WhatsApp Chat Fraud Analysis Using Support Vector Machine Method

ABSTRACT


Introduction
The development of information technology affects the growth of internet use in Indonesia which is increasing every year.Indonesia is one of the countries with the largest internet users in the world, it is proven that Indonesia is ranked 4th in the world based on the country with the highest number of internet users in the first quarter of 2021 [1].Information is data consists of various contents to provide meaning to users and it can contain positive and negative content.To obtain information, people can interact by communicating with one another through various media [2].Increasingly sophisticated technology allows people to communicate quickly, one of popular online applications as communication medium used in Indonesia is Whatsapp.In facts, it reveal that the majority of Indonesians like WhatsApp.Based on survey by Ding, a mobile top up platform, WhatsApp is the instant messaging service most Indonesians use to communicate with others.From the survey, it was revealed that 89% of respondents stated that WhatsApp is the first media that Indonesians choose to communicate with each other, the second is Facebook with a percentage of 44% and Instagram 41% [3].
The Directorate of Cyber Crime, Bareskrim Polri reported that throughout 2019, there were 4,586 reports of cyber crimes.The thousands cases are 1,617 fraud cases.Meanwhile, from January to December 2020, there were 2,259 cybercrime reports and 649 fraud cases [4].Fraud is form of crime committed by various lies or deception with the aim of benefiting oneself as described in Article 378 of the Criminal Code (KUHP) [5].One type of fraud is fraud via whatsapp chat, it is fraud using the internet to carry out online fraud mode [6].The fraud motives is various, ranging from asking for money, taking over WhatsApp accounts, offering prizes, winning quizzes, and claim from the bank.The fraudsters are good at using tricks to trick the victim, for example using certain photos, unusual grammar but trying to convince, etc. WhatsApp fraud comes in many forms and many way over time and time.
There are several studies on Support Vector Machine (SVM) that have been done previously research.One of them the research conducted by [7] uses Support Vector Machine method to find digital evidence related to cases of cyberbullying victims on Instagram comments.The classification results are in the form of positive and negative classes which are divided into positive cyberbullying sentiments and cyberbullying negative sentiments with the highest accuracy results of 90% at 50% training data composition and 50% test data composition.Another research conducted by [8] is the classification of sms spam using support vector machine, in this research conducted investigation on several data mining techniques, they are support vector machines, multinomial naive bayes and decision trees.From the results of the tests conductedt, it is known that Support Vector Machine produces the highest accuracy of three methods tested, which is 98.33%.Another research conducted by [9] used SVM method to detect Hate Speech words in comments in online media with average accuracy of 53.88% in Hate Speech category as many as 87 comments, and for non-Hate Speech category there were 105 comments.
Support Vector Machine (SVM) is learning system uses learning algorithms based on optimization theory [10].Support Vector Machine is included in supervised learning class, where in its implementation there is need for training phase and testing phase [11].
The research was conducted using SVM method to detect chat fraud on whatsapp.It can be seen in the level of accuracy in classifying chat fraud on WhatsApp in this research.To measure the level of accuracy in the classification, the rapidminer application will be used and several test scenarios will be carried out to measure the level of accuracy to obtain better results.The aim of this research is that it can be used as reference for social media users to be careful in using online applications as online communication media such as whatsapp to minimize the cases related to fraud on social mediain the future.

Research Methodology
This research applies the literature study method to find secondary data in the references theories and relevant research.The framework in this research consists of 4 phases in describing as follows: The initial data collection process is carried out to identify existing problems.The method used is to collect data related to fraud through chat on WhatsApp.Documents or data obtained are raw documents.This raw document contains meaningless part for sentiment analysis classification process, such as stopwords in Indonesian.

Initial Data Management
At this phase, pre-processing the data is carried out through four phases, they are: cleansing, case folding, tokenization, and stop word elimination.Pre-processing is done to classified data and to facilitate data processing.Data cleaning or is conducted to prevent any missing values in data.The data cleaning process means a process of removing incomplete data, containing errors, and unuseful data to processed clean data, relevant and useful data only [12].

Case folding
It is the process of equating cases in a document.This is done to make searching easier.Not all text documents are consistent in the use of capital letters.Therefore, the role at this phase is needed in converting the entire text in document into a standard form (lowercase).

Tokenization
It is a process to divide the text comes from a sentence or paragraph into certain parts [13].

Stopword elimination
It is defined as an often appear word in a text document that does not give importance to the content of document [14].This phase removes the affixes in the word.For example, the word "opened" has the suffix "-ed" sothe word will be returned to the basic form of the word by removing the existing suffix.

Stemming
It means eliminates affixes for each words to be a basic word, and at this phase it also aims to clear a word from improper spelling.

Data Labeling
Equation The data obtained does not yet have a class, therefore at this phase the labeling is to determine the class of each text.Labels have two classes, they are: Positive and Negative.

Result and Discussion Analysis
Testing was conducted on data using SVM algorithm.The test results were measured using three performance measure parameters (accuracy, precision, recall) with confusion matrix.The results of this test will be evaluated and analyzed to become output in this research.SVM algorithm was chosen, the first factor is because of its ability to minimize errors in training data and minimize errors in influencing dimensions.The strategy used is called Structural Risk Minimization (SRM).The second factor was chosen SVM algorithm because it is one method can be used in a high-dimensional problem but the number of data samples is limited.Performance Measure is used to evaluate the accuracy of the system in classifying fraudulent chats on data.
According to [15] to get the right parameter values for the optimum value of SVM, cross validation can be used.Cross Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one is used as learning data for a model or training data and one for model validation or called testing data.The k-fold cross-validation process is as follows; first the data is divided by a predetermined number of k, then k iterations and each different k is used as another k validation and this is continued until the last k.In data mining and machine learning 10-fold cross-validation is the most commonly used [16].

Result and Discussion
In this chapter will describe the process during the research conducted.The researcher uses tools, it is rapidminer in processing data using Support Vector Machine algorithm method.

Data Collection
The data was taken from several whatsapp user chats which identified as fraud and not.All chats that exist in several WhatsApp were taken to be used as data in this research.The following pictures are examples of some of chats taken.Some of these chats used for research data.The following is an example of chat has been taken, it can be seen in Figure 1 Figure 1.Example of Chat

Initial Data Processing
The data collected is fraud chat on whatsapp, then the chats are analyzed based on the Positive and Negative categories.The following is the data processing process.I contacted you to inform that your shopee account has received a cashback voucher!!! idr 1,000,000 and can be activated in shopeepay balance.

Preprocessing Data 1. Case folding, this process converts all letters to lowercase
2. Tokenization, this phase will divide the text of the sentence based on numbers, spaces and punctuation marks for the next phase of text analysis.

Data Process
After preprocessing, the next process is to convert the data into a document with the 'process document from data' operator.The data is taken from the Read Excel operator, this is done because the data is stored in Excel.Then determine which field will be used as a class by using the 'set role' operator.Furthermore, the process is carried out in the Cross Validation operator.Cross Validation is a statistical method of evaluating and comparing learning algorithms by dividing the data into two segments: one is used as learning data for a model or training data and one for model validation or called testing data.The model formed from the training data is implemented into data testing through the 'apply model' operator.The last phase of the process is calculating performance.The output in the process is the classification of SVM algorithm based on performance measures in the form of accuracy, precision and recall.Table 5 shows the results of the classification algorithm using support vector machine algorithm is 84.21% with 33 True Positive records classified as positive, 5 False Positive records classified as negative, 31 True Negative records classified as negative, and 7 False Negative records classified as positive.

Conclusion
Based on the results of the research and discussion that have been described above, it can be concluded that this research successfully implemented SVM algorithm for whatsapp fraud chat analysis with an accuracy rate of 84.21%.

Figure 1 .
Figure 1.Research Methodology Based on the series of work above, each of these phases can be explained as follows: 3.1.Data collectionThe initial data collection process is carried out to identify existing problems.The method used is to collect data related to fraud through chat on WhatsApp.Documents or data obtained are raw documents.This raw document contains meaningless part for sentiment analysis classification process, such as stopwords in Indonesian.

Table 1 .
Example of Case Folding

Table 2 .
Example of tokenization

Table 5 .
Result Analysis