OPINION ANALYSIS OF TRAVELER BASED ON TOURISM SITE REVIEW USING SENTIMENT ANALYSIS

Technology development nowadays makes it easier for people to access information. One of them is to ﬁnd information regarding a place. Many prospective visitors would read reviews from people who have visited a place to ﬁnd out how they rate a place. Opinion on other people’s reviews is very inﬂuential in inﬂuencing others’ decisions in assessing a place they want to visit. Opinion analysis can be done by conducting a sentiment analysis of hotel customer reviews. The data used are traveler reviews of hotels in East Java on the Tripadvisor site. Traveler reviews data was taken by crawling on tourist sites, and the unstructured reviews data would be a preprocessing and weighted term from reviews using the TF-IDF method. The classiﬁcation process is done using the support vector machine method to ﬁnd opinions from traveler reviews, which are positive or negative. Based on the classiﬁcation results, hotels that have the most positive sentiments in Surabaya are Harris Hotel Gubeng and Pop! Hotel Gubeng with the same number of reviews, 252 reviews. In comparison, hotels with the most positive sentiment in Malang are Harris Hotel Malang with 311 reviews. The opinion analysis results are expected to help the hotel manager evaluate and develop to increase the number of tourist visits.


INTRODUCTION
Nowadays, internet development is very rapid. Internet technology has been used in many fields such as the media to get various kinds of information, media to run businesses (selling online or shopping online), media to get entertainment (music, movies, magazines), to access social media like Instagram, Twitter, and Facebook. Today many people use the internet to make it easier for them to get information about hotels. The increasing number of hotels that can be visited while on vacation would confuse people to determine which hotel is suitable for them. Many sites provide information about hotels in various countries. One of them is Tripadvisor. Based on data released by Skift in 2013 by looking at similar web statistics in seeing visitor traffic during October 2013, TripAdvisor is ranked two of the ten online travel sites in the world that are most accessed and utilized by tourists with a total of 48.5 million visitors. On these sites, people can find a lot of information about hotels, restaurants, and tourist attractions in various countries and reviews from people who have visited the place. On TripAdvisor, people who have seen a hotel, restaurant, or tourist place can write their opinions about the site. Comments from other people are usually used as a reference for potential customers, whether they should visit that place or not. Opinion from other people is very influential in influencing others' decisions in assessing a place they want to see.
Indonesia is one of the famous countries for natural beauty and cultural beauty. Many tourist attractions in Indonesia that can be visited by local tourists and foreign tourists. One of the provinces in Indonesia that can be visited for a vacation in East Java. Many cities/regencies in East Java can be used as tourist destinations because of the lots of natural beauty such as beaches, mountains, and waterfalls widely available in several regions. Besides that, there are many tourist attractions, artificial tours, and cultural tours in East Java. Even according to the head of the East Java Culture and Tourism Office. In 2017 the number of local tourist visits to East Java was 65,623,535. This number increased by 13.01% from 2016, where the number of local tourist visits to East Java was 58,068,493. While the number of foreign tourist visits to East Java in 2017 as many as 690,509 increased by 11.62% compared to 2016, the number of foreign tourist visits was 618,651. Judging from many tourists who are interested in visiting East Java, it also opens opportunities for hotel managers to attract customers and make them stay at the hotel they manage. Thus, hotel managers should pay more attention to reviews of people who have visited the hotel because it can maintain and increase the number of visits.
Based on research from Chory et al. [1] with the research title "Sentiment Analysis on User Satisfaction Level of Mobile Data Services Using Support Vector Machine (SVM) Algorithm" and research from Ayu and Sarno [2] with the research title "Sentiment Detection of Comment Titles in Booking.com Using Probabilistic Latent Semantic Analysis" to analyze customer data reviews on the booking.com site using the PLSA method, it is necessary to analyze the comments of users to see how public satisfaction in using services.
In the review feature on various sites, they didn't show which reviews are positive (reviews with good opinions) only or negative (reviews with wrong opinions) only on websites, so sentiment analysis can be done to find out how traveler reviews on the sites. Traveler reviews on various sites are written based on what they feel when writing the reviews, so it can be said to be an honest review from the traveler. Sentiment analysis is a study of how to analyze opinions, sentiments, evaluations, judgments, attitudes, and emotions of an entity where they can be products, services, individuals, issues, events, organizations, and topics [3] . Sentiment analysis can be applied to several different types of levels, either it's text in the form of documents or sentences. Reviews from travelers are influenced by emotions (sentiments) to be classified, and that polarity can be determined positive or negative [4] .
Many methods can be used to classify text. One method that can be used is Support Vector Machine (SVM). Support Vector Machine method is one of many methods that can be used to classify spatial data. This method has been widely applied to solve various kinds of problems in many fields, like gene expression analysis, finance, weather, and the medical field. In sentiment analysis, the support vector machine method's implementation is widely used because it can provide better results than similar classification methods such as Artificial Neural Network (ANN), especially in finding solutions. After all, SVM can find a globally optimal solution [5][6][7][8] .
Based on the background described in this study, sentiment analysis was conducted to find out opinions from reviews written by hotel customers on TripAdvisor. Also, the review would be analyzed based on the aspects where the review can be categorized. The three aspects are used based on the TripAdvisor site: location, cleanliness, and service. The data used is review data from Tripadvisor obtained by crawling using the scrapy library in python. Term Frequency-Inverse Document Frequency (TF-IDF) method is used to find the weight values of a word and Support Vector Machine method to determine opinion sentiment 2 LITERATURE REVIEW

TripAdvisor
Tripadvisor is one of the largest travel sites in the world headquartered in Needham, Massachusetts. Figure 1 shows the website of Tripadvisor. This site provides tourist attractions, hotels, restaurants, and flights and can help tourists book their tours. This site offers a comment feature where tourists can provide reviews of tourist attractions, hotels, and restaurants that have been  visited to share experiences with various people worldwide, with 315 million active and inactive reviewers. Figure 2 shows a snapshot of the customer's review page. There is also a feature to compare flight prices, provide a link for tourists to book travel packages and find hotel prices

Sentiment Analysis
Sentiment analysis is a branch of science from text mining, natural language programs, and artificial intelligence that learns how to analyze opinions, sentiments, evaluations, attitudes, judgments, and emotions of an entity that can be a product, service, organization, individual, issue network, events, and topics [9,10] . Sentiment analysis is also called opinion mining, which is useful in managing natural languages, text mining, and computational linguistics and aims to determine opinions on a particular topic, where such behavior can indicate judgment and reasons and trends [11] . Sentiment analysis is mostly used to conduct analysis or to be able to assess public opinion, both opinions that refer to the likes or dislikes of goods or services. This sentiment means subjective information and has positive and negative polarity values where this polarity value can be used as a parameter to determine a decision [4] .

Text Minning
Text mining is a process to find patterns in the form of information or knowledge in a document or source that was previously not visible to become a desired pattern for a particular purpose [12] . Text mining is often used to help analyze information, assist the decision-making process, and manage information in the form of text in large numbers. The data would be processed with various methods such as classification, clustering, sentiment analysis, etc. Text mining is in data mining but has different process stages and more stages than data mining. This is because of text mining processes data in the text whose characteristics are more complex than ordinary or structured data. Therefore, in text mining, several initial steps are needed to prepare to be changed to become structured [4] .

Text Pre-processing
In text mining, there is an initial stage before processing data, namely preprocessing. The Pre-processing process aims to process data that is initially still in the form of text that is changed first to fit the required format. In Text preprocessing several steps can be done, namely case folding, tokenizing, filtering, and stemming.

TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) is a method used to calculate each word's weight that has been extracted. TF-IDF is used to calculate common words in information retrieval. In the TF-IDF weighting model, this method integrates the model term frequency (TF) and inverse document frequency (IDF), which is the term frequency (TF) process to calculate the number of occurrences of a word in a document/text and inverse document frequency ( IDF) to calculate terms in various documents/texts which are considered as general terms and are considered not important [13][14][15] .
In TF-IDF Methods, first, count the term frequency tft, where is the term in document , which shows the term in document . This would affect the term's weight, which would be higher when many terms appear in one document [16] . The value of would be calculated using the weighting term frequency ( ), with the formula in Equation 1.
Many words that appear in documents are generally the value of the term frequency of words that are not important. To avoid weighting on non-essential words, we use document frequency weighting to count the number of documents containing the term value. In a document, the emergence of a term that exists in most documents can result in a unique term search process interrupted. Inverse Document Frequency (IDF) is useful for reducing the weight of a term if the term's appearance is spread across all documents. The formula of the inverse document frequency is shown in Equation 2.
Furthermore, weighting TF-IDF is done by multiplying the results of document frequency with the inverse document frequency. The formula is shown in Equation 3.

Support Vector Machine
Support Vector Machine is one method for classifying. The learning method used is guided. This algorithm was created by Vladimir Vapnik, which functions in analyzing data and recognizing patterns to get classification results. The algorithm support vector machine has a simple concept where the best hyperplane or boundary line is calculated, separating the two classes. Thus, the data enters into a category or another category [17,18] . Examples of class separation in support vector machine are shown in Figure 3.
In Figure 3, it has been shown that there are two classes, namely +1, which shows positive classes, and -1, indicating negative classes. The hyperplane margin calculation is used to get the best hyperplane line and look for the maximum point. In dividing two classes with the best hyperplane, it can be defined in Equation (4).
In pattern , which is included in class -1, it can be formulated in Equation 5.
In pattern , which is included in class +1, it can be formulated as in Equation 6.
Finding the largest margin value is done by maximizing the distance between the hyperplane and its closest point. This is obtained by formula: The classification process problem is that most sample data are not linearly separated, so if a linear support vector machine is used, the results obtained are not optimal and result in poor classification results. Linear support vector machines can be changed to non-linear support vector machines using adding kernel functions. This method works by mapping input data to a higher dimensional feature space. It is expected that the input data from the mapping to the feature space would be linearly separated so that the optimal hyperplane can be searched [19] . Here are some of the frequently used kernel functions. Linear Kernel, Polynomial Kernel, Radial Basis Function (RBF) Kernel, Sigmoid Kernel, Three algorithms can be used in processing support vector machine training data, namely Quadratic Programming, Sequential Minimal Optimization, and Sequential Training. In its use, we must pay attention to the advantages and disadvantages of each algorithm. Quadratic Programming is a formulation process that can provide numerical analysis results with a complex algorithm and takes a long time. Sequential Minimal Optimization is the development of Quadratic Programming, where this algorithm can only provide small optimization. Whereas Sequential Training is a simple algorithm that doesn't take much time. Sequential training steps are as follows [20,21] . First, we initialized against =0 and other parameters, such as α, γ, C, and εvalues. The = alpha i is used to searching support support-vector. The γ= Gamma constants to control speed. The C = variable slack. The ε= epsilon is used to find error values Second, we calculated the Hessian matrix obtained from multiplication between the polynomial kernel and y, a vector worth 1 and -1. The calculation follows Equation 12.
Third, we performed an iteration for each iteration initialized, then calculate the value, which can be calculated using equation 13.
Fourth, we calculated values, which can be calculated using Equation 14.
Fifth, we updated the value of using using Equation 15.
Return to the third step and do it repeatedly until you get the maximum iteration or (| |) < . From the above process, we would get a support vector value (SV), where the value of SV=( > thresholdSV). After that, the value of bias is calculated by using Equation 16.
Sixth, we calculated the function f(x) to determine the classification results in a particular sentiment class with equation 17.
If the function is positive, then the document would be classified in the positive sentiment class. In contrast, if the function is negative, it would be classified in the negative sentiment class.

MATERIAL AND METHOD
This study was carried out in four sequential processes. They are data collection, data pre-processing, feature extraction, and sentiment analysis. Figure 4 shows the processes.

Literature Study
A literature study is done by looking for literature sources related to sentiment analysis, text mining, support vector machine method, and TF-IDF method to gain knowledge and support the research. The literature used can be sourced from books, journals, previous thesis.

Data Collection
At this stage, data would be collected to be used in conducting research. The data used is data related to the TripAdvisor site reviews with restrictions only on data hotel in East Java. The process of retrieving data using the scrapy library in python and reviews would be saved in a CSV file. The data obtained would be separated again based on the point, so when there are two sentences in a review, then the data would be counted as two documents. After that, the data that has been obtained would be divided into training data and data testing. The training data would be labeled, and the manual categorization of aspects would be doing. As shown in Figure 5, the category aspect is location, cleanliness, and service.

Data Preprocessing
After crawling data in the form of reviews from people on the TripAdvisor site, the data would then proceed to the preprocessing stage. At this stage, there are several steps taken, namely case folding, tokenizing, filtering, and stemming.
Case folding. At this stage, the inconsistent reviews data that has been obtained would be changed into all lowercase letters, and characters other than letters would be removed, such as numbers and punctuation marks, so that the final result of the document/text is already small and all the punctuation marks are gone.
Tokenizing. At this stage, reviews that have become lowercase letters, and the marking is gone would be tokenized to be broken down into several parts to separate into words.
Filtering. The results from the tokenizing stage would be filtered at this stage, and words that are considered not important would be deleted.

Stemming.
After filtering, the words would be trimmed if there is a prefix, suffix, or insertion to become a necessary word. the result from this stage would be carried out to the TF-IDF method for weighting before entering the classification stage.

Feature Extraction
Furthermore, we can do the feature extraction process or say that "weighted term." At this stage, we would weigh the essential words obtained from the results of the preprocessing process. Feature extraction is carried out using the Term Frequency-Inverse Document Frequency (TF-IDF) method. The weighting process is done by utilizing a module that is in python, which is by using the scikit-learn module. Tapi akan lebih bagus lagi seandainya disekitar hotel ada minimarket sehingga tidak menyulitkan jika ingin belanja.

Sentiment Classification
After the topic has been generated from each of the following reviews, each review's sentiment would be determined to find out the positive or negative opinions. The determination of sentiment in this study would be conducted using the support vector machine method.
The support vector machine method would find the best hyperplane or dividing line, which would divide the two classes and classify the data based on the closeness of the word on the side of the line.

Data Collection
The review data in this study were obtained from crawling using scrapy library in python. The data taken is hotel customer reviews in East Java on the TripAdvisor site. The data then divided into training data and testing data. For training data, 300 reviews were used with 150 positive label reviews and 150 negative label reviews from various hotels in East Java. While for testing data using review data from 5 hotels in Surabaya and five hotels in Malang, the most popular was the rating on the TripAdvisor site. Examples of raw data crawling results from TripAdvisor sites can be seen in Table 1 .
After that, the data that has been obtained would be divided into training data and data testing. The training data would be labeled, and the manual categorization of aspects would be doing. Table 2 shows an example of labeling process's result.

Data Preprocessing
After crawling data in the form of reviews from people on the TripAdvisor site, the data would then proceed to the preprocessing stage. At this stage, there are several steps taken, namely case folding, tokenizing, filtering, and stemming. The preprocessing phase is done by using the Natural Language Toolkit (NLTK) in python. The stages are documents whose sentences have been divided into each word and become basic words. The results at this stage would be used for the next process.
Case folding. In this process, we changed reviews into all lowercase letters and characters other than letters would be removed, such as numbers and punctuation marks. Table 3 shows the example of case folding on a list of reviews. tapi akan lebih bagus lagi seandainya disekitar hotel ada minimarket sehingga tidak menyulitkan jika ingin belanja 3.
['lokasinya', 'strategis', 'tengah', 'kota'] Tokenizing. In this process, we broke data down into several parts so that it would separate into words. Table 4 shows the example of tokenizing on a list of reviews. Filtering.. In this process we filtered the data and deleted insignificant words. Table 5 shows the example of filtering on a list of reviews. Stemming.. In this process we trimmed the words if there is a prefix, suffix, or insertion. This process transforms each word into its basic form. Table 6 shows the example of stemming on a list of reviews.

Feature Extraction
After Pre-processing, data would be carried out to the TF-IDF method for weighting before entering the classification stage. First, calculate the value of TF, DF, and IDF as shown in Table 7 . Furthermore, weighting TF-IDF is done by multiplying the results of document frequency with the inverse document frequency. Table 8 shows the results. Table 9 shows the sentiment classification results of hotel customers in Surabaya and Malang. Based on Table 9 , hotels in Surabaya with the most positive sentiment are Harris Hotel Gubeng and Pop! Hotel Gubeng with the same number of reviews, 252 reviews. Then, the Malang hotels that have the most negative sentiment is Grand Darmo Suite with 46 reviews.

Sentiment Classification Based on Aspect
The next step is to analyze by looking at the sentiment results based on the reviews' aspect categories. Then categories of aspects that have the most positive sentiments and negative sentiments can be known. There are three categories of aspects used, location, cleanliness, and service.
Based on Table 11 we could see that at Grand Darmo Suite Surabaya, the aspect that has the most positive sentiment is the service aspect, with 89 reviews or 39.9% of the total reviews. At Harris Hotel Gubeng Surabaya, the aspect that has the most positive sentiment is the service aspect, with 133 reviews or 49.3% of the total reviews. At POP! Gubeng Hotel Surabaya, the aspect that gets the highest positive sentiment, is the service aspect with 135 reviews or 49.3% of the total reviews. Furthermore,at Primebiz Hotel Surabaya, the aspect that has the most positive sentiment is the service aspect, with 88 reviews or 41.9% of the total reviews. Finally, at Swiss-Bellin Tunjungan Surabaya, the aspect that has the most positive sentiment is the service aspect, with 120 reviews or 44.3% of the total reviews. Based on these results, the Service aspect is the most aspect reviewed by customers at five hotels in Surabaya. The service aspect is the most satisfying aspect of hotel customers in Surabaya.  Based on Table 12 , we could see that at the 101 OJ Hotel Malang, the spect that has the most positive sentiment, is service aspect with 114 reviews or 59,1% of the total reviews. At Harris Hotel Malang, the aspect that has the most positive sentiment is the service aspect, with 184 reviews or 47,2% of the total reviews. At Kartanegara Premium Guest House Malang, the Aspect that has the most positive sentiment is the service aspect, with 63 reviews or 35% of the total reviews. At Santika Premiere Hotel Malang, the aspect that has the most positive sentiment is the service aspect, with 102 reviews or 48,3% of the total reviews. Finally, at Tugu Hotel Malang, the aspect that has the most positive sentiment is the service aspect, with 44 reviews or 40% of the total reviews. Based on these results, the Service aspect is the most aspect reviewed by customers at five hotels in Malang. The service aspect is the most satisfying aspect of hotel customers in Malang.

Evaluation
The success rate of the classifier can be measured by evaluating. In general, a confusion matrix is used to measure evaluation and done by measuring the value of accuracy, precision, recall, and F1-Score. Table 13 shows the result of the measurements.

CONCLUSION
The support vector machine method combined with TF-IDF can solve problems in sentiment classification. This is evidenced by the ability of the TF-IDF method to give a weight value to a word and the ability of the Support vector machine method to provide labels in each review, which are positive reviews and negative reviews. The service aspect is the most aspect reviewed by hotel customers in Surabaya and Malang. The service aspect is the most satisfying aspect of hotel customers both in Surabaya and Malang.
On the accuracy test, the highest accuracy value on hotel review data in Surabaya is POP! Gubeng Hotel Surabaya with an accuracy rate of 94%, while the highest accuracy value on hotel reviews data in Malang is The 101 OJ Malang Hotel, with an accuracy rate of 99%. With this value of accuracy, it means the classifier used has worked well in classifying reviews. Sentiment Analysis of hotel reviews can be done to find out how the opinions of hotel customers. It can contribute to hotel managers in improving service to increase the number of visits.