Comparative Study of KNN, SVM and Decision Tree Algorithm for Student’s Performance Prediction

—Students who are not-active will affect the number of students who graduate on time. Prevention of not-active students can be done by predicting student performance. The study was conducted by comparing the KNN, SVM, and Decision Tree algorithms to obtain the best predictive model. The model making process was carried out by the following steps: data collecting, pre-processing, model building, comparison of models, and evaluation. The results show that the SVM algorithm has the best accuracy in predicting with a precision value of 95%. The Decision Tree algorithm has a prediction accuracy of 93% and the KNN algorithm has a prediction accuracy value of 92%.

Abstract-Students who are not-active will affect the number of students who graduate on time. Prevention of not-active students can be done by predicting student performance. The study was conducted by comparing the KNN, SVM, and Decision Tree algorithms to obtain the best predictive model. The model making process was carried out by the following steps: data collecting, pre-processing, model building, comparison of models, and evaluation. The results show that the SVM algorithm has the best accuracy in predicting with a precision value of 95%. The Decision Tree algorithm has a prediction accuracy of 93% and the KNN algorithm has a prediction accuracy value of 92%.

I. INTRODUCTION
I MPROVING the quality of education and accreditation of departments is always endeavored by every college department. Timeliness of graduating students is one of the elements for accreditation assessment [1]. The accreditation will be better if more students graduate on time. Students who are not-active will affect the number of students who graduate on time. Thus, the more students who graduate not on time will the lower the department's accreditation.
Prevention of not-active students can be done by predicting student performance. Several studies on student performance had been conducted. Some studies use Data Mining algorithm. Data Mining algorithm was used to perform student performance analysis system (SPAS) [2], to analyze student performance using clustering techniques [3], and to predict student performance (poor, average, good, and excellent) using educational data [4]. Other research by applying Decision Tree algorithms such as: predictions of drop-out students from college based on GPA [5], analysis to predict the accuracy of 4-year studies of student [6]. Other research to predict student performance at the beginning of joining a course program [7], predicting the student performance in distance higher education using active learning [8], predictions of student performance correlated with course activities [9], and predicting student performance using advanced learning analytics to compare features [10]. In addition to the Data Mining algorithm, using the Fuzzy method is also done to predict student performance. Fuzzy Support System method   was used for evaluation of student performance in laboratory [11], and an application of fuzzy logic for evaluation of student academic performance [12].
Research by comparing several algorithms to obtain the best predictions has been done. Among had been done is; comparing Simple Logistic Classifier and SVM algorithms to predict athlete's win [13] at, comparative analysis between SVM and KNN classifier for EMG signal classification [14], compare KNN, SVM, and Random Forest algorithms for facial expression classification [15]. Comparative algorithm research for predicting student performance had also been carried out. Among them have been done are; look for classification algorithm that can be used to predict student performance [16], comparing Bayesian algorithm and Decision Tree [17], compare Apriori and K-Means algorithms [18], and compare Neural Network, SVM, and Decision Tree algorithms [19]. Comparison of KNN, SVM, and Decision Tree algorithms [20]. Recent research comparing the KNN, SVM, and Decision Tree algorithms concludes that the SVM algorithm has the best accuracy. The study used K = 5 on the KNN algorithm. Further research is deemed necessary to proceed with trying out different K. This paper is a continuation of previous research, which compares the accuracy of the KNN, SVM, and Decision Tree algorithms by changing the K value in the KNN algorithm.

II. METHODS
This research had been done using several Machine Learning algorithms, namely KNN, SVM, and Decision Tree. The tools used are R Studio. The library used in the R Studio is the Caret package. Machine Learning processing through several processes: data collecting, pre-processing, model building, comparison of models, and evaluation [21]. The research process is shown in Figure 1. Data collection is conducted by combining all data into one with the same attributes. The data used are: GP (grade point), GPA (grade point average), hometown, type of school, majors, parent's work, and student performance (active/non-active). Pre-processing is used to improve the data before building a Machine Learning model. Problems in data are usually like different attributes, missing values, etc. Pre-processing is also done by splitting the data into training and testing. Training data is used to build models. The model that has been built is then tested using data testing to determine the accuracy of the prediction. The next step is to compare several models that have been built, namely the model of the KNN , SVM, and Decision Tree algorithm. The final step is to evaluate to determine the best algorithm for predicting student's performance based on the model obtained.

III. RESULTS
Student academic data of Informatics Engineering Department Politeknik Harapan Bersama are used in this paper. The dataset consists of 1530 rows and 7 attributes data. First 6 variables had been used for predicting the 7th variable. Table  I shows all the details of data.
GP (Grade Points) is the average score of learning outcomes in every semester, 0 means the lowest score and 4 means the highest score. GPA (Grade Points Average) is the cumulative average point value of all semesters that have been passed, 0 means the lowest score and 4 means the highest score. Hometown is the hometown of students, 0 means student coming from a city that is near from campus and 1 means student coming from a city far away from campus. Type of school is a type of high school, 0 means students come from private schools and 1 means students come from public schools. Major is majors when high school, 1 means students come from the computer/informatics department, 2 means students come from natural science majors, and 3 mean students come from other than both. Parents jobs are jobs from student parents, 1 means parents work as civil servants, 2 means as private employees, 3 mean as entrepreneurs, 4 means as farmers/fishermen, and 5    mean other than that. Active is student performance, 0 means students are not active and 1 means students are active.

A. Model Result
Before the data is processed, the data set is split into two parts by a ratio of 75:25, which 75% to training and 25% to testing. Training data used to construct the model. Training data used were 1148 samples, 6 predictor, and 2 classes, with cross-validation 10 fold and repeated 3 times. Output of training data is a model used for classification. The model that had been built is shown in Table II.
The model was then tested used testing data to know how accurate that model. Table III shows a matrix of the testing  result for KNN algorithm, Table IV is testing result for SVM  algorithm, and Table V is testing result for Decision Tree algorithm.

B. Classification Results
Classification result is obtained from the model that has been tested. Table VI shows the comparison of the testing result between KNN, SVM, and Decision Tree algorithm on the confusion matrix. Figure 2 shows the comparison accuracy between algorithm based on classes.
The final result is a comparison of model classification to see which algorithm has the best accuracy. Table VII shows the comparison of the classification model obtained.

IV. DISCUSSIONS
The best model for KNN algorithm to predict student performance is k = 3 (kernel) with accuracy 94.5%, value  Comparison with matrix confusion shows different things from the results of previous comparisons. SVM algorithm has the best accuracy to predict active students (96%) compared to KNN (96%) and Decision Tree (92%). However, the Decision Tree algorithm has the best accuracy for predict non-active students (92%) compared to SVM (91%) and KNN (88%). Although Decision Tree algorithm has the best accuracy in predicting non-active students, but only 1% difference from SVM algorithm. While for predicting the accuracy of active students, SVM has a 4% difference from Decision Tree and KNN. It could be said that the SVM algorithm still occu-pies the best position compared to KNN and Decision Tree. This is corroborated after the overall accuracy calculation is performed, it is found that SVM has the best classification accuracy of 95% while KNN has 94.5% accuracy and Decision Tree has 93% accuracy. Thus, the best algorithm for predicting student performance is by using the SVM algorithm.
V. CONCLUSIONS KNN algorithm can predict student performance well with k = 3. The best model of SVM algorithm to predict student's performance is by using the value of C = 1. Whereas if using the Decision Tree algorithm, the best predictions if using the model cp = 0.6689113. Comparison of three algorithm machine learning (KNN, SVM, and Decision Tree) shows that SVM has the best accuracy (95%) compared to KNN (94.5%) and Decision Tree (93%) in predicting student's performance.