ALGORITHMS COMPARISON FOR NON-REQUIREMENTS CLASSIFICATION USING THE SEMANTIC FEATURE OF SOFTWARE REQUIREMENT STATEMENTS

Noise in a Software Requirements Speciﬁcation (SRS) is an irrelevant requirements statement or a non-requirements statement. This can be confusing to the reader and can have negative repercussions in later stages of software development. This study proposes a classiﬁcation model to detect the second type of noise, the non-requirements statement. The classiﬁcation model that is built is based on the semantic features of the non-requirements statement. This research also compares the ﬁve best-supervised machine learning methods to date, which are support vector machine (SVM), naïve Bayes (NB), random forest (RF), k-nearest neighbor (kNN), and Decision Tree. This comparison aimed to determine which method can produce the best non-requirements classiﬁcation, model. The comparison shows that the best model is produced by the SVM method with an average accuracy of 0.96. The most signiﬁcant features in this non-requirement classiﬁcation model are the requirements statement or non-requirements, id statement, normalized mean value, standard deviation value, similarity variant value, standard deviation normalization value, maximum normalized value, similarity variant normalization value, value Bad NN, mean value, number of sentences, bad VB score, and project id. the statements in the SRS. From the classiﬁcation process of ﬁve methods, SVM, Naive Bayes, Random Forest, kNN, and Decision Tree, the SVM method is the best method for detecting non-requirements statements indicated by an average accuracy value of 0.96. Apart from that, from the precision-recall and ROC-recall plots, the points generated by the SVM method are above the diagonal line compared to other methods with the x and y axes approaching the value 1.


INTRODUCTION
The Software Quality Assurance (SQA) process is considered very important to prevent the increase in costs required when developing software, especially at the requirements specification stage [1,2] . Meyer emphasized that the Software Requirements Specification (SRS) process, which uses natural language, has several weaknesses compared to formal languages. Seven common mistakes cause these weaknesses that software developers make. Meyer introduced seven errors in software requirements that usually occur in software requirements specifications [1,[3][4][5][6] . The seven sins are noise, silence, overspecification, contradiction, ambiguity, forward reference, and wishful thinking. Several previous works focus on detecting the presence of seven errors in the software requirements specification [3,[5][6][7] . Another work by Enda and Siahaan focuses on correcting ambiguous requirements statements [4] . However, there have been no prior publications on noise detection in software requirements specifications from past research. Noise is not only a common problem in signal processing but also in the software requirements specification.
The given facts above are the motivation to research effective ways to detect noise in the software requirements specification. Noise is one of the types of errors that requirements engineers often make [1] . Noise may appear in the requirements specification process in Software Requirements Specifications (SRS). Noise occurs when the software developer includes information that is not relevant to the overall software requirements or includes non-requirements in the requirements specification. This can be confusing to the reader and can have negative repercussions in later stages of software development.
Noise is divided into two types, which are irrelevant requirements and non-requirement statements. Figure 1 shows the two types of noise. Irrelevant requirements discuss more topics outside the general topic that is being discussed by all relevant requirements. Non-requirements statements tend to contain topics that are also contained in the requirement statements from the same SRS. Although noise in various SRSs has different parts of the spoken order, non-requirement statements tend to have specific terms (such as project, schedule, object-oriented, and user acceptance). The existence of noise within software requirements specification could be the effect of the tendency of the engineer to say more than necessary. In the end, it could exhaust the project resource and thread the success of the project. Therefore, noise in software requirements specification should be detected and removed as early as possible. Noise detection in previous studies resulted in an unsatisfactory Kappa coefficient value, which is 0.4426. This is because the method developed in that study only detects the first type of noise, which is irrelevant requirement statements [8] . The author will improve the previous method by developing a classification model for detecting the second type of noise, non-requirement statements.
There are several studies on noise or outlier detection from text data. Several studies focus on adapting unsupervised machine learning to detect noise [9][10][11][12][13] . Another study focused on detecting irregularities in documents using conceptual charts [14] . The study provides a graphic visualization of the deviations that occur in text data.
This study exercises five well-known representable supervised methods, i.e., Support Vector Machine (SVM), Naïve Bayes (NB), Random Forest (RF), k-Nearest Neighbor (kNN), and Decision Tree [15][16][17][18] . The contribution of this research is to build a classification model for the non-need statement for waiting for noise.
Furthermore, a classification model will be made based on the results of the best classification method. Thus, the nonrequirements classification modeling is expected to improve the quality of the SRS.

Dataset
This study used 648 software requirements statements that were extracted manually from 14 SRS documents. Table 1 shows the dataset used in this study. Each document refers to a different project from various problem domains [8] . Five annotators label each requirements statement as requirements, non-requirements, or irrelevant. The label is a value of boolean type. The annotator will label the statement with 1 (true) in the noise column if and only if it thinks the statement is noise. The annotator will label the status with 0 (false) in the noise column if and only if it thinks the statement is a relevant requirements statement. Measuring the reliability between annotators on the non-requirements statement shown by the red diagram in Figure 6, the average reliability between annotators is 0.87. This shows that the five annotators have a very good level of reliability.

FIGURE 2
Reliability between annotators on a non-requirements statement.

Development of a Non-Requirements Statement Classification Model
The steps for building a classification model for non-requirements statements can be seen in Figure 2. This step consists of three main processes: creating a sentence-level feature, a discourse-level feature, and a noise classification model.

FIGURE 3
Building the classification models.

Sentences Level Feature Generation
As shown in Figure 3, sentence-level feature generation comprises two main processes, i.e., sentence preprocessing and feature generation. Sentence processing involves part-of-speech (POS) tagging, lemmatization, stopword removal, and word-POS tag frequency counter. Based on the study carried out by Hussain [19] , we considered seven word-pos tag pairs to be counted, i.e., word-MD (modal), word-NN (noun), word-VB (verb), word-RB (adverb), word-JJ (adjective), word-DT (determinant), and word-other (conjunction, preposition, cardinal, etc.). Since NN tags (e.g., NN, NNS, NNP, and NNPS), we grouped them into NN tags. The same approach was applied to other POS tags. We maintained a bag of words for each type of POS tag. Each bag of words contains unique word-POS tags. Based on the seven bags of words, the Semantic-based Word-POS tags frequency counter counts the number of word-POS tag occurrences in non-requirement and requirement statements for each type of word-POS tag pairs.

FIGURE 4
Automatic noise-related sentence-level feature generation.
Sentence-level feature generation involves two main processes, i.e., keyword ranker and feature selection. Keyword ranker uses the output from the previous process to calculate the strength of discrimination (in terms of likelihood ratio or LR) of each word-POS tag. LR of a word-POS can be calculated using equation 1. Keyword ranker incrementally orders the results.

Feature Extraction
The next process was extracting sentence-level and discourse-level features from the training data. This training data was used to build the classification model for detecting the non-requirement type of noises. Figure 4 and Figure 5 show the process diagram for extracting sentence-level and discourse-level features, respectively. In sentence-level feature extraction, the frequency counter process refers to the seven bags of word-POS generated from the previous process. Since the corpus contains a limited number of vocabularies, we need to extend the scope to handle broader cross-domain projects. We used Wordnet thesaurus to find similar words given a similarity threshold.

FIGURE 5
Sentence-level feature extraction.

Test Results
The features that have been made in the previous chapter were tested using the weka application. The initial process is crossvalidation of all project IDs. Furthermore, training and testing methods were used in a cycle, namely: the DA-1 project ID was used as test data, and the DA-2 to DA-14 project IDs were used as training data. DA-2 project IDs are used as test data, and project IDs DA-1, DA-3 to DA-14 as training data and so on so that DA-14 project IDs are test data and project IDs DA-1 through DA-13 are training data. Table 2 shows the performance of each classification method in a cross-validation environment. Table 3 shows the performance of each method in building a classification model for non-requirements statements and testing each dataset. From the training and testing process, it can be seen that SVM and Random Forest are almost relative inaccurate (with a variance of 0.012 and 0.011). The accuracy of the kNN and the Decision Tree is below it, with a variance of 0.016. Meanwhile, Naive Bayes has the highest variance value, which means that its accuracy value is the lowest than the other four methods. These results indicate that the training model produced by SVM has a higher average accuracy value (0.883) than the other four methods. The classification model produced by SVM also has the highest average accuracy value (0.910). These results are consistent with research by [15,20] , wherein in terms of accuracy, SVM can produce better classification results than other classification methods.  Table 3 also shows statistical information from the classification results of each method. This confirms that, in general, the classification model produced by SVM has a more stable performance with respect to the population size of the dataset and the ratio between requirements and non-requirements statements. The precision-recall plot in Figure 7 and the ROC-recall plot in Figure 8 of each method in two-dimensional space can measure the classification ability to distinguish the two possible outcomes. The classification model that can produce the highest number of results located above the diagonal line is considered the best model. Two graphical plots show that the classification model produced by SVM has higher potential uses than the others. In addition, for the variance of classification results among problem domains, this study also showed an insignificant variance for all methods.

TABLE 3
The accuracy of classification models in the testing process.

SVM Representation
Simple representation of SVM as in equation 3.
The variable is called the slack variable to measure the error made at point , ). Kernel functions play an important role in SVM performance. It is based on reproducing the Hilbert Kernel space as in Eq. 4.
If is a positively symmetric definite function, which satisfies Mercer's Rule, The polynomial kernel is a popular method for non-linear modeling. The second kernel is usually preferred because it avoids problems with hessian being zero.

Attribute Selection
Attribute selection is made to find out which attribute is the most significant. The attribute evaluator uses InfoGainAttributeEval and the Ranker lookup method from the weka application. Table 4 explains the title code in Table 5. In Table 5, it can be seen that all features can affect accuracy (with the greatest variance of 0.0001). From the attributes selection results, it can be seen that of the nineteen features used to build the model, only eleven features that are considered significant classify non-requirement statements. The eleven features represent semantic and statistical features.  Table 6 shows the correlation between variables after feature selection. Explanation of letters a-k in the top and left side of the title as described in Table 4.

Discussion
From the above research results, it can be seen that there is a set of semantic features that can classify requirement statements and non-requirement statements. Meanwhile, SVM was chosen because it is the best-supervised method for classifying requirement statements and non-requirement statements.

CONCLUSION
The detection of non-requirements statements on the SRS can be done by classifying the statements in the SRS. From the classification process of five methods, SVM, Naive Bayes, Random Forest, kNN, and Decision Tree, the SVM method is the best method for detecting non-requirements statements indicated by an average accuracy value of 0.96. Apart from that, from the precision-recall and ROC-recall plots, the points generated by the SVM method are above the diagonal line compared to other methods with the x and y axes approaching the value 1.
The features that affect the results of noise detection are maximum normalization value, mean normalization value, standardized standard deviation value, variant normalization value, maximum value, mean value, standard deviation value, variance value, bad NN value, bad VB value, and other bad grades. The suggestion that can be given for further development is using other methods to get a better score. In addition, it is hoped that the development of a classification model in the Naive Bayes method is expected so that this method can detect non-requirements statements better.