This essay has been submitted by a student. This is not an example of the work written by professional essay writers.
Case Study

Effective simultaneous feature and sample reduction for highly imbalance datasets: A case study on intrusion detection data

Pssst… we can write an original essay just for you.

Any subject. Any type of essay. We’ll even meet a 3-hour deadline.

GET YOUR PRICE

writers online

Effective simultaneous feature and sample reduction for highly imbalance datasets: A case study on intrusion detection data

 

 

Abstract

Imbalance problem is an important issue to deal with when working with many datasets especially for the applications which concern detection and diagnosis. Some important classification problems in this category include network intrusion detection and disease diagnosis. Class imbalance problem may result in biased decision towards the majority class. This paper proposes a novel simultaneous feature and sample selection approach which deals with this problem by weighting the samples (features) based on their importance and their role in the constitution of the whole dataset. For this purpose, a new criterion is defined which is call centrality and accounts for the data points which are located in the core of the distribution. This criterion is calculated based on the number of neighbors of each data point on the data manifold. Locally linear embedding (LLE) algorithm is used to find the data manifold. This approach tries to preserve the main structure of the dataset in the space even after feature/ample removal. Experiments on the NSL-KDD dataset, as an imbalance data for intrusion detection, shows that the approach can maintain or even increase the average classification accuracy while improving the precision and F-measure on the minority classes.

Keywords: imbalanced data; feature selection; prototype selection; intrusion detection; manifold learning; locally linear embedding (LLE)

 

  1. Introduction

Learning imbalanced datasets is an important issue in machine learning research because imbalances scenarios are used in many applications such as fraud detection, customer turnover, product defect detection, disease detection, disaster prediction, regression, image annotation, video processing, text classification, support vector machine learning, anomaly detection, classifier performance estimation, and more [1, 2].

Imbalanced data is a set of data that the statistics of the samples in the different classes are very different. Classes with the lowest number of samples are called minority classes. Because in most learning algorithms, they train a classifier by assuming equal number of class samples. Therefore, when applying these algorithms to imbalanced datasets, the classifier is often trained on the basis of majority class instances. This leads to very poor prediction of minority class examples, because minority class training is not done properly.

Distribution of classes, or in other words the proportion of samples belonging to each class in the dataset, plays a very important role in creating the classifiers. Classification of imbalanced data occurs when one class, which is often the concept of learning (minority class), has less frequency than the other classes (majority class). Classification of imbalanced data is one of the important research issues in the field of data mining. This is a challenging task with the presence of outliers.  Creating a classifier to manage these two issues simultaneously is one of the major challenges[3-5].

The challenges of classifying imbalanced data include the extraction of biased models of training data, incorrect classification of minority classes, neglect of minority data, and over-fitting. Imbalanced data solutions are divided into four categories: data-level approaches, algorithmic level approaches, cost-sensitive level approaches and Ensemble level approaches.

In data-level approaches, they attempt to balance the data by making changes to the data distribution. This method involves a reduction in majority class samples or an increase in minority class samples, which are called under-sampling and over-sampling, respectively. Each of these methods has disadvantages.

In the under-sampling method, data reduction in the majority class results in data deletion. Data that contains information that is useful for creating a classifier. Also in the over-sampling method, increasing the data in the minority class results in increased runtime and over-fitting.

Algorithm-level approaches introduced refinements to standard classification algorithms. They attempted to adapt these algorithms to imbalanced data.

In cost-sensitive level approaches, costs are considered as the costs of incorrectly classifying data for minority class data and majority class data. These costs are assumed for minority class data rather than majority class data, and costs are usually depicted as a cost matrix.

How to quantify costs is the most fundamental disadvantage of cost-sensitive methods. This approach involves two methods of cost-sensitive decision trees and cost-sensitive neural networks. Therefore, in the learning process, the goal is to reduce cost errors rather than increase the accuracy rate.

In ensemble level approaches, several classifiers are applied to the dataset. In these methods it is possible to use different classifiers. Finally, the method of voting between classifiers is usually used to predict new data labels.

Examples of ensemble level approaches are Bagging Based, Boosting-Based, Adaptive Boosting-Ada Boost, Gradient Tree Boosting, XG Boost[3].

An imbalanced dataset may disrupt learning, and proper modeling of the minority class is important. The most common way to deal with imbalanced datasets is data sampling. Data sampling algorithms interact directly with training data and work independently of the classifier.

In general, sampling algorithms fall into two categories: the first one increases the number of samples in the minority class (over-sampling). The second one reduces the number of samples in the majority class (under-sampling). The sampling algorithms are divided into two parts: The first category randomly selects samples from the minority class and adds them to the dataset. The second category synthesizes new samples.

The SMOTE over-sampling technique is one of the most popular sampling algorithms. The algorithm selects two neighboring instances in the minority class and randomly generates new ones. Since SMOTE can select any pair of neighbor samples in the minority class, noise samples are easily generated in the feature space. Various versions of SMOTE were introduced to overcome the disadvantages of this method [1].

Prototype selection is a multi-objective problem where accuracy and compression rates are important. The sample selection is done correctly when the noise is first removed and then the data is compressed.[6]

For large datasets, scalability is an important issue. One of the most common ways to deal with massive amounts of information is to reduce it. Among the data reduction techniques, one of the most common is instance selection. Sample selection involves deleting missing, redundant, or error samples from the training set.

In the instance selection methods, a subset of the data is selected such that the performance of these methods is similar as if the entire dataset were used. Instance selection consists of two main models. In the first model, instance selection is the method for selecting the prototype for algorithms such as k-nearest neighbor. In the second model, the instance selection is to create the training set for the learning algorithms such as decision tree or neural network [2].

Prototype selection is one of the most common preprocessing tasks in data mining applications. When dealing with large amounts of data in practical problems, removing noise, redundant or useless samples is the first step. There are many algorithms for prototype selection. However, in difficult problems, using only one method will not work well. [7]

Dimensionality reduction is a challenging task for high-dimensional data processing in machine learning and data mining. Dimensionality reduction can help reduce computation time, save storage space and improve the performance of learning algorithms. Feature selection techniques as an effective dimensionality reduction technique are to find a subset of features to maintain the most relevant information[8].

Feature selection is an important tool for many machine learning and data mining tasks. By eliminating irrelevant features and reducing data processing complexity, feature selection can significantly improve classification performance or clustering tasks[9].

Traditional unsupervised feature selection algorithms usually assume the distribution of data samples alike and do not consider any dependencies between them, but the supervised methods are not. Data samples not only have high dimensional properties but also are inherently dependent[10].

Our aim in this research is to propose a method that, in addition to improving the classification of minority classes, the ability to classify the data of the majority classes is also maintained at an acceptable level. In the remainder of the article, Section 2 reviews the work on the proposed method. Section 3 describes the research tools and methods. Section 4 focuses on the problem of imbalanced data and describes the proposed method for reasons. In Section 5, we present the results of the simulation of the proposed method. Section 6 summarizes the conclusions and future work.

 

  1. Related work

2-1-Improved feature selection method

In 2014, Wang and his colleagues incorporated feature selection and multiple kernel learning into the sparse coding on the manifold. To this end, unified objective functions are defined for feature selection, multiple kernel learning, sparse coding, and graph regularization. By optimizing the objective functions iteratively, they developed new data representation algorithms with feature selection and multiple kernels learning, respectively. The purpose of this method was to improve the feature selection method [11].

In 2014, Cheng and colleagues proposed a new feature selection method based on locally linear embedding (LLE), which is a manifold learning method. They showed that the LLE objective can be decomposed according to the data dimensionalities in the subset selection problem. They then proposed a new unsupervised feature selection algorithm to select a subset of features that represent the data manifold and call it locally linear selection (LLS). Due to the local relationships among samples, they estimated a score for each feature and called it LLS scores. They then ranked the scores and used them to select the feature. The purpose of this method was to develop a new method for feature selection that can formulate as nonlinear dimensionality reduction with discrete constraints [12].

Don't use plagiarised sources.Get your custom essay just from $11/page

In 2017, Yao and his colleagues used the power of the LLE method. They applied the idea of LLE in the graph preserving feature selection framework. To solve the problems, they proposed a new filter-based feature selection method based on LLE and named it LLE score. The proposed criterion, measures the difference between the local structure of each feature and the local structure of the original data. Their goal was to improve feature selection [13].

In 2018, Feng and his colleagues proposed an unsupervised feature selection approach based on automated encoder that leverages a single-layer automated encoder for a joint framework for feature selection and Manifold learning.

In addition, they incorporated spectral graph analysis on the data mapped into the learning process to preserve the geometry of the local data from the original data space to the low-dimensional feature space. Their goal was to improve feature selection [14].

In 2019, Zeng and his colleagues proposed a new feature selection method with local adaptive loss function and global sparsity constraint. This method can use to model data with different distributions. Given the local and global sparsity of data, this method is able to select the most discriminating features. The purpose of this method was to improve the feature selection method [9].

In 2019, Yu proposed an automated encoder method with stacked noise elimination and manifold regularization called manifold regularized stacked denoising auto encoders (MRSDAE) based on particle swarm optimization. In this way, manifold regularization and feature selection are both embedded in deep learning. The purpose of this method was to select discriminant features based on a integration of structure and parameter optimization, manifold regularization and feature selection techniques [15].

2-2-Improved instance selection method

In 2018, Gunn and his colleagues presented taxonomy of instance selection and instance generation methods for data classification. This taxonomy discriminated a large number of offline sample selection methods. Their aim was to review the literature on sampling methods [16].

In 2018, González and his colleagues proposed an adaptation of the local set concept to multi-label data and verified its effectiveness in designing two new algorithms. First algorithm cleans the datasets to improve predictive capabilities while the second algorithm aims to reduce the size of the datasets. Their aim was to provide sample selection methods for multi-label learning [17].

In 2019, Gonzalo and his colleagues proposed a different approach that takes into account not only the number of times each prototype is selected but also the subsets prototypes that are selected. They developed the GEEBIES method, which was a new way of combining the results of ensembles of prototype selectors. They then showed that their proposed method was better than the standard boosting method. Their aim was to improve the sampling method [7].

In 2019, Morais and his colleagues introduced the k-INOS method, a new algorithm to prevent contamination of over-sampling algorithms with minority class noisy samples. The k-INOS method is based on the concept of neighborhood of influence and works as a wrapper around each over-sampling algorithm. The purpose of the k-INOS method was to improve the performance of sampling algorithms in most criteria [1].

 

 

2-3-Improved algorithm performance parameters

In 2006, Pękalska and his colleagues performed a number of experiments on different metric and non-metric dissimilarity representations and prototype selection methods. They compared several approaches, such as traditional feature selection methods, state searching, and linear programming with random selection. The purpose of this method was to improve the accuracy of the classification [18].

In 2012, Ren and his colleagues proposed a feature selection approach based on local and global structure preserving called LGFS. This method first uses two graphs, nearest neighbor graph and farthest neighbor graph to describe the local and global structure of the samples, respectively. To eliminate redundancy among the selected features, the E-LGFS criterion was introduced as it uses normalized mutual information to measure the dependency between a pair of features. The purpose of this method was to improve classification accuracy [19].

In 2012, Wei and his colleagues developed a new feature selection technique based on the graph embedded framework for manifold learning. They showed that the Linear Discriminant Analysis score and the Marginal Fisher Analysis score could be a direct application of the graph preserving criterion. They proposed two methods for recursive feature elimination (RFE) based on the feature score and subset level score, respectively to identify the optimal feature subset. The purpose of this method was to improve the performance of the algorithm [20].

In 2013, Verbiest and colleagues believed that the accuracy of the k Nearest Neighbor method could be improved by using prototype selection, so they prepared KNN with a reduced but reinforced dataset to select its neighbors from. They used fuzzy rough set theory to express the quality of the instances, and used a wrapper method to determine which instances were pruned. Their method was called FRPS. The purpose of this method was to improve classification accuracy [21].

In 2013, Zare and his colleagues introduced a new framework for selecting a set of prototypes from a labeled graph set, taking into account their discriminative power. The purpose of this method was to improve the accuracy of the classification [22].

In 2013, Yao and his colleagues considered the unsupervised feature selection problem. They proposed a new feature selection method called LapGOFS. Their method minimized the maximum variance of the predicted value of the regression model. Using Manifold’s learning techniques and optimized laboratory design, their proposed approach was able to select the most effective features to improve learning performance. The purpose of this method was to improve the performance of the algorithm [23].

In 2014, Hamidzadeh and his colleagues proposed a Large Margin Instance Reduction Algorithm called LMIRA. Their method eliminated non-border instances and retained border instances. The instances reduction process was formulated as a constrained binary optimization problem and then solved by a functional algorithm. The purpose of this method was to improve classification accuracy and increase the reduction rate [24].

In 2015, Calvo and his colleagues proposed a method in which the PS algorithm acts as a rapid recommender system by entering a new instance, which is likely to retrieve similar classes. Then the actual classification is done only according to the prototypes from the initial training set belonging to the suggested classes. The purpose of this method was to improve the efficiency. and prove the noise resistance of the method [25].

In 2015, Zhang and his colleagues proposed a new algorithm called EMR-SLRA to embed multipurpose features. The EMR-SLRA algorithm is based on the least-squares component analysis framework, in particular, low dimensional feature representation and projection matrix is obtained by the low rank approximation of the connected multidimensional feature matrix.

Due to the complementary property among multiple features, the proposed algorithm simultaneously applies the ensemble manifold regularization on the output feature embedding. The purpose of this method was to improve the performance of the algorithm [26].

In 2015, Hamidzadeh and his colleagues proposed an instance reduction method based on hyper rectangle clustering, called IRAHC. This method removes non-border instances and keeps the border and near border instances. The purpose of this method was to improve classification accuracy and increase the reduction rate [27].

In 2016, Fan and his colleagues used model selection and evolutionary computation to solve the uncertainty and hide partial data in the imbalanced data clustering. They used probabilistic models to solve uncertainty and an evolutionary process to adjust and estimate optimal parameters. The purpose of this method was to improve the clustering performance [4].

In 2016, Jian and his colleagues introduced a new different contribution sampling (DCS) method based on the contributions of the support vectors (SVs) and the nonsupport vectors (NSVs) for classification. The DCS method uses different sampling methods for the SVs and the NSVs and uses the biased support vector machine (B-SVM) method to detect the SVs and the NSVs of an imbalanced data. In addition, the SMOTE and RUS techniques are used to reproduce the SVs in the minority data and the NSVs in the majority data, respectively. The purpose of this method was to improve the classification performance [5].

In 2016, Wang and his colleagues proposed an unsupervised spectral feature selection method with the l1-Norm graph, called USFS. The proposed algorithm performed spectral clustering and the l1-Norm graph jointly to select discriminative features.

The manifold structure of the original datasets was first learned by spectral clustering of unlabeled samples and then used for feature selection. In addition, the l1-Norm graph was used to maintain a clear Manifold structure. The purpose of this method was to improve the performance of the algorithm [28].

In 2016, Dornaika and colleagues applied nonlinear data self-representativeness to overcome the disadvantages of the SMRS method by using two types of data projections: kernel trick and column generation. Qualitative evaluations were performed by summarizing two video films. The purpose of this method was to improve the performance of the algorithm [29].

In 2016, Du and his colleagues proposed a multiple graph unsupervised feature selection method to leverage local and global graphs information simultaneously. In addition, they used the l2;p norm to achieve more flexible sparse learning. The purpose of this method was to improve the performance of the algorithm [30].

In 2016, Valero and his colleagues conducted an empirical study in a simulated framework in which PS methods can be compared under classical conditions, as expected in distributed scenarios. The purpose of this method was to improve the efficiency and robust of the algorithm[31].

In 2016, Li and his colleagues proposed constructing the graph using heterogeneous feature representations from multiple perspectives. They call their proposed method MRMVFS, which can exploit label information, label relationship, data distribution as well as correlations among different types of features simultaneously to boost feature selection performance. The purpose of this method was to improve the performance of the algorithm [32].

In 2017, Zhu and his colleagues jointly focused on regression and classification and proposed a new feature selection approach by embedding observational relational information into a sparse multi-task learning framework. Relational information includes three types of relationships such as attribute-attribute relationship, response-response relationship, and sample-sample relationship to maintain the three types of similarity between attributes, response variables, and samples, respectively. With reduced data later, they used the support vector regression model for prediction and the support vector classification model for labeling. The purpose of this method was to improve the performance of the algorithm[33].

In 2017, Du and his colleagues proposed an unsupervised feature selection method called RUFSM, in which both robust discriminative feature selection and robust clustering are performed simultaneously while local manifold structures of data maintained. Their method first predicted both the latent orthogonal cluster centers and the sparse representation of the projected data points based on matrix factorization to select robust discriminative features, and then the feature selection and  the clustering were applied simultaneously to guarantee a general optimization. The purpose of the proposed algorithm was to improve the clustering performance [8].

In 2018, Pang and his colleagues proposed a secure sample reduction method to improve its computational power and called as SIR-KMTSVM. This method can eliminate most of the redundant examples both for the target classes and the remaining classes, so the speed of work is greatly accelerated. The purpose of this method was to improve the algorithm’s performance[34].

In 2018, Tang and his colleagues proposed a robust, unsupervised efficient method for feature selection through dual self-representation and manifold regularization, commonly referred to as DSRMR. The term self-representation of feature is used to learn the matrix of feature representation coefficient to measure the importance of different feature dimensions. Sample self-representation term for automatic learning of a sample similarity graph is used that the local geometric structure of the data in unsupervised feature selection is maintained. The purpose of this method was to improve the accuracy of clustering and improve the normalized mutual information (NMI) [35].

In 2018, Haro-García and his colleagues proposed an evolutionary sample selection algorithm that combines three strategies. First, it used the framework of the CHC genetic algorithm. Second, it allowed the selection of each sample more than once, and finally it provided a local value for k, which depended on the nearest neighborhood of each test sample. The purpose of this method was to improve the performance of the algorithm [36].

In 2018, González and colleagues focused on the use of prototype selection in multi-label datasets as a preliminary step in the learning process. They used three single-label prototype selection algorithms and data transformation methods such as binary relevance, dependent binary relevance, label power set, and random k label sets. The purpose of this method was to improve the accuracy of the classification if samples were reduced [37].

In 2018, Zhang and his colleagues introduced a semi-supervised local multi-manifold Isomap learning framework by linear embedding and called it SSMM-Isomap. This method can be applied the labeled and unlabeled training samples for learning neighborhood to preserve local nonlinear manifold features and a linear feature extractor simultaneously.  The SSMM-Isomap formula was introduced with the aim of minimizing pairwise distances of intra-class points in a similar manifold and maximizing the distances of different manifolds. The purpose of this method was to improve the performance of the algorithm[38].

In 2018, Yang and his colleagues proposed a sample reduction algorithm called NNGIR. An NaNG is automatically generated by the natural neighbor search algorithm. The proposed algorithm uses NaNG to divide the original training set into noisy, border and internal samples. Next, the algorithm generates a reduced set by eliminating noisy and redundant point. The purpose of this method was to improve the accuracy of the prediction when the rate of decline was strongly increased [39].

In 2019, Hosseini and his colleagues proposed a new feature selection method based on the interaction information (II) to analyze higher level interaction and improve the search method in the feature space. In this regard, an evolutionary feature subset selection algorithm presented based on interaction information. In the first step, candidate features are determined using symmetric uncertainty (SU) approaches and bivariate interaction information. In the second step, candidate feature subsets are created and evaluated using multivariate interaction information. Finally, the best candidate feature subsets are selected using dominant/dominated relationships. The purpose of this method was to improve classification precision, improve the F-measure criterion and algorithm robustness [40].

In 2019, Tang and his colleagues proposed a robust unsupervised feature selection method that incorporated the latent representation learning into feature selection. The latent representation is modeled by non-negative matrix factorization of the affinity matrix, which explicitly represents the relationships among the data samples. At the same time, the local manifold structure of main data space is maintained by a graph based manifold regularization term in the transformed feature space. The purpose of this method was to improve the performance of the algorithm [10].

In 2019, Kordos and his colleagues developed a multi-objective evolutionary algorithm for evaluating the selections using two criteria: training data set compression and prediction quality in terms of root mean squared error. A multi-objective KNN regression was used to evaluate the training error. The purpose of this method was to reduce the dataset and improve the prediction method in trained multi-output regressions on the reduced dataset [6].

In 2019, Pourpanah and his colleagues introduced a BSO-based feature selection technique for data classification. This method combines the FAM model, which is an incremental learning neural network, and BSO, a feature selection method, to produce the hybrid FAM-BSO model for feature selection and optimization. In the first step, FAM is used to create a number of prototype nodes incrementally. The BSO was then used to search and select the optimal subset of features. The purpose of FAM-BSO method was to improve classification accuracy with minimum number of features [41].

In 2019, Tao and his colleagues proposed a new over-sampling technique that used the RNS procedure to generate synthetic minority data without the need for actual minority data. The minority data generated was combined with the majority data as input for a bi-class classification for learning. The purpose of this method was to improve performance including the G-Mean and F-Measure criteria [42].

In 2019, Zhang and his colleagues proposed an embedded multi-label feature selection approach with manifold regularization. Specifically, a low-dimensional embedding built on the core feature space to fit the label distribution to maintain locally labeled correlations, using limited label information in consideration of the co-occurrence relationships of label pairs. The purpose of this method was to improve the classification performance [43].

In 2019, Cruz and his colleagues proposed the FIRE-DES ++ method, an advanced type of FIRE-DES method. Their proposed method eliminates noise and reduces overlap of classes in the validation set. In addition, the proposed method defines the region of competence using an equal number of samples in each class and avoids selecting a region of competence with samples of a single class. The purpose of this method was to improve the performance of the algorithm[44].

In 2019, Haro-García and his colleagues developed a new method using boosting to obtain a subset of samples. Their method was able to improve the accuracy of dataset classification even with a significant reduction of samples. The samples were gradually added. Samples were selected to maximize the accuracy of the subset by using the weighting of the samples generated from the ensembles structure of the classifiers and adding new ones step by step. The purpose of this method was to improve classification accuracy[2].

In 2019, Lai and his colleagues presented a mathematical analysis of the impact of imbalanced data related to PPMC analysis. Then a new framework called RCAF is presented to improve the accuracy of correlation analysis. The purpose of this method was to reduce the standard deviation of the correlation coefficient [45].

In 2019, Malhotra and colleagues examined the effectiveness of classifiers for software defect prediction by using sampling methods and cost-sensitive classifiers. They proposed a new method called SPIDER3 after modifying the SPIDER2 over-sampling method. The purpose of this method was to improve the predictability of classifiers using over-sampling methods [46].

2-4-Improved imbalanced data problem

In 2008, García and his colleagues presented a model of the memetic algorithm whose their method included an ad-hoc local search. They used the proposed model specifically to optimize the prototype selection problem with the aim of tackling the scaling up problem[47].

In 2016, Zhang and his colleagues conducted an analysis to investigate one-vs-one schema empowerment for multi-class imbalanced classification problems. They used binary ensemble learning approaches. They proposed several methods of ensemble learning to solve pairwise tasks in multiclass data. They then used the aggregation strategy to combine the outputs of the methods to reconstruct the original multiclass task. The purpose of this method is to solve the imbalances in the data set [3].

In 2018, Bhagat and his colleague introduced the RKWELM method, a type of kernel-based WELM method to handle the problem of imbalanced data. This method benefits from ensemble methods. The computational complexity of KELM depends on the number of kernels. KELM uses the Gaussian kernel function. The purpose of this method is to solve the imbalances in the data set [48].

In 2019, Koziarski and his colleagues proposed a new strategy to counter imbalanced data, focusing on noise presence. Their method was called RBO, which could identify areas where artificial elements of the minority class would have to be produced based on estimating the imbalanced distribution with radial basis functions. The purpose of this method is to solve unbalanced data problems [49].

In 2019, Xie and his colleagues introduced a new data level approach to tackle imbalanced problems, calling it GL. They fitted the distribution of the original data. They generated new data based on the distribution of the Gaussian hybrid model. The generated data, including synthetic minority and majority classes, were used to train learning models. The purpose of this method is to solve the imbalances in the data set [50].

2-5- Improved noise control and handling

In 2018, Lu and his colleagues proposed an unsupervised feature selection method. They used their model to capture relationships between features without learning the cluster labels. Each feature can be reconstructed using a linear combination of all the features in the original feature space. A representative feature should provide a greater weight for reconstructing other features. In addition, a structure preserved constraint was incorporated into their model to maintain the local manifold structure of the data. The purpose of this method was to handle Gaussian noise [51].

2-6- Improve the preprocessing stage

In 2014, Verbiest and his colleagues presented an approach where data is cleaned before applying the SMOTE technique to improve the quality of the samples produced. They also cleaned the data after applying SMOTE. For this purpose they proposed two prototype selection techniques based on fuzzy rough set theory. The first technique removes the noisy samples from the imbalanced dataset and the second one cleans the data generated by SMOTE. The purpose of this method was to improve the preprocessing stage in the classification of noisy data [52].

 

  1. Materials and methods

3-1-Fisher’s score

One of the filter-based feature selection methods is the Fisher method. The Fisher Score is a supervised feature selection method. This method measures the power of feature representation by simultaneously evaluating the ability of the feature to maximize sample distances from different classes and minimize sample distances from the same class. The FisherScorer statement specifies the Fisher Score for the r-th feature. FisherScorer is calculated as Eq. (1) where   is the mean of the r-th feature in the P-th class. The letter C denotes the number of classes, the letter fri denotes the r-th feature of the i-th sample and np denotes the number of samples in the P-th class.

(1)

Then the features are ranked according to the FSr criterion and the superior feature is selected with the highest score.[13]

3-2-LLE score

LLE method is one of the most popular manifold learning algorithms. The LLE method first learns the local structure of the data in the original space, and then finds less dimensional representation by preserving these local structures. In previous research, the LLE method for feature extraction has been embedded into the graph framework. Therefore, developing the LLE method in filter-based feature selection methods does not seem to be a difficult task. However, few methods have so far used the LLE method for feature ranking.

In this study, we first introduce how to embed LLE into a graph preserved framework. To do this, we first model the local structure as what LLE does, which is summarized below. For each sample xi we have:

1) Find the set of neighbors Ni = {xj, j Ji} using the K-nearest neighbors xi.

2) Calculate the reconstruction weights that minimize the reconstruction error of xi using Ni samples.

The first step is usually done by using the Euclidean distance based on the K-nearest neighbors to find the neighbors.

The second step is to find the best weights of reconstruction. The optimal weights are determined by solving the problem (2).

(2)

Repeating the first and second steps for all samples constitute the reconstruction weights of a weighting matrix M=[mij]n×n. In the matrix M, if xj∈Ni is mij=0. It is worth noting that the dimensions of sample d are usually larger than the numbers of neighbors K, in the other words d>K. Therefore, the least squares method is used to solve the Eq. (2). Then, each feature is evaluated by its ability to maintain these weights. The LLEScorer criterion is used for the r-th feature, which must be minimized in Eq. (3).

(3)             =

 

Then the features are ranked according to the LLEScorer criterion. The top d features with the lowest scores are selected. The detailed expression of the above method is presented in Algorithm 1.

 

Algorithm1: Embedding LLE into the graph-preserving feature selection framework
Input: The data matrix X.

Output: The ranked feature list.

1: Firstly, compute K-nearest neighbors of xi, then calculate its reconstruction weights mij through Eq. (2).

Do these two procedures for all the data, and the weighting matrix M is obtained;

2: Compute the importance of the d feature by Eq. (3);

3: Rank the d feature in ascending order according to its score;

4: return The ranking list of the feature.

 

Considering A=I-M-MT+MTM, λ=0, the proposed method can be embedded in Eq. (4). [13]

(4)
  1. Proposed Method

In the proposed method, the focus is on simultaneous sample selection and feature selection techniques. The SVM classification prediction is by 10-fold evaluation method. Figure 1 shows the flowchart of the proposed method.

NSL-KDD dataset preprocessing
Select samples based on Supervised Centrality Degree
Score calculation with FisherScore method
Score calculation with LLEScore method
Combine Scores
Feature selection with the help of Ultimate Scores
Classification of samples

Figure 1.flowchart of the proposed method

Our aim in this study is to propose a method that, in addition to improving the classification in minority classes’ data, the ability to classify the data of the majority classes is also maintained at an acceptable level. In this research a new method of sampling was presented that we applied only to the majority class.

Standard classification algorithms only tend to predict the majority class and, in the face of minority class features, treat them as noise and often ignore them. As such, the probability of incorrect classification of the minority class is high compared to the majority class.

 

 

 

 

4-1-Preprocessing

Weka software was used to reduce the size of the problem in the pre-processing phase. Not all features in the NSL-KDD dataset are of equal importance, so by eliminating irrelevant, duplicate, and unimportant features, the problem dimensions are reduced. Thus, in the pre-processing phase it was attempted to extract the most important features from the 41 features. As a result, smaller search space was applied as input to the proposed model process. The Relief filter was applied to the NSL-KDD dataset by Weka software and reduced the search space.

4-2-Select samples based on Supervised Centrality Degree

4-2-1-Supervised Centrality Degree Approach

In this research a new method for samples selection is presented. In the proposed method, by one of the manifold learning methods such as LLE is first reduced and the data is mapped to a less dimensioned space. Then in the new space the KNN method can be used to find the number of neighbors.

In the proposed method, the samples are labeled. An attempt has been made to illustrate the proposed method with an example. Suppose Xi prototypes are in four-dimensional space. At first, the dimensions of the prototypes are reduced to two-dimensional space by one of the manifold learning methods. So Yi new examples are created. Figure 2 illustrates this.

To calculate centrality in the proposed method, we determine the number of K’s nearest neighbors based on Euclidean distance and a graph is plotted based on the number of neighbors. Then the adjacency matrix is ​​calculated and the degree of each node is determined. Higher degree nodes are more centralized, and each of the higher score samples means more important ones and must be preserved.

Yi
Xi

Figure 2.dimension reduction with Manifold learning methods

 

Suppose Figure 3 represents data samples in two-dimensional space. Samples have two ‘a’ and ‘b’ labels. We want to calculate the centrality of the sample Y4. Samples Y1 to Y9 are visible with the label in Figure 3.

Figure 3.data samples in two-dimensional space

 

Suppose K = 5. By the KNN method, for the Y4 sample, the neighbor samples are identified. Sample Y4 is labeled ‘b’. According to Figure 3, 5 of the nearest neighbors to sample Y4 are identified. Then a graph is drawn.

(5)KNN (Y4) = {Y1, Y2, Y3, Y5, Y6}

The nearest neighbors to sample Y4 are: samples Y3 and Y5 labeled ‘a’ and samples Y1, Y2 and Y6 labeled ‘b’.

The proposed centrality method is a supervised one. Suppose that the symbol L represents similar labels, the symbol L’ denotes dissimilar labels, symbol Number(YjL) denotes the number of samples labeled with the similar label, and the symbol Number(YjL’) denotes the number of samples labeled with the dissimilar label. In this proposed method, we calculate the difference of the number of neighbors with the similar class from the number of neighbors with the dissimilar classes and assign a score according to Eq. (6) to each instance which is an integer.

(6)Centrality(Yi) = Number(YjL)- Number(YjL)

The highest score is K and the lowest is -K. Some samples may have the equal number of neighbors. The score for the sample Y4 is calculated as follows.

(7)Centrality(Y4b) = 3 – 2 =1

Geometric distance is used to calculate centrality at high dimensions. Applying the geometrical distance will cause the outliers to be removed and the samples on the manifold will remain. To identify the nearest neighbors with geometrical distance, samples are selected on the manifold. Thus their degree remains high and the degree of outliers decreases. So, they are not selected to the K nearest neighbor and are gradually eliminated.

When you map high-dimensional spatial to low-dimensional spatial, it is no longer necessary to calculate the geometric distance in the original space, but rather to use the Euclidean distance in the smaller dimension. In other words, the Euclidean distance in lower dimensions is similar to the geometric distance in higher dimensions provided that manifold learning methods are used to reduce the dimension. At first manifold calculates the original data. After mapping and counting the number of neighbors, each of the samples with more neighbors has a higher centrality.

In the proposed method, while the importance of the samples is maintained, even if the features are discarded, the sample remains important. On the other hand, when the features are discarded, the weaker samples are not reinforced, but they remain weak or samples that are outliers remain outliers.

Features can also be centrally applied. Previously, every sample that had the most neighborhoods was more important. When the row and column of the primary data matrix are replaced, it will then give each feature a score.

The least rated feature is the more important one and should be preserved. In other words, a feature that is similar to another means that these two features can be deleted.

At this point, the score for each sample is calculated. Then the samples are ranked in descending order. Top samples get the most scores.

 

4-3-Score calculation with FisherScore method

At this point, the FisherScore criterion is calculated by Eq. (1), that is, the score for each feature. Then the features are ranked in descending order. The top feature gets the most scores. The purpose of the Fisher method was to increase the differentiation of classes. Then it is better to take the normalization score vector.

4-4- Score calculation with LLEScore method

At this point, the LLEScore criterion is calculated by Eq. (3), the score for each feature. The features are then ranked in ascending order. The top feature has the lowest score. The purpose of the LLE method was to maintain the local structure of the data. Then it is better to take the normalization score vector.

4-5- Combine Scores

At this point, the scores from the previous two steps are combined by Eq. (8) and the degree of effectiveness of each method is determined by the alpha value. We choose the feature that minimizes the FinalScore criterion.

 

(8)

4-6- Feature selection with the help of Ultimate Scores

At this point, the FinalScore vector can select the best 15 features, the best 20 features, the best 25 features.

4-7- Classification of samples

At this stage, the SVM classifier can be predicted by the 10-fold evaluation method, and performance criteria such as classification accuracy, precision, recall and F-measure can be calculated.

 

  • Experiments

This section describes the implementation process of the proposed model by selecting the feature. In this section we simulate the base model, the proposed model, and the proposed model by selecting the feature and comparing their results with each other. In this study, the classification criteria are accuracy, precision, recall and F-measure. The classification method is 10-fold.

5-1- NSL-KDD dataset

The KDD99 dataset is a standard dataset for IDS. Most researchers have focused on the KDD99 fields and their criteria have been precision, accuracy, recall and F-measure in their research. In this study, a modified version of the KDD99 dataset, called the NSL-KDD dataset, was used. The NSL-KDD dataset contains two main classes Normal and Attack. The Attack class itself contains four subclasses of DoS, R2L, U2R, Probe. The NSL-KDD dataset is a five-class problem with the majority of Normal and DoS classes and the Probe, R2L and U2R minority classes. The NSL-KDD dataset also has 41 features.

5-2- NSL-KDD database preprocessing

In the first step, the names of the 41 data sets were changed to tags with simpler names. In the second step, the string values ​​were converted to numeric values ​​to use the support vector machine. In this study, a classification is made for the five-class problem. So in the third step, field 42 which represents the class type was numerically labeled.

Table1. Statistical description of the data set in the experiments

Data SetNSL-KDD
100% NSL-KDDSelected NSL-KDD
Number of ClassesOne normal class, four attack classes
Number of Features4141
Number of Samples12597317040

 

The NSL-KDD dataset contains 125973 records and 41 features. In step 4, to reduce the amount of memory and computational time, we used 10% of the total data to extract Normal and DoS majority classes, 40% of the total data to extract the Probe class, and 100% of the total Data to extract R2L and U2R minority classes. Duplicate samples were then removed. A statistical description of the dataset is provided in Table 1.

In the fifth step, the dimensions of the problem were reduced. The Relief filter was applied to the NSL-KDD dataset by Weka software. By applying the Relief filter by Weka, features that cannot separate the different classes are eliminated. We apply the Relief filter to the features of the NSL-KDD dataset, and finally, 31 features remain and the rest of the features are removed (according to Table 2). The number of samples is reduced to 15604.

 

Table2. Statistical description of the data after filtering by Weka

Refining with Veka
Number of ClassesDoS,R2L,U2R, Probing, normal
Number of Features after Dimension Reduction31
Number of Samples after Preprocessing15604

 

After applying the filter, the remaining features are applied as input to the proposed model process. After applying the filter, a list of 31 features is presented in Table 3.

 

Table3. A list of features remaining after applying the filter

Number of FeaturesRemaining featuresFilter method
311, 2, 3, 4, 8, 10, 12, 14, 15, 17, 18, 19, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41Relief method

 

5-3- Sampling

In experiments to determine the effectiveness of the proposed model, a percentage of the data may be selected and partially discarded. If so, the accuracy of the majority classes will be reduced, but it is likely that the accuracy of the data in the minority classes will be improved.

While the sampling method does not reduce the density of the majority class samples, it does make the minority class samples more efficient. In our proposed method, the number of minority and majority classes should have a relative proportion.

In these experiments, the number of samples from the Normal and DoS majority classes, and Probe minority class was reduced by the proposed sample selection method, but the number of samples in the R2L and U2R minority classes remained unchanged. In this case, the accuracy and performance of the classifier can be calculated. When we report the accuracy of the classes separately, it can be shown that the proposed method works well for the unbalanced data.

According to Table 4, the number of samples can be seen by class type. We have shown the number of samples in four modes. The first is related to all NSL-KDD data with 41 features. The latter corresponds to a percentage of NSL-KDD with 41 features. The third case concerns 17040 samples. In this case, the Relief filter is applied to them. After applying the filter, 31 features and 15604 samples are left. The fourth case concerns 15604 samples. In this case, the proposed method of sample selection was applied to the samples. Finally, after applying the proposed method, the dataset will have 31 features and 13036 samples.

5-4- Performance evaluation criteria

The performance of an IDS is evaluated by its ability to generate accurate forecasts. By comparing the true nature of an event and predicting an IDS, four consequences of TP, FP, FN, TN may occur. Security executives seek to reduce false alarms and increase false positives. Classification performance criteria such as accuracy, precision, recall and F-measure were used in this study.

(9)
(10))
(11)
(12)

The performance evaluation of the classification algorithm is performed by the confusion matrix. This matrix contains information about the actual classes and the predicted classes. [53]

 

 

 

 

 

 

 

 

Table4. Number of samples per class

Case1Case2Case3Case4
Classesnumber of samples with 41 features in 100% NSL-KDDnumber of samples with 41 features in Selected NSL-KDDnumber of samples with 31 features after applying Relief filternumber of samples with 31 features after applying the sample selection method
DoS45927458444443499
Probe11656466043523499
R2L995995987987
U2R52525252
Normal67343674957694999
SUM=125973170401560413036

 

5-5- Configure the test environment

MatLab programming language along with LibSvm and DRToolBox were used for the simulation. The Grid Search method was used to adjust the values of simulation parameters such as alpha. The Grid Search method takes an interval of different values for each parameter and repeats the experiments with interval values to find the best fit for each parameter.

The kernel type must be specified to apply the SVM classifier. In this simulation, the RBF kernel was used according to equation (13). The kernel parameters must be set. According to equation (13), the kernel parameters are gamma, coef0 and C which used standard values, as in other papers.

(13)RBF kernel : exp(-gamma*|u-v|^2)

These values are accessible through the LibSvm default setting, which assumed gamma = 1 / num_features, coef0 = 0, and C = 1.

In the pre-processing phase, Relief filtering was performed by Weka software. The purpose of the filter was to reduce search space. After reducing the search space, the proposed sample selection technique was performed. Then Fisher and LLE based feature selection technique was applied and SVM classification evaluation criteria were calculated.

In our research work, experiments were repeated for seven different alpha parameter values. The results were then shown separately, and finally, in Table 25, our results were compared with those of papers with the same evaluation criteria.

The proposed method with feature selection was used to select the best 15 features and the best 20 features and the best 25 features.

5-5-1- Basic model simulation without feature selection

The purpose of these experiments is to determine classification accuracy and other performance measures such as precision, recall and F-measure in the baseline model for minority classes such as Probe, R2L and U2R, while the number of features is 41. The SVM classifier was used in these experiments. These experiments were performed with the RBF kernel in the five-class mode. The classification method is 10-fold.

5-5-1-1- Basic Model Performance Criteria with 41 Features

According to Table 5, the confusion matrix in the basic model in the five-class mode is visible on all data, i.e. 17040 samples with 41 features.

 

 

Table5. Confusion matrix in base model with 41 features

predicted
DoSProbeR2LU2RNormal
ActualDoS40830045
Probe34000061
R2L0074025
U2R00005
Normal0000674

 

According to Table 6, the classification accuracy and other performance measures such as precision, recall, and F-measure are shown in the base model for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on all data sets, namely 17040 samples with 41 features in five classes.

 

Table6. Basic Model Performance Criteria with 41 Features

RecallPrecisionF-measure
Probe0.860.990.92
R2L0.751.000.86
U2R0.000.000.00
Accuracy0.92

 

 

 

5-5-2- Simulation of the proposed sample selection model without feature selection

According to Table 4, after applying the Relief filter to 17040 samples and then applying the proposed method to select the prototype on the majority classes, the number of samples was reduced to 13036 with 31 features.

The purpose of these experiments is to determine classification accuracy and other performance measures such as precision, recall and F-measure in the proposed model for minority classes such as Probe, R2L and U2R, while the number of features is 31. The SVM classifier was used in these experiments. These experiments were performed with the RBF kernel in the five-class mode. The classification method is 10-fold.

5-5-2-1- Proposed Model Performance Criteria with 31 Features

According to Table 7, the confusion matrix in the proposed model in the five-class mode is visible on 13036 samples with 41 features.

 

Table7. Confusion matrix in proposed model with 31 features

predicted
DoSProbeR2LU2RNormal
ActualDoS32730019
Probe23310015
R2L009007
U2R00103
Normal1520491

 

According to Table 8, the classification accuracy and other performance measures such as precision, recall, and F-measure are shown in the proposed model for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on 13036 samples with 31 features in five classes.

 

Table8. Proposed Model Performance Criteria with 31 Features

RecallPrecisionF-measure
Probe0.950.980.96
R2L0.930.970.95
U2R0.000.000.00
Accuracy0.96

5-5-3- Analyzing and examining experiments

Compared to the experiments in Sections 5-5-1 and Sections 5-5-2, not only the accuracy of the proposed model is more than the accuracy of the base model, but the performance of the classifier in minority classes such as Probe and R2L is better than the base model. classifier accuracy increased from 92 percent to 96 percent. Classifier performance can be better represented for each class by other criteria such as precision, recall and F-measure.

According to Table 5, the Probe class has 464 samples. In the base model, 400 samples of the Probe class, 61 samples of the Normal class and 3 samples of the DoS class are predicted. Because the Normal class is the majority class type. But according to Table 7, the Probe class has 348 samples. The proposed model predicted 331 samples of the Probe class, 15 samples of the Normal class and 2 samples of the Dos class. In other words, the proposed method has been able to predict the number of more probe samples as the probe sample.

According to Table 5, the R2L class has 99 samples. In the base model, 74 samples of the R2L class and 25 samples of the Normal class are predicted. Because the Normal class is the majority class type. But according to Table 7, the R2L class has 97 samples. The proposed model predicted 90 samples of the R2L class and 7 samples of the Normal class. In other words, the proposed method has been able to predict the number of more R2L samples as the R2L sample.

In other words, the ability to learn minority classes such as Probe and R2L has been greatly improved. But the classifier performance for the U2R minority class has not changed.

So our proposed model is an effective one. Because it eliminates weak Normal samples just like outliers. Therefore, it is expected to predict a lower number of Normal samples correctly. According to Table 5 in the base model, the number of samples expected to be of the Normal class type is 674 but according to Table 7 in the proposed model, this number has been reduced to 491 samples of the Normal class.

For the R2L class, Recall criterion increased from 75% to 93% in accordance with Tables 6 and 8. In other words, it predicts more R2L samples as R2L samples. For R2L class, Precision criterion decreased from 100% to 97% in accordance with Tables 6 and 8. Because a number of Normal and U2R samples are predicted as R2L class samples.

While the proposed method has improved the F-measure criterion for minority classes, it has also increased the accuracy criterion for the majority classes.

5-5-4- Simulation of the proposed model with feature selection method

In this part of the experiment, the best 15 features, the best 20 features, and the best 25 features are identified by the relation (5). These experiments are repeated with different values for alpha such as 0, 0.1, 0.3, 0.5, 0.7, 0.9, 1. These selected features are then extracted from a dataset with 13036 samples and 31 features, namely the fourth state of Table 4.

The purpose of these experiments is to determine classification accuracy and other performance measures such as precision, recall and F-measure in the proposed model for minority classes such as Probe, R2L and U2R, with the best 15 features, the best 20 features and the best 25 features are selected. The SVM classifier was used in these experiments. These experiments were performed with the RBF kernel in the five-class mode. The classification method is 10-fold.

5-5-4-1- Experiments with zero alpha value

According to Table 9, the confusion matrix in the proposed model in the five-class mode with zero alpha value is visible on 13036 samples with 15 features, 20 features, and 25 features.

According to Table 10, the classifier accuracy and other performance measures such as precision, recall, and F-measure are shown in the proposed model for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on 13036 samples with 15 features, 20 features, and 25 features in five classes with zero alpha value.

5-5-4-2- Experiments with an alpha value of 0.1

According to Table 11, the confusion matrix in the proposed model in the five-class mode with an alpha value of 0.1 is visible on 13036 samples with 15 features, 20 features, and 25 features.

According to Table 12, the classifier accuracy and other performance measures such as precision, recall, and F-measure are shown in the proposed model for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on 13036 samples with 15 features, 20 features, and 25 features in five classes with an alpha value of 0.1.

 

 

5-5-4-3- Experiments with an alpha value of 0.3

According to Table 13, the confusion matrix in the proposed model in the five-class mode with an alpha value of 0.3 is visible on 13036 samples with 15 features, 20 features, and 25 features.

According to Table 14, the classifier accuracy and other performance measures such as precision, recall, and F-measure are shown in the proposed model for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on 13036 samples with 15 features, 20 features, and 25 features in five classes with an alpha value of 0.3.

5-5-4-4- Experiments with an alpha value of 0.5

According to Table 15, the confusion matrix in the proposed model in the five-class mode with an alpha value of 0.5 is visible on 13036 samples with 15 features, 20 features, and 25 features.

According to Table 16, the classifier accuracy and other performance measures such as precision, recall, and F-measure are shown in the proposed model for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on 13036 samples with 15 features, 20 features, and 25 features in five classes with an alpha value of 0.5.

5-5-4-5- Experiments with an alpha value of 0.7

According to Table 17, the confusion matrix in the proposed model in the five-class mode with an alpha value of 0.7 is visible on 13036 samples with 15 features, 20 features, and 25 features.

According to Table 18, the classifier accuracy and other performance measures such as precision, recall, and F-measure are shown in the proposed model for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on 13036 samples with 15 features, 20 features, and 25 features in five classes with an alpha value of 0.7.

5-5-4-6- Experiments with an alpha value of 0.9

According to Table 19, the confusion matrix in the proposed model in the five-class mode with an alpha value of 0.9 is visible on 13036 samples with 15 features, 20 features, and 25 features.

According to Table 20, the classifier accuracy and other performance measures such as precision, recall, and F-measure are shown in the proposed model for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on 13036 samples with 15 features, 20 features, and 25 features in five classes with an alpha value of 0.9.

5-5-4-7- Experiments with an alpha value of one

According to Table 21, the confusion matrix in the proposed model in the five-class mode with an alpha value of one is visible on 13036 samples with 15 features, 20 features, and 25 features.

According to Table 22, the classifier accuracy and other performance measures such as precision, recall, and F-measure are shown in the proposed model for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on 13036 samples with 15 features, 20 features, and 25 features in five classes with an alpha value of one.

 

 

 

 

 

Table9. Confusion matrix in the proposed model with 15 features, 20 features and 25 features and zero alpha value

predictedpredictedpredicted
DoSProbeR2LU2RNormalDoSProbeR2LU2RNormalDoSProbeR2LU2RNormal
ActualDoS3221500123214002332750017
Probe133700112329001743330011
R2L008809009304009403
U2R001210011100130
Normal31550474313604751430490
15_feature20_feature25_feature
Alpha=0

 

 

Table10. Performance criteria of the proposed model with 15 features, 20 features and 25 features and zero alpha value

RecallPrecisionF-measureRecallPrecisionF-measureRecallPrecisionF-measure
Probe0.970.920.940.950.950.950.960.970.97
R2L0.910.940.920.960.930.940.970.960.96
U2R0.501.000.670.331.000.500.751.000.86
Accuracy0.940.940.96
15_feature20_feature25_feature
Alpha=0

 

 

Table11. Confusion matrix in the proposed model with 15 features, 20 features and 25 features and alpha value of 0.1

predictedpredictedpredicted
DoSProbeR2LU2RNormalDoSProbeR2LU2RNormalDoSProbeR2LU2RNormal
ActualDoS3221500123214002332750017
Probe133700112329001743330011
R2L008809009304009403
U2R001210011100130
Normal31550474313604751430490
15_feature20_feature25_feature
Alpha=0.1

 

 

Table12. Performance criteria of the proposed model with 15 features, 20 features and 25 features and alpha value of 0.1

RecallPrecisionF-measureRecallPrecisionF-measureRecallPrecisionF-measure
Probe0.970.920.940.950.950.950.960.970.97
R2L0.910.940.920.960.930.940.970.960.96
U2R0.501.000.670.331.000.500.751.000.86
Accuracy0.940.940.96
15_feature20_feature25_feature
Alpha=0.1

 

 

 

Table13. Confusion matrix in the proposed model with 15 features, 20 features and 25 features and alpha value of 0.3

predictedpredictedpredicted
DoSProbeR2LU2RNormalDoSProbeR2LU2RNormalDoSProbeR2LU2RNormal
ActualDoS3211500113214002332650018
Probe133600112328001843330011
R2L108809009304009403
U2R001210011100121
Normal31650474313604761430489
15_feature20_feature25_feature
Alpha=0.3

 

 

Table14. Performance criteria of the proposed model with 15 features, 20 features and 25 features and alpha value of 0.3

RecallPrecisionF-measureRecallPrecisionF-measureRecallPrecisionF-measure
Probe0.970.920.940.940.950.950.960.970.97
R2L0.900.940.920.960.930.940.970.960.96
U2R0.501.000.670.331.000.500.501.000.67
Accuracy0.940.940.96
15_feature20_feature25_feature
Alpha=0.3

 

 

 

Table15. Confusion matrix in the proposed model with 15 features, 20 features and 25 features and alpha value of 0.5

predictedpredictedpredicted
DoSProbeR2LU2RNormalDoSProbeR2LU2RNormalDoSProbeR2LU2RNormal
ActualDoS3221500113204002332650018
Probe133700112329001743330012
R2L008809009304009303
U2R001110011100121
Normal31650474313604761430489
15_feature20_feature25_feature
Alpha=0.5

 

 

Table16. Performance criteria of the proposed model with 15 features, 20 features and 25 features and alpha value of 0.5

RecallPrecisionF-measureRecallPrecisionF-measureRecallPrecisionF-measure
Probe0.970.920.940.950.950.950.950.970.96
R2L0.910.940.920.960.930.940.970.960.96
U2R0.331.000.500.331.000.500.501.000.67
Accuracy0.940.940.96
15_feature20_feature25_feature
Alpha=0.5

 

 

Table17. Confusion matrix in the proposed model with 15 features, 20 features and 25 features and alpha value of 0.7

predictedpredictedpredicted
DoSProbeR2LU2RNormalDoSProbeR2LU2RNormalDoSProbeR2LU2RNormal
ActualDoS3221500113214002332550019
Probe133700112328001843330012
R2L008809009304009403
U2R001210011100121
Normal31550475313704761440489
15_feature20_feature25_feature
Alpha=0.7

 

 

Table18. Performance criteria of the proposed model with 15 features, 20 features and 25 features and alpha value of 0.7

RecallPrecisionF-measureRecallPrecisionF-measureRecallPrecisionF-measure
Probe0.970.920.940.940.950.950.950.970.96
R2L0.910.940.920.960.920.940.970.950.96
U2R0.501.000.670.331.000.500.501.000.67
Accuracy0.940.940.96
15_feature20_feature25_feature
Alpha=0.7

 

 

Table19. Confusion matrix in the proposed model with 15 features, 20 features and 25 features and alpha value of 0.9

predictedpredictedpredicted
DoSProbeR2LU2RNormalDoSProbeR2LU2RNormalDoSProbeR2LU2RNormal
ActualDoS316600263245001932750017
Probe332800174332001243330011
R2L008709009403009403
U2R001120013000131
Normal21114047014304891430489
15_feature20_feature25_feature
Alpha=0.9

 

 

Table20. Performance criteria of the proposed model with 15 features, 20 features and 25 features and alpha value of 0.9

RecallPrecisionF-measureRecallPrecisionF-measureRecallPrecisionF-measure
Probe0.940.950.950.950.970.960.960.970.97
R2L0.910.850.880.970.960.960.970.960.96
U2R0.251.000.400.751.000.860.601.000.75
Accuracy0.930.960.96
15_feature20_feature25_feature
Alpha=0.9

 

 

Table21. Confusion matrix in the proposed model with 15 features, 20 features and 25 features and alpha value of 1

 

predictedpredictedpredicted
DoSProbeR2LU2RNormalDoSProbeR2LU2RNormalDoSProbeR2LU2RNormal
ActualDoS320300253234002232530021
Probe532400206325001843290016
R2L00850130087010008909
U2R000030010300103
Normal041049305104920510491
15_feature20_feature25_feature
Alpha=1

 

Table22. Performance criteria of the proposed model with 15 features, 20 features and 25 features and alpha value of 1

RecallPrecisionF-measureRecallPrecisionF-measureRecallPrecisionF-measure
Probe0.930.980.950.930.970.950.940.980.96
R2L0.870.990.920.900.980.940.910.980.94
U2R0.000.000.000.000.000.000.000.000.00
Accuracy0.940.950.95
15_feature20_feature25_feature
Alpha=1

 

 

5-5-5- Analyzing and examining experiments

Compared to the experiments in Sections 5-5-4-1 through Sections 5-5-4-7, not only the accuracy of the proposed model with feature selection is more than the accuracy of the base model, but the performance of the classifier in minority classes such as Probe, R2L and U2R is better than the base model. Classifier accuracy increased from 92 percent to 96 percent. Classifier performance can be better represented for each class by other criteria such as precision, recall and F-measure.

According to Table 5, the Probe class has 464 samples. In the base model, 400 samples of the Probe class, 61 samples of the Normal class and 3 samples of the DoS class are predicted. It is due to the Normal class being the majority class. But according to Tables 9 and 11, the Probe class has 348 samples. The proposed model with 25 features selected predicted 333 samples of the Probe class, 11 samples of the Normal class and 4 samples of the Dos class.

In Table 19, the Probe class has 348 samples. The proposed model with 20 features selected predicted 332 samples of the Probe class, 12 samples of the Normal class and 4 samples of the Dos class. In other words, the proposed method has been able to predict the number of more Probe samples as the Probe sample.

According to Table 5, the R2L class has 99 samples. In the base model, 74 samples of the R2L class and 25 samples of the Normal class are predicted. Because the Normal class is the majority class type. But according to Tables 9, 11 and 19, the R2L class has 97 samples. According to Tables 9 and 11 proposed model with 25 features selected and in Table 19, proposed model with 20 features selected predicted 94 samples of the R2L class and 3 samples of the Normal class. In other words, the proposed method has been able to predict the number of more R2L samples as the R2L sample.

According to Table 5, the U2R class has 5 samples. The base model predicted all U2R samples of the Normal class type. Because the Normal class is the majority class type.

But according to Tables 9, 11 and 19, the U2R class has 4 samples. According to Tables 9 and 11 proposed model with 25 features selected and in Table 19, proposed model with 20 features selected predicted 3 samples of the U2R class and 1 sample of the R2L class. In other words, the proposed method has been able to predict the number of more U2R samples as the U2R sample.

In other words, the ability to learn minority classes such as Probe and R2L has been greatly improved.

So our proposed model with feature selection is an effective one. Because it eliminates weak Normal samples just like outliers. Therefore, it is expected to predict a lower number of Normal samples correctly. According to Table 5 in the base model, the number of samples expected to be of the Normal class type is 674 but according to Tables 9 and 11 proposed model with 25 features selected and in Table 19, proposed model with 20 features selected this number has been reduced to 489 and 490 samples of the Normal class.

For Probe class, Recall criterion increased from 86% to 95% and 96% in accordance with Table 6 and Tables 10 and 12, proposed model with 25 features selected and Table 20, proposed model with 20 features selected. In other words, it predicts more Probe samples as Probe samples.

For R2L class, Recall criterion increased from 75% to 97% in accordance with Table 6 and Tables 10 and 12, proposed model with 25 features selected and Table 20, proposed model with 20 features selected. In other words, it predicts more R2L samples as R2L samples.

For U2R class, Recall criterion increased from 0% to 75% in accordance with Table 6 and Tables 10 and 12, proposed model with 25 features selected and Table 20, proposed model with 20 features selected. In other words, it predicts more U2R samples as U2R samples.

For Probe class, Precision criterion decreased from 99% to 97% and 96% in accordance with Table 6 and Tables 10 and 12, proposed model with 25 features selected and Table 20, proposed model with 20 features selected. Because a number of Normal and DoS samples are predicted as Probe class samples.

For R2L class, Precision criterion decreased from 100% to 96% in accordance with Table 6 and Tables 10 and 12, proposed model with 25 features selected and Table 20, proposed model with 20 features selected. Because a number of Normal and U2R samples are predicted as R2L class samples.

For U2R class, Precision criterion increased from 0% to 100% in accordance with Table 6 and Tables 10 and 12, proposed model with 25 features selected and Table 20, proposed model with 20 features selected. It is because of that none of the other classes are predicted as U2R class samples. F-measure criterion, for Probe class increased from 92% to 97%, for R2L class increased from 86% to 96% and for U2R class increased from 0% to 86% in accordance with Table 6 and Tables 10 and 12, proposed model with 25 features selected.

While the proposed method with feature selection as improved the F-measure criterion for minority classes, it has also increased the accuracy criterion for the majority classes.

5-5-6- Simulation of the proposed model with feature selection method on whole NSL-KDD data

According to Table 23, the confusion matrix in the proposed model with feature selection in the five-class mode with an alpha value of 0.5 is visible on 125973 samples with 15 features, 20 features, and 25 features.

According to Table 24, the classifier accuracy and other performance measures such as precision, recall, and F-measure are shown in the proposed model with feature selection for the Probe, R2L and U2R minority classes. These experiments were performed using a 10-fold evaluation method on 125973 samples with 15 features, 20 features, and 25 features in five classes with an alpha value of 0.5.

 

 

Table23. Confusion matrix in the proposed model with 15 features, 20 features and 25 features on whole NSL-KDD data and alpha value of 0.5

predictedpredictedpredicted
DoSProbeR2LU2RNormalDoSProbeR2LU2RNormalDoSProbeR2LU2RNormal
ActualDoS40826343965301567407801016159143958424491211702260
Probe36112246039059109251706551211106174463
R2L209151771496112816977110
U2R13923160215201501102813
Normal423246683021636033852216103722636833018525865065554
15_feature20_feature25_feature
Alpha=0.5

 

 

Table24. Performance criteria of the proposed model with 15 features, 20 features and 25 features on whole NSL-KDD data and alpha value of 0.5

RecallPrecisionF-measureRecallPrecisionF-measureRecallPrecisionF-measure
Probe0.960.660.780.940.770.850.950.840.89
R2L0.920.500.650.970.440.600.980.620.76
U2R0.440.310.360.380.350.370.540.340.41
Accuracy0.930.920.95
15_feature20_feature25_feature
Alpha=0.5

 

 

5-5-7- Discussion

The purpose of testing the proposed model with the feature selection method on all NSL-KDD data was to investigate whether the criterion of accuracy can be maintained at an acceptable level? We first build the proposed model by selecting the feature on 13036 samples with an alpha value of 0.5 in Section 5-5-4. Then we tested the same model on 125973 samples and came up with some interesting results. According to Table 24, the accuracy was maintained at about 93% and 95%. Next, according to Tables 6 and 24, the F-measure criteria for the U2R class increased from 0% to 37% and 41%, respectively. These items show the power of the proposed model with feature selection.

  1. Conclusions and future work

This paper proposes a novel and effective approach to tackle both the imbalanced class problem and feature redundancy simultaneously. A good example for a highly imbalance dataset with huge number of samples is the NSL-KDD dataset.  For this purposes, a neighborhood is defined for each sample and feature respectively. This neighborhood which is also defined on a manifold is constituted so that the main structure and distribution of the data samples is preserved even after feature reduction or sample removal. For learning the manifold, the well-known LLE algorithm is used. A new sample/feature selection criterion called “The centrality measure” is applied to weight the features and samples on the manifold and remove the least effective samples.

The experiments with different parameter tunings show the effectiveness of the algorithm. Accuracy as one of the main performance measures is even increased as compared with the whole dataset. However, our main focus was on class imbalance specific measures such as F-measure, which as denote shows considerable improvement using the proposed approach. However, we suppose that more experiments are necessary both on other high dimensional highly imbalance datasets and with more complex classifiers which is previously proposed in the related works.

 

References

1.de Morais, R.F.A.B. and G.C. Vasconcelos, Boosting the performance of over-sampling algorithms through under-sampling the minority class. Neurocomputing, 2019. 343: p. 3-18.

2.de Haro-García, A., G. Cerruela-García, and N. García-Pedrajas, Instance selection based on boosting for instance-based learners. Pattern Recognition, 2019. 96: p. 106959.

3.Zhang, Z., et al., Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowledge-Based Systems, 2016. 106: p. 251-263.

4.Fan, J., et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling. Neurocomputing, 2016. 211: p. 172-181.

5.Jian, C., J. Gao, and Y. Ao, A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing, 2016. 193: p. 115-122.

6.Kordos, M., Á. Arnaiz-González, and C. García-Osorio, Evolutionary prototype selection for multi-output regression. Neurocomputing, 2019. 358: p. 309-320.

7.Cerruela-García, G., et al., Improving the combination of results in the ensembles of prototype selectors. Neural Networks, 2019. 118: p. 175-191.

8.Du, S., et al., Robust unsupervised feature selection via matrix factorization. Neurocomputing, 2017. 241: p. 115-127.

9.Zeng, Z., et al., Local adaptive learning for semi-supervised feature selection with group sparsity. Knowledge-Based Systems, 2019. 181: p. 104787.

10.Tang, C., et al., Unsupervised feature selection via latent representation learning and manifold regularization. Neural Networks, 2019. 117: p. 163-178.

11.Wang, J.J.-Y., H. Bensmail, and X. Gao, Feature selection and multi-kernel learning for sparse representation on a manifold. Neural Networks, 2014. 51: p. 9-16.

12.Feng, D.-C., F. Chen, and W.-L. Xu, Detecting Local Manifold Structure for Unsupervised Feature Selection. Acta Automatica Sinica, 2014. 40(10): p. 2253-2261.

13.Yao, C., et al., LLE Score: A New Filter-Based Unsupervised Feature Selection Method Based on Nonlinear Manifold Embedding and Its Application to Image Recognition. IEEE Transactions on Image Processing, 2017. 26(11): p. 5257-5269.

14.Feng, S. and M.F. Duarte, Graph autoencoder-based unsupervised feature selection with broad and local data structure preservation. Neurocomputing, 2018. 312: p. 310-323.

15.Yu, J., Manifold regularized stacked denoising autoencoders with feature selection. Neurocomputing, 2019. 358: p. 235-245.

16.Gunn, I.A.D., Á. Arnaiz-González, and L.I. Kuncheva, A taxonomic look at instance-based stream classifiers. Neurocomputing, 2018. 286: p. 167-178.

17.Arnaiz-González, Á., et al., Local sets for multi-label instance selection. Applied Soft Computing, 2018. 68: p. 651-666.

18.Pękalska, E., R.P.W. Duin, and P. Paclík, Prototype selection for dissimilarity-based classifiers. Pattern Recognition, 2006. 39(2): p. 189-208.

19.Ren, Y., et al., Local and global structure preserving based feature selection. Neurocomputing, 2012. 89: p. 147-157.

20.Wei, D., S. Li, and M. Tan, Graph embedding based feature selection. Neurocomputing, 2012. 93: p. 115-125.

21.Verbiest, N., C. Cornelis, and F. Herrera, FRPS: A Fuzzy Rough Prototype Selection method. Pattern Recognition, 2013. 46(10): p. 2770-2782.

22.Zare Borzeshi, E., et al., Discriminative prototype selection methods for graph embedding. Pattern Recognition, 2013. 46(6): p. 1648-1657.

23.Yao, G., K. Lu, and X. He, G-Optimal Feature Selection with Laplacian regularization. Neurocomputing, 2013. 119: p. 175-181.

24.Hamidzadeh, J., R. Monsefi, and H.S. Yazdi, LMIRA: Large Margin Instance Reduction Algorithm. Neurocomputing, 2014. 145: p. 477-487.

25.Calvo-Zaragoza, J., J.J. Valero-Mas, and J.R. Rico-Juan, Improving kNN multi-label classification in Prototype Selection scenarios using class proposals. Pattern Recognition, 2015. 48(5): p. 1608-1622.

26.Zhang, L., et al., Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding. Pattern Recognition, 2015. 48(10): p. 3102-3112.

27.Hamidzadeh, J., R. Monsefi, and H. Sadoghi Yazdi, IRAHC: Instance Reduction Algorithm using Hyperrectangle Clustering. Pattern Recognition, 2015. 48(5): p. 1878-1889.

28.Wang, X., et al., Unsupervised spectral feature selection with l1-norm graph. Neurocomputing, 2016. 200: p. 47-54.

29.Dornaika, F., I. Kamal Aldine, and A. Hadid, Kernel sparse modeling for prototype selection. Knowledge-Based Systems, 2016. 107: p. 61-69.

30.Du, X., et al., Multiple graph unsupervised feature selection. Signal Processing, 2016. 120: p. 754-760.

31.Valero-Mas, J.J., J. Calvo-Zaragoza, and J.R. Rico-Juan, On the suitability of Prototype Selection methods for kNN classification with distributed data. Neurocomputing, 2016. 203: p. 150-160.

32.Li, Y., et al., Manifold regularized multi-view feature selection for social image annotation. Neurocomputing, 2016. 204: p. 135-141.

33.Zhu, X., et al., A novel relational regularization feature selection method for joint regression and classification in AD diagnosis. Medical Image Analysis, 2017. 38: p. 205-214.

34.Pang, X., C. Xu, and Y. Xu, Scaling KNN multi-class twin support vector machine via safe instance reduction. Knowledge-Based Systems, 2018. 148: p. 17-30.

35.Tang, C., et al., Robust unsupervised feature selection via dual self-representation and manifold regularization. Knowledge-Based Systems, 2018. 145: p. 109-120.

36.de Haro-García, A., J. Pérez-Rodríguez, and N. García-Pedrajas, Combining three strategies for evolutionary instance selection for instance-based learning. Swarm and Evolutionary Computation, 2018. 42: p. 160-172.

37.Arnaiz-González, Á., et al., Study of data transformation techniques for adapting single-label prototype selection algorithms to multi-label learning. Expert Systems with Applications, 2018. 109: p. 114-130.

38.Zhang, Y., et al., Semi-supervised local multi-manifold Isomap by linear embedding for feature extraction. Pattern Recognition, 2018. 76: p. 662-678.

39.Yang, L., et al., Natural neighborhood graph-based instance reduction algorithm without parameters. Applied Soft Computing, 2018. 70: p. 279-287.

40.Hosseini, E.S. and M.H. Moattar, Evolutionary feature subsets selection based on interaction information for high dimensional imbalanced data classification. Applied Soft Computing, 2019. 82: p. 105581.

41.Pourpanah, F., et al., Feature selection based on brain storm optimization for data classification. Applied Soft Computing, 2019. 80: p. 761-775.

42.Tao, X., et al., Real-value negative selection over-sampling for imbalanced data set learning. Expert Systems with Applications, 2019. 129: p. 118-134.

43.Zhang, J., et al., Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognition, 2019. 95: p. 136-150.

44.Cruz, R.M.O., et al., FIRE-DES++: Enhanced online pruning of base classifiers for dynamic ensemble selection. Pattern Recognition, 2019. 85: p. 149-160.

45.Lai, C.S., et al., A robust correlation analysis framework for imbalanced and dichotomous data with uncertainty. Information Sciences, 2019. 470: p. 58-77.

46.Malhotra, R. and S. Kamal, An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing, 2019. 343: p. 120-140.

47.García, S., J.R. Cano, and F. Herrera, A memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition, 2008. 41(8): p. 2693-2709.

48.Raghuwanshi, B.S. and S. Shukla, UnderBagging based reduced Kernelized weighted extreme learning machine for class imbalance learning. Engineering Applications of Artificial Intelligence, 2018. 74: p. 252-270.

49.Koziarski, M., B. Krawczyk, and M. Woźniak, Radial-Based oversampling for noisy imbalanced data classification. Neurocomputing, 2019. 343: p. 19-33.

50.Xie, Y., et al., Generative learning for imbalanced data using the Gaussian mixed model. Applied Soft Computing, 2019. 79: p. 439-451.

51.Lu, Q., X. Li, and Y. Dong, Structure preserving unsupervised feature selection. Neurocomputing, 2018. 301: p. 36-45.

52.Verbiest, N., et al., Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Applied Soft Computing, 2014. 22: p. 511-517.

53.Wu, S.X. and W. Banzhaf, The use of computational intelligence in intrusion detection systems: A review. Applied Soft Computing, 2010. 10(1): p. 1-35.

  1. Hosseini Bamakan, S.M., et al., An effective intrusion detection framework based on MCLP/SVM optimized by time-varying chaos particle swarm optimization. Neurocomputing, 2016. 199: p. 90-102.
  2. Wang, H., J. Gu, and S. Wang, An effective intrusion detection framework based on SVM with feature augmentation. Knowledge-Based Systems, 2017. 136: p. 130-139.
  3. Singh, R., H. Kumar, and R.K. Singla, An intrusion detection system using network traffic profiling and online sequential extreme learning machine. Expert Systems with Applications, 2015. 42(22): p. 8609-8624.
  4. Lin, S.-W., et al., An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection. Applied Soft Computing, 2012. 12(10): p. 3285-3290.
  5. Chung, Y.Y. and N. Wahid, A hybrid network intrusion detection system using simplified swarm optimization (SSO). Applied Soft Computing, 2012. 12(9): p. 3014-3022.

 

 

 

 

  Remember! This is just a sample.

Save time and get your custom paper from our expert writers

 Get started in just 3 minutes
 Sit back relax and leave the writing to us
 Sources and citations are provided
 100% Plagiarism free
error: Content is protected !!
×
Hi, my name is Jenn 👋

In case you can’t find a sample example, our professional writers are ready to help you with writing your own paper. All you need to do is fill out a short form and submit an order

Check Out the Form
Need Help?
Dont be shy to ask