Using Machine Learning Techniques to Detect Malicious URLs
A study by Kuyama, Kakizaki, and Sasaki designed a method to detect malicious domains in URLs using DNS and WHOIS features. The damages resulting from targeted attacks cause severe issues, as a majority of targeted attacks aim for unlawful access and acquisition of confidential and private information from an organization or company. Detecting infections in terminals in communication with the command and communication (C&C) server and LAN needs identification of a C&C server beforehand. As a result, the method developed by the researchers assists in identifying new servers. It does so by extracting the feature points of its WHOIS system together with information on the Domain Name System (DNS) from the C&C server domain and attempt to detect the new server using neural networks.
The method proposed by the researchers identifies the C7C server by employing a supervised machine learning technique and feature points acquired from DNS and WHOIS domains of the command and communication servers and standard domains. The feature point is chosen according to spoofing difficulty specifications such as expiration date, valid terms, and email addresses received from WHOIS, mail exchanger records, and name server (NS) records. The method developed by Kuyama, Kakizaki, and Sasaki is an improvement of previous techniques such as the data mining tool named RIPPER that applies a combination of information received from external repositories of serves and domain information. However, the failure rate was high and thus, the need to use machine learning techniques to detect malicious domains.
Don't use plagiarised sources.Get your custom essay just from $11/page
In another study, Sahoo, Liu, and Hoi offer a basic understanding and a detailed survey of the Malicious URL Detection strategies using artificial intelligence and particularly machine learning. Malicious URLs are common threats to cybersecurity, as they contain content such as drive-by downloads, phishing, and spam that lure unsuspicious web users into scams, including monetary loss, malware installation and loss of sensitive information, adding to billions of dollars in loses. As a result, it is vital to detect and deter such malicious threats from URLs, but the solution goes beyond blacklists to incorporate machine learning techniques. Naga, Naveen, and Verma argue that it is possible to use lexical features to gain more insight about a URL without intensely exhausting a single source. For instance, it is possible to classify parameters in recognition of malicious URLs by capitalizing on visual attributes that make it possible to classify malicious short URLs. Social media network corporations such as Facebook and Twitter use these types of primitive features to determine whether to check for malicious URLs in their servers.
Technically, the systems are referred to as recommendations systems. It is possible to derive four different groups of obfuscation strategies for simplifying the benign URLs from malicious URLs. The techniques include obfuscating a host with an IP address or another domain and obfuscating with long hostnames and domain misspelt. Therefore, in the framework of the proposed machine learning method for detecting malicious URLs makes use of convolution neural networks, which existing studies support shows excellent performance over Logistic regression and Support Vector algorithms. When compared to the other classification techniques, convolution neural networks produce a precision of approximately 96% over both machine learning techniques. As a deep learning technique, the method exposes the dynamic attack detection approach that inserts a JavaScript on a URL to by-pass the detection mechanisms in a server.
Even though machine learning techniques are currently under use for detecting malicious URLs, Ferreira argues the performance and efficiency of the methods depends on the choice of a machine learning algorithm. After converting the features of a malicious URL into vectors, various algorithms can be used to determine if a URL is benign or malicious. According to Ferreira, there are two categories of algorithms, namely Batch Learning and Online Learning algorithms. A typical algorithm of the former class is the Naïve Bayes algorithm. The algorithm focuses on classifying the URLs and a ‘naïve’ algorithm such that it considers all its features are independents from one another. Therefore, for every element, it calculates the conditional probability that assists in predicting the possibility of malicious URLs in a server.
Apart from the Naïve Bayes algorithm, Support Vector Machine (SVM) are renowned for binary categorization of large dimensions of data and information. The algorithm employs a single-rule approach that is expressed using the kernel function that calculates the extent of similarity between two non-negative coefficients and feature vectors that show the training examples that exist in proximity to the decision boundary. In this regard, SVM categorizes new cases through computing their signed distance to the decision boundary.
Online learning algorithm category treats data and information as instantaneous flows, which implies that the algorithms learn from the training data and predict the actual URLs of the data simultaneously. Therefore, the most common algorithms include First Order, Second Order, Online and Active Learning. First-order algorithms learn through updating vectors comprising labels, including benign or suspicious using the first-order elements of available training data. In contrast, the second-order algorithms go beyond first-order factors to focus on boosting learning efficiency by exploring second-order aspects such as statistical features. In the active online learning platform, the algorithms consult if URLs are already categorized as malicious or not, and when it relays a low confidence level. The technique usually ASSUMES THAT THE CATEGORIZATION OF TRAINING DATA is received without incurring costs, but this is false because, in real systems, the classification process is slow and expensive.
Machine Learning Approaches
Machine learning approaches focus on analyzing the information contained in URLs and related webpages or websites by mining relevant feature representations of URLs, and later creating a prediction model to categorize whether the URLs are malicious or not. The approaches use two types of features, namely, dynamic and static elements. In the latter, the process involves the analysis of a webpage using information present without opening the URL through executing a JavaScript. The information mined includes lexical features, of the URL string, information regarding a host, and JavaScript or HTML content. Static features are safer compared to dynamic features. An underlying assumption is that the dispersion of the static features varies depending on whether a URL is malicious or not. By using the distribution information, it is possible to create a prediction model to detect new URLs in a server.
In contrast, dynamic features analysis involves monitoring the behaviour of a system that is at risk of malicious attacks for anomalies. They include techniques that mine log data and monitors call series for suspicious activity. For detection of questionable URLs, various features are used to offer useful information to boost machine learning techniques.
Lexical Features
A lexical feature is a feature obtained from the property of a URL name or string. It is assumed that it is possible to identify whether or not a URL is malicious based on its configuration. For instance, the majority of obfuscation methods focus on benign URLs by imitating names and adding minor variations to the URLs. A lexical feature is usually applied in conjunction with other features such as host-based features to boost model performance. Nevertheless, the URL string must be processed to retrieve vital features. Typically, the statistical characteristics of a URL, including the length of components, length of URLs, and the number of characters are instrumental in mining words from the string.
The bag-of-words approach is considered a machine compatible blacklist approach. Here, the URL string is processed in a way that every character is identified using a unique character comprising a word. Based on the different kinds of words in every URL, a dictionary is created where every word becomes a feature. If a name was initially present in the URL, it is assigned a value of 1, and 0 if otherwise. For machine learning, instead of examining the entire URL string, the approach assigns varied scores to URLs based on minor elements of the URLs strings.
Apart from the traditional lexical features that are directly mined from URL strings without considerable computation or domain knowledge, advanced lexical features are currently used in mining. The advanced features are based on five features derived from heuristics to make the detection obfuscation resistant. These include URL-related, domain, directory-related, filename, and argument features. Another vital feature is the Kolmogorov Complexity. It refers to the degree of complexity of a string. Using conditional Kolmogorov Complexity, which is the degree of complexity of a string when another line is added that does not contribute to the original input, it is possible to combine these measures determine the whether a URL is suspicious or otherwise. However, a significant limitation of the test is scalability, as its applicability for large sets of URLs is improbable.
As a result, emerging measures introduce a new concept of intra-relatedness that measures the relationships between different words comprising URLs with a particular focus on the relation between the signed domain and other URLs. Currently, the intra-relatedness is measured using two concepts namely path brand name distance and domain brand name distance, which are kinds of edit distance among the strings that detect suspicious URLs that seek to imitate popular websites or brands.
Host-oriented Features
A host-oriented feature is derived from the hostname elements of a URL. It allows the determination of identity, location, and properties and management style of a malicious host. Studies of the effects of host-oriented features on the suspiciousness of a URL reveal that phishers exploit short URL services, which makes this technique a vital element in detecting malicious URLs. Some of the host-based features in current use are WHOIS information, IP Address properties, Domain Name Properties, and Connection Speed.
In a study, Patil and Patil propose a methodology for detecting malicious URLs together with the forms of suspicious attacks using a multi-class categorization. The researchers offer 42 new features of phishing, spam, and malware URLs, which are not included in the initial studies of suspicious URLs attack type identification and detection. Multi-class and binary datasets are created using 23,894 and 26,041 malicious and benign URLs comprising 11,297, 8976, and 3621 malware, phishing, and spam URLs. The researchers use a state-of-the-art supervised online machine learning and batch qualifiers, with experiments carried out multi-class and binary dataset using the machine learning qualifiers. The findings reveal that the learning classifier achieves 98.44% and 99.86% detection accuracy in multi-class and binary settings, which is the best detection accuracy rates.
In another study, Babagoli proposed an approach of phishing URL detection that employs meta-heuristic-oriented non-linear regression algorithms together with a feature selection technique. The researchers used a dataset containing 11,055 legitimate and phishing webpages and chose 20 main features for extraction from the selected websites. Moreover, the researchers also employed two feature selection techniques, namely the wrapper method and decision tree to select the relevant subset and resulted in a detection accuracy rate of 96.32% after using the latter approach. Following the feature selection phase, the researchers implemented two meta-heuristic algorithms to detect and predict malicious websites such as Harmony Search that was deployed based on Support Vector Machine and non-linear regression approach. As noted by the study, the non-linear regression approach was essential in classifying the websites, where the limits of the regression model were determined using a harmony search algorithm. The analysis demonstrated that the non-linear regression technique grounded on HS led to a detection accuracy rates of 94.13% for train process and 92.80% for the and test process. Besides, the performance analysis also showed that the non-linear regression approach-oriented HS resulted in improved performance than SVM.
Zuhair also proposed a series of 58 newly obtained webpage hybrid features and refined them as maximum relevant, robust features, minimum redundant, and few, which are features that were obtained directly from webpages content and URLs. As a result, the researchers considered the two feature categories throughout the experiments. The first category was a team set of 48 features, particularly embedded object and cross-site scripting features. In contrast, the second feature was a set of 10 URL elements obtained from webpages URL. Furthermore, the researchers employed a specific criterion or mRMR to determine the appropriate feature subset for better detection of phishing. According to the researchers, mRMR removed redundant and irrelevant features simultaneously over a high-dimensional feature space. The study used an SVM assessment and machine learning classifier such as FP, TP, FN, F-measure, and Recall to examine its approach. The findings demonstrated that the method was relevant in optimizing phish detection approaches for anti-phishing schemes in the future.
In a related study, Choi proposed an approach that used machine learning for detecting malicious URLs of popular attack categories such as malware, phishing, and spam to determine the nature of attack launched by a suspicious URL. The researchers used features such as link popularity, lexical content, network traffic, DNS and DNS fluctuation, and webpage content. They collected real data from different sources, from Yahoo’s directory to Phish Tank and employed three machine learning algorithms, namely SVM and ML-kNN and RAkEL for detecting suspicious URLs and multi-class categorization. The researchers evaluated their approach on 32,000 and 40,000 benign and malicious URLs and realized a detection accuracy rate of 98% and identification of malicious attacks of 93%.
Canali proposed a filter named Prophiler that used static analysis strategies to inspect webpages for malicious content quickly. The researchers employed the features obtained from HTML content of webpages, related JavaSctipt codes, and corresponding URLs. The study used different machine learning algorithms such as Random Tree, Naïve Bayes, Random Forest, J48, and Logistic, Logistic regression, and Bayes Net to evaluate the features. As noted by the researchers, their filtering technique was capable of reducing the load on more costly dynamic tools such as Wepawet by at least 85%, with a small amount of missed suspicious content. However, Justin had earlier proposed an automated URL categorization approach by applying statistical techniques to determine the host-based and lexical characteristics of suspicious websites URLs. The researchers extracted host-based and lexical features, namely WHOIS properties, IP Address properties, geographic properties, and domain name properties. The evaluation was carried out using machine learning algorithms such as SVM and Logistic regression. According to the results, the resulting machine learning classifiers obtained a 95-99% detection accuracy rates for large numbers of suspicious websites from URLs, with only unassertive false positives.
Abdelhamid et al. designed a system called Multi-label Classifier based on Associative Classification (MCAC) to detect phishing website URLs in websites. The researchers employed 16 features and categorized website URLs into three groups, namely suspicious, legitimate, and phishing. The MCAC was a ruled-oriented algorithm that mined rules from phishing dataset. Moreover, Patil and Patil also provided an overview of the different types of webpage malicious attacks on their examination of malicious webpages detection approaches. Hadi et al., on the other hand, employed a Fast-Associative Classification Algorithm (FACA) to classify phishing website URLs. The algorithm worked through determines every rule item series and designing model classification techniques. The researchers examined a dataset comprising 11,055 with two types of features, namely phishing and legitimate features. The researchers used the minimum confidence and minimum support parameter.
Jeena, Preethi, Praveena, and Preethi also proposed a dynamic approach that integrated machine learning strategies during the process of sensing malicious URLs. Owing to the increasing use of smartphones, especially for online transaction, users face a static threat to the safety and security of data and information exchanged through the devices. In the approach, the researchers used two algorithms, namely Convoluted Neural Network (CNN) and Support Vector Machine. The system was offered a series of input that included approximately 1,000 suspicious websites mined from reputable sources such as Alexa ranking. To make the technique even more robust and logistic regression was applied to classify and predict value. As a result, the supervised learning algorithms such as SVM and CNN that offered accuracy and F1 scores of 0.81 and 0.74, respectively.
Xuan, Nguyen, and Nikolaevich also proposed a method to detect malicious URLs using machine learning strategies based on the attributes and behaviours of URLs. Moreover, the study also examined and exploited big data technology to boost the capacity for sensing malicious software using abnormal behaviours. In short, the proposed detection system comprised a new series of URL behaviours and features, an algorithm, and big data technology. The findings illustrated that the intended action and attributes of the URLs could assist in boosting the capacity to sense malicious URLs considerably, and was also a friendly and optimized solution to the detection of malicious software.
Works Cited
Abdelhamid, Neda, Aladdin Ayesh, and Fadi Thabtah. “Phishing detection based associative classification data mining.” Expert Systems with Applications 41.13 (2014): 5948-5959.
Babagoli, Mehdi, Mohammad Pourmahmood Aghababa, and Vahid Solouk. “Heuristic non-linear regression strategy for detecting phishing websites.” Soft Computing 23.12 (2019): 4315-4327.
Canali, Davide, et al. “Prophiler: a fast filter for the large-scale detection of malicious web pages.” Proceedings of the 20th international conference on World wide web. 2011.
Cho, Do Xuan, and Ha Hai Nam. “A Method of Monitoring and Detecting APT Attacks Based on Unknown Domains.” Procedia Computer Science 150 (2019): 316-323.
Jeena, R., Preethi, G., Praveena, A., and Preethi, A. “Malicious URL Detection Using Machine
Learning Techniques.” International Journal of Innovative Research in Science,
Engineering and Technology, vol. 8, no. 3, 2019, p. 1751-1754.
Patil, Dharmaraj Rajaram, and J. B. Patil. “Survey on malicious web pages detection techniques.” International Journal of u-and e-Service, Science and Technology 8.5 (2015): 195-206.
Patil, Dharmaraj, R., and J. B. Patil. “Malicious web pages detection using static analysis of URLs.” International Journal of Information Security and Cybercrime 5.2 (2016): 31-50.
Sahoo, Doyen, Chenghao Liu, and Steven CH Hoi. “Malicious URL detection using machine learning: A survey.” arXiv preprint arXiv:1701.07179 (2017).
Xuan, Cho Do, Hoa Dinh Nguyen, and Tisenko Nikolaevich. Malicious URL Detection based on Machine Learning. International Journal of Advanced Computer Science and Applications,
vol. 11, no. 1, 2020, p. 148-153.
Zuhair, Hiba, Ali Selamat, and Mazleena Salleh. “Selection of robust feature subsets for phish webpage prediction using maximum relevance and minimum redundancy criterion.” Journal of Theoretical and Applied Information Technology 81.2 (2015): 188.