Machine Learning: Best Approach toward Ransomware Detection


Machine Learning: Best Approach toward Ransomware Detection

Introduction
With the increased advancement in technology, the increased use of computer networks has been enormous. The dependence on computers has brought about the attraction of hackers and other cyber terrorism into action. Hacks can be disruptive to the government; the businesses affected and therefore are an issue that needs to be tackled.
Using machine learning and artificial intelligence to fix vulnerabilities and realize threats early could help increase the confidence in computer networks and protect these systems. Machine learning and AI can be used to gather information, analyze threats against endpoints, use against its networks to provide scenarios and determine perimeter holes and weak points. Expanding the use of machine learning technology into cybersecurity could be a game-changer to reshape the effectiveness of data detection.

Ransomware Background
To understand the application of machine learning to ransomware detection, there is a need to understand the threat mechanisms of ransomware to build strong and effective detection systems for the same. Ransomware attacks target critical files on a computer system and locks them until ransomware is paid to the attackers.
One of the most infamous ransomware attacks occurred in 2017 through an attack referred to as the “wannacry attacks”. According to Zhang et al. (2018), the attack infected more than 200,000 computers in 150 countries in less than a day. Such attack mechanisms pose enormous dangers to users, especially companies who may be forced to incur losses in order to free themselves from ransomware attacks.
Unlike other forms of malware, ransomware is increasingly hard to deal with. According to Bae, Lee & Im (2020), the number of ransomware variants has increased rapidly every year with the levels of sophistication increasing every time. In the past, machine learning systems have been used to detect malware and build strong defense systems that hackers find hard to penetrate. In machine learning-based malware programs, the machine learning algorithm retains signatures of different types of malware, making it easier for the system to detect a piece of malware in the event of an attack. However, with ransomware, the increasing levels of sophistication make it difficult to maintain a stable database of ransomware that can be used to train systems to detect ransomware attacks. Satter (2021) notes that it is almost impossible to protect a system against ransomware due to the ever-changing variants. However, this does not mean that ransomware cannot be detected. A well-designed machine learning algorithm with all the necessary pre-requisites could help organizations protect themselves against a ransomware attack.
Ransomware Detection Taxonomy
The design of a good ransomware detection system necessitates an understanding of the rules of machine learning algorithms, leading to the creation of the best solutions to the ransomware problem. Different types of malware have different machine learning algorithms that are built on the very characteristics of the problem that needs to be fixed.
To detect ransomware, the first step is to extract the features. Every piece of ransomware originating from a different attacker may have a unique signature that informs its attack method and proliferation into the host computer system. Fernando, Komninos & Chen (2020) provide a taxonomy for ransomware detection that tracks some of the steps that could be applied to build an effective machine learning model. The extraction of features could be based on multiple information criteria about ransomware. In short, one could extract the features of ransomware based on prerecorded data that outlines the working of ransomware, especially in relation to sensitive files within a computer system.
Alternatively, the machine learning model could use a custom feature selection algorithm to detect malware. That is, a specialized algorithm to select the salient features of a suspected ransomware file to determine if it is indeed ransomware. The next step of this malware detection is to apply specific machine learning models to train an algorithm on how to separate ransomware files from normal system files. The efficacy of an algorithm in ransomware detection is determined by the ability of the algorithm to accurately separate normal and ransomware files. Machine learning techniques that could be used include Support Vector Machine (SVM), logistic regression, decision trees, and random forests (Fernando, Komninos & Chen, 2020). Each technique offers a different level of accuracy and effectiveness in detecting ransomware in a computer system.

Features of a Machine Learning Application for Detecting Ransomware
The advantage of machine learning applications over traditional malware detection systems is automation. Machine learning systems run an algorithm that analyzes suspect files in real-time and determines if they are ransomware or not. Designing an efficient ransomware detection algorithm that uses machine learning as its main process requires large representative datasets. The data set is used to train a machine-learning algorithm to detect and report ransomware. A ransomware dataset comprises a large number of files, some of which are ransomware, while others are normal files contained in a computer system. Given that ransomware mostly targets computer files, training a machine-learning algorithm is necessary not to mistake system files for ransomware.
The trained algorithm then needs to be interpreted. According to Kaspersky Labs (2020), the trained model has to be interpretable in the sense that it can take input from its environment and return the desired results. The algorithm should have the ability to interpret actual and false threats to the computer system. The Kaspersky Labs literature on machine learning use in malware detection further states that false positives for the trained algorithm should be low. False positives are cases where the algorithms identify ransomware when there is none. An alternative to false positives is false negatives. The system could identify the fail to flag off an actual ransomware threat when one exists. False negatives are more dangerous to the security of an organization because they increase the likelihood of a ransomware attack. On the other hand, false positives could instigate panic throughout the organization, thereby causing unnecessary downtimes.
The machine learning algorithm for ransomware should also adapt to an attacker's processes. According to Kaspersky Labs (2020), attackers often apply counteractions to existing security settings which allow them to circumvent the security settings that have been put in place against malware. In the case of ransomware detection, the machine learning algorithm should anticipate any modifications that the attacker may make to ensure that a ransomware attack is successful. Automated adaptability to threat scenarios ensures that the systems remain formidable, even without human agent insight.

Classifying Ransomware
Machine learning systems use classification strategies that determine the salient characteristics of an item before a selection is made. The use of large representative data sets, as defined by Kaspersky Labs (2020), facilitates the classification of files into their correct categories and the categories that signal malware. Various researches have been conducted to determine which ransomware features could be used to separate them from other files. Chen et al. (2018) summarize a catalog of classification strategies from other scholars to help create an idea of the process that could be applied in the classification of ransomware. One way to classify ransomware is to monitor network traffic. According to Chen et al. (2018), one could analyze HTTP message sequences and content sizes to detect ransomware. This process requires an understanding of the stand message sequences of communications between servers and hosts as well as the ideal content size of packets being transmitted.
There are different types of ransomware that could be classified differently. Arivudainambi, KA & Visu (2020) present a detection model that monitors incoming TCP/IP packets through the server. This process is almost similar to the one described by Chen et al. (2018) above. However, Arivudainambi, KA & Visu (2020) analyze the packet header and use the control and command server (C/C server) to blacklist detected ransomware attacks. In this case, the packet header of the ransomware file provides critical information to determine if the content is safe for the host system or not. Another threat detection model for malware could involve collecting samples logs from different ransomware families. Longitudinal studies of ransomware have produced standard ransomware descriptors, enabling security experts to classify malware into families. In order to maximize the potential for detecting ransomware, the salient characteristics of every ransomware family are sampled and then used to train a machine-learning algorithm that can detect and then report the ransomware as needed. Relying on only one or two aspects of ransomware reduces the overall chances of identifying malware due to the increased frequency of variations of ransomware.
With the knowledge of ransomware families, one can build the maximum frequent patterns of ransomware to put into perspective the most likely ransomware attacks that a system can expect. For example, suppose the crypto locker ransomware becomes common in a certain region of the US in, say, a month. In that case, the security experts can sample the most repeating characteristics of the ransomware at that time and then build a machine-learning algorithm that can detect and isolate future threats. According to Arivudainambi, KA & Visu (2020), using the maximum frequent pattern method resulted in a 95 percent accuracy of detecting ransomware samples. More so, observing ransomware patterns for similarities means that organizations have to work closely to provide data that could enable better classifications of malware for future protections. Classifying ransomware is the first step of ensuring the machine learning algorithm develops a higher efficiency in extracting features from suspicious files. A complex machine learning model includes as many classification features as possible.
The need to have multiple classification features in the characterization of ransomware comes from learned weaknesses of signature-based malware detection techniques. Signature-based techniques, which are applied in most anti-virus programs, are most effective in identifying known threats. Hwang et al. (2020) note that a strong malware detection program should have the ability to identify various behavioral traits of malware. Some of the behavioral traits include the file's activities, its API call patterns and frequencies, registry keys, file extensions, etc. In a nutshell, one is looking at the immediate characteristics of the suspected ransomware files and the interactions between the file and the rest of the system. Such comprehensive functions can only be aided by well-designed machine algorithms that incorporate comprehensive classification strategies. The machine learning algorithm should also have the functional capacity to perform its classification without affecting the availability of critical system files.

Sample Ransomware Machine Application – Khammas (2020)
The main stages of ransomware detection include preprocessing, feature selection, and classification strategies. The model provided by Khammas (2020) provides an example that can be used to summarize the discussions presented in the preceding sections of this research. Figure 1 below summarizes the model proposed by Khammas (2020).

Figure 1. Khammas’ (2020) model of classifying malware
The first stage in the model is the input stage. In this stage, the large representative dataset identified as "goodware exe files" and "ransomware exe files" are subjected to feature extraction. In the earlier discussions of this research, Kaspersky Labs (2020) noted the need to have large representative datasets that contain both malware and actual system files. Khammas' (2020) model makes this distinction in its design. Additionally, Hwang et al. (2020) mentioned that file extensions could be used to classify malware. In the presented example, the use of executable files to outline the basic outline and framework of the algorithm highlights the mode of classification used.
The first processing stage is preprocessing. In this state, the datasets are processed to produce salient features that can help the ransomware detection algorithm detect malware. In the preprocessing stage, the method starts by extracting the features of the files input at the beginning of the process. The model applies frequent pattern mining to produce features that may be similar or different from the goodware and ransomware exe files input at the beginning. These patterns' frequency helps the model determine which features could best represent ransomware or goodware files. The significance of this process is that it will enable the algorithm used to differentiate between the files in the real application world. These data are then normalized to ensure that there are no repeated sets or inconsistent information that may affect the model's accuracy when trying to identify malware files.
The next step in the Khammal (2020) model is feature selection. In this stage, the features extracted are selected based on the features representing the ransomware files' best characteristics. A gains ratio is calculated for each feature which then determines the final features that the algorithm will use to classify the goodware and ransomware files. The last stage of the algorithm is the classification stage, where the algorithm applies a random forest approach to determine which files are goodware and the ones that are ransomware.
Machine Learning as a Backup Mechanism
There is not a hundred percent assurance that machine learning systems will detect different forms of malware, especially if the attacker applies a unique strategy that the machine learning model cannot adapt to. In such a case, an organization's cybersecurity team has the great burden of ensuring the ransomware attack does not result in the loss of critical system files. Eckerle (2020) defines a machine learning method that could be used as a backup for system files. When attacks lock critical files in demand for ransom, the organization has backup copies that can be relied on while the organization tries to deal with the ransomware attack. Eckerle (2020) defines the Rubrik Cloud Data Management Platform, which takes snapshots of existing files and creates a metafile with a log of changes made to the file in every subsequent operation. Instead of being processed at the site via computationally intensive machine learning algorithms, the files are uploaded to a machine learning pipeline that resides in a cloud. The machine learning system keeps a continuous log that can facilitate recovery processes in the event of a cyberattack.
The Rubrik system also applies an analysis of the file system to determine the files that are safe and the ones that are not. A file system behavioral analysis tries to determine which number of files added, number of files deleted, etc., which are salient features of a ransomware attack. Assessing the behavior of the file system makes it possible for the machine learning system to raise alerts in case a malicious file is detected on the file system. This approach to recognizing anomalies deviates from the network-based detection of anomalies and relegates the machine learning applied to the file system level. Another method of analysis that could be applied in the file system files is content analysis. The Rubrik system has an application called Radar which examines the sharp increase in the entropy of a file, something that is synonymous with ransomware attacks. This approach to ransomware detection applies machine learning as the method of last resort to ensure that impending ransomware attacks do not infiltrate the file system.
Conclusion
As identified above, the traits of ransomware are constantly changing. However, the variations in the characteristics of malware should not deter the installation of needed processes to ensure the safety of security systems. There is a chance to stop hackers by applying an automated process that provides 24-hour protection to a system and its files. Machine learning systems can detect repeating patterns of ransomware signatures and raise alarms for cybersecurity teams to act before the ripple effects of the ransomware are felt throughout an organization. For instance, ransomware often whitelist files. Whitelist files are files that grant access to system files such as the C:/program files (x86)/steam/Na (Moussaileb et al., 2018). A machine learning system can be trained to identify executable programs that attempt to access the system files and subsequently cause the system to block the malicious file. The risk of occurrence of ransomware is not a matter of if but a matter of when. The growing pervasiveness of cyberattacks throughout the globe makes it difficult to predict a ransomware attack. Additionally, attackers may take advantage of the hours when the responses from an organization would below to initiate a ransomware attack. The best way to act would be to use machine learning methods at both network and file system levels to avert the possibility of successful ransomware attacks.

References
Arivudainambi, D., KA, V. K., & Visu, P. (2020). Ransomware Traffic Classification Using
Deep Learning Models: Ransomware Traffic Classification. International Journal of Web Portals (IJWP), 12(1), 1-11.
Bae, S. I., Lee, G. B., & Im, E. G. (2020). Ransomware detection using machine learning algorithms. Concurrency and Computation: Practice and Experience, 32(18), e5422.
Chen, L., Yang, C. Y., Paul, A., & Sahita, R. (2018). Towards resilient machine learning for ransomware detection. arXiv preprint arXiv:1812.09400.
Craigen, D., Diakun-Thibault, N., & Purse, R. (2014). Defining Cybersecurity. Technology Innovation Management Review, 4(10). https://www.timreview.ca/article/835
Eckerle, A. (2021). Using machine learning for anomaly detection and ransomware recovery. CIO. Retrieved November 20, 2021, from https://www.cio.com/article/3633171/using-machine-learning-for-anomaly-detection-and-ransomware-recovery.html.
Fernando, D. W., Komninos, N., & Chen, T. (2020). A Study on the Evolution of Ransomware Detection Using Machine Learning and Deep Learning Techniques. IoT, 1(2), 551-604.
Hwang, J., Kim, J., Lee, S., & Kim, K. (2020). Two-stage ransomware detection using dynamic analysis and machine learning techniques. Wireless Personal Communications, 112(4), 2597-2609.
Kasperksy Labs. (2020). machine learning for malware detection. Retrieved November 20, 2021, from https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf.
Khammas, B. M. (2020). Ransomware Detection Using Random Forest Technique. ICT Express, 6(4), 325-331.
Moussaileb, R., Bouget, B., Palisse, A., Le Bouder, H., Cuppens, N., & Lanet, J. L. (2018, August). Ransomware's early mitigation mechanisms. In Proceedings of the 13th International Conference on Availability, Reliability, and Security (pp. 1-10).
Polyakov, A. (2019, July 31). Machine Learning for Cybercriminals. Medium. https://towardsdatascience.com/machine-learning-for-cybercriminals-a46798a8c268
Satter, R. (2021, July 5). Up to 1,500 businesses affected by a ransomware attack, U.S. firm's CEO says. Reuters. Retrieved November 20, 2021, from https://www.reuters.com/technology/hackers-demand-70-million-liberate-data-held-by-companies-hit-mass-cyberattack-2021-07-05/.
Zhang, H., Xiao, X., Mercaldo, F., Ni, S., Martinelli, F., & Sangaiah, A. K. (2019). Classification of ransomware families with machine learning based on N-gram of opcodes. Future Generation Computer Systems, 90, 211-221.

Sample Solution