A guide to Machine Learning for Application Security

Machine Learning is definitely not the magic bullet it is sometimes made out to be, but it is by far the most promising area in cyber security. If applied properly, it will unlock a variety of new capabilities, especially when dealing with the growing complexity in the AppSec landscape. We will take a look at how you can leverage different ML algorithms to increase your application security posture and strengthen your companies defense capabilities.

But first, let’s take a closer look at why web applications and their vulnerabilities have been rising in importance.

If you are an expert on application security already, you can to skip to our introduction to Machine Learning or directly to the application of ML to AppSec.

Application Security is becoming increasingly complex and difficult to maintain

Web applications, APIs, microservices, serverless. The modern portfolio of software assets at a company is increasing across all verticals, as more of our life becomes digital. Establishing and maintaining the security for all these software applications has become notoriously difficult due to an increase in code and architecture complexity. Therefore, we are seeing a big increase in the number of vulnerabilities and breaches originating in the application layer. Software running publicly available via the internet is now so attractive of a target, that 75% of attacks happen at the application level.

According to the Open Web Application Security Project’s list (OWASP Top 10 - 2017), the most prevalent risks include Sensitive Data Exposure, Security Misconfiguration, and Cross-Site Scripting. By abusing these, hackers are able to compromise sensitive data, receive unauthorized access to system data or functionality, steal credentials, deliver malware to victims and much more.

Examples for such attacks can be observed almost weekly, such as the leak of non-medical personal data of 1.5 million patients of Singapore’s largest group of healthcare institutions on the day of writing this article. According to a recent IBM security study, the average cost of a data breach amounts to roughly $8 Million in the US, disregarding the oftentimes much larger costs arising from a loss in customer trust. This effectively shows that companies have to fight the battle on three fronts: safeguarding data, integrity and availability of their applications.

Visual representation of an application attack

Simplified view of an attack.

Meanwhile, most traditional security solutions still primarily deliver protection at the network layer focused on signature-based intrusion prevention and detection which have little or no impact on attacks happening at the application layer. This gap calls for an increased awareness towards software application security and the use of new technologies such as Machine Learning to supplement the shortcomings of legacy solutions.

Machine Learning and Artificial Intelligence provide a great solution to some of cyber securities biggest challenges: analyzing data and finding signs of a breach as early as possible.

Artificial Intelligence, Machine Learning or Data Mining?

According to Merriam Webster, Artificial Intelligence (AI) can be defined as follows:

“The capability of a machine to imitate intelligent human behavior.”

In reality, however, the goals behind applying artificial intelligence frequently go beyond replicating the human mind, instead using human reasoning as a model to create services that perform tasks impossible for a human.

Machine Learning (ML) is the foundation that allows us to build such smart systems. A system that can make a decision on its own, without having been explicitly programmed to do so. For example, a human could look at a hundred incoming client requests and flag some of them as potentially malicious. The same can be achieved with a Machine Learning algorithm, which looks at data previously labeled by a human, learns from the decision making and can then classify incoming requests on its own. Where the two disperse, however, is in the quantity of observable information: while a human would likely not be able to efficiently recognize such anomalies among thousands of data points within a reasonable timeframe, an algorithm is capable of performing the same task for millions of requests, with the only limitation being the current progress in computing power.

Machine Learning itself can be split into two primary disciplines which differ in their training and application: supervised and unsupervised learning. Essentially, in supervised learning, algorithms are used to “reverse-engineer” the function f( ) needed to predict an output variable Y based on an input x (f(x) = Y), commonly used for regressions and classifications. Conversely, in unsupervised learning input data x is clustered or associations are established without previously defining a corresponding output variable Y. Semi-supervised learning can be seen as a hybrid between the two that is particularly relevant to application security, since it enables the user to work with a combination of labeled and unlabeled data, making use of the fact that a lot of data in the field is similar.

Deep learning, in turn, is simply one of many approaches to Machine Learning that can be applied to both supervised and unsupervised learning. It differentiates from other approaches in that these algorithms are based on the biological structure of the brain. In fact, deep learning is a version of a group of algorithms called artificial neural networks (NN). At their core, these deep neural networks break up problems into different components which are then pieced back together, similar to how our brains work. Thereby, deep learning algorithms often achieve better results in a shorter time frame, due to a lower number of nodes needed to process things.

So how does data mining fit into this picture? Data scientists apply mining algorithms to automatically look for specific features and patterns within data sets or aggregate information from various sources. However, as opposed to Machine Learning, data mining doesn’t learn and apply knowledge on its own without human interaction. It does, however, serve as a common precursor to many real-world applications of AI and ML.

Visual representation of the relationship among different data science fields

Overview of the relationship among different data science fields.

Machine Learning x Application Security

What makes applying Machine Learning to application security so crucial? While ML is already being applied to other areas of cybersecurity, application security has been relatively untouched by it. On the offensive side of things however there is a lot of movement: According to a recent joint study led by the University of Oxford, the growing use of artificial intelligence by hackers will lead to an expansion of existing threats by replacing human labor as well as the introduction of entirely new threat scenarios. This significantly increases the effectiveness and efficiency of attacks. Less interesting targets can now be place into the scope of an attacker and attribution can become increasingly difficult. One bad guy now has the tools to do more damage. This introduces a strong need associated to applying Machine Learning to application security, but also a caveat: AI used for attacks are better at exploiting vulnerabilities in AI defense systems than humans, which means special attention needs to be placed on avoiding their abuse by combining Human- with Artificial Intelligence.

Another strong application for Machine Learning in AppSec is the defense against zero-day exploits. Protection against such attacks is crucial since they are rarely noticed right away. It usually takes months to discover and address these breaches, and meanwhile large amounts of sensitive data is exposed. Machine Learning can provide a way to protect against such attacks by identifying malicious behavior not only based on rules but also by identifying abnormal data movement and help spot outliers.

In application security, large amounts of data is produced that can be used to train ML models efficiently. The challenge, however, lies in preprocessing the data and correctly labeling requests as clean or malicious. Additional complexity is introduced by the fact that attacks are often signified by a chain of requests. Looking at individual requests within that chain might not set of any alarms, but looking at the entire chain will uncover what’s going on.

Overview of ML-Applications in AppSec

In relation to our description of different Machine Learning approaches above, four primary AppSec applications can be identified as illustrated below:

Visual representation of common Machine Learning applications to AppSec

Overview of common ML applications to AppSec.

While anomaly and misuse detection constitute primary actions of automated intrusion detection systems, data exploration and risk scoring can be seen as support functions that enable security analysts to perform the right responses.

Anomaly detection defines normal behavior first and then identifies every other behavior as an anomaly and thereby a potential threat (comparable to whitelisting).
For misuse detection the reverse is true: Here, malicious behavior is identified based on training with labeled data. Usually, all traffic not classified as malicious is allowed through (comparable to blacklisting).
Data exploration is used to identify characteristics of the data, often using visual exploration which can serve both as the foundation for anomaly or misuse detection and directly assist security analysts by increasing the ‘readability’ of incoming requests.
Finally, risk scoring can be used to assess the probability of a certain user’s behavior to be malicious, which can either be done by attributing an absolute risk score or classifying a user based on the probability that she is a bad actor.

Most promising use of ML in Application Security

In Machine Learning, the quality and quantity of available data usually make or break the performance of the resulting algorithms, and application security is no exception. This makes some of the potential applications less promising than others:

To support misuse detection with the classification of incoming requests requires significant amounts of labeled data. Unfortunately, it is very difficult to obtain labeled data of client requests to an application. This is due to the fact, that it’s almost impossible to label data automatically, thus requiring extremely large amounts of human labor. Therefore, researchers and companies alike often rely on constructed datasets, where all requests (malicious or not) were triggered by the creator of the dataset. While this makes labeling easy, these datasets have been proven to be skewed and are also limited in the types of malicious behavior they represent, as most were created 15 to 20 years ago, such as the DARPA 1998 / 1999 datasets and the KDD 1999 dataset.

Anomaly detection, on the other hand, does not rely on labeled data as extensively and therefore finds broader real-life application in cybersecurity. For example, in network security, AI can be used to monitor their client’s IT network and detects abnormal behavior of users or devices within the network by applying algorithms that first learn what normal behavior looks like and then flag the abnormal - and thereby potentially malicious - behavior, enabling the security team to take respective action. For application security, this can be translated to monitoring incoming user requests and traffic to teach the algorithm what ‘normal’ looks like. Then, such an algorithm can be applied to identify malicious attacks by spotting the difference from normal traffic, often by observing outliers with a predefined distance metric. Both of these applications are not without their complications however: For example, it is difficult to ensure that collected training data actually represents normal behavior. Further, the model needs to allow for changes in behavior triggered by non-malicious events and deal with false positives appropriately. Additionally, extracting useful information from http-messages is difficult and observing each individual piece of traffic introduces a lot of computational complexity. Nevertheless, since anomaly detection delivers the potential to generate insights without extensive data labeling and since the aforementioned problems can be overcome, it is the primary focus of this guide.

Aside from data collection, data preparation is incredibly important to derive value from said request-data. The sheer amount of available data calls for smart decisions on which subset to consider. Dimensionality reduction, which can be divided into feature selection and feature extraction, is often needed to aid the learners. Feature selection focuses on eliminating variables that don’t add predictive value, which can, for example, be identified by a high correlation with another variable. Feature extraction, on the other hand, describes the combination of multiple variables into one while losing as little predictive value as possible.

Human security expertise is essential for building effective Algorithms

Unfortunately, a good data collection and preparation process isn’t sufficient in building algorithms that adapt well to identifying malicious behavior in software systems. One of the main challenges is actually the technical implementation. Web traffic has a lot of variables to consider and isn’t as black and white as a numerical data set. These difficulties are complicated even more by one of the key challenges associated with unsupervised learning in general, namely establishing appropriate distance functions. When you, for example, apply clustering algorithms and manage to cluster traffic successfully, it doesn’t end there: Now you need to find a way to interpret the clusters and identify which clusters or outliers signify malicious behavior. Additionally, even when running the same algorithm on the same dataset, any visualizations you created might look very different, making it very difficult to identify anomalies by comparing different snapshots in time.

This is why human expertise is still very much a relevant component in all this for both creating the actual algorithms, as well as making sense of their findings. Prior knowledge on what you are looking for in the results can help to construct the right distance functions which in turn may help to identify additional anomalies. The important balancing act is to inject enough domain knowledge without overly limiting the algorithms ability to identify previously unknown anomalies. As exemplified by clustering, only once you are able to understand what different clusters are signaling back, they start providing value to your security.

Future Outlook

As we demonstrated, Machine Learning provides several practical applications for cybersecurity, but there is still a long way to go. More importantly, it is highly unlikely that ML algorithms will eliminate the need for human intelligence in application security at least in the short term. Instead, the relationship between human and artificial intelligence can best be described as symbiotic: In the first step, cybersecurity expertise is needed to prepare and classify training data, select appropriate algorithms and, in many cases, establish the right distance metric. Then, based on the results of the algorithm, human expertise is needed once again to make decisions based on the data classified or visualized by the machine.

Sources and suggested Readings

A selection of suggested readings:

Additional sources used in the making of this article: