A Closer Look at Deep Learning Techniques for Software Security

Program as a Sequence of Tokens

Sequence-based representation of source code is one of the most common ways to use deep learning in software security. In this approach, the source code is transformed into a sequence of tokens or characters, which is then fed into the neural model. ChatGPT is an example of this approach, where the model is trained on a large corpus of text data and then fine-tuned for a specific task, such as bug detection or vulnerability classification.

Despite its impressive performance in natural language processing, ChatGPT is just one of many sequence-based ML models available for software security. Codex (the underlying model for CoPilot coding assistant) and PaLM (Google’s large language model) are examples of this type of system.

Token sequences are useful for representing programs because they bypass hand-engineered rules, graphs, or patterns, but they may compromise their capturing of semantics, robustness, and interpretability. This is especially true if a small amount of data is available. As it stands, the scarcity of clean, high-quality, and deduplicated data is a big barrier to training effective machine learning models for security improvements.

Moreover, sequence-based models are not usually robust to unexpected input changes. The predictions made by these models are influenced by changes that modify the input on the surface but keep the behavior unchanged. Despite outperforming most of its competitors, it is unclear whether ChatGPT is robust in software security tasks. To better understand what we mean by a change that preserves the behavior, consider the code snippets below:

Original Program

String[] f(final String[] array) {
    final String[] arr
        = new String[array.length];
    for (int i = 0; i < array.length; i++)
        arr[array.length - i - 1] = array[i];
    return arr;
}

Transformed Program

String[] f(final String[] nodes) {
    double merger = 1;
    int stop = nodes.length;
    final String[] sorter
         = new String[nodes.length];
    int visitor = 0;
    while (visitor < stop) {
        int checkout = nodes.length - visitor;
        sorter[checkout - merger] = nodes[visitor];
        visitor += 1;
    }
    return sorter;
}

The original and transformed programs are simple Java code that implement reversing the order of the input array. ChatGPT easily predicts the name of the original program as reverseOrder. The prompt we used in all cases was:

Predict the name of the following function:

code

Next, consider the transformed code snippet. This program implements the same functionality but the for-loop has been transformed into a while-loop, the name of the variables have been changed, and some intermediate variables have been added. This time, ChatGPT fails to suggest a correct name; it predicts sortDescending as the function name. (Note: Both predictions were made by ChatGPT Feb 13 version).

Another challenge of sequence-based approaches, which might even be more problematic in industry, is the lack of explainability or interpretability. Basically, models should be able to provide useful information in addition to their final predictions. However, sequence-based models are black boxes: it is difficult to understand how or why they arrive at their output as they cannot provide the user with a human-understandable version of the model’s reasoning. This lack of transparency can make it difficult to trust the predictions made by AI tools.

Finally, another challenge that is unique to models like ChatGPT or Codex, is crafting good prompts that would get the model to produce the intended results. Additionally, due to some degree of non-determinism in these models, the same prompts may result in different predictions. Crafting effective prompts is such a challenging task that has opened a new field of research called prompt engineering.

Program as a Collection of Semantic Facts

Semantic-based representation of source code is another approach to using deep learning in software security. Instead of focusing on tokens or characters, this kind of model is trained on the semantics of the source code. These semantics are often collected using traditional software analysis techniques and compiler tools. This approach makes use of graph-based or trace-based representations to capture structural, semantic, and functional dependencies.

In one example of a graph-based representation, the code is viewed as a directed or undirected graph where nodes represent code elements and edges represent their dependencies. These dependencies can be dataflow edges, control flow edges, call graph edges, or other complex relational dependencies between program elements. With graph-based representations, code interactions and their behaviors can be captured effectively.

CodeTrek is an example of a semantic-based approach that represents codebases as databases that are built according to rich relational schemas. Here is an overview of CodeTrek:

unnamed (71)

CodeTrek takes any relational program database as input, and generates a graph using the database schema. Each named tuple (row) in the database is translated into a node and each key-foreign-key relationship is translated into an edge in the program graph.

Next, CodeTrek samples some walks over the generated graph, and feeds them to a deep neural network. The deep neural network can be used for a variety of applications such as classifying buggy/non-buggy programs or predicting program properties.

Its robustness to subtle changes distinguishes CodeTrek from sequence-based approaches. Let us see this feature in action. Take the code snippet below (Example: Easy Exception) as an example. Both ChatGPT and CodeTrek can correctly predict the most appropriate exception to catch at the highlighted location (????) is HTTPError. The prompt we used in ChatGPT is the following:

Predict the exception type to catch in “????”:

code

Example: Easy Exception

response = requests.get(url)

if response.status_code == 200:
    user_data = response.json()
else:
    try:
        response.raise_for_status()
    except ????:
        print("Request failed.")

An example of a heuristic that a sequence-based approach appears to learn is catching HTTPError whenever tokens like request or response appear in the context. This is usually a correct assumption. (Note: As we mentioned earlier, sequence-based models are black boxes. So, this is merely an observation-based guess).

However, one can trick the sequence-based model to make incorrect predictions by preserving the semantics but adding variable names such as request or response to the context. To see an example, consider the following Python code snippets. Both pieces of code provide the same functionality: they read the content of a URL and load it as JSON. So, the exception type that should be caught is UnicodeDecodeError in both cases.

Both CodeTrek and ChatGPT correctly predict the exception type to catch in the code snippet below.

Original Program

from json import loads
from urllib.request import urlopen

url = "http/path/to/file"

with urlopen(url) as response:
    data = response.read()

try:
    content = loads(data.decode('utf-8'))
except ????:
    print("Failed.")

Transformed Program

from json import loads
from urllib.request import urlopen

url = "http/path/to/file"

with urlopen(url) as response:
    data = response.read()

parse_http_response = lambda x: x.decode('utf-8')
get_http_response = lambda x: loads(x)

try:
    content = get_http_response(parse_http_response(data))
except ????:
    print("Failed.")

To complicate the code on the surface, we introduce lambdas, get_http_response() and parse_http_response(), and use them instead of commonly used json functions, decode() and loads(). This indeed tricks ChatGPT to suggest the exception to catch is HTTPError even though UnicodeDecodeError must be caught.

CodeTrek on the other hand, does not rely merely on memorizing the tokens, but learns to look at a chain of function calls starting from try blocks to locations in programs (or their libraries) where the exception was originally raised. As a result, CodeTrek correctly predicts UnicodeDecodeError for the transformed code.

In summary, a semantic-based approach can be more robust than a sequence-based approach, since it enables the model to understand the meaning of the code rather than just its syntax. There is, however, a requirement for more complex processing of data.

Conclusion

Deep learning offers a powerful set of techniques for improving software security. While ChatGPT has received significant attention, it is one of many learning-based models that can be applied to software security. In addition to sequence-based models like ChatGPT, semantic-based models offer another approach to using deep learning in software security and have the potential to be even more powerful. Ultimately, the most effective approach will depend on the specific application and the nature of the data being analyzed. Incorporating semantic-based models into software security applications may be an effective strategy forward.