Beyond prompting: the role of phrasing tasks in vulnerability prediction for Java

Predicting vulnerability in a code element, such as a function or method, often leverages machine or deep learning models to classify whether it is vulnerable or not. Recently, novel solutions exploiting conversational large language models (LLMs) have emerged, which allow the formulation of the task through a prompt containing natural language elements and the input code element, obtaining a natural language as a response. Although initial promising results, there is currently no broad exploration of (i) how the input prompt influences the prediction capabilities and (ii) what characteristics of the model response relate to correct predictions. In this paper, we conduct an empirical investigation into how accurately two popular conversational LLMs, i.e., GPT-3.5 and Llama-2, predict whether a Java method is vulnerable by employing a thorough prompting strategy by (i) adhering to the Zero-Shot and Zero-Shot Chain-of-Thought techniques and (ii) formulating the prediction task in alternative ways via rephrasing. After a manual inspection of the responses generated, we observed that GPT-3.5 displayed more variable F1 scores compared to Llama-2, which was steadier but often gave no direct classification. ZS prompts achieved F1 scores between 0.53 and 0.69, with a tendency of classifying methods positively (i.e., ‘vulnerable’); conversely, ZS-CoT presents a broader range of scores, varying from 0.35 to 0.72, with often inconsistencies in the results. Then, we phrased the task in their “inverted form”, i.e., asking the LLM to check for the absence of vulnerabilities, which led to worse results for GPT-3.5, while Llama-2 occasionally performed better. The study further suggests that textual metrics provide important information on LLM outputs. Despite this, these metrics are not correlated with actual outcomes, as the models respond consistently with uniform confidence, irrespective of whether the outcome is correct or not. This underscores the need for customized prompt engineering and response analysis strategies to improve the precision and reliability of LLM-based systems for vulnerability prediction. In addition, we applied our study to two state-of-the-art LLMs, validating the broader applicability of our methodology. Finally, we performed an analysis of various textual properties of the model responses, such as response length and readability scores, to further explore the characteristics of the responses given for vulnerability detection tasks.

Subjects

Vulnerability prediction

Large language models

Prompt engineering

Prompt rephrasing

Empirical study

DDC Class

005: Computer Programming, Programs, Data and Security

006: Special computer methods

Funding(s)

Cybersecurity for AI-Augmented Systems

Lizenz

https://creativecommons.org/licenses/by/4.0/

Publication version

publishedVersion

Name

s42400-025-00476-0.pdf

Size

2.7 MB

Format

Adobe PDF

Options

Beyond prompting: the role of phrasing tasks in vulnerability prediction for Java