Retrieve, Refine, or Both? Using Task-Specific Guidelines for Secure Python Code Generation

Large Language Models (LLMs) are increasingly used for code generation, but they often produce code with security vulnerabilities. While techniques like fine-tuning and instruction tuning can improve security, they are computationally expensive and require large amounts of secure code data. Recent studies have explored prompting techniques to enhance code security without additional training. Among these, Recursive Criticism and Improvement (RCI) has demonstrated strong improvements by iteratively refining the generated code by leveraging LLMs' self-critiquing capabilities. However, RCI relies on the model's ability to identify security flaws, which is constrained by its training data and susceptibility to hallucinations. This paper investigates the impact of incorporating taskspecific secure coding guidelines extracted from MITRE's CWE and CodeQL recommendations into LLM prompts. For this, we employ Retrieval-Augmented Generation (RAG) to dynamically retrieve the relevant guidelines that help the LLM avoid generating insecure code. We compare RAG with RCI, observing that both deliver comparable performance in terms of code security, with RAG consuming considerably less time and fewer tokens. Additionally, combining both approaches further reduces the amount of insecure code generated, requiring only slightly more resources than RCI alone, highlighting the benefit of adding relevant guidelines in improving LLM-generated code security.

DDC Class

005.8: Computer Security

Options

Retrieve, Refine, or Both? Using Task-Specific Guidelines for Secure Python Code Generation