Options
Impact of identifier normalization on vulnerability detection techniques
Publikationstyp
Conference Paper
Date Issued
2025-04
Sprache
English
Author(s)
Start Page
69
End Page
76
Citation
IEEE International Conference on Software Analysis, Evolution and Reengineering - Companion, SANER-C 2025
Contribution to Conference
Publisher DOI
Publisher
IEEE
ISBN of container
979-8-3315-3749-4
This study examines the impact of identifier normalization on software vulnerability detection using three approaches: static application security testing (SAST), specialized machine learning (ML) models, and Large Language Models (LLM). Using the BigVul dataset of vulnerabilities in C/C++ projects, the research evaluates the performance of these methods under normalized (generalized variables / functions names) and their original conditions. SAST tools such as Flawfinder and CppCheck exhibit limited effectiveness (F1 ∼ scores 0.1) and are unaffected by normalization. Specialized ML models, such as LineVul, achieve high F1 scores on nonnormalized data (F1 ∼ 0.9) but suffer significant performance drops when tested on normalized inputs, highlighting their lack of generalizability. In contrast, LLMs such as Llama3, although underperforming in their pre-trained state, show substantial improvement after fine-tuning, achieving robust and consistent results across both normalized and non-normalized datasets. The findings suggest that while SAST tools are less effective, fine-tuned LLMs hold strong potential for scalable and generalized vulnerability detection. The study recommends further exploration of hybrid approaches that combine ML models, LLMs, and traditional tools to enhance accuracy and adaptability in diverse scenarios.
Subjects
Vulnerability Detection
Data Set Normalization
LLM
Large Language Models
Machine Learning
Static Application Security Testing
DDC Class
005.8: Computer Security