Security of LLM-generated Code: A Comparative Analysis

Explainable & Ethical AI
Published: arXiv: 2605.23091v1
Authors

Srivathsan G Morkonda Mahmoud Selim Hala Assal

Abstract

The majority of software developers use or are planning to use Artificial Intelligence (AI) tools in their development processes. Their top reasons include improving productivity and faster learning. In fact, Large Language Model (LLM)-generated code is currently in production, including in major tech companies. However, concerns were raised about the risks associated with the use of AI tools to generate code. In this paper, we focus our attention on the risks to software security. We empirically evaluate the security of code generated by seven popular LLMs. We build upon previous work to mimic the behaviours of developers when using LLMs to generate code. Our results show that all seven LLMs that we have evaluated generate code that contains vulnerabilities, the majority of which are of critical or high severity.

Paper Summary

Problem
The problem this paper addresses is the security risk of using Large Language Model (LLM)-generated code in software development. With the increasing adoption of AI tools like ChatGPT, developers are concerned about the potential security vulnerabilities in code generated by these models. The paper aims to investigate the security of LLM-generated code and identify areas where it falls short.
Key Innovation
The key innovation of this paper is its comparative analysis of 7 popular LLMs, including Claude 3, Perplexity AI, OpenAI GPT-4o, Google Gemini, Phind-70B, Amazon CodeWhisperer, and IBM watsonx. The researchers used a standardized set of Python prompts to generate code and then evaluated the security of the generated code using two common methods: CoedQL and GPT-4o. This study is the first to comparatively evaluate LLMs with respect to the security of code they generate.
Practical Impact
The practical impact of this research is significant, as it highlights the security risks associated with using LLM-generated code. The study found that all 7 LLM tools generated vulnerable code, with at least 73% of generated code snippets vulnerable to one or more Common Weakness Enumeration (CWE) vulnerabilities. This finding has important implications for developers, organizations, and regulatory bodies, as it suggests that LLM-generated code may not be suitable for production use without proper security checks and validation.
Analogy / Intuitive Explanation
Think of LLM-generated code like a recipe book. Just as a recipe book can provide a framework for cooking, but may not account for all possible variations and ingredients, LLM-generated code provides a framework for writing software, but may not account for all possible security vulnerabilities. The researchers in this study are like quality control inspectors, testing the recipe book for potential errors and weaknesses. By identifying these vulnerabilities, they can help developers and organizations create safer and more secure software.
Paper Information
Categories:
cs.SE cs.AI cs.CR
Published Date:

arXiv ID:

2605.23091v1

Quick Actions