Since the widespread and growing use of ChatGPT and other large language models (LLMs) in recent years, cybersecurity has become a major concern. Among the many questions, cybersecurity professionals have wondered how effective these tools are in launching an attack. Cybersecurity researchers Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang recently conducted a study to find out the answer. The conclusion: they are very effective.
ChatGPT 4 quickly exploits day-old vulnerabilities
During the study, the team used 15 one-day vulnerabilities that occurred in real life. One-day vulnerabilities refer to the time between the discovery of an issue and the creation of the patch, meaning that it is a known vulnerability. The cases included websites with vulnerabilities, container management software, and Python packages. Since all vulnerabilities came from the CVE database, they included the CVE description.
THE Master of Laws (LL.M.) The agents also had web browsing elements, a terminal, search results, file creation, and a code interpreter. Additionally, the researchers used a very detailed prompt with a total of 1,056 tokens and 91 lines of code. The prompt also included debugging and logging instructions. The prompts did not include subagents or a separate scheduling module, however.
The team quickly found that ChatGPT was able to successfully exploit day-old vulnerabilities 87% of the time. All other methods tested, including LLMs and open-source vulnerability scanners, failed to exploit any vulnerabilities. GPT-3.5 also failed to detect any vulnerabilities. According to the report, GPT-4 only failed on two vulnerabilities, both of which were very difficult to detect.
“The Iris web application is extremely difficult for an LLM agent to navigate because navigation is done via JavaScript. As a result, the agent tries to access forms/buttons without interacting with the elements needed to make them available, preventing it from doing so. The detailed description of HertzBeat is in Chinese, which may confuse the GPT-4 agent we deploy because we use English for the prompt,” the report explains.
Discover AI-powered cybersecurity solutions
ChatGPT’s success rate is still limited by CVE code
The researchers concluded that the reason for the high success rate lies in the tool’s ability to exploit complex vulnerabilities in multiple stages, launch different attack methods, create exploit codes, and manipulate non-web vulnerabilities.
The study also revealed a significant limitation of Chat GPT for vulnerability scanning. When asked to exploit a vulnerability without the CVE code, LLM was unable to perform at the same level. Without the CVE code, GPT-4 was successful only 7% of the time, which is a decrease from 80%. Because of this significant gap, the researchers took a step back and isolated how often GPT4 was able to determine the correct vulnerability, which was 33.3% of the time.
“Surprisingly, we found that the average number of actions performed with and without a CVE description differed by only 14% (24.3 actions vs. 21.3 actions). We believe this is partly due to the length of the context window, further suggesting that a scheduling mechanism and subagents could increase performance,” the researchers wrote.
The Effect of LLMs on One-Day-in-the-Future Vulnerabilities
The researchers concluded that their study demonstrated that LLMs have the ability to autonomously exploit day-old vulnerabilities, but only GPT-4 can currently achieve this goal. However, there are concerns that the capacity and functionality of LLMs will only grow in the future, making them an even more destructive and powerful tool for cybercriminals.
“Our findings demonstrate both the possibility of an emerging capability and the fact that it is harder to discover a vulnerability than it is to exploit it. Nevertheless, our findings underscore the need for the broader cybersecurity community and LLM vendors to think carefully about how to integrate LLM agents into defensive measures and how to deploy them at scale,” the report concludes.