The uGen system uses retrieval-augmented generation and a multi-agent design to close the knowledge gaps that prevent unaugmented language models from producing functional microarchitectural attack code, enabling scalable vulnerability assessment across diverse hardware in under four minutes per run.

Key Takeaways

  • uGen, the first LLM-driven framework for automated microarchitectural attack code generation, achieves up to 100% success on Spectre-v1 (Claude Sonnet-4) and 80% on Prime+Probe cache side-channel attacks (Qwen3-Coder).
  • Unaugmented large language models, including GPT, Claude, and Qwen3, consistently misgenerate or misplace the critical hardware-level operations that make microarchitectural exploits functional, regardless of model capability.
  • A complete, deployable proof-of-concept costs $1.25 and takes under four minutes, making per-target, per-configuration vulnerability assessment economically viable at scale for the first time.
  • Microarchitectural attacks remain an expanding threat surface: VMScape (2025) extended speculative execution exploitation into virtualization boundaries, and Prime+Probe has been adapted against AMD SEV-SNP-protected virtual machines.
  • A parallel cluster of multi-agent LLM penetration testing frameworks, including CurriculumPT, PentestMCP, and AWS Security Agent, indicates that automated exploit synthesis is becoming a structural feature of both offensive and defensive security workflows.

The central finding of the uGen paper, published in May 2026, is technically specific and strategically consequential: large language models left to their own devices cannot reliably generate functional microarchitectural attack code. They consistently misplace the precise hardware-level operations that make cache side-channels and speculative execution exploits work. The uGen framework, built on retrieval-augmented generation and a multi-agent architecture, corrects this by injecting missing domain knowledge at inference time, producing functional proof-of-concept code for Spectre-v1 and Prime+Probe attacks across diverse microarchitectures for $1.25 per run. For security teams, this converts labor-intensive, environment-sensitive vulnerability assessment into a scalable, repeatable process.

Unaugmented LLMs Fail at Microarchitectural Exploit Synthesis Because They Misplace Critical Attack Primitives

The problem uGen solves is not general code quality but placement precision. According to the uGen paper, state-of-the-art models including GPT, Claude, and Qwen3 frequently misgenerate or misplace critical attack primitives: the specific sequences of memory access, cache manipulation, and branch prediction control that distinguish a functional exploit from structurally similar code that produces no actual leakage.

Microarchitectural attacks depend on sequences of hardware-level operations that do not map cleanly to the higher-level programming abstractions dominant in LLM training corpora. Writing a Spectre-v1 gadget that leaks data requires precise placement of memory accesses relative to speculative execution windows. Writing a Prime+Probe attack that resolves cache set conflicts requires understanding of how a specific CPU microarchitecture maps virtual addresses to cache sets. LLMs trained on general codebases lack the dense representation of these constraint sets.

uGen's design treats this as a retrieval problem rather than a generation problem. A RAG pipeline identifies which attack primitives are missing or incorrectly placed in a draft output, retrieves relevant domain-specific documentation, and a multi-agent architecture then synthesizes corrected, functionally complete proof-of-concept code tailored to specified defender requirements and target architectures.

uGen Achieves 100% Spectre-v1 Success and 80% Prime+Probe Success Across Diverse Hardware

Evaluation results for uGen, drawn from testing across cache-based and speculative execution attacks, diverse microarchitectures, and three LLM platforms, establish the clearest evidence of the framework's operational capability. According to the uGen paper, the framework reached up to 100% success rate for Spectre-v1 attacks using Claude Sonnet-4, and 80% for Prime+Probe attacks using Qwen3-Coder.

Success rates varied meaningfully by model and attack type, establishing that the choice of underlying LLM is a material variable in output quality. Claude Sonnet-4 led on speculative execution attacks; Qwen3-Coder led on cache-based attacks. The cross-model benchmark covering GPT, Claude, and Qwen3 represents the most systematic published assessment of LLM capability on microarchitectural attack synthesis.

The cost and speed metrics are operationally significant beyond the headline figures. At $1.25 per run in under four minutes, uGen makes per-target, per-configuration testing economically viable at a scale that was previously impossible. Existing microarchitectural vulnerability assessment requires deep hardware expertise, extended manual effort, and produces code that frequently fails to port across processor generations. The framework's portability across diverse architectures directly addresses what the paper identifies as the primary barrier to systematic assessment.

Microarchitectural Attacks Have Continued to Expand Since Spectre and Meltdown, and uGen Arrives into a Growing Threat Surface

The attack environment is not static. Since Spectre and Meltdown were disclosed in January 2018, researchers have documented a continuous expansion of exploitable microarchitectural surfaces. VMScape, disclosed in 2025, extended speculative execution exploitation into virtualization boundaries by targeting incomplete isolation in branch prediction units, allowing a malicious guest VM to manipulate host processes and exfiltrate sensitive data including cryptographic keys, even on systems running current Spectre mitigations.

Cache-based attack techniques have expanded similarly. Recent research demonstrated Prime+Probe attacks against AMD SEV-SNP-protected virtual machines, a hardware-based confidential computing feature designed explicitly to isolate sensitive workloads. The technique has also been extended to Apple M1 and Samsung Exynos architectures, establishing that cache side-channels are not an Intel-specific problem.

Intel's 2018 hardware revisions addressed Meltdown and Spectre-v2 but not Spectre-v1. Software mitigations including Google's Retpoline technique and compiler-level speculation barriers have been deployed broadly, but impose performance costs and require ongoing enforcement across compiler toolchains and operating system updates. According to the Spectre vulnerability record, complete elimination of speculative execution would prevent these attacks but would cause significant processor performance degradation. This creates a permanent partial-mitigation condition that uGen is designed to probe systematically.

A Parallel Ecosystem of LLM Penetration Testing Frameworks Signals Structural Change in Automated Security Research

uGen is not an isolated development. A cluster of multi-agent LLM penetration testing frameworks has emerged concurrently. CurriculumPT, published in 2025, combines curriculum learning with a multi-agent system to enable progressive acquisition of exploitation skills across increasingly complex targets. PentestMCP, using GPT-4.1, achieved average success rates of 87.3% for information gathering, 62.3% for vulnerability discovery, and 56.6% for exploitation across web application targets. AWS Security Agent published a multi-agent architecture for automated penetration testing in cloud environments.

The common pattern across these frameworks is that raw LLM capability is insufficient for reliable exploit generation; each requires RAG, structured tool use, specialized agent roles, or domain-specific knowledge injection to achieve operational effectiveness. uGen's contribution is extending this pattern to the hardware layer, where knowledge specificity requirements are higher and attack surface constraints are tighter.

LLMs without augmentation already present a measurable security challenge in code generation contexts. According to research on LLM-generated code security, between 12% and 65% of code snippets generated by major language models contain vulnerabilities classified under the Common Weakness Enumeration, depending on task, language, and model. GitHub Copilot introduced security vulnerabilities in 32.8% of Python code and 24.5% of JavaScript code in published evaluations. uGen's results represent a different axis of the same problem: LLMs generating code that is precisely designed to exploit hardware-level weaknesses, with high functional correctness when properly augmented.

The dual-use implication is direct and the paper does not obscure it. The same automation that enables scalable defender assessment also reduces the expertise threshold for offensive use. A Spectre-v1 proof-of-concept that previously required deep microarchitectural knowledge to construct now takes under four minutes and costs less than a cup of coffee. Security teams that have not yet developed systematic microarchitectural testing programs are operating with a gap that is now trivially exploitable by well-resourced adversaries.

Background: What Spectre-v1, Prime+Probe, and Retrieval-Augmented Generation Are

Spectre-v1 (CVE-2017-5753) is a class of vulnerability present in virtually every modern processor that performs speculative execution. When a processor speculatively executes instructions along a predicted branch path that turns out to be incorrect, the speculative computation is discarded from architectural state but leaves traces in microarchitectural state, most commonly the CPU cache. An attacker constructs a code gadget that triggers speculative access to memory outside its intended bounds, then uses a cache timing channel to recover the accessed value. The original Spectre paper demonstrated exploitation against Intel, AMD, and ARM processors.

Prime+Probe is a cache side-channel technique that does not require memory sharing between attacker and victim. The attacker first "primes" a set of cache lines with its own data, waits while a victim executes, then "probes" those cache lines to detect which were evicted by the victim's memory accesses. Timing differences between cached and uncached accesses reveal the victim's access pattern. The PAPP paper from ACM documented how CPU memory prefetchers interfere with Prime+Probe and how this affects attack reliability across microarchitectures.

Retrieval-Augmented Generation is an LLM architecture pattern that supplements model weights with external documents retrieved at inference time. The retrieved content is injected into the model's context before generation, enabling the model to produce outputs that depend on knowledge not encoded in its weights. Originally proposed for factual question answering, RAG has been applied to code generation, where retrieved content typically includes API documentation, code examples, and domain-specific specifications. In uGen's case, the retrieved content is microarchitecture-specific attack primitive documentation that fills the knowledge gaps the paper identifies in all three evaluated models.

References

  1. uGen: LLM-Driven Microarchitectural Attack PoC Framework (arXiv 2026)
  2. Spectre Attacks: Exploiting Speculative Execution (Kocher et al.)
  3. Spectre (security vulnerability) - Wikipedia
  4. Last-Level Cache Side-Channel Attacks Are Practical (IEEE Xplore)
  5. PAPP: Prefetcher-Aware Prime and Probe Side-Channel Attack (ACM)
  6. cachepc: Prime+Probe on AMD SEV-SNP Virtual Machines (GitHub)
  7. LLM-Generated Code Security Research (Emergent Mind)
  8. CurriculumPT: LLM Multi-Agent Penetration Testing (MDPI, 2025)
  9. PentestMCP: LLM and MCP Based Multi-Agent Framework (Sciety, 2025)
  10. AWS Security Agent: Multi-Agent Architecture for Automated Penetration Testing
  11. Hidden Risks of LLM-Generated Web Application Code (arXiv)