wolfSSL 5.9.0 — LLM-Augmented Fuzzing Campaign
Ongoing security research into the wolfSSL 5.9.0 TLS library (268 C files, 1.29M lines of code) combining AFL++ coverage-guided fuzzing with LLM-assisted code review. Three custom harnesses target the historically most vulnerable attack surfaces: TLS handshake, X.509 certificate parsing, and ASN.1 decoding. After an initial short-run phase exposed coverage limitations, we rebuilt the entire fuzzing pipeline with CMPLOG instrumentation, proper seed corpus from wolfSSL’s own test suite (59 DER certificates + keys), and ASN.1/TLS dictionaries. An overnight campaign is now running.
Phase 1: short runs (2026-03-30)
The first phase produced 64M+ AFL++ executions, 478+ unique corpus items, and 1,360 LLM code review findings — but no confirmed vulnerabilities. Coverage was extremely low (0.26–1.05%) because the seed corpus consisted of manually crafted minimal inputs rather than real protocol data.
| Method | Executions | Coverage | Findings | Result |
|---|---|---|---|---|
| AFL++ LAF (cert parser) | 23.2M | 0.42% | 100 corpus items | 0 crashes |
| AFL++ LAF (ASN.1) | 20.8M | 0.70% | 162 corpus items | 0 ASAN |
| AFL++ LAF (TLS handshake) | 8.6M | 1.05% | 30 corpus items | 0 ASAN |
| Ollama code review | — | — | 1,360 findings → 210 CRITICAL | High FP rate |
| Claude FP filter | — | — | 210 → 5 candidates | All 5 FP |
Phase 2: overnight campaign with CMPLOG (2026-03-31)
After analyzing Phase 1 results, we rebuilt the fuzzing pipeline from scratch. The key improvements:
- CMPLOG instrumentation — a separate binary that logs all comparisons at runtime, letting AFL++ auto-solve multi-byte checks like TLS version bytes, ASN.1 tag matching, and OID comparisons that previously blocked the fuzzer
- Real seed corpus — 59 DER certificates and keys from wolfSSL’s own test suite (RSA, ECC, PKCS#8, malformed test certs) plus the evolved corpus from Phase 1
- Protocol dictionaries — hand-built ASN.1/X.509 dictionary (OIDs, tags, length encodings, extension identifiers) and TLS dictionary (record types, versions, cipher suites, extensions, alert codes)
- Three binary variants — LAF (main fuzzer, splits comparisons), CMPLOG (comparison logging for auto-dictionary), ASAN (crash verification post-run)
A 10-second smoke test confirmed the improvement: 3.16% coverage and 558 corpus items in 10 seconds vs 0.70% and 162 items in 10 minutes from Phase 1 — a 4.5× coverage increase. The campaigns (4 hours per target, 12 hours total) are running on a dedicated Linux fuzzing box with automatic ASAN crash verification at the end of each campaign. In parallel, we’re running an improved LLM code review (v2) covering all 13 key source files (202K lines) with function-level chunking and better prompts to reduce the false positive rate.
| Target | Seeds | Dictionary | Duration | Status |
|---|---|---|---|---|
| ASN.1 parser (4h + 8h extended) | 254 | asn1_x509.dict | 12 hours | 794M execs, 3.84% cvg, 1,034 corpus, 0 crashes |
| X.509 cert parser | 192 | asn1_x509.dict | 4 hours | 282M execs, 3.54% cvg, 893 corpus, 0 crashes |
| TLS handshake | 33 | tls.dict | 4 hours | 133M execs, 1.06% cvg, 34 corpus, 0 crashes |
| Code review v2 (qwen2.5:14b) | 13 files | 202K lines | ~3 hours | 913 findings → 1 TP after Claude filter |
| Code review Round 2 (deepseek) | 13 files | 202K lines | 7.9 hours | 2,297 findings → 2 “TP” → both FP |
| Code review Round 3 (dolphin) | 13 files | 202K lines | 12.7 hours | 2,761 findings → 3 “TP” → all FP |
Code review v2 improvements
Phase 1 code review only scanned 2 files with fixed-size 50-line windows, leading to high false positive rates from missing caller context. v2 addresses this:
- Function-level chunking — extracts complete C functions (up to 14K chars) instead of arbitrary line windows, so the LLM sees full validation logic
- 13 files / 202K lines — expanded from 2 files to cover tls13.c, dtls13.c, pkcs7.c, pkcs12.c, ecc.c, rsa.c, ssl.c, ocsp.c, sp_int.c, and more
- wolfSSL-aware prompts — explicitly tell the LLM about WC_SAFE_SUM_WORD32(), free-then-null patterns, and caller-level validation to reduce FPs
- Dual strategy per file — each file gets both a generic memory/integer scan and a domain-specific scan (ASN.1, TLS state machine, or crypto)
- Deduplication — groups findings by (function, CWE, type) before Claude filtering
Why no bugs? wolfSSL's defensive coding
All 5 candidates flagged by Claude turned out to be false positives because wolfSSL employs excellent defensive practices that LLMs struggle to trace across function boundaries:
WC_SAFE_SUM_WORD32()macro — validates integer arithmetic before every allocation- Consistent free-then-null pattern — prevents use-after-free
- Length validation before all buffer operations — caller-level bounds checks
- Callback return value bounds checking — e.g., PSK key length capped at
MAX_PSK_KEY_LEN
Phase 2 results: 1.2 billion executions, 0 crashes, 1 logic bug
All fuzzing campaigns complete. Every target ran with CMPLOG + dictionary + real seed corpus. ASN.1 got an extended 8-hour run on top of the initial 4 hours.
| Campaign | Runtime | Executions | Coverage | Corpus | Crashes |
|---|---|---|---|---|---|
| ASN.1 (CMPLOG, 4h) | 4h | 316M | 3.82% | 975 | 0 |
| ASN.1 extended (8h) | 8h | 458M | 3.84% | 1,034 | 0 |
| X.509 cert (CMPLOG) | 4h | 248M | 3.54% | 893 | 0 |
| TLS handshake (CMPLOG) | 4h | 133M | 1.06% | 34 | 0 |
| Total (Phase 2) | 20h | 1.155B | — | 2,936 | 0 |
The TLS handshake harness plateaued at 1.06% coverage with only 34 corpus items — the state machine is extremely hard to penetrate even with CMPLOG. The fuzzer can’t get past the initial handshake parsing without producing a valid cryptographic response. ASN.1 and cert parsers reached 3.5–3.8% with ~1,000 corpus items each — respectable for an 8-hour campaign but far from exhaustive.
Code review: 3 models, 6 Claude TPs, 1 real bug
We ran the same 13-file code review pipeline through three different LLM models, then filtered the top 100 candidates from each through Claude Sonnet 4.6. Every “true positive” from Claude was then manually verified against the actual source code.
| Finding | Model | Claude verdict | Manual verdict | Details |
|---|---|---|---|---|
EncodedDottedForm |
qwen2.5:14b | TP (medium) | Real but not exploitable | Off-by-one in OID encoding. Debug-only code behind #ifdef, single caller with outSz=16. |
wc_oid_sum |
deepseek-coder-v2 | TP (high) | False positive | Max sum is 8,160 (255×32) — fits word32. XOR path can’t overflow. |
GetLength_ex |
deepseek-coder-v2 | TP (medium) | False positive | 5 validation checks prevent overflow. Theoretical wrap unreachable (Check 3 caps at INT_MAX). |
StreamOctetString |
dolphin-mistral | TP (medium, CVSS 7.5) | False positive | Bounds check includes input offset i making it stricter, not weaker. No wrap with TLS-sized inputs. |
EncodeObjectId (overflow) |
dolphin-mistral | TP (medium) | False positive | Max multiplication = 2.6M (fits word32). len overflow needs 300M+ elements; callers pass <20. |
EncodeObjectId (signed/unsigned) |
dolphin-mistral | TP (medium) | False positive | Syntactic signed/unsigned mismatch but check works correctly in all reachable scenarios. |
Result: 1 real logic bug out of ~5,970 candidates across 3 models. 300 candidates evaluated by Claude, 6 flagged as true positives, only 1 confirmed after manual verification. The real bug was found by the precision model (qwen), not the noisy ones. dolphin-mistral produced the most findings (2,761) but zero real bugs — including one that Claude scored at CVSS 7.5 as a buffer overflow, which was actually a bounds check that was stricter than necessary.
Claude’s FP filter reduces noise (5,970 → 6) but is not a substitute for reading the code. It reasons about patterns (“OID + multiplication = overflow”) rather than computing actual value ranges. Manual verification caught all 5 false positives that Claude missed.
Multi-model code review: does a different LLM find different bugs?
One model scanning code is a single opinion. The same code chunk that qwen2.5:14b dismisses might trigger a finding in deepseek-coder-v2 or dolphin-mistral. We’re running the same 13-file pipeline through multiple models to compare blind spots:
| Round | Model | Type | Findings | After Claude filter | Status |
|---|---|---|---|---|---|
| 1 | qwen2.5:14b | Local (Linux + Ollama GPU) | 913 (595 CRIT) | 1 TP / 99 FP | Complete |
| 2 | deepseek-coder-v2:lite | Local (Linux + Ollama GPU) | 2,297 (305 CRIT) | 2 “TP” / 98 FP → both FP after manual review | Complete |
| 3 | dolphin-mistral | Local (Linux + Ollama GPU) | 2,761 (1,769 CRIT) | 3 “TP” / 97 FP → all FP after manual review | Complete |
| 4 | Claude Sonnet 4.6 | Anthropic API | — | — | Skipped — 3 rounds sufficient |
| 5 | qwen2.5:3b | Local (Linux + Ollama GPU) | — | — | Skipped — too small for C code |
What we learned: More noise does not mean more bugs. dolphin-mistral produced 3× the findings of qwen (2,761 vs 913) but zero real bugs. deepseek produced 2.5× more but also zero. The one real bug was found by the precision model (qwen2.5:14b), which had the fewest findings but the best signal-to-noise ratio. Each model flagged different functions — but in a mature codebase, “different” just means “different false positives.”
Honest assessment
Let’s be real: we would be very surprised if a home lab approach — one researcher with a Ryzen 5 fuzzing box, an RTX 3060 for LLM inference, and a week of effort — produced a meaningful security finding in wolfSSL. This is a library that has been:
- Continuously fuzzed by Google’s OSS-Fuzz since 2016 (billions of executions)
- Audited by professional security firms multiple times
- FIPS 140-2/140-3 certified (military-grade validation)
- Deployed in automotive, aerospace, and government systems
- Maintained by a team that clearly understands defensive C coding
(
WC_SAFE_SUM_WORD32, free-then-null, caller-level validation everywhere)
Our 1.22 billion AFL++ executions at 3.5–3.8% coverage are a rounding error compared to what OSS-Fuzz runs continuously. The one bug we found through code review is a logic error in debug-only code that no fuzzer would ever reach.
That’s actually the point. The value of this research isn’t in finding wolfSSL vulnerabilities — it’s in proving out the methodology. We built and validated an LLM-augmented fuzzing pipeline that:
- Scans 202K lines of C with 3 models producing 5,971 candidates in ~24h total GPU time
- Reduces 5,971 candidates → 300 Claude-evaluated → 6 “TPs” → 1 confirmed real after manual verification
- Achieves 4.5× better fuzzing coverage with CMPLOG + dictionaries + real seed corpus
- Runs entirely on commodity hardware ($300 GPU + $400 NUC)
wolfSSL is a much harder target. Apply this to a less-audited embedded TLS library or an IoT firmware stack, and the results would be very different.
Lessons for LLM-assisted fuzzing
Practical hints for anyone trying to use AI for vulnerability research in compiled C code:
- LLMs can’t trace cross-function validation. Ollama flagged
XMALLOC(untrusted_size)patterns without seeing the bounds check 3 functions up the call stack. Always verify caller context manually. - CMPLOG is a game changer for protocol fuzzing. Adding a CMPLOG binary (
-cflag) gave AFL++ visibility into everymemcmp/strcmpat runtime. Result: 4.5× coverage improvement (0.70% → 3.16%) in a 10-second smoke test. - Two-stage fuzzing works. Use
AFL_LLVM_LAF_ALL=1for path discovery, then replay the corpus against ASAN builds for crash detection. Also: run your fuzzing directory on a ramdisk (tmpfs). Kudos to Albert for both of these hints. - Claude’s FP filter is useful but not reliable. It correctly identified 1 real bug (EncodedDottedForm) but also flagged 2 false positives as “high/medium confidence true positives” (wc_oid_sum, GetLength_ex). It reasons about patterns (“OID + integer = overflow”) rather than computing actual value ranges. Always verify manually.
- Seed quality matters more than execution count. Switching from hand-crafted minimal seeds to wolfSSL’s own 59 DER test certificates was the single biggest improvement.
- Pick your targets wisely. LLM-assisted security research works best on targets that have not been fuzzed extensively.
- The 50-line code review window is too small. v2 with function-level chunking and wolfSSL-aware prompts cut noise dramatically. Feed entire functions, tell the LLM about the target’s defensive patterns.
Cross-model analysis: what 1,757 functions tell us
We ran the same 13-file pipeline through 3 different LLMs. Together they flagged 1,757 unique functions. Here’s how they overlap:
| Category | Functions | % of total |
|---|---|---|
| Flagged by all 3 models | 149 | 8.5% |
| Flagged by exactly 2 models | 365 | 20.8% |
| qwen only | 321 | 18.3% |
| deepseek only | 460 | 26.2% |
| dolphin only | 462 | 26.3% |
72% of findings are model-unique — each model sees something different. 36 exact
(function, CWE) pairs were agreed upon by all 3, including GetASN_Items,
DecodeCertInternal, and DecodeGeneralName. These 36 represent
the highest-confidence candidates for manual review. All were evaluated by Claude —
none confirmed as exploitable.
Each model also has unique CWE categories: qwen flagged 83 CWEs the others missed, dolphin had 95 unique CWEs, deepseek had 48. The diversity is real — but on wolfSSL, it’s diversity of false positives.
Final numbers
| Metric | Value |
|---|---|
| AFL++ total executions | 1.22 billion |
| AFL++ crashes | 0 |
| Code review candidates (3 models) | 5,971 |
| Claude-evaluated candidates | 300 |
| Claude “true positives” | 6 |
| Confirmed real after manual verification | 1 |
| Exploitable vulnerabilities | 0 |
| Total researcher time | ~3 days |
| Hardware cost | $300 GPU + $400 NUC |
Conclusion
wolfSSL is one of the most hardened open-source C libraries in existence. 1.22 billion fuzzer executions produced zero crashes. Three LLM models scanning 202K lines of code produced nearly 6,000 candidates — and after Claude filtering and manual verification, exactly one real bug: a logic error in debug-only code that cannot be reached in production.
Let’s be honest about what we didn’t do. 1.22 billion executions sounds impressive, but with 3.8% coverage on the ASN.1 parser and 1.06% on the TLS handshake, it’s hard to say we even started real fuzzing. The handshake harness never got past the initial ClientHello parsing — it couldn’t produce a cryptographically valid response. The ASN.1 parser plateaued after a few hours and never broke through to deeper code paths. A proper fuzzing campaign against wolfSSL would need weeks of continuous execution, custom harnesses for each TLS extension, grammar-based seed generation for valid handshake sequences, and probably a network-aware fuzzer that can complete a full TLS exchange. We didn’t do any of that — and Google’s OSS-Fuzz has been doing exactly that since 2016.
The main goal was never to break wolfSSL. It was to set up and validate a repeatable process for LLM-augmented security assessment:
- Build instrumented binaries (LAF, CMPLOG, ASAN) from a single Makefile
- Write harnesses that feed untrusted input to the right entry points
- Seed from the project’s own test data, not hand-crafted bytes
- Run overnight campaigns with automatic ASAN verification
- Scan source code with multiple LLMs in parallel, each catching different patterns
- Filter noise with Claude, then verify every “true positive” by reading the actual code
- Document everything in backlogs so the next assessment starts faster
That process now exists, is documented, and is proven to work — the same pipeline found wolfSSL was the calibration target. The real value is applying this to codebases that haven’t had the benefit of a decade of OSS-Fuzz and professional security audits.