Complete 2026-04-02

wolfSSL 5.9.0 — LLM-Augmented Fuzzing Campaign

Ongoing security research into the wolfSSL 5.9.0 TLS library (268 C files, 1.29M lines of code) combining AFL++ coverage-guided fuzzing with LLM-assisted code review. Three custom harnesses target the historically most vulnerable attack surfaces: TLS handshake, X.509 certificate parsing, and ASN.1 decoding. After an initial short-run phase exposed coverage limitations, we rebuilt the entire fuzzing pipeline with CMPLOG instrumentation, proper seed corpus from wolfSSL’s own test suite (59 DER certificates + keys), and ASN.1/TLS dictionaries. An overnight campaign is now running.

Fuzzing AFL++ CMPLOG ASAN TLS ASN.1 LLM-assisted

Phase 1: short runs (2026-03-30)

The first phase produced 64M+ AFL++ executions, 478+ unique corpus items, and 1,360 LLM code review findings — but no confirmed vulnerabilities. Coverage was extremely low (0.26–1.05%) because the seed corpus consisted of manually crafted minimal inputs rather than real protocol data.

MethodExecutionsCoverageFindingsResult
AFL++ LAF (cert parser)23.2M0.42%100 corpus items0 crashes
AFL++ LAF (ASN.1)20.8M0.70%162 corpus items0 ASAN
AFL++ LAF (TLS handshake)8.6M1.05%30 corpus items0 ASAN
Ollama code review1,360 findings → 210 CRITICALHigh FP rate
Claude FP filter210 → 5 candidatesAll 5 FP

Phase 2: overnight campaign with CMPLOG (2026-03-31)

After analyzing Phase 1 results, we rebuilt the fuzzing pipeline from scratch. The key improvements:

  • CMPLOG instrumentation — a separate binary that logs all comparisons at runtime, letting AFL++ auto-solve multi-byte checks like TLS version bytes, ASN.1 tag matching, and OID comparisons that previously blocked the fuzzer
  • Real seed corpus — 59 DER certificates and keys from wolfSSL’s own test suite (RSA, ECC, PKCS#8, malformed test certs) plus the evolved corpus from Phase 1
  • Protocol dictionaries — hand-built ASN.1/X.509 dictionary (OIDs, tags, length encodings, extension identifiers) and TLS dictionary (record types, versions, cipher suites, extensions, alert codes)
  • Three binary variants — LAF (main fuzzer, splits comparisons), CMPLOG (comparison logging for auto-dictionary), ASAN (crash verification post-run)

A 10-second smoke test confirmed the improvement: 3.16% coverage and 558 corpus items in 10 seconds vs 0.70% and 162 items in 10 minutes from Phase 1 — a 4.5× coverage increase. The campaigns (4 hours per target, 12 hours total) are running on a dedicated Linux fuzzing box with automatic ASAN crash verification at the end of each campaign. In parallel, we’re running an improved LLM code review (v2) covering all 13 key source files (202K lines) with function-level chunking and better prompts to reduce the false positive rate.

TargetSeedsDictionaryDurationStatus
ASN.1 parser (4h + 8h extended)254asn1_x509.dict12 hours794M execs, 3.84% cvg, 1,034 corpus, 0 crashes
X.509 cert parser192asn1_x509.dict4 hours282M execs, 3.54% cvg, 893 corpus, 0 crashes
TLS handshake33tls.dict4 hours133M execs, 1.06% cvg, 34 corpus, 0 crashes
Code review v2 (qwen2.5:14b)13 files202K lines~3 hours913 findings → 1 TP after Claude filter
Code review Round 2 (deepseek)13 files202K lines7.9 hours2,297 findings → 2 “TP” → both FP
Code review Round 3 (dolphin)13 files202K lines12.7 hours2,761 findings → 3 “TP” → all FP

Code review v2 improvements

Phase 1 code review only scanned 2 files with fixed-size 50-line windows, leading to high false positive rates from missing caller context. v2 addresses this:

  • Function-level chunking — extracts complete C functions (up to 14K chars) instead of arbitrary line windows, so the LLM sees full validation logic
  • 13 files / 202K lines — expanded from 2 files to cover tls13.c, dtls13.c, pkcs7.c, pkcs12.c, ecc.c, rsa.c, ssl.c, ocsp.c, sp_int.c, and more
  • wolfSSL-aware prompts — explicitly tell the LLM about WC_SAFE_SUM_WORD32(), free-then-null patterns, and caller-level validation to reduce FPs
  • Dual strategy per file — each file gets both a generic memory/integer scan and a domain-specific scan (ASN.1, TLS state machine, or crypto)
  • Deduplication — groups findings by (function, CWE, type) before Claude filtering

Why no bugs? wolfSSL's defensive coding

All 5 candidates flagged by Claude turned out to be false positives because wolfSSL employs excellent defensive practices that LLMs struggle to trace across function boundaries:

  • WC_SAFE_SUM_WORD32() macro — validates integer arithmetic before every allocation
  • Consistent free-then-null pattern — prevents use-after-free
  • Length validation before all buffer operations — caller-level bounds checks
  • Callback return value bounds checking — e.g., PSK key length capped at MAX_PSK_KEY_LEN

Phase 2 results: 1.2 billion executions, 0 crashes, 1 logic bug

All fuzzing campaigns complete. Every target ran with CMPLOG + dictionary + real seed corpus. ASN.1 got an extended 8-hour run on top of the initial 4 hours.

CampaignRuntimeExecutionsCoverageCorpusCrashes
ASN.1 (CMPLOG, 4h)4h316M3.82%9750
ASN.1 extended (8h)8h458M3.84%1,0340
X.509 cert (CMPLOG)4h248M3.54%8930
TLS handshake (CMPLOG)4h133M1.06%340
Total (Phase 2)20h1.155B2,9360

The TLS handshake harness plateaued at 1.06% coverage with only 34 corpus items — the state machine is extremely hard to penetrate even with CMPLOG. The fuzzer can’t get past the initial handshake parsing without producing a valid cryptographic response. ASN.1 and cert parsers reached 3.5–3.8% with ~1,000 corpus items each — respectable for an 8-hour campaign but far from exhaustive.

Code review: 3 models, 6 Claude TPs, 1 real bug

We ran the same 13-file code review pipeline through three different LLM models, then filtered the top 100 candidates from each through Claude Sonnet 4.6. Every “true positive” from Claude was then manually verified against the actual source code.

FindingModelClaude verdictManual verdictDetails
EncodedDottedForm qwen2.5:14b TP (medium) Real but not exploitable Off-by-one in OID encoding. Debug-only code behind #ifdef, single caller with outSz=16.
wc_oid_sum deepseek-coder-v2 TP (high) False positive Max sum is 8,160 (255×32) — fits word32. XOR path can’t overflow.
GetLength_ex deepseek-coder-v2 TP (medium) False positive 5 validation checks prevent overflow. Theoretical wrap unreachable (Check 3 caps at INT_MAX).
StreamOctetString dolphin-mistral TP (medium, CVSS 7.5) False positive Bounds check includes input offset i making it stricter, not weaker. No wrap with TLS-sized inputs.
EncodeObjectId (overflow) dolphin-mistral TP (medium) False positive Max multiplication = 2.6M (fits word32). len overflow needs 300M+ elements; callers pass <20.
EncodeObjectId (signed/unsigned) dolphin-mistral TP (medium) False positive Syntactic signed/unsigned mismatch but check works correctly in all reachable scenarios.

Result: 1 real logic bug out of ~5,970 candidates across 3 models. 300 candidates evaluated by Claude, 6 flagged as true positives, only 1 confirmed after manual verification. The real bug was found by the precision model (qwen), not the noisy ones. dolphin-mistral produced the most findings (2,761) but zero real bugs — including one that Claude scored at CVSS 7.5 as a buffer overflow, which was actually a bounds check that was stricter than necessary.

Claude’s FP filter reduces noise (5,970 → 6) but is not a substitute for reading the code. It reasons about patterns (“OID + multiplication = overflow”) rather than computing actual value ranges. Manual verification caught all 5 false positives that Claude missed.

Multi-model code review: does a different LLM find different bugs?

One model scanning code is a single opinion. The same code chunk that qwen2.5:14b dismisses might trigger a finding in deepseek-coder-v2 or dolphin-mistral. We’re running the same 13-file pipeline through multiple models to compare blind spots:

RoundModelTypeFindingsAfter Claude filterStatus
1 qwen2.5:14b Local (Linux + Ollama GPU) 913 (595 CRIT) 1 TP / 99 FP Complete
2 deepseek-coder-v2:lite Local (Linux + Ollama GPU) 2,297 (305 CRIT) 2 “TP” / 98 FP → both FP after manual review Complete
3 dolphin-mistral Local (Linux + Ollama GPU) 2,761 (1,769 CRIT) 3 “TP” / 97 FP → all FP after manual review Complete
4 Claude Sonnet 4.6 Anthropic API Skipped — 3 rounds sufficient
5 qwen2.5:3b Local (Linux + Ollama GPU) Skipped — too small for C code

What we learned: More noise does not mean more bugs. dolphin-mistral produced 3× the findings of qwen (2,761 vs 913) but zero real bugs. deepseek produced 2.5× more but also zero. The one real bug was found by the precision model (qwen2.5:14b), which had the fewest findings but the best signal-to-noise ratio. Each model flagged different functions — but in a mature codebase, “different” just means “different false positives.”

Honest assessment

Let’s be real: we would be very surprised if a home lab approach — one researcher with a Ryzen 5 fuzzing box, an RTX 3060 for LLM inference, and a week of effort — produced a meaningful security finding in wolfSSL. This is a library that has been:

  • Continuously fuzzed by Google’s OSS-Fuzz since 2016 (billions of executions)
  • Audited by professional security firms multiple times
  • FIPS 140-2/140-3 certified (military-grade validation)
  • Deployed in automotive, aerospace, and government systems
  • Maintained by a team that clearly understands defensive C coding (WC_SAFE_SUM_WORD32, free-then-null, caller-level validation everywhere)

Our 1.22 billion AFL++ executions at 3.5–3.8% coverage are a rounding error compared to what OSS-Fuzz runs continuously. The one bug we found through code review is a logic error in debug-only code that no fuzzer would ever reach.

That’s actually the point. The value of this research isn’t in finding wolfSSL vulnerabilities — it’s in proving out the methodology. We built and validated an LLM-augmented fuzzing pipeline that:

  • Scans 202K lines of C with 3 models producing 5,971 candidates in ~24h total GPU time
  • Reduces 5,971 candidates → 300 Claude-evaluated → 6 “TPs” → 1 confirmed real after manual verification
  • Achieves 4.5× better fuzzing coverage with CMPLOG + dictionaries + real seed corpus
  • Runs entirely on commodity hardware ($300 GPU + $400 NUC)

wolfSSL is a much harder target. Apply this to a less-audited embedded TLS library or an IoT firmware stack, and the results would be very different.

Lessons for LLM-assisted fuzzing

Practical hints for anyone trying to use AI for vulnerability research in compiled C code:

  • LLMs can’t trace cross-function validation. Ollama flagged XMALLOC(untrusted_size) patterns without seeing the bounds check 3 functions up the call stack. Always verify caller context manually.
  • CMPLOG is a game changer for protocol fuzzing. Adding a CMPLOG binary (-c flag) gave AFL++ visibility into every memcmp/strcmp at runtime. Result: 4.5× coverage improvement (0.70% → 3.16%) in a 10-second smoke test.
  • Two-stage fuzzing works. Use AFL_LLVM_LAF_ALL=1 for path discovery, then replay the corpus against ASAN builds for crash detection. Also: run your fuzzing directory on a ramdisk (tmpfs). Kudos to Albert for both of these hints.
  • Claude’s FP filter is useful but not reliable. It correctly identified 1 real bug (EncodedDottedForm) but also flagged 2 false positives as “high/medium confidence true positives” (wc_oid_sum, GetLength_ex). It reasons about patterns (“OID + integer = overflow”) rather than computing actual value ranges. Always verify manually.
  • Seed quality matters more than execution count. Switching from hand-crafted minimal seeds to wolfSSL’s own 59 DER test certificates was the single biggest improvement.
  • Pick your targets wisely. LLM-assisted security research works best on targets that have not been fuzzed extensively.
  • The 50-line code review window is too small. v2 with function-level chunking and wolfSSL-aware prompts cut noise dramatically. Feed entire functions, tell the LLM about the target’s defensive patterns.

Cross-model analysis: what 1,757 functions tell us

We ran the same 13-file pipeline through 3 different LLMs. Together they flagged 1,757 unique functions. Here’s how they overlap:

CategoryFunctions% of total
Flagged by all 3 models1498.5%
Flagged by exactly 2 models36520.8%
qwen only32118.3%
deepseek only46026.2%
dolphin only46226.3%

72% of findings are model-unique — each model sees something different. 36 exact (function, CWE) pairs were agreed upon by all 3, including GetASN_Items, DecodeCertInternal, and DecodeGeneralName. These 36 represent the highest-confidence candidates for manual review. All were evaluated by Claude — none confirmed as exploitable.

Each model also has unique CWE categories: qwen flagged 83 CWEs the others missed, dolphin had 95 unique CWEs, deepseek had 48. The diversity is real — but on wolfSSL, it’s diversity of false positives.

Final numbers

MetricValue
AFL++ total executions1.22 billion
AFL++ crashes0
Code review candidates (3 models)5,971
Claude-evaluated candidates300
Claude “true positives”6
Confirmed real after manual verification1
Exploitable vulnerabilities0
Total researcher time~3 days
Hardware cost$300 GPU + $400 NUC

Conclusion

wolfSSL is one of the most hardened open-source C libraries in existence. 1.22 billion fuzzer executions produced zero crashes. Three LLM models scanning 202K lines of code produced nearly 6,000 candidates — and after Claude filtering and manual verification, exactly one real bug: a logic error in debug-only code that cannot be reached in production.

Let’s be honest about what we didn’t do. 1.22 billion executions sounds impressive, but with 3.8% coverage on the ASN.1 parser and 1.06% on the TLS handshake, it’s hard to say we even started real fuzzing. The handshake harness never got past the initial ClientHello parsing — it couldn’t produce a cryptographically valid response. The ASN.1 parser plateaued after a few hours and never broke through to deeper code paths. A proper fuzzing campaign against wolfSSL would need weeks of continuous execution, custom harnesses for each TLS extension, grammar-based seed generation for valid handshake sequences, and probably a network-aware fuzzer that can complete a full TLS exchange. We didn’t do any of that — and Google’s OSS-Fuzz has been doing exactly that since 2016.

The main goal was never to break wolfSSL. It was to set up and validate a repeatable process for LLM-augmented security assessment:

  • Build instrumented binaries (LAF, CMPLOG, ASAN) from a single Makefile
  • Write harnesses that feed untrusted input to the right entry points
  • Seed from the project’s own test data, not hand-crafted bytes
  • Run overnight campaigns with automatic ASAN verification
  • Scan source code with multiple LLMs in parallel, each catching different patterns
  • Filter noise with Claude, then verify every “true positive” by reading the actual code
  • Document everything in backlogs so the next assessment starts faster

That process now exists, is documented, and is proven to work — the same pipeline found wolfSSL was the calibration target. The real value is applying this to codebases that haven’t had the benefit of a decade of OSS-Fuzz and professional security audits.