wolfSSL Fuzzing Campaign

Complete 2026-04-02

wolfSSL 5.9.0 — LLM-Augmented Fuzzing Campaign

Ongoing security research into the wolfSSL 5.9.0 TLS library (268 C files, 1.29M lines of code) combining AFL++ coverage-guided fuzzing with LLM-assisted code review. Three custom harnesses target the historically most vulnerable attack surfaces: TLS handshake, X.509 certificate parsing, and ASN.1 decoding. After an initial short-run phase exposed coverage limitations, we rebuilt the entire fuzzing pipeline with CMPLOG instrumentation, proper seed corpus from wolfSSL’s own test suite (59 DER certificates + keys), and ASN.1/TLS dictionaries. An overnight campaign is now running.

Fuzzing AFL++ CMPLOG ASAN TLS ASN.1 LLM-assisted

Phase 1: short runs (2026-03-30)

The first phase produced 64M+ AFL++ executions, 478+ unique corpus items, and 1,360 LLM code review findings — but no confirmed vulnerabilities. Coverage was extremely low (0.26–1.05%) because the seed corpus consisted of manually crafted minimal inputs rather than real protocol data.

Method	Executions	Coverage	Findings	Result
AFL++ LAF (cert parser)	23.2M	0.42%	100 corpus items	0 crashes
AFL++ LAF (ASN.1)	20.8M	0.70%	162 corpus items	0 ASAN
AFL++ LAF (TLS handshake)	8.6M	1.05%	30 corpus items	0 ASAN
Ollama code review	—	—	1,360 findings → 210 CRITICAL	High FP rate
Claude FP filter	—	—	210 → 5 candidates	All 5 FP

Phase 2: overnight campaign with CMPLOG (2026-03-31)

After analyzing Phase 1 results, we rebuilt the fuzzing pipeline from scratch. The key improvements:

CMPLOG instrumentation — a separate binary that logs all comparisons at runtime, letting AFL++ auto-solve multi-byte checks like TLS version bytes, ASN.1 tag matching, and OID comparisons that previously blocked the fuzzer
Real seed corpus — 59 DER certificates and keys from wolfSSL’s own test suite (RSA, ECC, PKCS#8, malformed test certs) plus the evolved corpus from Phase 1
Protocol dictionaries — hand-built ASN.1/X.509 dictionary (OIDs, tags, length encodings, extension identifiers) and TLS dictionary (record types, versions, cipher suites, extensions, alert codes)
Three binary variants — LAF (main fuzzer, splits comparisons), CMPLOG (comparison logging for auto-dictionary), ASAN (crash verification post-run)

A 10-second smoke test confirmed the improvement: 3.16% coverage and 558 corpus items in 10 seconds vs 0.70% and 162 items in 10 minutes from Phase 1 — a 4.5× coverage increase. The campaigns (4 hours per target, 12 hours total) are running on a dedicated Linux fuzzing box with automatic ASAN crash verification at the end of each campaign. In parallel, we’re running an improved LLM code review (v2) covering all 13 key source files (202K lines) with function-level chunking and better prompts to reduce the false positive rate.

Target	Seeds	Dictionary	Duration	Status
ASN.1 parser (4h + 8h extended)	254	asn1_x509.dict	12 hours	794M execs, 3.84% cvg, 1,034 corpus, 0 crashes
X.509 cert parser	192	asn1_x509.dict	4 hours	282M execs, 3.54% cvg, 893 corpus, 0 crashes
TLS handshake	33	tls.dict	4 hours	133M execs, 1.06% cvg, 34 corpus, 0 crashes
Code review v2 (qwen2.5:14b)	13 files	202K lines	~3 hours	913 findings → 1 TP after Claude filter
Code review Round 2 (deepseek)	13 files	202K lines	7.9 hours	2,297 findings → 2 “TP” → both FP
Code review Round 3 (dolphin)	13 files	202K lines	12.7 hours	2,761 findings → 3 “TP” → all FP

Code review v2 improvements

Phase 1 code review only scanned 2 files with fixed-size 50-line windows, leading to high false positive rates from missing caller context. v2 addresses this:

Function-level chunking — extracts complete C functions (up to 14K chars) instead of arbitrary line windows, so the LLM sees full validation logic
13 files / 202K lines — expanded from 2 files to cover tls13.c, dtls13.c, pkcs7.c, pkcs12.c, ecc.c, rsa.c, ssl.c, ocsp.c, sp_int.c, and more
wolfSSL-aware prompts — explicitly tell the LLM about WC_SAFE_SUM_WORD32(), free-then-null patterns, and caller-level validation to reduce FPs
Dual strategy per file — each file gets both a generic memory/integer scan and a domain-specific scan (ASN.1, TLS state machine, or crypto)
Deduplication — groups findings by (function, CWE, type) before Claude filtering

Why no bugs? wolfSSL's defensive coding

All 5 candidates flagged by Claude turned out to be false positives because wolfSSL employs excellent defensive practices that LLMs struggle to trace across function boundaries:

WC_SAFE_SUM_WORD32() macro — validates integer arithmetic before every allocation
Consistent free-then-null pattern — prevents use-after-free
Length validation before all buffer operations — caller-level bounds checks
Callback return value bounds checking — e.g., PSK key length capped at MAX_PSK_KEY_LEN

Phase 2 results: 1.2 billion executions, 0 crashes, 1 logic bug

All fuzzing campaigns complete. Every target ran with CMPLOG + dictionary + real seed corpus. ASN.1 got an extended 8-hour run on top of the initial 4 hours.

Campaign	Runtime	Executions	Coverage	Corpus
ASN.1 (CMPLOG, 4h)	4h	316M	3.82%	975
ASN.1 extended (8h)	8h	458M	3.84%	1,034
X.509 cert (CMPLOG)	4h	248M	3.54%	893
TLS handshake (CMPLOG)	4h	133M	1.06%	34
Total (Phase 2)	20h	1.155B	—	2,936

The TLS handshake harness plateaued at 1.06% coverage with only 34 corpus items — the state machine is extremely hard to penetrate even with CMPLOG. The fuzzer can’t get past the initial handshake parsing without producing a valid cryptographic response. ASN.1 and cert parsers reached 3.5–3.8% with ~1,000 corpus items each — respectable for an 8-hour campaign but far from exhaustive.

Code review: 3 models, 6 Claude TPs, 1 real bug

We ran the same 13-file code review pipeline through three different LLM models, then filtered the top 100 candidates from each through Claude Sonnet 4.6. Every “true positive” from Claude was then manually verified against the actual source code.

Finding	Model	Claude verdict	Manual verdict	Details
`EncodedDottedForm`	qwen2.5:14b	TP (medium)	Real but not exploitable	Off-by-one in OID encoding. Debug-only code behind `#ifdef`, single caller with outSz=16.
`wc_oid_sum`	deepseek-coder-v2	TP (high)	False positive	Max sum is 8,160 (255×32) — fits word32. XOR path can’t overflow.
`GetLength_ex`	deepseek-coder-v2	TP (medium)	False positive	5 validation checks prevent overflow. Theoretical wrap unreachable (Check 3 caps at INT_MAX).
`StreamOctetString`	dolphin-mistral	TP (medium, CVSS 7.5)	False positive	Bounds check includes input offset `i` making it stricter, not weaker. No wrap with TLS-sized inputs.
`EncodeObjectId` (overflow)	dolphin-mistral	TP (medium)	False positive	Max multiplication = 2.6M (fits word32). `len` overflow needs 300M+ elements; callers pass <20.
`EncodeObjectId` (signed/unsigned)	dolphin-mistral	TP (medium)	False positive	Syntactic signed/unsigned mismatch but check works correctly in all reachable scenarios.

Result: 1 real logic bug out of ~5,970 candidates across 3 models. 300 candidates evaluated by Claude, 6 flagged as true positives, only 1 confirmed after manual verification. The real bug was found by the precision model (qwen), not the noisy ones. dolphin-mistral produced the most findings (2,761) but zero real bugs — including one that Claude scored at CVSS 7.5 as a buffer overflow, which was actually a bounds check that was stricter than necessary.

Claude’s FP filter reduces noise (5,970 → 6) but is not a substitute for reading the code. It reasons about patterns (“OID + multiplication = overflow”) rather than computing actual value ranges. Manual verification caught all 5 false positives that Claude missed.

Multi-model code review: does a different LLM find different bugs?

One model scanning code is a single opinion. The same code chunk that qwen2.5:14b dismisses might trigger a finding in deepseek-coder-v2 or dolphin-mistral. We’re running the same 13-file pipeline through multiple models to compare blind spots:

Round	Model	Type	Findings	After Claude filter	Status
1	qwen2.5:14b	Local (Linux + Ollama GPU)	913 (595 CRIT)	1 TP / 99 FP	Complete
2	deepseek-coder-v2:lite	Local (Linux + Ollama GPU)	2,297 (305 CRIT)	2 “TP” / 98 FP → both FP after manual review	Complete
3	dolphin-mistral	Local (Linux + Ollama GPU)	2,761 (1,769 CRIT)	3 “TP” / 97 FP → all FP after manual review	Complete
4	Claude Sonnet 4.6	Anthropic API	—	—	Skipped — 3 rounds sufficient
5	qwen2.5:3b	Local (Linux + Ollama GPU)	—	—	Skipped — too small for C code

What we learned: More noise does not mean more bugs. dolphin-mistral produced 3× the findings of qwen (2,761 vs 913) but zero real bugs. deepseek produced 2.5× more but also zero. The one real bug was found by the precision model (qwen2.5:14b), which had the fewest findings but the best signal-to-noise ratio. Each model flagged different functions — but in a mature codebase, “different” just means “different false positives.”

Honest assessment

Let’s be real: we would be very surprised if a home lab approach — one researcher with a Ryzen 5 fuzzing box, an RTX 3060 for LLM inference, and a week of effort — produced a meaningful security finding in wolfSSL. This is a library that has been:

Continuously fuzzed by Google’s OSS-Fuzz since 2016 (billions of executions)
Audited by professional security firms multiple times
FIPS 140-2/140-3 certified (military-grade validation)
Deployed in automotive, aerospace, and government systems
Maintained by a team that clearly understands defensive C coding (WC_SAFE_SUM_WORD32, free-then-null, caller-level validation everywhere)

Our 1.22 billion AFL++ executions at 3.5–3.8% coverage are a rounding error compared to what OSS-Fuzz runs continuously. The one bug we found through code review is a logic error in debug-only code that no fuzzer would ever reach.

That’s actually the point. The value of this research isn’t in finding wolfSSL vulnerabilities — it’s in proving out the methodology. We built and validated an LLM-augmented fuzzing pipeline that:

Scans 202K lines of C with 3 models producing 5,971 candidates in ~24h total GPU time
Reduces 5,971 candidates → 300 Claude-evaluated → 6 “TPs” → 1 confirmed real after manual verification
Achieves 4.5× better fuzzing coverage with CMPLOG + dictionaries + real seed corpus
Runs entirely on commodity hardware ($300 GPU + $400 NUC)

wolfSSL is a much harder target. Apply this to a less-audited embedded TLS library or an IoT firmware stack, and the results would be very different.

Lessons for LLM-assisted fuzzing

Practical hints for anyone trying to use AI for vulnerability research in compiled C code:

LLMs can’t trace cross-function validation. Ollama flagged XMALLOC(untrusted_size) patterns without seeing the bounds check 3 functions up the call stack. Always verify caller context manually.
CMPLOG is a game changer for protocol fuzzing. Adding a CMPLOG binary (-c flag) gave AFL++ visibility into every memcmp/strcmp at runtime. Result: 4.5× coverage improvement (0.70% → 3.16%) in a 10-second smoke test.
Two-stage fuzzing works. Use AFL_LLVM_LAF_ALL=1 for path discovery, then replay the corpus against ASAN builds for crash detection. Also: run your fuzzing directory on a ramdisk (tmpfs). Kudos to Albert for both of these hints.
Claude’s FP filter is useful but not reliable. It correctly identified 1 real bug (EncodedDottedForm) but also flagged 2 false positives as “high/medium confidence true positives” (wc_oid_sum, GetLength_ex). It reasons about patterns (“OID + integer = overflow”) rather than computing actual value ranges. Always verify manually.
Seed quality matters more than execution count. Switching from hand-crafted minimal seeds to wolfSSL’s own 59 DER test certificates was the single biggest improvement.
Pick your targets wisely. LLM-assisted security research works best on targets that have not been fuzzed extensively.
The 50-line code review window is too small. v2 with function-level chunking and wolfSSL-aware prompts cut noise dramatically. Feed entire functions, tell the LLM about the target’s defensive patterns.

Cross-model analysis: what 1,757 functions tell us

We ran the same 13-file pipeline through 3 different LLMs. Together they flagged 1,757 unique functions. Here’s how they overlap:

Category	Functions	% of total
Flagged by all 3 models	149	8.5%
Flagged by exactly 2 models	365	20.8%
qwen only	321	18.3%
deepseek only	460	26.2%
dolphin only	462	26.3%

72% of findings are model-unique — each model sees something different. 36 exact (function, CWE) pairs were agreed upon by all 3, including GetASN_Items, DecodeCertInternal, and DecodeGeneralName. These 36 represent the highest-confidence candidates for manual review. All were evaluated by Claude — none confirmed as exploitable.

Each model also has unique CWE categories: qwen flagged 83 CWEs the others missed, dolphin had 95 unique CWEs, deepseek had 48. The diversity is real — but on wolfSSL, it’s diversity of false positives.

Final numbers

Metric	Value
AFL++ total executions	1.22 billion
AFL++ crashes	0
Code review candidates (3 models)	5,971
Claude-evaluated candidates	300
Claude “true positives”	6
Confirmed real after manual verification	1
Exploitable vulnerabilities	0
Total researcher time	~3 days
Hardware cost	$300 GPU + $400 NUC

Conclusion

wolfSSL is one of the most hardened open-source C libraries in existence. 1.22 billion fuzzer executions produced zero crashes. Three LLM models scanning 202K lines of code produced nearly 6,000 candidates — and after Claude filtering and manual verification, exactly one real bug: a logic error in debug-only code that cannot be reached in production.

Let’s be honest about what we didn’t do. 1.22 billion executions sounds impressive, but with 3.8% coverage on the ASN.1 parser and 1.06% on the TLS handshake, it’s hard to say we even started real fuzzing. The handshake harness never got past the initial ClientHello parsing — it couldn’t produce a cryptographically valid response. The ASN.1 parser plateaued after a few hours and never broke through to deeper code paths. A proper fuzzing campaign against wolfSSL would need weeks of continuous execution, custom harnesses for each TLS extension, grammar-based seed generation for valid handshake sequences, and probably a network-aware fuzzer that can complete a full TLS exchange. We didn’t do any of that — and Google’s OSS-Fuzz has been doing exactly that since 2016.

The main goal was never to break wolfSSL. It was to set up and validate a repeatable process for LLM-augmented security assessment:

Build instrumented binaries (LAF, CMPLOG, ASAN) from a single Makefile
Write harnesses that feed untrusted input to the right entry points
Seed from the project’s own test data, not hand-crafted bytes
Run overnight campaigns with automatic ASAN verification
Scan source code with multiple LLMs in parallel, each catching different patterns
Filter noise with Claude, then verify every “true positive” by reading the actual code
Document everything in backlogs so the next assessment starts faster

That process now exists, is documented, and is proven to work — the same pipeline found wolfSSL was the calibration target. The real value is applying this to codebases that haven’t had the benefit of a decade of OSS-Fuzz and professional security audits.