Manual verification 2026-04-02

BearSSL — Applying the wolfSSL Methodology to an Unfuzzed Target

After validating our LLM-augmented fuzzing pipeline on wolfSSL (a heavily audited, OSS-Fuzz target), we applied the exact same approach to BearSSL — a compact TLS library that has never been through OSS-Fuzz and appears unmaintained since 2022. The hypothesis: less-audited code yields more findings.

Fuzzing AFL++ CMPLOG Code review Multi-model TLS

Why BearSSL?

AspectwolfSSL (previous)BearSSL (this project)
Size202K lines (13 key files)59K lines (294 files)
AuthorwolfSSL Inc (team)Thomas Pornin (solo cryptographer)
LicenseGPLv2 + commercialMIT
OSS-FuzzYes (since 2016)No
Last commitActive (daily)~2022 (unmaintained)
TLS versions1.0–1.3 + DTLS1.0–1.2 only
Defensive macrosWC_SAFE_SUM_WORD32, ForceZeroNone — manual checks only

BearSSL is written by a cryptographer known for constant-time implementations and security-first design. But it lacks the continuous fuzzing and enterprise-grade defensive patterns that wolfSSL has built up over years of OSS-Fuzz and FIPS certification. The question: does craftsmanship alone prevent the bugs that automated tooling finds?

Fuzzing: 419M executions

Same CMPLOG + LAF + ASAN pipeline as wolfSSL. Two harnesses targeting the X.509 certificate parser and private key decoder — the highest-risk attack surfaces.

CampaignDurationExecutionsCoverageCorpusCrashes
X.509 parser (8h + 9h extended)17h193M0.43%2040
Private key decoder (4h + 9h extended)13h226M0.17%980
Total30h419M3020

Coverage plateaued early and never increased during extended runs — the harnesses couldn’t break through BearSSL’s input validation. Same limitation as wolfSSL: real TLS fuzzing needs grammar-aware seed generation and multi-message handshake sequences, not just random mutation.

Multi-model code review: 3 models, same approach

We ran the identical multi-model pipeline from the wolfSSL research: function-level chunking, target-specific system prompts, force local GPU only, Claude FP filtering on each model’s top candidates.

ModelDurationFindingsCRITICALClaude TPs
qwen2.5:14b 8.3 min 46 33 (72%) 1
deepseek-coder-v2:lite 20.7 min 103 5 (5%) 1
dolphin-mistral 19.5 min 113 43 (38%) 1

What the models found

Each model produced 1 Claude-confirmed true positive. We are currently in manual review to verify exploitability and assess impact.

  • 1 finding cross-confirmed by 2 models (qwen + deepseek independently flagged the same function) — this is the strongest signal we’ve seen across all our research. In wolfSSL, no finding was cross-confirmed.
  • 1 finding unique to dolphin-mistral — the uncensored model caught a different class of issue that the censored models didn’t flag. This validates the multi-model approach: each model has different blind spots.

We are not disclosing specific details until manual verification is complete. BearSSL is unmaintained, which means any confirmed findings cannot be patched upstream. We will publish full technical details after completing our assessment and determining the appropriate disclosure path.

wolfSSL vs BearSSL: what the comparison tells us

MetricwolfSSLBearSSL
Source lines scanned202K~15K
AFL++ executions1.22B419M
Code review (3 models)5,971 findings262 findings
Claude TPs (total)63
Cross-model confirmed01
Unique findings1 (debug code)2 (production code)
OSS-FuzzYesNo

The pattern is clear: less-audited code yields more findings. wolfSSL’s one confirmed finding was in debug-only code behind #ifdef. BearSSL’s findings are in production code paths. The absence of continuous fuzzing (OSS-Fuzz) and the lack of systematic defensive macros make a measurable difference — even in code written by one of the world’s best cryptographers.

Multi-model validation: confirmed

The wolfSSL research suggested that running multiple models adds noise without adding bugs on mature codebases. BearSSL tells a different story:

  • Cross-model confirmation works. Two models independently flagging the same function is a much stronger signal than a single model’s verdict. In wolfSSL we never saw this. In BearSSL we did.
  • Dolphin finds what censored models miss. The uncensored model caught a different class of issue entirely. For less-audited targets, the multi-model approach is justified.
  • BearSSL scans are fast. 8.3 minutes for qwen on 59K lines. The entire 3-model pipeline (code review + Claude filter) completed in under an hour. For small libraries, multi-model scanning has negligible cost.

Status

PhaseStatus
AFL++ fuzzing (X.509 + skey)Complete — 419M execs, 0 crashes
Code review Round 1 (qwen)Complete — 46 findings, 1 Claude TP
Code review Round 2 (deepseek)Complete — 103 findings, 1 Claude TP (cross-confirmed)
Code review Round 3 (dolphin)Complete — 113 findings, 1 Claude TP (unique)
Manual verification & exploit developmentIn progress
DisclosurePending exploit development