BearSSL — Applying the wolfSSL Methodology to an Unfuzzed Target
After validating our LLM-augmented fuzzing pipeline on wolfSSL (a heavily audited, OSS-Fuzz target), we applied the exact same approach to BearSSL — a compact TLS library that has never been through OSS-Fuzz and appears unmaintained since 2022. The hypothesis: less-audited code yields more findings.
Why BearSSL?
| Aspect | wolfSSL (previous) | BearSSL (this project) |
|---|---|---|
| Size | 202K lines (13 key files) | 59K lines (294 files) |
| Author | wolfSSL Inc (team) | Thomas Pornin (solo cryptographer) |
| License | GPLv2 + commercial | MIT |
| OSS-Fuzz | Yes (since 2016) | No |
| Last commit | Active (daily) | ~2022 (unmaintained) |
| TLS versions | 1.0–1.3 + DTLS | 1.0–1.2 only |
| Defensive macros | WC_SAFE_SUM_WORD32, ForceZero | None — manual checks only |
BearSSL is written by a cryptographer known for constant-time implementations and security-first design. But it lacks the continuous fuzzing and enterprise-grade defensive patterns that wolfSSL has built up over years of OSS-Fuzz and FIPS certification. The question: does craftsmanship alone prevent the bugs that automated tooling finds?
Fuzzing: 419M executions
Same CMPLOG + LAF + ASAN pipeline as wolfSSL. Two harnesses targeting the X.509 certificate parser and private key decoder — the highest-risk attack surfaces.
| Campaign | Duration | Executions | Coverage | Corpus | Crashes |
|---|---|---|---|---|---|
| X.509 parser (8h + 9h extended) | 17h | 193M | 0.43% | 204 | 0 |
| Private key decoder (4h + 9h extended) | 13h | 226M | 0.17% | 98 | 0 |
| Total | 30h | 419M | — | 302 | 0 |
Coverage plateaued early and never increased during extended runs — the harnesses couldn’t break through BearSSL’s input validation. Same limitation as wolfSSL: real TLS fuzzing needs grammar-aware seed generation and multi-message handshake sequences, not just random mutation.
Multi-model code review: 3 models, same approach
We ran the identical multi-model pipeline from the wolfSSL research: function-level chunking, target-specific system prompts, force local GPU only, Claude FP filtering on each model’s top candidates.
| Model | Duration | Findings | CRITICAL | Claude TPs |
|---|---|---|---|---|
| qwen2.5:14b | 8.3 min | 46 | 33 (72%) | 1 |
| deepseek-coder-v2:lite | 20.7 min | 103 | 5 (5%) | 1 |
| dolphin-mistral | 19.5 min | 113 | 43 (38%) | 1 |
What the models found
Each model produced 1 Claude-confirmed true positive. We are currently in manual review to verify exploitability and assess impact.
- 1 finding cross-confirmed by 2 models (qwen + deepseek independently flagged the same function) — this is the strongest signal we’ve seen across all our research. In wolfSSL, no finding was cross-confirmed.
- 1 finding unique to dolphin-mistral — the uncensored model caught a different class of issue that the censored models didn’t flag. This validates the multi-model approach: each model has different blind spots.
We are not disclosing specific details until manual verification is complete. BearSSL is unmaintained, which means any confirmed findings cannot be patched upstream. We will publish full technical details after completing our assessment and determining the appropriate disclosure path.
wolfSSL vs BearSSL: what the comparison tells us
| Metric | wolfSSL | BearSSL |
|---|---|---|
| Source lines scanned | 202K | ~15K |
| AFL++ executions | 1.22B | 419M |
| Code review (3 models) | 5,971 findings | 262 findings |
| Claude TPs (total) | 6 | 3 |
| Cross-model confirmed | 0 | 1 |
| Unique findings | 1 (debug code) | 2 (production code) |
| OSS-Fuzz | Yes | No |
The pattern is clear: less-audited code yields more findings.
wolfSSL’s one confirmed finding was in debug-only code behind #ifdef.
BearSSL’s findings are in production code paths. The absence of continuous
fuzzing (OSS-Fuzz) and the lack of systematic defensive macros make a measurable
difference — even in code written by one of the world’s best cryptographers.
Multi-model validation: confirmed
The wolfSSL research suggested that running multiple models adds noise without adding bugs on mature codebases. BearSSL tells a different story:
- Cross-model confirmation works. Two models independently flagging the same function is a much stronger signal than a single model’s verdict. In wolfSSL we never saw this. In BearSSL we did.
- Dolphin finds what censored models miss. The uncensored model caught a different class of issue entirely. For less-audited targets, the multi-model approach is justified.
- BearSSL scans are fast. 8.3 minutes for qwen on 59K lines. The entire 3-model pipeline (code review + Claude filter) completed in under an hour. For small libraries, multi-model scanning has negligible cost.
Status
| Phase | Status |
|---|---|
| AFL++ fuzzing (X.509 + skey) | Complete — 419M execs, 0 crashes |
| Code review Round 1 (qwen) | Complete — 46 findings, 1 Claude TP |
| Code review Round 2 (deepseek) | Complete — 103 findings, 1 Claude TP (cross-confirmed) |
| Code review Round 3 (dolphin) | Complete — 113 findings, 1 Claude TP (unique) |
| Manual verification & exploit development | In progress |
| Disclosure | Pending exploit development |