BearSSL Research — HoneyLens

Manual verification 2026-04-02

BearSSL — Applying the wolfSSL Methodology to an Unfuzzed Target

After validating our LLM-augmented fuzzing pipeline on wolfSSL (a heavily audited, OSS-Fuzz target), we applied the exact same approach to BearSSL — a compact TLS library that has never been through OSS-Fuzz and appears unmaintained since 2022. The hypothesis: less-audited code yields more findings.

Fuzzing AFL++ CMPLOG Code review Multi-model TLS

Why BearSSL?

Aspect	wolfSSL (previous)	BearSSL (this project)
Size	202K lines (13 key files)	59K lines (294 files)
Author	wolfSSL Inc (team)	Thomas Pornin (solo cryptographer)
License	GPLv2 + commercial	MIT
OSS-Fuzz	Yes (since 2016)	No
Last commit	Active (daily)	~2022 (unmaintained)
TLS versions	1.0–1.3 + DTLS	1.0–1.2 only
Defensive macros	WC_SAFE_SUM_WORD32, ForceZero	None — manual checks only

BearSSL is written by a cryptographer known for constant-time implementations and security-first design. But it lacks the continuous fuzzing and enterprise-grade defensive patterns that wolfSSL has built up over years of OSS-Fuzz and FIPS certification. The question: does craftsmanship alone prevent the bugs that automated tooling finds?

Fuzzing: 419M executions

Same CMPLOG + LAF + ASAN pipeline as wolfSSL. Two harnesses targeting the X.509 certificate parser and private key decoder — the highest-risk attack surfaces.

Campaign	Duration	Executions	Coverage	Corpus
X.509 parser (8h + 9h extended)	17h	193M	0.43%	204
Private key decoder (4h + 9h extended)	13h	226M	0.17%	98
Total	30h	419M	—	302

Coverage plateaued early and never increased during extended runs — the harnesses couldn’t break through BearSSL’s input validation. Same limitation as wolfSSL: real TLS fuzzing needs grammar-aware seed generation and multi-message handshake sequences, not just random mutation.

Multi-model code review: 3 models, same approach

We ran the identical multi-model pipeline from the wolfSSL research: function-level chunking, target-specific system prompts, force local GPU only, Claude FP filtering on each model’s top candidates.

Model	Duration	Findings	CRITICAL	Claude TPs
qwen2.5:14b	8.3 min	46	33 (72%)	1
deepseek-coder-v2:lite	20.7 min	103	5 (5%)	1
dolphin-mistral	19.5 min	113	43 (38%)	1

What the models found

Each model produced 1 Claude-confirmed true positive. We are currently in manual review to verify exploitability and assess impact.

1 finding cross-confirmed by 2 models (qwen + deepseek independently flagged the same function) — this is the strongest signal we’ve seen across all our research. In wolfSSL, no finding was cross-confirmed.
1 finding unique to dolphin-mistral — the uncensored model caught a different class of issue that the censored models didn’t flag. This validates the multi-model approach: each model has different blind spots.

We are not disclosing specific details until manual verification is complete. BearSSL is unmaintained, which means any confirmed findings cannot be patched upstream. We will publish full technical details after completing our assessment and determining the appropriate disclosure path.

wolfSSL vs BearSSL: what the comparison tells us

Metric	wolfSSL	BearSSL
Source lines scanned	202K	~15K
AFL++ executions	1.22B	419M
Code review (3 models)	5,971 findings	262 findings
Claude TPs (total)	6	3
Cross-model confirmed	0	1
Unique findings	1 (debug code)	2 (production code)
OSS-Fuzz	Yes	No

The pattern is clear: less-audited code yields more findings. wolfSSL’s one confirmed finding was in debug-only code behind #ifdef. BearSSL’s findings are in production code paths. The absence of continuous fuzzing (OSS-Fuzz) and the lack of systematic defensive macros make a measurable difference — even in code written by one of the world’s best cryptographers.

Multi-model validation: confirmed

The wolfSSL research suggested that running multiple models adds noise without adding bugs on mature codebases. BearSSL tells a different story:

Cross-model confirmation works. Two models independently flagging the same function is a much stronger signal than a single model’s verdict. In wolfSSL we never saw this. In BearSSL we did.
Dolphin finds what censored models miss. The uncensored model caught a different class of issue entirely. For less-audited targets, the multi-model approach is justified.
BearSSL scans are fast. 8.3 minutes for qwen on 59K lines. The entire 3-model pipeline (code review + Claude filter) completed in under an hour. For small libraries, multi-model scanning has negligible cost.

Status

Phase	Status
AFL++ fuzzing (X.509 + skey)	Complete — 419M execs, 0 crashes
Code review Round 1 (qwen)	Complete — 46 findings, 1 Claude TP
Code review Round 2 (deepseek)	Complete — 103 findings, 1 Claude TP (cross-confirmed)
Code review Round 3 (dolphin)	Complete — 113 findings, 1 Claude TP (unique)
Manual verification & exploit development	In progress
Disclosure	Pending exploit development