LLM-Assisted Security Research — The Throughline
The methodology layer that ties the rest of the projects together. Every other HoneyLens project — the sensor, the fuzzer, the pentest framework, the ADR agent — uses LLMs for some part of the workflow. This track is where we measure how well that actually works: which models are good at what, where the failure modes are, and what the right human-in-the-loop boundaries are.
What We Measure
Five concrete benchmarks running across the project, each producing reproducible numbers we can talk about:
- Crash triage accuracy — 8-model ensemble on real AFL++ / ASAN crashes from the AFA project. Inputs: stacktrace + minimised reproducer + target source snippet. Outputs: severity class, root cause family, suggested CVSS. Ground truth from manual analysis. Measured: precision per class, inter-model agreement, cost-per-correct-classification.
- Honeypot event classification accuracy — the
ai_analysistable on each sensor accumulates verdicts on real attacker traffic. Re-scored periodically against analyst spot-checks; novelty-score calibration tracked over time. - Pentest finding verification — the pentest framework's verification phase uses a different model from the one that found the vulnerability. We track the false-positive rate (the verifier disagrees with the finder) as a function of model pairing.
- Detection rule generation — can an LLM produce a Suricata or YARA rule from a captured payload that catches the same payload class without over-fitting on the specific bytes? Tested against the HoneyLens local SID range and the upstream ET Open ruleset as ground-truth controls.
- Vision model comparison — published as /research/vision-models. How well do vision-capable LLMs handle screenshots of captive portals, CAPTCHA challenges, dashboard UIs vs. raw HTML parsing?
Published Work
- Vision Model Comparison — vision LLM benchmark for security-relevant image tasks.
- Protocol Recognition Benchmark — how well do LLMs identify protocols from raw byte streams?
- wolfSSL Fuzzing — the multi-harness AFL++ campaign and crash-triage methodology.
- BearSSL Research — companion to the wolfSSL work; small-footprint TLS implementations under fuzz.
- TP-Link Router Pentest — the pentest framework's first published engagement.
Why This Matters
Most public LLM-for-security writing is one of two failure modes: vendor breathlessness (“our model found a CVE”) or doomerism (“LLMs will replace security engineers”). The interesting thing is the middle: real, reproducible measurements of which parts of security work an LLM is good at, which parts it's actively dangerous on, and what the cost / benefit curve looks like for a single-person research operation that uses them every day.
Everything published here is reproducible: inputs, prompts, model versions, dates, cost data. The point isn't to claim a single model is best; it's to give other researchers numbers they can argue with.
What’s Next
- Yearly model-rotation benchmark — same task suite, fresh model lineup, see how the curves move.
- Public dataset of redacted honeypot events with human-labeled ground truth for third-party LLM benchmarking.
- Methodology paper covering the cost-bounded human-in-the-loop pattern used across all four implementation projects.