LLM-Assisted Security Research

continuous · methodology track LLM Research

LLM-Assisted Security Research — The Throughline

The methodology layer that ties the rest of the projects together. Every other HoneyLens project — the sensor, the fuzzer, the pentest framework, the ADR agent — uses LLMs for some part of the workflow. This track is where we measure how well that actually works: which models are good at what, where the failure modes are, and what the right human-in-the-loop boundaries are.

methodology model benchmarking 8-model comparison published research

What We Measure

Five concrete benchmarks running across the project, each producing reproducible numbers we can talk about:

Crash triage accuracy — 8-model ensemble on real AFL++ / ASAN crashes from the AFA project. Inputs: stacktrace + minimised reproducer + target source snippet. Outputs: severity class, root cause family, suggested CVSS. Ground truth from manual analysis. Measured: precision per class, inter-model agreement, cost-per-correct-classification.
Honeypot event classification accuracy — the ai_analysis table on each sensor accumulates verdicts on real attacker traffic. Re-scored periodically against analyst spot-checks; novelty-score calibration tracked over time.
Pentest finding verification — the pentest framework's verification phase uses a different model from the one that found the vulnerability. We track the false-positive rate (the verifier disagrees with the finder) as a function of model pairing.
Detection rule generation — can an LLM produce a Suricata or YARA rule from a captured payload that catches the same payload class without over-fitting on the specific bytes? Tested against the HoneyLens local SID range and the upstream ET Open ruleset as ground-truth controls.
Vision model comparison — published as /research/vision-models. How well do vision-capable LLMs handle screenshots of captive portals, CAPTCHA challenges, dashboard UIs vs. raw HTML parsing?

Published Work

Vision Model Comparison — vision LLM benchmark for security-relevant image tasks.
Protocol Recognition Benchmark — how well do LLMs identify protocols from raw byte streams?
wolfSSL Fuzzing — the multi-harness AFL++ campaign and crash-triage methodology.
BearSSL Research — companion to the wolfSSL work; small-footprint TLS implementations under fuzz.
TP-Link Router Pentest — the pentest framework's first published engagement.

Why This Matters

Most public LLM-for-security writing is one of two failure modes: vendor breathlessness (“our model found a CVE”) or doomerism (“LLMs will replace security engineers”). The interesting thing is the middle: real, reproducible measurements of which parts of security work an LLM is good at, which parts it's actively dangerous on, and what the cost / benefit curve looks like for a single-person research operation that uses them every day.

Everything published here is reproducible: inputs, prompts, model versions, dates, cost data. The point isn't to claim a single model is best; it's to give other researchers numbers they can argue with.

What’s Next

Yearly model-rotation benchmark — same task suite, fresh model lineup, see how the curves move.
Public dataset of redacted honeypot events with human-labeled ground truth for third-party LLM benchmarking.
Methodology paper covering the cost-bounded human-in-the-loop pattern used across all four implementation projects.