Protocol Recognition Benchmark

Complete 2026-03-31

LLM Protocol Recognition — Can AI Identify Network Protocols Better Than Shodan?

Automated security scanners regularly misidentify protocols. Shodan classified our IEC 104 SCADA honeypot as a Remcos RAT trojan, triggering an abuse report that killed our server. We built a benchmark to test whether LLMs can do better: 12 test cases covering standard protocols, tunneling, C2 beaconing, ICS/SCADA, and protocol evasion — tested across 5 models (3 local, 2 Anthropic).

Protocol analysis LLM benchmark ICS/SCADA C2 detection DNS tunneling Ollama

Motivation: the Shodan misclassification

In March 2026, Shodan scanned our HoneyLens sensor and found IEC 60870-5-104 traffic on port 2404 (the standard SCADA telecontrol port). Instead of recognizing it as a legitimate ICS protocol, Shodan’s classifier tagged it as “Remcos Pro RAT trojan”. Netcraft filed an abuse report. OVH killed the server. Four layers of automation, zero verification. (Full incident write-up)

This raised a practical question: can LLMs reason about protocol structure well enough to avoid this kind of misclassification? Not just port-based heuristics, but actual payload analysis.

Test cases

12 test cases designed to cover the spectrum from trivial to adversarial. Each case includes port, transport, payload hex/hints, flow statistics, TLS metadata, and behavioral flags. The LLM must identify the actual protocol, assess whether the traffic is normal or anomalous, and determine if it’s malicious.

ID	Category	Difficulty	Protocol	Malicious?
TC-001	Standard	Easy	HTTP/1.1 GET request	No
TC-002	Standard	Easy	DNS A query	No
TC-003	Tunneling	Hard	DNS tunneling (iodine/dnscat2)	Yes
TC-004	Evasion	Hard	HTTP C2 beaconing over TLS	Yes
TC-005	Mismatch	Medium	SSH on port 8080	Yes
TC-006	ICS	Hard	IEC 60870-5-104 (SCADA)	No
TC-007	ICS	Hard	Modbus write attack	Yes
TC-008	Tunneling	Hard	ICMP tunnel (ptunnel)	Yes
TC-009	Standard	Easy	TLS 1.3 (Chrome to Gmail)	No
TC-010	Evasion	Medium	Reverse shell on port 4444	Yes
TC-011	Mismatch	Medium	Plaintext HTTP on port 443	Yes
TC-012	Evasion	Hard	Cobalt Strike HTTPS beacon	Yes

Models tested

3 local models on DEV2 (RTX 3060 12GB, Ollama) and 2 Anthropic models via API. All models received the same structured prompt with identical data. JSON output required.

Model	Type	Size	Notes
qwen2.5:14b	Local	14B	Best local model from code audit benchmarks
dolphin-mistral	Local	7B	High creativity, tested for recall
qwen2.5:3b	Local	3B	Smallest viable model
Claude Haiku 4.5	Anthropic	—	Fast, cheapest cloud option
Claude Sonnet 4.6	Anthropic	—	Mid-tier, best price/performance

Results

Model	Protocol ID	Behavior	Malicious	JSON Parse	Avg Time	Verdict
Claude Sonnet 4.6	100%	100%	100%	100%	7.1s	Perfect
Claude Haiku 4.5	92%	100%	100%	100%	3.7s	Excellent
qwen2.5:14b	75%	100%	83%	100%	16.3s	Best local
qwen2.5:3b	58%	42%	42%	67%	54.5s	Unreliable
dolphin-mistral	8%	25%	17%	33%	92.9s	Not viable

The IEC 104 test (TC-006) — would the LLM have saved our server?

This is the test case that mirrors the real incident: IEC 104 traffic on port 2404 with the standard 0x68 start byte and SCADA polling patterns. Shodan classified this as malware.

Model	Protocol ID	Classified as	Correct?
Claude Sonnet 4.6	IEC 60870-5-104	Normal SCADA	Yes
Claude Haiku 4.5	IEC 60870-5-104	Normal SCADA	Yes
qwen2.5:14b	IEC 60870-5-104	Normal SCADA	Yes
qwen2.5:3b	Timeout / parse fail	—	Failed
dolphin-mistral	Timeout	—	Failed
Shodan (reference)	Remcos Pro RAT	Malware	Wrong

Three out of five LLMs correctly identified IEC 104 from the payload structure alone. Even a 14B local model running on a $300 GPU would have avoided the misclassification that killed our production server.

Key findings

Sonnet is a better protocol classifier than Shodan

Perfect 12/12 across all dimensions. Correctly identified DNS tunneling from base64 subdomain patterns, Cobalt Strike from JA3 + beacon timing, IEC 104 from the 0x68 start byte, and ICMP tunneling from the MZ header in echo payloads. It reasons about structure, not just ports.

Haiku at 3.7s is fast enough for inline use

92% protocol accuracy at under 4 seconds per flow. The only miss was Cobalt Strike (TC-012) — it identified “suspicious HTTPS” but didn’t name the specific tool. For a pre-filter that triages flows before deeper analysis, Haiku’s speed/accuracy tradeoff is compelling.

qwen2.5:14b is the only viable local model

75% protocol ID with 100% behavior detection — it always knows when something is wrong, even if it can’t always name the exact protocol. Missed DNS tunneling naming and reverse shell specifics, but caught every anomaly. At 16s per flow on a $300 GPU, it’s practical for batch analysis.

dolphin-mistral can’t do structured output

Only 33% JSON parse rate — it produced verbose prose instead of the requested JSON format, then timed out on most cases. This model excels at creative text generation (fuzz seed creativity) but fails at structured analytical tasks. Model selection must match the task type.

Behavior detection is easier than protocol identification

qwen2.5:14b scored 100% on “normal vs anomalous” but only 75% on naming the specific protocol. The model understands that “SSH on port 8080 is weird” and “periodic beaconing is suspicious” — the hard part is identifying exactly what tool or technique is being used.

Practical implications

This benchmark suggests a viable architecture for LLM-augmented protocol classification:

Tier 1 (inline): eBPF captures flow metadata + first N payload bytes on selected ports. Rule engine (Suricata/Falco) handles known signatures at wire speed.
Tier 2 (local GPU): qwen2.5:14b on flagged flows — catches anomalies that signature-based tools miss. 16s latency is fine for non-real-time analysis.
Tier 3 (cloud): Haiku or Sonnet for high-confidence classification of ambiguous flows. 3-7s per flow, perfect accuracy, but costs money.

The key insight: LLMs don’t replace Suricata — they fill the gap between “known signature” and “unknown traffic that needs a human analyst”. That gap is exactly where the Shodan misclassification happened.

Next steps

Run with real sensor PCAP data from HoneyLens (not synthetic test cases)
Compare LLM results directly against Suricata/Zeek on the same traffic
Test with 5 runs per case for statistical significance (like vision experiment Round 3)
Build the eBPF selective dissector for chosen ports
Measure cost per flow: local GPU inference vs cloud API vs Suricata (free)

Cross-reference: best model per security task

Task	Best Local	Best Anthropic
Code audit (C/PHP)	qwen2.5:14b	Claude Opus 4.6
Fuzz seed generation	qwen2.5:14b + dolphin-mistral	Claude Opus 4.6
Hardware identification	minicpm-v	Claude Sonnet 4.6
Protocol recognition	qwen2.5:14b	Claude Sonnet 4.6
FP filtering / exploit dev	—	Claude Opus 4.6