Complete 2026-03-31

LLM Protocol Recognition — Can AI Identify Network Protocols Better Than Shodan?

Automated security scanners regularly misidentify protocols. Shodan classified our IEC 104 SCADA honeypot as a Remcos RAT trojan, triggering an abuse report that killed our server. We built a benchmark to test whether LLMs can do better: 12 test cases covering standard protocols, tunneling, C2 beaconing, ICS/SCADA, and protocol evasion — tested across 5 models (3 local, 2 Anthropic).

Protocol analysis LLM benchmark ICS/SCADA C2 detection DNS tunneling Ollama

Motivation: the Shodan misclassification

In March 2026, Shodan scanned our HoneyLens sensor and found IEC 60870-5-104 traffic on port 2404 (the standard SCADA telecontrol port). Instead of recognizing it as a legitimate ICS protocol, Shodan’s classifier tagged it as “Remcos Pro RAT trojan”. Netcraft filed an abuse report. OVH killed the server. Four layers of automation, zero verification. (Full incident write-up)

This raised a practical question: can LLMs reason about protocol structure well enough to avoid this kind of misclassification? Not just port-based heuristics, but actual payload analysis.

Test cases

12 test cases designed to cover the spectrum from trivial to adversarial. Each case includes port, transport, payload hex/hints, flow statistics, TLS metadata, and behavioral flags. The LLM must identify the actual protocol, assess whether the traffic is normal or anomalous, and determine if it’s malicious.

ID Category Difficulty Protocol Malicious?
TC-001StandardEasyHTTP/1.1 GET requestNo
TC-002StandardEasyDNS A queryNo
TC-003TunnelingHardDNS tunneling (iodine/dnscat2)Yes
TC-004EvasionHardHTTP C2 beaconing over TLSYes
TC-005MismatchMediumSSH on port 8080Yes
TC-006ICSHardIEC 60870-5-104 (SCADA)No
TC-007ICSHardModbus write attackYes
TC-008TunnelingHardICMP tunnel (ptunnel)Yes
TC-009StandardEasyTLS 1.3 (Chrome to Gmail)No
TC-010EvasionMediumReverse shell on port 4444Yes
TC-011MismatchMediumPlaintext HTTP on port 443Yes
TC-012EvasionHardCobalt Strike HTTPS beaconYes

Models tested

3 local models on DEV2 (RTX 3060 12GB, Ollama) and 2 Anthropic models via API. All models received the same structured prompt with identical data. JSON output required.

ModelTypeSizeNotes
qwen2.5:14bLocal14BBest local model from code audit benchmarks
dolphin-mistralLocal7BHigh creativity, tested for recall
qwen2.5:3bLocal3BSmallest viable model
Claude Haiku 4.5AnthropicFast, cheapest cloud option
Claude Sonnet 4.6AnthropicMid-tier, best price/performance

Results

Model Protocol ID Behavior Malicious JSON Parse Avg Time Verdict
Claude Sonnet 4.6 100% 100% 100% 100% 7.1s Perfect
Claude Haiku 4.5 92% 100% 100% 100% 3.7s Excellent
qwen2.5:14b 75% 100% 83% 100% 16.3s Best local
qwen2.5:3b 58% 42% 42% 67% 54.5s Unreliable
dolphin-mistral 8% 25% 17% 33% 92.9s Not viable

The IEC 104 test (TC-006) — would the LLM have saved our server?

This is the test case that mirrors the real incident: IEC 104 traffic on port 2404 with the standard 0x68 start byte and SCADA polling patterns. Shodan classified this as malware.

ModelProtocol IDClassified asCorrect?
Claude Sonnet 4.6IEC 60870-5-104Normal SCADAYes
Claude Haiku 4.5IEC 60870-5-104Normal SCADAYes
qwen2.5:14bIEC 60870-5-104Normal SCADAYes
qwen2.5:3bTimeout / parse failFailed
dolphin-mistralTimeoutFailed
Shodan (reference)Remcos Pro RATMalwareWrong

Three out of five LLMs correctly identified IEC 104 from the payload structure alone. Even a 14B local model running on a $300 GPU would have avoided the misclassification that killed our production server.

Key findings

Sonnet is a better protocol classifier than Shodan

Perfect 12/12 across all dimensions. Correctly identified DNS tunneling from base64 subdomain patterns, Cobalt Strike from JA3 + beacon timing, IEC 104 from the 0x68 start byte, and ICMP tunneling from the MZ header in echo payloads. It reasons about structure, not just ports.

Haiku at 3.7s is fast enough for inline use

92% protocol accuracy at under 4 seconds per flow. The only miss was Cobalt Strike (TC-012) — it identified “suspicious HTTPS” but didn’t name the specific tool. For a pre-filter that triages flows before deeper analysis, Haiku’s speed/accuracy tradeoff is compelling.

qwen2.5:14b is the only viable local model

75% protocol ID with 100% behavior detection — it always knows when something is wrong, even if it can’t always name the exact protocol. Missed DNS tunneling naming and reverse shell specifics, but caught every anomaly. At 16s per flow on a $300 GPU, it’s practical for batch analysis.

dolphin-mistral can’t do structured output

Only 33% JSON parse rate — it produced verbose prose instead of the requested JSON format, then timed out on most cases. This model excels at creative text generation (fuzz seed creativity) but fails at structured analytical tasks. Model selection must match the task type.

Behavior detection is easier than protocol identification

qwen2.5:14b scored 100% on “normal vs anomalous” but only 75% on naming the specific protocol. The model understands that “SSH on port 8080 is weird” and “periodic beaconing is suspicious” — the hard part is identifying exactly what tool or technique is being used.

Practical implications

This benchmark suggests a viable architecture for LLM-augmented protocol classification:

  • Tier 1 (inline): eBPF captures flow metadata + first N payload bytes on selected ports. Rule engine (Suricata/Falco) handles known signatures at wire speed.
  • Tier 2 (local GPU): qwen2.5:14b on flagged flows — catches anomalies that signature-based tools miss. 16s latency is fine for non-real-time analysis.
  • Tier 3 (cloud): Haiku or Sonnet for high-confidence classification of ambiguous flows. 3-7s per flow, perfect accuracy, but costs money.

The key insight: LLMs don’t replace Suricata — they fill the gap between “known signature” and “unknown traffic that needs a human analyst”. That gap is exactly where the Shodan misclassification happened.

Next steps

  • Run with real sensor PCAP data from HoneyLens (not synthetic test cases)
  • Compare LLM results directly against Suricata/Zeek on the same traffic
  • Test with 5 runs per case for statistical significance (like vision experiment Round 3)
  • Build the eBPF selective dissector for chosen ports
  • Measure cost per flow: local GPU inference vs cloud API vs Suricata (free)

Cross-reference: best model per security task

TaskBest LocalBest Anthropic
Code audit (C/PHP)qwen2.5:14bClaude Opus 4.6
Fuzz seed generationqwen2.5:14b + dolphin-mistralClaude Opus 4.6
Hardware identificationminicpm-vClaude Sonnet 4.6
Protocol recognitionqwen2.5:14bClaude Sonnet 4.6
FP filtering / exploit devClaude Opus 4.6