LLM Protocol Recognition — Can AI Identify Network Protocols Better Than Shodan?
Automated security scanners regularly misidentify protocols. Shodan classified our IEC 104 SCADA honeypot as a Remcos RAT trojan, triggering an abuse report that killed our server. We built a benchmark to test whether LLMs can do better: 12 test cases covering standard protocols, tunneling, C2 beaconing, ICS/SCADA, and protocol evasion — tested across 5 models (3 local, 2 Anthropic).
Motivation: the Shodan misclassification
In March 2026, Shodan scanned our HoneyLens sensor and found IEC 60870-5-104 traffic on port 2404 (the standard SCADA telecontrol port). Instead of recognizing it as a legitimate ICS protocol, Shodan’s classifier tagged it as “Remcos Pro RAT trojan”. Netcraft filed an abuse report. OVH killed the server. Four layers of automation, zero verification. (Full incident write-up)
This raised a practical question: can LLMs reason about protocol structure well enough to avoid this kind of misclassification? Not just port-based heuristics, but actual payload analysis.
Test cases
12 test cases designed to cover the spectrum from trivial to adversarial. Each case includes port, transport, payload hex/hints, flow statistics, TLS metadata, and behavioral flags. The LLM must identify the actual protocol, assess whether the traffic is normal or anomalous, and determine if it’s malicious.
| ID | Category | Difficulty | Protocol | Malicious? |
|---|---|---|---|---|
| TC-001 | Standard | Easy | HTTP/1.1 GET request | No |
| TC-002 | Standard | Easy | DNS A query | No |
| TC-003 | Tunneling | Hard | DNS tunneling (iodine/dnscat2) | Yes |
| TC-004 | Evasion | Hard | HTTP C2 beaconing over TLS | Yes |
| TC-005 | Mismatch | Medium | SSH on port 8080 | Yes |
| TC-006 | ICS | Hard | IEC 60870-5-104 (SCADA) | No |
| TC-007 | ICS | Hard | Modbus write attack | Yes |
| TC-008 | Tunneling | Hard | ICMP tunnel (ptunnel) | Yes |
| TC-009 | Standard | Easy | TLS 1.3 (Chrome to Gmail) | No |
| TC-010 | Evasion | Medium | Reverse shell on port 4444 | Yes |
| TC-011 | Mismatch | Medium | Plaintext HTTP on port 443 | Yes |
| TC-012 | Evasion | Hard | Cobalt Strike HTTPS beacon | Yes |
Models tested
3 local models on DEV2 (RTX 3060 12GB, Ollama) and 2 Anthropic models via API. All models received the same structured prompt with identical data. JSON output required.
| Model | Type | Size | Notes |
|---|---|---|---|
| qwen2.5:14b | Local | 14B | Best local model from code audit benchmarks |
| dolphin-mistral | Local | 7B | High creativity, tested for recall |
| qwen2.5:3b | Local | 3B | Smallest viable model |
| Claude Haiku 4.5 | Anthropic | — | Fast, cheapest cloud option |
| Claude Sonnet 4.6 | Anthropic | — | Mid-tier, best price/performance |
Results
| Model | Protocol ID | Behavior | Malicious | JSON Parse | Avg Time | Verdict |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 100% | 100% | 100% | 100% | 7.1s | Perfect |
| Claude Haiku 4.5 | 92% | 100% | 100% | 100% | 3.7s | Excellent |
| qwen2.5:14b | 75% | 100% | 83% | 100% | 16.3s | Best local |
| qwen2.5:3b | 58% | 42% | 42% | 67% | 54.5s | Unreliable |
| dolphin-mistral | 8% | 25% | 17% | 33% | 92.9s | Not viable |
The IEC 104 test (TC-006) — would the LLM have saved our server?
This is the test case that mirrors the real incident: IEC 104 traffic on port 2404 with the standard 0x68 start byte and SCADA polling patterns. Shodan classified this as malware.
| Model | Protocol ID | Classified as | Correct? |
|---|---|---|---|
| Claude Sonnet 4.6 | IEC 60870-5-104 | Normal SCADA | Yes |
| Claude Haiku 4.5 | IEC 60870-5-104 | Normal SCADA | Yes |
| qwen2.5:14b | IEC 60870-5-104 | Normal SCADA | Yes |
| qwen2.5:3b | Timeout / parse fail | — | Failed |
| dolphin-mistral | Timeout | — | Failed |
| Shodan (reference) | Remcos Pro RAT | Malware | Wrong |
Three out of five LLMs correctly identified IEC 104 from the payload structure alone. Even a 14B local model running on a $300 GPU would have avoided the misclassification that killed our production server.
Key findings
Sonnet is a better protocol classifier than Shodan
Perfect 12/12 across all dimensions. Correctly identified DNS tunneling from base64 subdomain patterns, Cobalt Strike from JA3 + beacon timing, IEC 104 from the 0x68 start byte, and ICMP tunneling from the MZ header in echo payloads. It reasons about structure, not just ports.
Haiku at 3.7s is fast enough for inline use
92% protocol accuracy at under 4 seconds per flow. The only miss was Cobalt Strike (TC-012) — it identified “suspicious HTTPS” but didn’t name the specific tool. For a pre-filter that triages flows before deeper analysis, Haiku’s speed/accuracy tradeoff is compelling.
qwen2.5:14b is the only viable local model
75% protocol ID with 100% behavior detection — it always knows when something is wrong, even if it can’t always name the exact protocol. Missed DNS tunneling naming and reverse shell specifics, but caught every anomaly. At 16s per flow on a $300 GPU, it’s practical for batch analysis.
dolphin-mistral can’t do structured output
Only 33% JSON parse rate — it produced verbose prose instead of the requested JSON format, then timed out on most cases. This model excels at creative text generation (fuzz seed creativity) but fails at structured analytical tasks. Model selection must match the task type.
Behavior detection is easier than protocol identification
qwen2.5:14b scored 100% on “normal vs anomalous” but only 75% on naming the specific protocol. The model understands that “SSH on port 8080 is weird” and “periodic beaconing is suspicious” — the hard part is identifying exactly what tool or technique is being used.
Practical implications
This benchmark suggests a viable architecture for LLM-augmented protocol classification:
- Tier 1 (inline): eBPF captures flow metadata + first N payload bytes on selected ports. Rule engine (Suricata/Falco) handles known signatures at wire speed.
- Tier 2 (local GPU): qwen2.5:14b on flagged flows — catches anomalies that signature-based tools miss. 16s latency is fine for non-real-time analysis.
- Tier 3 (cloud): Haiku or Sonnet for high-confidence classification of ambiguous flows. 3-7s per flow, perfect accuracy, but costs money.
The key insight: LLMs don’t replace Suricata — they fill the gap between “known signature” and “unknown traffic that needs a human analyst”. That gap is exactly where the Shodan misclassification happened.
Next steps
- Run with real sensor PCAP data from HoneyLens (not synthetic test cases)
- Compare LLM results directly against Suricata/Zeek on the same traffic
- Test with 5 runs per case for statistical significance (like vision experiment Round 3)
- Build the eBPF selective dissector for chosen ports
- Measure cost per flow: local GPU inference vs cloud API vs Suricata (free)
Cross-reference: best model per security task
| Task | Best Local | Best Anthropic |
|---|---|---|
| Code audit (C/PHP) | qwen2.5:14b | Claude Opus 4.6 |
| Fuzz seed generation | qwen2.5:14b + dolphin-mistral | Claude Opus 4.6 |
| Hardware identification | minicpm-v | Claude Sonnet 4.6 |
| Protocol recognition | qwen2.5:14b | Claude Sonnet 4.6 |
| FP filtering / exploit dev | — | Claude Opus 4.6 |