Vision Model Comparison — Hardware Pentest Tool Identification
Can LLM vision models identify hardware security tools from a photograph? We ran three rounds: Round 1 (9 devices, 4 models), Round 2 (12 devices, 6 models), and Round 3 (controlled retest with identical image for all models). Same prompt, no hints — just “identify what you see.” Round 3 revealed that earlier local model results were inflated by low-resolution input simplifying the scene.
Round 1: 9 devices, 4 models (2026-03-29)
Local models ran on DEV2 (RTX 3060 12GB) via Ollama. Claude Opus 4.6 via API as the reference.
Round 1: Saleae Logic Analyzer, RTL-SDR Blog V3, ACR1252U NFC Reader, TI LaunchPad, Bus Pirate, STM32 Nucleo, USB-UART adapters, SanDisk USB, jumper wires.
| Model | Size | Time | Found | Correct | Halluc. | Verdict |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | — | 9 | 7 | 0 | Best overall |
| minicpm-v | 5.0 GB | 20.2s | 3 | 3 | 1 | Best local |
| llava:7b | 4.0 GB | 20.7s | 19 | 2 | ~15 | Dangerous |
| moondream | 1.0 GB | 4.6s | 1 | 1 partial | 0 | Not viable |
Round 2: 12 devices, 6 models — uncontrolled (2026-03-31)
Better photo, more devices, expanded model lineup. However, this round had a methodological flaw: all models received a compressed 479KB JPEG rather than the original 15MB PNG. Results are included for completeness but Round 3 supersedes them.
Round 2: 12 devices including debuggers (ST-Link, CMSIS-DAP), SOIC test clips, development boards, RTL-SDR, Saleae, NFC reader.
| Model | Type | Time | Found | Correct | Halluc. | OCR | Verdict |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | Anthropic | 19.1s | 12 | 10 | 1 | Good | Best overall |
| ChatGPT (OpenAI) | OpenAI | — | 13 | 9 | 2 | Good | Competitive |
| Claude Haiku 4.5 | Anthropic | 7.3s | 8 | 4 | 2 | Partial | Fast but imprecise |
| minicpm-v | Local | 31.8s | 9 | 5 | 1 | Partial | Best local |
| moondream | Local | 2.8s | 1 | 0 | 1 | None | Not viable |
| llava:7b | Local | timeout | 0 | 0 | — | — | Crashed |
Round 3: controlled, 5 runs per model (2026-03-31)
Rounds 1–2 each used a single run per model — statistically meaningless for models with high output variance. Round 3 fixes this with 5 runs per model on the exact same 1.1MB JPEG (2400px, q95). This gives us mean accuracy, variance, and eliminates lucky/unlucky single-run bias. llava:7b is excluded (crashed in Round 2).
| Model | Type | Avg Time | Avg Correct /12 | Range | Avg Halluc. | OCR | Verdict |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4.6 | Anthropic | 20.8s | 10.8 | 10–11 | 1.0 | Excellent | Best overall |
| Claude Haiku 4.5 | Anthropic | 5.7s | 6.8 | 5–9 | 0.2 | Partial | Fast triage |
| minicpm-v | Local | 134s | 5.8 | 5–7 | 0 | Partial | Viable for triage |
| moondream | Local | 1.2s | 0 | 0–0 | 5.0 | None | Loop bug |
5 runs each, same image, same prompt. minicpm-v run 5 timed out (300s) and is excluded from its average. ChatGPT excluded (web UI, cannot control image format).
Why 5 runs changed everything
Our single-run Round 3 scored minicpm-v at 0/12 — leading us to conclude local vision models were “not viable”. With 5 runs, minicpm-v actually averages 5.8/12, identifying breadboards, RTL-SDR, ST-Link, SparkFun, and Bluetooth adapters across runs. The single-run result was an unlucky outlier where the model fixated on one device. Single-run benchmarks are unreliable for stochastic models.
Key findings across all rounds
Sonnet is remarkably consistent
10–11 correct devices across all 5 runs with near-zero variance. Reads “RTL-SDR.COM”, “Saleae”, “ST-LINK V2”, “CMSIS-DAP” every single time. The only consistent error: hallucinating a “Packet Squirrel” from the NFC reader’s black case (5/5 runs).
Haiku has high variance
Ranged from 5 to 9 correct devices across runs — nearly 2× variation. Sometimes reads Saleae and ST-Link labels, sometimes misses them entirely. At 5.7s average it’s great for quick checks, but don’t trust any single output.
minicpm-v is better than we thought
The single-run Round 3 was misleading. Across 5 runs, minicpm-v averages 5.8/12 with zero hallucinations. It consistently finds breadboard, RTL-SDR, and SparkFun, and sometimes catches ST-Link, CMSIS-DAP, and Bluetooth. Slow (134s avg) but genuinely useful for local triage. The caveat: one run timed out at 300s, showing reliability issues.
Single-run benchmarks are dangerous
minicpm-v went from “5/12 best local” (Round 2) to “0/12 not viable” (Round 3 single-run) to “5.8/12 viable for triage” (Round 3 ×5). Each conclusion felt definitive at the time. Always run multiple iterations and report the mean ± range.
The Packet Squirrel is a systematic hallucination
Sonnet hallucinated a Hak5 Packet Squirrel in 5 out of 5 runs. This isn’t random — it’s a deterministic misidentification of the NFC reader’s black case. The shape and size match Sonnet’s training data perfectly. Anthropic models need a “confidence threshold” mechanism for hardware ID tasks.
Cross-reference: best model per security task
| Task | Best Local Model | Best Anthropic Model |
|---|---|---|
| Code audit (C/PHP) | qwen2.5:14b | Claude Opus 4.6 |
| Fuzz seed generation | qwen2.5:14b + dolphin-mistral | Claude Opus 4.6 |
| Hardware identification | minicpm-v | Claude Sonnet 4.6 |
| FP filtering / exploit dev | — | Claude Opus 4.6 |