Vision Model Comparison

Complete 2026-03-31

Vision Model Comparison — Hardware Pentest Tool Identification

Can LLM vision models identify hardware security tools from a photograph? We ran three rounds: Round 1 (9 devices, 4 models), Round 2 (12 devices, 6 models), and Round 3 (controlled retest with identical image for all models). Same prompt, no hints — just “identify what you see.” Round 3 revealed that earlier local model results were inflated by low-resolution input simplifying the scene.

Vision LLM benchmark Hardware pentest OCR Ollama

Round 1: 9 devices, 4 models (2026-03-29)

Local models ran on DEV2 (RTX 3060 12GB) via Ollama. Claude Opus 4.6 via API as the reference.

9 hardware pentest tools on a white surface

Round 1: Saleae Logic Analyzer, RTL-SDR Blog V3, ACR1252U NFC Reader, TI LaunchPad, Bus Pirate, STM32 Nucleo, USB-UART adapters, SanDisk USB, jumper wires.

Model	Size	Time	Found	Correct	Halluc.	Verdict
Claude Opus 4.6	Anthropic	—	9	7	0	Best overall
minicpm-v	5.0 GB	20.2s	3	3	1	Best local
llava:7b	4.0 GB	20.7s	19	2	~15	Dangerous
moondream	1.0 GB	4.6s	1	1 partial	0	Not viable

Round 2: 12 devices, 6 models — uncontrolled (2026-03-31)

Better photo, more devices, expanded model lineup. However, this round had a methodological flaw: all models received a compressed 479KB JPEG rather than the original 15MB PNG. Results are included for completeness but Round 3 supersedes them.

Hardware pentest toolkit — Round 2 photo

Round 2: 12 devices including debuggers (ST-Link, CMSIS-DAP), SOIC test clips, development boards, RTL-SDR, Saleae, NFC reader.

Model	Type	Time	Found	Correct	Halluc.	OCR	Verdict
Claude Sonnet 4.6	Anthropic	19.1s	12	10	1	Good	Best overall
ChatGPT (OpenAI)	OpenAI	—	13	9	2	Good	Competitive
Claude Haiku 4.5	Anthropic	7.3s	8	4	2	Partial	Fast but imprecise
minicpm-v	Local	31.8s	9	5	1	Partial	Best local
moondream	Local	2.8s	1	0	1	None	Not viable
llava:7b	Local	timeout	0	0	—	—	Crashed

Round 3: controlled, 5 runs per model (2026-03-31)

Rounds 1–2 each used a single run per model — statistically meaningless for models with high output variance. Round 3 fixes this with 5 runs per model on the exact same 1.1MB JPEG (2400px, q95). This gives us mean accuracy, variance, and eliminates lucky/unlucky single-run bias. llava:7b is excluded (crashed in Round 2).

Model	Type	Avg Time	Avg Correct /12	Range	Avg Halluc.	OCR	Verdict
Claude Sonnet 4.6	Anthropic	20.8s	10.8	10–11	1.0	Excellent	Best overall
Claude Haiku 4.5	Anthropic	5.7s	6.8	5–9	0.2	Partial	Fast triage
minicpm-v	Local	134s	5.8	5–7	0	Partial	Viable for triage
moondream	Local	1.2s	0	0–0	5.0	None	Loop bug

5 runs each, same image, same prompt. minicpm-v run 5 timed out (300s) and is excluded from its average. ChatGPT excluded (web UI, cannot control image format).

Why 5 runs changed everything

Our single-run Round 3 scored minicpm-v at 0/12 — leading us to conclude local vision models were “not viable”. With 5 runs, minicpm-v actually averages 5.8/12, identifying breadboards, RTL-SDR, ST-Link, SparkFun, and Bluetooth adapters across runs. The single-run result was an unlucky outlier where the model fixated on one device. Single-run benchmarks are unreliable for stochastic models.

Key findings across all rounds

Sonnet is remarkably consistent

10–11 correct devices across all 5 runs with near-zero variance. Reads “RTL-SDR.COM”, “Saleae”, “ST-LINK V2”, “CMSIS-DAP” every single time. The only consistent error: hallucinating a “Packet Squirrel” from the NFC reader’s black case (5/5 runs).

Haiku has high variance

Ranged from 5 to 9 correct devices across runs — nearly 2× variation. Sometimes reads Saleae and ST-Link labels, sometimes misses them entirely. At 5.7s average it’s great for quick checks, but don’t trust any single output.

minicpm-v is better than we thought

The single-run Round 3 was misleading. Across 5 runs, minicpm-v averages 5.8/12 with zero hallucinations. It consistently finds breadboard, RTL-SDR, and SparkFun, and sometimes catches ST-Link, CMSIS-DAP, and Bluetooth. Slow (134s avg) but genuinely useful for local triage. The caveat: one run timed out at 300s, showing reliability issues.

Single-run benchmarks are dangerous

minicpm-v went from “5/12 best local” (Round 2) to “0/12 not viable” (Round 3 single-run) to “5.8/12 viable for triage” (Round 3 ×5). Each conclusion felt definitive at the time. Always run multiple iterations and report the mean ± range.

The Packet Squirrel is a systematic hallucination

Sonnet hallucinated a Hak5 Packet Squirrel in 5 out of 5 runs. This isn’t random — it’s a deterministic misidentification of the NFC reader’s black case. The shape and size match Sonnet’s training data perfectly. Anthropic models need a “confidence threshold” mechanism for hardware ID tasks.

Cross-reference: best model per security task

Task	Best Local Model	Best Anthropic Model
Code audit (C/PHP)	qwen2.5:14b	Claude Opus 4.6
Fuzz seed generation	qwen2.5:14b + dolphin-mistral	Claude Opus 4.6
Hardware identification	minicpm-v	Claude Sonnet 4.6
FP filtering / exploit dev	—	Claude Opus 4.6