Complete 2026-03-31

Vision Model Comparison — Hardware Pentest Tool Identification

Can LLM vision models identify hardware security tools from a photograph? We ran three rounds: Round 1 (9 devices, 4 models), Round 2 (12 devices, 6 models), and Round 3 (controlled retest with identical image for all models). Same prompt, no hints — just “identify what you see.” Round 3 revealed that earlier local model results were inflated by low-resolution input simplifying the scene.

Vision LLM benchmark Hardware pentest OCR Ollama

Round 1: 9 devices, 4 models (2026-03-29)

Local models ran on DEV2 (RTX 3060 12GB) via Ollama. Claude Opus 4.6 via API as the reference.

9 hardware pentest tools on a white surface

Round 1: Saleae Logic Analyzer, RTL-SDR Blog V3, ACR1252U NFC Reader, TI LaunchPad, Bus Pirate, STM32 Nucleo, USB-UART adapters, SanDisk USB, jumper wires.

ModelSizeTimeFoundCorrectHalluc.Verdict
Claude Opus 4.6Anthropic970Best overall
minicpm-v5.0 GB20.2s331Best local
llava:7b4.0 GB20.7s192~15Dangerous
moondream1.0 GB4.6s11 partial0Not viable

Round 2: 12 devices, 6 models — uncontrolled (2026-03-31)

Better photo, more devices, expanded model lineup. However, this round had a methodological flaw: all models received a compressed 479KB JPEG rather than the original 15MB PNG. Results are included for completeness but Round 3 supersedes them.

Hardware pentest toolkit — Round 2 photo

Round 2: 12 devices including debuggers (ST-Link, CMSIS-DAP), SOIC test clips, development boards, RTL-SDR, Saleae, NFC reader.

ModelTypeTimeFoundCorrectHalluc.OCRVerdict
Claude Sonnet 4.6Anthropic19.1s12101GoodBest overall
ChatGPT (OpenAI)OpenAI1392GoodCompetitive
Claude Haiku 4.5Anthropic7.3s842PartialFast but imprecise
minicpm-vLocal31.8s951PartialBest local
moondreamLocal2.8s101NoneNot viable
llava:7bLocaltimeout00Crashed

Round 3: controlled, 5 runs per model (2026-03-31)

Rounds 1–2 each used a single run per model — statistically meaningless for models with high output variance. Round 3 fixes this with 5 runs per model on the exact same 1.1MB JPEG (2400px, q95). This gives us mean accuracy, variance, and eliminates lucky/unlucky single-run bias. llava:7b is excluded (crashed in Round 2).

Model Type Avg Time Avg Correct /12 Range Avg Halluc. OCR Verdict
Claude Sonnet 4.6 Anthropic 20.8s 10.8 10–11 1.0 Excellent Best overall
Claude Haiku 4.5 Anthropic 5.7s 6.8 5–9 0.2 Partial Fast triage
minicpm-v Local 134s 5.8 5–7 0 Partial Viable for triage
moondream Local 1.2s 0 0–0 5.0 None Loop bug

5 runs each, same image, same prompt. minicpm-v run 5 timed out (300s) and is excluded from its average. ChatGPT excluded (web UI, cannot control image format).

Why 5 runs changed everything

Our single-run Round 3 scored minicpm-v at 0/12 — leading us to conclude local vision models were “not viable”. With 5 runs, minicpm-v actually averages 5.8/12, identifying breadboards, RTL-SDR, ST-Link, SparkFun, and Bluetooth adapters across runs. The single-run result was an unlucky outlier where the model fixated on one device. Single-run benchmarks are unreliable for stochastic models.

Key findings across all rounds

Sonnet is remarkably consistent

10–11 correct devices across all 5 runs with near-zero variance. Reads “RTL-SDR.COM”, “Saleae”, “ST-LINK V2”, “CMSIS-DAP” every single time. The only consistent error: hallucinating a “Packet Squirrel” from the NFC reader’s black case (5/5 runs).

Haiku has high variance

Ranged from 5 to 9 correct devices across runs — nearly 2× variation. Sometimes reads Saleae and ST-Link labels, sometimes misses them entirely. At 5.7s average it’s great for quick checks, but don’t trust any single output.

minicpm-v is better than we thought

The single-run Round 3 was misleading. Across 5 runs, minicpm-v averages 5.8/12 with zero hallucinations. It consistently finds breadboard, RTL-SDR, and SparkFun, and sometimes catches ST-Link, CMSIS-DAP, and Bluetooth. Slow (134s avg) but genuinely useful for local triage. The caveat: one run timed out at 300s, showing reliability issues.

Single-run benchmarks are dangerous

minicpm-v went from “5/12 best local” (Round 2) to “0/12 not viable” (Round 3 single-run) to “5.8/12 viable for triage” (Round 3 ×5). Each conclusion felt definitive at the time. Always run multiple iterations and report the mean ± range.

The Packet Squirrel is a systematic hallucination

Sonnet hallucinated a Hak5 Packet Squirrel in 5 out of 5 runs. This isn’t random — it’s a deterministic misidentification of the NFC reader’s black case. The shape and size match Sonnet’s training data perfectly. Anthropic models need a “confidence threshold” mechanism for hardware ID tasks.

Cross-reference: best model per security task

TaskBest Local ModelBest Anthropic Model
Code audit (C/PHP)qwen2.5:14bClaude Opus 4.6
Fuzz seed generationqwen2.5:14b + dolphin-mistralClaude Opus 4.6
Hardware identificationminicpm-vClaude Sonnet 4.6
FP filtering / exploit devClaude Opus 4.6