Editorial illustration for Calibration uses NVIDIA Triton Llama-3-8B A10 and vLLM Qwen2.5-7B RTX 4090 data
Calibration uses NVIDIA Triton Llama-3-8B A10 and vLLM...
Calibration uses NVIDIA Triton Llama-3-8B A10 and vLLM Qwen2.5-7B RTX 4090 data
I built a tiny benchmark that runs entirely on a laptop—no GPU, no API key, no cloud account required. The code lives on GitHub at github.com/sandeepmb/fraud‑agents‑benchmark, and every figure in this piece comes straight from that repo, so you can rerun the tests yourself. While the hot path of payment‑authorization still belongs to classical machine‑learning models, the newer agent‑style LLMs find their niche on the slower, asynchronous side.
On a single CPU core the gradient‑boosted scorer hits a p99 latency of 0.15 ms, comfortably under the roughly 100 ms budget of an ISO 8583 authorization, whereas a calibrated LLM simulator stalls at about 1,200 ms. At 50,000 transactions per second for an hour the GBDT scorer costs roughly $54, but a gpt‑4o‑mini‑class model jumps to $16,200 and a frontier model like Claude Sonnet 4.6 climbs to $351,000. Determinism also diverges: 500 identical inputs yield one distinct GBDT score but 498 distinct LLM outputs, even with temperature set to 0.
Agents, however, prove useful for tasks such as SAR drafting, evidence gathering through MCP‑typed tools, and an agent‑as‑judge pass before a human signs off.
The calibration draws on three public sources: NVIDIA Triton's published time-to-first-token figures for Llama-3-8B q4 on an A10, vLLM benchmarks for Qwen2.5-7B on an RTX 4090, and the p50 and p99 numbers OpenAI and Anthropic publish for their hosted APIs. The simulator also produces non-deterministic score outputs on identical inputs, which is what we need for the determinism experiment. Break #1: Inference Sits Outside the ISO 8583 Budget Five thousand single-transaction calls to the GBDT scorer on one CPU core at batch size 1. Four hundred draws from the calibrated LLM latency distribution.
Why this matters
We can run a fraud‑detection benchmark on a laptop, no GPU, no API key, no cloud. Simple and open. That simplicity lets anyone verify the numbers, which is rare in today’s AI reporting.
The calibration pulls from NVIDIA Triton’s Llama‑3‑8B time‑to‑first‑token on an A10, vLLM’s Qwen2.5‑7B on an RTX 4090, and the p50/p99 latency figures that OpenAI and Anthropic publish for their hosted services. From those three lenses the author concludes that classical machine‑learning models still dominate the synchronous “hot” path, while LLM‑driven agents are better suited to the asynchronous “cold” path. Does this split hold across larger, production‑scale pipelines, or only in the constrained laptop environment?
The benchmark’s open‑source nature invites us to test that question, yet the article stops short of showing how the findings scale with data volume or model size. For developers, the takeaway is modest: traditional models may remain cost‑effective for real‑time fraud checks, while agents could add value in batch‑oriented workflows. Founders should weigh the latency trade‑offs before committing to LLM‑only solutions, and researchers might explore whether the observed latency gap narrows with newer hardware.
Unclear whether future hardware or software optimizations will shift these boundaries.
Further Reading
- Papers with Code Benchmarks - Papers with Code
- Chatbot Arena Leaderboard - LMSYS