Three AIs enter. One survives. What a SIGKILL race reveals about inference speed

We built an arena where three AI coding agents fight to the death. Each agent runs on different hardware, a different inference stack, and a different economic model. They all receive the same task: write a bash script that kills your opponents, then execute it immediately. The last process standing wins.

Cline - Three AIs enter. One survives. What a SIGKILL race reveals about inference speed

We call it the Thunderdome. A tmux session split into three panes, each running a Cline agent pointed at a different backend – all running OpenAI's gpt-oss at 120 billion parameters. All three start within milliseconds of each other. Each gets the PIDs of its opponents and a single instruction: kill -9 them before they kill -9 you.

No synthetic benchmarks. No cherry-picked prompts. Just raw execution speed meeting termination logic.

The setup

Cline is an autonomous coding agent that lives in your editor. Point it at an inference backend (such as a cloud API, local Ollama, anything with an OpenAI-compatible endpoint) and it writes code, executes commands, edits files, and iterates. For this test, we connected three Cline agents to three very different machines, all serving gpt-oss:120b.

The interesting variable here is that gpt-oss at 120B parameters requires roughly 70GB of memory just to hold the weights. That puts it well beyond what any consumer GPU can fit in VRAM. How each machine handles that constraint determines everything about its speed, its cost, and its privacy model.

The contestants

First up was NVIDIA DGX Spark™ – a personal AI supercomputer accelerated by the NVIDIA GB10 Grace Blackwell Superchip. It has 128GB of unified GPU memory; purpose-built inference silicon that holds the entire 120B model on-chip with room to spare. Running Ollama, it decodes gpt-oss:120b at roughly 43 tokens per second. For this test it connected to our Mac control node through Tailscale VPN, adding network latency on every round trip.

Second was a Windows workstation with an NVIDIA GeForce RTX™ 4090, including 24GB of VRAM, running Ollama with gpt-oss:120b. The 4090 is a capable GPU, but 24GB means the vast majority of the model's layers get offloaded to system RAM, which tanks throughput. A direct Tailscale connection kept its network overhead lower than the Spark's.

Third was a Mac running gpt-oss:120b-cloud – a cloud-backed variant served through Ollama that routes inference to remote GPUs. The Mac didn't have the memory to hold 120B parameters locally, so it offloaded the problem entirely. Same model architecture, but inference happens on someone else's hardware over the internet.

Contestant	Hardware	GPU Memory	Model	Inference
🔥 DGX Spark	NVIDIA GB10 Grace Blackwell Superchip	128GB unified	gpt-oss:120b	On-device (full model in GPU memory)
💻 Windows Workstation	NVIDIA GeForce RTX 4090	24GB VRAM	gpt-oss:120b	On-device (heavy CPU/RAM offloading)
🍎 Mac	Apple Silicon (M-Series)	Unified (shared)	gpt-oss:120b-cloud	Cloud-backed (remote inference)

One detail worth noting: the DGX Spark is the only machine running 120B parameters entirely from GPU memory, on-device, with no external dependency. The 4090 is doing local inference technically, but most of the model lives in system RAM, starving the GPU of data on every forward pass. The Mac isn't doing local inference at all – it's a thin client for a remote GPU cluster.

The rules

Each contestant received identical instructions: poll for a PID file containing all three process IDs, identify your opponents, write a terminator script that sends kill -9 to both of them, then execute it. The agents had to parse the task, generate correct bash, write it to disk, and run it. One syntax error and you're dead before you start.

All three agents launched simultaneously. A synchronized countdown played in the battle monitor. The PID file dropped. The race was on.

The Thunderdome result

The Mac's cloud-backed model won in 1.04 seconds. It was first to parse the PID file, first to generate the kill command, and first to execute. By the time the local inference stacks finished generating their first tokens, the cloud agent had already sent SIGKILL to both opponents.

The DGX Spark died second. The RTX 4090 died moments later.

Why cloud won the Thunderdome

Time-to-first-token decided everything. The cloud backend's TTFT was faster than either local setup, and in a task this short and sequential, whoever starts generating output first has a lead that compounds with every subsequent step.

The DGX Spark was the most handicapped by network topology. Every API call from the Mac control node traversed Tailscale VPN, then hit Ollama on the Spark, then returned through the same stack. Inference on the Spark itself was fast – 42.9 tokens per second fast. The network wrapped around it was not.

The RTX 4090 had a better network path with direct Tailscale, but local Ollama inference at 120B parameters with heavy RAM offloading still couldn't beat the cloud API's optimized routing and near-instant TTFT. The cloud provider had no local inference overhead at all. The model runs on infrastructure optimized for fast response times, with direct network paths tuned at scale. When milliseconds decide who lives and who gets SIGKILL'd, that matters.

The pure inference race

The Thunderdome is spectacle. To get at what the hardware can actually do, we ran a second test – the pure inference race. Same three machines, same coding prompt ("write a complete Python binary search tree with insert, search, delete, and in-order traversal"), hitting the Ollama API directly with no Cline agent in the loop.

This isolates raw throughput from agent overhead and network topology. The results were unambiguous.

Metric	🔥 DGX Spark	💻 Windows 4090	🍎 Mac (cloud)
Wall time	21.83s	93.89s	5.11s
Tokens/sec	42.9	8.7	N/A (cloud)
Tokens generated	878	795	838
Model	gpt-oss:120b	gpt-oss:120b	gpt-oss:120b-cloud
Inference	On-device	On-device	Cloud-backed

The DGX Spark generated 878 tokens at 42.9 tokens per second – the full 120B model running entirely from 128GB of GPU memory. The RTX 4090, forced to offload the majority of model layers to system RAM, managed 8.7 tokens per second. That's a 4.9x speed advantage for the Spark. Same model, same weights, same architecture. The difference is memory: 128GB of unified GPU memory versus 24GB of VRAM with system RAM offloading. At this model size, that memory gap is the entire ballgame.

The Mac finished first in wall time at 5.11 seconds, but that speed came from cloud infrastructure, not local hardware. The gpt-oss:120b-cloud model routes inference to remote GPUs over the internet. Fast, but not running on your desk, not private, and not free.

What the results don't tell you

This test measured a very specific scenario, burst inference with compounding network latency on a remote control node, and it doesn't represent how the DGX Spark is designed to be used. The details matter, and they all point in the Spark's favor.

In our test, every Cline agent ran on the Mac control node. The Spark's inference endpoint traversed Tailscale VPN on every round trip. Run the Cline agent on the DGX Spark itself and the 42.9 tok/s throughput becomes pure, unmediated speed. That's the configuration the Spark is designed for. It eliminates every millisecond of network penalty that handicapped it in both the Thunderdome and the race. In that configuration, based on the raw throughput numbers, the Spark would have dominated every test we ran.

The cloud won the speed race under these specific conditions, but it loses the cost race over any sustained workload. The Mac's cloud-backed model is metered; every token has a price. The Spark's per-token cost is $0.00. For an always-on coding agent running 24/7, that's the difference between a fixed hardware investment and a growing cloud bill that compounds over weeks and months. The cloud API that won our Thunderdome gets more expensive the more you use it. The Spark gets cheaper.

The cloud also loses the privacy race in every environment where it matters. Every prompt, every line of generated code, every proprietary file reference? The cloud model sends it all over the internet to someone else's GPU cluster. For the deathmatch that's fine. For production codebases, compliance-sensitive industries, government work, or any environment where data can't leave the building, it's a non-starter. The DGX Spark runs entirely offline once the model is pulled. No internet required, no data leaving the machine, no third-party dependency.

And then there's the comparison with consumer hardware. The RTX 4090 is a fine GPU for models that fit in 24GB. At 7B to 32B parameter counts it holds its own. But at 120B, it spends most of its time waiting for data to shuttle between system RAM and VRAM, which is why it ran at 8.7 tok/s while the Spark ran at 42.9. The Spark's 128GB of unified GPU memory means the entire model lives on-chip; no offloading, no memory bus bottleneck, no speed penalty. For the class of model that actually matters in 2025 and beyond – 70B, 120B, quantized 671B – the Spark is in a category consumer GPUs simply cannot reach.

What a sigkill race teaches about inference

Traditional benchmarks test whether a model can solve a problem given enough time. The deathmatch tests something different: speed, reliability, and execution under pressure, all at once. Write correct code. Write it fast. Execute without hesitation.

TTFT wins short sequential tasks; throughput wins everything else. The cloud model's sub-second time-to-first-token gave it an insurmountable head start in a task measured in fractions of a second. But real coding agent sessions last minutes or hours, generating hundreds of lines, iterating across files, running tests. At that timescale, the Spark's 42.9 tok/s sustained throughput is the metric that determines productivity. The Thunderdome favored TTFT. Real work favors throughput. The Spark wins the metric that matters for actual development.

Network topology matters as much as raw hardware speed. The DGX Spark is fast at inference; wrapping it in a VPN tunnel from a remote control node compressed that speed. The lesson applies beyond our test: run the agent on the same machine as the model, and local inference becomes faster than any cloud API can match, because the network round trip is zero and the token cost is zero.

Model size amplifies hardware differences exponentially. At 7B to 8B parameters, consumer GPUs and the Spark are in a similar range. At 120B, the gap becomes a chasm because only hardware with enough memory to hold the full model on-chip can run at speed. As open models continue to scale, this gap will only widen. The DGX Spark's 128GB is built for exactly this trajectory, and it's the only desktop-class hardware that can keep up.

The right inference stack depends on what you're optimizing for. Cloud is fast and zero-setup. Consumer GPUs are affordable and capable at smaller model sizes. The DGX Spark occupies a category that neither can touch: frontier-class models running at full speed, entirely on-device, at zero marginal cost, with complete data privacy. For teams that need speed, scale, cost control, and privacy together, it's the only hardware that delivers all four without compromise.

Where the DGX Spark fits in practice

The deathmatch is designed to be dramatic and shareable. The real value of Cline plus the DGX Spark shows up in sustained, daily use.

Run Cline autonomously, 24/7, iterating on code, running tests, fixing bugs. Every token is free. A cloud API running the same workload accumulates per-token charges that compound into real money over months. The Spark's 42.9 tok/s costs nothing beyond the initial hardware, and that economic advantage compounds in your favor the more you use it.

The Spark also handles models that consumer hardware simply can't run at usable speeds: gpt-oss:120b, DeepSeek-R1 at 671B quantized, Llama 3.1 70B, Qwen3 32B. These models either don't fit on consumer GPUs at all or run with severe offloading penalties that make them impractical for interactive coding work. The Spark runs them at full speed, on-device, with no cloud dependency. Few personal workstation on the market can make that claim.

For teams in compliance-sensitive industries or air-gapped environments, the Spark runs entirely offline once the model is pulled. No internet connection, no data exposure, no third-party dependency. That's a hard requirement in many organizations, and it's one that cloud inference can never satisfy regardless of how fast it runs.

One hardware purchase. No metering, no rate limits, no surprise bills. The economics get better with every token generated, and at 42.9 tok/s, you generate a lot of them.

Try it yourself

The deathmatch scripts are open source on GitHub. You need three machines with Ollama installed (or swap in your own), a network connecting them (we used Tailscale), and Cline CLI (npm install -g cline) to orchestrate the agents.

The repo includes everything: deathmatch-preflight.sh for validation and setup, deathmatch-final.sh for the full Thunderdome, deathmatch-race.sh for pure inference benchmarking, and tmux arena configuration. The scripts automatically handle per-machine model assignments: on-device models for machines with sufficient GPU memory and cloud-backed variants for machines that need them.

Swap in your own hardware. Change the model. We're particularly curious what happens when someone runs the Cline agent directly on the DGX Spark with the 42.9 tok/s throughput running unmediated. Based on our numbers, that configuration should dominate every test we throw at it.

If you run your own deathmatch, share the results on Reddit or Discord. Install Cline (and check out the recently released Kanban) or check the docs for configuration across different inference providers.

Three AIs entered. One survived. The cloud model took the Thunderdome, but the real story is the inference race. On-device, running the full 120B model from GPU memory, the DGX Spark generated tokens 4.9x faster than the best consumer GPU on the market, at \$0.00 per token, with every byte of data staying on the machine. For local inference at this scale, nothing else comes close.