On-Device AI Is the Future. The Unit Economics Just Aren't There Yet.
If frontier AI lands at $1K–$5K per employee per month, a $20K private AI server per employee stops sounding crazy and starts sounding like procurement. The hardware is almost there. The math isn't — yet.
Here's a back-of-the-napkin exercise every CFO is going to run within the next two years. Take a heavy AI user on your team — an engineer running agents all day, an analyst with a research pipeline, anyone whose job has quietly become "directing models." At frontier pricing, that person can burn $1,000 to $5,000 a month in tokens. I wrote about Fable 5's pricing when it landed: ten dollars per million input tokens, fifty per million output. Run serious agentic workloads against those numbers and a four-figure monthly bill per employee isn't an edge case. It's the median.
Now ask the question the CFO will ask: if that employee costs $3,000 a month in inference, why don't we just buy them a $20,000 private AI server and be done with it?
It's a good question. The answer today is "because you can't." But the interesting part isn't the answer. It's how fast the answer is changing.
The rent-versus-own math is brutal
Do the arithmetic without flinching. An employee spending $3,000 a month on frontier inference costs $36,000 a year, $108,000 over three years, and the meter never stops. A $20,000 box amortized over the same three years is about $550 a month, plus electricity. Even at the bottom of the range — $1,000 a month in tokens — the box pays for itself before the second year is out. At the top of the range it pays for itself in four months.
We have seen this exact movie before. Companies rented time on mainframes until the PC made compute cheap enough to put on every desk. They rented data center racks until the cloud made elasticity worth a premium, and now the same companies are repatriating steady-state workloads because renting a constant is the most expensive way to own it. Inference for a knowledge worker is about as steady-state as workloads get: the same person, hammering models, eight hours a day, every working day. That is precisely the shape of demand that capex was invented for.
And the per-token bill is only half the argument. The box on the desk doesn't rate-limit you. It doesn't send your code, your contracts, or your customer data to someone else's data center. It doesn't change its pricing, deprecate your model, or have an outage during your product launch. For regulated industries — healthcare, defense, finance, legal — "the data never leaves the building" isn't a nice-to-have. It's the whole purchase order.
Walk into Micro Center and look at the shelf
Here's what makes this more than a thought experiment: the hardware is no longer hypothetical. Nvidia is already selling the early versions of the box, at consumer and prosumer prices.
Read that middle line again. Nvidia ships a book-sized machine with 128GB of unified memory that runs 200-billion-parameter models, locally, for four grand. Wrap an RTX PRO 6000 in a workstation and you're at roughly $12,000 to $15,000 — comfortably inside the $20K envelope — serving 70B-class models at full precision with room for long context. The category the CFO's question assumes doesn't need to be invented. It's on a shelf.
Demand is already outrunning supply, which tells you something about where this is headed. The DGX Spark's price has crept from $3,999 toward $4,700 since launch, largely on memory supply constraints, and 5090s have spent most of their life trading above MSRP. The market is voting.
So why can't you do it yet?
Because of the gap nobody puts on the spec sheet: the model that's worth $3,000 a month doesn't fit in the box, and the models that fit in the box aren't worth $3,000 a month.
The frontier models behind that per-employee bill are vastly larger than anything 128GB of memory can hold, and it wouldn't matter if they weren't — the labs don't sell the weights. What you can actually run locally is the open-weights tier: excellent 30B-to-180B models that land roughly where the frontier was a year or so ago. For a lot of tasks, that's genuinely fine. For the work that justifies a four-figure token bill — long-horizon agents, hard reasoning, the stuff that replaced a workflow rather than autocompleting one — the capability gap is exactly the part you were paying for.
That's the honest state of on-device AI in 2026. The economics scream "buy the box." The capability says "keep renting." The unit economics are just not there yet — and "yet" is the operative word.
Two curves, one inflection point
What makes this a timing question rather than a fantasy is that two curves are converging from opposite directions, and neither shows signs of stopping.
Models are being optimized downward. The cost of inference at a fixed capability level has been falling roughly tenfold per year — a thousandfold over three years — driven by distillation, quantization, mixture-of-experts sparsity, and plain algorithmic progress that cuts the compute needed for a given score by multiples annually. The practical version of that statistic: today's 70B open-weights model does what a frontier model did twelve to eighteen months ago. The frontier keeps moving, but the copy of it that fits in a box keeps arriving about a year later, at a hundredth of the price.
Hardware is climbing upward. Consumer cards went from 24GB to 32GB in a generation. The unified-memory desk machines went from zero to 128GB in one product cycle, with native FP4 turning every gigabyte into three. Each hardware generation raises the ceiling on what "fits locally" by roughly a model class — and Nvidia, Apple, and AMD are all racing to own the category, because they can read the same CFO napkin everyone else can.
The inflection point is where those curves cross: the moment a locally-runnable model handles, say, 80% of a given employee's actual workload at quality indistinguishable from the API. Not 100% — that's the wrong bar. The desk box doesn't need to beat the frontier. It needs to be good enough that the cloud becomes the exception instead of the default: local for the daily grind, frontier API for the 20% of problems that genuinely need the best model on Earth. The day that hybrid math works, the $20K box stops being an enthusiast purchase and becomes an IT line item — the way the PC did, the way the workstation did, the way every wave of compute has eventually migrated toward the user once the economics allowed it.
What to do about it now
Not buy the hardware. That's the trap. A $20K box purchased today is a depreciating bet on this year's model class, and both curves are moving fast enough that next year's box will embarrass it. Early hardware hedges almost always lose to one more year of renting.
- Architect for portability. Every AI workflow you build should sit behind a routing layer that doesn't care whether the model is an API in Virginia or a card under the desk. If your agents are hard-wired to one provider's endpoint, the inflection point will arrive and you won't be able to take the discount.
- Route by difficulty, not by habit. Start splitting workloads now into "needs the frontier" and "a strong open-weights model would do." That's the same discipline that saves you money on tokens today, and it's the exact seam where local inference will slot in later. Organizations that already route by difficulty flip a config. Organizations that don't will be rewriting workflows.
- Run the napkin math per role. Token spend per employee is about to be a number your CFO knows. Know it first — which roles are at $200 a month and which are at $4,000 — because the $4,000 roles are where on-device lands first.
- Let privacy lead the business case. If you're in a regulated industry, the inflection point arrives early for you, because the comparison isn't box-versus-API price. It's box-versus-not-being-allowed-to-use-AI-at-all.
On-device AI is the future for the same reason the PC was: compute migrates toward the user the moment the economics permit it, and the economics are converging from both directions at once. The labs optimize down, the silicon climbs up, and somewhere in the next few years they cross. The companies that win that crossing won't be the ones that bought hardware earliest. They'll be the ones whose architecture, routing, and cost discipline were ready the day the box made sense.
If you're trying to figure out where your organization's inflection point is — what to route where, what your real per-employee AI economics look like, and how to build for the switch without betting early — that's the work I do. Let's talk →
