It is not. It's a terrible comparison. Qwen, deepseek and other Chinese models are known for their 10x or even better efficiency compared to Anthropic's.
That's why the difference between open router prices and those official providers isn't that different. Plus who knows what open routed providers do in term quantization. They may be getting 100x better efficiency, thus the competitive price.
That being said not all users max out their plan, so it's not like each user costs anthropic 5,000 USD. The hemoragy would be so brutal they would be out of business in months
That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper, and then you use that to claim that they're 10x more efficient.
Opus isn't that expensive to host. Look at Amazon Bedrock's t/s numbers for Opus 4.5 vs other chinese models. They're around the same order of magnitude- which means that Opus has roughly the same amount of active params as the chinese models.
Also, you can select BF16 or Q8 providers on openrouter.
Opus doubled in speed with version 4.5, leading me to speculate that they had promoted a sonnet size model. The new faster opus was the same speed as Gemini 3 flash running on the same TPUs. I think anthropics margins are probably the highest in the industry, but they have to chop that up with google by renting their TPUs.
Except there are providers that serve both chinese models AND opus as well. On the same hardware.
Namely, Amazon Bedrock and Google Vertex.
That means normalized infrastructure costs, normalized electricity costs, and normalized hardware performance. Normalized inference software stack, even (most likely). It's about a close of a 1 to 1 comparison as you can get.
Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models. Note that they are not incentivized to slow down the serving of Opus or the chinese models! So that tells you the ratio of active params for Opus and for the chinese models.
And Microsoft's Azure. It's on all 3 major cloud providers. Which tells me, they can make profit from these cloud providers without having to pay for any hardware. They just take a small enough cut.
How is this related to the inference, may I ask? Except for some very hardware-specific optimizations of model architecture, there's nothing to prevent one to host these models on your own infrastructure. And that's what actually many OpenRouter providers, at least some of which are based in US, are doing. Because most of Chinese models mentioned here are open-weight (except for Qwen who has one proprietary "Max" model), and literally anyone can host them, not just someone from China. So it just doesn't really matter.
I mean sure, but in terms of cost per dollar/per watt of inference Nvidia's GPUs are pretty up there - unless China is pumping out domestic chips cheaply enough.
Also with Nvidia you get the efficiency of everything (including inference) built on/for Cuda, even efforts to catch AMD up are still ongoing afaik.
I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware.
> The new 1s have a mandate to at least run on local hardware.
They do? Source?
But if that's true, it would explain why Minimax, Z.ai and Moonshot are all organized as Singaporean holding companies, with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China. Can't be forced to use inferior local hardware if you're just a body shop for a "foreign" AI company. ;)
Agree, but I guess the Opus 4.6 is 10x larger, rather than Chinese models being 10x more efficient. It is said that GPT-4 is already a 1.6T model, and Llama 4 behemoth is also much bigger than Chinese open-weight models. Chinese tech companies are short of frontier GPUs, but they did a lot of innovations on inference efficiency (Deepseek CEO Liang himself shows up in the author list of the related published papers).
I agree that Opus almost definitely isn't anywhere near that big, but AWS throughput might not be a great way to measure model size.
According to OpenRouter, AWS serves the latest Opus and Sonnet at roughly the same speed. It's likely that they simply allocate hardware differently per model.
GPT-4 was likely much larger than any of the SOTA models we have today, at least in terms of active parameters. Sparse models are the new standard, and the price drop that came with Opus 4.5 made it fairly obvious that Anthropic are not an exception.
>It is not. It's a terrible comparison. Qwen, deepseek and other Chinese models are known for their 10x or even better efficiency compared to Anthropic's.
I find it a good comparison because it is a good baseline since we have zero insider knowledge of Anthropic. They give me an idea that a certain size of a model has a certain cost associated.
I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect. Current Qwen models perform as good as Sonnet 3 I think. 2 years later when Chinese models catchup with enough distillation attacks, they would be as good as Sonnet 4.6 and still be profitable.
> I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect.
Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.
Lots of models get really close on benchmarks, but benchmarks only tell us how good they are at solving a defined problem. Opus is far better at solving ill-defined ones.
I have a project where we've had Opus, Sonnet, Deepseek, Kimi, Qwen create and execute an aggregate total of about 350 plans so far, and the quality difference as measured in plans where the agent failed to complete the tasks on the first run is high enough that it comes out several times higher than Anthropics subscription prices, but probably cheaper than the API prices once we have improved the harness further - at present the challenge is that too much human intervention for the cheaper models drives up the cost.
My dashboard goes from all green to 50/50 green/red for our agents whenever I switch from Claude to one of the cheaper agents... This is after investing a substantial amount of effort in "dumbing down" the prompts - e.g. adding a lot of extra wording to convince the dumber models to actually follow instructions - that is not necessary for Sonnet or Opus.
I buy the benchmarks. The problem is that a 10% difference in the benchmarks makes the difference between barely usable and something that can consistently deliver working code unilaterally and require few review interventions. Basically, the starting point for "usable" on these benchmarks is already very far up the scale for a lot of tasks.
I do strongly believe the moat is narrow - With 4.6 I switched from defaulting to Opus to defaulting to Sonnet for most tasks. I can fully see myself moving substantial workloads to a future iteration of Kimi, Qwen or Deepseek in 6-12 months once they actually start approaching Sonnet 4.5 level. But for my use at least, currently, they're at best competing with Athropics 3.x models in terms of real-world ability.
That said, even now, I think if we were stuck with current models for 12 months, we might well also be able to build our way around this and get to a point where Deepseek and Kimi would be cheaper than Sonnet.
Eventually we'll converge on good enough harnesses to get away with cheaper models for most uses, and the remaining appeal for the frontier models will be complex planning and actual hard work.
My employer bought me a Claude Max subscription. On heavy weeks I use 80% of the subscription. And among software engineers that I know, I'm a relatively heavy user.
Why? Because in my experience, the bottleneck is in shareholders approving new features, not my ability to dish out code.
goal? yeah. but in reality just timing it right (starting a session at 7-8am, to get 2 sessions in a workday, or even 3 if you can schedule something at 5am), i rarely hit limits.
if i hit the limit usually i'm not using it well and hunting around. if i'm using it right i'm basically gassed out trying to hit the limit to the max.
In saas this is not true. Most saas is highly profitable or was i suppose because they knew that most of their customers would never max out their plans.
A huge number of people are convinced that OpenAI and Anthropic are selling inference tokens at a loss despite the fact that there's no evidence this is true and a lot of evidence that it isn't. It's just become a meme uncritically regurgitated.
This sloppy Forbes article has polluted the epistemic environment because now theres a source to point to as "evidence."
So yes this post author's estimation isn't perfect but it is far more rigorous than the original Forbes article which doesn't appear to even understand the difference between Anthropic's API costs and its compute costs.
I'd love to be a fly on the wall when this argument is tried in front of a bankruptcy court. It drives me nuts. Of course there's evidence that they're selling tokens at a loss.
The only thing these companies sell are tokens. That's their entire output. OpenAI is trying to build an ad business but it must be quite small still relative to selling tokens because I've not yet seen a single ad on ChatGPT. It's not like these firms have a huge side business selling Claude-themed baseball caps.
That means the cost of "inference" is all their costs combined. You can't just arbitrarily slice out anything inconvenient and say that's not a part of the cost of generating tokens. The research and training needed to create the models, the salaries of the people who do that, the salaries of the people who build all the serving infrastructure, the loss leader hardcore users - all of it is a part of the cost of generating each token served.
Some people look at the very different prices for serving open weights models and say, see, inference in general is cheap. But those costs are distorted by companies trying to buy mindshare by giving models away for free, and of those, both the top labs keep claiming the Chinese are distilling them like crazy including using many tactics to evade blocks! So apparently the cost of a model like DeepSeek is still partly being subsidized by OpenAI and Anthropic against their will. The cost of those tokens is higher than what's being charged, it's just being shifted onto someone else's books. Nice whilst it lasts, but this situation has been seen many times in the past and eventually people get tired of having costs externalized onto them.
For as long as firms are losing money whilst only selling tokens, that means those tokens are selling at a loss. To not sell tokens at a loss the companies would have to be profitable.
This is all true but it isn't really important for the argument people are making. What is more important is the marginal cost per token. If each token sold is at a marginal loss, their losses would scale with usage, that simply can't be happening with API pricing. But in general, yes I agree with you and I'm sure they are taking a huge loss on Claude Code.
The article is about compute cost though. By "lose money on inference" I mean the assertion that inference has negative gross margins which a lot of people truly believe. This is important because it's common to reason from this that LLM's are uneconomical and a ticking time bomb where prices will have to be jacked up several orders of magnitude just to cover the compute used for the tokens.
But there's no such thing as compute cost in the abstract. What exactly is compute cost for AI? Does it include:
• Inference used for training? Modern training pipelines aren't just gradient descent, there's a ton of inference used in them too.
• Gradient descent itself?
• The CPUs and disks storing and managing the datasets?
• The web servers?
• The people paid to swap out failed components at the dc?
Let's say you try and define it to mean the same as unit economics - what does it cost you to add an additional customer vs what they bring in. There's still no way to do this calculation. It's like trying to compute the unit economics of a software company. Sure, if you ignore all the R&D costs of building the software in the first place and all the R&D costs of staying competitive with new versions, then the unit economics look amazing, but there's still plenty of loss-making software startups in the world.
Unit economics are a useful heuristic for businesses where there aren't any meaningful base costs required to stay in the game because they let you think about setup costs separately. Manufacturing toys, private education, farming... lots of businesses where your costs are totally dominated by unit economics. AI isn't like that.
> A huge number of people are convinced that OpenAI and Anthropic are selling inference tokens at a loss despite the fact that there's no evidence this is true
Theres quite a lot of evidence, no proof I'd agree, but then there's no absolute proof I'm aware to the contrary either, so I don't know where you're getting this from.
The two pieces of evidence I'm aware of is that 1) Anthropic doesn't want their subsidised plans being used outside of CC, which would imply that the money their making off it isn't enough, and 2) last time I checked, API spending is capped at $5000 a month
Like I say, neither of these are proof, you can come up with reasonable arguments against them, but once again the same could be said for evidence on the contrary
> which would imply that the money their making off it isn't enough
I don't think this logically follows. An unlimited buffet doesn't let you resell all of the food out the backdoor. At some level of usage any fixed price plan becomes unprofitable.
I agree the 5k cap is interesting as evidence although as you said I suspect there are other reasons for it.
As for evidence against it: The Information reported that OpenAI and Anthropic are 30%+ gross margins for the last few years. Sam Altman and Dario have both claimed inference is profitable in various scattered interviews. Other experts seem to generally agree too. A quick search found a tweet from former PyTorch team member Horace He: https://x.com/typedfemale/status/1961197802169798775 and a response to it in agreement from Anish Tondwalkar former researcher at OpenAI and Google Brain.
But a simple assumption that Anthropic runs a normal large MoE LLM (which it almost certainly does) suggests that the actual price of running it (mostly energy) is pretty small.
I think the wafer scale compute is a massive deal. It's already being leveraged for models you can use right now and the reception on HN has been negligible. The entire model lives in SRAM. This is orders of magnitude faster than HBM/DRAM. I can't imagine they couldn't eventually break even using hardware like this in production.
I calculated only last weekend that my team would cost, if we would run Claude Code on retail API costs, around $200k/mo. We pay $1400/month in Max subscriptions. So that's $50k/user... But what tokens CC is reporting in their json -> a lot of this must be cached etc, so doubt it's anywhere near $50k cost, but not sure how to figure out what it would cost and I'm sure as hell not going to try.
I'm fascinated to know the kind of work that allows you to intelligently allocate so much resources. I use Claude extensively and feel that I great value out of it but I reach a limit in terms of what I can do that makes sense relatively quickly it seems.
Yea basically we have an app that’s like Netflix but for dogs, so people can leave on dog oriented shows for their dogs when they get kombucha or coffee
I'm surprised, isn't it forbidden to use the Max plan as part of a company? Just curious, as I thought it was forbidden by the ToS but I'm not sure if I have a good understanding of it
Most companies forbid it though, since you're not covered by any legal protection - for example, Anthropic can use your data or code to train new models and more.
There is nothing in the TOS last time I checked forbidding it's use with Claude code. It's only forbidden to utilize it in the running of the business.
So getting Claude code subscriptions for developers should be permissable and not be against anything... However, if you created a rest endpoint to eg run a preconfigured prompt as part of your platform, that'd be against it
Yeah, that would end that really quickly. I use Pro for personal stuff. If $200 is not allowed for companies I don't think anyone would use it, at all.
I think the main issue I have with the article is that author whole argument is based on 'Qwen wouldn't run at a loss'. But why wouldn't it?
Depsite it being a business, there might be a number of arguments why they decide to run without profit for now: from trying to expand the user base, to Chinese government sponsoring Chinese AI business.
> Cost remains an ever present challenge. Cursor’s larger rivals are willing to subsidize aggressively. According to a person familiar with the company’s internal analysis, Cursor estimated last year that a $200-per-month Claude Code subscription could use up to $2,000 in compute, suggesting significant subsidization by Anthropic. Today, that subsidization appears to be even more aggressive, with that $200 plan able to consume about $5,000 in compute, according to a different person who has seen analyses on the company’s compute spend patterns.
This is the relevant quote from the original article.
If Anthropic's compute is fully saturated then the Claude code power users do represent an opportunity cost to Anthropic much closer to $5,000 then $500.
Anthropic's models may be similar in parameter size to model's on open router, but none of the others are in the headlines nearly as much (especially recently) so the comparison is extremely flawed.
The argument in this article is like comparing the cost of a Rolex to a random brand of mechanical watch based on gear count.
But opportunity cost is not actual cost. “If everyone just kept paying but used our service less we would be more profitable” is true, but not in any meaningful way.
Are Anthropic currently unable to sell subscriptions because they don’t have capacity?
> Are Anthropic currently unable to sell subscriptions because they don’t have capacity?
Absolutely! Im currently paying $170 to google to use Opus in antigravity without limit in full agent mode, because I tried Anthropic $20 subscription and busted my limit within a single prompt. Im not gonna pay them $200 only to find out I hit the limit after 20 or even 50 prompts.
And after 2 more months my price is going to double to over $300, and I still have no intention of even trying the 20x Max plan, if its really just 20x more prompts than Pro.
Opportunity cost is not the same thing as actual cost. They might have made more money if they were capable of selling the API instead of CC, but I would never tell my company to use CC all the time if I didn’t have a personal subscription.
> If Anthropic's compute is fully saturated then the Claude code power users do represent an opportunity cost to Anthropic much closer to $5,000 then $500.
I think it's the other way around? Sparse use of GPU farms should be the more expensive thing. Full saturation means that we can exploit batching effects throughout.
You know who also loves to use the term "opportunity cost"?
The entertainment industry. They still tell you about how much money they're leaving on the table because people pirate stuff.
What would happen in reality for entertainment is people would "consume" far less "content".
And what would happen in reality for Anthropic is people would start asking themselves if the unpredictability is worth the price. Or at best switch to pay as you go and use the API far less.
> The argument in this article is like comparing the cost of a Rolex to a random brand of mechanical watch on gear count
I mean... rolex is overpriced brand whose cost to consumers is mainly just marketting in itself. Its production cost is nowhere close to selling price and looking at gears is fair way of evaluating that
Not directly. But if production cost is above selling price, you typically tend to get less production. And if production cost is (way) below selling price, that tends to invite competition.
I'm using API directly for software developement, i'm on path to pay ~$5k this month per user, some less , some more, with daily use is just growing more and more.
How confident are you in the opus 4.6 model size? I've always assumed it was a beefier model with more active params that Qwen397B (17B active on the forward pass)
Unlikely. Amazon Bedrock serves Opus at 120tokens/sec.
If you want to estimate "the actual price to serve Opus", a good rough estimate is to find the price max(Deepseek, Qwen, Kimi, GLM) and multiply it by 2-3. That would be a pretty close guess to actual inference cost for Opus.
It's impossible for Opus to be something like 10x the active params as the chinese models. My guess is something around 50-100b active params, 800-1600b total params. I can be off by a factor of ~2, but I know I am not off by a factor of 10.
In practice, tps is a reflection of vram memory bandwidth during inference. So the tps tells you a lot about the hardware you're running on.
Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.
I won't say it'll tell you everything; I have no clue what optimizations Opus may have, which can range from native FP4 experts to spec decoding with MTP to whatever. But considering chinese models like Deepseek and GLM have MTP layers (no clue if Qwen 3.5 has MTP, I haven't checked since its release), and Kimi is native int4, I'm pretty confident that there is not a 10x difference between Opus and the chinese models. I would say there's roughly a 2x-3x difference between Opus 4.5/4.6 and the chinese models at most.
> In practice, tps is a reflection of vram memory bandwidth during inference.
> Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.
You sure about that? I thought you could shard between GPUs along layer boundaries during inference (but not training obviously). You just end up with an increasingly deep pipeline. So time to first token increases but aggregate tps also increases as you add additional hardware.
Hint: what's in the kv cache when you start processing the 2nd token?
And that's called layer parallelism (as opposed to tensor parallelism). It allows you to run larger models (pooling vram across gpus) but does not allow you to run models faster.
Tensor parallelism DOES allow you to run models faster across multiple GPUs, but you're limited to how fast you can synchronize the all-reduce. And in general, models would have the same boost on the same hardware- so the chinese models would have the same perf multiplier as Opus.
Note that providers generally use tensor parallelism as much as they can, for all models. That usually means 8x or so.
In reality, tps ends up being a pretty good proxy for active param size when comparing different models at the same inference provider.
GPT 4 was rumoured/leaked to be 1.8T. Claude 3.5 Sonnet was supposedly 175B, so around 0.5T-1T seems reasonable for Opus 3.5. Maybe a step up to 1-3T for Opus 4.0
Since then inference pricing for new models has come down a lot, despite increasing pressure to be profitable. Opus 4.6 costs 1/3rd what Opus 4.0 (and 3.5) costs, and GPT 5.4 1/4th what o1 costs. You could take that as indication that inference costs have also come done by at least that degree.
My guess would have been that current frontier models like Opus are in the realm of 1T params with 32B active
Opus 4.6 likely has in the order of 100B active parameters. OpenRouter lists the following throughput for Google Vertex:
42 tps for Claude Opus 4.6 https://openrouter.ai/anthropic/claude-opus-4.6
143 tps for GLM 4.7 (32B active parameters) https://openrouter.ai/z-ai/glm-4.7
70 tps for Llama 3.3 70B (dense model) https://openrouter.ai/meta-llama/llama-3.3-70b-instruct
For GLM 4.7, that makes 143 * 32B = 4576B parameters per second, and for Llama 3.3, we get 70 * 70B = 4900B, which makes sense since denser models are easier to optimize. As a lower bound, we get 4576B / 42 ≈ 109B active parameters for Opus 4.6. (This makes the assumption that all three models use the same number of bits per parameter and run on the same hardware.)
Try 10s of trillions. These days everyone is running 4-bit at inference (the flagship feature of Blackwell+), with the big flagship models running on recently installed Nvidia 72gpu rubin clusters (and equivalent-ish world size for those rented Ironwood TPUs Anthropic also uses). Let's see, Vera Rubin racks come standard with 20 TB (Blackwell NVL72 with 10 TB) of unified memory, and NVFP4 fits 2 parameters per btye...
Of course, intense sparsification via MoE (and other techniques ;) ) lets total model size largely decouple from inference speed and cost (within the limit of world size via NVlink/TPU torrus caps)
So the real mystery, as always, is the actual parameter count of the activated head(s). You can do various speed benchmarks and TPS tracking across likely hardware fleets, and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)
Comparing Opus 4.6 or GPT 5.4 thinking or Gemini 3.1 pro to any sort Chinese model (on cost) is just totally disingenuous when China does NOT have Vera Rubin NVL72 GPUs or Ironwood V7 TPUs in any meaningful capacity, and is forced to target 8gpu Blackwell systems (and worse!) for deployment.
Do you have any clues to guess the total model size? I do not see any limitations to making models ridiculously large (besides training), and the Scaling Law paper showed that more parameters = more better, so it would be a safe bet for companies that have more money than innovative spirit.
> I do not see any limitations to making models ridiculously large (besides training)
From my understanding, the "besides training" is a big issue. As I noted earlier[1], Qwen3 was much better than Qwen2.5, but the main difference was just more and better training data. The Qwen3.5-397B-A17B beat their 1T-parameter Qwen3-Max-Base, again a large change was more and better training data.
I generally agree, back of the napkin math shows H20 cluster of 8gpu * 96gb = 768gb = 768B parameters on FP8 (no NVFP4 on Hopper), which lines up pretty nicely with the sizes of recent open source Chinese models.
However, I'd say its relatively well assumed in realpolitik land that Chinese labs managed to acquire plenty of H100/200 clusters and even meaningful numbers of B200 systems semi-illicitly before the regulations and anti-smuggling measures really started to crack down.
This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.
tldr: the author argues it is closer to costing 500 USD per month IF a user hits their weekly rate limits every week.
Which is probably a lot more correct than other claims. However it's also true that anybody who has to use the API might pay that much, creating a real cost per token moat for Anthropics Claude code vs other models as long as they are so far ahead in terms of productivity.
Claude subscription is equivalant of spot instance
And APIs are on-demand service equivalant.
Priority is set to APIs and leftover compute is used by Subscription Plans.
When there is no capacity, subscriptions are routed to Highly Quantized cheaper models behind the scenes.
Selling subscription makes it cheaper to run such inference at scale otherwise many times your capacity is just sitting there idle.
Also, these subscription help you train your model further on predictable workflow (because the model creators also controls the Client like qwen code, claude code, anti gravity etc...)
This is probably why they will ban you for violating TOS that you cannot use their subscription service model with other tools.
They aren't just selling subscription, but the subscription cost also help them become better at the thing they are selling which is coding for coding models like Qwen, Claude etc...
I've used qwen code, codex and claude.
Codex is 2x better than Qwen code and Claude is 2x better than Codex.
So I'd hope the Claude Opus is atleast 4-5x more expensive to run than flagship Qwen Code model hosted by Alibaba.
Not only that, but since the release of 5.4 and 5.3 codex I've been running them in parallel and I've been let down by Opus 4.6 with maximum thinking way more than I've been let down with OpenAI models.
In fact I'm more and more inclined to run my own benchmarks from now on, because I seriously distrust those I see online.
Even if the benchmarks are indeed valid, they just don't reflect my use cases, usages and ability to navigate my projects and my dependencies.
imho they're mostly better at a subset of different tasks. I find codex to be better at reasoning through bugs and reviewing code when compared to Opus, but for writing code I find Claude a lot better.
Maybe that's just CLAUDE.md and memory causing the difference of course.
As a matter of preference however I like the way Claude Code works just a lot better, instructing it to work with parallel subagents in work trees etc. just matches the way I think these things should work I guess.
What people don't realize is that cache is *free*, well not free, but compared to the compute required to recompute it? Relatively free.
If you remove the cached token cost from pricing the overall api usage drops from around $5000 to $800 (or $200 per week) on the $200 max subscription. Still 4x cheaper over API, but not costing money either - if I had to guess it's break even as the compute is most likely going idle otherwise.
Cache definitely isn't free! We're in a global RAM shortage and KV caches sit around consuming RAM in the hope that there will be a hit.
The gamble with caching is to hold a KV cache in the hope that the user will (a) submit a prompt that can use it and (b) that will get routed to the right server which (c) won't be so busy at the time it can't handle the request. KV caches aren't small so if you lose that bet you've lost money (basically, the opportunity cost of using that RAM for something else).
I'm incredibly salty about this - they're essentially monetizing intensely something that allows them to sell their inference at premium prices to more users - without any caching, they'd have much less capacity available.
inference compute is vastly different versus training, also it has to stay hot in vram which probably takes up most of it. There is limited use for THAT much compute as well, they are running things like claude code compiler and even then they're scratching the surface of the amount of compute they have.
Training currently requires nvidia's latest and greatest for the best models (they also use google TPU's now which are also technically the latest and greatest? However, they're more of a dual purpose than anything afaik so that would be a correct assesment in that case)
Inference can run on a hot potato if you really put your mind to it
I think I've heard multiple time that a large % of training compute for SoTA models is inference to generate training tokens, this is bound to happen with RL training
Electricity is charged whenever you use it or not, so very unlikely, but sure, they can find uses for it. Although they are not going to make that much money compared to claude code subscriptions.
“The article is right to separate compute cost from retail price — but the retail price baseline itself is arbitrary depending on where you run the model.
The same capability (e.g. Llama 3.3 70B with tool calling and 128K context) runs $3.00/1M tokens at model developer list price and $0.22/1M at Fireworks AI — a 93% gap for identical specs. That spread makes any “it costs Anthropic X” estimate depend entirely on which reference price you anchor to.
We track this live across 1,625 SKUs and 40+ vendors at a7om.com — the variance across the market is larger than most people realise when they back-calculate provider economics.”
I can’t get past all the LLM-isms. Do people really not care about AI-slopifying their writing? It’s like learning about bad kerning, you see it everywhere.
I had a similar reaction to OP for a different post a few weeks back - I think some analysis on the health economy. Initially as I was reading I thought - "Wow, I've never read a financial article written so clearly". Everything in layman's terms. But as I continued to read, I began to notice the LLM-isms. Oversimplified concepts, "the honest truth" "like X for Y", etc.
Maybe the common factor here is not having deep/sufficient knowledge on the topic being discussed? For the article I mentioned, I feel like I was less focused on the strength of the writing and more on just understanding the content.
LLMs are very capable at simplifying concepts and meeting the reader at their level. Personally, I subscribe to the philosophy of - "if you couldn't be bothered to write it, I shouldn't bother to read it".
This happens to non-native English speakers a lot (like me). My style of writing is heavily influenced by everything I read. And since I also do research using LLMs, I'll probably sound more and more as an AI as well, just by reading its responses constantly.
I just don't know what's supposed to be natural writing anymore. It's not in the books, disappears from the internet, what's left? Some old blogs for now maybe.
> The real story is actually in the article. … And the real issue for Cursor … They have real "brand awareness", and they are genuinely better than the cheaper open weights models - for now at least. It's a real conundrum for them.
> … - these are genuinely massive expenses that dwarf inference costs.
Popular content is popular because it is above the threshold for average detection.
In a better world, platforms would empower defenders, by granting skilled human noticers flagging priority, and by adopting basic classifiers like Pangram.
Unfortunately, mainstream platforms have thus far not demonstrated strong interest in banning AI slop. This site in particular has actually taken moderation actions to unflag AI slop, in certain occasions...
It is certainly very obvious a lot of the time. I wonder if we revisited the automated slop detection problem we’d be more successful now… it feels like there are a lot more tells and models have become more idiosyncratic.
the openrouter comparison is interesting because it shows what happens when you have actual supply-side competition. multiple providers, different quantizations, price competition. the spread between cheapest and priciest for the same model can be 3-5x.
anthropic doesn't have that. single provider, single pricing decision. whether or not $5k is accurate the more interesting question is what happens to inference pricing when the supply side is genuinely open. we're seeing hints of it with open router but its still intermediated
not saying this solves anthropic's cost problem, just that the "what does inference actually cost" question gets a lot more interesting when providers are competing directly
1. It would be nice to define terms like RSI or at least link to a definition.
2. I found the graph difficult to read. It's a computer font that is made to look hand-drawn and it's a bit low resolution. With some googling I'm guessing the words in parentheses are the clouds the model is running on. You could make that a bit more clear.
What CC costs internally is not public. How efficient it is, is not public.
…You could take efficiency improvement rates from previous models releases (from x -> y) and assume; they have already made “improvements” internally. This is likely closer to what their real costs are.
Ed Zitron made that claim (in particular here: [1]). In the same article he admits he not a programmer, and had to ask someone else to try out Claude Code and ccusage for him. He doesn't have any understanding of how LLMs or caching works. But he's prominent because he's received leaked financial details for Anthropic and OpenAI, eg [2]
Maybe I'm misreading it, but I don't see him saying it's just the cost of *inference* alone (which is the strawman that the article in the OP is arguing against). He says:
> this company is wilfully burning 200% to 3000% of each Pro or Max customer that interacts with Claude Code
There is of course this meme that "Anthropic would be profitable today if they stopped training new models and only focused on inference", but people on HN are smart enough to understand that this is not realistic due to model drift, and also due to comeptition from other models. So training is forever a part of the cost of doing business, until we have some fundamental changes in the underlying technology.
I can only interpret Ed Zitron as saying "the cost of doing business is 200% to 3000% of the price users are paying for their subscriptions", which sounds extremely plausible to me.
I mean, the very first paragraph of TFA is describing who is under that impression. Literally the first sentence:
> My LinkedIn and Twitter feeds are full of screenshots from the recent Forbes article on Cursor claiming that Anthropic's $200/month Claude Code Max plan can consume $5,000 in compute.
That's claiming that worst case, a subscriber _can_ use that much. It's possible that's wrong too, but in any case a lot of services are built on the assumption that the average user doesn't max out the plan.
So the article's title is obviously sensationalized.
I have no problem believing that a Claude Max plan can consume equivalent to $5000 worth of retail Opus use, but one interesting thing you'll see if you e.g. have Claude write agents for you, is that it's pretty aggressive about setting agents to use Sonnet or even Haiku, so not only will most people not exhaust their plans, but a lot of people who do will do so in part using the cheaper models. When you then factor in Anthropics reported margins, and their ability to prioritise traffic (e.g. I'd assume that if their capacity is maxed out they'd throttle subscribers in favour of paid by the token? Maybe not, but it's what I'd do), I'd expect the real cost to them of a maximised plan to be much lower.
Also, while Opus certainly is a lot better than even the best Chinese models, when I max out my Claude plan, I make do with Kimi 2.5. When factoring in the re-run of changes because of the lower quality, I'd spend maybe 2x as much per unit of work I were to pay token prices for all my monthly use w/Kimi.
I'd still prefer Claude if the price comes down to 1x, as it's less hassle w/the harder changes, but their lead is effectively less than a year.
Claude Code Max obviously doesn't cost 10x more than Kimi. The article even confirms that you can get $5k worth of computer for $200 with Claude Code Max.
So no, Claude would not be getting NEARLY as much usage as it's currently getting if it weren't for the $100/$200 monthly subscription. You're comparing Kimi to the price that most people aren't paying.
Is it fair to say the Open Router models aren't subsidized though? They make the case that companies on there are running a business, but there are free models, and companies with huge AI budgets that want to gather training data and show usage.
Monopoly isn't the only thing that allows you to charge large margins.
API inference access is naturally a lot more costly to provide compared to Chat UI and Claude Code, as there is a lot more load to handle with less latency. In the products they can just smooth over load curves by handling some of the requests slower (which the majority of users in a background Code session won't even notice).
It's a good question. Costs will be lumpy. Inference servers will have a preferred batch size. Once you have a server you can scale number of users up to that batch size for relatively low cost. Then you need to add another server (or rack) for another large cost.
However I think it's fair to say the cost is roughly linear in the number of users other than that.
There may be some aspects which are not quite linear when you see multiple users submitting similar queries... But I don't think this would be significant.
The comparison with Qwen/Kimi by "comparable architecture size" is doing a lot of heavy lifting. Parameter count doesn't tell you much when the models aren't in the same league quality-wise.
I wonder if a better proxy would be comparing by capability level rather than size. The cost to go from "good" to "frontier" is probably exponential, not linear - so estimating Anthropic's real cost from what it takes to serve Qwen 397B seems off.
This article is hilariously flawed, and it takes all of 5 seconds of research to see why.
Alibaba is the primary comparison point made by the author, but it's a completely unsuitable comparison. Alibab is closer to AWS then Anthropic in terms of their business model. They make money selling infrastructure, not on inference. It's entirely possible they see inference as a loss leader, and are willing to offer it at cost or below to drive people into the platform.
We also have absolutely no idea if it's anywhere near comparable to Opus 4.6. The author is guessing.
So the articles primary argument is based on a comparison to a company who has an entirely different business model running a model that the author is just making wild guesses about.
Well, IDK, I have used CC with API billing pretty extensively and managed to spend ~$1000 in one month more or less. Moved to a Max 20x subscription and using it a bit less (I'm still scared) but not THAT less and I'm around 10% weekly usage. I'm not counting the tokens, though.
And on top of that, Anthropic does not run their own compute clusters do they? They probably get completely ripped by whoever is renting them the processors.
$200 worth of actual computation is an awful lot of computation.
What this doesn't mention is the "cost" to the public: the inevitable bailouts after it all comes crashing down again, the massive subsidies that Datacenters get from tax payers, the fresh water they consume, the electricity price hikes for everyone else, the noise, air and water pollution and the massive health impact on the surrounding population of every datacenter. The jobs that it destroys and the innocent people it kills through use of the technology in military targeting and autonomous weapons usage.
> Aren't they losing money on the retail API pricing, too?
No, they aren't, and probably neither is anyone else offering API pricing. And Anthropic's API margins may be higher than anyone else.
For example, DeepSeek released numbers showing that R1 was served at approximately "a cost profit margin of 545%" (meaning 82% of revenue is profit), see my comment https://news.ycombinator.com/item?id=46663852
They're all looking for outside money because they're all looking for outside money, and so need to keep up with their competitors investments in training. It's a game of chicken. Once their ability to raise more abates, they'll slow down new training runs, and fund that out of inference margins instead, but the first one to be forced to do so will risk losing market share.
> Qwen 3.5 397B-A17B is a good comparison
It is not. It's a terrible comparison. Qwen, deepseek and other Chinese models are known for their 10x or even better efficiency compared to Anthropic's.
That's why the difference between open router prices and those official providers isn't that different. Plus who knows what open routed providers do in term quantization. They may be getting 100x better efficiency, thus the competitive price.
That being said not all users max out their plan, so it's not like each user costs anthropic 5,000 USD. The hemoragy would be so brutal they would be out of business in months
That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper, and then you use that to claim that they're 10x more efficient.
Opus isn't that expensive to host. Look at Amazon Bedrock's t/s numbers for Opus 4.5 vs other chinese models. They're around the same order of magnitude- which means that Opus has roughly the same amount of active params as the chinese models.
Also, you can select BF16 or Q8 providers on openrouter.
Opus doubled in speed with version 4.5, leading me to speculate that they had promoted a sonnet size model. The new faster opus was the same speed as Gemini 3 flash running on the same TPUs. I think anthropics margins are probably the highest in the industry, but they have to chop that up with google by renting their TPUs.
This is not a valid argument. TPS is essentially QoS and can be adjusted; more GPUs allocated will result in higher speed.
> That's a tautology. People think chinese models are 10x more efficient because they're 10x cheaper
They do have different infrastructure / electricity costs and they might not run on nvidia hardware.
It's not just the models.
Except there are providers that serve both chinese models AND opus as well. On the same hardware.
Namely, Amazon Bedrock and Google Vertex.
That means normalized infrastructure costs, normalized electricity costs, and normalized hardware performance. Normalized inference software stack, even (most likely). It's about a close of a 1 to 1 comparison as you can get.
Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models. Note that they are not incentivized to slow down the serving of Opus or the chinese models! So that tells you the ratio of active params for Opus and for the chinese models.
And Microsoft's Azure. It's on all 3 major cloud providers. Which tells me, they can make profit from these cloud providers without having to pay for any hardware. They just take a small enough cut.
https://code.claude.com/docs/en/microsoft-foundry
https://www.anthropic.com/news/claude-in-microsoft-foundry
AWS and GCP both have their own custom inference chips, so a better example for hosting Opus on commodity hardware would be Digital Ocean.
> Both Amazon and Google serve Opus at roughly ~1/2 the speed of the chinese models
We were responded about 10x not 0.5x.
x86 vs arm64 could have different performance. The Chinese models could be optimized for different hardware so it could show massive differences.
I mean GN has covered the Nvidia black market in China enough that we pretty much know that they run on Nvidia hardware still.
How is this related to the inference, may I ask? Except for some very hardware-specific optimizations of model architecture, there's nothing to prevent one to host these models on your own infrastructure. And that's what actually many OpenRouter providers, at least some of which are based in US, are doing. Because most of Chinese models mentioned here are open-weight (except for Qwen who has one proprietary "Max" model), and literally anyone can host them, not just someone from China. So it just doesn't really matter.
I mean sure, but in terms of cost per dollar/per watt of inference Nvidia's GPUs are pretty up there - unless China is pumping out domestic chips cheaply enough.
Also with Nvidia you get the efficiency of everything (including inference) built on/for Cuda, even efforts to catch AMD up are still ongoing afaik.
I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware.
> unless China is pumping out domestic chips cheaply enough
They are. Nvidia makes A LOT of profit. Hey, top stock for a reason.
> I wouldn't be surprised if things like DS were trained and now hosted on Nvidia hardware
DS is "old". I wouldn't study them. The new 1s have a mandate to at least run on local hardware. There are data center requirements.
I agree it could still be trained on Nvidia GPUs (black market etc), but not running.
> The new 1s have a mandate to at least run on local hardware.
They do? Source?
But if that's true, it would explain why Minimax, Z.ai and Moonshot are all organized as Singaporean holding companies, with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China. Can't be forced to use inferior local hardware if you're just a body shop for a "foreign" AI company. ;)
> with claimed data center locations (according to OpenRouter) in the US or Singapore and only the devs in China
They just have a China only endpoint and likely a company under a different name.
Nothing to do with AI. TikTok is similar (global vs China operations).
Agree, but I guess the Opus 4.6 is 10x larger, rather than Chinese models being 10x more efficient. It is said that GPT-4 is already a 1.6T model, and Llama 4 behemoth is also much bigger than Chinese open-weight models. Chinese tech companies are short of frontier GPUs, but they did a lot of innovations on inference efficiency (Deepseek CEO Liang himself shows up in the author list of the related published papers).
No, Opus cannot be 10x larger than the chinese models.
If Opus was 10x larger than the chinese models, then Google Vertex/Amazon Bedrock would serve it 10x slower than Deepseek/Kimi/etc.
That's not the case. They're in the same order of magnitude of speed.
They serve it about 2x slower. So it must have about 2x the active parameters.
It could still be 10x larger overall, though that would not make it 10x more expensive.
I agree that Opus almost definitely isn't anywhere near that big, but AWS throughput might not be a great way to measure model size.
According to OpenRouter, AWS serves the latest Opus and Sonnet at roughly the same speed. It's likely that they simply allocate hardware differently per model.
GPT-4 was likely much larger than any of the SOTA models we have today, at least in terms of active parameters. Sparse models are the new standard, and the price drop that came with Opus 4.5 made it fairly obvious that Anthropic are not an exception.
Actually, Opus might achieve a lower cost with the help of TPUs.
> Plus who knows what open routed providers do in term quantization
The quantisation is shown on the provider section.
>It is not. It's a terrible comparison. Qwen, deepseek and other Chinese models are known for their 10x or even better efficiency compared to Anthropic's.
I find it a good comparison because it is a good baseline since we have zero insider knowledge of Anthropic. They give me an idea that a certain size of a model has a certain cost associated.
I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect. Current Qwen models perform as good as Sonnet 3 I think. 2 years later when Chinese models catchup with enough distillation attacks, they would be as good as Sonnet 4.6 and still be profitable.
> I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect.
Define "much worse".
Where are you getting those benchmark figures from? Math-500 should be closer to 98% for both models: https://artificialanalysis.ai/evaluations/math-500?models=de...
Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.
Lots of models get really close on benchmarks, but benchmarks only tell us how good they are at solving a defined problem. Opus is far better at solving ill-defined ones.
>Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.
Ah, the "trust me bro" advantage. Couldn't it just be brand identity and familiarity?
I have a project where we've had Opus, Sonnet, Deepseek, Kimi, Qwen create and execute an aggregate total of about 350 plans so far, and the quality difference as measured in plans where the agent failed to complete the tasks on the first run is high enough that it comes out several times higher than Anthropics subscription prices, but probably cheaper than the API prices once we have improved the harness further - at present the challenge is that too much human intervention for the cheaper models drives up the cost.
My dashboard goes from all green to 50/50 green/red for our agents whenever I switch from Claude to one of the cheaper agents... This is after investing a substantial amount of effort in "dumbing down" the prompts - e.g. adding a lot of extra wording to convince the dumber models to actually follow instructions - that is not necessary for Sonnet or Opus.
I buy the benchmarks. The problem is that a 10% difference in the benchmarks makes the difference between barely usable and something that can consistently deliver working code unilaterally and require few review interventions. Basically, the starting point for "usable" on these benchmarks is already very far up the scale for a lot of tasks.
I do strongly believe the moat is narrow - With 4.6 I switched from defaulting to Opus to defaulting to Sonnet for most tasks. I can fully see myself moving substantial workloads to a future iteration of Kimi, Qwen or Deepseek in 6-12 months once they actually start approaching Sonnet 4.5 level. But for my use at least, currently, they're at best competing with Athropics 3.x models in terms of real-world ability.
That said, even now, I think if we were stuck with current models for 12 months, we might well also be able to build our way around this and get to a point where Deepseek and Kimi would be cheaper than Sonnet.
Eventually we'll converge on good enough harnesses to get away with cheaper models for most uses, and the remaining appeal for the frontier models will be complex planning and actual hard work.
> That being said not all users max out their plan,
These are not cell phone plans which the average joe takes, they are plans purchased with the explicit goal of software development.
I would guess that 99 out of every 100 plans are purchased with the explicit goal of maxing them out.
I’m not maxing them out… I have issues that I need to fix, features I need to develop, and I have things I want to learn.
When I have a feeling that these tools will speed me up, I use them.
My client pays for a couple of these tools in an enterprise deal, and I suspect most of us on the team work like that.
If my goal was to max out every tool my client pays, I’d be working 24hrs a day and see no sunlight ever.
I guess it’s like the all you can eat buffet. Everybody eats a lot, but if you eat so much that you throw up and get sick, you are special.
My employer bought me a Claude Max subscription. On heavy weeks I use 80% of the subscription. And among software engineers that I know, I'm a relatively heavy user.
Why? Because in my experience, the bottleneck is in shareholders approving new features, not my ability to dish out code.
goal? yeah. but in reality just timing it right (starting a session at 7-8am, to get 2 sessions in a workday, or even 3 if you can schedule something at 5am), i rarely hit limits.
if i hit the limit usually i'm not using it well and hunting around. if i'm using it right i'm basically gassed out trying to hit the limit to the max.
There’s absolutely no way that’s true.
In saas this is not true. Most saas is highly profitable or was i suppose because they knew that most of their customers would never max out their plans.
A huge number of people are convinced that OpenAI and Anthropic are selling inference tokens at a loss despite the fact that there's no evidence this is true and a lot of evidence that it isn't. It's just become a meme uncritically regurgitated.
This sloppy Forbes article has polluted the epistemic environment because now theres a source to point to as "evidence."
So yes this post author's estimation isn't perfect but it is far more rigorous than the original Forbes article which doesn't appear to even understand the difference between Anthropic's API costs and its compute costs.
I'd love to be a fly on the wall when this argument is tried in front of a bankruptcy court. It drives me nuts. Of course there's evidence that they're selling tokens at a loss.
The only thing these companies sell are tokens. That's their entire output. OpenAI is trying to build an ad business but it must be quite small still relative to selling tokens because I've not yet seen a single ad on ChatGPT. It's not like these firms have a huge side business selling Claude-themed baseball caps.
That means the cost of "inference" is all their costs combined. You can't just arbitrarily slice out anything inconvenient and say that's not a part of the cost of generating tokens. The research and training needed to create the models, the salaries of the people who do that, the salaries of the people who build all the serving infrastructure, the loss leader hardcore users - all of it is a part of the cost of generating each token served.
Some people look at the very different prices for serving open weights models and say, see, inference in general is cheap. But those costs are distorted by companies trying to buy mindshare by giving models away for free, and of those, both the top labs keep claiming the Chinese are distilling them like crazy including using many tactics to evade blocks! So apparently the cost of a model like DeepSeek is still partly being subsidized by OpenAI and Anthropic against their will. The cost of those tokens is higher than what's being charged, it's just being shifted onto someone else's books. Nice whilst it lasts, but this situation has been seen many times in the past and eventually people get tired of having costs externalized onto them.
For as long as firms are losing money whilst only selling tokens, that means those tokens are selling at a loss. To not sell tokens at a loss the companies would have to be profitable.
This is all true but it isn't really important for the argument people are making. What is more important is the marginal cost per token. If each token sold is at a marginal loss, their losses would scale with usage, that simply can't be happening with API pricing. But in general, yes I agree with you and I'm sure they are taking a huge loss on Claude Code.
The article is about compute cost though. By "lose money on inference" I mean the assertion that inference has negative gross margins which a lot of people truly believe. This is important because it's common to reason from this that LLM's are uneconomical and a ticking time bomb where prices will have to be jacked up several orders of magnitude just to cover the compute used for the tokens.
But there's no such thing as compute cost in the abstract. What exactly is compute cost for AI? Does it include:
• Inference used for training? Modern training pipelines aren't just gradient descent, there's a ton of inference used in them too.
• Gradient descent itself?
• The CPUs and disks storing and managing the datasets?
• The web servers?
• The people paid to swap out failed components at the dc?
Let's say you try and define it to mean the same as unit economics - what does it cost you to add an additional customer vs what they bring in. There's still no way to do this calculation. It's like trying to compute the unit economics of a software company. Sure, if you ignore all the R&D costs of building the software in the first place and all the R&D costs of staying competitive with new versions, then the unit economics look amazing, but there's still plenty of loss-making software startups in the world.
Unit economics are a useful heuristic for businesses where there aren't any meaningful base costs required to stay in the game because they let you think about setup costs separately. Manufacturing toys, private education, farming... lots of businesses where your costs are totally dominated by unit economics. AI isn't like that.
You're missing costs.
- Amortized training costs.
- SG&A.
- Capex depreciation.
All the above impact profitability over various time horizons and have to rolled into present and projected P&L and cash flow analysis.
We have amortized training cost estimates. Inference to training compute over model lifetime is 10:1 or over for major models at major providers.
In part due to base model reuse and all the tricks like distillation. But mainly, due to how much inference the big providers happen to sell.
So, not the massive economic loss you'd need to push models away from being profitable. Capex and R&D take the cake there.
> A huge number of people are convinced that OpenAI and Anthropic are selling inference tokens at a loss despite the fact that there's no evidence this is true
Theres quite a lot of evidence, no proof I'd agree, but then there's no absolute proof I'm aware to the contrary either, so I don't know where you're getting this from.
The two pieces of evidence I'm aware of is that 1) Anthropic doesn't want their subsidised plans being used outside of CC, which would imply that the money their making off it isn't enough, and 2) last time I checked, API spending is capped at $5000 a month
Like I say, neither of these are proof, you can come up with reasonable arguments against them, but once again the same could be said for evidence on the contrary
> which would imply that the money their making off it isn't enough
I don't think this logically follows. An unlimited buffet doesn't let you resell all of the food out the backdoor. At some level of usage any fixed price plan becomes unprofitable.
I agree the 5k cap is interesting as evidence although as you said I suspect there are other reasons for it.
As for evidence against it: The Information reported that OpenAI and Anthropic are 30%+ gross margins for the last few years. Sam Altman and Dario have both claimed inference is profitable in various scattered interviews. Other experts seem to generally agree too. A quick search found a tweet from former PyTorch team member Horace He: https://x.com/typedfemale/status/1961197802169798775 and a response to it in agreement from Anish Tondwalkar former researcher at OpenAI and Google Brain.
I get the other things, but believing Altmans's words is not high on the list of things to be considered evidence.
But a simple assumption that Anthropic runs a normal large MoE LLM (which it almost certainly does) suggests that the actual price of running it (mostly energy) is pretty small.
Does this not count as evidence? I would agree that it sounds a little shaky, but I would not say there is no evidence.
https://www.wheresyoured.at/oai_docs/
I think the wafer scale compute is a massive deal. It's already being leveraged for models you can use right now and the reception on HN has been negligible. The entire model lives in SRAM. This is orders of magnitude faster than HBM/DRAM. I can't imagine they couldn't eventually break even using hardware like this in production.
I calculated only last weekend that my team would cost, if we would run Claude Code on retail API costs, around $200k/mo. We pay $1400/month in Max subscriptions. So that's $50k/user... But what tokens CC is reporting in their json -> a lot of this must be cached etc, so doubt it's anywhere near $50k cost, but not sure how to figure out what it would cost and I'm sure as hell not going to try.
I'm fascinated to know the kind of work that allows you to intelligently allocate so much resources. I use Claude extensively and feel that I great value out of it but I reach a limit in terms of what I can do that makes sense relatively quickly it seems.
Yea basically we have an app that’s like Netflix but for dogs, so people can leave on dog oriented shows for their dogs when they get kombucha or coffee
Omg, I can't believe that's real
I wanted to believe that you're essentially trolling, but no - that service exist. And not an upstart, there is coverage going back several years.
Our societies are seriously fucked.
Never have I read something that screamed Bay Area more than this lmao.
This product wouldnt be needed if they had a Juicero to dispense refreshing fresh squeezed juice anytime.
Same for me, but I suppose it is letting agents more loose and less checking of the code and rather throw away lots of generated output.
Gemini CLI shows how much was saved through caching each session, and it's usually somewhere around 90%
I'm surprised, isn't it forbidden to use the Max plan as part of a company? Just curious, as I thought it was forbidden by the ToS but I'm not sure if I have a good understanding of it
?
Claude Code has a Teams plan which includes Max tiers. Why would it be forbidden?
Most companies forbid it though, since you're not covered by any legal protection - for example, Anthropic can use your data or code to train new models and more.
This maybe was the case year+ ago but this is no longer the case, used to be most; now it is some/few
There is nothing in the TOS last time I checked forbidding it's use with Claude code. It's only forbidden to utilize it in the running of the business.
So getting Claude code subscriptions for developers should be permissable and not be against anything... However, if you created a rest endpoint to eg run a preconfigured prompt as part of your platform, that'd be against it
But I'm neither a lawyer nor work for anthropic
Ah, that makes sense. I hope they mean that then. We are just devs using it to write code; not selling it on.
Surely that can't be true? The expectation would be that people pay $200 a month for building open source and personal hobby software with Claude?
Yeah, that would end that really quickly. I use Pro for personal stuff. If $200 is not allowed for companies I don't think anyone would use it, at all.
If they believe a sufficient number is locked in then they may consider doing this later.
If that were true, then everyone I know is violating that tos
You can use `npx ccusage` to check your local logs and see how much it would have cost through the API.
> but not sure how to figure out what it would cost and I'm sure as hell not going to try.
Ask Opus to figure out how much it would cost. Lol.
I think the main issue I have with the article is that author whole argument is based on 'Qwen wouldn't run at a loss'. But why wouldn't it? Depsite it being a business, there might be a number of arguments why they decide to run without profit for now: from trying to expand the user base, to Chinese government sponsoring Chinese AI business.
Hi, OP here! Even if Qwen wants to run at a loss, why would Together, DeepInfra, SiliconFlow, etc _all_ also want to run at a similar loss?
To capture market.
Nothing differentiates them. Anything they capture is based only on price and when they raise it, they lose it entirely.
> Cost remains an ever present challenge. Cursor’s larger rivals are willing to subsidize aggressively. According to a person familiar with the company’s internal analysis, Cursor estimated last year that a $200-per-month Claude Code subscription could use up to $2,000 in compute, suggesting significant subsidization by Anthropic. Today, that subsidization appears to be even more aggressive, with that $200 plan able to consume about $5,000 in compute, according to a different person who has seen analyses on the company’s compute spend patterns.
This is the relevant quote from the original article.
If Anthropic's compute is fully saturated then the Claude code power users do represent an opportunity cost to Anthropic much closer to $5,000 then $500.
Anthropic's models may be similar in parameter size to model's on open router, but none of the others are in the headlines nearly as much (especially recently) so the comparison is extremely flawed.
The argument in this article is like comparing the cost of a Rolex to a random brand of mechanical watch based on gear count.
But opportunity cost is not actual cost. “If everyone just kept paying but used our service less we would be more profitable” is true, but not in any meaningful way.
Are Anthropic currently unable to sell subscriptions because they don’t have capacity?
Opportunity costs are real. In many cases they are more real than 'actual costs'. However, I otherwise agree with you.
> Are Anthropic currently unable to sell subscriptions because they don’t have capacity?
Absolutely! Im currently paying $170 to google to use Opus in antigravity without limit in full agent mode, because I tried Anthropic $20 subscription and busted my limit within a single prompt. Im not gonna pay them $200 only to find out I hit the limit after 20 or even 50 prompts.
And after 2 more months my price is going to double to over $300, and I still have no intention of even trying the 20x Max plan, if its really just 20x more prompts than Pro.
This has a absolutely nothing to do with whether they're limited by available compute...
What? Wouldn't they give me more than 1 prompt of compute for my $20, if they had spare?
I don't think that logically follows.
They have a business model and are trying to capture more revenue, fully saturating your computer isn't obviously a good business strategy.
If anything, you are confirming that $170 covers heavy Opus use profitably for the provider.
Opportunity cost is not the same thing as actual cost. They might have made more money if they were capable of selling the API instead of CC, but I would never tell my company to use CC all the time if I didn’t have a personal subscription.
You’re looking through the wrong end of the telescope. An investor is buying opportunity and it is a real cost to them.
Still makes no sense as they’d lose revenue, data, and scale if they don’t subsidize.
> If Anthropic's compute is fully saturated then the Claude code power users do represent an opportunity cost to Anthropic much closer to $5,000 then $500.
I think it's the other way around? Sparse use of GPU farms should be the more expensive thing. Full saturation means that we can exploit batching effects throughout.
Don’t give them any ideas, please! I need my 100 USD subscription with generous Opus usage!
Google's Antigravity has Opus access, and I suspect it's subsidised.
You know who also loves to use the term "opportunity cost"?
The entertainment industry. They still tell you about how much money they're leaving on the table because people pirate stuff.
What would happen in reality for entertainment is people would "consume" far less "content".
And what would happen in reality for Anthropic is people would start asking themselves if the unpredictability is worth the price. Or at best switch to pay as you go and use the API far less.
I prefer car analogies
> The argument in this article is like comparing the cost of a Rolex to a random brand of mechanical watch on gear count
I mean... rolex is overpriced brand whose cost to consumers is mainly just marketting in itself. Its production cost is nowhere close to selling price and looking at gears is fair way of evaluating that
> production cost is nowhere close to selling price
When has production cost had anything to do with selling price?
Not directly. But if production cost is above selling price, you typically tend to get less production. And if production cost is (way) below selling price, that tends to invite competition.
You can rent the GPUs and everything needed to run the model. Opportunity cost is not a real cost here.
Only thing that matters is if the users would have paid $5000 if they don't have option to buy subscription. And I highly doubt they would have.
I'm using API directly for software developement, i'm on path to pay ~$5k this month per user, some less , some more, with daily use is just growing more and more.
How confident are you in the opus 4.6 model size? I've always assumed it was a beefier model with more active params that Qwen397B (17B active on the forward pass)
Yeah that's a massive assumption they're making. I remember musk revealed Grok was multiple trillion parameters. I find it likely Opus is larger.
I'm sure Anthropic is making money off the API but I highly doubt it's 90% profit margins.
> I find it likely Opus is larger.
Unlikely. Amazon Bedrock serves Opus at 120tokens/sec.
If you want to estimate "the actual price to serve Opus", a good rough estimate is to find the price max(Deepseek, Qwen, Kimi, GLM) and multiply it by 2-3. That would be a pretty close guess to actual inference cost for Opus.
It's impossible for Opus to be something like 10x the active params as the chinese models. My guess is something around 50-100b active params, 800-1600b total params. I can be off by a factor of ~2, but I know I am not off by a factor of 10.
Are you sure you can use tps as a proxy?
In practice, tps is a reflection of vram memory bandwidth during inference. So the tps tells you a lot about the hardware you're running on.
Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.
I won't say it'll tell you everything; I have no clue what optimizations Opus may have, which can range from native FP4 experts to spec decoding with MTP to whatever. But considering chinese models like Deepseek and GLM have MTP layers (no clue if Qwen 3.5 has MTP, I haven't checked since its release), and Kimi is native int4, I'm pretty confident that there is not a 10x difference between Opus and the chinese models. I would say there's roughly a 2x-3x difference between Opus 4.5/4.6 and the chinese models at most.
> In practice, tps is a reflection of vram memory bandwidth during inference.
> Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.
You sure about that? I thought you could shard between GPUs along layer boundaries during inference (but not training obviously). You just end up with an increasingly deep pipeline. So time to first token increases but aggregate tps also increases as you add additional hardware.
That doesn't work. Think about it a bit more.
Hint: what's in the kv cache when you start processing the 2nd token?
And that's called layer parallelism (as opposed to tensor parallelism). It allows you to run larger models (pooling vram across gpus) but does not allow you to run models faster.
Tensor parallelism DOES allow you to run models faster across multiple GPUs, but you're limited to how fast you can synchronize the all-reduce. And in general, models would have the same boost on the same hardware- so the chinese models would have the same perf multiplier as Opus.
Note that providers generally use tensor parallelism as much as they can, for all models. That usually means 8x or so.
In reality, tps ends up being a pretty good proxy for active param size when comparing different models at the same inference provider.
Oh I see. I went and confused total aggregate throughput with per-query throughput there didn't I.
You can estimate on tok/second
The Trillions of parameters claim is about the pretraining.
It’s most efficient in pre training to train the biggest models possible. You get sample efficiency increase for each parameter increase.
However those models end up very sparse and incredibly distillable.
And it’s way too expensive and slow to serve models that size so they are distilled down a lot.
GPT 4 was rumoured/leaked to be 1.8T. Claude 3.5 Sonnet was supposedly 175B, so around 0.5T-1T seems reasonable for Opus 3.5. Maybe a step up to 1-3T for Opus 4.0
Since then inference pricing for new models has come down a lot, despite increasing pressure to be profitable. Opus 4.6 costs 1/3rd what Opus 4.0 (and 3.5) costs, and GPT 5.4 1/4th what o1 costs. You could take that as indication that inference costs have also come done by at least that degree.
My guess would have been that current frontier models like Opus are in the realm of 1T params with 32B active
Anthropic CEO said 50%+ margins in an interview. I'm guessing 50 - 60% right now.
Even if it's larger, OpenRouter has DeepSeek v3.2 (685B/37B active) at $0.26/0.40 and Kimi K2.5 (1T/32B active) at $0.45/2.25 (mentioned in the post).
Opus 4.6 likely has in the order of 100B active parameters. OpenRouter lists the following throughput for Google Vertex:
For GLM 4.7, that makes 143 * 32B = 4576B parameters per second, and for Llama 3.3, we get 70 * 70B = 4900B, which makes sense since denser models are easier to optimize. As a lower bound, we get 4576B / 42 ≈ 109B active parameters for Opus 4.6. (This makes the assumption that all three models use the same number of bits per parameter and run on the same hardware.)Yep, you can also get similar analysis from Amazon Bedrock, which serves Opus as well.
I'd say Opus is roughly 2x to 3x the price of the top Chinese models to serve, in reality.
Also curious if any experts can weigh in on this. I would guess in the 1 trillion to 2 trillion range.
Try 10s of trillions. These days everyone is running 4-bit at inference (the flagship feature of Blackwell+), with the big flagship models running on recently installed Nvidia 72gpu rubin clusters (and equivalent-ish world size for those rented Ironwood TPUs Anthropic also uses). Let's see, Vera Rubin racks come standard with 20 TB (Blackwell NVL72 with 10 TB) of unified memory, and NVFP4 fits 2 parameters per btye...
Of course, intense sparsification via MoE (and other techniques ;) ) lets total model size largely decouple from inference speed and cost (within the limit of world size via NVlink/TPU torrus caps)
So the real mystery, as always, is the actual parameter count of the activated head(s). You can do various speed benchmarks and TPS tracking across likely hardware fleets, and while an exact number is hard to compute, let me tell you, it is not 17B or anywhere in that particular OOM :)
Comparing Opus 4.6 or GPT 5.4 thinking or Gemini 3.1 pro to any sort Chinese model (on cost) is just totally disingenuous when China does NOT have Vera Rubin NVL72 GPUs or Ironwood V7 TPUs in any meaningful capacity, and is forced to target 8gpu Blackwell systems (and worse!) for deployment.
Nobody is running 10s of trillion param models in 2026. That's ridiculous.
Opus is 2T-3T in size at most.
Do you have any clues to guess the total model size? I do not see any limitations to making models ridiculously large (besides training), and the Scaling Law paper showed that more parameters = more better, so it would be a safe bet for companies that have more money than innovative spirit.
> I do not see any limitations to making models ridiculously large (besides training)
From my understanding, the "besides training" is a big issue. As I noted earlier[1], Qwen3 was much better than Qwen2.5, but the main difference was just more and better training data. The Qwen3.5-397B-A17B beat their 1T-parameter Qwen3-Max-Base, again a large change was more and better training data.
[1]: https://news.ycombinator.com/item?id=47089780
China is targeting H20 because that's all they were officially allowed to buy.
I generally agree, back of the napkin math shows H20 cluster of 8gpu * 96gb = 768gb = 768B parameters on FP8 (no NVFP4 on Hopper), which lines up pretty nicely with the sizes of recent open source Chinese models.
However, I'd say its relatively well assumed in realpolitik land that Chinese labs managed to acquire plenty of H100/200 clusters and even meaningful numbers of B200 systems semi-illicitly before the regulations and anti-smuggling measures really started to crack down.
This does somewhat beg the question of how nicely the closed source variants, of undisclosed parameter counts, fit within the 1.1tb of H200 or 1.5tb of B200 systems.
They do not have enough H200 or Blackwell systems to server 1.6 billion people and the world so I doubt it's in any meaningful number.
I assure you, the number of people paying to use Qwen3-Max or other similar proprietary endpoints is far less than 1.6 billion.
You don't need to assure me. It's a theoretical maximum.
tldr: the author argues it is closer to costing 500 USD per month IF a user hits their weekly rate limits every week.
Which is probably a lot more correct than other claims. However it's also true that anybody who has to use the API might pay that much, creating a real cost per token moat for Anthropics Claude code vs other models as long as they are so far ahead in terms of productivity.
Claude subscription is equivalant of spot instance
And APIs are on-demand service equivalant.
Priority is set to APIs and leftover compute is used by Subscription Plans.
When there is no capacity, subscriptions are routed to Highly Quantized cheaper models behind the scenes.
Selling subscription makes it cheaper to run such inference at scale otherwise many times your capacity is just sitting there idle.
Also, these subscription help you train your model further on predictable workflow (because the model creators also controls the Client like qwen code, claude code, anti gravity etc...)
This is probably why they will ban you for violating TOS that you cannot use their subscription service model with other tools.
They aren't just selling subscription, but the subscription cost also help them become better at the thing they are selling which is coding for coding models like Qwen, Claude etc...
I've used qwen code, codex and claude.
Codex is 2x better than Qwen code and Claude is 2x better than Codex.
So I'd hope the Claude Opus is atleast 4-5x more expensive to run than flagship Qwen Code model hosted by Alibaba.
> Claude is 2x better than Codex
This hasn't been true in a long time.
Not only that, but since the release of 5.4 and 5.3 codex I've been running them in parallel and I've been let down by Opus 4.6 with maximum thinking way more than I've been let down with OpenAI models.
In fact I'm more and more inclined to run my own benchmarks from now on, because I seriously distrust those I see online.
Even if the benchmarks are indeed valid, they just don't reflect my use cases, usages and ability to navigate my projects and my dependencies.
imho they're mostly better at a subset of different tasks. I find codex to be better at reasoning through bugs and reviewing code when compared to Opus, but for writing code I find Claude a lot better.
Maybe that's just CLAUDE.md and memory causing the difference of course.
As a matter of preference however I like the way Claude Code works just a lot better, instructing it to work with parallel subagents in work trees etc. just matches the way I think these things should work I guess.
My impression as well, especially since 5.2 which I felt was on par or better than Opus 4.5
> When there is no capacity, subscriptions are routed to Highly Quantized cheaper models behind the scenes.
Have they announced this?
> Have they announced this?
No and indeed they have said they never do this at all.
What people don't realize is that cache is *free*, well not free, but compared to the compute required to recompute it? Relatively free.
If you remove the cached token cost from pricing the overall api usage drops from around $5000 to $800 (or $200 per week) on the $200 max subscription. Still 4x cheaper over API, but not costing money either - if I had to guess it's break even as the compute is most likely going idle otherwise.
Cache definitely isn't free! We're in a global RAM shortage and KV caches sit around consuming RAM in the hope that there will be a hit.
The gamble with caching is to hold a KV cache in the hope that the user will (a) submit a prompt that can use it and (b) that will get routed to the right server which (c) won't be so busy at the time it can't handle the request. KV caches aren't small so if you lose that bet you've lost money (basically, the opportunity cost of using that RAM for something else).
> What people don't realize is that cache is free
I'm incredibly salty about this - they're essentially monetizing intensely something that allows them to sell their inference at premium prices to more users - without any caching, they'd have much less capacity available.
> [...] if I had to guess it's break even as the compute is most likely going idle otherwise.
Why would it go idle? It would go to their next best use. At least they could help with model training or let their researchers run experiments etc.
inference compute is vastly different versus training, also it has to stay hot in vram which probably takes up most of it. There is limited use for THAT much compute as well, they are running things like claude code compiler and even then they're scratching the surface of the amount of compute they have.
Training currently requires nvidia's latest and greatest for the best models (they also use google TPU's now which are also technically the latest and greatest? However, they're more of a dual purpose than anything afaik so that would be a correct assesment in that case)
Inference can run on a hot potato if you really put your mind to it
I think I've heard multiple time that a large % of training compute for SoTA models is inference to generate training tokens, this is bound to happen with RL training
They can run any number of inference experiments. Like a lot of the alignment work they have going on.
I am not saying this would be a great use of their compute, but idle is far from the only alternative. (Unless electricity is the binding constraint?)
Electricity is charged whenever you use it or not, so very unlikely, but sure, they can find uses for it. Although they are not going to make that much money compared to claude code subscriptions.
> Electricity is charged whenever you use it or not, [...]
Huh, what? You know you can turn off unused equipment, and at least my nvidia GPU can use more or less Watts even when turned on?
Or does Anthropic have a flatline deal for electricity and cooling?
“The article is right to separate compute cost from retail price — but the retail price baseline itself is arbitrary depending on where you run the model. The same capability (e.g. Llama 3.3 70B with tool calling and 128K context) runs $3.00/1M tokens at model developer list price and $0.22/1M at Fireworks AI — a 93% gap for identical specs. That spread makes any “it costs Anthropic X” estimate depend entirely on which reference price you anchor to. We track this live across 1,625 SKUs and 40+ vendors at a7om.com — the variance across the market is larger than most people realise when they back-calculate provider economics.”
This is such a well-written essay. Every line revealed the answer to the immediate question I had just thought of
I can’t get past all the LLM-isms. Do people really not care about AI-slopifying their writing? It’s like learning about bad kerning, you see it everywhere.
I had a similar reaction to OP for a different post a few weeks back - I think some analysis on the health economy. Initially as I was reading I thought - "Wow, I've never read a financial article written so clearly". Everything in layman's terms. But as I continued to read, I began to notice the LLM-isms. Oversimplified concepts, "the honest truth" "like X for Y", etc.
Maybe the common factor here is not having deep/sufficient knowledge on the topic being discussed? For the article I mentioned, I feel like I was less focused on the strength of the writing and more on just understanding the content.
LLMs are very capable at simplifying concepts and meeting the reader at their level. Personally, I subscribe to the philosophy of - "if you couldn't be bothered to write it, I shouldn't bother to read it".
Alternate theory... a few months into the LLMism phenomenon, people are starting to copy the LLM writing style without realizing it :(
This happens to non-native English speakers a lot (like me). My style of writing is heavily influenced by everything I read. And since I also do research using LLMs, I'll probably sound more and more as an AI as well, just by reading its responses constantly.
I just don't know what's supposed to be natural writing anymore. It's not in the books, disappears from the internet, what's left? Some old blogs for now maybe.
I think you're just hallucinating because this does not come across as an AI article
I see quite a few:
“what X actually is”
“the X reality check”
Overuse of “real” and “genuine”:
> The real story is actually in the article. … And the real issue for Cursor … They have real "brand awareness", and they are genuinely better than the cheaper open weights models - for now at least. It's a real conundrum for them.
> … - these are genuinely massive expenses that dwarf inference costs.
This style just screams “Claude” to me.
It was almost certainly at least heavily edited with one. Ignoring the content, every single thing about the structure and style screams LLM.
> I think you're just hallucinating because this does not come across as an AI article
It has enough tells in the correct frequency for me to consider it more than 50% generated.
Name checks out
It's really unfortunate that we call well-structured writing 'LLM-isms' now.
I don’t see the usual tells in this essay
People care, when they can tell.
Popular content is popular because it is above the threshold for average detection.
In a better world, platforms would empower defenders, by granting skilled human noticers flagging priority, and by adopting basic classifiers like Pangram.
Unfortunately, mainstream platforms have thus far not demonstrated strong interest in banning AI slop. This site in particular has actually taken moderation actions to unflag AI slop, in certain occasions...
It is certainly very obvious a lot of the time. I wonder if we revisited the automated slop detection problem we’d be more successful now… it feels like there are a lot more tells and models have become more idiosyncratic.
Tons of companies do this already. It's not like this is a problem that nobody is constantly revisiting...
the openrouter comparison is interesting because it shows what happens when you have actual supply-side competition. multiple providers, different quantizations, price competition. the spread between cheapest and priciest for the same model can be 3-5x.
anthropic doesn't have that. single provider, single pricing decision. whether or not $5k is accurate the more interesting question is what happens to inference pricing when the supply side is genuinely open. we're seeing hints of it with open router but its still intermediated
not saying this solves anthropic's cost problem, just that the "what does inference actually cost" question gets a lot more interesting when providers are competing directly
Good article! Small suggestions:
1. It would be nice to define terms like RSI or at least link to a definition.
2. I found the graph difficult to read. It's a computer font that is made to look hand-drawn and it's a bit low resolution. With some googling I'm guessing the words in parentheses are the clouds the model is running on. You could make that a bit more clear.
These margins are far greater than the ones Dario has indicated during many of his recent podcasts appearances.
What did he say?
What CC costs internally is not public. How efficient it is, is not public.
…You could take efficiency improvement rates from previous models releases (from x -> y) and assume; they have already made “improvements” internally. This is likely closer to what their real costs are.
Was anyone under the impression that it does? Serious question. I've never heard that, personally.
Ed Zitron made that claim (in particular here: [1]). In the same article he admits he not a programmer, and had to ask someone else to try out Claude Code and ccusage for him. He doesn't have any understanding of how LLMs or caching works. But he's prominent because he's received leaked financial details for Anthropic and OpenAI, eg [2]
[1] https://www.wheresyoured.at/anthropic-is-bleeding-out/ [2] https://www.wheresyoured.at/costs/
Maybe I'm misreading it, but I don't see him saying it's just the cost of *inference* alone (which is the strawman that the article in the OP is arguing against). He says:
> this company is wilfully burning 200% to 3000% of each Pro or Max customer that interacts with Claude Code
There is of course this meme that "Anthropic would be profitable today if they stopped training new models and only focused on inference", but people on HN are smart enough to understand that this is not realistic due to model drift, and also due to comeptition from other models. So training is forever a part of the cost of doing business, until we have some fundamental changes in the underlying technology.
I can only interpret Ed Zitron as saying "the cost of doing business is 200% to 3000% of the price users are paying for their subscriptions", which sounds extremely plausible to me.
You would be surprised because there are lots of posters here who think that the cost is so enormous that this whole industry is unviable.
I mean, the very first paragraph of TFA is describing who is under that impression. Literally the first sentence:
> My LinkedIn and Twitter feeds are full of screenshots from the recent Forbes article on Cursor claiming that Anthropic's $200/month Claude Code Max plan can consume $5,000 in compute.
That's claiming that worst case, a subscriber _can_ use that much. It's possible that's wrong too, but in any case a lot of services are built on the assumption that the average user doesn't max out the plan.
So the article's title is obviously sensationalized.
I have no problem believing that a Claude Max plan can consume equivalent to $5000 worth of retail Opus use, but one interesting thing you'll see if you e.g. have Claude write agents for you, is that it's pretty aggressive about setting agents to use Sonnet or even Haiku, so not only will most people not exhaust their plans, but a lot of people who do will do so in part using the cheaper models. When you then factor in Anthropics reported margins, and their ability to prioritise traffic (e.g. I'd assume that if their capacity is maxed out they'd throttle subscribers in favour of paid by the token? Maybe not, but it's what I'd do), I'd expect the real cost to them of a maximised plan to be much lower.
Also, while Opus certainly is a lot better than even the best Chinese models, when I max out my Claude plan, I make do with Kimi 2.5. When factoring in the re-run of changes because of the lower quality, I'd spend maybe 2x as much per unit of work I were to pay token prices for all my monthly use w/Kimi.
I'd still prefer Claude if the price comes down to 1x, as it's less hassle w/the harder changes, but their lead is effectively less than a year.
Twitter.
By the way, one of the charts in the article shows that Opus 4.6 is 10x costlier than Kimi K2.5.
I thought there was no moat in AI? Even being 10x costlier, Anthropic still doesn't have enough compute to meet demand.
Those "AI has no moat" opinions are going to be so wrong so soon.
Claude Code Max obviously doesn't cost 10x more than Kimi. The article even confirms that you can get $5k worth of computer for $200 with Claude Code Max.
So no, Claude would not be getting NEARLY as much usage as it's currently getting if it weren't for the $100/$200 monthly subscription. You're comparing Kimi to the price that most people aren't paying.
Is it fair to say the Open Router models aren't subsidized though? They make the case that companies on there are running a business, but there are free models, and companies with huge AI budgets that want to gather training data and show usage.
Why does Claude charge 10x for API, compared to subscriptions? They're not a monopoly, so one would expect margins to be thinner.
Monopoly isn't the only thing that allows you to charge large margins.
API inference access is naturally a lot more costly to provide compared to Chat UI and Claude Code, as there is a lot more load to handle with less latency. In the products they can just smooth over load curves by handling some of the requests slower (which the majority of users in a background Code session won't even notice).
I have very naive question:
People in comments have assumption that Atropic 10 times bigger than chinese models so calc cost is 10 times more.
But from perspective of Big O notation only a few algorithms gives you O(N). Majority high optimized things provide O(N*Log(N))
So what is big O for any open model for single request?
It's a good question. Costs will be lumpy. Inference servers will have a preferred batch size. Once you have a server you can scale number of users up to that batch size for relatively low cost. Then you need to add another server (or rack) for another large cost.
However I think it's fair to say the cost is roughly linear in the number of users other than that.
There may be some aspects which are not quite linear when you see multiple users submitting similar queries... But I don't think this would be significant.
N*Log(N) can be approximated to O(N) for most realistic usecases.
As for LLM, there is probably some cost constant added once it can fit on a single GPU, but should probably be almost linear.
Did anthropic do the oldest SaaS sales trick in the 2010s SaaS playbooks ;)
The comparison with Qwen/Kimi by "comparable architecture size" is doing a lot of heavy lifting. Parameter count doesn't tell you much when the models aren't in the same league quality-wise.
I wonder if a better proxy would be comparing by capability level rather than size. The cost to go from "good" to "frontier" is probably exponential, not linear - so estimating Anthropic's real cost from what it takes to serve Qwen 397B seems off.
Nobody gets RSI typing “iterate until tests pass”
Recursive self improvement and Repetitive Strain Injury being the same initialism is really funny to me
Honest questions: have you never heard of a hyperbole before and are you on the spectum?
This article is hilariously flawed, and it takes all of 5 seconds of research to see why.
Alibaba is the primary comparison point made by the author, but it's a completely unsuitable comparison. Alibab is closer to AWS then Anthropic in terms of their business model. They make money selling infrastructure, not on inference. It's entirely possible they see inference as a loss leader, and are willing to offer it at cost or below to drive people into the platform.
We also have absolutely no idea if it's anywhere near comparable to Opus 4.6. The author is guessing.
So the articles primary argument is based on a comparison to a company who has an entirely different business model running a model that the author is just making wild guesses about.
What? Aws is a good comparison if you want only infra level costs which is what the post is talking about.
Well, IDK, I have used CC with API billing pretty extensively and managed to spend ~$1000 in one month more or less. Moved to a Max 20x subscription and using it a bit less (I'm still scared) but not THAT less and I'm around 10% weekly usage. I'm not counting the tokens, though.
And on top of that, Anthropic does not run their own compute clusters do they? They probably get completely ripped by whoever is renting them the processors.
$200 worth of actual computation is an awful lot of computation.
What this doesn't mention is the "cost" to the public: the inevitable bailouts after it all comes crashing down again, the massive subsidies that Datacenters get from tax payers, the fresh water they consume, the electricity price hikes for everyone else, the noise, air and water pollution and the massive health impact on the surrounding population of every datacenter. The jobs that it destroys and the innocent people it kills through use of the technology in military targeting and autonomous weapons usage.
Tl;dr, their guesstimate:
> Anthropic is looking at approximately $500 in real compute cost for the heaviest users.
Ok but so it does cost Cursor $5k per power-Cursor user?? Still seems pretty rough..
Yes, you could turn it around to say that using Anthropic models in Cursor, Copilot, Junie, etc. is 'subsidising' Claude Code users.
$5 = $5
but $5 that I amortize over 7 years might end up being $1.7 maybe if I don't rapidly combust (supply chain risk)
Cursor may be losing money only on $200 sub people who do over $200 of usage (it grants $400)
Everyone else pays them at API prices
No, to use $5k in Cursor you have to pay $5k.
I wonder how they are defining a power user. How many tokens, what could be the size the code base?
The $5k power user is the one that consistently uses all input and output tokens available under the Max subscription
> I'm fairly confident the Forbes sources are confusing retail API prices with actual compute costs
Aren't they losing money on the retail API pricing, too?
> ... comparisons to artificially low priced Chinese providers...
Yeah, no this article does not pass the sniff test.
> Aren't they losing money on the retail API pricing, too?
No, they aren't, and probably neither is anyone else offering API pricing. And Anthropic's API margins may be higher than anyone else.
For example, DeepSeek released numbers showing that R1 was served at approximately "a cost profit margin of 545%" (meaning 82% of revenue is profit), see my comment https://news.ycombinator.com/item?id=46663852
Weird that they're all looking for outside money then
They're all looking for outside money because they're all looking for outside money, and so need to keep up with their competitors investments in training. It's a game of chicken. Once their ability to raise more abates, they'll slow down new training runs, and fund that out of inference margins instead, but the first one to be forced to do so will risk losing market share.
Inference is profitable. No one is selling at a loss. It’s training to keep up with competitors that is causing losses.
> Inference is profitable
Eh. We don't really know that, and the people saying that have an interest in the rest of the world believing it's true.
How are we so sure that deep inside the moon isn't made out of cheese?
I easily go through two pro max $200/m accounts and yesterday got a third pro account when I ran out.
It’s worth it, but I know they aren’t making money on me. But, of course I’m marketing them constantly so…