I work at Google on these systems everyday (caveat this is my own words not my employers)). So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.
However I can share this written by my colleagues! You'll find great explanations about accelerator architectures and the considerations made to make things fast.
Edit:
Another great resource to look at is the unsloth guides. These folks are incredibly good at getting deep into various models and finding optimizations, and they're very good at writing it up. Here's the Gemma 3n guide, and you'll find others as well.
Inference is (mostly) stateless. So unlike training where you need to have memory coherence over something like 100k machines and somehow avoid the certainty of machine failure, you just need to route mostly small amounts of data to a bunch of big machines.
I don't know what the specs of their inference machines are, but where I worked the machines research used were all 8gpu monsters. so long as your model fitted in (combined) vram, you could job was a goodun.
To scale the secret ingredient was industrial amounts of cash. Sure we had DGXs (fun fact, nvidia sent literal gold plated DGX machines) but they wernt dense, and were very expensive.
Most large companies have robust RPC, and orchestration, which means the hard part isn't routing the message, its making the model fit in the boxes you have. (thats not my area of expertise though)
> Inference is (mostly) stateless. ... you just need to route mostly small amounts of data to a bunch of big machines.
I think this might just be the key insight. The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!), you "only" pay for the request-specific raw compute and the memory storage+bandwidth for the activations. And the proprietary models are now huge, highly-quantized extreme-MoE models where the former factor (model size) is huge and the latter (request-specific compute) has been correspondingly minimized - and where it hasn't, you're definitely paying "pro" pricing for it. I think this goes a long way towards explaining how inference at scale can work better than locally.
(There are "tricks" you could do locally to try and compete with this setup, such as storing model parameters on disk and accessing them via mmap, at least when doing token gen on CPU. But of course you're paying for that with increased latency, which you may or may not be okay with in that context.)
> The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!)
Kind of unrelated, but this comment made me wonder when we will start seeing side channel attacks that force queries to leak into each other.
I don't think it's either useful or particularly accurate to characterize modern disagg racks of inference gear, well-understood RDMA and other low-overhead networking techniques, aggressive MLA and related cache optimizations that are in the literature, and all the other stuff that goes into a system like this as being some kind of mystical thing attended to by a priesthood of people from a different tier of hacker.
This stuff is well understood in public, and where a big name has something highly custom going on? Often as not it's a liability around attachment to some legacy thing. You run this stuff at scale by having the correct institutions and processes in place that it takes to run any big non-trivial system: that's everything from procurement and SRE training to the RTL on the new TPU, and all of the stuff is interesting, but if anyone was 10x out in front of everyone else? You'd be able to tell.
Signed,
Someone Who Also Did Megascale Inference for a TOP-5 For a Decade.
For real. Say it takes 1 machine 5 seconds to reply, and that a machine can only possibly form 1 reply at a time (which I doubt, but for argument).
If the requests were regularly spaced, and they certainly won’t be, but for the sake of argument, then 1 machine could serve 17,000 requests per day, or 120,000 per week. At that rate, you’d need about 5,600 machines to serve 700M requests. That’s a lot to me, but not to someone who owns a data center.
Yes, those 700M users will issue more than 1 query per week and they won’t be evenly spaced. However, I’d bet most of those queries will take well under 1 second to answer, and I’d also bet each machine can handle more than one at a time.
It’s a large problem, to be sure, but that seems tractable.
Yes. And batched inference is a thing, where intelligent grouping/bin packing and routing of requests happens. I expect a good amount of "secret sauce" is at this layer.
But that’s not accurate. There are all sorts of tricks around KV cache where different users will have the same first X bytes because they share system prompts, caching entire inputs / outputs when the context and user data is identical, and more.
Not sure if you were just joking or really believe that, but for other peoples’ sake, it’s wildly wrong.
Google will win the LLM game if the LLM game is about compute, which is the common wisdom and maybe true, but not foreordained by God. There's an argument that if compute was the dominant term that Google would never have been anything but leading by a lot.
Personally right now I see one clear leader and one group going 0-99 like a five sigma cosmic ray: Anthropic and the PRC. But this is because I believe/know that all the benchmarks are gamed as hell, its like asking if a movie star had cosmetic surgery. On quality, Opus 4 is 15x the cost and sold out / backordered. Qwen 3 is arguably in next place.
In both of those cases, extreme quality expert labeling at scale (assisted by the tool) seems to be the secret sauce.
Which is how it would play out if history is any guide: when compute as a scaling lever starts to flatten, you expert label like its 1987 and claim its compute and algorithms until the government wises up and stops treating your success persobally as a national security priority. It's the easiest trillion Xi Xianping ever made: pretending to think LLMs are AGI too, fast following for pennies on the dollar, and propping up a stock market bubble to go with the fentanyl crisis? 9-D chess. It's what I would do about AI if I were China.
I believe Google might win the LLM game simply because they have the infrastructure to make it profitable - via ads.
All the LLM vendors are going to have to cope with the fact that they're lighting money on fire, and Google have the paying customers (advertisers) and with the user-specific context they get from their LLM products, one of the juciest and most targetable ad audiences of all time.
It would take talent for them to mess up hosting businesses who want to use their TPUs on GCP.
But then again even there, their reputation for abandoning products, lack of customer service, condescension when it came to large enterprises’ “legacy tech” lets Microsoft who is king of hand holding big enterprise and even AWS run rough shod over them.
When I was at AWS ProServe, we didn’t even bother coming up with talking points when competing with GCP except to point out how they abandon services. Was it partially FUD? Probably. But it worked.
> It would take talent for them to mess up hosting businesses who want to use their TPUs on GCP.
> But then again even there, their reputation for abandoning products
What are the chances of abandoning TPU-related projects where the company literally invested billions in infrastructure? Zero.
Well, Search had no chance when the sites also make money from Google ads. Google fucked their Search by creating themselves incentives for bounce rate.
Yeah honestly. They could just try selling solutions and SLAs combining their TPU hardware with on-prem SOTA models and practically dominate enterprise. From what I understand, that's GCP's gameplay too for most regulated enterprise clients.
Googles bread and butter is advertising, so they have a huge interest in keeping things in house. Data is more valuable to them than money from hardware sales.
Even then, I think that their primary use case is going to be consumer grade good AI on phones. I dunno why Gemma QAT model fly so low on the radar, but you can basically get full scale Llamma 3 like performance from a single 3090 now, at home.
It’s my understanding that google makes bulk of ad money from search ads - sure they harvest a ton of data but it isn’t as valuable to them as you’d think. I suspect they know that could change so they’re hoovering up as much as they can to hedge their bets. Meta on the other hand is all about targeted ads.
Google has already started the process of letting companies self-host Gemini, even on NVidia Blackwell GPUs.
Although imho, they really should bundle it with their TPUs as a turnkey solution for those clients who haven't invested in large scale infra like DCs yet.
My guess is that either google want's a high level of physical control over their TPUs, or they have one sort of deal or another with NVidia and don't want to step on their toes.
I'd think no. They have the hardware and software experience, likely have next and next-next plans in place already. The big hurdle is money, which G has a bunch of.
TPUs are accelerators that accelerate the common operations found in neural nets. A big part is simply a massive number of matrix FMA units to process enormous matrix operations, which comprises the bulk of doing a forward pass through a model. Caching enhancements and massively growing memory was necessary to facilitate transformers, but on the hardware side not a huge amount has changed and the fundamentals from years ago still powers the latest models. The hardware is just getting faster and with more memory and more parallel processing units. And later getting more data types to enable hardware-enabled quantization.
So it isn't like Google designed a TPU for a specific model or architecture. They're pretty general purpose in a narrow field (oxymoron, but you get the point).
The set of operations Google designed into a TPU is very similar to what nvidia did, and it's about as broadly capable. But Google owns the IP and doesn't pay the premium and gets to design for their own specific needs.
Hasn’t the Inferentia chip been around long enough to make the same argument? AWS and Google probably have the same order of magnitude of their own custom chips
Im a research person building models so I can't answer your questions well (save for one part)
That is, as a research person using our GPUs and TPUs I see first hand how choices from the high level python level, through Jax, down to the TPU architecture all work together to make training and inference efficient. You can see a bit of that in the gif on the front page of the book. https://jax-ml.github.io/scaling-book/
I also see how sometimes bad choices by me can make things inefficient. Luckily for me if my code/models are running slow I can ping colleagues who are able to debug at both a depth and speed that is quite incredible.
And because were on HN I want to preemptively call out my positive bias for Google! It's a privilege to be able to see all this technology first hand, work with great people, and do my best to ship this at scale across the globe.
This caught my attention "But today even “small” models run so close to hardware limits".
Sounds analogous to the 60's and 70's i.e "even small programs run so close to hardware limits". If optimization and efficiency is dead in software engineering, it's certainly alive and well in LLM development.
Large Model Systems (LMSYS Corp.) is a 501(c)(3) non-profit focused on incubating open-source projects and research. Our mission is to make large AI models accessible to everyone by co-developing open models, datasets, systems, and evaluation tools. We conduct cutting-edge machine learning research, develop open-source software, train large language models for broad accessibility, and build distributed systems to optimize their training and inference.
I have met a lot of people at Google, they have some really good engineers and mediocre ones. But mostl importantly they are just normal engineers dealing normal office politics.
I don't like how the grand parent mystifies this. This problem is just normal engineering. Any good engineer could learn how to do it.
Because most smart people are not generalists. My first boss was really smart and managed to found a university institute in computer science. The 3 other professors he hired were, ahem, strange choices. We 28 year old assistents could only shake our heads. After fighting a couple of years with his own hires the founder left in frustration to found another institution.
One of my colleagues was only 25, really smart in his field and became a professor less than 10 years later. But he was incredibly naive in everyday chores. Buying groceries or filing taxes resulted in major screw-ups regularly
I have met those supersmart specialists but in my experience there are also a lot of smart people who are more generalists.
The real answer is likely internal company politics and priorities. Google certainly has people with the technical skills to solve it but do they care and if they care can they allocate those skilled people to the task?
My observation is that in general smart generalists are smarter than smart specialists. I work at Google, and it’s just that these generalists folks are extremely fast learners. They can cover breadth and depth of an arbitrary topic in a matter of 15 minutes, just enough to solve a problem at hand.
It’s quite intimidating how fast they can break down difficult concepts into first principles. I’ve witnessed this first hand and it’s beyond intimidating. Makes you wondering what you’re doing at this company… That being said, the caliber of folks I’m talking about is quite rare, like top 10% of top 2-3% teams at Google.
Can you explain what you mean about 'not needing to be solved'? There are versions of that kind of critique that would seem, at least on the surface, to better apply to finance or flash trading.
I ask because scaling an system that a substantially chunk of the population finds incredibly useful, including for the more efficient production of public goods (scientific research, for example) does seem like a problem that a) needs to be solved from a business point of view, and b) should be solved from a civic-minded point of view.
I think the problem I see with this type of response is that it doesn't take into context the waste of resources involved. If the 700M users per week is legitimate then my question to you is: how many of those invocations are worth the cost of resources that are spent, in the name of things that are truly productive?
And if AI was truly the holy grail that it's being sold as then there wouldn't be 700M users per week wasting all of these resources as heavily as we are because generative AI would have already solved for something better. It really does seem like these platforms are, and won't be, anywhere as useful as they're continuously claimed to be.
Just like Tesla FSD, we keep hearing about a "breakaway" model and the broken record of AGI. Instead of getting anything exceptionally better we seem to be getting models tuned for benchmarks and only marginal improvements.
I really try to limit what I'm using an LLM for these days. And not simply because of the resource pigs they are, but because it's also often a time sink. I spent an hour today testing out GPT-5 and asking it about a specific problem I was solving for using only 2 well documented technologies. After that hour it had hallucinated about a half dozen assumptions that were completely incorrect. One so obvious that I couldn't understand how it had gotten it so wrong. This particular technology, by default, consumes raw SSE. But GPT-5, even after telling it that it was wrong, continued to give me examples that were in a lot of ways worse and kept resorting to telling me to validate my server responses were JSON formatted in a particularly odd way.
Instead of continuing to waste my time correcting the model I just went back to reading the docs and GitHub issues to figure out the problem I was solving for. And that led me down a dark chain of thought: so what happens when the "teaching" mode rethinks history, or math fundamentals?
I'm sure a lot of people think ChatGPT is incredibly useful. And a lot of people are bought into not wanting to miss the boat, especially those who don't have any clue to how it works and what it takes to execute any given prompt. I actually think LLMs have a trajectory that will be similar to social media. The curve is different and I, hopefully, don't think we've seen the most useful aspects of it come to fruition as of yet. But I do think that if OpenAI is serving 700M users per week then, once again, we are the product. Because if AI could actually displace workers en masse today you wouldn't have access to it for $20/month. And they wouldn't offer it to you at 50% off for the next 3 months when you go to hit the cancel button. In fact, if it could do most of the things executives are claiming then you wouldn't have access to it at all. But, again, the users are the product - in very much the same way social media played into.
Finally, I'd surmise that of those 700M weekly users less than 10% of those sessions are being used for anything productive that you've mentioned and I'd place a high wager that the 10% is wildly conservative. I could be wrong, but again - we'd know about that if it were the actual truth.
> If the 700M users per week is legitimate then my question to you is: how many of those invocations are worth the cost of resources that are spent, in the name of things that are truly productive?
Is everything you spend resources on truly productive?
Who determines whether something is worth it? Is price/willingness of both parties to transact not an important factor?
I don't think ChatGPT can do most things I do. But it does eliminate drudgery.
> so what happens when the "teaching" mode rethinks history, or math fundamentals?
The person attempting to learn either (hopefully) figures out the AI model was wrong, or sadly learns the wrong material. The level of impact is probably quite relative to how useful the knowledge is one's life.
The good or bad news, depending on how you look at it, is that humans are already great at rewriting history and believing wrong facts, so I am not entirely sure an LLM can do that much worse.
Maybe ChatGPT might just kill of the ignorant like it already has? GPT already told a user to combine bleach and vinegar, which produces chlorine gas. [1]
They won’t be honest and explain it to you but I will. Takes like the one you’re responding to are from loathsome pessimistic anti-llm people that are so far detached from reality they can just confidently assert things that have no bearing on truth or evidence. It’s a coping mechanism and it’s basically a prolific mental illness at this point
And what does that make you? A "loathsome clueless pro-llm zealot detached from reality"? LLMs are essentially next word predictors marketed as oracles. And people use them as that. And that's killing them. Because LLMs don't actually "know", they don't "know that they don't know", and won't tell you they are inadequate when they are. And that's a problem left completely unsolved. At the core of very legitimate concerns about the proliferation of LLMs. If someone here sounds irrational and "coping", it very much appears to be you.
> working on problems that don't even really need to be solved
Very, very few problems _need_ to be solved. Feeding yourself is a problem that needs to be solved in order for you to continue living. People solve problems for different reasons. If you don't think LLMs are valuable, you can just say that.
The few problems humanity has that need to be solved:
1. How to identify humanity's needs on all levels, including cosmic ones...(we're in the Space Age so we need to prepare ourselves for meeting beings from other places)
2. How to meet all of humanity's needs
Pointing this out regularly is probably necessary because the issue isn't why people are choosing what they're doing...it's that our systems actively disincentivize collectibely addressing these two problems in a way that doesn't sacrifice people's wellbeing/lives... and most people don't even think about it like this.
An H100 is a $20k USD card and has 80GB of vRAM. Imagine a 2U rack server with $100k of these cards in it. Now imagine an entire rack of these things, plus all the other components (CPUs, RAM, passive cooling or water cooling) and you're talking $1 million per rack, not including the costs to run them or the engineers needed to maintain them. Even the "cheaper"
I don't think people realize the size of these compute units.
When the AI bubble pops is when you're likely to be able to realistically run good local models. I imagine some of these $100k servers going for $3k on eBay in 10 years, and a lot of electricians being asked to install new 240v connectors in makeshift server rooms or garages.
You can pick up a DGX-1 on Ebay right now for less than $10k. 256 GB vRAM (HBM2 nonetheless), NVLink capability, 512 GB RAM, 40 CPU cores, 8 TB SSD, 100 Gbit HBAs. Equivalent non-Nvidia branded machines are around $6k.
They are heavy, noisy like you would not believe, and a single one just about maxes out a 16A 240V circuit. Which also means it produces 13 000 BTU/hr of waste heat.
Fair warning: the BMCs on those suck so bad, and the firmware bundles are painful, since you need a working nvidia-specific container runtime to apply them, which you might not be able to get up and running because of a firmware bug causing almost all the ram to be presented as nonvolatile.
Honestly, unless you //really// need nvlink/ib (meaning that copies and pcie trips are your bottleneck), you may do better with whatever commodity system with sufficient lanes, slots, and CFM is available at a good price.
Heat pump sure, but how is gas furnace more efficient than resistive load inside the house? Do you mean more economical rather than more efficient (due to gas being much cheaper/unit of energy)?
Depends where your electricity comes from. If you're burning fossil fuels to make electricity, that's only about 40% efficient, so you need to burn 2.5x as much fuel to get the same amount of heat into the house.
Sure. That has nothing to do with the efficiency of your system though. As far as you are concerned this is about your electricity consumption for the home server vs gas consumption. In that sense resistive heat inside the home is 100% efficient compared to gas furnace; the fuel cost might be lower on the latter.
Sure, it's "equally efficient" if you ignore the inefficient thing that is done outside where you draw the system box, directly in proportion to how much you do it.
Heating my house with a giant diesel-powered radiant heater from across the street is infinitely efficient, too, since I use no power in my house.
If you don’t close the box of the system at some point to isolate the input, efficiency would be meaningless. I think in the context of the original post, suggesting running a server in winter would be a zero-waste endeavor if you need the heat anyway, it is perfectly clear that the input is electricity to your home at a certain $/kWh and gas at a certain $/BTU. Under that premise, it is fair to say that would not be true if you have a heat pump deployed but would be true compared to gas furnace in terms of efficiency (energy consumed for unit of heat), although not necessarily true economically.
Generating 1kWh of heat with electric/resistive is more expensive than gas, which itself is more expensive than a heat pump, based on the cost of fuel to go in
If your grid is fossil fuels burning the fuel directly is more efficient. In all cases a heat pump is more efficient.
Did you skip searing it after sous vide? Did you sous vide it to the "instantly kill all bacteria" temperature (145°F for steak) thereby overcooking & destroying it, or did you sous vide to a lower temperature (at most 125°F) so that it'd reach a medium-rare 130°F-140°F after searing & carryover cooking during resting? It should have a nice seared crust, and the inside absolutely shouldn't be mushy.
Please research this. Done right, sous vide is amazing. But it is almost never the only technique used. Just like when you slow roast a prime rib at 200f, you MUST sear to get Maillard reaction and a satisfying texture.
Cooling BTUs already take the coefficient of performance of the vapor-compression cycle into account. 4w of heat removed for each 1w of input power is around the max COP for an air cooled condenser, but adding an evaporative cooling tower can raise that up to ~7.
I just looked at a spec sheet for a 230V single-phase 12k BTU mini-split and the minimum circuit ampacity was 3A for the air handler and 12A for the condenser, add those together for 15A, divide by .8 is 18.75A, next size up is 20A. Minimum circuit ampacity is a formula that is (roughly) the sum of the full load amps of the motor(s) inside the piece of equipment times 1.25 to determine the conductor size required to power the equipment.
So the condensing unit likely draws ~9.5-10A max and the air handler around ~2.4A, and both will have variable speed motors that would probably only need about half of that to remove 12k BTU of heat, so ~5-6A or thereabouts should do it, which is around 1/3rd of the 16A server, or a COP of 3.
Are you talking about the guy in Temecula running two different auctions with some of the same photos (356878140643 and 357146508609, both showing a missing heat sink?) Interesting, but seems sketchy.
How useful is this Tesla-era hardware on current workloads? If you tried to run the full DeepSeek R1 model on it at (say) 4-bit quantization, any idea what kind of TTFT and TPS figures might be expected?
I can’t speak to the Tesla stuff but I run an Epyc 7713 with a single 3090 and creatively splitting the model between GPU/8 channels of DDR4 I can do about 9 tokens per second on a q4 quant.
Even is the AI bubble does not pops, your prediction about those servers being available on ebay in 10 years will likely be true, because some datacenters will simply upgrade their hardware and resell their old ones to third parties.
Sure, datacenters will get rid of the hardware - but only because it's no longer commercially profitable run them, presumably because compute demands have eclipsed their abilities.
It's kind of like buying a used GeForce 980Ti in 2025. Would anyone buy them and run them besides out of nostalgia or curiosity? Just the power draw makes them uneconomical to run.
Much more likely every single H100 that exists today becomes e-waste in a few years. If you have need for H100-level compute you'd be able to buy it in the form of new hardware for way less money and consuming way less power.
For example if you actually wanted 980Ti-level compute in a desktop today you can just buy a RTX5050, which is ~50% faster, consumes half the power, and can be had for $250 brand new. Oh, and is well-supported by modern software stacks.
> Sure, datacenters will get rid of the hardware - but only because it's no longer commercially profitable run them, presumably because compute demands have eclipsed their abilities.
I think the existence of a pretty large secondary market for enterprise servers and such kind of shows that this won't be the case.
Sure, if you're AWS and what you're selling _is_ raw compute, then couple generation old hardware may not be sufficiently profitable for you anymore... but there are a lot of other places that hardware could be applied to with different requirements or higher margins where it may still be.
Even if they're only running models a generation or two out of date, there are a lot of use cases today, with today's models, that will continue to work fine going forward.
And that's assuming it doesn't get replaced for some other reason that only applies when you're trying to sell compute at scale. A small uptick in the failure rate may make a big dent at OpenAI but not for a company that's only running 8 cards in a rack somewhere and has a few spares on hand. A small increase in energy efficiency might offset the capital outlay to upgrade at OpenAI, but not for the company that's only running 8 cards.
I think there's still plenty of room in the market in places where running inference "at cost" would be profitable that are largely untapped right now because we haven't had a bunch of this hardware hit the market at a lower cost yet.
Off topic, but I bought my (still in active use) 980ti literally 9 years ago for that price. I know, I know, inflation and stuff, but I really expected more than 50% bang for my buck after 9 whole years…
The 5050 doesn't support 32-bit PsyX. So a bunch of games would be missing a ton of stuff. You'd still need the 980 running with it for older PhyX games because nVidia.
I have around a thousand broadwell cores in 4 socket systems that I got for ~nothing from these sorts of sources... pretty useful. (I mean, I guess literally nothing since I extracted the storage backplanes and sold them for more than the systems cost me). I try to run tasks in low power costs hours on zen3/4 unless it's gonna take weeks just running on those, and if it will I crank up the rest of the cores.
And 40 P40 GPUs that cost very little, which are a bit slow but with 24gb per gpu they're pretty useful for memory bandwidth bound tasks (and not horribly noncompetitive in terms of watts per TB/s).
Given highly variable time of day power it's also pretty useful to just get 2x the computing power (at low cost) and just run it during the low power cost periods.
It's interesting to think about scenarios where that hardware would get used only part of the time, like say when the sun is shining and/or when dwelling heat is needed. The biggest sticking point would seem to be all of the capex for connecting them to do something useful. It's a shame that PLX switch chips are so expensive.
Someone's take on AI was that we're collectively investing billions in data centers that will be utterly worthless in 10 years.
Unlike the investments in railways or telephone cables or roads or any other sort of architecture, this investment has a very short lifespan.
Their point was that whatever your take on AI, the present investment in data centres is a ridiculous waste and will always end up as a huge net loss compared to most other investments our societies could spend it on.
Maybe we'll invent AGI and he'll be proven wrong as they'll pay back themselves many times over, but I suspect they'll ultimately be proved right and it'll all end up as land fill.
The servers may well be worthless (or at least worth a lot less), but that's pretty much true for a long time. Not many people want to run on 10 year old servers (although I pay $30/month for a dedicated server that's dual Xeon L5640 or something like that, which is about 15 years old).
The servers will be replaced, the networking equipment will be replaced. The building will still be useful, the fiber that was pulled to internet exchanges/etc will still be useful, the wiring to the electric utility will still be useful (although I've certainly heard stories of datacenters where much of the floor space is unusable, because power density of racks has increased and the power distribution is maxed out)
I have a server in my office that's at from 2009 still far more economical to run than buying any sort of cloud compute. By at least an order of magnitude.
If it is all a waste and a bubble, I wonder what the long term impact will be of the infrastructure upgrades around these dcs. A lot of new HV wires and substations are being built out. Cities are expanding around clusters of dcs. Are they setting themselves up for a new rust belt?
This isn’t my original take but if it results in more power buildout, especially restarting nuclear in the US, that’s an investment that would have staying power.
Sure, but what about the collective investment in smartphones, digital cameras, laptops, even cars. Not much modern technology is useful and practical after 10 years, let alone 20. AI is probably moving a little faster than normal, but technology depreciation is not limited to AI.
They probably are right, but a counter argument could be how people thought going to the moon was pointless and insanely expensive, but the technology to put stuff in space and have GPS and comms satellites probably paid that back 100x
Reality is that we don’t know how much of a trope this statement is.
I think we would get all this technology without going to the moon or Space Shuttle program. GPS, for example, was developed for military applications initially.
I don’t mean to invalidate your point (about genuine value arising from innovations originating from the Apollo program), but GPS and comms satellites (and heck, the Internet) are all products of nuclear weapons programs rather than civilian space exploration programs (ditto the Space Shuttle, and I could go on…).
Yes, and no. The people working on GPS paid very close attention to the papers from JPL researchers describing their timing and ranging techniques for both Apollo and deep-space probes. There was more cross-pollination than meets the eye.
It's not that going to the Moon was pointless, but stopping after we'd done little more than planted a flag was. Werner von Braun was the head architect of the Apollo Program and the Moon was intended as little more than a stepping stone towards setting up a permanent colony on Mars. Incidentally this is also the technical and ideological foundation of what would become the Space Shuttle and ISS, which were both also supposed to be little more than small scale tools on this mission, as opposed to ends in and of themselves.
Imagine if Columbus verified that the New World existed, planted a flag, came back - and then everything was cancelled. Or similarly for literally any colonization effort ever. That was the one downside of the space race - what we did was completely nonsensical, and made sense only because of the context of it being a 'race' and politicians having no greater vision than beyond the tip of their nose.
My personal sneaking suspicion is that publicly offered models are using way less compute than thought. In modern mixture of experts models, you can do top-k sampling, where only some experts are evaluated, meaning even SOTA models aren't using much more compute than a 70-80b non-MoE model.
To piggyback on this, at enterprise level in modern age, the question is really not about "how are we going to serve all these users", it comes down to the fact that investors believe that eventually they will see a return on investment, and then pay whatever is needed to get the infra.
Even if you didn't have optimizations involved in terms of job scheduling, they would just build as many warehouses as necessary filled with as many racks as necessary to serve the required user base.
What I wonder is what this means for Coreweave, Lambda and the rest, who are essentially just renting out fleets of racks like this. Does it ultimately result in acquisition by a larger player? Severe loss of demand? Can they even sell enough to cover the capex costs?
I wonder if it's feasible to hook up NAND flash with a high bandwidth link necessary for inference.
Each of these NAND chips hundreds of dies of flash stacked inside, and they are hooked up to the same data line, so just 1 of them can talk at the same time, and they still achieve >1GB/s bandwidth. If you could hook them up in parallel, you could have 100s of GBs of bandwidth per chip.
NAND is very, very slow relative to RAM, so you'd pay a huge performance penalty there. But maybe more importantly my impression is that memory contents mutate pretty heavily during inference (you're not just storing the fixed weights), so I'd be pretty concerned about NAND wear. Mutating a single bit on a NAND chip a million times over just results in a large pile of dead NAND chips.
No it's not slow - a single NAND chip in SSDs offers >1GB of bandwidth - inside the chip there are 100+ wafers actually holding the data, but in SSDs only one of them is active when reading/writing.
You could probably make special NAND chips where all of them can be active at the same time, which means you could get 100GB+ bandwidth out of a single chip.
This would be useless for data storage scenarios, but very useful when you have huge amounts of static data you need to read quickly.
The memory bandwidth on an H100 is 3TB/s, for reference. This number is the limiting factor in the size of modern LLMs. 100GB/s isn't even in the realm of viability.
That bandwidth is for the whole GPU, which has 6 mermoy chips. But anyways, what I'm proposing isn't for the high-end and training, but for making inference cheap.
And I was somehat conservative with the numbers, a modern budget SSD with a single NAND can do more than 5GB/s read speed.
That bandwidth is for the whole GPU, which has 6 chips. But anyways, what I'm proposing isn't for the high-end and training, but for making inference cheap.
And I was somehat conservative with the numbers, a modern budget SSD with a single NAND can do more than 5GB/s read speed.
Why stop at 80 H100s for a mere 6.4 terabytes of GPU memory?
Supermicro will sell you a full rack loaded with servers [1] providing 13.4 TB of GPU memory.
And with 132kW of power output, you can heat an olympic-sized swimming pool by 1°C every day with that rack alone. That's almost as much power consumption as 10 mid-sized cars cruising at 50 mph.
> When the AI bubble pops is when you're likely to be able to realistically run good local models.
After years of “AI is a bubble, and will pop when everyone realizes they’re useless plagiarism parrots” it’s nice to move to the “AI is a bubble, and will pop when it becomes completely open and democratized” phase
Yeah I think the crux of the issue is that chatgpt is serving a huge number of users including paid users and is still operating at a massive operating loss. They are spending truckloads of money on GPUs and selling access at a loss.
You have thousands of dollars, they have tens of billions. $1,000 vs $10,000,000,000. They have 7 more zeros than you, which is one less zero than the scale difference in users: 1 user (you) vs 700,000,000 users (openai). They managed to squeak out at least one or two zeros worth of efficiency at scale vs what you're doing.
Also, you CAN run local models that are as good as GPT 4 was on launch on a macbook with 24 gigs of ram.
You can knock off a zero or two just by time shifting the 700 million distinct users across a day/week and account for the mere minutes of compute time they will actually use in each interaction. So they might no see peaks higher than 10 million active inference session at the same time.
Conversely, you can't do the same thing as a self hosted user, you can't really bank your idle compute for a week and consume it all in a single serving, hence the much more expensive local hardware to reach the peak generation rate you need.
During times of high utilization, how do they handle more requests than they have hardware? Is the software granular enough that they can round robin the hardware per token generated? UserA token, then UserB, then UserC, back to UserA? Or is it more likely that everyone goes into a big FIFO processing the entire request before switching to the next user?
I assume the former has massive overhead, but maybe it is worthwhile to keep responsiveness up for everyone.
Inference is essentially a very complex matrix algorithm run repeatedly on itself, each time the input matrix (context window) is shifted and the new generated tokens appended to the end. So, it's easy to multiplex all active sessions over limited hardware, a typical server can hold hundreds of thousands of active contexts in the main system ram, each less than 500KB and ferry them to the GPU nearly instantaneously as required.
The context after application of the algorithm is just text, something like 256k input tokens, each token representing a group of roughly 2-5 characters, encoded into 18-20 bits.
The active context during inference, inside the GPUs, explodes each token into a 12288 dimensions vector, so 4 orders of magnitude more VRAM, and is combined with the model weights, Gbytes in size, across multiple parallel attention heads. The final result are just more textual tokens, which you can easily ferry around main system RAM and send to the remote user.
First of all, they never “handle more requests than they have hardware.” That’s impossible (at least as I’m reading it).
The vast majority of usage is via their web app (and free accounts, at that). The web app defaults to “auto” selecting a model. The algorithm for that selection is hidden information.
As load peaks, they can divert requests to different levels of hardware and less resource hungry models.
Only a very small minority of requests actually specify the model to use.
There are a hundred similar product design hacks they can use to mitigate load. But this seems like the easiest one to implement.
> But this seems like the easiest one to implement.
Even easier: Just fail. In my experience the ChatGPT web page fails to display (request? generate?) a response between 5% and 10% of the time, depending on time of day. Too busy? Just ignore your customers. They’ll probably come back and try again, and if not, well, you’re billing them monthly regardless.
I don't usually see responses fail. But what I did see shortly after the GPT-5 release (when servers were likely overloaded) was the model "thinking" for over 8 minutes. It seems like (if you manually select the model) you're simply getting throttled (or put in a queue).
That's all different now with agentic which was not really a big thing until the end of 2024. before they were doing 1 request, now they're doing hundreds for a given task. the reason oai/azure win over locally run models is the parallelization that you can do with a thinking agent. simultaneous processing of multiple steps.
You hit the nail on the head. Just gotta add the up to $10 billion investment from Microsoft to cover pretraining, R&D, and inference. Then, they still lost billions.
One can serve a lot if models if allowed to burn through over a billion dollars with no profit requirement. Classic, VC-style, growth-focused capitalism with an unusual, business structure.
I think you maybe have misunderstood the parent (or maybe I did?). They're saying you can't compare an individual's cost to run a model against OpenAI's cost to run it + R&D. Individuals aren't paying for R&D, and that's where most of the cost is.
Inference contributes to their losses. In January 2025, Altman admitted they are losing money on Pro subscriptions, because people are using it more than they expected (sending more inference requests per month than would be offset by the monthly revenue).
Currently, they lose more money per inference than they make for Pro subscriptions, because they are essentially renting out their service each month instead of charging for usage (per token).
At the end of the day, until at least one of the big providers gives us balance sheet numbers, we don't know where they stand. My current bet is that they're losing money whichever way you dice it.
The hope being as usual that costs go down and the market share gained makes up for it. At which point I wouldn't be shocked by pro licenses running into the several hundred bucks per month.
A single node with GPUs has a lot of FLOPs and very high memory bandwidth. When only processing a few requests at a time, the GPUs are mostly waiting on the model weights to stream from the GPU ram to the processing units. When batching requests together, they can stream a group of weights and score many requests in parallel with that group of weights. That allows them to have great efficiency.
Some of the other main tricks - compress the model to 8 bit floating point formats or even lower. This reduces the amount of data that has to stream to the compute unit, also newer GPUs can do math in 8-bit or 4-bit floating point. Mixture of expert models are another trick where for a given token, a router in the model decides which subset of the parameters are used so not all weights have to be streamed. Another one is speculative decoding, which uses a smaller model to generate many possible tokens in the future and, in parallel, checks whether some of those matched what the full model would have produced.
Add all of these up and you get efficiency!
Source - was director of the inference team at Databricks
I think the most direct answer is that at scale, inference can be batched, so that processing many queries together in a parallel batch is more efficient than interactively dedicating a single GPU per user (like your home setup).
If you want a survey of intermediate level engineering tricks, this post we wrote on the Fin AI blog might be interesting. (There's probably a level of proprietary techniques OpenAI etc have again beyond these):
https://fin.ai/research/think-fast-reasoning-at-3ms-a-token/
This is the real answer, I don't know what people above are even discussing when batching is the biggest reduction in costs. If it costs say $50k to serve one request, with batching is also costs $50k to serve 100 at the same time with minimal performance loss, I don't know what the real number of users is before you need to buy new hardware, but I know it's in the hundreds so going from $50000 to $500 in effective costs is a pretty big deal (assuming you have the users to saturate the hardware).
My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching.
Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time.
(Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1)
But you could instead batch the compute parts together.
LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2
Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.)
Time spent using approach 1 (1 request at a time):
The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time.
TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss.
Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction:
- Big models like GPT-4 are split across many GPUs (sharding).
- Each GPU holds some layers in VRAM.
- To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math.
- Loading into cache is slow, the ops are fast though.
- Without batching: load layer > compute user1 > load again > compute user2.
- With batching: load layer once > compute for all users > send to gpu 2 etc
- This makes cost per user drop massively if you have enough simultaneous users.
- But bigger batches need more GPU memory for activations, so there's a max size.
This does makes sense to me but does this sound accurate to you?
Would love to know if I'm still missing something important.
This seems a bit complicated to me. They don't serve very many models. My assumption is they just dedicate GPUs to specific models, so the model is always in VRAM. No loading per request - it takes a while to load a model in anyway.
The limiting factor compared to local is dedicated VRAM - if you dedicate 80GB of VRAM locally 24 hours/day so response times are fast, you're wasting most of the time when you're not querying.
Loading here refers to loading from VRAM to the GPUs core cache, loading from VRAM is extremely slow in terms of GPU time that GPU cores end up idle most of the time just waiting for more data to come in.
700M weekly users doesn't say much about how much load they have.
I think the thing to remember is that the majority of chatGPT users, even those who use it every day, are idle 99.9% of the time. Even someone who has it actively processing for an hour a day, seven days a week, is idle 96% of the time. On top of that, many are using less-intensive models. The fact that they chose to mention weekly users implies that there is a significant tail of their user distribution who don't even use it once a day.
So your question factors into a few of easier-but-still-not-trivial problems:
- Making individual hosts that can fit their models in memory and run them at acceptable toks/sec.
- Making enough of them to handle the combined demand, as measured in peak aggregate toks/sec.
- Multiplexing all the requests onto the hosts efficiently.
Of course there are nuances, but honestly, from a high level last problem does not seem so different from running a search engine. All the state is in the chat transcript, so I don't think there any particular reason reason that successive interactions on the same chat need be handled by the same server. They could just be load-balanced to whatever server is free.
We don't know, for example, when the chat says "Thinking..." whether the model is running or if it's just queued waiting for a free server.
The short answer is "batch size". These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time.
This makes them a lot more efficient to run at high batch size.
If you try to run GPT4 at home, you'll still need enough VRAM to load the entire model, which means you'll need several H100s (each one costs like $40k). But you will be under-utilizing those cards by a huge amount for personal use.
It's a bit like saying "How come Apple can make iphones for billions of people but I can't even build a single one in my garage"
I'm actually not sure I understand how MoE helps here. If you can route a single request to a specific subnetwork then yes, it saves compute for that request. But if you have a batch of 100 requests, unless they are all routed exactly the same, which feels unlikely, aren't you actually increasing the number of weights that need to be processed? (at least with respect to an individual request in the batch).
I'm sure there are countless tricks, but one that can implemented at home, and I know plays a major part in Cerebras' performance is: speculative decoding.
Speculative decoding uses a smaller draft model to generate tokens with much less compute and memory required. Then the main model will accept those tokens based on the probability it would have generated them. In practice this case easily result in a 3x speedup in inference.
Another trick for structured outputs that I know of is "fast forwarding" where you can skip tokens if you know they are going to be the only acceptable outputs. For example, you know that when generating JSON you need to start with `{ "<first key>": ` etc. This can also lead to a ~3x speedup in when responding in JSON.
To measure the performance gains on a local machine (or even standard cloud GPU setup), since you can't run this in parallel with the same efficiency you could in a high-ed data center, you need to compare the number of calls made to each model.
In my experiences I'd seen the calls to the target model reduced to a third of what they would have been without using a draft model.
You'll still get some gains on a local model, but they won't be near what they could be theoretically if everything is properly tuned for performance.
It also depends on the type of task. I was working with pretty structured data with lots of easy to predict tokens.
You're right on ratios, but actually the ratio is much worse than 6:1 since they are MoEs. The 20B has 3.6B active, and the 120B has only 5.1B active, only about 40% more!
At the heart of inference is matrix-vector multiplication. If you have many of these operations to do and only the vector part differs (which is the case when you have multiple queries), you can do matrix-matrix multiplication by stuffing the vectors into a matrix. Computing hardware is able to run the equivalent of dozens of matrix-vector multiplication operations in the same time it takes to do 1 matrix-matrix multiplication operation. This is called batching. That is the main trick.
A second trick is to implement something called speculative decoding. Inference has two phases. One is prompt processing and another is token generation. They actually work the same way using what is called a forward pass, except prompt processing can do them in parallel by switching from matrix-vector to matrix-matrix multiplication and dumping the prompt’s tokens into each forward pass in parallel. Each forward pass will create a new token, but it can be discarded unless it is from the last forward pass, as that will be the first new token generated as part of token generation. Now, you put that token into the next forward pass to get the token after it, and so on. It would be nice if all of the forward passes could be done in parallel, but you do not know the future, so you ordinarily cannot. However, if you make a draft model that is a very fast model runs in a fraction of the time and guesses the next token correctly most of the time, then you can sequentially run the forward pass for that instead N times. Now, you can take the N tokens and put it into the prompt processing routine that did N forward passes in parallel. Instead of discarding all tokens except the last one like in prompt processing, we will compare them to the input tokens. All tokens up to and including the first token that differ, that come out of the parallel forward pass are valid tokens for the output of the main model. This is guaranteed to always produce at least 1 valid token since in the worse case the first token does not match, but the output for the first token will be equal to the output of running the forward pass without having done speculative decoding. You can get a 2x to 4x performance increase from this if done right.
Now, I do not work on any of this professionally, but I am willing to guess that beyond these techniques, they have groups of machines handling queries of similar length in parallel (since doing a batch where 1 query is much longer than the others is inefficient) and some sort of dynamic load balancing so that machines do not get stuck with a query size that is not actively being utilized.
Your model keeps the weights on slow memory and needs to touch all of them to make 1 token for you. By batching you make 64 tokens for 64 users in one go. And they use dozens of GPUs in parallel to make 1024 tokens in the time your system makes 1 token. So even though the big system costs more, it is much more efficient when being used by many users in parallel. Also, by using many fast GPUs in series to process parts of the neural net, it produces output much faster for each user compared to your local system. You can't beat that.
The big players use parallel processing of multiple users to keep the GPUs and memory filled as much as possible during the inference they are providing to users. They can make use of the fact that they have a fairly steady stream of requests coming into their data centers at all times. This article describes some of how this is accomplished.
I'm pretty much an AI layperson but my basic understanding of how LLMs usually run on my or your box is:
1. You load all the weights of the model into GPU VRAM, plus the context.
2. You construct a data structure called the "KV cache" representing the context, and it hopefully stays in the GPU cache.
3. For each token in the response, for each layer of the model, you read the weights of that layer out of VRAM and use them plus the KV cache to compute the inputs to the next layer. After all the layers you output a new token and update the KV cache with it.
Furthermore, my understanding is that the bottleneck of this process is usually in step 3 where you read the weights of the layer from VRAM.
As a result, this process is very parallelizable if you have lots of different people doing independent queries at the same time, because you can have all their contexts in cache at once, and then process them through each layer at the same time, reading the weights from VRAM only once.
So once you got the VRAM it's much more efficient for you to serve lots of people's different queries than for you to be one guy doing one query at a time.
It is not just engineering. There are also huge, very huge, investments into infrastructure.
As already answered, AI companies use extremely expensive setups (servers with professional cards) in large numbers and all these things concentrated in big datcenters with powerful networking and huge power consumption.
Imagine - last time, so huge investments (~1.2% of GDP, and unknown if investments will grow or not) was into telecom infrastructure - mostly wired telephones, but also cable TV and later added Internet and cell communications and clouds (in some countries wired phones just don't cover whole country and they jumped directly into wireless communications).
Larger investments was into railroads - ~6% of GDP (and I'm also not sure, some people said, AI will surpass them as share of possible for AI tasks constantly grow).
So to conclude, just now AI boom looks like main consumer of telecom (Internet) and cloud infrastructure. If you've seen old mainframes in datacenters, and extremely thick core network cables (with hundreds wires or fibers in just one cable), and huge satellite dishes, you could imagine, what I'm talking about.
And yes, I'm not sure, will this boom end like dot-coms (Y2K), or such huge usage of resources will sustain. Why it is not obvious, because for telecoms (internet) also was unknown, if people will use phones and other p2p communications for leisure as now, or will leave phones just for work. Even worse, if AI agents become ordinary things, possible scenario, number of AI agents will surpass number of people.
A few people have mentioned looking a the vLLM docs and blog (recommended!). I'd also recommend SGLang's docs and blog as well.
If you're interested in a bit of a deeper dive, I can highly recommend reading some of what DeepSeek has published: https://arxiv.org/abs/2505.09343 (and actually quite a few of their Technical Reports and papers).
I'd also say that while the original GPT-4 was a huge model when it was originally released (rumored 1.7T-A220B), these days you can get (original release) "GPT-4-class" performance at ~30B dense/100B sparse MoE - and almost all the leading MoEs have between 12-37B activations no matter how big they get - Kimi K2 (1T param weights) has only 32B activations). If you do a basic quants (FP8/INT8) you can easily push 100+ tok/s on pretty bog standard data center GPUs/nodes. You quant even lower for even better speeds (tg is just MBW) for not much in quality loss (although for open source kernels, usually without getting much overall throughput or latency improvements).
A few people have mentioned speculative decoding, if you want to learn more, I'd recommend taking a look at the papers for one of the (IMO) best open techniques, EAGLE: https://github.com/SafeAILab/EAGLE
The other thing that is often ignored, especially for multiturn that I haven't seen mentioned yet is better caching, specifically prefix caching (radix-tree, block-level hash) or tiered/offloaded kvcaches (LMCache as one example). If you search for those keywords, you'll find lots there as well.
Inference runs like a stateless web server. If you have 50K or 100K machines, each with a tons of GPUs (usually 8 GPUs per node), then you end up with a massive GPU infrastructure that can run hundreds of thousands, if not millions, of inference instances. They use something like Kubernetes on top for scheduling, scaling and spinning up instances as needed.
For storage, they also have massive amount of hard disks and SSD behind planet scale object file systems (like AWS's S3 or Tectonic at Meta or MinIO in prem) all connected by massive amount of switches and routers of varying capacity.
So in the end, it's just the good old Cloud, but also with GPUs.
Btw, OpenAI's infrastructure is provided and managed by Microsoft Azure.
And, yes, all of this requires billions of dollars to build and operate.
First off I’d say you can run models locally at good speed, llama3.1:8b runs fine a MacBook Air M2 with 16GB RAM and much better on a Nvidia RTX3050 which are fairly affordable.
For OpenAI, I’d assume that a GPU is dedicated to your task from the point you press enter to the point it finishes writing. I would think most of the 700 million barely use ChatGPT and a small proportion use it a lot and likely would need to pay due to the limits. Most of the time you have the website/app open I’d think you are either reading what it has written, writing something or it’s just open in the background, so ChatGPT isn’t doing anything in that time. If we assume 20 queries a week taking 25 seconds each. That’s 8.33 minutes a week. That would mean a single GPU could serve up to 1209 users, meaning for 700 million users you’d need at least 578,703 GPUs. Sam Altman has said OpenAI is due to have over a million GPUs by the end of year.
I’ve found that the inference speed on newer GPUs is barely faster than older ones (perhaps it’s memory speed limited?). They could be using older clusters of V100, A100 or even H100 GPUs for inference if they can get the model to fit or multiple GPUs if it doesn’t fit. A100s were available in 40GB and 80GB versions.
I would think they use a queuing system to allocate your message to a GPU. Slurm is widely used in HPC compute clusters, so might use that, though likely they have rolled their own system for inference.
The idea that a GPU is dedicated to a single inference task is just generally incorrect. Inputs are batched, and it’s not a single GPU handling a single request, it’s a handful of GPUs in various parallelism schemes processing a batch of requests at once. There’s a latency vs throughput trade off that operators make. The larger that batch size the greater the latency, but it improves overall cluster throughput.
Almost every trick to run a LLM at OpenAI's scale is a trade secret and may not be easily understood by mere mortals anyways (e.g. bare-metal CUDA optimizations)
Multi-tenancy likely explains the bulk of it. $10k vs. $10b gives them six orders of magnitude more GPU resources, but they have 9 orders of magnitude more users. The average user is probably only running an active ChatGPT query for a few minutes per day, which covers the remaining 3 orders of magnitude.
One trick is that multiple prompts are combined into a batch that is subject to the token prediction simultaneously.
I've had Google Gemini start talking to be in the context of someone else's chat, unrelated to mine.
ME: Does POSIX have any API for creating a filesystem object (such as a file) with a predetermined inode number supplied by the caller?
AI: <reformats my question into LaTeX syntax, out of the blue>
\documentclass{article}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\begin{document}
\section*{Does POSIX have any API for creating a filesystem object with a predetermined inode number supplied by the caller?}
\textbf{No, POSIX does not provide a standard API for creating a filesystem object (like a file or directory) with a caller-specified inode number.}
\subsection*{Inode Allocation in POSIX Filesystems}
[... SNIP]
ME: Why are you spewing LaTeX syntax at me?
AI: You are absolutely right! My apologies. That was a mistake on my part. I got carried away with the instruction to use LaTeX for mathematical and scientific notations and incorrectly applied it to the entire response. [... SNIP]
There was no such instruction. I've never chatted with any AI about LaTeX. it leaked from the tokens of someone else's chat.
> There was no such instruction. I've never chatted with any AI about LaTeX. it leaked from the tokens of someone else's chat.
Nope. That's not how it works. Attention doesn't work across multiple independent prompts queued in the same batch. It's not physically possible for the tokens of another chat to leak.
What most likely happened is that the model glitched out to the instructions in its (hidden) system prompt, which most likely does include instructions about using LaTeX for mathematical and scientific notation.
Maybe not due to attention, but it is certainly possible for chat content to get leaked into other conversations due to bugs in the stack, and in fact it has happened before.
"We took ChatGPT offline earlier this week due to a bug in an open-source library which allowed some users to see titles from another active user’s chat history. It’s also possible that the first message of a newly-created conversation was visible in someone else’s chat history if both users were active around the same time."
You are probably right about this particular LaTeX issue though.
They can have a very even load if they use their nodes for training when the customer use is low, so that massively helps. If they have 3x as much hardware as they need to serve peak demand (even with throttling) this will cost a lot, unless they have a another use for lots of GPU.
Just illustrative guesses, not real numbers, I underestimate overheads here but anyway ...
Let's assume a $20k expert node can produce 500 tokens per second (15,000 per year). $5k a year for the machine per year. $5k overheads. 5 experts per token (so $50k to produce 15,000 megatokens with a
100% throughput). Say they charge up to $10 per million tokens ... yeah it's tight but I can see how it's doable.
Say they cost $100 per user per year. If it's $10 per million tokens (depends on the model) then they are budgeting 10 million tokens per user. That's like 100 books per year. The answer is that users probably don't use as much as the api would cost.
The real question is, how does it cost $10 per megatoken?
500 tokens per second per node is like 15,000 megatokens per year. So a 500 token node can bring in $150,000 per node.
Call it 5 live experts and a router. That's maybe $20k per expert per year. If it's a kilowatt power supply per expert, and $0.1 per kW power that's $1000 for power. The hardware is good for 4 years so $5k for that. Toss in overheads, and it's maybe $10k costs.
So at full capacity they can make $5 off $10 revenue. With uneven loads they make nothing, unless they have some optimisation and very good load balancing (if they can double the tokens per second then they make a decent profit).
Lots of good answers that mention the big things (money, scale, and expertise). But one thing I haven’t seen mentioned yet is that the transformer math is probably against your use case. Batch compute on beefy hardware is currently more efficient than computing small sequences for a single user at a time, since these models tend to be memory bound and not compute bound. They have the users that makes the beefy hardware make sense, enough people are querying around the same time to make some batching possible.
Well, their huge GPU clusters have "insane VRAM". Once you can actually load the model without offloading, inference isn't all that computationally expensive for the most part.
When I think about serving large-scale LLM inference (like ChatGPT), I see it a lot like high-speed web serving — there are layers to it, much like in the OSI model.
1. Physical/Hardware Layer
At the very bottom is the GPU silicon and its associated high-bandwidth VRAM. The model weights are partitioned, compiled, and efficiently placed so that each GPU chip and its VRAM are used to the fullest (ideally). This is where low-level kernel optimizations, fused operations, and memory access patterns matter so that everything above the chip level tries to play nice with the lowest level.
2. Intra-Node Coordination Layer
Inside a single server, multiple GPUs are connected via NVLink (or equivalent high-speed interconnect). Here you use tensor parallelism (splitting matrices across GPUs), pipeline parallelism (splitting model layers across GPUs), or expert parallelism (only activating parts of the model per request) to make the model fit and run faster. The key is minimizing cross-GPU communication latency while keeping all GPUs running at full load - many low level software tricks here.
3. Inter-Node Coordination Layer
When the model spans multiple servers, high-speed networking like InfiniBand comes into play. Techniques like data parallelism (replicating the model and splitting requests), hybrid parallelism (mixing tensor/pipeline/data/expert parallelism), and careful orchestration of collectives (all-reduce, all-to-all) keep throughput high while hiding model communication (slow) behind model computation (fast).
4. Request Processing Layer
Above the hardware/multi-GPU layers is the serving logic: batching incoming prompts together to maximize GPU efficiency and mold them into ideal shapes to max out compute, offloading less urgent work to background processes, caching key/value attention states (KV cache) to avoid recomputing past tokens, and using paged caches to handle variable-length sequences.
5. User-Facing Serving Layer
At the top are optimizations users see indirectly — multi-layer caching for common or repeated queries, fast serialization protocols like gRPC or WebSockets for minimal overhead, and geo-distributed load balancing to route users to the lowest-latency cluster.
Like the OSI model, each “layer” solves its own set of problems but works together to make the whole system scale. That’s how you get from “this model barely runs on a single high-end GPU” to “this service handles hundreds of millions of users per week with low latency.”
If the explanation really is, as many comments here suggest, that prompts can be run in parallel in batches at low marginal additional cost, then that feels like bad news for the democratization and/or local running of LLMs. If it’s only cost-effective to run a model for ~thousands of people at the same time, it’s never going to be cost-effective to run on your own.
Sure, but that's how most of human society works already.
It's more cost effective to farm eggs from a hundred thousand chickens than it is for individuals to have chickens in their yard.
You CAN run a GPT-class model on your own machine right now, for several thousand dollars of machine... but you can get massively better results if you spend those thousands of dollars on API credits over the next five years or so.
Some people will choose to do that. I have backyard chickens, they're really fun! Most expensive eggs I've ever seen in my life.
50 years ago general computers were also time shared. Then the pendulum swing to desktop, then back to central.
I for one look forward to another 10 years of progress - or less - putting current models running on a laptop. I don’t trust any big company with my data
For fungible things, it's easy to cost out. But not all things can be broken down just in token cost, especially as people start building their lives around specific models.
Even beyond privacy just the availability is out of your control - you can look at r/ChatGPT's collective spasm yesterday when 4o was taken from them, but basically, you have no guarantees to access for services, and for LLM models in particular, "upgrades" can completely change behavior/services that you depend on.
Google has been even worse in the past here, I've seen them deprecate model versions with 1 month notices. It seems a lot of model providers are doing dynamic model switching/quanting/reasoning effort adjustments based on load now.
Well, you can also batch your own queries. Not much use for a chatbot but for an agentic system or offline batch processing it becomes more reasonable.
Consider a system were running a dozen queries at once is only marginally more expensive than running one query. What would you build?
That determines the cost effectiveness to make it worth it to train one of these models in the first place. Using someone else's weights, you can afford to predict quite inefficiently.
You and your engineering team might be able to figure it out and purchase enough equipment also if you had received billions of dollars. And billions and billions. And more billions and billions and billions. Then additional billions, and more billions and billions and even more billions and billions of dollars. They have had 11 rounds of funding totaling around $60 billion.
My mental model is: "How can an airline move 100 customers from NY to LA with such low latency, when my car can't even move me without painfully slow speeds".
- the models are not too big for the cards. Specifically, they know the cards they have and they modify the topology of the model to fit their hardware well
- lots of optimisations. Eg the most trivial implementation of transformer-with-attention inference is going to be quadratic in the size of your output but actual implementations are not quadratic. Then there are lots of small things: tracing the specific model running on the specific gpu, optimising kernels, etc
- more costs are amortized. Your hardware is relatively expensive because it is mostly sitting idle. AI company hardware gets much more utilization and therefore can be relatively more expensive hardware, where customers are mostly paying for energy.
I work at a university data center, although not on LLMs. We host state of the art models for a large number of users. As far as I understand, there is no secret sauce. We just have a big GPU cluster with a batch system, where we spin up jobs to run certain models. The tricky part for us is to have the various models available on demand with no waiting time.
But I also have to say 700M weekly users could mean 100M daily or 70k a minute (low ball estimate with no returning users...) is a lot, but achievable at startup scale. I don't have out current numbers but we are several orders of magnitude smaller of course :-)
The big difference to home use is the amount of VRAM. Large VRAM GPUs such as H100 are gated being support contracts and cost 20k. Theoretically you could buy a Mac Pro with a ton of RAM as an individual if you wanted to run auch models yourself.
TL;DR: It's massively easier to run a few models really fast than it is to run many different models acceptably.
They probably are using some interesting hardware, but there's a strange economy of scale when serving lots of requests for a small number of models. Regardless of if you are running single GPU, clustered GPU, FPGAs, or ASICs, there is a cost with initializing the model that dwarfs the cost of inferring on it by many orders of magnitude.
If you build a workstation with enough accelerator-accessible memory to have "good" performance on a larger model, but only use it with typical user access patterns, that hardware will be sitting idle the vast majority of the time. If you switch between models for different situations, that incurs a load penalty, which might evict other models, which you might have to load in again.
However, if you build an inference farm, you likely have only a few models you are working with (possibly with some dynamic weight shifting[1]) and there are already some number of ready instances of each, so that load cost is only incurred when scaling a given model up or down.
I've had the pleasure to work with some folks around provisioning an FPGA+ASIC based appliance, and it can produce mind-boggling amounts of tokens/sec, but it takes 30m+ to load a model.
[1] there was a neat paper at SC a few years ago about that, but I can't find it now
> The only way to do fast inference here is to pipeline those layers by having one GPU handle the first ten layers, another handle the next ten, and so on. Otherwise you just won’t be able to fit all the weights in a single GPU’s memory, so you’ll spend a ton of time swapping weights in and out of memory and it’ll end up being really slow. During inference, each token (typically in a “micro batch” of a few tens of tokens each) passes sequentially through that pipeline of GPUs
I would also point out that 700 million per week is not that much. It probably translated to thousands of qps, which is "easily" served by thousands of big machines.
The serving infrastructure becomes very efficient when serving requests in parallel.
Look at VLLM. It's the top open source version of this.
But the idea is you can service 5000 or so people in parallel.
You get about 1.5-2x slowdown on per token speed per user, but you get 2000x-3000x throughput on the server.
The main insight is that memory bandwidth is the main bottleneck so if you batch requests and use a clever KV cache along with the batching you can drastically increase parallel throughput.
Look for positron.ai talks about their tech, they discuss their approach to scaling LLM workloads with their dedicated hardware. It may not be what is done by OpenAI or other vendors, but you'll get an idea of the underlying problems.
Baseten serves models as a service, at scale. There’s quite a lot of interesting engineering both for inference and infrastructure perf. This is a pretty good deep dive into the tricks they employ: https://www.baseten.co/resources/guide/the-baseten-inference...
I once solved a similar issue in a large application by applying the Flyweight design pattern at massive scale. The architectural details could fill an article, but the result was significant performance improvement.
The first step is to acquire hardware fast enough to run one query quickly (and yes, for some model size you are looking at sharding the model and distributed runs). The next one is to batch request, improving GPU use significantly.
Take a look at vLLM for an open source solution that is pretty close to the state of the art as far as handling many user queries:https://docs.vllm.ai/en/stable/
Have you looked at what happens to tokens per second when you increase batch size? The cost of serving 128 queries at once is not 128x the cost of serving one query.
This. the main trick, outside of just bigger hardware, is smart batching. E.g. if one user asks why the sky is blue, the other asks what to make for dinner, both queries go though the same transformer layers, same model weights so they can be answered concurrently for very little extra GPU time. There's also ways to continuously batch requests together so they don't have to be issued at the same time.
Once you have enough GPUs to have your whole model available in GPU RAM you can do inference pretty fast.
As soon as you have enough users you can let your GPUs burn with a high load constantly, while your home solution would idle most of the time and therefore be way too expensive compared to the value.
Complete guess, but my hunch is that it's in the sharding. When they break apart your input into its components, they send it off to hardware that is optimized to solve for that piece. On that hardware they have insane VRAM and it's already cached in a way that optimizes that sort of problem.
1. They have many machines to split the load over
2. MoE architecture that lets them shard experts across different machines - 1 machine handles generating 1 token of context before the entire thing is shipped off to the next expert for the next token. This reduces bandwidth requirements by 1/N as well as the amount of VRAM needed on any single machine
3. They batch tokens from multiple users to further reduce memory bandwidth (eg they compute the math for some given weights on multiple users). This reduces bandwidth requirements significantly as well.
So basically the main tricks are batching (only relevant when you have > 1 query to process) and MoE sharding.
Basically, if Nvidia sold AI GPUs at consumer prices, OpenAI and others would buy them all up for the lower price, consumers would not be able to buy them, and Nvidia would make less money. So instead, we normies can only get "gaming" cards with pitiful amounts of VRAM.
AI development is for rich people right now. Maybe when the bubble pops and the hardware becomes more accessible, we'll start to see some actual value come out of the tech from small companies or individuals.
What incentive do any of the big LLM providers have to solve this problem? I know there are technical reasons, but SaaS is a lucrative and proven business model and the systems have for years all been built by companies with an incentive to keep that model running, which means taking any possible chance to trade off against the possibility of some paying customer ever actually being able to run the software on their own computer. Just like the phone company used to never let you buy a telephone (you had to rent it from the phone company, which is why all the classic Western Electric telephones were indestructible chunks of steel).
Huge batches to find the perfect balance between compute and memory banthwidth, quantized models, speculative decoding or similar techniques, MoE models, routing of requests on smaller models if required, batch processing to fill the GPUs when demand is lower (or electricity is cheaper).
Elsewhere in the thread, someone talked about how h100’s each have 80GB of vram and cost 20000 dollars.
The largest chatgpt models are maybe 1-1.5tb in size and all of that needs to load into pooled vram. That sounds daunting, but a company like open ai has countless machines that have enough of these datacenter grade gpus with gobs of vram pooled together to run their big models.
Inference is also pretty cheap, especially when a model can comfortably fit in a pool of vram. Its not that the pool of gpus spool up each time someone sends a request, but whats more likely is that there’s a queue to f requests from someone like chatgpts 700 million users, and the multiple (I have no idea how many) pools of vram keep the models in their memory to chew through that nearly perpetual queue of requests.
I do not have a technical answer, but I have the feeling that the concept of "loss leaders" is useful
IMO outfits like OpenAI are burning metric shit tonnes of cash serving these models. It pails in comparison to the mega shit tonnes of cash used to train the models.
They hope to gain market share before they start charging customers what it costs.
At the end of the day, the answer is... specialized hardware. No matter what you do on your local system, you don't have the interconnects necessary. Yes, they have special software, but the software would not work locally. NVIDIA sells entire solutions and specialized interconnects for this purpose. They are well out of the reach of the standard consumer.
But software wise, they shard, load balance, and batch. ChatGPT gets 1000s (or something like that) of requests every second. Those are batched and submitted to one GPU. Generating text for 1000 answers is often the same speed as generating for just 1 due to how memory works on these systems.
They also don’t need one system per user. Think of how often you use their system over the week, maybe one hour total? You can shove 100+ people into sharing one system at that rate… so already you’re down to only needing 7 million systems.
not affiliated with them and i might be a little out of date but here are my guesses
1. prompt caching
2. some RAG to save resources
3. of course lots model optimizations and CUDA optimizations
4. lots of throttling
5. offloading parts of the answer that are better served by other approaches (if asked to add numbers, do a system call to a calculator instead of using LLM)
6. a lot of sharding
One thing you should ask is: What does it mean to handle a request with chatgpt? It might not be what you think it is.
I work at Google on these systems everyday (caveat this is my own words not my employers)). So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.
However I can share this written by my colleagues! You'll find great explanations about accelerator architectures and the considerations made to make things fast.
https://jax-ml.github.io/scaling-book/
In particular your questions are around inference which is the focus of this chapter https://jax-ml.github.io/scaling-book/inference/
Edit: Another great resource to look at is the unsloth guides. These folks are incredibly good at getting deep into various models and finding optimizations, and they're very good at writing it up. Here's the Gemma 3n guide, and you'll find others as well.
https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...
Same explanation but with less mysticism:
Inference is (mostly) stateless. So unlike training where you need to have memory coherence over something like 100k machines and somehow avoid the certainty of machine failure, you just need to route mostly small amounts of data to a bunch of big machines.
I don't know what the specs of their inference machines are, but where I worked the machines research used were all 8gpu monsters. so long as your model fitted in (combined) vram, you could job was a goodun.
To scale the secret ingredient was industrial amounts of cash. Sure we had DGXs (fun fact, nvidia sent literal gold plated DGX machines) but they wernt dense, and were very expensive.
Most large companies have robust RPC, and orchestration, which means the hard part isn't routing the message, its making the model fit in the boxes you have. (thats not my area of expertise though)
> Inference is (mostly) stateless. ... you just need to route mostly small amounts of data to a bunch of big machines.
I think this might just be the key insight. The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!), you "only" pay for the request-specific raw compute and the memory storage+bandwidth for the activations. And the proprietary models are now huge, highly-quantized extreme-MoE models where the former factor (model size) is huge and the latter (request-specific compute) has been correspondingly minimized - and where it hasn't, you're definitely paying "pro" pricing for it. I think this goes a long way towards explaining how inference at scale can work better than locally.
(There are "tricks" you could do locally to try and compete with this setup, such as storing model parameters on disk and accessing them via mmap, at least when doing token gen on CPU. But of course you're paying for that with increased latency, which you may or may not be okay with in that context.)
mmap is not free. It just moves bandwidth around.
> The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!)
Kind of unrelated, but this comment made me wonder when we will start seeing side channel attacks that force queries to leak into each other.
I don't think it's either useful or particularly accurate to characterize modern disagg racks of inference gear, well-understood RDMA and other low-overhead networking techniques, aggressive MLA and related cache optimizations that are in the literature, and all the other stuff that goes into a system like this as being some kind of mystical thing attended to by a priesthood of people from a different tier of hacker.
This stuff is well understood in public, and where a big name has something highly custom going on? Often as not it's a liability around attachment to some legacy thing. You run this stuff at scale by having the correct institutions and processes in place that it takes to run any big non-trivial system: that's everything from procurement and SRE training to the RTL on the new TPU, and all of the stuff is interesting, but if anyone was 10x out in front of everyone else? You'd be able to tell.
Signed, Someone Who Also Did Megascale Inference for a TOP-5 For a Decade.
> So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.
"we do 1970s mainframe style timesharing"
there, that was easy
For real. Say it takes 1 machine 5 seconds to reply, and that a machine can only possibly form 1 reply at a time (which I doubt, but for argument).
If the requests were regularly spaced, and they certainly won’t be, but for the sake of argument, then 1 machine could serve 17,000 requests per day, or 120,000 per week. At that rate, you’d need about 5,600 machines to serve 700M requests. That’s a lot to me, but not to someone who owns a data center.
Yes, those 700M users will issue more than 1 query per week and they won’t be evenly spaced. However, I’d bet most of those queries will take well under 1 second to answer, and I’d also bet each machine can handle more than one at a time.
It’s a large problem, to be sure, but that seems tractable.
Yes. And batched inference is a thing, where intelligent grouping/bin packing and routing of requests happens. I expect a good amount of "secret sauce" is at this layer.
Here's an entry-level link I found quickly on Google, OP: https://medium.com/@wearegap/a-brief-introduction-to-optimiz...
But that’s not accurate. There are all sorts of tricks around KV cache where different users will have the same first X bytes because they share system prompts, caching entire inputs / outputs when the context and user data is identical, and more.
Not sure if you were just joking or really believe that, but for other peoples’ sake, it’s wildly wrong.
Makes perfect sense, completely understand now!
Doesn't google have TPU's that makes inference of their own models much more profitable than say having to rent out NVDIA cards?
Doesn't OpenAI depend mostly on its relationship/partnership with Microsoft to get GPUs to inference on?
Thanks for the links, interesting book!
Yes. Google is probably gonna win the LLM game tbh. They had a massive head start with TPUs which are very energy efficient compared to Nvidia Cards.
Google will win the LLM game if the LLM game is about compute, which is the common wisdom and maybe true, but not foreordained by God. There's an argument that if compute was the dominant term that Google would never have been anything but leading by a lot.
Personally right now I see one clear leader and one group going 0-99 like a five sigma cosmic ray: Anthropic and the PRC. But this is because I believe/know that all the benchmarks are gamed as hell, its like asking if a movie star had cosmetic surgery. On quality, Opus 4 is 15x the cost and sold out / backordered. Qwen 3 is arguably in next place.
In both of those cases, extreme quality expert labeling at scale (assisted by the tool) seems to be the secret sauce.
Which is how it would play out if history is any guide: when compute as a scaling lever starts to flatten, you expert label like its 1987 and claim its compute and algorithms until the government wises up and stops treating your success persobally as a national security priority. It's the easiest trillion Xi Xianping ever made: pretending to think LLMs are AGI too, fast following for pennies on the dollar, and propping up a stock market bubble to go with the fentanyl crisis? 9-D chess. It's what I would do about AI if I were China.
Time will tell.
I believe Google might win the LLM game simply because they have the infrastructure to make it profitable - via ads.
All the LLM vendors are going to have to cope with the fact that they're lighting money on fire, and Google have the paying customers (advertisers) and with the user-specific context they get from their LLM products, one of the juciest and most targetable ad audiences of all time.
The only one who can stop Google is Google.
They’ll definitely have the best model, but there is a chance they will f*up the product / integration into their products.
There is plenty of time left to fumble the ball.
It would take talent for them to mess up hosting businesses who want to use their TPUs on GCP.
But then again even there, their reputation for abandoning products, lack of customer service, condescension when it came to large enterprises’ “legacy tech” lets Microsoft who is king of hand holding big enterprise and even AWS run rough shod over them.
When I was at AWS ProServe, we didn’t even bother coming up with talking points when competing with GCP except to point out how they abandon services. Was it partially FUD? Probably. But it worked.
>It would take talent for them to mess up hosting businesses who want to use their TPUs on GCP.
there are few groups as talented at losing a head start as google.
> It would take talent for them to mess up hosting businesses who want to use their TPUs on GCP. > But then again even there, their reputation for abandoning products
What are the chances of abandoning TPU-related projects where the company literally invested billions in infrastructure? Zero.
Google employees collectively have a lot of talent.
A truly astonishing amount of talent applied to… hosting emails very well, and losing the search battle against SEO spammers.
Well, Search had no chance when the sites also make money from Google ads. Google fucked their Search by creating themselves incentives for bounce rate.
Yeah honestly. They could just try selling solutions and SLAs combining their TPU hardware with on-prem SOTA models and practically dominate enterprise. From what I understand, that's GCP's gameplay too for most regulated enterprise clients.
Googles bread and butter is advertising, so they have a huge interest in keeping things in house. Data is more valuable to them than money from hardware sales.
Even then, I think that their primary use case is going to be consumer grade good AI on phones. I dunno why Gemma QAT model fly so low on the radar, but you can basically get full scale Llamma 3 like performance from a single 3090 now, at home.
It’s my understanding that google makes bulk of ad money from search ads - sure they harvest a ton of data but it isn’t as valuable to them as you’d think. I suspect they know that could change so they’re hoovering up as much as they can to hedge their bets. Meta on the other hand is all about targeted ads.
https://www.cnbc.com/2025/04/09/google-will-let-companies-ru...
Google has already started the process of letting companies self-host Gemini, even on NVidia Blackwell GPUs.
Although imho, they really should bundle it with their TPUs as a turnkey solution for those clients who haven't invested in large scale infra like DCs yet.
My guess is that either google want's a high level of physical control over their TPUs, or they have one sort of deal or another with NVidia and don't want to step on their toes.
And also, Google's track record with hardware.
Gemma Term of uses ?
Relenting hardware like that would be such a cleansing old-school revenue stream for Google... just imagine...
But they’re ASICs so any big architecture changes will be painful for them right?
I'd think no. They have the hardware and software experience, likely have next and next-next plans in place already. The big hurdle is money, which G has a bunch of.
TPUs are accelerators that accelerate the common operations found in neural nets. A big part is simply a massive number of matrix FMA units to process enormous matrix operations, which comprises the bulk of doing a forward pass through a model. Caching enhancements and massively growing memory was necessary to facilitate transformers, but on the hardware side not a huge amount has changed and the fundamentals from years ago still powers the latest models. The hardware is just getting faster and with more memory and more parallel processing units. And later getting more data types to enable hardware-enabled quantization.
So it isn't like Google designed a TPU for a specific model or architecture. They're pretty general purpose in a narrow field (oxymoron, but you get the point).
The set of operations Google designed into a TPU is very similar to what nvidia did, and it's about as broadly capable. But Google owns the IP and doesn't pay the premium and gets to design for their own specific needs.
There are plenty of matrix multiplies in the backward pass too. Obviously this is less useful when serving but it's useful for training.
Hasn’t the Inferentia chip been around long enough to make the same argument? AWS and Google probably have the same order of magnitude of their own custom chips
Inferentia has a generally worse stack but yes
Im a research person building models so I can't answer your questions well (save for one part)
That is, as a research person using our GPUs and TPUs I see first hand how choices from the high level python level, through Jax, down to the TPU architecture all work together to make training and inference efficient. You can see a bit of that in the gif on the front page of the book. https://jax-ml.github.io/scaling-book/
I also see how sometimes bad choices by me can make things inefficient. Luckily for me if my code/models are running slow I can ping colleagues who are able to debug at both a depth and speed that is quite incredible.
And because were on HN I want to preemptively call out my positive bias for Google! It's a privilege to be able to see all this technology first hand, work with great people, and do my best to ship this at scale across the globe.
This caught my attention "But today even “small” models run so close to hardware limits".
Sounds analogous to the 60's and 70's i.e "even small programs run so close to hardware limits". If optimization and efficiency is dead in software engineering, it's certainly alive and well in LLM development.
Why does the unsloth guide for gemma 3n say:
> llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!
That makes the want to try exactly that? Weird
No, you just over complicate things.
> Another great resource to look at is the unsloth guides.
And folks at LMSys: https://lmsys.org/blog/
If people at google are so smart why can't google.com get a 100% lighthouse score?
I have met a lot of people at Google, they have some really good engineers and mediocre ones. But mostl importantly they are just normal engineers dealing normal office politics.
I don't like how the grand parent mystifies this. This problem is just normal engineering. Any good engineer could learn how to do it.
Because most smart people are not generalists. My first boss was really smart and managed to found a university institute in computer science. The 3 other professors he hired were, ahem, strange choices. We 28 year old assistents could only shake our heads. After fighting a couple of years with his own hires the founder left in frustration to found another institution.
One of my colleagues was only 25, really smart in his field and became a professor less than 10 years later. But he was incredibly naive in everyday chores. Buying groceries or filing taxes resulted in major screw-ups regularly
I have met those supersmart specialists but in my experience there are also a lot of smart people who are more generalists.
The real answer is likely internal company politics and priorities. Google certainly has people with the technical skills to solve it but do they care and if they care can they allocate those skilled people to the task?
My observation is that in general smart generalists are smarter than smart specialists. I work at Google, and it’s just that these generalists folks are extremely fast learners. They can cover breadth and depth of an arbitrary topic in a matter of 15 minutes, just enough to solve a problem at hand.
It’s quite intimidating how fast they can break down difficult concepts into first principles. I’ve witnessed this first hand and it’s beyond intimidating. Makes you wondering what you’re doing at this company… That being said, the caliber of folks I’m talking about is quite rare, like top 10% of top 2-3% teams at Google.
A lot of really smart people working on problems that don't even really need to be solved is an interesting aspect of market allocation.
Can you explain what you mean about 'not needing to be solved'? There are versions of that kind of critique that would seem, at least on the surface, to better apply to finance or flash trading.
I ask because scaling an system that a substantially chunk of the population finds incredibly useful, including for the more efficient production of public goods (scientific research, for example) does seem like a problem that a) needs to be solved from a business point of view, and b) should be solved from a civic-minded point of view.
I think the problem I see with this type of response is that it doesn't take into context the waste of resources involved. If the 700M users per week is legitimate then my question to you is: how many of those invocations are worth the cost of resources that are spent, in the name of things that are truly productive?
And if AI was truly the holy grail that it's being sold as then there wouldn't be 700M users per week wasting all of these resources as heavily as we are because generative AI would have already solved for something better. It really does seem like these platforms are, and won't be, anywhere as useful as they're continuously claimed to be.
Just like Tesla FSD, we keep hearing about a "breakaway" model and the broken record of AGI. Instead of getting anything exceptionally better we seem to be getting models tuned for benchmarks and only marginal improvements.
I really try to limit what I'm using an LLM for these days. And not simply because of the resource pigs they are, but because it's also often a time sink. I spent an hour today testing out GPT-5 and asking it about a specific problem I was solving for using only 2 well documented technologies. After that hour it had hallucinated about a half dozen assumptions that were completely incorrect. One so obvious that I couldn't understand how it had gotten it so wrong. This particular technology, by default, consumes raw SSE. But GPT-5, even after telling it that it was wrong, continued to give me examples that were in a lot of ways worse and kept resorting to telling me to validate my server responses were JSON formatted in a particularly odd way.
Instead of continuing to waste my time correcting the model I just went back to reading the docs and GitHub issues to figure out the problem I was solving for. And that led me down a dark chain of thought: so what happens when the "teaching" mode rethinks history, or math fundamentals?
I'm sure a lot of people think ChatGPT is incredibly useful. And a lot of people are bought into not wanting to miss the boat, especially those who don't have any clue to how it works and what it takes to execute any given prompt. I actually think LLMs have a trajectory that will be similar to social media. The curve is different and I, hopefully, don't think we've seen the most useful aspects of it come to fruition as of yet. But I do think that if OpenAI is serving 700M users per week then, once again, we are the product. Because if AI could actually displace workers en masse today you wouldn't have access to it for $20/month. And they wouldn't offer it to you at 50% off for the next 3 months when you go to hit the cancel button. In fact, if it could do most of the things executives are claiming then you wouldn't have access to it at all. But, again, the users are the product - in very much the same way social media played into.
Finally, I'd surmise that of those 700M weekly users less than 10% of those sessions are being used for anything productive that you've mentioned and I'd place a high wager that the 10% is wildly conservative. I could be wrong, but again - we'd know about that if it were the actual truth.
> If the 700M users per week is legitimate then my question to you is: how many of those invocations are worth the cost of resources that are spent, in the name of things that are truly productive?
Is everything you spend resources on truly productive?
Who determines whether something is worth it? Is price/willingness of both parties to transact not an important factor?
I don't think ChatGPT can do most things I do. But it does eliminate drudgery.
> so what happens when the "teaching" mode rethinks history, or math fundamentals?
The person attempting to learn either (hopefully) figures out the AI model was wrong, or sadly learns the wrong material. The level of impact is probably quite relative to how useful the knowledge is one's life.
The good or bad news, depending on how you look at it, is that humans are already great at rewriting history and believing wrong facts, so I am not entirely sure an LLM can do that much worse.
Maybe ChatGPT might just kill of the ignorant like it already has? GPT already told a user to combine bleach and vinegar, which produces chlorine gas. [1]
[1] https://futurism.com/chatgpt-bleach-vinegar
They won’t be honest and explain it to you but I will. Takes like the one you’re responding to are from loathsome pessimistic anti-llm people that are so far detached from reality they can just confidently assert things that have no bearing on truth or evidence. It’s a coping mechanism and it’s basically a prolific mental illness at this point
And what does that make you? A "loathsome clueless pro-llm zealot detached from reality"? LLMs are essentially next word predictors marketed as oracles. And people use them as that. And that's killing them. Because LLMs don't actually "know", they don't "know that they don't know", and won't tell you they are inadequate when they are. And that's a problem left completely unsolved. At the core of very legitimate concerns about the proliferation of LLMs. If someone here sounds irrational and "coping", it very much appears to be you.
Well, we all thought advertising was the worst thing to come out of the tech industry, someone had to prove us wrong!
Just wait until the two combine.
> working on problems that don't even really need to be solved
Very, very few problems _need_ to be solved. Feeding yourself is a problem that needs to be solved in order for you to continue living. People solve problems for different reasons. If you don't think LLMs are valuable, you can just say that.
The notion that simply pretending to not understand that I was making a value judgment about worth is an argument is tiring.
The few problems humanity has that need to be solved:
1. How to identify humanity's needs on all levels, including cosmic ones...(we're in the Space Age so we need to prepare ourselves for meeting beings from other places)
2. How to meet all of humanity's needs
Pointing this out regularly is probably necessary because the issue isn't why people are choosing what they're doing...it's that our systems actively disincentivize collectibely addressing these two problems in a way that doesn't sacrifice people's wellbeing/lives... and most people don't even think about it like this.
An H100 is a $20k USD card and has 80GB of vRAM. Imagine a 2U rack server with $100k of these cards in it. Now imagine an entire rack of these things, plus all the other components (CPUs, RAM, passive cooling or water cooling) and you're talking $1 million per rack, not including the costs to run them or the engineers needed to maintain them. Even the "cheaper"
I don't think people realize the size of these compute units.
When the AI bubble pops is when you're likely to be able to realistically run good local models. I imagine some of these $100k servers going for $3k on eBay in 10 years, and a lot of electricians being asked to install new 240v connectors in makeshift server rooms or garages.
What do you mean 10 years?
You can pick up a DGX-1 on Ebay right now for less than $10k. 256 GB vRAM (HBM2 nonetheless), NVLink capability, 512 GB RAM, 40 CPU cores, 8 TB SSD, 100 Gbit HBAs. Equivalent non-Nvidia branded machines are around $6k.
They are heavy, noisy like you would not believe, and a single one just about maxes out a 16A 240V circuit. Which also means it produces 13 000 BTU/hr of waste heat.
Fair warning: the BMCs on those suck so bad, and the firmware bundles are painful, since you need a working nvidia-specific container runtime to apply them, which you might not be able to get up and running because of a firmware bug causing almost all the ram to be presented as nonvolatile.
Are there better paths you would suggest? Any hardware people have reported better luck with?
Honestly, unless you //really// need nvlink/ib (meaning that copies and pcie trips are your bottleneck), you may do better with whatever commodity system with sufficient lanes, slots, and CFM is available at a good price.
It's not waste heat if you only run it in the winter.
Opt if you ignore that both gas furnaces and heat pumps are more efficient than resistive loads.
Heat pump sure, but how is gas furnace more efficient than resistive load inside the house? Do you mean more economical rather than more efficient (due to gas being much cheaper/unit of energy)?
Depends where your electricity comes from. If you're burning fossil fuels to make electricity, that's only about 40% efficient, so you need to burn 2.5x as much fuel to get the same amount of heat into the house.
Sure. That has nothing to do with the efficiency of your system though. As far as you are concerned this is about your electricity consumption for the home server vs gas consumption. In that sense resistive heat inside the home is 100% efficient compared to gas furnace; the fuel cost might be lower on the latter.
Sure, it's "equally efficient" if you ignore the inefficient thing that is done outside where you draw the system box, directly in proportion to how much you do it.
Heating my house with a giant diesel-powered radiant heater from across the street is infinitely efficient, too, since I use no power in my house.
If you don’t close the box of the system at some point to isolate the input, efficiency would be meaningless. I think in the context of the original post, suggesting running a server in winter would be a zero-waste endeavor if you need the heat anyway, it is perfectly clear that the input is electricity to your home at a certain $/kWh and gas at a certain $/BTU. Under that premise, it is fair to say that would not be true if you have a heat pump deployed but would be true compared to gas furnace in terms of efficiency (energy consumed for unit of heat), although not necessarily true economically.
Generating 1kWh of heat with electric/resistive is more expensive than gas, which itself is more expensive than a heat pump, based on the cost of fuel to go in
If your grid is fossil fuels burning the fuel directly is more efficient. In all cases a heat pump is more efficient.
It’d be fun to actually calculate this efficiency. My local power is mostly nuclear so I wonder how that works out.
You accelerate the climate catastrophe so there's less need for heating in the long run.
I'm in the market for an oven right now and 230V/16A is the voltage/current the one I'll probably be getting operates under.
At 90°C you can do sous vide, so basically use that waste heat entirely.
For such temperatures you'd need a CO2 heat pump, which is still expensive. I don't know about gas, as I don't even have a line to my place.
90C for sous vide??? You're going to kill any meal at 90.
How can you bear to eat sous vide though? I've tried it for months and years, and I still find it troublesome. So mushy, nothing enjoy.
Did you skip searing it after sous vide? Did you sous vide it to the "instantly kill all bacteria" temperature (145°F for steak) thereby overcooking & destroying it, or did you sous vide to a lower temperature (at most 125°F) so that it'd reach a medium-rare 130°F-140°F after searing & carryover cooking during resting? It should have a nice seared crust, and the inside absolutely shouldn't be mushy.
Please research this. Done right, sous vide is amazing. But it is almost never the only technique used. Just like when you slow roast a prime rib at 200f, you MUST sear to get Maillard reaction and a satisfying texture.
Seasonality in git commit frequency
> 13 000 BTU/hr
In sane units: 3.8 kW
You mean 1.083 tons of refrigeration
The choice of BTU/hr was firmly tongue in cheek for our American friends.
> In sane units: 3.8 kW
5.1 Horsepower
> > In sane units: 3.8 kW
> 5.1 Horsepower
0-60 in 1.8 seconds
Again, in sane units:
0-100 in 1.92 seconds
3.8850 poncelet
But ... can it run Crysis?
:D
How many football fields of power?
You’ll need (2) 240V 20A 2P breakers, one for the server and one for the 1-ton mini-split to remove the heat ;)
Matching AC would only need 1/4 the power, right? If you don't already have a method to remove heat.
Cooling BTUs already take the coefficient of performance of the vapor-compression cycle into account. 4w of heat removed for each 1w of input power is around the max COP for an air cooled condenser, but adding an evaporative cooling tower can raise that up to ~7.
I just looked at a spec sheet for a 230V single-phase 12k BTU mini-split and the minimum circuit ampacity was 3A for the air handler and 12A for the condenser, add those together for 15A, divide by .8 is 18.75A, next size up is 20A. Minimum circuit ampacity is a formula that is (roughly) the sum of the full load amps of the motor(s) inside the piece of equipment times 1.25 to determine the conductor size required to power the equipment.
So the condensing unit likely draws ~9.5-10A max and the air handler around ~2.4A, and both will have variable speed motors that would probably only need about half of that to remove 12k BTU of heat, so ~5-6A or thereabouts should do it, which is around 1/3rd of the 16A server, or a COP of 3.
Well I don't know why that unit wants so many amps. The first 12k BTU window unit I looked at on amazon uses 12A at 115V.
Just air freight them from 60 degrees North to 60 degrees South and vice verse every 6 months.
Well, get a heat pump with a good COP of 3 or more, and you won't need quite as much power ;)
Are you talking about the guy in Temecula running two different auctions with some of the same photos (356878140643 and 357146508609, both showing a missing heat sink?) Interesting, but seems sketchy.
How useful is this Tesla-era hardware on current workloads? If you tried to run the full DeepSeek R1 model on it at (say) 4-bit quantization, any idea what kind of TTFT and TPS figures might be expected?
I can’t speak to the Tesla stuff but I run an Epyc 7713 with a single 3090 and creatively splitting the model between GPU/8 channels of DDR4 I can do about 9 tokens per second on a q4 quant.
Impressive. Is that a distillation, or the real thing?
Even is the AI bubble does not pops, your prediction about those servers being available on ebay in 10 years will likely be true, because some datacenters will simply upgrade their hardware and resell their old ones to third parties.
Would anybody buy the hardware though?
Sure, datacenters will get rid of the hardware - but only because it's no longer commercially profitable run them, presumably because compute demands have eclipsed their abilities.
It's kind of like buying a used GeForce 980Ti in 2025. Would anyone buy them and run them besides out of nostalgia or curiosity? Just the power draw makes them uneconomical to run.
Much more likely every single H100 that exists today becomes e-waste in a few years. If you have need for H100-level compute you'd be able to buy it in the form of new hardware for way less money and consuming way less power.
For example if you actually wanted 980Ti-level compute in a desktop today you can just buy a RTX5050, which is ~50% faster, consumes half the power, and can be had for $250 brand new. Oh, and is well-supported by modern software stacks.
> Sure, datacenters will get rid of the hardware - but only because it's no longer commercially profitable run them, presumably because compute demands have eclipsed their abilities.
I think the existence of a pretty large secondary market for enterprise servers and such kind of shows that this won't be the case.
Sure, if you're AWS and what you're selling _is_ raw compute, then couple generation old hardware may not be sufficiently profitable for you anymore... but there are a lot of other places that hardware could be applied to with different requirements or higher margins where it may still be.
Even if they're only running models a generation or two out of date, there are a lot of use cases today, with today's models, that will continue to work fine going forward.
And that's assuming it doesn't get replaced for some other reason that only applies when you're trying to sell compute at scale. A small uptick in the failure rate may make a big dent at OpenAI but not for a company that's only running 8 cards in a rack somewhere and has a few spares on hand. A small increase in energy efficiency might offset the capital outlay to upgrade at OpenAI, but not for the company that's only running 8 cards.
I think there's still plenty of room in the market in places where running inference "at cost" would be profitable that are largely untapped right now because we haven't had a bunch of this hardware hit the market at a lower cost yet.
Off topic, but I bought my (still in active use) 980ti literally 9 years ago for that price. I know, I know, inflation and stuff, but I really expected more than 50% bang for my buck after 9 whole years…
The 5050 doesn't support 32-bit PsyX. So a bunch of games would be missing a ton of stuff. You'd still need the 980 running with it for older PhyX games because nVidia.
I have around a thousand broadwell cores in 4 socket systems that I got for ~nothing from these sorts of sources... pretty useful. (I mean, I guess literally nothing since I extracted the storage backplanes and sold them for more than the systems cost me). I try to run tasks in low power costs hours on zen3/4 unless it's gonna take weeks just running on those, and if it will I crank up the rest of the cores.
And 40 P40 GPUs that cost very little, which are a bit slow but with 24gb per gpu they're pretty useful for memory bandwidth bound tasks (and not horribly noncompetitive in terms of watts per TB/s).
Given highly variable time of day power it's also pretty useful to just get 2x the computing power (at low cost) and just run it during the low power cost periods.
So I think datacenter scrap is pretty useful.
It's interesting to think about scenarios where that hardware would get used only part of the time, like say when the sun is shining and/or when dwelling heat is needed. The biggest sticking point would seem to be all of the capex for connecting them to do something useful. It's a shame that PLX switch chips are so expensive.
This seems likely. Blizzard even sold off old World of Warcraft servers. You can still get them on ebay
Except their insane electricity demands will still be the same, meaning nobody will buy them. You have plenty of SPARC servers on Ebay.
There is also a community of users known for not making sane financial decisions and keeping older technologies working in their basements.
But we are few, and fewer still who will go for high power consumption devices with esoteric cooling requirements that generate a lot of noise.
Someone's take on AI was that we're collectively investing billions in data centers that will be utterly worthless in 10 years.
Unlike the investments in railways or telephone cables or roads or any other sort of architecture, this investment has a very short lifespan.
Their point was that whatever your take on AI, the present investment in data centres is a ridiculous waste and will always end up as a huge net loss compared to most other investments our societies could spend it on.
Maybe we'll invent AGI and he'll be proven wrong as they'll pay back themselves many times over, but I suspect they'll ultimately be proved right and it'll all end up as land fill.
If a coal powered electric plant it next to the data-center you might be able to get electric cheap enough to keep it going.
Datacenters could go into the business of making personal PC's or workstations using the older NVIDIA cards and sell them.
The servers may well be worthless (or at least worth a lot less), but that's pretty much true for a long time. Not many people want to run on 10 year old servers (although I pay $30/month for a dedicated server that's dual Xeon L5640 or something like that, which is about 15 years old).
The servers will be replaced, the networking equipment will be replaced. The building will still be useful, the fiber that was pulled to internet exchanges/etc will still be useful, the wiring to the electric utility will still be useful (although I've certainly heard stories of datacenters where much of the floor space is unusable, because power density of racks has increased and the power distribution is maxed out)
I have a server in my office that's at from 2009 still far more economical to run than buying any sort of cloud compute. By at least an order of magnitude.
Perhaps if you only need to run some old PHP app.
What kind of disk and how much memory is in there?
If it is all a waste and a bubble, I wonder what the long term impact will be of the infrastructure upgrades around these dcs. A lot of new HV wires and substations are being built out. Cities are expanding around clusters of dcs. Are they setting themselves up for a new rust belt?
Or early provisioning for massively expanded electric transit and EV charging infrastructure, perhaps.
Maybe the dcs could be turned into some mean cloud gaming servers?
This isn’t my original take but if it results in more power buildout, especially restarting nuclear in the US, that’s an investment that would have staying power.
Sure, but what about the collective investment in smartphones, digital cameras, laptops, even cars. Not much modern technology is useful and practical after 10 years, let alone 20. AI is probably moving a little faster than normal, but technology depreciation is not limited to AI.
They probably are right, but a counter argument could be how people thought going to the moon was pointless and insanely expensive, but the technology to put stuff in space and have GPS and comms satellites probably paid that back 100x
Reality is that we don’t know how much of a trope this statement is.
I think we would get all this technology without going to the moon or Space Shuttle program. GPS, for example, was developed for military applications initially.
I don’t mean to invalidate your point (about genuine value arising from innovations originating from the Apollo program), but GPS and comms satellites (and heck, the Internet) are all products of nuclear weapons programs rather than civilian space exploration programs (ditto the Space Shuttle, and I could go on…).
Yes, and no. The people working on GPS paid very close attention to the papers from JPL researchers describing their timing and ranging techniques for both Apollo and deep-space probes. There was more cross-pollination than meets the eye.
It's not that going to the Moon was pointless, but stopping after we'd done little more than planted a flag was. Werner von Braun was the head architect of the Apollo Program and the Moon was intended as little more than a stepping stone towards setting up a permanent colony on Mars. Incidentally this is also the technical and ideological foundation of what would become the Space Shuttle and ISS, which were both also supposed to be little more than small scale tools on this mission, as opposed to ends in and of themselves.
Imagine if Columbus verified that the New World existed, planted a flag, came back - and then everything was cancelled. Or similarly for literally any colonization effort ever. That was the one downside of the space race - what we did was completely nonsensical, and made sense only because of the context of it being a 'race' and politicians having no greater vision than beyond the tip of their nose.
Utterly? Moores law per power requirement is dead, lower power units can run electric heating for small towns!
My personal sneaking suspicion is that publicly offered models are using way less compute than thought. In modern mixture of experts models, you can do top-k sampling, where only some experts are evaluated, meaning even SOTA models aren't using much more compute than a 70-80b non-MoE model.
To piggyback on this, at enterprise level in modern age, the question is really not about "how are we going to serve all these users", it comes down to the fact that investors believe that eventually they will see a return on investment, and then pay whatever is needed to get the infra.
Even if you didn't have optimizations involved in terms of job scheduling, they would just build as many warehouses as necessary filled with as many racks as necessary to serve the required user base.
As a non-American the 240V thing made me laugh.
What I wonder is what this means for Coreweave, Lambda and the rest, who are essentially just renting out fleets of racks like this. Does it ultimately result in acquisition by a larger player? Severe loss of demand? Can they even sell enough to cover the capex costs?
It means they're likely going to be left holding a very expensive bag.
These are also depreciating assets.
I wonder if it's feasible to hook up NAND flash with a high bandwidth link necessary for inference.
Each of these NAND chips hundreds of dies of flash stacked inside, and they are hooked up to the same data line, so just 1 of them can talk at the same time, and they still achieve >1GB/s bandwidth. If you could hook them up in parallel, you could have 100s of GBs of bandwidth per chip.
NAND is very, very slow relative to RAM, so you'd pay a huge performance penalty there. But maybe more importantly my impression is that memory contents mutate pretty heavily during inference (you're not just storing the fixed weights), so I'd be pretty concerned about NAND wear. Mutating a single bit on a NAND chip a million times over just results in a large pile of dead NAND chips.
No it's not slow - a single NAND chip in SSDs offers >1GB of bandwidth - inside the chip there are 100+ wafers actually holding the data, but in SSDs only one of them is active when reading/writing.
You could probably make special NAND chips where all of them can be active at the same time, which means you could get 100GB+ bandwidth out of a single chip.
This would be useless for data storage scenarios, but very useful when you have huge amounts of static data you need to read quickly.
The memory bandwidth on an H100 is 3TB/s, for reference. This number is the limiting factor in the size of modern LLMs. 100GB/s isn't even in the realm of viability.
That bandwidth is for the whole GPU, which has 6 mermoy chips. But anyways, what I'm proposing isn't for the high-end and training, but for making inference cheap.
And I was somehat conservative with the numbers, a modern budget SSD with a single NAND can do more than 5GB/s read speed.
That bandwidth is for the whole GPU, which has 6 chips. But anyways, what I'm proposing isn't for the high-end and training, but for making inference cheap.
And I was somehat conservative with the numbers, a modern budget SSD with a single NAND can do more than 5GB/s read speed.
Four H100 in a 2U rack didn't sound impressive, but that is accurate:
>A typical 1U or 2U server can accommodate 2-4 H100 PCIe GPUs, depending on the chassis design.
>In a 42U rack with 20x 2U servers (allowing space for switches and PDU), you could fit approximately 40-80 H100 PCIe GPUs.
Why stop at 80 H100s for a mere 6.4 terabytes of GPU memory?
Supermicro will sell you a full rack loaded with servers [1] providing 13.4 TB of GPU memory.
And with 132kW of power output, you can heat an olympic-sized swimming pool by 1°C every day with that rack alone. That's almost as much power consumption as 10 mid-sized cars cruising at 50 mph.
[1] https://www.supermicro.com/en/products/system/gpu/48u/srs-gb...
> as much power consumption as 10 mid-sized cars cruising at 50 mph
Imperial units are so weird
What about https://www.cerebras.ai/system?
And the big hyperscaler cloud providers are building city-block sized data centers stuffed to the gills with these racks as far as the eye can see
> When the AI bubble pops is when you're likely to be able to realistically run good local models.
After years of “AI is a bubble, and will pop when everyone realizes they’re useless plagiarism parrots” it’s nice to move to the “AI is a bubble, and will pop when it becomes completely open and democratized” phase
It's not even been 3 years. Give it time. The entire boom and bust of the dot come bubble took 7 years.
They'll be in landfill in 10 years.
Yeah I think the crux of the issue is that chatgpt is serving a huge number of users including paid users and is still operating at a massive operating loss. They are spending truckloads of money on GPUs and selling access at a loss.
This isn’t like how Google was able to buy up dark fiber cheaply and use it.
From what I understand, this hardware has a high failure rate over the long term especially because of the heat they generate.
You have thousands of dollars, they have tens of billions. $1,000 vs $10,000,000,000. They have 7 more zeros than you, which is one less zero than the scale difference in users: 1 user (you) vs 700,000,000 users (openai). They managed to squeak out at least one or two zeros worth of efficiency at scale vs what you're doing.
Also, you CAN run local models that are as good as GPT 4 was on launch on a macbook with 24 gigs of ram.
https://artificialanalysis.ai/?models=gpt-oss-20b%2Cgemma-3-...
You can knock off a zero or two just by time shifting the 700 million distinct users across a day/week and account for the mere minutes of compute time they will actually use in each interaction. So they might no see peaks higher than 10 million active inference session at the same time.
Conversely, you can't do the same thing as a self hosted user, you can't really bank your idle compute for a week and consume it all in a single serving, hence the much more expensive local hardware to reach the peak generation rate you need.
During times of high utilization, how do they handle more requests than they have hardware? Is the software granular enough that they can round robin the hardware per token generated? UserA token, then UserB, then UserC, back to UserA? Or is it more likely that everyone goes into a big FIFO processing the entire request before switching to the next user?
I assume the former has massive overhead, but maybe it is worthwhile to keep responsiveness up for everyone.
Inference is essentially a very complex matrix algorithm run repeatedly on itself, each time the input matrix (context window) is shifted and the new generated tokens appended to the end. So, it's easy to multiplex all active sessions over limited hardware, a typical server can hold hundreds of thousands of active contexts in the main system ram, each less than 500KB and ferry them to the GPU nearly instantaneously as required.
I was under the impression that context takes up a lot more VRAM than this.
The context after application of the algorithm is just text, something like 256k input tokens, each token representing a group of roughly 2-5 characters, encoded into 18-20 bits.
The active context during inference, inside the GPUs, explodes each token into a 12288 dimensions vector, so 4 orders of magnitude more VRAM, and is combined with the model weights, Gbytes in size, across multiple parallel attention heads. The final result are just more textual tokens, which you can easily ferry around main system RAM and send to the remote user.
This is great product design at its finest.
First of all, they never “handle more requests than they have hardware.” That’s impossible (at least as I’m reading it).
The vast majority of usage is via their web app (and free accounts, at that). The web app defaults to “auto” selecting a model. The algorithm for that selection is hidden information.
As load peaks, they can divert requests to different levels of hardware and less resource hungry models.
Only a very small minority of requests actually specify the model to use.
There are a hundred similar product design hacks they can use to mitigate load. But this seems like the easiest one to implement.
> But this seems like the easiest one to implement.
Even easier: Just fail. In my experience the ChatGPT web page fails to display (request? generate?) a response between 5% and 10% of the time, depending on time of day. Too busy? Just ignore your customers. They’ll probably come back and try again, and if not, well, you’re billing them monthly regardless.
Is this a common experience for others? In several years of reasonable ChatGPT use I have only experienced that kind of failure a couple of times.
I don't usually see responses fail. But what I did see shortly after the GPT-5 release (when servers were likely overloaded) was the model "thinking" for over 8 minutes. It seems like (if you manually select the model) you're simply getting throttled (or put in a queue).
During peaks they can kick out background jobs like model training or API users doing batch jobs.
One clever ingredient in OpenAI's secret sauce is billions of dollars of losses. About $5 billion dollars lost in 2024. https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...
That's all different now with agentic which was not really a big thing until the end of 2024. before they were doing 1 request, now they're doing hundreds for a given task. the reason oai/azure win over locally run models is the parallelization that you can do with a thinking agent. simultaneous processing of multiple steps.
With infinite resources, you can serve infinite users. Until it's gone.
You hit the nail on the head. Just gotta add the up to $10 billion investment from Microsoft to cover pretraining, R&D, and inference. Then, they still lost billions.
One can serve a lot if models if allowed to burn through over a billion dollars with no profit requirement. Classic, VC-style, growth-focused capitalism with an unusual, business structure.
they would be break-even if all they did was serve existing models and got rid of everything related to R&D
Have they considered replacing their engineers with AI?
Would you have any numbers to back it up ?
they are not the only player so getting rid of R&D would be suicide
An AI lab with no R&D. Truly a hacker news moment
I think you maybe have misunderstood the parent (or maybe I did?). They're saying you can't compare an individual's cost to run a model against OpenAI's cost to run it + R&D. Individuals aren't paying for R&D, and that's where most of the cost is.
The unspoken context there is that the inference isn't the thing causing the losses.
Inference contributes to their losses. In January 2025, Altman admitted they are losing money on Pro subscriptions, because people are using it more than they expected (sending more inference requests per month than would be offset by the monthly revenue).
https://xcancel.com/sama/status/1876104315296968813
So people find more value than they thought so they'll just up the price. Meanwhile, they still make more money per inference than they lose.
Currently, they lose more money per inference than they make for Pro subscriptions, because they are essentially renting out their service each month instead of charging for usage (per token).
Do you have a source for that?
This assumes that the value obtained by customers is high enough to cover any possible actual cost.
Many current AI uses are low value things or one time things (for example CV generation, which is killing online hiring).
True.
At the end of the day, until at least one of the big providers gives us balance sheet numbers, we don't know where they stand. My current bet is that they're losing money whichever way you dice it.
The hope being as usual that costs go down and the market share gained makes up for it. At which point I wouldn't be shocked by pro licenses running into the several hundred bucks per month.
A single node with GPUs has a lot of FLOPs and very high memory bandwidth. When only processing a few requests at a time, the GPUs are mostly waiting on the model weights to stream from the GPU ram to the processing units. When batching requests together, they can stream a group of weights and score many requests in parallel with that group of weights. That allows them to have great efficiency.
Some of the other main tricks - compress the model to 8 bit floating point formats or even lower. This reduces the amount of data that has to stream to the compute unit, also newer GPUs can do math in 8-bit or 4-bit floating point. Mixture of expert models are another trick where for a given token, a router in the model decides which subset of the parameters are used so not all weights have to be streamed. Another one is speculative decoding, which uses a smaller model to generate many possible tokens in the future and, in parallel, checks whether some of those matched what the full model would have produced.
Add all of these up and you get efficiency! Source - was director of the inference team at Databricks
I think the most direct answer is that at scale, inference can be batched, so that processing many queries together in a parallel batch is more efficient than interactively dedicating a single GPU per user (like your home setup).
If you want a survey of intermediate level engineering tricks, this post we wrote on the Fin AI blog might be interesting. (There's probably a level of proprietary techniques OpenAI etc have again beyond these): https://fin.ai/research/think-fast-reasoning-at-3ms-a-token/
This is the real answer, I don't know what people above are even discussing when batching is the biggest reduction in costs. If it costs say $50k to serve one request, with batching is also costs $50k to serve 100 at the same time with minimal performance loss, I don't know what the real number of users is before you need to buy new hardware, but I know it's in the hundreds so going from $50000 to $500 in effective costs is a pretty big deal (assuming you have the users to saturate the hardware).
My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching.
Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time.
(Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1)
LA->CA1->LB->CB1->LC->CC1->LA->CA2->LB->CB2->LC->CC2
But you could instead batch the compute parts together.
LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2
Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.)
Time spent using approach 1 (1 request at a time):
LLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLC
Time spend using approach 2 (batching):
LLLLLLLLLLCCLLLLLLLLLLCCLLLLLLLLLLCC
The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time.
TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss.
Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction:
- Big models like GPT-4 are split across many GPUs (sharding).
- Each GPU holds some layers in VRAM.
- To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math.
- Loading into cache is slow, the ops are fast though.
- Without batching: load layer > compute user1 > load again > compute user2.
- With batching: load layer once > compute for all users > send to gpu 2 etc
- This makes cost per user drop massively if you have enough simultaneous users.
- But bigger batches need more GPU memory for activations, so there's a max size.
This does makes sense to me but does this sound accurate to you?
Would love to know if I'm still missing something important.
This seems a bit complicated to me. They don't serve very many models. My assumption is they just dedicate GPUs to specific models, so the model is always in VRAM. No loading per request - it takes a while to load a model in anyway.
The limiting factor compared to local is dedicated VRAM - if you dedicate 80GB of VRAM locally 24 hours/day so response times are fast, you're wasting most of the time when you're not querying.
Loading here refers to loading from VRAM to the GPUs core cache, loading from VRAM is extremely slow in terms of GPU time that GPU cores end up idle most of the time just waiting for more data to come in.
Yeah chatgpt pretty much nailed it.
700M weekly users doesn't say much about how much load they have.
I think the thing to remember is that the majority of chatGPT users, even those who use it every day, are idle 99.9% of the time. Even someone who has it actively processing for an hour a day, seven days a week, is idle 96% of the time. On top of that, many are using less-intensive models. The fact that they chose to mention weekly users implies that there is a significant tail of their user distribution who don't even use it once a day.
So your question factors into a few of easier-but-still-not-trivial problems:
- Making individual hosts that can fit their models in memory and run them at acceptable toks/sec.
- Making enough of them to handle the combined demand, as measured in peak aggregate toks/sec.
- Multiplexing all the requests onto the hosts efficiently.
Of course there are nuances, but honestly, from a high level last problem does not seem so different from running a search engine. All the state is in the chat transcript, so I don't think there any particular reason reason that successive interactions on the same chat need be handled by the same server. They could just be load-balanced to whatever server is free.
We don't know, for example, when the chat says "Thinking..." whether the model is running or if it's just queued waiting for a free server.
The short answer is "batch size". These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size.
If you try to run GPT4 at home, you'll still need enough VRAM to load the entire model, which means you'll need several H100s (each one costs like $40k). But you will be under-utilizing those cards by a huge amount for personal use.
It's a bit like saying "How come Apple can make iphones for billions of people but I can't even build a single one in my garage"
I'm actually not sure I understand how MoE helps here. If you can route a single request to a specific subnetwork then yes, it saves compute for that request. But if you have a batch of 100 requests, unless they are all routed exactly the same, which feels unlikely, aren't you actually increasing the number of weights that need to be processed? (at least with respect to an individual request in the batch).
Essentially, inference is well-amortized across the many users.
I wonder then if its possible to load the unused parts into main memory, while the more used parts into VRAM
Great metaphor
I'm sure there are countless tricks, but one that can implemented at home, and I know plays a major part in Cerebras' performance is: speculative decoding.
Speculative decoding uses a smaller draft model to generate tokens with much less compute and memory required. Then the main model will accept those tokens based on the probability it would have generated them. In practice this case easily result in a 3x speedup in inference.
Another trick for structured outputs that I know of is "fast forwarding" where you can skip tokens if you know they are going to be the only acceptable outputs. For example, you know that when generating JSON you need to start with `{ "<first key>": ` etc. This can also lead to a ~3x speedup in when responding in JSON.
gpt-oss-120b can be used with gpt-oss-20b as speculative drafting on LM Studio
I'm not sure it improved the speed much
To measure the performance gains on a local machine (or even standard cloud GPU setup), since you can't run this in parallel with the same efficiency you could in a high-ed data center, you need to compare the number of calls made to each model.
In my experiences I'd seen the calls to the target model reduced to a third of what they would have been without using a draft model.
You'll still get some gains on a local model, but they won't be near what they could be theoretically if everything is properly tuned for performance.
It also depends on the type of task. I was working with pretty structured data with lots of easy to predict tokens.
a 6:1 parameter ratio is too small for specdec to have that much of an effect. You'd really want to see 10:1 or even more for this to start to matter
You're right on ratios, but actually the ratio is much worse than 6:1 since they are MoEs. The 20B has 3.6B active, and the 120B has only 5.1B active, only about 40% more!
At the heart of inference is matrix-vector multiplication. If you have many of these operations to do and only the vector part differs (which is the case when you have multiple queries), you can do matrix-matrix multiplication by stuffing the vectors into a matrix. Computing hardware is able to run the equivalent of dozens of matrix-vector multiplication operations in the same time it takes to do 1 matrix-matrix multiplication operation. This is called batching. That is the main trick.
A second trick is to implement something called speculative decoding. Inference has two phases. One is prompt processing and another is token generation. They actually work the same way using what is called a forward pass, except prompt processing can do them in parallel by switching from matrix-vector to matrix-matrix multiplication and dumping the prompt’s tokens into each forward pass in parallel. Each forward pass will create a new token, but it can be discarded unless it is from the last forward pass, as that will be the first new token generated as part of token generation. Now, you put that token into the next forward pass to get the token after it, and so on. It would be nice if all of the forward passes could be done in parallel, but you do not know the future, so you ordinarily cannot. However, if you make a draft model that is a very fast model runs in a fraction of the time and guesses the next token correctly most of the time, then you can sequentially run the forward pass for that instead N times. Now, you can take the N tokens and put it into the prompt processing routine that did N forward passes in parallel. Instead of discarding all tokens except the last one like in prompt processing, we will compare them to the input tokens. All tokens up to and including the first token that differ, that come out of the parallel forward pass are valid tokens for the output of the main model. This is guaranteed to always produce at least 1 valid token since in the worse case the first token does not match, but the output for the first token will be equal to the output of running the forward pass without having done speculative decoding. You can get a 2x to 4x performance increase from this if done right.
Now, I do not work on any of this professionally, but I am willing to guess that beyond these techniques, they have groups of machines handling queries of similar length in parallel (since doing a batch where 1 query is much longer than the others is inefficient) and some sort of dynamic load balancing so that machines do not get stuck with a query size that is not actively being utilized.
AFAIK main trick is batching, GPU can do same work on batch of data, you can work on many requests at the same time more efficiently.
batching requests increase latency to first token, so it's tradeoff and MoE makes it more tricky because they are not equally used.
there was somewhere great article explaining deepseek efficiency that explained it in great detail (basically latency - throughput tradeoff)
Your model keeps the weights on slow memory and needs to touch all of them to make 1 token for you. By batching you make 64 tokens for 64 users in one go. And they use dozens of GPUs in parallel to make 1024 tokens in the time your system makes 1 token. So even though the big system costs more, it is much more efficient when being used by many users in parallel. Also, by using many fast GPUs in series to process parts of the neural net, it produces output much faster for each user compared to your local system. You can't beat that.
The big players use parallel processing of multiple users to keep the GPUs and memory filled as much as possible during the inference they are providing to users. They can make use of the fact that they have a fairly steady stream of requests coming into their data centers at all times. This article describes some of how this is accomplished.
https://www.infracloud.io/blogs/inference-parallelism/
I'm pretty much an AI layperson but my basic understanding of how LLMs usually run on my or your box is:
1. You load all the weights of the model into GPU VRAM, plus the context.
2. You construct a data structure called the "KV cache" representing the context, and it hopefully stays in the GPU cache.
3. For each token in the response, for each layer of the model, you read the weights of that layer out of VRAM and use them plus the KV cache to compute the inputs to the next layer. After all the layers you output a new token and update the KV cache with it.
Furthermore, my understanding is that the bottleneck of this process is usually in step 3 where you read the weights of the layer from VRAM.
As a result, this process is very parallelizable if you have lots of different people doing independent queries at the same time, because you can have all their contexts in cache at once, and then process them through each layer at the same time, reading the weights from VRAM only once.
So once you got the VRAM it's much more efficient for you to serve lots of people's different queries than for you to be one guy doing one query at a time.
It is not just engineering. There are also huge, very huge, investments into infrastructure.
As already answered, AI companies use extremely expensive setups (servers with professional cards) in large numbers and all these things concentrated in big datcenters with powerful networking and huge power consumption.
Imagine - last time, so huge investments (~1.2% of GDP, and unknown if investments will grow or not) was into telecom infrastructure - mostly wired telephones, but also cable TV and later added Internet and cell communications and clouds (in some countries wired phones just don't cover whole country and they jumped directly into wireless communications).
Larger investments was into railroads - ~6% of GDP (and I'm also not sure, some people said, AI will surpass them as share of possible for AI tasks constantly grow).
So to conclude, just now AI boom looks like main consumer of telecom (Internet) and cloud infrastructure. If you've seen old mainframes in datacenters, and extremely thick core network cables (with hundreds wires or fibers in just one cable), and huge satellite dishes, you could imagine, what I'm talking about.
And yes, I'm not sure, will this boom end like dot-coms (Y2K), or such huge usage of resources will sustain. Why it is not obvious, because for telecoms (internet) also was unknown, if people will use phones and other p2p communications for leisure as now, or will leave phones just for work. Even worse, if AI agents become ordinary things, possible scenario, number of AI agents will surpass number of people.
How is the routing to the hardware available? Let's say that a request hit the datacenter, how is it routed to an available GPU in a rack?
A few people have mentioned looking a the vLLM docs and blog (recommended!). I'd also recommend SGLang's docs and blog as well.
If you're interested in a bit of a deeper dive, I can highly recommend reading some of what DeepSeek has published: https://arxiv.org/abs/2505.09343 (and actually quite a few of their Technical Reports and papers).
I'd also say that while the original GPT-4 was a huge model when it was originally released (rumored 1.7T-A220B), these days you can get (original release) "GPT-4-class" performance at ~30B dense/100B sparse MoE - and almost all the leading MoEs have between 12-37B activations no matter how big they get - Kimi K2 (1T param weights) has only 32B activations). If you do a basic quants (FP8/INT8) you can easily push 100+ tok/s on pretty bog standard data center GPUs/nodes. You quant even lower for even better speeds (tg is just MBW) for not much in quality loss (although for open source kernels, usually without getting much overall throughput or latency improvements).
A few people have mentioned speculative decoding, if you want to learn more, I'd recommend taking a look at the papers for one of the (IMO) best open techniques, EAGLE: https://github.com/SafeAILab/EAGLE
The other thing that is often ignored, especially for multiturn that I haven't seen mentioned yet is better caching, specifically prefix caching (radix-tree, block-level hash) or tiered/offloaded kvcaches (LMCache as one example). If you search for those keywords, you'll find lots there as well.
Inference runs like a stateless web server. If you have 50K or 100K machines, each with a tons of GPUs (usually 8 GPUs per node), then you end up with a massive GPU infrastructure that can run hundreds of thousands, if not millions, of inference instances. They use something like Kubernetes on top for scheduling, scaling and spinning up instances as needed.
For storage, they also have massive amount of hard disks and SSD behind planet scale object file systems (like AWS's S3 or Tectonic at Meta or MinIO in prem) all connected by massive amount of switches and routers of varying capacity.
So in the end, it's just the good old Cloud, but also with GPUs.
Btw, OpenAI's infrastructure is provided and managed by Microsoft Azure.
And, yes, all of this requires billions of dollars to build and operate.
First off I’d say you can run models locally at good speed, llama3.1:8b runs fine a MacBook Air M2 with 16GB RAM and much better on a Nvidia RTX3050 which are fairly affordable.
For OpenAI, I’d assume that a GPU is dedicated to your task from the point you press enter to the point it finishes writing. I would think most of the 700 million barely use ChatGPT and a small proportion use it a lot and likely would need to pay due to the limits. Most of the time you have the website/app open I’d think you are either reading what it has written, writing something or it’s just open in the background, so ChatGPT isn’t doing anything in that time. If we assume 20 queries a week taking 25 seconds each. That’s 8.33 minutes a week. That would mean a single GPU could serve up to 1209 users, meaning for 700 million users you’d need at least 578,703 GPUs. Sam Altman has said OpenAI is due to have over a million GPUs by the end of year.
I’ve found that the inference speed on newer GPUs is barely faster than older ones (perhaps it’s memory speed limited?). They could be using older clusters of V100, A100 or even H100 GPUs for inference if they can get the model to fit or multiple GPUs if it doesn’t fit. A100s were available in 40GB and 80GB versions.
I would think they use a queuing system to allocate your message to a GPU. Slurm is widely used in HPC compute clusters, so might use that, though likely they have rolled their own system for inference.
The idea that a GPU is dedicated to a single inference task is just generally incorrect. Inputs are batched, and it’s not a single GPU handling a single request, it’s a handful of GPUs in various parallelism schemes processing a batch of requests at once. There’s a latency vs throughput trade off that operators make. The larger that batch size the greater the latency, but it improves overall cluster throughput.
> Sure, they have huge GPU clusters
That's a really, really big "sure."
Almost every trick to run a LLM at OpenAI's scale is a trade secret and may not be easily understood by mere mortals anyways (e.g. bare-metal CUDA optimizations)
Trade secrets also exist to hide faults and blemishes.
Trade secret?
With all the staff poaching the trade secrets may have now leaked?
That's half the reason tech companies poach.
It's the entire reason.
It's also the reason John Carmack got sued by zenimax when he went to oculus.
Multi-tenancy likely explains the bulk of it. $10k vs. $10b gives them six orders of magnitude more GPU resources, but they have 9 orders of magnitude more users. The average user is probably only running an active ChatGPT query for a few minutes per day, which covers the remaining 3 orders of magnitude.
One trick is that multiple prompts are combined into a batch that is subject to the token prediction simultaneously.
I've had Google Gemini start talking to be in the context of someone else's chat, unrelated to mine.
ME: Does POSIX have any API for creating a filesystem object (such as a file) with a predetermined inode number supplied by the caller?
AI: <reformats my question into LaTeX syntax, out of the blue>
ME: Why are you spewing LaTeX syntax at me?AI: You are absolutely right! My apologies. That was a mistake on my part. I got carried away with the instruction to use LaTeX for mathematical and scientific notations and incorrectly applied it to the entire response. [... SNIP]
There was no such instruction. I've never chatted with any AI about LaTeX. it leaked from the tokens of someone else's chat.
> There was no such instruction. I've never chatted with any AI about LaTeX. it leaked from the tokens of someone else's chat.
Nope. That's not how it works. Attention doesn't work across multiple independent prompts queued in the same batch. It's not physically possible for the tokens of another chat to leak.
What most likely happened is that the model glitched out to the instructions in its (hidden) system prompt, which most likely does include instructions about using LaTeX for mathematical and scientific notation.
Maybe not due to attention, but it is certainly possible for chat content to get leaked into other conversations due to bugs in the stack, and in fact it has happened before.
https://openai.com/index/march-20-chatgpt-outage/
"We took ChatGPT offline earlier this week due to a bug in an open-source library which allowed some users to see titles from another active user’s chat history. It’s also possible that the first message of a newly-created conversation was visible in someone else’s chat history if both users were active around the same time."
You are probably right about this particular LaTeX issue though.
They can have a very even load if they use their nodes for training when the customer use is low, so that massively helps. If they have 3x as much hardware as they need to serve peak demand (even with throttling) this will cost a lot, unless they have a another use for lots of GPU.
Just illustrative guesses, not real numbers, I underestimate overheads here but anyway ...
Let's assume a $20k expert node can produce 500 tokens per second (15,000 per year). $5k a year for the machine per year. $5k overheads. 5 experts per token (so $50k to produce 15,000 megatokens with a 100% throughput). Say they charge up to $10 per million tokens ... yeah it's tight but I can see how it's doable.
Say they cost $100 per user per year. If it's $10 per million tokens (depends on the model) then they are budgeting 10 million tokens per user. That's like 100 books per year. The answer is that users probably don't use as much as the api would cost.
The real question is, how does it cost $10 per megatoken?
500 tokens per second per node is like 15,000 megatokens per year. So a 500 token node can bring in $150,000 per node.
Call it 5 live experts and a router. That's maybe $20k per expert per year. If it's a kilowatt power supply per expert, and $0.1 per kW power that's $1000 for power. The hardware is good for 4 years so $5k for that. Toss in overheads, and it's maybe $10k costs.
So at full capacity they can make $5 off $10 revenue. With uneven loads they make nothing, unless they have some optimisation and very good load balancing (if they can double the tokens per second then they make a decent profit).
Lots of good answers that mention the big things (money, scale, and expertise). But one thing I haven’t seen mentioned yet is that the transformer math is probably against your use case. Batch compute on beefy hardware is currently more efficient than computing small sequences for a single user at a time, since these models tend to be memory bound and not compute bound. They have the users that makes the beefy hardware make sense, enough people are querying around the same time to make some batching possible.
Well, their huge GPU clusters have "insane VRAM". Once you can actually load the model without offloading, inference isn't all that computationally expensive for the most part.
When I think about serving large-scale LLM inference (like ChatGPT), I see it a lot like high-speed web serving — there are layers to it, much like in the OSI model.
1. Physical/Hardware Layer At the very bottom is the GPU silicon and its associated high-bandwidth VRAM. The model weights are partitioned, compiled, and efficiently placed so that each GPU chip and its VRAM are used to the fullest (ideally). This is where low-level kernel optimizations, fused operations, and memory access patterns matter so that everything above the chip level tries to play nice with the lowest level.
2. Intra-Node Coordination Layer Inside a single server, multiple GPUs are connected via NVLink (or equivalent high-speed interconnect). Here you use tensor parallelism (splitting matrices across GPUs), pipeline parallelism (splitting model layers across GPUs), or expert parallelism (only activating parts of the model per request) to make the model fit and run faster. The key is minimizing cross-GPU communication latency while keeping all GPUs running at full load - many low level software tricks here.
3. Inter-Node Coordination Layer When the model spans multiple servers, high-speed networking like InfiniBand comes into play. Techniques like data parallelism (replicating the model and splitting requests), hybrid parallelism (mixing tensor/pipeline/data/expert parallelism), and careful orchestration of collectives (all-reduce, all-to-all) keep throughput high while hiding model communication (slow) behind model computation (fast).
4. Request Processing Layer Above the hardware/multi-GPU layers is the serving logic: batching incoming prompts together to maximize GPU efficiency and mold them into ideal shapes to max out compute, offloading less urgent work to background processes, caching key/value attention states (KV cache) to avoid recomputing past tokens, and using paged caches to handle variable-length sequences.
5. User-Facing Serving Layer At the top are optimizations users see indirectly — multi-layer caching for common or repeated queries, fast serialization protocols like gRPC or WebSockets for minimal overhead, and geo-distributed load balancing to route users to the lowest-latency cluster.
Like the OSI model, each “layer” solves its own set of problems but works together to make the whole system scale. That’s how you get from “this model barely runs on a single high-end GPU” to “this service handles hundreds of millions of users per week with low latency.”
If the explanation really is, as many comments here suggest, that prompts can be run in parallel in batches at low marginal additional cost, then that feels like bad news for the democratization and/or local running of LLMs. If it’s only cost-effective to run a model for ~thousands of people at the same time, it’s never going to be cost-effective to run on your own.
Sure, but that's how most of human society works already.
It's more cost effective to farm eggs from a hundred thousand chickens than it is for individuals to have chickens in their yard.
You CAN run a GPT-class model on your own machine right now, for several thousand dollars of machine... but you can get massively better results if you spend those thousands of dollars on API credits over the next five years or so.
Some people will choose to do that. I have backyard chickens, they're really fun! Most expensive eggs I've ever seen in my life.
50 years ago general computers were also time shared. Then the pendulum swing to desktop, then back to central.
I for one look forward to another 10 years of progress - or less - putting current models running on a laptop. I don’t trust any big company with my data
For fungible things, it's easy to cost out. But not all things can be broken down just in token cost, especially as people start building their lives around specific models.
Even beyond privacy just the availability is out of your control - you can look at r/ChatGPT's collective spasm yesterday when 4o was taken from them, but basically, you have no guarantees to access for services, and for LLM models in particular, "upgrades" can completely change behavior/services that you depend on.
Google has been even worse in the past here, I've seen them deprecate model versions with 1 month notices. It seems a lot of model providers are doing dynamic model switching/quanting/reasoning effort adjustments based on load now.
Well, you can also batch your own queries. Not much use for a chatbot but for an agentic system or offline batch processing it becomes more reasonable.
Consider a system were running a dozen queries at once is only marginally more expensive than running one query. What would you build?
That determines the cost effectiveness to make it worth it to train one of these models in the first place. Using someone else's weights, you can afford to predict quite inefficiently.
You and your engineering team might be able to figure it out and purchase enough equipment also if you had received billions of dollars. And billions and billions. And more billions and billions and billions. Then additional billions, and more billions and billions and even more billions and billions of dollars. They have had 11 rounds of funding totaling around $60 billion.
Isn’t the answer to the question just classic economies of scale?
You can’t run GPT4 for yourself because the fixed costs are high. But the variable costs are low, so OAI can serve a shit ton.
Or equivalently the smallest available unit of “serving a gpt4” is more gpt4 than one person needs.
I think all the inference optimisation answers are plain wrong for the actual question asked?
It’s the same principle as:
https://www.tripadvisor.com/Restaurant_Review-g60763-d477541...
My mental model is: "How can an airline move 100 customers from NY to LA with such low latency, when my car can't even move me without painfully slow speeds".
Different hardware, batching, etc.
I think it’s some combination of:
- the models are not too big for the cards. Specifically, they know the cards they have and they modify the topology of the model to fit their hardware well
- lots of optimisations. Eg the most trivial implementation of transformer-with-attention inference is going to be quadratic in the size of your output but actual implementations are not quadratic. Then there are lots of small things: tracing the specific model running on the specific gpu, optimising kernels, etc
- more costs are amortized. Your hardware is relatively expensive because it is mostly sitting idle. AI company hardware gets much more utilization and therefore can be relatively more expensive hardware, where customers are mostly paying for energy.
I work at a university data center, although not on LLMs. We host state of the art models for a large number of users. As far as I understand, there is no secret sauce. We just have a big GPU cluster with a batch system, where we spin up jobs to run certain models. The tricky part for us is to have the various models available on demand with no waiting time.
But I also have to say 700M weekly users could mean 100M daily or 70k a minute (low ball estimate with no returning users...) is a lot, but achievable at startup scale. I don't have out current numbers but we are several orders of magnitude smaller of course :-)
The big difference to home use is the amount of VRAM. Large VRAM GPUs such as H100 are gated being support contracts and cost 20k. Theoretically you could buy a Mac Pro with a ton of RAM as an individual if you wanted to run auch models yourself.
TL;DR: It's massively easier to run a few models really fast than it is to run many different models acceptably.
They probably are using some interesting hardware, but there's a strange economy of scale when serving lots of requests for a small number of models. Regardless of if you are running single GPU, clustered GPU, FPGAs, or ASICs, there is a cost with initializing the model that dwarfs the cost of inferring on it by many orders of magnitude.
If you build a workstation with enough accelerator-accessible memory to have "good" performance on a larger model, but only use it with typical user access patterns, that hardware will be sitting idle the vast majority of the time. If you switch between models for different situations, that incurs a load penalty, which might evict other models, which you might have to load in again.
However, if you build an inference farm, you likely have only a few models you are working with (possibly with some dynamic weight shifting[1]) and there are already some number of ready instances of each, so that load cost is only incurred when scaling a given model up or down.
I've had the pleasure to work with some folks around provisioning an FPGA+ASIC based appliance, and it can produce mind-boggling amounts of tokens/sec, but it takes 30m+ to load a model.
[1] there was a neat paper at SC a few years ago about that, but I can't find it now
How can Google serve 3B users when I can't do one internet search locally? [2001]
I think this article can be interesting:
https://www.seangoedecke.com/inference-batching-and-deepseek...
Here is an example of what happens
> The only way to do fast inference here is to pipeline those layers by having one GPU handle the first ten layers, another handle the next ten, and so on. Otherwise you just won’t be able to fit all the weights in a single GPU’s memory, so you’ll spend a ton of time swapping weights in and out of memory and it’ll end up being really slow. During inference, each token (typically in a “micro batch” of a few tens of tokens each) passes sequentially through that pipeline of GPUs
I would also point out that 700 million per week is not that much. It probably translated to thousands of qps, which is "easily" served by thousands of big machines.
The serving infrastructure becomes very efficient when serving requests in parallel.
Look at VLLM. It's the top open source version of this.
But the idea is you can service 5000 or so people in parallel.
You get about 1.5-2x slowdown on per token speed per user, but you get 2000x-3000x throughput on the server.
The main insight is that memory bandwidth is the main bottleneck so if you batch requests and use a clever KV cache along with the batching you can drastically increase parallel throughput.
Look for positron.ai talks about their tech, they discuss their approach to scaling LLM workloads with their dedicated hardware. It may not be what is done by OpenAI or other vendors, but you'll get an idea of the underlying problems.
You also can't run a Google search. Some systems are just large!
Baseten serves models as a service, at scale. There’s quite a lot of interesting engineering both for inference and infrastructure perf. This is a pretty good deep dive into the tricks they employ: https://www.baseten.co/resources/guide/the-baseten-inference...
I once solved a similar issue in a large application by applying the Flyweight design pattern at massive scale. The architectural details could fill an article, but the result was significant performance improvement.
The first step is to acquire hardware fast enough to run one query quickly (and yes, for some model size you are looking at sharding the model and distributed runs). The next one is to batch request, improving GPU use significantly.
Take a look at vLLM for an open source solution that is pretty close to the state of the art as far as handling many user queries:https://docs.vllm.ai/en/stable/
Have you looked at what happens to tokens per second when you increase batch size? The cost of serving 128 queries at once is not 128x the cost of serving one query.
This. the main trick, outside of just bigger hardware, is smart batching. E.g. if one user asks why the sky is blue, the other asks what to make for dinner, both queries go though the same transformer layers, same model weights so they can be answered concurrently for very little extra GPU time. There's also ways to continuously batch requests together so they don't have to be issued at the same time.
Not answering, but I appreciate your little courage to ask this possibly-stupid-sounding question.
I have had the same question lingering, so I guess there are many more people like me and you benefiting from this thread!
I'd start by watching these lectures:
https://ut.philkr.net/advances_in_deeplearning/
Especially the "Advanced Training" section to get some idea of tricks that are used these days.
Simple answer: they are throwing billions of dollars at infrastructure (GPU) and losing money with every user.
You’re not losing money if money flows in faster than it flows out
Once you have enough GPUs to have your whole model available in GPU RAM you can do inference pretty fast.
As soon as you have enough users you can let your GPUs burn with a high load constantly, while your home solution would idle most of the time and therefore be way too expensive compared to the value.
Complete guess, but my hunch is that it's in the sharding. When they break apart your input into its components, they send it off to hardware that is optimized to solve for that piece. On that hardware they have insane VRAM and it's already cached in a way that optimizes that sort of problem.
How does a billion dollar company scale in a way that a single person cannot?
I dunno I ran `ollama run gpt-oss:20b` locally and it only used 16GB locally and I had decent enough inference on my Macbook.
Now do the 120b model.
I think they just have a philosophers stone that they plug their ethernet cable into
And to think they'll let me use (some of it) for mere pennies!
The marginal value of money is low. So it's not linear. They can buy orders of magnitude more GPUs than you can buy.
ChatGPT uses an horrendous amount of energy. Crazy. It will ruin us all.
1. They have many machines to split the load over 2. MoE architecture that lets them shard experts across different machines - 1 machine handles generating 1 token of context before the entire thing is shipped off to the next expert for the next token. This reduces bandwidth requirements by 1/N as well as the amount of VRAM needed on any single machine 3. They batch tokens from multiple users to further reduce memory bandwidth (eg they compute the math for some given weights on multiple users). This reduces bandwidth requirements significantly as well.
So basically the main tricks are batching (only relevant when you have > 1 query to process) and MoE sharding.
They have more than 700mX your computing budget?
Basically, if Nvidia sold AI GPUs at consumer prices, OpenAI and others would buy them all up for the lower price, consumers would not be able to buy them, and Nvidia would make less money. So instead, we normies can only get "gaming" cards with pitiful amounts of VRAM.
AI development is for rich people right now. Maybe when the bubble pops and the hardware becomes more accessible, we'll start to see some actual value come out of the tech from small companies or individuals.
What incentive do any of the big LLM providers have to solve this problem? I know there are technical reasons, but SaaS is a lucrative and proven business model and the systems have for years all been built by companies with an incentive to keep that model running, which means taking any possible chance to trade off against the possibility of some paying customer ever actually being able to run the software on their own computer. Just like the phone company used to never let you buy a telephone (you had to rent it from the phone company, which is why all the classic Western Electric telephones were indestructible chunks of steel).
batching & spread of users over time will get you there already
Huge batches to find the perfect balance between compute and memory banthwidth, quantized models, speculative decoding or similar techniques, MoE models, routing of requests on smaller models if required, batch processing to fill the GPUs when demand is lower (or electricity is cheaper).
Elsewhere in the thread, someone talked about how h100’s each have 80GB of vram and cost 20000 dollars.
The largest chatgpt models are maybe 1-1.5tb in size and all of that needs to load into pooled vram. That sounds daunting, but a company like open ai has countless machines that have enough of these datacenter grade gpus with gobs of vram pooled together to run their big models.
Inference is also pretty cheap, especially when a model can comfortably fit in a pool of vram. Its not that the pool of gpus spool up each time someone sends a request, but whats more likely is that there’s a queue to f requests from someone like chatgpts 700 million users, and the multiple (I have no idea how many) pools of vram keep the models in their memory to chew through that nearly perpetual queue of requests.
I do not have a technical answer, but I have the feeling that the concept of "loss leaders" is useful
IMO outfits like OpenAI are burning metric shit tonnes of cash serving these models. It pails in comparison to the mega shit tonnes of cash used to train the models.
They hope to gain market share before they start charging customers what it costs.
Azure servers
Data centers, and use of client hardware, those 700M clients' hardware are being partially used as clusters.
Because they spend billions per year on that.
Money. Don't let them lie to you. just look at nvidia.
They are throwing money at this problem hoping you throw more money back.
At the end of the day, the answer is... specialized hardware. No matter what you do on your local system, you don't have the interconnects necessary. Yes, they have special software, but the software would not work locally. NVIDIA sells entire solutions and specialized interconnects for this purpose. They are well out of the reach of the standard consumer.
But software wise, they shard, load balance, and batch. ChatGPT gets 1000s (or something like that) of requests every second. Those are batched and submitted to one GPU. Generating text for 1000 answers is often the same speed as generating for just 1 due to how memory works on these systems.
They also don’t need one system per user. Think of how often you use their system over the week, maybe one hour total? You can shove 100+ people into sharing one system at that rate… so already you’re down to only needing 7 million systems.
By setting billions of VC money on fire: https://en.wikipedia.org/wiki/OpenAI
No, really. They just have entire datacenters filled with high end GPUs.
redis
not affiliated with them and i might be a little out of date but here are my guesses
1. prompt caching
2. some RAG to save resources
3. of course lots model optimizations and CUDA optimizations
4. lots of throttling
5. offloading parts of the answer that are better served by other approaches (if asked to add numbers, do a system call to a calculator instead of using LLM)
6. a lot of sharding
One thing you should ask is: What does it mean to handle a request with chatgpt? It might not be what you think it is.
source: random workshops over the past year.
https://en.wikipedia.org/wiki/Autoscaling
Finally, some1 with the important questions!
Hint: it's a money thing.
They rewrote it in Rust/Zig the one you have is written in Ruby. :-p
They are hosted on Microsoft Azure cloud infrastructure and Microsoft owns 49%
They are also partnering with rivals like Google for additional capacity https://www.reuters.com/business/retail-consumer/openai-taps...
In fact logout gpt I found it hosted on azure