dwarkesh reiner pope gpt5 claude gemini training transcript

Today I’m interviewing Riner Pope who is CEO of Maddox which is a new chip startup. Previously he was doing TPU architecture and many other things at Google. This is a very different format from my usual interviews. This is going to be a Blackboard lecture we’re going to get up in a second. We in fact built this whole new studio with specifically this format in mind. Um and so it’s a pleasure to get to inaugurate it with you. We’re going to be talking about model architecture, ML infra, many other things. And um the reason I think it’s an important topic is because once you actually understand how training and inference actually work in a cluster as we’ll see a lot of things about why AI is the way it is why AI architectures are the way they are why um API prices are the way they are fundamentally also how why AI progress is the way it is start making sense and you need to understand the details to get there and you need a blackboard to understand the details. So Riner thank you so much for doing this. >> Yeah very happy to be here. Okay. Uh, full disclosure, I am a angel investor in Maddx, but that’s unrelated to this podcast. Um, Reiner, maybe to kick us off, I’ll ask this question. So, we have

[00:01:02] a couple of companies like Claude and Codex and Cursor are offering something like uh, fast mode where for 6x the price, they’ll give you streamy tokens at 2.5x the speed. Mechanically, I’m curious what’s going on here. Like, why is it the case that you can pay more to get faster latency? And two, could you keep going? Could you pay 100x more and somehow get even faster speeds or much much faster speeds? Um, and three, could you go the other way? Could you have something like uh claw code slow mode where if you are willing to wait for minutes on end, you could get um even cheaper prices. So maybe this will help motivate the kind of analysis that you’ll be doing through the liar. >> Great. I mean to jump to a little bit to jump to the conclusion the big effect is batch size but what we’re going to do now is quantify exactly what that looks like and what its implications are on latency and cost. Uh there’s going to be another effect which is um you can call it speculative decoding or multi-token prediction. We can maybe come back to that later but I think the first thing that we’ll talk through is batch size.

[00:02:00] So what I’d like to introduce is um sort of the two principles of analysis. Firstly we’re going to look at a roof line analysis of how I run a transformer model on on a cluster of chips. um we’ll take a sort of a let’s say a Blackwell NVL72 uh cluster so a rack of 72 GPUs um and so the roof line analysis means we look at uh memory bandwidth and and compute performance and then the other side of that is that we’re going to look at just two simple factors of the model which are the time to operate on the weights and then the time to operate on the context the KB cache. So let’s jump in. What we’re going to try and do is we’re going to try and estimate the time that it takes uh to to run an inference of a certain shape. Now, we’re not perfect here. We can’t uh exactly predict the time. And so, instead, we’re going to approximate. And so, we’re going to say that the time must be greater than or equal to a certain quantity. And so, we’re going to consider two different um aspects. We’re going to look at the time for uh it takes to uh

[00:03:01] do the memory fetches uh and then the time it takes to do the compute. And it’ll turn out that this actually gives us a very strong predictive power even with a simple model. So one by one, what is the time that it takes to do the compute? So there are really two things I need to do in the compute. I need to um multiply by all of the active parameters um and then I need to do some work on the attention. So multiplying by all the active parameters. I have a certain batch size that I’m running and then I’ve got a number of uh active parameters in my model and then um and then I’m just going to divide this by the compute throughput which is uh the flops of the chip. So this is hardware constant. So this this actually accounts for all of the compute time for all of the weight matrix multiplies. Um there’s a little caveat here. we we’ve sort of ignored the time to do any of the attention computation but that in

[00:04:00] general can be will be quite small in comparison to this. So so we’ll ignore >> Maybe I’ll just inter from time to time to ask some very naive questions or to clarify some uh basic points but just for the audience you’re not serving one user at a time. The batch refers to the fact that you’re serving many different users at the same time. >> Yeah. >> Um and that’s a whole batch. >> Yeah. So I can motivate the batch at least a little bit. So um I mean we will see exactly why batch is such a favorable optimization but what will turn out to be the case is that uh if you do not batch together many users um the cost and the economics you get is can be like a thousand times worse than than if you do batch many two users together um and and we’ll be able to see that quite explicitly >> and then uh number of active parameters this is saying like if I look at for example a deepseek model uh the deepseek v3 model has about 30 37 billion active parameters and uh 700 billion total parameters. So this is we’re focusing on just the ones that are active for a single token. Okay, so we’re modeled compute performance. I’m going to keep writing equals, but in all of these cases, you can think of this

[00:05:00] time as being at least this much and and maybe there will be some terms we ignored. Um on the memory side, um what do we need to do uh with memory? We we need to fetch um we need to fetch all of the weights. And so there is some time to fetch all of the the total number of parameters not just the active parameters. Um so there’s wait fetch time and then in addition uh there’s a KV cache fetch time. So there is um this actually depends on batch size. Uh so for every element of the batch we have to fetch uh an entire context length worth of tokens and then there’s a size per token. So uh um like bytes bytes per for for one token. Um and so this is a model parameter >> and maybe just back in let’s just explain what the KB cache is real quick. >> Yeah. So when I do a forward pass uh let me draw actually a um how the auto reggressive inference works. So this is during decode. Um so if I think I have a

[00:06:02] bunch of tokens uh of text I’m growing a tensor because uh ultimately the tokens are represented as some like tensor of uh in some embedding dimension and then in this direction I have the sequence length. Um the work of running a decode is I I have to run each token through a um uh through a whole bunch of matrix multiplies over a bunch of different layers. Um, and I have I have in general I’m going to have to do that work over uh all of these uh tokens. But then one step of decode is actually to produce just this one additional token path here. Yep. And so what I’m going to do there is I’m going to run a full forwards pass of uh multiplying by all of the weight matrices in the entire model. Um but then I’ve got this attention mechanism where this token sort of it’s it’s like looking at all of the past tokens. um in this way and what is it looking at specifically? It is

[00:07:01] looking at some internal representation that the model has produced of the tokens and we call that the KB cache. So this process of attending this this single token attending to all of the history of tokens um that’s attention. It is mostly dominated by memory fetches rather than um than matrix multiplies. >> Mhm. So we’ve got the amount of memory that we’re fetching shown over here and then this is of course just then divided by the uh memory bandwidth. Um so uh so the memory bytes per second. So in fact this these equations here are actually enough for us to now some draw some fit lines. And so the things that we’d like to look at are sensitivity to batch and then also um which we’ll draw separately to context links. So we said that the big big effects you can get is like some some trade-off in latency versus uh versus cost um in in batch size. So so let’s draw them out. I

[00:08:00] think there’s just really two graphs we want to draw. Um we’ll first just draw um batch size versus uh time here. So when we look at the shape of this, we’ve got a maximum of a sum and then and then another term. Um so let’s look at these terms one by one and how they scale uh uh the time for compute and and memory uh and how they show up. So let’s first look at this compute time. Uh this is just purely linear linear in batch size with no um no offset. So it is some >> curve like this. This is this is t compute. Um and then on the memory side we’ve got some portion here that uh that is just this constant that um that is you know constant in some base offset here which is the uh weight fetch weight fetch.

[00:09:00] And then finally we have um this term here which is the KB fetch um which we’re going to draw as as this is the KB fetch which is which is linear and bash and so it looks like that. So the sum of this plus this maxed with this. So let’s at least first to draw the sum. Um so the two memory times in in conjunction end up looking on this curved slope like this. >> Mhm. >> And then we get a um the overall maximum is I’ll draw a little figure here. Is it the maximum of these two curves? Make sense? Okay. So so so what does what does this mean actually? So this is a latency um plot. Um so if I grow my batch size I um I get initially some not very strong dependence on batch size and so there’s

[00:10:00] some lower bound on latency here. Um latency lower bound lower bound. Um so this already partially answers the question for a given hardware configuration and then we can talk about varying hardware configuration but for a given hardware configuration there is a lower bound on latency which is simply the I need to read all of my total parameters um from uh from memory into the into the chips and that takes a certain amount of time. Uh if if I use all of my memory bandwidth I can’t do any better than that. uh it seems like the way you’ve drawn the slopes for compute time and how the KB grows >> and what implication the KB has on memory uh time >> that as what if this were above or below >> yeah or is that necessarily the case because if this is always true then as batch size grows compute always dominates >> uh KVN which which suggests that if you

[00:11:00] have big enough batch size maybe memory is never an issue. Yeah, this is really sensitive to the context length. Um, so I think we should come back and explore this. The there will be as you vary the context length, the KB fetch time will go up and up and so that’ll cause a transition from uh compute limited to memory limit. >> And is there something especially significant about the slope being exactly the slope of the um the comput time? >> Yeah, whenever we have balance points, it kind of says that you’re getting it exactly right. Um and so >> for the particular context length where the slopes match um that says I am equally memory bound and computebound which is a really desirable place to but suppose it’s like this is a very simple algebra algebra problem but >> suppose it’s you know the optimal is 100k context length >> and you go to 200k context length does your MFU go down to like 50% like does it have a humongous impact on MFU >> to be like slightly outside of context length optimal range goldilock zone.

[00:12:01] >> That’s right. So that is true as modeled here. Um there’s a key point here that I’m modeling this context length as uh or I’m modeling the memory fetch as linear in context length >> that actually depends on model architecture. It is true for many of the or all of the model architectures with dense attention. Yeah. Um there’s a sparse attention actually scales much better than that. >> Got it. And is sparse attention what everybody uses in practice? >> I’m pretty excited about sparse attention. uh it’s hard to know what the labs are using. Deepseek has published a sparse attention mechanism. >> I’ll just like put a plug in that sparse attention. Some of the deepseeek papers that have published sparse attention end up putting a square root in this term. >> Okay. So, so far we’ve done we’ve looked at the latency. Um it’s kind of hard to read off cost from this. Uh so if I think what does cost mean um I’m going to like to run this inference I’m going to use the GPU for a certain number of seconds like 1 millisecond or 20 milliseconds or something like that. Um, and I have to pay the rental time for for that for that time. So like it’s $2

[00:13:00] an hour per GPU or something like that. Um, so so that’s the cost of this inference, but how much value have how many tokens have I processed during that inference? That is the batch size. And so what we actually want to plot is going to be the um the cost versus batch size. Um, which is like T over B uh versus batch size. >> Uh, this is the cost per token. Mhm. >> Um so like we have to imagine dividing each of these three curves by by b. So multiplying by this um reciprocal um and so what we end up with there is the the compute curve is going to um it was linear. we divide by b that makes that uh a constant here and this is t compute the um the kv fetch was linear now it becomes a constant as well um uh uh kv

[00:14:00] fetch and then the um the the weight fetch uh was constant and now we’ve divided by b and so it becomes this um hyper parabola. >> Mhm. >> And so again, we’re going to compute the the max of the sum. Um so the sum of these two terms shifts the the uh the parabola up sum of the KB fetch and the weight fetch um gives us a sort of a a higher parabola that’s like this. Mhm. >> And then we’re going to take the max with the compute uh here. So we end up with this this being the overall shape that that we care about. So again, so like we see some limiting behavior. The cost initially starts very high at batch size of one actually like it almost goes to infinity like uh it’s

[00:15:02] um because we’ve got so many weight fetches which are not advertised over a large batch size. Um but then as we increase the batch size the weight fetches become amortized over so many different batch elements that they their cost go grows very small and eventually the compute time uh ends up driving the cost. >> Mhm. >> So there is a limiting um like lower bound lower bound on cost um which is this one here. >> Yeah. Um, >> so clawed code slow or codec slow or whatever would just live on this line and it wouldn’t help much because you’re you’re not able to amortize the KV values over a much bigger badge. >> Yeah. Yeah. They’re unique per batch. The compute is also unique per batch. And so what is the minimum work you can do per batch after amatizing everything else away? Um so at this point where you are no longer um memory bandwidth bound, >> what practically

[00:16:00] how big a batch do you need to like how yeah how big are the batches practically for frontier models? >> Um you can you can just solve for that actually. Um and it’s not even particularly sensitive to model architecture. >> So um let let’s go ahead and do that. So what we’re talking about is we’re going to say when the memory time is equal to the compute time. That’s that that’s what that question is. Um for now I’m going to discard the um because we’re focused on what what the batch size is and really there’s a question of what uh when the weights are amotized over the um the the multiplies I’m going to focus on comparing the weight fetch time to the weight multiply time. I’m going to disregard the KB fetch term um just just to simplify the analysis so we can get a kind of a clean answer out. Um so we’re going to equate uh this portion with this with these two terms. Yeah. So writing that out um we get n number

[00:17:00] of total parameters over memory uh me memory bandwidth uh is equal to um batch size times number of active parameters divided by the compute performance. So looking over here, everything on the top, these are model parameters. Everything on the bottom, these are hardware parameters. Um it it turns out to be nice to rearrange them such that we have the hardware parameters on one side. So So let’s this is equivalent to um memory bandwidth being equal to um batch size times number of active parameters. divided by the number of total parameters. So, so this is a hardware parameter. Um, actually the this actually ends up being a dimensionless constant. Uh, if you

[00:18:01] look in terms of flops, what are the dimensions of this? This is um multiplies per second. This is bytes per second. So, that’s not quite dimensionless. But what you do is you say like multiplies per second times let’s say I’m doing FP4. Um, so I I do like how many FP4 multiplies per second times the fact that uh each one each FP4 is half a bite. Um, and so I can actually make this end ending up being dimensionless. Um, and and this ends up being on most GPUs um around 300 somewhere around 300 >> and sorry has that ratio changed over time as we’ve gone from model generation to model generation where the blobs keeps increasing? >> So there’s a hardware parameter um to what extent has the hardware changed? So um from like A100 to A100 to B100 um the the flops has increased substantially. The memory band has also increased substantially and it has remained reasonably stable. >> Yeah. And we can we can express this one as well. This is a sparity parameter. >> Um and I I might even phrase it slightly

[00:19:01] different. Let’s solve for batch size in total. Um we end up with and so we’re just moving this back over to the other side. we end up with batch size needs to be bigger than approximately um 300 times sparity. So for example, if I have 100 like I activate in deepseek I activate 32 out of 256 experts. So this would be like eight for deepseek. Got it. Okay. So so this actually gives you a ballpark which is like remarkably accurate to practice. Generally people will go a little bit larger than this. they don’t really want to be exactly at the balance point because um real world efficiencies aren’t as good as a roof line analysis would say. Um but like take this and maybe double it or triple it. >> Okay. So basically it’s like 2 to 3,000 tokens per batch. But then if you included the KB cache, >> yes, >> the implication would be that the optimal batch size should grow larger. So this is get like we we solve for the

[00:20:01] equivalence between when um compute time is equal to memory time. If I add in more memory bandwidth like something that consumes more memory bandwidth then I have less available for the the weight loads and so I need to grow the uh the memory bandwidth more and therefore the batch size more. >> This seems incredibly small like a batch this would be like less than one sequence, right? >> Yeah. Okay. So, so I guess this is um keep in mind that I’m talking about the number of tokens that I’m generating one more token for. So, so it’s like it’s actually 2,000 unique sequences in >> Got it. Okay. We’re just talking about the a single forward pass on these sequences. This is like the Do you think of like the bash is the number sequences rather than like >> That’s right. Okay. Cool. >> Yeah. >> When I’m prepping for interviews, I often talk to experts in the field. So, for Reiner, I chatted with two of James engineers, Clark and Axel. Clark, who works on low latency trading systems, walked me through why Gene Street uses FPGAAS to make sure that they have predictable nancond latencies. You can just build these like giant grids of compute very easily that do exactly what

[00:21:01] you need to touch 100 megabytes of SRAM and then get your response back in tens of nanconds very easily and that’s basically impossible on he then went on to explain why CPUs just wouldn’t work for this kind of thing. And so if you have a clock that’s going every 3 nonds, you actually have several bytes of information at a time to make your decision. That’s as opposed to a CPU where you’ll just collect up a whole packet, you know, let’s say a 1500 byt packet and you say, “Okay, this packet is ready. Here you go, CPU. You can start thinking about it now.” FPGAs allow you to react to the earliest part of the packet as it arrives rather than having to wait for the full thing. We also talked about liquid cooling, network design, and many other things. If you’re interested in this stuff, Jane Street is hiring. You can check out their open roles at janestreet.com/bcash. And if you want to watch the full prep conversation, we posted it there, too. If you’ve got a Frontier model and you are actually doing inference, surely they must have more than 2,000 concurrent users. >> Yeah. >> Is there any added latency from the fact

[00:22:00] that you need to have the whole batch fill up? or is it if you have a reasonable amount of users, it’s so unlikely that you wouldn’t it it would not take you 100 milliseconds to fill up the next 2,000 slots. >> Yeah, the the way to think about this, I guess we think of it as like when does the train depart as a model. So let’s say I’ve picked a batch size that I’m going to run at. Maybe I pick, you know, this batch size. >> Um and so like well and by the way, this intersection point is is the same intersection point here. Um >> so I pick this batch size. is I know that it’s going to take for example maybe it’s something like 20 milliseconds is a common place this ends up landing >> what I’m going to produce is uh like so this is a timeline of what is running on the GPU it’s going to start a new batch every 20 milliseconds uh regardless and so uh so so each of this is 20 this is 40 you can think of this as a schedule for the train a new train departs every 20 millonds any passengers who are board

[00:23:01] the train. Um if the train is full, then they wait to the next train. Um if the train is not f full, the train’s going to go anyway. Um and so in terms of what that means for queuing latency, it means that the worst case is that you like a request arrives just after the train departed. It has to wait for the next train. So that’s up to 20 millonds and then it has to wait for that train to to complete. Uh and so the worst case latency is 40. >> So how is it 20 millconds derived? Um I mean rule of thumb but where it comes from is not fully explained yet but um so far we’ve focused on memory bandwidth and compute uh time. Uh when we look at memory the other consideration is that we want to use all of the memory capacity we have. Um and so generally we’re going to use all of that memory capacity to store the weights or the KBs. And so we just want to read like in the time of doing a forward pass maybe we want to read all of the memory capacity into into the chip. Um and so

[00:24:01] that is capacity divided by bandwidth that tends to be 20 milliseconds on on many different generations of HPM. >> The units make sense. You would have a uh a bite divided bytes per second. >> Yeah. So for example, I mean on on I think the Reuben generation it is something like 288 GB um divided by 20 terabytes per second. Um uh and this looks like it comes out to about 15 millconds. Yeah. Let me make sure I understand what it’s saying. I mean I understand how why the units can’t the sort of unit analysis but what is it saying is we can evacuate and replace the HBM in this amount of time. And so we don’t want to be in a situation where the HBM is not big enough that we’re not, you know, actually able to

[00:25:00] keep write everything we want to it or take everything out of it or we don’t want to be in a situation where our ability to write back and forth is so big or so small compared. Yeah, there’s sort of two scenarios. Why don’t we pick a latency that is bigger than 15 milliseconds? And um if I think what that means, it means I actually have time to read the HBM like twice. Y >> um by the way, most of HPM accesses is reads, not writes. It’s like almost all reads because the weight matrices are read only and then almost all of the KB cache accesses are reads. So um in like let’s say I run 30 milliseconds, I can read all of HPM twice, but what’s the point of that? Like I I don’t want to read the white matrices twice. Um I don’t want to read the KVs twice. >> Yeah, it makes sense. Makes a ton of sense. Okay, so a couple of actually quick questions. One, if it is the case that the optimal batch size is something like 2,00 and that actually true, it’s totally dependent on the sparity. It’s not dependent on the model size or anything. >> I mean sparity shows up in model size, but beyond that, it only depends on sparity, not on scale. But that’s a very interesting result and that seems to imply that you can

[00:26:02] one question is how much of a push towards centralization is it that you would have these economies of scale from inference from batching. >> Yeah. >> But it seems like it’s not that big a deal like I don’t know is 2,000 users at the same time a lot. It doesn’t seem like a lot. >> We can do a bit of analysis on this which would be actually it’s like you can think of it in terms of number of users but maybe a more productive way to think of it is in terms of number of tokens per second. >> Mhm. So what does this batch size uh mean in terms of tokens per second of this of the system? So um tokens per second um tokens per second is going to be equal to the batch size. We run a batch many tokens and then we do that every um t so every time intervals which is let’s say which is uh which is this thing is equal to the 15 milliseconds 20 milliseconds number. So um this ends up being batch size itself times uh about 60. So um like 64 * b um and so this ends up being around

[00:27:01] uh 2,00 * 64. So like 128 um 128k uh tokens per sort of in more digestible units like uh it’s hard to reason about concurrent users but what is the global traffic for for a system? Um uh when you look at some of the announcements uh sometimes the API providers will will brag about how much traffic they have. Um the the the numbers that I’ve remembered from some announcements of Gemini last year were in the hundreds of millions of tokens per second worldwide. So so uh about a thou like this is 1,000th of that. >> Yeah. But I mean the Gemini is big, right? That’s actually 1,000 of Gemini is a lot to to actually be like uh >> to be competitive at scale, you need to be able to serve at least 1,000 of Gemini. Yeah, >> that’s interesting. >> Um, cool. Um, okay. So, the more sparsity you have, the less

[00:28:00] compute you need. And it does seem that as batch sizes get bigger, compute ends up being the bottleneck. Mhm. >> According to this analysis. So then the question is how far can you take sparity? That is to say as the sparity ratio increases as you have fewer and fewer active parameters relative to total parameters how much is performance of the model degrading and is it degrading faster than you’re saving compute by increasing the sparity factor. >> Yeah. So performance equality of the of the model rather than speed of the model. Yeah. Yeah. >> So unfortunately we’re not able to answer that analytically. That’s um >> that is an empirical question of model quality. >> Best I can do is pull up a paper and answer that empirically. >> Yeah. >> Okay. Uh should we follow the paper now or is it make sense? >> Yeah. So so this paper this is unified laws for routed language models. It’s a somewhat old paper by this stage but one of the things that they did is looked at if I keep increasing sparity what is the model quality impact? This answer is

[00:29:01] very sensitive to the actual choice of mixture of experts. Mixture of experts has been around for a really long time. I think it was maybe even back in 2017. Um but the tech techniques have changed a lot. Deepseek mixture of experts was was a big change in how it worked. Um there have been older papers which are Gshar uh switch transformer. So the actual empirical results are going to depend on all of that. Um but on one of the older techniques that is shown here you can see if I hold constant the number of active parameters at a certain size and then I increase the sparity which they call expert count here the quality keeps increasing and then if you imagine like drawing a horizontal line from 1.3B dense >> uh across you end up seeing that for example in this case the 64 expert 370 million activated parameters model is as good as a dense 1.3 billion model. So in some sense it’s actually not amazing returns where you need to increase total parameters 100fold to get the equivalent

[00:30:00] of 10x as many active parameters. >> Yeah I mean actually even more so yeah it’s a huge increase in parameter count for a modest increase in in >> yeah so in this case actually it’s what what is it 4x >> 64x for 4x. Yeah. So while it is while it is true I guess that the you get this benefit of being able to economize on your compute time if you increase sparsity. Um naively it would seem like oh that’s a trade-off worth making. But if this is this you’re decreasing this by 2x and then having this go up by 8x every time you double >> sparity. So is that good or bad? Actually um even from a memory point of view keep in mind um you are doubling this portion of the memory fetches which is amotized by batch and so just just keep running a larger batch size. Um from the point of view of the analysis we’ve done here this is pure win. Keep

[00:31:00] doing it. Um uh keep doing it until you run out of available users basically. >> Mhm. Um, so there’s actually this equivalence between uh if I want to go sparse or if I have a lot of users, I can go to a much sparer model. So from that point of view, it’s it’s a reasonable trade-off. Um, the other trade-off that shows up here is that um it also consumes memory capacity, which we we’ve only reasoned about memory bandwidth here, but it also consumes memory capacity. >> So let me just make sure I understood. You’re saying we want bigger we we want um to spend less time computing therefore we do more sparity to make that work we need bigger batch sizes >> which means we need more memory capacity um >> yeah so >> to have more sparity. >> Yeah. So I mean maybe this would be a good point to actually um talk about how a mixture of experts layer is typically laid out on on a like on a rack of GPUs or something like that. >> Yeah. Yeah, makes sense.

[00:32:00] >> Yeah. Where were we? >> Uh sparse mixture of experts. Um maybe how we lay that on out on a GPU. >> Yeah. >> So, um let’s zoom in on the mixture of experts layer first and and and sort of draw what that looks like. >> So, we typically um will have a some kind of a router layer >> um which is making the decision of where we route uh the experts uh the tokens to. So, we get tokens coming in here. They go through a router layer and then we have a bunch of different um experts. Uh I’ll draw draw a few more um to line some up. Um and then the router will make a decision and which experts am I going to route to? And it’ll be a small fraction of them. Maybe one in 32. So maybe it’ll make a decision to route to this one. Um uh maybe this one and maybe >> Mhm. uh these experts. So these each expert itself is a normal MLP. It has a up

[00:33:00] projection and then a down projection with a nonlinearity in between. Um and then finally we sort of do the inverse operation. So where we were broadcasting things out here um we’re going to bring them back in and sum them up. So bringing them in like this. Uh and then finally we have our residual connection. So that the token is also passed through here and it gets added to the result of thee layer. So so this is a normal layer. Um what I want to talk through is how this is mapped to a like a GPU rack um and what this means for communication uh because I think this will will start to show some of the the limits of how fast we can go. >> Yeah. >> So um the standard practice here and it it is the best solution is to use um expert parallelism. So that means different experts go on different GPUs. So if we take something like a Deepseek model, um they have 256 experts. Um

[00:34:00] let’s say we want to run that on a Blackwell rack. Um so there are 72 GPUs. Um we have a divisibility problem. This is not a power of two. Um so we’ll just like simplify and say we’re only going to use 64 of them. Um just ignore the other eight. It’s not a big deal. Um and so we we have four experts per GP. uh very simple um uh for the sake of the diagram I’ll actually just say let’s let’s say we have two experts per GPU so we um we end up just putting uh these are the GPU boundaries every pair of experts is on its own GPU um and then we can look at the communication cost we had some experts stored some tokens stored centrally here they get routed to all of these experts um and so uh there’s some communication cost paid Here there’s the same communication cost paid on the output. Um and then the hope is that uh this does not become communication committed. Um now what is the traffic pattern here?

[00:35:02] Um the traffic pattern here is that any GPU in fact will be talking to any other GPU depending on um the the decisions made by the model. So this is an allto-all traffic pattern. >> So when you say any GPU in the pretense, >> yeah, >> the router is more than one GPU. Yeah, the router. So I I drew this as one router. Uh in reality, you would actually have many copies of the router and so you would have um as as many routers as as GPUs in fact >> as as as the incoming incoming traffic. >> Yeah. So these are these are the these are 64 GPUs. These are 64 GPUs. It’s actually the same GPUs. We just like draw them as a separate because they’re serving different purposes. >> So at this point any GPU can be sending to any any other GPU. So this all to-wall pattern um of communication that shows up uh how the blackwell racks are configured um is a is a perfect fit for the um the communication pattern that thee actually wants to do. Um however if

[00:36:02] you think maybe I want to do like maybe one rack is too slow and I want to do two racks. Um then I have this challenge that like maybe I’ve got some sort of rack boundary drawn outside here like this. Um, and I no longer in fact have all toall communication between all the GPUs in two racks. Um, and so the rackto-rack communication ends up being a substantial bottleneck. So, uh, this sort of like the fundamental thing here is that one rack is actually the bounds the size of an expert layer you can do. And so, uh, this has been part of what’s been driving towards um, larger and larger interconnect domains. >> Yeah. Um before we it may be worth you explaining what exactly a rack is >> the differences in bandwidth between Iraq >> and within Iraq >> and the all versus not all nature of communication within versus outside. >> Yeah. And and this is a place where it starts to be very different in fact between uh Nvidia for example and Google

[00:37:00] and then others including us. Um the so generally uh a rack is a um it is a physical structure. Um it it’s a few meters tall um meter or two wide depends on configuration. Um and it stores uh some number of GPUs or XPUs which is typically about 64. Um the the con what constrains it being a certain size is power delivery weight um and cooling ability. uh it it ends up being about this size in in many cases because of these physical constraints. Um so then when I deploy a data center like I’ve got a data center may have thousands of these racks. So I’ve got one of these tall racks. It’s got a bunch of GPUs in it um uh and so on. Um and then I put another rack um next. Um >> you make it sound so easy. >> Yeah. Right. I just like drop them in. Um in Nvidia’s case um the the communication uh topology um is uh

[00:38:02] actually it it they put the GPUs on on the outside of the rack and then they put these switches on the inside of the rack. So what this ends up being is that there’s a set of switches in here. Um these are the NV switches. >> Mhm. >> And then they run a bunch of cables. Um, every single GPU uh has cables um going going to the switches in the middle. Um, so uh every GPU goes to the switches in the middle and then uh the switches have connections to all the GPUs. So all of the GPUs can talk to all the other GPUs uh in in just like two hops going to the switch going to the other GPU. Now when I want to leave the rack, I end up going via a different path. Um the GPUs have also a much slower um uh connectivity which is typically about eight times slower um which is uh so so the green that I drew here in GPU cases is the NVL link. More generally it’s called the

[00:39:00] scale up network. Um uh this is the scale up network. Um you will typically um also have a scale out network which allows you to connect to like some data center switch. Um so data center switch And then all of the GPUs will have some connectivity up to some data center switch somewhere. >> Um but this is this is about times uh like this is the scale out um and it tends to be about about eight times slump uh in bad words. So the the challenge if you want to for example lay out a mixture of expert layer across two racks is that half of the GPUs here are going to be wanting to talk to to talk to the GPUs GPUs here. And so um like half of the like just on average like when I look at where the tokens on on these GPUs want to go half of the tokens want to go inside the rack that’s great they can

[00:40:01] use the the fast scale up network but half the tokens are going to want to leave the rack and go to the other rack and that’s not as good. they’re going to need to use a much slower network. And so that becomes the bottleneck on uh on on the all to-all pattern. Um a different choice would be well why don’t I like have a big switch here and sort of like um and connect uh everything to some big switching uh like much bigger switch that actually combines the two racks together. There are many ideas in this direction but in general it becomes uh the reason you have this sort of hierarchy of switches rather than one big switch is to manage the cabling congestion. uh you just need to run a large number of cables. >> Sorry. Is this is that question you just asked basically why isn’t it a bigger scale up? >> Yeah, exactly. >> Why not why not just like have like a million chips in scale up? >> What has changed that is allowed Nvidia to go from Hopper was eight then Blackwell is 72 and now Reuben will be is it 500 or something? >> 500 and something. Yeah.

[00:41:00] >> Um what what has allowed that to happen? uh from Hopper to to Blackwell is is mostly just a uh the decision to switch from uh uh trays as the form factor or one of these as a tray to to switching to racks as the form factor. That’s a product decision. It’s um there wasn’t a substantial technical barrier there. Um >> uh switching from uh from the like uh 64 to to 500 or so. Um there’s a bit of Jensen math there, but uh uh there is at least a genuine 4x increase um which is um coming from a much more complicated and difficult rack design. And so that that is actually like new new physical design to run more cables. >> And the cable complication is just the the the cost of figuring out which cable hops to which or like which signal goes from. >> Let’s sort of zoom in on this and look at the the wire density. Um, I’ll draw this diagram just once more. So, we have a bit of a cleaner version

[00:42:00] to work with and a larger version. Um, let’s say I have some switches in the middle. Yep. >> Um, and let’s say I’m going to have initially I’m going to start with just two GPUs on each side or two trays of GPUs on each side. Um, and let’s say maybe each tray wants to have uh two cables coming out of it. Um, so I get some kind of I I physically run vertical cables that look like this running to the switches. Um, now if I want to double the number of GPUs in a rack, um, uh, I need to run like literally twice the density of cables. So, um, I need to run Yeah. Uh, these as well. Um, >> extremely question, but if you look at a physical data center Mhm. Seems like there’s a lot of space within a rack. I don’t know. Just like the cables are like really big and >> Yeah. So there is space outside the rack. Inside the rack like these racks are like I mean as they become more optimized these racks are very tight. So

[00:43:00] um there’s uh connector density going from um from from from the tray into the rack and the rack’s back plane. Um and then the back plane itself has a has has a really high density. Um there are other physical constraints including like bend radius of cables like you don’t want to snap them and so on. >> Yeah. >> It’s literally the physical space to put a cable that’s constraining it. >> Yeah. >> I had no idea. Interesting. >> Uh that seems surprising that like >> of a hu the rack is so big and it just like we can’t just stuff more cables in there. >> Yeah. So I mean rack design is not my expertise but like when I talk to to folks and what are the constraints they’re up against it’s it’s a combination of um uh so what are the big physical things you’re optimizing for? um space uh weight of the rack like it’s actually really heavy and so like you need enough metal to not sag and fall but then you add more metal and it’s heavier. Um and then power and cooling and so all of those are competing for like modern racks are pushing all of those to very extreme physical limits.

[00:44:00] >> Deep work is by its nature quite aversive. So even things which seem like work like Slack and email can be easy ways to distract yourself. So, I often wish that I could just turn the internet off, but if I’m prepping for an interview, even if I have the papers and books on hand, it’s still super useful to be able to do a back and forth in the LLM so I can break down concepts and research follow-ups. Google’s new Gemma 4 is the first open model that allows me to have this kind of fully disconnected focus machine. It’s small enough to run on my laptop, but good enough to actually be useful. So, to prep for this episode, I downloaded Reer Scaling book and shut off the internet. I was able to have Gemma help me understand the material and answer my questions. If you want an LLM that you can run locally on your laptop or even your phone, you should check out Gemma 4. >> When was GP4 released again? It was 2022 or 2023. >> Three. Okay. And it was rumored to be over one trillion parameters >> and it seems like only now and within the last 6 months have models been getting released that are significantly more parameters than a model released 3

[00:45:00] years ago. >> Yeah. when supposedly there should have been this um uh scaling in the meantime. Is the reason that we were just waiting for racks with enough memory to hold the five trillion parameter model along with its KV Kash for enough you know users for a full um for a lot of sequences or RL if you’re doing RL kind of a similar consideration of actually holding the KB cache for all the the uh >> the the the batch of problems you’re trying to solve. Um so if you look at like hopper you had eight hoppers and I think the >> that’s 640 gigabytes uh as of 2022. Yeah. >> Um >> with black well finally which was deployed what 2020 >> very recently maybe last year. >> Last year >> you finally have a scale up with on the order of like 10 20 terabytes. >> Mhm. >> Which is enough for like a 5T model plus KB cache. >> Yeah. Um deploying in in larger scale up domains is a huge unlock. Um yeah, I mean >> I’ve drawn here the sort of Nvidia

[00:46:00] Blackwell deployment. Um the Google deployment uh has actually had very large scale domains for >> and that also explains why Gemini was seem to be ahead like was Gemini 2.5 was a successful or it just seems like Gemini has that successful pre-train for longer than some of the other labs. I not having been there at the time, I’m not sure how much is coming from like successfully deploying higher sparity ratios which which could be um it could also be I mean there’s a whole bunch of actual modeling things of like uh specifically how do you do the mixture of experts uh we’ve seen the um deepseek uh like the deepseek mixture of expert has said actually activate more experts but finer grained experts was a big innovation >> um I’m sure that there are many other innovations on the model architecture as well as on the training data it’s kind hard to disentangle all of them. But uh what shows up in terms of the limits of what you can do um the the active parameters uh as we saw is limited by the compute cost um and then the total parameters is limited by

[00:47:00] the scaleup size. >> Yep. When you’re operating within a single scale of domain, is that a consideration specifically for either forward or backward or specifically for prefill versus decode or is it is it preferred to always be within a scale up >> whatever kind of workload you have whether you’re doing a pre-training run or whether you’re doing RL generation or whether you’re doing inference for users. Yeah, really interesting. Um so okay so uh to answer that question we’re going to need to talk about the communication patterns um that so we’ve talked about the mixture of expert communication pattern that is this all to all um uh all to all to all um all to all very strongly favors um uh full connectivity which is what we’ve kind of just shown here and it favors being within one rack Um there

[00:48:03] are other kinds of parallelism besides expert parallelism which which which we just showed here in the literature is tensor parallelism. This is um with a trend towards smaller experts this has become much less relevant. So we can ignore that. Um but the other two things that we have available are data parallelism and pipeline parallelism. Um and they are actually much they can be a much better fit for uh using multiple racks. So let’s focus on pipeline parallelism specifically. Um, this is one layer of um, I’m going to have like a 100 more layers up above. Um, I could decide at this point, for example, to move to a different rack, change rack. Now, is that going to become a communication bottleneck? So, we can actually just solve for when this becomes a communication bottleneck. Um but before we do that algebraically like let’s just sort of visualize it out and

[00:49:01] sketch the path. So we’re going to have a bunch this is another layer and we’re going to have another layer here and so on. Um uh so let’s say I change rack here and then some number of layers later I change rack here as well. Um so our our our methodology that we’re going to use to determine whether we have a communication bottleneck in this like in this point where we change rack um is we’re going to compare the this this is the scale out um scale out um bandwidth requirements to the scale up bandwidth requirements. >> Mhm. So let’s try this. And and I mean the hint is going to be that um there’s a lot more sends here like we’re sending many things here whereas we’re only sending one thing here and then we’re also maybe doing it many times. That’s so that’s going to be the uh what what makes the difference. >> Uh c can I try to guess just out of curiosity to see if I’m actually

[00:50:01] understanding. Um it seems like you’re sending like >> batch size into the rack >> in here. >> Yes. Uh but the communication within Iraq is sort of batch size times number of GPUs. >> Yeah. So number of activated GPUs, right? So like I I don’t send to this GPU at all, right? So there’s an explosion from one to like it’s three times larger here in in this diagram. >> Yeah. >> Um the key thing is that I I didn’t even need to send to this GPU at all. And so that’s a big saving. >> I see. Yeah. >> Okay. So we’re going to talk through um uh sort of how much more uh what is the slowdown of to what extent is scale up uh a bottleneck over scale uh over scale out. So uh we will directly jump to the ratio of the time spent on uh scale up time on scale up

[00:51:00] over the time spent on scale out. So this this is the quantity we’re talking about. Um and the first consideration is that the scale up is like um uh scale up is is eight times faster than scale out generally. And so uh at a baseline if the bandwidths were the same we would have this one one over eight which is coming from bandwidth bandwidth. But then we have some amount of expansion in in in how much data we’re sending. So if one token comes in here, >> then this one token gets routed to in the deep sea case, it’ll get routed to maybe 32 experts or or 16 experts gets routed to some number of experts. So this is the number of activated experts. Number of activated experts Um, and then it also

[00:52:03] this the same thing applies on multiple different layers. So maybe I’m going to run two layers. So um there’s also a multiple times um number of layers uh per stage. >> And you need to multiply the whole thing by two for the um for the Yes. Yes. And there’s a factor. Thank you. Um so what we would like is the for the scale up time to be greater than the scale out time. Um because like the scale up time is the more important and precious resource. And so we just we want this one we would like this number to be greater than or equal to one. Um and this really doesn’t seem hard like we there’s just a factor of eight that we need to overcome. So we need the product of these three things to be bigger than eight. Um typically we have a fairly large number of activated experts. could be eight um by itself. Um and then we can increase the number of layers per stage a lot until until we we satisfy this. >> I see.

[00:53:00] >> Um so what this ends up looking like is that I can in fact have an entire pipeline of racks where one rack does one layer and then I move on to the next rack and I do another layer and then I move on to the next rack. I can do another layer. It’s interesting to me that the best parallelism uh strategy in practice ends up being one which physically resembles the actual architecture. It’s not some galaxy brain thing, you know, it’s like, oh, we have experts, we’re going to put them on different GPUs. Oh, we have different layers. We’re just going to put them on different racks. Isn’t that I feel that’s interesting that the physical and >> the the the model architecture matches like the the cutting matches the model architecture. >> Yeah, exactly. >> Yeah. I mean it could have been something wackier with tensor parallelism and whatever. >> Yeah. So I mean I think a way to think of it is I mean okay the galaxy brain way to think of it is um like what are all the different dimensions in which a model is scaled up. Um and so there is uh it is scaled up by layers. It is scaled up by the like the model uh dimension. It is scaled up by the DFF dimension. It is scaled up by the number

[00:54:00] of experts. Um every single one of those numbers you can choose to cut along. Um and if those numbers are big enough, it eventually becomes profitable to gong there. >> Um and we have selected two of them. The other two in the way typical models are typically sized are not profitable. >> So there’s um talk by Ilia where he says today we know not to do pipeline parallelism and Horses gave my friends and me I hate that it sounds like a Dr. quote but he gave us a lecture on these different kinds of parallelisms and he said the problem with pipeline parallelism is that it other than the bubbles it constrain it creates these architectural constraints yes >> on um like Kimmy for example has these uh residuals where attention attends to the >> a few back or something >> yeah layers a few back and so that becomes hard to implement in this way >> yeah um so and I guess we didn’t really fully articulate even what is the benefit that we’re getting from

[00:55:00] pipelining. >> Yeah. >> Um >> uh and so these complexities are real. It’s pipelining is a massive hassle. It’s uh but it does give you some benefits. Um >> the uh and then you can then decide whether those benefits are worth it worth the costs. Um the uh the biggest benefit that shows up so it can has some benefits in inference maybe bigger benefits in training. Um in inference what are we saving on? Are we saving on um memory time or compute time? Not really. We’re just moving the memory time from one chip to another chip um or one rack to a different rack. There’s no actual benefit in runtime. Um however, what we are saving on is that the memory capacity is uh the amount of memory used per rack. If we think that the memory in a rack is a bottleneck, then there’s a constraint on how fast we can go. Um it pipelining allows us to massively reduce that bottleneck. >> I I I guess but the

[00:56:00] opposite connotation to this which actually before this I was chatting before this interview I was chatting with um Axel who’s a GPU performance engineer at uh Jane Street and he was explaining well to do pipelining you had to do micro batches rather than full batches. >> Mh. And if you do micro batches, then you’re by definition not able to amortize the weight loading the weights. That’s right. Across >> all the users or all the sequences. And so the positive connotation of that is you don’t have to use as much memory. The negative connotation is that of that is that we can’t amortize loading the weights across all those users. Maybe it’s worth explaining why you had to do microbatches because you can’t. >> So we draw the mic the pipeline bubble. Um yeah. >> Okay. So, so why do we do um uh what what is this microbatching that shows up in shows up in pipeline parallelism? So, um the uh I’ll focus on inference first. It’s it’s a slightly simpler problem. Um so, and I’m going to draw uh so this is

[00:57:00] time um and then this is which rack uh rack um we’re on. And so, the idea is that maybe I’ll have like four racks. So I’ve got um uh an inference that is going to like step through these four racks in some time like this. So this is inference number zero. Um uh it runs at a certain batch size uh and it steps through all all the pipeline stages like this. Now if we were to say well we’re going to run inference number one here like this is clearly like a massive waste, right? Like um like threequarters of the time each of the racks is doing nothing. So um so so we don’t actually run inference one here. We we we run it as soon as we can which is immediately after um inference zero finishes like this. Um so uh and then we keep um so if we hadn’t filled this in we would call this the pipeline bubble. Um when I’ve drawn it in this inference context where we’re only going in a forwards pass it’s like obvious like why would you do the stupid

[00:58:01] thing? >> But in a training context uh it’s maybe less obvious. But in the inference context, it’s it’s sort of really natural to to make this change. >> Oh, interesting. So, this sort of obvious, but um the difference between microbash and bash doesn’t matter at all in inference because you can just call whatever you want, whatever. >> Yeah, >> it it only matters in training because there is an optimal batch size. >> Yes. And before you do the backward step, you want to have accumulated before you do a full backward step, you want to have accumulated all the sequences in that bash. And if you want to do pipeline and training in order to avoid that bubble, you need to >> should we draw the the training diagram? Yeah, let’s do that. Let’s do that. Um, >> so so this is the inference diagram and I’ll call this four just so we don’t have the wrong thing showing up there. Um, so let’s do the same thing for training. Now we’ve got a forwards pass, but at some stage we’re going to have to transition to a backwards pass. So we’ll

[00:59:01] we’ll do some number of uh batches in the forwards pass and then we’re going to transition to the backwards pass for everyone all in one go. So the the inference part is the same uh here but then we do a hard stop at this point and then transition everyone to backwards pass um similar numbering like this. It may be worth clarifying the reason there is that hard stop is because you want to do a whole batch at once for the backward step >> and then there is an optimal size for how big that batch should be. >> Yeah. I mean smaller is always better actually is is is is a way to put it but uh it’s a like from a ML convergence rate perspective smaller is always better because basically you’re getting the freshest information from from from the gradient descent >> but total trading time perspective >> total training time perspective it’s wor

[01:00:00] like smaller is worse from a systems perspective and so the optimum is the trade-off between those two >> so you pick a batch size um and you uh and then like for that batch size you you do some amount forwards and then a some amount backwards Y >> you asked why why is there even a hard stop there pipeline parallelism because of this the like the fact that you’ve got this idle time here which is the the bubble um there are so many techniques in the literature for how to um lay this out differently and and avoid that there are more complicated schemes called like zero bubble or one forward one backward which sort of interle the forwards and the backwards in complicated ways but uh >> you can mine bitcoin in that >> right right more usefully you can do the weight gradient uh step but uh but you can also maybe yeah so in inference actually the the effect of of pipelining on anything you care about like batch size or latency actually is neutral it it doesn’t improve it doesn’t make it worse so if you look at the latency of

[01:01:01] this inference running it if it were pipelined versus if it were all on one rack if it were all on one rack we would just like slide all of the the boxes down uh and still put them in a row and the latency would be the same So um pipelining is neither better nor worse for latency. Um but it it does mean that you just use less memory per per rack like memory capacity because now instead of needing the whole model you only need a quarter of the model. >> Makes a ton of sense. So basically no brainer to use pipelining during inference but there’s this hardware trade-off during training. >> So so even in inference in fact it is not used a ton. Um it say it reduces your memory capacity requirements. Um there’s actually a huge surplus like um I think you’re saying that a a a rack of blackwell has many many terabytes maybe tens of terabytes of uh that’s much bigger than um like a trillion parameter model a trillion parameter model is only needs one terabyte. Um, and so it already fits in

[01:02:00] fact. And so there’s not a huge benefit from um from pipelining because you you’re reducing a number that’s already pretty small. >> But it does say that theoretically maybe you had too much memory and uh maybe you could have done a different uh like build a different hardware that has less memory. In fact, >> if you were designing your hardware like and you said I actually didn’t need that much memory because um I don’t need the weights to fit in one rack. I can fit the weights in eight racks. um then uh I could have maybe built a hardware that didn’t have so much HPM per GPU. >> Last week, Horses was kind enough to give me and my friends a great lecture on large scale pre-training systems. And there were some concepts that I wanted to animate for a write up on my blog like how weight shard and gradients flow depending on the parallelism that you’re using. So I gave cursor my lecture notes and a sketch that I made during the lecture and I asked it to visualize a specific hierarchal collective that Horus had explained. The first version was already pretty good and then I was able to use design mode to select and tweak any specific components from

[01:03:00] there. I was able to do all of this without a clear end state in mind. Cursor’s composer too fast model was quick enough that I was able to iterate almost instantaneously. I could try an idea, test the results in the built-in browser and immediately make any changes. I went through 10 different versions in under 20 minutes. If you want to check out this animation, I published it along with the lecture notes in a blog post. The link is in the description. And if you want to try out this kind of iterative design flow for yourself, go to cursor.com/larch to get started. So macro question, everybody’s talking about the memory wall right now. Memory is getting super expensive. There’s not enough memory. Smartphone volume will go down 30% because there’s not enough memory. Hyperscalers are spending this is shocking if I’m Dylan said they’re spending 50% of their capex this year >> on memory. >> On memory that’s believable. Yeah. >> So like what is hyperscaler capex? It’s like high hundreds of billions maybe a trillion and they’re spending half of that on memory. Okay. So that that is a huge constraint. That’s why we’re not going to get new laptops and phones this

[01:04:00] year. >> Um >> but at the same time we’re we have too much memory. Like people are willing to put too much memory into these systems, >> right? So um this is >> like why why why is Jedet shoving all this memory into these racks if >> Yeah. If you don’t need it. >> Yeah. So we’ve like in in the um equations we had here before we raised them we were doing memory time. So memory bandwidth and and compute bandwidth. Let’s now start looking at uh memory capacity. >> Yeah. >> So we’ll start off with just like memory capacity without even thinking about parallelism scheme. Um and so the um uh like the capacity of memory um or the or the the demand on memory is um the number of total parameters um plus so so this is what we need to fit the weights in some system that we are using >> um and then we need to fit the KVs as well. So, KVs go as batch size times the length of the context um times uh times the bytes bytes per bytes per um

[01:05:04] okay so um what I was arguing about in this context and the case I was making uh for pipelining is that um we will actually there are some techniques that allow us to solve this other techniques that allow us to solve this so let’s let’s consider So we’re going to run this on some number of GPUs and and we’re going to say um we’re going to have one extended which is um uh E is going to be the expert parallelism. So how many when we had this charting of uh uh expert layer across many GPUs how much of that uh to what extent do we do that? How many GPUs? Um so we’re going to say that this is fact for example 64 and then P is going to be the extent of pipeline pipelining. Um and so this is the number of racks which who knows maybe maybe we’ll pick four or something like that what we want to calculate. So this is

[01:06:00] the this is like the total um total memory requirement across the system. Um but now I’m going to calculate a um a memory requirement per GPU. So per per GPU memory requirement uh we’re going to have I guess I’ll use a lower case C me. Um and well obviously we just take all these numbers and divide it by en really easy. So um uh it’s this n total um plus the batch time length of context time bytes per toke. Um all of this is divided by e * p. Okay. So this is like why is this correct to divide it this way? Um well we’re we’re we’re saying we knew that the parameters were perfectly divided amongst all the the GPUs in a rack. They’re al the layers are perfectly

[01:07:01] divided amongst the the the different racks. So that works here and somehow we’re going to arrange I’ll handwave exactly how somehow we can arrange the same perfect sharding of of the contexts across GPUs in a rack and and and then based on layer across uh racks >> and sorry for the number of racks >> uh yeah for example >> yeah um so um this is the place where we actually need to go back and analyze this batch size B and you were making this comment that there’s micro batching versus global batching So um let’s come back to this pipelining diagram here. Um we’ve got one batch going forward here and then as I drew it, it kind of just like disappeared. That’s not really correct. If you think about um how decode is working, I have a bunch of tokens that I have generated already. I do one forwards pass where I generate a new token and then and then I push like then I write that to my KB cache and then I do another forwards

[01:08:01] pass that generates the next token. >> So I’m actually going to be running this batch zero in a loop. So >> in fact I go forwards once I finish I can start the next iteration of the loop up here. Yeah. So we’ll just fill this in. We’ll have a Oh. Uh, nice. Yes. Um, yeah. So, we’ve got the two or three little two and three. Uh, two three. Uh, so let’s split this batch. This batch will be the global batch size. So B is going to be the um number of number of micro batches times the batch of like the batch size per micro batch. So how many micro batches do we need? So the number of microbatches in this diagram is four 0 1 2 3 um and then the batch size per um

[01:09:03] like the microbatch size this is still this like 2,000ish number. Um this is the one that is like um >> Mhm. >> This is the like 2,00 um times sparity uh sorry uh no this is the 300 time sparity uh 300 times sparity. >> This is this is the how big the train that takes up every 20 milliseconds, >> right? Yes. This is going to be the the 20 milliseconds uh train. Um so the global batch size is the number of microbatches times the the local batch size. Local batch size is set by this hardware parameter. the number of microbatches um well the number of microbatches is as small as possible such that we can like wrap around uh and not leave any idle time when we wrap around. So if we like if we had fewer we would have have this idle time when we wrap around and so you can sort of just visually see that it is equal to the number of pipeline stages I mean sort of proof by visual here like it is four and it’s four this way as well but I can you can sort of look and

[01:10:00] see that it goes along here and then it wraps around um number of fun stages >> yeah it’s a very basic question this is what is actually done >> okay like as in a frontier model today will actually have during inference have pipeline >> uh for sure during massive scale training this is done um it can be done for inference I’m actually going to make the case for why it is less attractive it is useful for weights but not so useful for Ks yeah >> um the big challenge is so let’s let’s fill this in the microbatch size here ends up being equal to the number of pipeline stages y >> when we go back and substitute this all of that into Here we get a um number of pipeline stages times um this little b showing up in here. And then when we

[01:11:00] factor this out, I’m going to split this into like this plus into two two terms. Um we get the full division by e * p over here. We still have division by E * P over here, but the P’s cancel, this P and this P. Um, they cancel and so what we find if you increase the number of uh pipeline stages, the memory footprint for the number of weights keeps going down and down and down, but the memory footprint for the number of activations stays constant. So, so it it it doesn’t actually work like most of your memory um ends up like once you do enough pipelining and it’s really not much like even two is often enough. Um this term becomes very small. This becomes the dominant term. The KB cache >> Yeah, I I know this is wrong. I’m trying to think out why logic here is wrong. If you have many different um you’re pipelining through many different

[01:12:00] stages, the KV values are not shared between layers. So why would it not help to be pipelining across multiple layers because then you don’t have to store >> Yeah, you only need to store like one layer rather than two layers of KVs, right? So So it helps from that perspective. You’re right. Um what’s competing with that though is that you need to be keeping all of the racks usefully busy at a time. And so the number of sequences that are in flight simultaneously has gone up. >> Ah yeah yeah yeah makes sense makes sense makes sense. >> So those exactly cancel and and you end up not getting a saving per GPU. >> Right. This is going back fundamentally to the point of you’re you’re not able to amvertise across KV caches. >> Yeah. Well so first we did you can’t amortize KV caches across batch size and now we’re saying you also can’t um shard it across pipeline stages. Um uh it it sucks from both of those points of view. >> Yeah. Yeah. Interesting. >> Okay. Because then what is done during inference? >> Um so I mean a like the deepseek paper reports what they do which is like um they just do a lot of expert parallelism. You should in effect you

[01:13:01] should increase your expert parallelism up to your scale up domain size. >> Um and then do very little pipelining. Maybe none at all maybe two um just enough to make the weight storage not not too big of an issue. Um those are the only two parallelisms that really make sense. In the past um there was tensor parallelism which was make cutting up within an expert but uh the experts are so small now that that that is not a profitable optimization. >> So this goes back to the question does that mean that frontier labs when they’re doing inference are just basically within a single scale up? >> Uh yes. Yeah. I mean you can look at how it depends on model size. um like you could have a very large model like um like one that exceeds the memory of a rack um and and and there you should be doing a bit of pipelining um maybe maybe it’s extremely sparse for example and that would be a reason to do it >> um so I guess this goes back to the question about uh or this goes back to the promise at the beginning of the lecture which was this will actually

[01:14:01] tell you about AI progress as well um to the extent it is the case that model size scaling has been slow until recently Because let me make sure I understand the claim. The claim would not be you could have trained across more more racks. It was just that it would not have made sense before like we didn’t have the ability to do inference for a bigger model easily. >> Actually I make the so pipelining doesn’t help with context length. It totally helps with model size. And so um because of the ability to do pipelining uh at least a rack should not be a constraint on your ability to fit the model parameters. I guess the other consideration you’re asking like why hasn’t it scaled up more and why did bigger scale up domains help. >> Um so we we talked through one aspect of that which is um we kind of said it it’s not because of memory capacity. We we have a solution to the memory capacity at least with respect to model size. Yeah. Not not with respect to um uh KV cache size but at least with respect to model size we have a solution to memory

[01:15:01] capacity. Um the other issue that shows up is uh latency. >> I was just about to ask so what is the going from rack to rack? What is the latency cost per per hop? >> This is very much dependent on the hardware. Um it’s I would uh I can’t say with a lot of authority. I think it’s probably on the order of a few milliseconds, but it could be off by an order. There >> is four a realistic number of how many pipelining stages you might have? >> Yeah. Yeah. >> Okay. So that’s that’s not >> it’s not on a small number of pipelining stages. This is not a huge um uh latency impact. >> Wait, I guess it’s 10 milliseconds per token. >> That’s right. >> Two * 4ish or I don’t know how many you said, but >> yeah. Yeah. 10 millions per token is actually a lot. >> Yeah. If it if it goes from 20 to 30, right? Or something like that. Yeah. Um this is so like just to to chart the path that it goes through. Um here you’re going from your from your GPU or TPU or whatever um to a network card um

[01:16:03] uh which then goes to like a top rack switch um and then hops over to the other rack and does the same uh same thing in reverse. So you sort of have to sum up the latencies of these different things. Um >> so this is the same thing as the DC the it may in fact go up to a des switch and back. Um depends on deployment configuration. >> Got it. Yeah. And because it’s um decode in sequential it’s also not the like they stack up across the stages. You can’t do them at the same time. >> That’s right. Yeah. >> Okay. So I I guess this brings us back to the question then. Is the size the scale up at all relevant to why AI model sizes or whatever have been what they have been over the last few years whether whether whether through training or through infrance. >> Yeah. So I mean we talked about latency of the hop um of the of this hop. Um there is also just the the same tm latency the memory time latency is actually substantially like massively

[01:17:01] improved by larger scale of domains. So um I’ll I’ll recall TMM down here. um tm for the weights. Uh t mm of weights. Um this was equal to the number of total parameters divided by the memory bandwidth. Which memory bandwidth are we talking about here? Is it just one GPU or it’s it’s it’s in fact it it is the number of GPUs that I can use in parallel to to load these weights. So, um I can’t use different pipeline stages in parallel because they they’re not running at the same time, but I can use all the GPUs in my scaleup domain in parallel to load the weights. >> And so, um this is actually extremely effective. Um so, uh basically I end up with a term here. This this memory bandwidth term itself is equal to um

[01:18:00] like scale up size >> times memory bandwidth per GPU. >> Yeah. Yeah. Times GPU bandwidth. Um uh and so this term doesn’t increase a lot. It maybe increases 1.5 or 2x per generation. But this one increased by like a factor of eight um from from >> so so the reason the bigger scale up matter is not the memory capacity of the whole scale scale up but really the memory bandwidth. >> Yeah. Yeah. Pipelining totally solves the capacity problem, but um but uh uh scale up size helps solve the bandwidth problem >> and the bandwidth problem helps you do longer context lengths which is more and more relevant as these models get more agentic. >> Yeah, it lets you just run the model at lower latency um uh as a first thing like if I just do a very fast model and it’s on like a little like H100 box. Yeah. >> Um uh the latency will be really high. >> Yeah. Okay. a super tangential question. There’s chinchilla scaling which tells you how how big should a model be relative to the amount of data you’re going to train it on. Um

[01:19:02] but now obviously you’re not just trying to optimize for the highest quality model you can get with training compute. You want the best results a user can get with a mixture of training and inference compute. Mhm. >> So then there’s a question of how much should you overtrain a model >> such that that compute amortized over training and inference is minimized to get a certain performance. But now with RL inference there’s or RL there’s another consideration which is you’re going to do some amount of pre-training that pre-training will be used both for RL generation >> and then for inference for the final user. And by overtraining here I mean while it would have been more efficient just from a training computer perspective to have a bigger model >> that you train for less time because it can learn faster maybe you you get a smaller model you spend more computing it than you otherwise would have but now it’s cheaper to give it to users like basically okay maybe but let me question more concrete how much more than chinchilla optimal are models overtrained

[01:20:01] >> and has that changed as a result of our generation >> this is a place where we have to do a bit of guess work because like the um the updated scaling laws and and the use and model traffics are not reported and so we have to guess there. Um but uh one way to look at it um let me first just make a sort of a general huristic claim if I am if I have some like cost and I’ve got a total cost which is a sum of like cost A and cost B like maybe this is the training cost and this is the inference cost. Yeah. Um and so I want to minimize this sum for many uh for many curves that tend up being the case. The minimum tends to be where these are where the costs are equalized. Um that’s something of a heristic claim, but uh you can you can it tends like there are many examples where it’s true like uh where one is one overx and the other one is is x for example. Um they tend to be minimized at at the point where uh they equal each

[01:21:02] other. Um it’s also true for like um e to the x and like e to the minus x and all kinds of other things. Um uh like so basically I’ve got some I’ve got some curve that’s going down, some other curve that’s going up and they tend to be minimized at this equal point. Um huristically I will conjecture that that is true um for the setup you described as well. um uh like actually showing that that would be true would require looking at the scaling laws and um and like fitting these like weird exponents. Um but but things that do follow power laws tend to tend to have this property. So I’ll just make that claim and move on. Um so we’re going to say that the uh cost of training um plus the cost of inference um we want to equalize these um uh we’ll do pre-training only first because it’s a little well actually we can do all of it in general. So so actually we’ll we’ll cost it as um cost

[01:22:00] of pre-training. So number of uh so number of number of active params um times the data on pre-training. So that’s the cost of pre-training. There’s a factor of six out here which is the number of flops. Um there’s the famous 6 ND formula. Um and then in in RL we have approximately the same thing. We’ve got like same number of active parameters. Um but now it’s uh the amount of data is the RL data. Um there’s this extra like efficiency multiplier which is um or inefficiency like the um the inefficiency um uh >> which is the fact that you’re not training on all your rollouts. >> Well, yeah, there there’s that. Um and then the other perhaps even bigger inefficiency is that um this involves a substantial amount of decode and often decode runs at uh less MFU than than than training. >> Okay. So if you’re doing a backward pass

[01:23:01] on every single generation in RL it would be six ND. >> Yeah. So this could be a smaller number, right? Like this could be somewhere. So um >> it would at least be two somewhere in the range of two to six. So we’ll just like we’ll say somewhere in the range of two to six and leave it at that. >> Yeah. uh um and then and then we can add in the inference cost. Um the inference cost is two um number of active uh times the data in inference. It’s right I think the way I said it was super gable. So for just for the audience maybe forward plus backwards per parameter is six. >> Forward alone is two. That’s why RL where you might you’re definitely going to generate all the trajectories but you might or might not train on all the trajectories is two to six. Yes. Yeah. Thank you. Um and then inference is is is just two. >> Yeah. >> So we’re going to solve for essentially maybe equality of all three of these terms that is ballpark where people are going to be like >> uh labs have more information on on what

[01:24:00] is productive in doing more RL for example than versus doing more pre-training. I don’t have that information but I think a good ballpark is 30 30 like uh 33% split between each of them. >> Actually I’m not sure I understand the intuition for that. Um, another naive model could have been that RL plus pre-training would be 50%. Any inference would be 50%. >> Yeah, that that’s also a valid uh answer as well. The because this is heristic, I can’t really argue for one versus the other. They don’t differ by that much like 33 versus 25 is is only a small fac. >> Um, uh, so let’s pick one of them. Uh, all equal seems uh simple enough. Um um and so we’re just going to solve for equality of them. It’s pretty straightforward. We can immediately see that the number of activated parameters totally disappears. And so let’s factor that out. And we’re going to just say that uh data in pre-training I decided to do it your way. It’s a little bit nicer actually. So data in pre-training plus um this uh oh I didn’t

[01:25:03] have the inefficiency over here either. um inefficiency um data in pre-training plus um some multiple of like alpha times the data in RL is just going to be and end up equal to the um some sum of beta times the uh data in inference. Um so uh and then let’s just like roughly size the alpha. This this this alpha it’s going to be um uh this is like the it’s maybe somewhere in the range of 2 to 6 uh 2 to 6 over 6 um from this term compared to this term. Um and then we’ve got an inefficiency term which uh I would say is maybe in the range of like 30% or something like that. Um so uh so so this alpha is going to be something like um 1 and 10 1 / 10 say um and this beta here is is actually

[01:26:02] the same. It’s it’s a third it’s 1/3* 33 30%. So it’s also um equals 1 in 10 something like that. If if both of them are one and 10 that kind of implies that there’s never a backward pass on RL. >> Yeah. Okay. We can make this like two and 10. Make it a bit bigger. Yeah. So yeah, like just write it out once more like this is two 2 over 10, this is 1 over 10. Um, so the number of inference tokens you have and this is just a function of like I’ve got hundreds of millions of tokens per second um times my model is deployed for I don’t know two months before I shift shift to the next version. um that should determine the um the number of uh tokens in in RL and pre pre-training and then I guess we didn’t do the equivalence between pre-training and and RL so we’ll do that here data pre-training should be equal to like 2 over 10 * data in RL for them to be cost equivalent um so

[01:27:03] sorry this one over I got it backwards uh like we pay more cost when it’s inefficient so it’s this needs to be one over. Um uh um so this tracing this back uh back forward >> um this this thing ends up actually being as written here it’s like yeah so this is like 1.5 and this is one um >> um >> billions of dollars worth of compute just flow the other direction. >> Yeah. Right. Right. >> I think like if you do it with a spreadsheet and like actually out you might notice when the money is going down the drain. Yeah. Yeah. Um so uh yeah so I think this yeah all of these end up being close in as modeled here. This 30% may have been a little bit too generous. Um so let’s say something like 1.5 here and and leave this as a one here. So I think it like at this point you can almost read it off like the number of inference tokens should be about the same as the number of pre-training tokens should be about the

[01:28:00] same as the number of RL tokens um within like factors that we’re not able to reason about. But then so it looks sorry I’m making a basic altruistic it sounds seems like there should be less RL tokens than pre-training tokens and >> yes that’s in general right because RL is less efficient um in terms of machine time and so uh you um if you’re trying to equalize the RL and and pre-training time then then you should have fewer tokens in order to have the same wall time that this is all quite interesting that um I never thought about it in terms of how much equalizing in terms of data I I I I mean I think starting with equalizing and cost is right but uh depending on how you model the cost this comes close to equalizing in data >> that if every single user who uses basically if you for GBT to be trained optimally every single user who uses GPD5 the total amount of tokens that they stream should equal the amount total amount that have gone into pre-training. >> Yeah. >> And the total amount of tokens that got into pre-training is the sum of all human knowledge. So like each model

[01:29:02] should generate the sum of human knowledge on the output that it gets on the input. >> Yeah. So I mean which way are people going to error? Like uh if you think that people’s power of prediction is not perfect and and also um you run the risk that your um that you make a model that is not a frontier model and then you just throw it away. um then then like that kind of changes the cost trade-off because there’s some like probability that applies to the inference and you should derate the inference tokens by some amount >> right >> and then can we back out how much more yeah compute than chinchilla optimal for a given sized >> model >> so I think we just have to make some real world assumptions here in order to do that so um so the inference tokens we should totally be able to catch right like so um let’s say a 200 million I don’t know maybe it’s like uh 500 million tokens a second now I don’t really know um 500 million tokens a second times a model is deployed for 2

[01:30:00] months before it becomes obsolete I don’t really know um uh I can’t do this in my head can you computer um uh 2.6 * 10 15th >> Okay 2.6 6 uh * 10 15. Okay. Um um this number is probably too large. This um because this is going to be multiple models in a family. We So let’s let’s make it like five times smaller or 10 times smaller or something like that. Um uh okay. So we’re estimating maybe 50 million tokens per second per per specific model. The model is live for two months. Um and so uh this comes out to around 200 uh trillion tokens. Um and then we want to compare that to active parameters on a um frontier model. I don’t actually know the latest rumors but um some

[01:31:00] do do you know? >> Somebody told me 150 trillion >> active cramps. >> So sorry I meant tokens >> trained on 150 trillion tokens. Interesting. >> Which which is similar. >> Yeah. Yeah. That’s actually similar. So um so data on pre-training >> this is not but well cited but >> you want me to not remove this um >> and I think often active prams uh number of active prams could be in the range of like uh 100 billion something like that maybe maybe a bit larger um uh so I’m assuming active pram of about 100 billion and so multiply by 20 to get the chinchilla uh token count so chinchilla d chinchilla would be around uh two trillion and yeah and we see like we’re at 100 times larger than uh than that >> actually what does the chinchilla actually mean >> like the token count for pre-training for um that the chinchilla scaling law would recommend I guess um >> oh I see so how much is it overtrained >> got it >> so yeah like the ratio of this 200

[01:32:02] trillion or 100 trillion parameters over uh over the like the tential optimal of of two trillion And that’s the amount it’s overtrained which is like effective 100 overt trained perhaps that’s what okay so if you consider this right here to the extent this is in the right ballpark just by thinking about okay you kind of want everything to be equal in terms of compute um here’s if if that openi also realizes that and they’re serving a certain amount of tokens per second that tells you how much data went into the pre-training of GBD5 >> it even if it’s like 50% off or something that is that is sort of wild that you can sort of first principles, >> these kinds of numbers. >> This is also I mean this is why you should just like approximate everywhere because like there’s so big error bars on this but yeah know it’s kind of like empowering to just like set a equal to b and figure it out. >> Yeah. Yeah. That’s super cool. >> Okay. So um it is weird of trying to deduce things. We can publicly look up the prices of the APIs of these models

[01:33:00] and um maybe you can learn something from that. Uh first with a longer context um Gemini 3.1 is um 50% more expensive if you go over 200k tokens than if you’re below 200k tokens. I mean at a high level I understand why that might that be but why specifically 50%. >> Yeah. Um so I mean why specifically 50%. Let’s let’s sort of um so so the high level uh even in the first place is um >> there is some amount of uh increasing cost with with context length. >> Yeah. >> Um and >> and uh we can bring that back up. That was the um the the memory time versus the compute time. So um okay so we we’ve put up these same equations from before of the the time for memory fetches which is the weights and and the KB cache. um and then the the time for the compute which is just the uh matrix

[01:34:00] multiplications for the weights. I I will also draw the um the cost curve. Um but this time I’ll do it as a function of context length instead of as a function of patch size. Um so this is time over uh yeah just just time. Uh so this is the cost curve as a function of context length. Um we’ll draw the compute. Um the the the cost of the compute is actually constant as a function of context length. There’s no dependence here on context length. In reality, there is some dependence, but it is very mild dependence, so we’ll ignore it. Um so this is the um time for the compute this one. Uh and then we’ll also draw the dependence uh of the memory fetch on on context length. And this starts at a large number for the weights and then grows gradually with um with the context length. So uh maybe here um and then

[01:35:03] grow gradually with context length. And so you take the maximum and you see there is this inflection point here. So now so this is the costs that uh that that for example Gemini might be paying. Um and then you think how how how might you put a pricing structure on top of that? um you would like to ensure that no matter what the context length is, you are you are still profitable. So >> interesting. >> And so we’ve got a two-tier pricing structure, maybe we’ve got something that looks like this up to some context. >> That’s fascinating. >> So I think it says something about um given that the bump is at 200k, it probably means that this is somewhat aligned with this crossover point. Maybe not exactly aligned with >> fascinating. Um so we can actually probably even complete that calculation just to see where it lands out. Um we can solve for the number of bytes per token. Um if if if we sort of make some assumptions about the number of active

[01:36:00] parameters. So solving for the number of bytes per token. Um we’re going to assume like the the point where we equalize um the time of memory and the time of compute is at let’s say 200k tokens. Um so we equalize these two. Um we’re also going to just uh assume that the batch size is large enough that the um the memory time spent on weights is is negligible. So we’ll forget about this and we’ll focus on the actual memory time spent on KB cache. So that ends up saying copying this term over batch times len context um times uh bytes token um over me bandwidth is going to be equal to uh number of activated prams over flops. And then we’re going to solve for bytes per token. Um

[01:37:18] size was missing here. Shows up here and then it cancels out by the time we get to here. and uh and I I dropped the len context >> so we can plug in numbers. This number this is this is this well is the reciprocal of the number that we saw before. It’s yeah this is like one over 300 um which is reasonably stable across many um different hardware platforms. We conjecturally said that maybe number of activated tokens is like 100 billion and length of the context we said was 200k. Um something is wrong here. The length of

[01:38:01] the context should be on the denominator not the numerator. Um 1667 like about one one kil almost 2 kilobyte. That’s that is plausible actually. Um so we said around 2 kilobytes. Um so um so let’s just do a a sanity check for this um for what this could be. Um there are two mechanisms that people do uh attention with a small number of bytes per token. Um one is uh dense attention with a lot of reuse across layers. Um so character AI has a blog post talking about that alternating long and short context. Um and like in the character AI kind of model which also showed up in the Gemma models the global

[01:39:01] context which is really what we’re talking about here global context um was shared across all the layers. And so to get this 2 kilobytes, you could get that for example as um a d head of 128 um is is typical. Um and then like the number of bytes is typically um number of attention layers um uh times 2 * d head uh times uh number of uh q heads. So um this is the number of unique contexts per layer. Do you do you share the the context across many layers or do you use it only once? Um uh so in character AI like models uh this number is one. Um we said this is 100 128. Um and uh this is a choice which typically

[01:40:02] ranges from one uh sorry this is KV heads I meant. Um >> so there is written a head and a KV head. is that >> the KV heads are the heads that are stored in memory like store the contents of the previous tokens. The Q heads are the um the retrieval heads. They’re they’re only used temporarily and they’re they’re used by the attending token. So um in this auto reggressive context, I’ve got KV heads associated with all of the context and then Q heads associated with this new token here. But but but this head the 128. >> Oh uh this is um this this number is actually the same for oh so this d head is the dimension of the vector. >> Ah yeah uh and number of kv heads is typically in the range of 1 to 8. >> So um like it is totally plausible to get this by for example having eight KV heads and and a d head of 128. That gives you exactly this number >> or or you could have like fewer KB heads but more layers. Interesting. >> Yeah. Um, so this is one way to get

[01:41:00] there via dense attention. There’s also a way to get there via sparse attention where you um increase all of these numbers, but then you have like a run over sparity term. >> So yeah, I mean I I think this number is plausible if if maybe a little bit small. >> It’s funny that they would leak so much information through their API pricing. >> I mean you are incentivized to price close to your costs because otherwise someone could script you. >> Maybe we can learn something about the difference in input versus output prices. Yeah. >> And what that tells us about decode versus pre-filled in these models. Um, and I think last I checked it’s like 50% more expensive or something like that or >> I I don’t remember. What I’ve seen in the past is like three or five times more sense. Let’s say it’s five times more expensive. Okay. This is the compute to process the next token in decode. Suppose you’re doing prefill. You’re not just processing the most recent token. You’re processing all the tokens in parallel. So I want to say

[01:42:00] that it would be this times len um len prefill length of a pass in general. >> Yeah. If we say like if we can think of decod as being a pass with one and then prefill being a pass with many. >> Okay. Yeah. Yeah. Um so maybe like prefix. Sure. Whatever. >> Um okay. Memory. So you’re not storing the KV cache if you’re for the tokens that are the prefill tokens. I think maybe maybe sort of let’s draw actually how prefill shows up here. Um uh if I may clarify uh so we do a bit of decode like this. Um >> we may actually come back and do more prefill like like if you think this is a chat session the user says something the AI generates the response and then the user says something else and we prefill this. So like maybe this is the more common like this is the general case rather than this. In fact, this is like you read a file or something. >> Read a file or just like the AI is responding to a user input or tool call or anything that’s not generated.

[01:43:00] >> Yep. Okay. Okay. So, suppose we’re here. you will need to load basically the you will have calculated all of this previously. >> Mhm. >> So, just the KV of everything that came before. But what is the memory cost of this? Well, memory bandwidth cost of this if you’re doing flash attention, it would Yeah, it’s it’s basically temporary. It it it doesn’t even go to main memory. Just ignore it. >> Okay. So, then it would just be everything that came before. So, is it not just that then? >> Yeah, there’s actually no adjustment at all to the memory time. >> Great. Oh, so it’s a very trivial change. Yeah. To accommodate. So this term is making it 5x more expensive. Now why would that be? Or what does that tell us about what what are we trying to learn here? What does that actually tell us? What what variable does this help us clamp?

[01:44:00] Um well the compute has presumably gotten five like the only thing that could have changed the comput 5x more expensive as a result. So so yeah this is the time for one pass but actually the amount of tokens is that that much larger. So I guess we want the cost per token in fact or the time per token. >> So I’m not sure I understood the this is this is for processing the next token in prefix. >> Uh well actually for processing the entire batch um so in this like at this cost we have processed this many tokens like let it prefilled. >> Yeah. Um or I guess pre Yeah. Like the of the pass like yeah not not this prefix but it’s this cost. >> Okay. So let’s just change as a pass we can. So this is 5x more expensive. >> Um input is 5 5x more expensive. >> No output is more expensive. >> Output is 5x more expensive. >> So the the result we want to work

[01:45:00] towards is that prefill is compute limited and decode is um memory bandwidth limited. >> Why don’t we do this? Why don’t we have Why don’t we just chart it with like len pass on the x- axis? >> Yep. >> T on T on the Y ais. >> T we want the cost per token. So it’ll be T over some stuff. T over length of the pass. >> Mhm. Yeah, that’ll be right. Okay. So okay confused about this len pass is the it seems like this should be higher when you’re doing prefill. >> Prefill has a bigger length pass. Yeah. Right. >> But then why is it cheaper?

[01:46:01] >> Why is it cost higher? Yeah. Yeah. Um, so I mean we’re gonna it’s this division by length pass that that actually makes it all uh so okay this is going to divide out. This is going to divide out but then we’re going to get a divi all of this is going to divide by length of pass and it’s going to make the memory cost cheaper. >> Okay. Yeah, let me let me think about this then. Okay. So let’s do one line for basically we’ll have four different lines. Um let’s do the let’s do prefill first. And so actually let’s let’s do decode first. >> Oh. Oh. So actually I will length length of the pass when it’s one that is decode. When it is bigger that is prefix. >> Okay. I see. I see. I see. That makes sense. Okay. Getting back to it. So tmp compute if you have um basically just this divided by length pass is just this amount. So this actually does not vary based on >> t. So it’ll just be some flat value

[01:47:00] like this. Um and this is t compute and then th this is like uh this is >> that’s decode decode, right? Um now tm we have this whole thing divided by len pass. Well, it doesn’t really matter what’s up there. It’ll just be something that looks like this. Right. Yeah. Say this is T me. This is decode again. So as the length of the prefix goes up or pass your memory bandwidth time declines. And that means that to the extent that you were me bottlenecked on memory bandwidth before you can avoid being bottlenecked on memory bandwidth. The fact that they are charging 5x less for

[01:48:02] prefill than decode does suggest that they are bottlenecked on memory bandwidth to quite a degree such that for them at least because t is equivalent to cost right it’s the cost of renting a computer. This is actually like this this would be at one and this would be at five. That’s right. That’s right. Yeah. So it it is in fact tremendously memory bandwidth bottom. The real graph looks something like the like that. >> Yeah. I mean it still crosses but >> yeah exactly. So yeah let me do it this way. >> Yeah that’s right. Um and then the this is the gap on decode between the memory and the compute time. >> Yeah. Yeah. >> Okay. Interesting. Another interesting one would be why cachets are so much cheaper. >> Yeah. Okay. >> So I think if I remember correctly, cachets are like 10x. It’s more

[01:49:01] expensive to write to cache according to the pricing on all these models, but if you do hit a cache, it’s 10x cheaper. So, what is going on with presumably this is the cost of keeping something in HBM rather than just evacuating it. But if you do keep it in HBM, then it’s cheaper to load again, >> right? So there’s two ways you can produce um tokens uh or the the KV cache for a token. Um you can just produce it from scratch by computing it from the underlying like token ids which are tiny. Um or you can um previously have produced it and stored it in memory somewhere. >> Uh so the cost ratio is really talking about the ratio between those two mechanisms of producing it. A cache miss means you’ve deleted it from all your memories and you have to recmp compute it from the tokens directly. >> In fact, you can maybe even take that a step further and think about which memory tier do you store it in. So you

[01:50:01] could store it in HPM. Um there are other slower and cheaper memories than HPM like DDR on your host or flash um as well. And so one of the things you can do is a is a calculation of um where it makes sense to be in each memory tier. Um and this is related to um how long you’re going to store for. So so we want to look at the cost of storage in in a few different memory tiers and also the cost of rematerialization. So um uh remat means the cost to rematerial like rebuild all of the KB cache from scratch having it after you deleted it. So we rematerialize it. Um and so basically this is going to cost the uh length of the context. Um uh actually we’ll look at uh cost per token so that we don’t need to carry around this length of context everywhere. So to rematerialize one token of KV cache um I just need to run

[01:51:01] I need to run a forward pass on the whole model and um and then so this is going to be the compute time. I have to rerun the compute um at whatever speed my GPU does it and then I multiply it by my like GPU dollars per second. Um >> sorry, extremely question. Why is there not a quadratic term? >> Yeah. So, uh there is a quadratic term um in it shows up in the compute um uh as an approximation. I chose to remove it. um the what that I I’ll just show you sort of quickly what that looks like. It’s because so you have the um if you look at the cost per token um or the number of flops per token. There is the flops that are coming from doing the weight matrix multiplies as a function of context lengths. Um and then there is the number of multiplies that comes from

[01:52:01] doing the KV cache. Um which is which goes up linearly with the the amount of stuff you attend to. um the slope on this is so low that like when you when you draw it like this it’s like it’s very well approximated by a flat line. >> So like it starts to like you start to notice the effect of the quadratic or the linear term up in the in the millions of tokens or so. So just not super relevant. >> So what is the reason that there’s no company which has over a million token context length >> um >> if this is true. >> Yeah. So there are two costs of long context. One is the memory bandwidth cost which we’ve spent a lot of time analyzing. That’s this thing. Um and then the other one is the compute cost. The compute cost is almost always um and sort of actually forced by um fundamental principles uh to be a much smaller slope than than the memory bandwidth cost. And so the primary thing that limits you to have really large contexts are memory band memory capacity which is exactly this effect like

[01:53:00] >> um and so there’s this idea that Daario said on the podcast and others have said which is we don’t need continual learning for AGI in context learning is enough and if you believe that then you have to think that we had to get to 100 million token 100 million billion context length to have an employee that is the equivalent to working with you for a month. Now maybe that’s no longer true as far as attention or something. Yeah. >> But um >> yeah, if you think that then as a some ML infra thing would have to change to allow for 100 million like the memory bandwidth to allow for 100 million >> token context lengths. >> I mean sparse attention gives you get out for sure because you get this um square root like you know gives you a big improvement. Um but I think it’s like if you look at the history of um context lengths of models from like earlier models like GPT3 maybe to GPD4 I don’t remember when the transition happened exactly like they

[01:54:00] shot up from like about 8K to 100K 200K um and then for the last year or two they’ve all been hovering around there. Um I think that actually indicates that that that’s sort of the reasonably balanced cost point and going massively beyond that would be cost prohibitive. >> Not because of the compute cost because >> the memory bandwidth cost. Yeah. >> Um so I actually don’t see a very good path to solving that. Like the memory the HPM is where is is at where it is. Uh it’s not getting hugely better. And and why doesn’t sparse attention solve that? >> The sparse attention is a big improvement. Um uh maybe that is priced in already perhaps. Um uh it’s not an infinite improvement because if you go too sparse, you lose too much quality. >> Yeah. >> But yeah, I mean the empirical result is that uh the context lengths haven’t been increasing that much. Um uh and I think it’s because there is no solution to the memory wall. Yeah. Interesting. >> Like so going too sparse just means like

[01:55:01] you’re attending to a very small subset of the tokens and the quality will get worse. Yeah. So what is the cost of of of these different ways of producing um uh reynthesizing the KV cache? Computing it from scratch is based on my GPU time. I have to do a certain amount of multiplies in order to um uh of GPU time that I spend in order to produce it. Um storing HPM. Um, this really goes as my um I think I had a number here which was the bytes per token. Um, so I need to I need to have some number of bytes per token and then I need to store this in the uh HBM. So it’s going to use up some of my HBM capacity. Uh so a way to think of this is that like if I have too many of these things sitting in HBM like if I fill up my HBM with just KV caches that I’m not using I

[01:56:00] can’t use that GPU and so how do I price that? Maybe I say that the cost of it is proportional to the fraction of the HPM I’m using. So so there’s also times GPU dollars. Um uh um and then let’s just do one more memory tier and say something like uh DDR um store in DDR instead. Um uh the same kind of thing goes up for flash and and for DDR. Um I put these in the wrong columns actually. Um I meant to make two columns. The the distinction I want to make is that there is the time to cost to retrieve And then there’s a uh cost cost to store um cost to hold hold on. >> Um um and so this is like there’s a cost per second whereas this is like a instantaneous cost. Um so rematerialization has a cost to retrieve and has zero cost to store it because we’ve deleted it. Um

[01:57:02] this is the one that I put in the wrong location. This is this is actually the cost to to hold on. So I will rewrite it. Okay. Um so we have this is the uh like if we’re just storing it in HPM, it has this sort of cost profile. Um uh and then if we store in DDR um it’s actually going to take some time. So it’s like we get the same thing here. Bytes per token over DDR capacity times DDR cost um a second. Um but but now this has a um a cost to retrieve that is is higher than the HPM because we need to copy it into the HPM. And so this is um bytes per

[01:58:03] token uh over DDR bandwidth um uh bandwidth uh and then this consumes some amount of the DDR as well >> and every scale up has DDR and flash. >> This is really a deployment question and so you can choose that. Um Nvidia does deploy in this form. Uh it has it has both. >> Why isn’t the cost to retrieve HBM the memory bandwidth or the bytes divided by memory bandwidth? Yeah, I mean it depends what what you define a retrieve to be. Here I’m defining retrieve to be um uh move it into HPM so that you can start actually doing inference on it and so like sort of by definition >> and because if it’s already in HPM you can be doing compute while you’re getting it from HPM desk for example. >> Yeah. Um so so these are three things and I I guess I ordered them wrong. Um, in general, if you if you’re balancing two costs and you’ve got different memory uh different tiers in the memory hierarchy, you should expect as as this cost goes up, this cost should go down.

[01:59:00] Um, so you can kind of see where the zeros are and um like I should have ordered them. This one first, this one second, and this one third. So if you’re going to hold on to it for for a very short amount of time, >> then the um all of this is like multiplied by the um hold time. >> Yeah. >> This one is and so is this one. Um and interestingly they have different prices to write for and is you specify this in the API for 5 minutes versus an hour. >> Yeah. which which suggests that the 5 minutes is HBM and the hour is DDR. >> I think that’s like I think that’s a pretty good assumption. It could if you look at the numbers it might also turn out that it’s one tier down and it’s DDR versus flash is >> Yeah. Okay. Interesting. And the price difference I think was I’ll look it up. Okay. So the um base uh base input tokens is five per million

[02:00:03] tokens basic which means rebat. >> Yeah, that’s five. Um, >> is this five $5 >> to like retrieve quote unquote and then the um to write to um uh presumably HBM write for 5 minutes is 6.25. >> So actually we might actually be able to determine the um which memory t it is by um by the durations. Actually, the duration probably tells it to actually 5 minutes versus 1 hour. >> Yeah, exactly. I think this will probably end up being um it’s going to be the drain time of the memory uh tier that you’re in. And so what that means is like uh like given that I’m I know I’m going to be holding something for 5 minutes. I would like to have pick a memory that I can read every 5 minutes like I can read the whole memory once per 5 minutes ballpark. Um

[02:01:01] so that is the drain time of the memory. So if I take the the like call or the storage storage capacity over storage bandwidth bandwidth um I would like this to be like equal to 5 minutes or something like that. >> Um and so actually we did this calculation for HPM. For HPM we know that this number is 20 milliseconds. Um so HPM is much too short like much too small. Um DDR could be about an order of magnitude or or two off from this. And so this is probably in the order of like actually I think it might even be in the in the seconds like 1 to 10 seconds. Um and then this is really I don’t have these numbers memorized but generally as you go to slower tiers uh flash is plausibly in the order of 1 minute. Um and then like spinning disc uh which is massively different I think is on the order of 1 hour. So this might actually identify that the tiers are probably flash and spinning disc. Sorry, why why is this the calculations the storage cap divided

[02:02:00] by the bandwidth? >> So, um you you’ve got a bunch of different memory tiers like we’ve listed four of them. Um >> uh the your choice like your choice of which memory tier is a like you want to minimize the cost. >> Um >> and so you are like what fraction of the device are you using? You’re using some fraction of the device for the holding onto it and then you’re using some fraction of the device to retrieve it. Um, and so let’s say I’m using like 10% of the device. Um, and and I want to equalize those two fractions. Uh, that that’s a sign that I’ve hit the right um the right thing. So let’s say I’ve got some runtime here. Like I I’m going to hold on for all of this time. Um, uh, and then so this is the time hold uh, and then there’s going to be some amount of time here which is time to retrieve. >> Mhm. Uh and I want I mean basically to equalize the costs these two costs. Um I want the retrieval time to be equal to the hold

[02:03:02] uh times the like fraction of capacity. >> Mhm. >> Um because like this is the the retrieval time. Uh yeah I mean this is >> this is how many other things I can hold simultaneously. Basically just like, hey, you want to you want to store things in there for so long such that the amount of time it’s in there is kind of the time to get all your things in there and out. >> Yeah, basically it makes sense. >> I I think that probably indicates that this is the two tiers of flash and and spinning disc. I’m kind of shocked to see spinning disc being used at all because it’s such an old technology. Yeah. >> I mean, it’s also crazy that it’s so slow that it takes an hour to load its full capacity into it and then >> like it’s a really unattractive technology, but it’s useful in some places. >> Yeah. So, we’re sitting down because I want to ask you some questions that uh I guess don’t need a blackboard. Um, you have this extremely interesting blog post where you talk about how at a high

[02:04:01] level the architecture of different cryptocraphic protocols looks a lot like neural networks. And there’s this conversion evolution where they both need to jumble information across all their inputs. For cryptographic protocols, it’s to make sure that there’s like each new input into a hash function will totally scramble what happens. For neural networks, of course, they need to consider information how this piece of information changes what you should make of this other piece of information. That has a extremely interesting point. I guess at a high level that the difference in what they’re trying to do in some sense, they’re trying to do the inverse thing, right? which is um cryptographic protocols are trying to take information which has structure and make it look indistinguishable from randomness. >> Yeah. >> And uh neural networks are trying to take things which are look like random protein sequences DNA garble text and extract higher level structure from it. So they have similar highle mechanisms but

[02:05:01] they’re actually kind of trying to do the opposite things. Um yeah I wonder what you make of that. >> Yeah. Um, so I mean the like the mixing like I tried to look for other examples where mixing like scrambling mixing shows up as well. There’s actually almost even like a physical example where like you’re stirring something, you’re making a cake and you want to stir the batter and like literally the idea like first stir it this way and then stir it this way is like actually not too bad of an approach. Um, but beyond that like in back to the digital world um there are some differences and the one you talk uh call out is is a pretty strong difference. um the way it shows up um like what makes neural nets uh like if you just randomly initialize a neural network actually maybe it’s a reasonable cryptograph like cipher as well because like the random initialization is it going to jumble stuff in a complicated way it may even like do what you want who knows um uh the thing that makes it interpretable is the gradient descent so you can differentiate a neural network um and

[02:06:01] get a meaningful derivative um and we do a lot of work to like not over complicate the derivative. So the residual connection keeps it like contained and simple. Um and the uh and so does like the layer norm uh stuff that we do. Um one of the biggest attacks against uh cryptographic ciphers is also to differentiate the cipher. Um ciphers run in a different number field. They run in um uh the field of two elements. So just binary. Um whereas neural nets run like in theory in the field of real numbers. Um uh and so you have to differentiate with respect to like binary numbers. Um but you can absolutely differentiate a cipher and this is called differential crypt analysis. And uh like basically what it says is that if you take a small difference of the input how like uh it’s quite difficult to make uh the difference of the output be small like oh like uh the whole job of a of a

[02:07:00] well-designed cipher is to make the difference out very large. >> Um so I I guess the distinction is that the the optimization goals at that point are about complexifying. They they don’t have the same residual connections or um or like layer norms that that would >> Yeah. I mean, I I guess a place where the the two merge is back doors. >> Um, okay. So, with a back door LLM, you’re trying to hide um what do you consider an input? It’s not an input into the forward pass, but it’s an input into the backward pass, but you’re trying to hide an input into the backward pass. >> Like you’re like this is like an adversarial uh Yeah. So, yeah. I mean in fact this is like this is actually a place where you get exactly the um sort of avalanche property that ciphers have as well. Um like adversarial attacks on typically like image classification models right are can I find a perturbation of the image that a very very small pertabbation of the image that totally changes the classification totally changes the output

[02:08:01] >> that is the common case in ciphers whereas it that’s the like undesired case in in neural nets for sure. Yeah. >> Okay. So I was asking you uh has have neural networks actually been used for cryptography and um we realized it might be better to just do this on the blackboard. >> Yeah. >> Um so I’m curious are they actually being used for cryptography? >> Yeah. So using neural nets for cryptography well in general cryptography like creating a new cipher is a very very dangerous proposition. Like uh almost all of them are broken like 99% of them are broken. So uh probably a bad place to start but the other direction has been very like in in at least one very clear case quite productive. Um so there’s this construction in so a construction that exists in in ciphers and then was imported into neural nets um called a fistl cipher fal network. Um so the idea is that um you you may have some some some function f uh which is not invertible. Um

[02:09:00] but you like the function because it like does interesting things like it it it um it does an MLP for example or or it mixes it in an interesting way. Um you’d like to build something out of this that is invertible. So the construction we’re going to make is going to actually be a twoinput function rather than a one input function. um and we’re going to apply uh f ofx we need to actually remember what x was. So we’re going to stick x over here so that we can uh work backwards and then we also can’t drop y. So we’re going to remember y and we’re going to add them together. And so we form this tpple. So, um, the the way to invert this, like if you think I have this output and I want to recover X and Y, well, I can easily recover X. It’s right there. I just read it off. And then to recover Y, I like if this thing was called Z, um, I can I can recover Y by Z minus F of X because I’ve already recovered X. So, so

[02:10:01] that means that this construction is invertible. Um, this was used in ciphers like a ton. Um, still is used. It’s one of the main uh mechanisms of constructing ciphers. Often you want ciphers to be invertible, especially the layers of ciphers you want to be invertible um because that has better cryptographic properties. This has actually been ported over into um into neural nets. Um there’s a 2017 18 paper called Rev nets, reversible networks. Um and what it does is it actually makes the entire like you can apply it to any network like a transformer network. you can make I do a forwards pass but then I can actually run the entire pass backwards as well. Um so the whole neural network is invertible um with exactly this construction and so this paper reversible networks um like applied to some layer like a transformer layer for example we’ve got this function f which is our transformer layer um now normally we would have um just an input and then a residual

[02:11:01] connection coming out um and it gets added like this um over here. Mhm. >> Um but now, uh the variation of this is going to be we’ve got two inputs, X and Y. Um so we’ve got X and Y inputs. Um X goes through the function gets added to Y and then this becomes the new X, the output X. And then this x becomes the output y. So um really what this is doing this is like this is actually sort of doing if you think of two layers uh back this is actually the thing you mentioned before it’s actually doing the residual connection from two layers back. Um like this y came from the previous layer and was the residual connection there. >> Um but because of this construction it the whole thing is invertible. >> Why do I care? What does invertible

[02:12:01] matter for? Um the big thing that it can be interesting for is for training. Um if I think of a forward passive training um so I will let’s say I have four layers I run them in 0123 order um I have to write all of the um activations to HBM >> um and so I get an HPM footprint um here that is kind of like linear linear in uh number of layers. >> Yep. Um, so this this actually can be uh the largest memory footprint during training. Um, and so this is normal training and then and then I run the backwards pass and I read it kind of in reverse like I I run them sort of forward pass goes forward, backward pass goes backwards and I have to read them back out. Um, the idea of this RevNet’s paper is that because it’s in invertible um I don’t need to store this at all. I can completely rematerialize it when I’m running my backwards pass. So I I run my forwards pass and then when I’m running my backwards pass, I’m simultaneously in

[02:13:01] lock step undoing all of the forwards pass steps that I did in order to um uh to have the activations that I need here. So this ends up being a memory saving, which is a nice idea. >> Interesting. And in in some sense, you’re spending more compute to save memory. >> That’s right. Yeah. >> Interesting. >> Huh. Actually, it’s kind of the opposite of what you’re doing with the KV cache. In the KV cache, >> you’re spending more memory to save compute. >> Yeah. Uh, spending more memory to save computers is generally profitable given where yeah, hardware are today. >> Yeah. Interesting. Cool. Uh, that was super fun, right? Thank you so much for doing it. I I feel like it really vindicated the vision behind the studio and and the blackboard. >> Cool. Thanks so much for doing it. >> Thanks.