Transcript — But how do AI images and videos actually work? (Guest video by Welch Labs)
Over the last few years, AI systems have become astonishingly good at turning text prompts into videos. At the core of how these models operate is a deep connection to physics. This generation of image and video models works using a process known as diffusion, which is remarkably equivalent to the Brownian motion we see as particles diffuse, but with time run backwards, and in high-dimensional space. As we’ll see, this connection to physics is much more than a curiosity. We get real algorithms out of the physics that we can use to generate images and videos. And this perspective will also give us some really nice intuitions for how these models work in practice.
But before we dive into this connection, let’s get hands-on with a real diffusion model. While the best models are closed source, there are some compelling open source models. This video of an astronaut was generated by an open source model called WAN 2.1. We can add to our prompt and have our astronaut hold a flag, hold a laptop, or hold a meeting. If we cut down our prompt to just an astronaut, we get this. And if we cut down our prompt to nothing, we interestingly still get this video of a woman.
If we dig into our WAN model’s source code, we’ll find that the video generation process begins with this call to a random number generator — creating a video where the pixel intensity values are chosen randomly. Here’s what it looks like. From here, this pure noise video is passed into a transformer. This is the same type of AI model used by large language models, like ChatGPT. But instead of outputting text, this transformer outputs another video that now looks like this. Still mostly noise, but with some hints of structure. This new video is added to our pure noise video, and then passed back into the model again, producing a third video that looks like this. This process is repeated again and again. Here’s what the video looks like after 5 iterations, 10, 20, 30, 40, and finally 50. Step by step, our transformer shapes pure noise into incredibly realistic video.
But what exactly is the connection to Brownian motion here? And how is our model able to use text input so expressively to shape noise into what our prompt describes? In this video, we’ll unpack diffusion models in 3 parts. First we’ll look at a 2021 OpenAI paper and model called CLIP. As we’ll see, CLIP is really two models, a language model and a vision model, that are trained using a clever learning objective that allows them to learn this really powerful shared space between words and pictures. Experimenting with this space will help us get a feel for the high dimensional spaces that diffusion models operate in. But learning a shared representation is not enough to generate images. From here we’ll look at the diffusion process itself. At a high level, diffusion models are trained to remove noise from images or videos. However, if you dig into the landmark papers in the field, you’ll find that this naive understanding of diffusion really doesn’t hold up in practice. In this section we’ll dig into the connection between diffusion models and diffusion processes in physics. This connection will help us understand how these models really work in practice and give us some powerful theory for dramatically speeding up image and video generation. Finally, we’ll bring these worlds together and see how approaches like CLIP are combined with diffusion models to condition and guide the generation process towards the videos we ask for in our prompts.
CLIP (3:37)
2020 was a landmark year for language modeling. New results in neural scaling laws and OpenAI’s GPT-3 showed that bigger really was better. Massive models trained on massive datasets had capabilities that simply didn’t exist in smaller models. It didn’t take long for researchers to apply similar ideas to images. In February 2021, a team at OpenAI released a new model architecture called CLIP, trained on a dataset of 400 million image and caption pairs scraped from the internet.
CLIP is composed of two models, one that processes text and one that processes images. The output of each of these models is a vector of length 512, and the central idea is that the vectors for a given image and its captions should be similar. To achieve this, the OpenAI team developed a clever training approach. Given a batch of image-caption pairs (for example a picture of a cat, a dog, and a man, with the captions “a photo of a cat”, “a photo of a dog”, and “a photo of a man”), pass the three images into the image model, and the three captions into the text model. We now have three image vectors and three text vectors, and we would like the vectors for the matching image-caption pairs to be similar.
The clever idea from here is to make use of the similarity not just between the corresponding images and captions, but between all image-caption pairs in the batch. If we arrange image vectors as columns of a matrix, and text vectors as rows, the diagonal pairs correspond to matching captions; off-diagonal are non-matching. The CLIP training objective seeks to maximize similarity along the diagonal while simultaneously minimizing similarity off the diagonal. The C in CLIP stands for contrastive, because the model learns to contrast matching and non-matching pairs.
CLIP measures similarity using cosine similarity — the cosine of the angle between vectors in this space. So if our text and image vector point in the same direction, the angle between our vectors will be zero, resulting in a maximum value of 1.
Shared Embedding Space (6:25)
The learned geometry of this shared vector space, known as a latent or embedding space, has some really interesting properties. If I take two pictures of myself, one not wearing a hat and one wearing a hat, and pass both of these into our CLIP image model, we get two vectors in our embedding space. Now if I take the vector corresponding to me wearing a hat, and subtract the vector of me not wearing a hat, we get a new vector in our embedding space. Now what text might this new vector correspond to? Mathematically we took the difference of me wearing a hat and me not wearing a hat. We can search for corresponding text by passing a bunch of different words into our text encoder, and for each computing the cosine similarity between our newly computed difference vector and the text vector. Testing a set of a few hundred common words, the top ranked match with a similarity of 0.165 is the word hat, followed by cap and helmet.
This is a remarkable result. The learned geometry of CLIP’s embedding space allows us to operate mathematically on the pure ideas or concepts in our images and text, translating differences in image content (like wearing a hat) into a literal distance between vectors in our embedding space. The OpenAI team showed that CLIP could produce very impressive image classification results by simply passing an image into our image encoder, and then comparing the resulting vector to a set of possible captions, one for each label, and classifying the image with whatever label resulted in the highest cosine similarity.
So techniques like CLIP give us a powerful shared representation of image and text — a kind of vector space of pure ideas. However, our CLIP models only go one direction. We can only map image and text to our shared embedding space. We have no way of generating images and text from our embedding vectors.
Diffusion Models & DDPM (8:16)
2020 turned out not only to be a transformative year for language modeling. A few weeks after the GPT-3 paper came out, a team at Berkeley published a paper called Denoising Diffusion Probabilistic Models, now known as DDPM. The paper showed for the first time that it was possible to generate very high quality images using a diffusion process, where pure noise is transformed step by step into realistic images.
The core idea behind diffusion models is pretty straightforward. We take a set of training images and add noise to each image step by step until the image is completely destroyed. From here we train a neural network to reverse this process. When I first learned about diffusion models, I assumed that the models would be trained to remove noise a single step at a time. Our model would be trained to predict the image in step 1 given the noisier image in step 2, and so on. When it came time to generate an image, we would pass pure noise into our model, take its output and pass it back into its input again and again, and after enough steps we would have a nice image.
Now, it turns out that this naive approach to building a diffusion model really does not work well. Virtually no modern models work like this. The first thing that surprised me is that the team added random noise to images not just during training, but also during image generation. Algorithm 2 tells us that when generating new images, at each step, after our neural network predicts a less noisy image, we need to add random noise to this image before passing it back into our model. This added noise turns out to matter a lot in practice. If we take stable diffusion 2 and use DDPM sampling, we can get nice images. If we remove the line of code that adds noise at each step, we end up with a tiny sad blurry tree. How is it that adding random noise while generating images leads to better quality, sharper images?
The second thing that surprised me was that the team wasn’t training models to reverse a single step in the noise addition process. Instead, they take an initial clean image x0, add scaled random noise epsilon, and train the model to predict the total noise that was added to the original image. So the team is effectively asking the model to skip all the intermediate steps and make a prediction about the original image. Intuitively, this learning task seems much more difficult to me than just learning to make a noisy image slightly less noisy. The Berkeley team’s paper and approach was a landmark result that put diffusion on the map.
Why does adding random noise while generating images and training the model like this work so well? The DDPM paper draws on some fairly complex theory. It turns out that there’s a different but mathematically equivalent way of understanding what diffusion models are really learning that we can use to get a visual and intuitive sense for why the DDPM algorithms work so well. The key will be thinking of diffusion models as learning a time-varying vector field.
Learning Vector Fields (11:44)
This perspective also leads to a more general approach called flow-based models. To see how diffusion models learn this time-varying vector field, let’s temporarily simplify our learning problem. One way to think about an image is as a point in high-dimensional space, where the intensity value of each pixel controls the position of the point in each dimension. If we reduce the size of our images to only two pixels, we can visualize the distribution of our images on a 2D scatterplot.
Real images have a very specific structure in this high-dimensional space. Let’s create some structure for our points in 2D for our diffusion model to learn. The exact structure doesn’t matter much — let’s start with a spiral.
The core idea of diffusion models — adding more and more noise to an image and then training a neural network to reverse this process — looks really interesting from the perspective of our 2D toy data. When we add random noise to an image, we’re effectively changing each pixel’s value by a random amount. In our toy 2D dataset, where the coordinates of a point correspond to that image’s pixel intensity values, adding random noise is equivalent to taking a step in a randomly chosen direction. As we add more and more noise to our image, our point goes on a random walk. This process is equivalent to the Brownian motion that drives diffusion processes in physics and is where diffusion models get their name.
From here, it’s pretty wild to think about what we’re asking our diffusion model to do. Our model will see many different random walks from various starting points in our dataset, and we’re effectively asking our model to reverse the clock — removing noise from our images by letting it play these diffusion processes backwards, starting from random locations and recovering the original structure of our dataset.
How can our model learn to reverse these random walks? In the naive single-step training approach, we’d give the model the coordinates of the 100th step and ask it to predict the 99th step. Although the direction of the 100th step is random, there will be some signal in aggregate. Given enough training points, many diffusion paths will go through this neighborhood, and on average our points will be diffusing away from the spiral, so our model can learn to point back towards it.
We can now see why the Berkeley team’s training objective works so well. Instead of training the model to remove noise one step at a time, the team trained the model to predict the total noise added across the entire walk — the vector pointing from the 100th step back to the original starting point. We can prove that learning to predict the noise added in the final step is mathematically equivalent to learning to predict the total noise added, divided by the number of steps. So when our model learns to reverse a single step, although our training data is noisy, we expect our model to ultimately learn to point back towards x0. By instead training our model to directly predict the vector pointing back towards x0, we’re significantly reducing the variance of our training examples, allowing our model to learn much more efficiently, without actually changing our underlying learning objective.
So for each point in our space, our model learns the direction pointing back towards the original data distribution. This is also known as a score function, and the intuition is that the score function points us towards more likely, less noisy data.
In practice, these learned directions depend heavily on how much noise we add. After 100 steps, most points are far from their starting points, so the model learns to move them back in the general direction of the spiral. However, if we train our model on examples after only one diffusion step, we end up with a much more nuanced vector field, pointing to the fine structure of our spiral.
There’s a clever solution: instead of just passing in the coordinates of our point into our model f, we can also pass in a time variable t corresponding to the number of steps taken in our random walk. If we set t=1 at our 100th step, then t=0.99 at the 99th step, and so on. Conditioning our models on time turns out to be essential, allowing our model to learn coarse vector fields for large t and very refined structures as t approaches 0. After training, we see this really interesting behavior as t approaches 0.4 — our learned vector field suddenly transitions from pointing towards the center of the spiral to pointing towards the spiral itself. It feels like a phase change.
We’re now in a great position to resolve the final mystery of the DDPM paper. How is it that adding random noise at each step while generating images leads to better quality, sharper images? Let’s follow the path of a single point guided by the DDPM image generation algorithm. Starting at a randomly chosen location of x=-1.6 and y=1.8, our model’s vector field points us back towards our spiral. Following the DDPM algorithm, we take a small step in the direction returned by our model, and add scaled random noise, which effectively moves our point in a random direction. Repeating this process for 64 steps, our particle jumps around quite a bit, but ultimately lands nicely on our spiral. Repeating for 256 points, our reverse diffusion process starts out looking like absolute chaos, but does converge nicely.
What happens if we remove the noise addition steps? All of our points quickly move to the center of our spiral, and then make their way towards a single inside edge. This explains the sad blurry tree. Instead of capturing our full spiral distribution, all of our generated points end up close to the center or average of our spiral. In the space of images, averages look blurry. Conceptually, different parts of our spiral correspond to different images of trees in the desert. When we remove the random noise steps, our generated images end up in the average — a blurry mess.
This prediction of the average is not a coincidence. It turns out we can show mathematically that our model learns to point to the mean or average of our dataset, conditioned on our input point and the time in our diffusion process. Given the noise we add in our forward process is Gaussian, for sufficiently small step sizes our reverse process will also follow a Gaussian distribution, where our model actually learns the mean of this distribution. Since our model just predicts the mean of our normal distribution, to actually sample from this distribution, we need to add zero-mean Gaussian noise to our model’s predicted value — which is precisely what the DDPM image generation process does. So adding random noise during image generation falls nicely out of theory, and in practice prevents all our points from landing near the center or average of our dataset.
DDIM (22:00)
The DDPM paper put diffusion models on the map, but didn’t immediately see widespread adoption due to the high compute demands of the large number of steps required. A few months later, papers from Stanford and Google showed that it’s remarkably possible to generate high quality images without actually adding random noise during the generation process, significantly reducing the number of steps required.
The DDPM image generation process can be expressed using a stochastic differential equation — the first term represents motion driven by our model’s vector field, the second term represents random motion. Using the Fokker-Planck equation from statistical mechanics, the Google Brain team showed that there’s another differential equation, this time an ordinary differential equation with no random component, that results in the same exact final distribution of points as our stochastic differential equation. This new algorithm doesn’t require taking random steps along the way.
This approach is generally known as DDIM. The scaling of step sizes, and especially how step sizes vary throughout a reverse diffusion process, matters a lot in practice. Switching to DDIM, we have smaller scaling for our step sizes that allow our trajectories to better follow the contour lines of our vector field, and land nicely on the correct spiral distribution. Comparing to DDPM, DDIM remarkably does not require any changes to model training, but is able to generate high quality images in significantly fewer steps, completely deterministically. Note that the theory does not tell us individual images will be the same, but that our final distribution will be. The WAN model uses a generalization of DDIM called flow matching.
Dall-E 2 / Conditioning (25:25 / 26:37)
By early 2021, diffusion models were capable of generating high quality images and could do so without enormous compute. However, our ability to steer the diffusion process using text prompts was still very limited. CLIP was able to learn a powerful shared representation but only goes one way — converting text or images into embedding vectors. These two problems potentially fit together: diffusion models are able to potentially reverse the CLIP image encoder, generating high quality images, and the output vector of the CLIP text encoder could be used to guide our diffusion models toward the images we want.
A team at OpenAI did exactly this in 2022, using image and caption pairs to train a diffusion model to invert the CLIP image encoder. Their approach yielded an incredible level of prompt adherence. The team called their method unCLIP, but their model is better known by its commercial name, DALL-E 2.
How do we actually use the embedding vectors to steer the diffusion process? One option is to simply pass our text vector as another input into our diffusion model, and train as we normally would to remove noise. The model will learn to use the text information to more accurately remove noise. This technique is called conditioning. We used a similar approach earlier when conditioning on the time variable. There’s a variety of ways to pass the text vector in: cross-attention, simple concatenation, or multiple input paths.
It turns out that conditioning alone is not enough to achieve the level of prompt adherence that we see in models like DALL-E 2. If we take our stable diffusion tree-in-the-desert example and only condition with text inputs, the model gives us a shadow in a desert, but no tree. It turns out that there’s one more powerful idea we need to effectively steer our diffusion models.
Guidance (30:02) / Classifier-Free Guidance
Returning to our toy dataset: if our overall spiral corresponds to realistic images, different sections may correspond to different image types. Let’s say the inner part is people, the middle is dogs, and the outer is cats. Now let’s train the same diffusion model but also pass in the point’s class. Running our generation process with class labels, we recover the overall structure but the fit is not great — confusion between people and dogs. Part of the problem is that we’re asking our model to simultaneously learn to point to our overall spiral AND toward specific classes on the spiral. The modeling task of generally matching our overall spiral has overpowered our model’s ability to move our point in the direction of a specific class.
The trick is to leverage the differences between an unconditional model (not trained on a specific class) and a model conditioned on specific classes. We could do this by training two separate models, but it’s more efficient to leave out the class information for a subset of our training examples. We now have the option of effectively passing in no class or text information into our model, and getting back a vector field that points towards our data in general. For large t (when training data is far from the spiral), our two vector fields point in roughly the same direction. As t approaches zero, our vector fields diverge, with our cat-conditioned vector field pointing more towards the outer cat portion of our spiral.
We can use these differences to push our points more in the direction of the class we want. Specifically, take the conditioned vector and subtract the unconditioned vector. This gives us a new vector pointing from the tip of unconditioned to the tip of conditioned. Amplify this direction by multiplying by a scaling factor alpha, and replace our original conditioned vector with this new vector.
This approach is called classifier-free guidance. Using these new vectors to guide a set of cat points, we see a nice tight fit. Same for dog and people classes. Classifier-free guidance has become an essential part of many modern image and video generation models. If we add classifier-free guidance to stable diffusion’s tree-in-the-desert example, once we reach a guidance scale alpha of around 2, we start to actually see a tiny tree. The size and detail of our tree improve as we increase alpha. As we use guidance to point our stable diffusion model’s vector field more in the direction of our prompt, our tree literally grows in size and detail in our images.
Negative Prompts (33:39)
Our WAN video generation model takes this guidance approach one step further. Instead of subtracting the output of an unconditioned model with no text input, the WAN team uses what’s known as a negative prompt, where they specifically write out all the features they don’t want in their video, and then subtract the resulting vector from the model’s conditioned output and amplify the result, steering the diffusion process away from these unwanted features. Their standard negative prompt is fascinating, including features like extra fingers and walking backwards, and interestingly is actually passed into their text encoder in Chinese. Here’s a video generated using the same astronaut on a horse prompt we used earlier, but without the negative prompt — it’s really interesting to see how the parts of the scene get cartoonish and no longer fit together.
Outro (34:27)
Since the publication of the DDPM paper in the summer of 2020, the field has progressed at a blistering pace, leading to the incredible text-to-video models that we see today. Of all the interesting details that make these models tick, the most astounding thing to me is that the pieces fit together at all. The fact that we can take a trained text encoder from CLIP or elsewhere and use its output to actually steer the diffusion process, which itself is highly complex, seems almost too good to be true. And on top of that, many of these core ideas can be built from relatively simple geometric intuitions that somehow hold in the incredibly high dimensional spaces these models operate in. The resulting models feel like a fundamentally new class of machine. To create incredibly lifelike and beautiful images and video, you no longer need a camera, you don’t need to know how to draw or how to paint, or how to use animation software. All you need is language.
About guest videos (35:32)
So this, as you can no doubt tell, was a guest video. It comes from Stephen Welch, who runs the channel Welch Labs. A while back he made this completely iconic series about imaginary numbers. He actually has since turned it into a book, and consistent with everything he makes, it’s just super high quality, lots of exercises, good stuff like that. More recently he’s been doing a lot of machine learning content, so cannot recommend his stuff highly enough.
The context on why I’m doing guest videos at all is that very recently my wife and I had our first baby, which I’m very excited about. The way I decided to go about my paternity leave was to reach out to a few creators whose work I really enjoy, and essentially ask, hey, what do you feel about me pointing some of the Patreon funds that come towards this channel towards you during this time that I’m away, and commission pieces to fill the airtime while I’m away. We’ve got statistical mechanics, we’ve got machine learning, even some modern art. The next guest video is going to be about a combination of modern art and group theory. Bye!