OpenAI astounds the world once more with the first look at Sora

0
405

Yesterday, open AI unleashed their latest monstrosity on humanity, and it’s truly mind-blowing. I hope you enjoy a good existential crisis because what you’re about to see is one small step for man and one giant leap for artificial kind. We all knew that better AI video models were coming, but open AI Sora just took things beyond our wildest expectations. It’s the first AI to make realistic videos up to a minute long. In today’s video, we’ll look at what this text-to-video model can actually do, figure out how it works under the hood, and pour one out for all the humans that may become obsolete.

It is February 16th, 2024, and you’re watching the code report. When I woke up yesterday, Google announced Gemini 1.5 with a context window up to 10 million tokens. That was an incredible achievement that was also blowing people’s minds, but Sundar was quickly overshadowed by Sam Altman, who just gave us a preview of his new friend Sora, which comes from the Japanese word for Sky. It’s a text-to-video model, and all the video clips you’re seeing in this video have been generated by Sora. It’s not the first AI video model; we already have open models like stable video diffusion and private products like Pika. But Sora blows everything out of the water. Not only are the images more realistic, but they can be up to a minute long and maintain cohesion between frames. They can also be rendered in different aspect ratios. They can either be created from a text prompt where you describe what you want to see or from a starting image that gets brought to life.

My initial thought was that open AI cherry-picked all these examples, but it appears that’s not the case because Sam Alman was taking requests from the crowd on Twitter and returning examples within a few minutes. Like two golden retrievers doing a podcast on top of a mountain – not bad. But this next one’s really impressive – a guy turning a non-profit open-source company into a profit-making closed-source company. Impressive, very nice.

So now you might be wondering how you can get your hands on this thing. Well, not so fast. If a model this powerful was given to some random person, one can only imagine the horrors that it would be used for. It would be nice if we could generate video for our AI influencers for additional tips, but that’s never going to happen. It’s highly unlikely this model will ever be open source, and when they do release it, videos will have C2P metadata, which is basically a surveillance apparatus that keeps a record of where content came from and how it was modified.

In any case, we do have some details on how the model works. It likely takes a massive amount of computing power, and just a couple of weeks ago, Sam Altman asked the world for $7 trillion to buy a bunch of GPUs. Yeah, that’s trillion with a T. Even Jensen Wong made fun of that number because it should really only cost around $2 trillion to get that job done. But maybe Jensen is Wong. It’s going to take a lot of GPUs for video models to scale. Let’s find out how they work.

Sora is a diffusion model, like Dolly and stable diffusion, where you start with some random noise and then gradually update that noise to a coherent image. There’s a ton of data in a single still image. For example, a 1000×1000 pixels by three color channels comes out to 3 million data points. But what if we have a 1-minute video at 60 frames per second? Now we have over 10 billion data points to generate. Just to put that in perspective, for the primate brain, 1 million seconds is about 11.5 days while 10 billion seconds is about 3,177 years. So there’s a massive difference in scale.

Plus video has the added dimension of time. To understand this data, they took an approach similar to large language models, which tokenize text. Sora is not tokenizing text but rather visual patches. These are like small compressed chunks of images that capture both what they are visually and how they move through time, or frame by frame. What’s also interesting is that video models typically crop their training data and outputs to a specific time and resolution. But Sora can train data on its native resolution and output variable resolutions as well. That’s pretty cool.

How is this technology going to change the world? Last year, tools like Photoshop got a whole new set of AI editing tools. In the future, we’ll be able to do the same in video. You might have a car driving down the road and want to change the background scenery. Now you can do that in 10 seconds instead of hiring a cameraman and CGI expert. But another lucrative, high-paying career that’s been put on notice is Minecraft streaming. Sora can simulate artificial movement in Minecraft and potentially turn any idea into a Minecraft world in seconds. Or maybe you want to direct your own indie Pixar movie. AI makes that possible by stealing the artwork of talented humans. But it might not be easy. As impressive as these videos are, you’ll notice a lot of flaws if you look closely. They have that subtle but distinctive AI look about them and they don’t perfectly model physics or humanoid interactions. But it’s only a matter of time before these limitations are figured out.

Although I’m personally threatened and terrified of Sora, it’s been a privilege and an honor to watch 10,000 years of human culture get devoured by robots. This has been the code report. Thanks for watching, and I will see you in the next one.