OpenAI’s Sora Is a Total Mystery

0
402

An AI-generated image of a man reading a book

Listen to this article

Produced by ElevenLabs and NOA, News Over Audio, using AI narration.

Yesterday afternoon, OpenAI teased Sora, a video-generation model that promises to convert written text prompts into highly realistic videos. Footage released by the company depicts such examples as “a Shiba Inu dog wearing a beret and black turtleneck” and “in an ornate, historical hall, a massive tidal wave peaks and begins to crash.” The excitement from the press has been reminiscent of the buzz surrounding the image creator DALL-E or ChatGPT in 2022: Sora is described as “eye-popping,” “world-changing,” and “breathtaking, yet terrifying.”

The imagery is genuinely impressive. At a glance, one example of an animated “fluffy monster” looks better than Shrek; an “extreme close up” of a woman’s eye, complete with a reflection of the scene in front of her, is startlingly lifelike. But Sora is also shrouded in mystery. Nobody outside a select group of safety testers and artists approved by OpenAI can use the program yet (although Sam Altman, the company’s CEO, has been taking Sora prompt requests on social media and posting the results). The model could very well bring about the fantasies people are already floating. Perhaps it will be an imagination engine, a cinematic revolution, or a misinformation machine. But for now, it’s best viewed as a provocation or an advertising blitz.

Although many of these products are spun as powerful enough to upend our conception of the world—or to destroy it outright—companies such as OpenAI tend not to detail their inner workings. (A recent study gave 10 major tech companies, including OpenAI, a failing grade on an AI-transparency index.) The MIT Technology Review was given a preview of sample videos generated by Sora only after agreeing to what its journalists called the “unusual” condition that they would not seek outside opinions until after OpenAI announced the product; initially, no research paper accompanied the release.

The technical report that OpenAI later published contains brief, generic descriptions that are sparse on, well, technical details. This is far from the first text-to-video model (Meta unveiled one in September 2022, about two months before ChatGPT’s release), but right now, without the ability of people outside the company to study or test Sora, knowing how it builds upon or compares with previous products is impossible. What is apparent from the report is that, similar to the start-up’s language models, the more computing power that OpenAI pumped into Sora, the higher quality its outputs became—a ghoulish blob of fur becomes a photorealistic, adorable pup when generated with 16 times the resources. Beyond any technological breakthrough, Sora may be the latest, and perhaps most spectacular, result of the billions of dollars in OpenAI’s coffers—a victory of scale as much as innovation.

A spokesperson for OpenAI told me in a written statement that the company is “sharing our research progress early to start working with and getting feedback from people outside of OpenAI and to give people a sense of what AI capabilities are on the horizon.” Asked about training data, the spokesperson would only specify that the model is trained on “licensed and publicly available content”; asked about potential harms, she said the company is still working to address “misinformation, hateful content, and bias .”

OpenAI is not alone in its secrecy. Also yesterday, Google announced an updated version of its flagship language model, Gemini 1.5, hailing it as a “breakthrough.” But nobody beyond a small group of developers and major, corporate customers would be able to test its most advanced capabilities. Plenty of other AI products are also released without much accompanying information.

We do know, however, that demos of AI products tend to contain flaws, some minor and some embarrassing, and Sora is no exception. By OpenAI’s own admission, it struggles with depicting physics, cause and effect (the company says that you might ask for a video of a person biting into a cookie, only to notice that no bite mark is left behind), and other simple details (a man is shown running the wrong way on a treadmill). Internet sleuths have uncovered still other failures, such as disappearing objects and misshapen hands. Nonetheless, the product appears astonishing—which, for all the excitement, raises exceedingly familiar yet serious concerns over deepfakes, copyright infringement, artists’ livelihoods, hidden biases, and more.

Meanwhile, the internet swirls with paparazzi-esque theories and observations: guesses about how Sora works; insinuations that Sora is not generating new things but copying existing videos; comparisons showing similarities between its videos and the outputs of a leading text-to-image model. These concerns, for now, cannot be found right or wrong. The public still barely understands the inner workings of DALL-E and ChatGPT, but at least we can test those products’ capabilities for ourselves; with Sora’s announcement, OpenAI has entered the realm of mythmaking.



Source link