OpenAI Sora

This is an AI model that can create realistic and imaginative scenes from text instructions.

Feb 21, 2024 - 06:44

Feb 21, 2024 - 06:42

0 429

OpenAI, the company behind the most powerful AI tools like ChatGPT and Dall-E 3, has released their first ever video generator, Sora. I am not exaggerating when I say that my jaw dropped when I first saw the first few videos generated by Sora.

What is Sora?

Sora is an AI model that can generate videos out of simple text prompts. It is capable of generating a minute of high-fidelity video.

Image by OpenAI

Sora is a diffusion model, an advanced AI technique with a unique way of “learning.” Diffusion models begin with clear data, like images or videos. They then gradually add noise until the original content is obscured.

The core of their power lies in reversing this process—learning to remove noise step-by-step until the original data is restored. This creates an AI system that can generate realistic results.

To guide Sora, it uses GPT (the technology behind ChatGPT) to expand simple text prompts into detailed descriptions tailored for video generation. This ensures even your brief ideas translate into visually rich, accurate results.

Here are few examples

Let’s cut to the chase—here are some prompts and sample videos demonstrating Sora’s remarkable abilities.

Prompt: A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.

OpenAI’s Sora
Prompt: The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery…

Prompt: The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery…

We already know that OpenAI’s chatbots can pass the bar exam without going to law school. Now, just in time for the Oscars, a new OpenAI app called Sora hopes to master cinema without going to film school. For now a research product, Sora is going out to a few select creators and a number of security experts who will red-team it for safety vulnerabilities. OpenAI plans to make it available to all wannabe auteurs at some unspecified date, but it decided to preview it in advance.

Other companies, from giants like Google to startups like Runway, have already revealed text-to-video AI projects. But OpenAI says that Sora is distinguished by its striking photorealism—something I haven’t seen in its competitors—and its ability to produce longer clips than the brief snippets other models typically do, up to one minute. The researchers I spoke to won’t say how long it takes to render all that video, but when pressed, they described it as more in the “going out for a burrito” ballpark than “taking a few days off.” If the hand-picked examples I saw are to be believed, the effort is worth it.

OpenAI didn’t let me enter my own prompts, but it shared four instances of Sora’s power. (None approached the purported one-minute limit; the longest was 17 seconds.) The first came from a detailed prompt that sounded like an obsessive screenwriter’s setup: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”

AI-generated video made with OpenAI's Sora. COURTESY OF OPENAI

The result is a convincing view of what is unmistakably Tokyo, in that magic moment when snowflakes and cherry blossoms coexist. The virtual camera, as if affixed to a drone, follows a couple as they slowly stroll through a streetscape. One of the passersby is wearing a mask. Cars rumble by on a riverside roadway to their left, and to the right shoppers flit in and out of a row of tiny shops.

It’s not perfect. Only when you watch the clip a few times do you realize that the main characters—a couple strolling down the snow-covered sidewalk—would have faced a dilemma had the virtual camera kept running. The sidewalk they occupy seems to dead-end; they would have had to step over a small guardrail to a weird parallel walkway on their right. Despite this mild glitch, the Tokyo example is a mind-blowing exercise in world-building. Down the road, production designers will debate whether it’s a powerful collaborator or a job killer. Also, the people in this video—who are entirely generated by a digital neural network—aren’t shown in close-up, and they don’t do any emoting. But the Sora team says that in other instances they’ve had fake actors showing real emotions.

The other clips are also impressive, notably one asking for “an animated scene of a short fluffy monster kneeling beside a red candle,” along with some detailed stage directions (“wide eyes and open mouth”) and a description of the desired vibe of the clip. Sora produces a Pixar-esque creature that seems to have DNA from a Furby, a Gremlin, and Sully in Monsters, Inc. I remember when that latter film came out, Pixar made a huge deal of how difficult it was to create the ultra-complex texture of a monster’s fur as the creature moved around. It took all of Pixar’s wizards months to get it right. OpenAI’s new text-to-video machine … just did it.

“It learns about 3D geometry and consistency,” says Tim Brooks, a research scientist on the project, of that accomplishment. “We didn’t bake that in—it just entirely emerged from seeing a lot of data.”

AI-generated video made with the prompt, “animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. the art style is 3d and realistic, with a focus on lighting and texture. the mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. the use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.” COURTESY OF OPENAI

While the scenes are certainly impressive, the most startling of Sora’s capabilities are those that it has not been trained for. Powered by a version of the diffusion model used by OpenAI’s Dalle-3 image generator as well as the transformer-based engine of GPT-4, Sora does not merely churn out videos that fulfill the demands of the prompts, but does so in a way that shows an emergent grasp of cinematic grammar.

That translates into a flair for storytelling. In another video that was created off of a prompt for “a gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.” Bill Peebles, another researcher on the project, notes that Sora created a narrative thrust by its camera angles and timing. “There's actually multiple shot changes—these are not stitched together, but generated by the model in one go,” he says. “We didn’t tell it to do that, it just automatically did it.”

AI-generated video made with the prompt “a gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.”COURTESY OF OPENAI

In another example I didn’t view, Sora was prompted to give a tour of a zoo. “It started off with the name of the zoo on a big sign, gradually panned down, and then had a number of shot changes to show the different animals that live at the zoo,” says Peebles, “It did it in a nice and cinematic way that it hadn't been explicitly instructed to do.”

One feature in Sora that the OpenAI team didn’t show, and may not release for quite a while, is the ability to generate videos from a single image or a sequence of frames. “This is going to be another really cool way to improve storytelling capabilities,” says Brooks. “You can draw exactly what you have on your mind and then animate it to life.” OpenAI is aware that this feature also has the potential to produce deepfakes and misinformation. “We’re going to be very careful about all the safety implications for this,” Peebles adds.

Expect Sora to have the same restrictions on content as Dall-E 3 : no violence, no porn, no appropriating real people or the style of named artists. Also as with Dall-E 3, OpenAI will provide a way for viewers to identify the output as AI-created. Even so, OpenAI says that safety and veracity is an ongoing problem that's bigger than one company. “The solution to misinformation will involve some level of mitigations on our part, but it will also need understanding from society and for social media networks to adapt as well,” says Aditya Ramesh, lead researcher and head of the Dall-E team.

AI-generated video made with the prompt “several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.”COURTESY OF OPENAI

Another potential issue is whether the content of the video Sora produces will infringe on the copyrighted work of others. “The training data is from content we’ve licensed and also publicly available content,” says Peebles. Of course, the nub of a number of lawsuits against OpenAI hinges on the question whether “publicly available” copyrighted content is fair game for AI training.

It will be a very long time, if ever, before text-to-video threatens actual filmmaking. No, you can’t make coherent movies by stitching together 120 of the minute-long Sora clips, since the model won’t respond to prompts in the exact same way—continuity isn’t possible. But the time limit is no barrier for Sora and programs like it to transform TikTok, Reels, and other social platforms. “In order to make a professional movie, you need so much expensive equipment,” says Peebles. “This model is going to empower the average person making videos on social media to make very high-quality content.”

As for now, OpenAI is faced with the huge task of making sure that Sora isn’t a misinformation train wreck. But after that, the long countdown begins until the next Christopher Nolan or Celine Song gets a statuette for wizardry in prompting an AI model. The envelope, please!