The ease of creating videos with just verbal cues has been made possible by OpenAI’s latest text-to-video AI model, Sora.
OpenAI, the artificial intelligence (AI) company behind ChatGPT, introduces its latest AI model, Sora, an innovative text-to-video generator that can create high-quality and hyper-realistic scenes up to a minute long based on text instructions.
According to OpenAI’s recent blog post, Sora is able to “generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background.” The company also emphasises that Sora has the capability to “understand not only what the user has asked for in the prompt, but also how those things exist in the physical world.” It can also “accurately interpret prompts and generate compelling characters that express vibrant emotions.”
Further key capabilities of Sora include the following:
- Generate videos at high resolution (1080p or higher).
- Facilitate a range of video styles, aspect ratios, and resolutions.
- Employ images and videos to elongate or modify an existing material
- Demonstrate emerging simulation capabilities such as 3D consistency and long-term object permanence.

The Technical Foundations of Sora
Sora leverages two groundbreaking AI techniques that have shown significant success in recent years: diffusion models and transformers.
Diffusion models are generative models, which means they generate data that is the same as the data on which they are trained. These generative models work by destroying real training data, adding Gaussian noise to corrupt it, and then recovering the data by reversing this noise process.
Sora makes use of a specific type of diffusion model known as a denoising diffusion probabilistic model (DDPM). DDPMs divide the image/video generation process into several smaller denoising steps, simplifying the training process to reverse the diffusion and produce clear samples.
Sora specifically employs a video variant of DDPM known as DVD-DDPM, which is tailored to directly model videos in the time domain while maintaining robust temporal consistency across frames. This capability is crucial to Sora’s capacity to generate consistent and high-quality videos.
Transformers are a type of neural network architecture designed to take a text sequence as input and produce a new text sequence as output. They achieve this by learning context and tracking relationships between sequence components. For instance, with the input sequence “What is the Portuguese translation of Good Morning? ” A transformer model will use an internal mathematical representation to discern the relevance and connections between the words “translations” and “Portuguese” and generate the response, “The Portuguese translation of Good Morning is Bom Dia.”
Sora leverages transformers to process visual data by inputting tokenised patches of video rather than textual tokens. This approach enables the model to comprehend spatial and temporal relationships within the video sequence. Moreover, Sora’s transformer architecture facilitates long-range coherence, object permanence, and other emerging simulation capabilities.
Potential Uses of Sora
Social Media Content Creation
OpenAI’s Sora can be used to create short-form videos for social media channels like Instagram, YouTube’s Shorts, Facebook, X (formerly known as Twitter), and TikTok.
Education
Educational organisations and professionals can use Sora to create informative videos and tutorials, making complex topics more engaging and digestible for their audiences without the need to have technical image editing expertise
Marketing and Advertising
Marketers can use Sora to craft compelling brand stories, promotional videos, and product demos, which are typically expensive to produce.
Virtual Reality (VR) and Augmented Reality (AR)
Sora’s video generation capabilities can enhance VR and AR experiences by creating lifelike environments, characters, and interactions, providing users with more immersive and engaging virtual worlds.
Architecture and Design
Sora can assist architects and designers in visualising and presenting their projects through realistic video renderings, helping clients better understand and visualise proposed designs.
Filmmaking
Filmmakers can use Sora to create rough visualisations or storyboards of scenes, helping them plan and refine their ideas before production begins.
The Risks of Sora
Sora’s primary strength is its ability to create realistic scenes that may not be achievable in real life. The strength also makes it possible to create “deepfake,” where real individuals or scenarios are altered to depict something that is not true. Sora does not also have a thorough understanding of physics, and so the concept of physical rules in the real world might not always be complied with.
This raises the risk of misinformation (the accidental dissemination of fake news) or disinformation (the deliberate sharing of incorrect information).
Further, the output of generative AI models is heavily influenced by the data they were trained on. Consequently, the same cultural biases or stereotypes present in the training data can manifest in the resulting videos.
Additionally, without proper control, Sora has the potential to deliver undesirable or inappropriate content, such as videos presenting violence, gory materials, or derogatory portrayals of certain groups.
Is Sora Accessible to the Public?
In the blog published by OpenAI, Sora is “becoming available to red teamers to assess critical areas for harms or risks.” The company is also granting access to a number of visual artists, designers, and filmmakers to solicit feedback on how to enhance the model to better serve the needs of creative professionals.
OpenAI also added “At this time, we don’t have a timeline or additional details to share on Sora’s broader public availability. We’ll be taking several important safety steps, including engaging policymakers, educators, and artists around the world to understand their concerns and identify positive use cases for this new technology.”