Sora! ! !

Recently, there has been a craze about Sora on the Internet. As the latest technology launched by OpenAI, Sora gives the magic of text-generated videos, and the results it demonstrates are impressive.

At present, the appeal of short videos has far exceeded traditional novels and graphic comics. Therefore, the advent of Sora may trigger a revolution in the field of video production.

The charm of Sora is that it can generate up to 60 seconds of video content based on text descriptions, which includes detailed scene settings, lifelike character expressions, and smooth camera transitions.

This technology enables the creation of diverse characters, specific actions, and a high degree of consistency with description in terms of themes and backgrounds. Sora not only accurately understands the user’s instructions, but also has deep insights into how these elements should appear in the real world.

Sora demonstrates a deep understanding of language to accurately capture user intent, creating video content that is both vivid and emotionally charged. It can even present multiple scenes in the same video while maintaining character coherence and visual style unity.

However, Sora is not flawless. It still needs to be improved in terms of simulating physical effects in complex scenarios and understanding specific cause-and-effect relationships. For example, a character in the video might take a bite of a cookie without leaving any noticeable mark on the cookie.

In addition, Sora may also show certain limitations when processing spatial details, such as distinguishing directions, or describing specific events over a period of time, such as the movement trajectory of a camera.

**To put it simply, Sora is a technology that can generate videos of up to 60 seconds using text. It can also be used to generate pictures, because pictures are essentially one frame of video. **

This article will start from Sora’s architecture, then Sora’s ecology, and finally how ordinary people or developers can use Sora to prepare for this AI wave~

Sora’s Architecture and Innovation

Sora represents a major innovation in AI video generation technology. It is significantly different in architecture from previous diffusion model-based systems such as Runway and Stable Diffusion. The core point is that Sora uses the Diffusion Transformer model, which is an advanced architecture that combines the diffusion model and the Transformer model, bringing unprecedented flexibility and quality improvement to video generation.

Architecture comparison

  • Runway/Stable Diffusion: These systems are based on the diffusion model and produce clear images by gradually adding noise to the image and then gradually removing the noise. While this process is capable of producing high-quality images, it has limitations in video generation, especially when it comes to processing long videos and maintaining video consistency.
  • Sora: Sora uses the Diffusion Transformer model to process noisy input images through the Transformer’s encoder-decoder architecture and predict a clearer image version. This not only improves the efficiency of image processing, but also achieves significant progress in video generation. The innovation of Sora is that the basic unit it processes is not a token of text, but a “Patch” of video, that is, a color block that changes over time. This allows Sora to process videos of any size and aspect ratio without pre-cropping or adjustment.

Innovative Applications

Sora’s architecture enables it to use more data and computing resources during training, resulting in higher quality output. This method not only avoids the original composition loss problem that may be caused by video preprocessing, but also because it can receive any video as training input, Sora’s output will not be affected by poor composition of the training input. In addition, Sora demonstrates the ability to simulate complex physical phenomena such as liquid dynamics, thanks to the physical rules contained in the large amounts of video data it uses during training.

Research basis and inspiration

The development of Sora was inspired by two papers, “Scalable Diffusion Models with Transformers” and “Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution”. These studies came from Google and were published shortly after the Sora project was launched. . These studies provide the theoretical basis and technical details of the Sora architecture, laying a solid foundation for the development of Sora and future AI video generation technology.

By combining the diffusion model and the Transformer model, Sora not only achieved a technological breakthrough, but also opened up new possibilities for video production and AI applications, indicating that the future of AI in film and television production, content creation and other fields will be broader and deeper.

What are the upgrades to Sora and previous AI video generation tools

The emergence of Sora in the field of AI video generation marks an important milestone in technological progress. Compared with earlier AI video generation tools, Sora introduces a series of innovations and upgrades that not only improve the quality of video generation, but also greatly expand the possibilities for video creation. The following are the main upgrades and optimizations between Sora and previous AI video generation tools:

Improve the quality and stability of generated videos

Sora’s technological advancements are primarily reflected in its ability to generate high-quality videos. Compared with previous tools, the video generated by Sora can be up to 60 seconds long, while supporting camera switching, ensuring the stability of the characters and background in the picture, and achieving high-quality output. These improvements mean videos generated using Sora are more realistic and provide a better viewing experience, providing users with richer and more dynamic visual content.

Innovative technical architecture: Diffusion Transformer model

Sora is able to achieve the above advantages thanks to its innovative technology architecture based on the Diffusion Transformer model. This architecture combines the advantages of the diffusion model and the Transformer model, allowing Sora to not only generate text content, but also predict and generate so-called “spatio-temporal patches”. These spatio-temporal patches can be understood as a small segment in the video, containing several frames of video content. This method makes Sora not limited by video length and graphics card performance during the training process. The generation process is more flexible and diverse, and it can combine different spatiotemporal patches to create new video content.

Enhanced flexibility and diversity

Compared with tools such as Pika based on the Diffusion model or LLM and ChatGPT based on the Transformer model, Sora’s technical architecture gives it higher flexibility and diversity. Pika is limited by graphics card performance when processing video content, and its main modes focus on video expansion or style transfer based on image keyframes. Sora, through its unique model, can create richer and more varied video content without being limited to specific video resolution or length.

Sora’s computing power requirements

Before discussing the cost and computing power requirements of Sora, we need to understand that the cost and computing power requirements of AI video generation technology, especially advanced models like Sora, are determined by a variety of factors. These factors include, but are not limited to, the complexity of the model, the resolution of the generated content, the length of the video, and the required generation quality. The following is a professional and detailed analysis of the cost and computing power requirements of Sora.

Basics of Cost Estimation

Before estimating the cost of generating a 60-second video with Sora, we looked at the pricing models of existing AI generation technologies. For example, DALL-E 3’s HD image generation costs $0.08 per generation, while Runway Gen-2’s video generation service charges $0.05/second. These prices provide a general range of pricing for AI generation services.


DALL-E 3 is the latest generation of AI image generation model developed by OpenAI, which is a subsequent version of the DALL-E series. This AI uses deep learning to generate high-resolution images. Users only need to provide short text descriptions, and DALL-E 3 can create corresponding images based on these descriptions. This model demonstrates impressive creativity and understanding, able to handle complex concepts and abstract thinking, generating images in a variety of styles and themes. DALL-E 3 has wide application potential in many fields such as art creation, design exploration, education and entertainment.

Runway Gen-2

Runway Gen-2 is an AI video generation tool launched by RunwayML, which enables users to easily create and edit video content through AI technology. Runway Gen-2 provides a series of AI-based video editing functions, such as real-time video synthesis, style conversion, content generation, etc. These tools allow users to convert text descriptions into video scenes, or to stylize and edit existing video footage. Runway Gen-2 is designed to simplify the video creation process and lower the threshold for producing high-quality video content. It is suitable for film and television production, advertising creativity, digital art and other fields.

Sora’s computing power requirements

Sora’s technical documents or promotional materials have not clearly disclosed its computing power requirements. However, based on the technical architecture it adopts - combining the diffusion model and the Transformer model - we can reasonably speculate that Sora’s demand for computing power is relatively high. Assume that Sora requires about 8 NVIDIA A100 GPUs for inference, which are some of the most high-end computing cards in the industry and are designed for deep learning and AI tasks.

Cost Estimate

According to the assumption, if Sora’s inference requires approximately 8 A100 GPUs, we can estimate it by referring to the GPU rental cost of cloud computing services. Assuming a cloud rental cost of $3 per hour per A100 GPU (this is an assumption and actual costs may vary by vendor and region), the Sora runtime costs approximately $24 per hour.

If Sora takes one minute to generate a one-minute video, the direct computing power cost per minute of video is approximately $0.4. However, this does not include other potential costs such as software usage fees, data storage and transfer fees, and any additional processing time.

Comprehensive estimate and market pricing

In summary, if software usage fees and other operating costs are taken into account, we can speculate that the cost of Sora generating a 60-second video may be higher than the direct computing power cost. If we estimate that half an hour costs about $10 (which is a very rough estimate), the video cost per second is about $0.33. This price may be adjusted based on the actual resources used and the service pricing strategy.

Future Generated Music

Currently, DALL-E 3 and Runway Gen-2 mainly focus on visual content generation of images and videos. Although they have not yet been directly applied to music (audio) generation, there are several problems that may be faced in realizing this function in the future:

  1. Matching of environment and object sounds: Each environment and object in the video may make a unique sound. The AI needs to understand the characteristics of these environments and objects, and how they interact (such as the sound of collisions between objects), in order to generate matching sounds.
  2. Sound Source Superposition: Sound in the real world is often the result of the superposition of multiple sound sources. AI needs to be able to handle this complexity and synthesize multi-layered audio landscapes.
  3. Integration of music and scenes: Music or background music not only needs to be of high quality, but also needs to be closely integrated with the scenes, emotions and rhythms in the video, which places higher demands on AI understanding and creativity.
  4. Synchronization of character dialogue: For videos containing character dialogue, AI needs to generate audio that is not only accurate in content, but also closely aligned with the character’s position, mouth shape, and expression. This requires complex models and algorithms. accomplish.

How to use it?

Overview of usage

Similar to ChatGPT, it is expected that users do not need to deploy and set up in the local environment, but can access and use the service in the following two convenient ways:

  1. ChatGPT integration: Users can use this function directly through the ChatGPT interface, such as GPTS, to achieve a seamless video generation experience. This integration method will provide users with a simple and intuitive operation interface, and they can customize and generate video content through text commands.
  2. API call: In order to meet the customized needs of developers and enterprise users, it is expected that API interfaces will also be provided. Through API calls, users can integrate video generation functions into their own applications, services or workflows to achieve a higher degree of automation and personalization.

Costs and Usage Limitations

Due to the high cost and long processing time of video generation, you may encounter the following limitations when using this service:

  • Number of times limit: In order to ensure the sustainability of the service, there may be certain limits on the number of times users can use it. This may be in the form of daily or monthly usage caps to balance user demand and resource consumption.
  • Advanced Subscription Service: In order to meet the needs of some users for higher frequency or higher quality video generation, a higher level subscription service may be launched. Such services may offer higher usage limits, faster processing, or more customization options.

Gradually release the plan

It is expected that the availability and functionality of this service will be gradually released within the next three months to six months.

The market size will be huge, triggering a new wave of AI~

Longer video

As the length of video generation increases, the demand for video memory also increases. However, considering the rapid progress of current technology development, we can optimistically predict that within a year, the technology will be able to support the generation of videos up to 5 to 10 minutes long. For longer videos, such as 30 minutes or 60 minutes, this is expected to be implemented within the next 3 years.

Video generation and the resulting copyright ownership issues are hot topics in today’s technical and legal discussions. When a video is generated based on an image or text, copyright is generally considered to belong to the original content creator who created the video. However, this principle applies only if the resulting work itself does not infringe the copyright of others.

  • Creators Rights: In AI based images orIn the case of text-generated video, if the original input content (image or text) is original by the creator, then the copyright of the generated video should belong to the creator. This is because the generation process is considered a technical means, and the copyright of the creative and original content belongs to the creator.
  • Non-infringement principle: Although the creator owns the copyright to the original input content, the generated video still needs to comply with the basic principles of copyright law, that is, it cannot infringe the copyright of any third party. This means that even if the video is generated by AI, any copyrighted material used in it must be licensed accordingly or comply with fair use principles.

Practical Challenge

In practice, determining the copyright ownership of AI-generated works may encounter a series of challenges, especially when the original input materials or generation algorithms involve the rights of multiple parties. In addition, different countries and regions may have different legal interpretations and practices regarding the copyright ownership of AI-generated works, which brings additional complexity to creators and users.

I personally speculate that copyright issues will be a big direction in the future.

Someone uses AI to defraud and forge?

With the development of AI technology, especially advanced video generation tools like Sora, we are faced with the problem of increasingly blurred boundaries between virtual content and real content. This is not only about how to distinguish which videos were shot for real and which were produced using tools like Sora, but also about the nature of authenticity in the future and how we deal with the potential risks posed by deepfakes.

The difference between virtual and reality

As the quality of AI-generated videos gets higher and higher, it becomes more difficult to distinguish which content was actually shot and which was AI-generated. However, technological advancements also mean that more accurate detection tools will be developed to identify AI-generated videos. Currently, video content is often embedded with watermarks to identify its source, and it is expected that more advanced tagging and verification technologies will be available in the future to help distinguish virtual and real content.

Deepfakes Challenge

The development of deepfake technology makes fake content easier to produce, thereby increasing the risk of being defrauded. However, just like photography and film and television production techniques throughout history, the public’s ability to discern such content continues to improve. Although the current AI technology may not be perfect in some details, such as the generated ants with only four legs, or errors such as deformation of the character’s hands, these illogical places provide clues to identify the content generated by the AI.

Countermeasures and future directions

Faced with the problem of deep forgery, the game between forgery and anti-counterfeiting will be a long-term process. In addition to developing more accurate detection tools, educating the public on how to identify fake content and improving their media literacy are key to meeting this challenge. In addition, as technology develops and laws and regulations improve, we may see more standards and protocols for content authenticity verification being established, aiming to protect consumers from the potential harm of deepfake content.

What is the future direction of Sora?

With the rapid development of artificial intelligence technology, Sora, as a cutting-edge AI video generation tool, has full of expectations for its future development prospects and evolution trends. The following are some imaginations and predictions for Sora’s next development:

A revolution in cost and efficiency

With algorithm optimization and hardware advancement, the cost of generating videos with Sora is expected to be significantly reduced, while the generation speed will be significantly accelerated. This means that the production of high-quality videos will become faster and more economical, providing small and medium-sized enterprises and even individual creators with previously unimaginable video production capabilities. This revolution in cost and efficiency will further democratize the creation of video content, inspiring more innovation and creative expression.

Comprehensive upgrade of quality and functionality

In the future, Sora will not only improve the image quality and video duration, but also achieve a qualitative leap in lens switching, scene consistency, and compliance with physical laws. AI will be able to more accurately understand and simulate the physical laws of the real world, making the generated video content almost indistinguishable from real-life content. In addition, this ability of AI will be further expanded to simulate subtle human expressions and complex natural phenomena, providing audiences with an unprecedented visual experience.

Sound and multi-modal fusion

We can foresee that it will not be limited to the generation of visual content. Combined with advanced sound synthesis technology, Sora will be able to generate sound effects and background music that perfectly match the video, and even achieve natural flow of character dialogue. Furthermore, the deep integration with text generation models such as GPT will unlock complete multi-modal interaction capabilities and realize all-round content generation from text description to visual, auditory and even more sensory dimensions. This multi-modal integration will greatly expand the application prospects of AI in education, entertainment, virtual reality and other fields.

Sora application scenarios

Sora’s application scenarios and practicality cover a wide range of fields, and its commercial application value cannot be underestimated. The following is a comprehensive analysis of Sora’s value and applications:

Enhance personal expression skills

Sora is like a comprehensive expression tool that greatly expands one’s creative and expressive abilities. Just as cars expand people’s mobility, ChatGPT expands people’s writing and communication abilities, Sora expands people’s visual and emotional expression capabilities through the medium of video. It allows ordinary people without professional writing, painting, photography, or video editing skills to express their thoughts and emotions like never before, resulting in richer, more intuitive communication.

Reduce video production costs

As a low-cost video generation tool, Sora provides great value to video creators. It lowers the threshold for video production, allowing more people to produce high-quality video content at a lower cost. This is not only beneficial for individual creators, but also provides small businesses and educational institutions with the possibility to produce professional-grade videos, thus broadening the application field in many aspects such as marketing, teaching and content creation.

Innovative human-computer interaction method

Sora opens up a new human-computer interaction model, especially showing great potential in dynamic video content generation. It can generate game plots, tasks and scenes in real time according to user instructions, providing unlimited content and experience for games and virtual reality. In addition, Sora can also dynamically convert news and articles into videos, providing a more intuitive and attractive form for information consumption, which is of great significance for improving the efficiency and effect of information reception.

Emotional connection and memory retention

Sora has unique value in emotional connection and memory retention.

By generating videos of deceased loved ones, it provides a new way for people to honor and preserve the memory of their loved ones.

As a digital companion, Sora can create avatars with personalized characteristics, provide users with emotional support and companionship, and open up a new dimension of interaction with the digital world.

Sora’s money-making logic

Sora’s future market is very large, involving every industry and every field

  • Emotional sustenance and entertainment services: Sora can provide customized video content, including courses to relieve anxiety, provide entertainment content, and even create memory videos of deceased relatives, all of which have highly personalized needs and emotional value , users are willing to pay for this unique experience.
  • Microfilm Production: Sora can generate microfilm-level content at low cost and high efficiency, providing powerful creative tools for independent film and television producers and artists. Through copyright sales, participation in film festivals, etc., the artistic works generated by Sora can be commercialized.
  • Content Creation and Secondary Creation: Sora can help content creators and novelists transform text content into visual content, providing new narrative methods and viewing experiences. By selling materials, providing teaching content, storytelling videos, etc., Sora can bring new sources of income to the education and entertainment industries.
  • Game content generation and advertising: Sora can dynamically generate game plots and scenes, providing unlimited possibilities for game development. At the same time, the advertising videos generated by Sora can be provided to e-commerce and brand owners to achieve rapid market verification and product promotion.
  • Tools and Platform Ecosystem: By providing easy-to-use prompts and widgets, Sora can build an ecosystem around video generation, attracting developers and creators to participate. This ecosystem can not only bypass existing production restrictions, but also provide users with more creative freedom and possibilities, thereby creating revenue models such as subscription services and platform usage fees.
  • Rapid Prototyping Verification and Commercial Application: Sora can help companies and entrepreneurs quickly verify product and service concepts and reduce initial investment costs by generating prototype videos. In areas such as advertising, e-commerce, and even film shot production, Sora’s application can significantly improve efficiency and reduce costs, creating direct economic value for business users.

How do ordinary people use it well? Use Sora to do a side job

  • Use it, learn how to use it, know what it can do and where its boundaries are.
  • Choose a direction that suits you and prepare relevant materials or development projects in advance
  • Technical staff can prepare to start preparing products and tools: collecting prompts and secondary development based on APIs

Sora Other discussions

Origin of name

Sora’s name is likely derived from the opening song of the anime “Tengen Breakthrough”, “Sora Shiro”, reflecting the project team’s pursuit of creativity and breaking through limitations.

Practicality and Popularity

Sora’s popularity is not only due to the conceptual hype of financing and stock price. It is indeed a technology with practical value and can already be applied to generate high-quality short video content, such as OpenAI’s display on TikTok accounts.

Competitiveness and Development

Sora has strong competitiveness on a global scale, and OpenAI’s technology and model advantages are significant. Although China is developing rapidly in this field, it is currently mainly led by large enterprises. The gap between China and Europe and the United States mainly lies in the in-depth application of computing power and AI technology.

Industrial Revolution

The emergence of Sora is considered an epoch-making technology in the field of text-to-video generation, heralding the possibility of a new round of industrial revolution. Although there have been many highly sought-after technologies in history, such as web3, blockchain, etc., Sora’s practicality and innovation make people optimistic about its epoch-making definition.

Silicon Valley Circle

Sora has received positive reviews in Silicon Valley and the industry. Although this may lead to more cautious investments in certain directions, it also encourages entrepreneurs and developers to explore new application directions and innovative models.

Chip and computing power requirements

With the development of video generation technology, the demand for computing power continues to grow, which is expected to promote more companies to participate in the development and production of graphics cards, promote the diversification of computing resources and improve performance.

Sora’s discussion and analysis reflect its far-reaching potential in technological innovation, commercial applications and social impact, and also remind the industry of the importance of continuous observation and rational evaluation of emerging technologies.

about Us

Welcome to SoraEase, we are an open source community dedicated to simplifying the application of Sora AI video generation technology. SoraEase aims to provide a fast and efficient usage and development platform for Sora enthusiasts and developers to help everyone easily master Sora technology, inspire innovation, and jointly promote the development and application of video generation technology.

At SoraEase we offer:

  • Sharing of the latest Sora application cases and technical research
  • Rapid development tools and resources for Sora Technologies
  • Q&A and discussion on the development and use of Sora
  • Rich Sora technical community activities and online communication opportunities

We believe that through the power of the community, Sora technology can be made more accessible and easier to use, allowing everyone to create stunning AI video content.

Community Resources

  • GitHub address: SoraEase GitHub
  • Join our community: Add Wechat nsddd_top and reply sora to join the group. In our WeChat community, you can get Sora’s latest consultation and technology sharing, and it is also a communication platform for Sora enthusiasts and developers.

We look forward to your joining and exploring the infinite possibilities of Sora technology!