Chinese AI company MiniMax has introduced its latest text-to-speech (TTS) model, Speech-02. This model has rapidly ascended to the top of the global TTS charts, outperforming industry leaders like OpenAI and ElevenLabs. The release of Speech-02 marks a transformative moment in voice synthesis technology, offering unprecedented capabilities in voice cloning, emotional expression, and multilingual support.
Revolutionizing Text-to-Speech with Speech-02
MiniMax’s Speech-02 is an autoregressive Transformer-based TTS model that introduces a learnable speaker encoder. This innovation allows the model to extract timbre features from reference audio without requiring transcription, enabling zero-shot voice cloning with high fidelity. Additionally, the model supports one-shot voice cloning, achieving exceptional similarity to the reference voice. The integration of Flow-VAE further enhances the overall quality of synthesized audio.
Multilingual Mastery and Emotional Intelligence
Speech-02 supports 32 languages, delivering high-quality speech synthesis across diverse linguistic contexts. Its robust and disentangled speaker representations enable various applications, including arbitrary voice emotion control via Low-Rank Adaptation (LoRA), text-to-voice synthesis by generating timbre features directly from text descriptions, and professional voice cloning through fine-tuning timbre features with additional data.
The model’s emotional intelligence capabilities allow it to detect and replicate subtle emotional nuances in speech. Users can opt for automatic emotion detection or manually control expressions, providing flexibility for creators to deliver tailored content that resonates with audiences.
Outperforming Industry Leaders
Speech-02 has achieved state-of-the-art results on objective voice cloning metrics, including Word Error Rate and Speaker Similarity. It has secured the top position on the public TTS Arena leaderboard, surpassing models from established players like OpenAI and ElevenLabs.
In comparison, ElevenLabs’ TTS models, while known for their speed, have been outpaced by Speech-02 in terms of quality and versatility. OpenAI’s TTS offerings, though robust, have not matched the multilingual and emotional depth demonstrated by MiniMax’s latest model.
Expanding the TTS Frontier with T2A-01-HD
Building on the success of Speech-02, MiniMax has also unveiled the T2A-01-HD model under its Hailuo Audio HD brand. This model introduces groundbreaking features such as voice cloning with just 10 seconds of audio input, a library of over 300 pre-built voices, and advanced customization options for pitch, speed, and emotional tone. It supports 17+ languages with natural regional accents, making it ideal for applications ranging from dubbing international films to creating region-specific advertisements.
Accessibility and Integration
MiniMax offers access to its TTS models through the Hailuo Audio HD platform, providing free trials and API integration for developers. This approach ensures that a wide range of users, from individual creators to large enterprises, can leverage the advanced capabilities of Speech-02 and T2A-01-HD in their projects.
Conclusion
MiniMax’s introduction of Speech-02 represents a significant leap forward in text-to-speech technology. By combining high-fidelity voice cloning, emotional intelligence, and extensive multilingual support, Speech-02 sets a new standard in the industry. As it continues to outperform established models from OpenAI and ElevenLabs, MiniMax solidifies its position as a leader in the evolving field of voice synthesis.

Leave a comment