IndexTTS2

Emotionally Expressive & Duration-Controlled Zero-Shot TTS by Index SpeechTeam

A breakthrough autoregressive zero-shot text-to-speech system with precise duration control and emotional expression capabilities, perfect for video dubbing and AI voice applications.

Watch Demos Read Paper

Demo Videos

Experience IndexTTS2's breakthrough capabilities through demonstration videos from Index SpeechTeam.

A Case Study on Iconic Scenes from Let the Bullets Fly

A Case Study on Iconic Scenes from Empresses in the Palace

A Case Study on Iconic Scenes from Empresses in the Palace

About IndexTTS2

IndexTTS2, developed by Index SpeechTeam, is a breakthrough autoregressive zero-shot text-to-speech system that solves the critical limitation of duration control while maintaining speech naturalness and adding emotional expression capabilities.

Revolutionary Duration Control & Emotional Expression

IndexTTS2 introduces a novel, general, and autoregressive-model-friendly method for speech duration control. Unlike traditional autoregressive TTS systems that struggle with precise duration control, our system supports two generation modes: explicit token specification for precise duration control and free autoregressive generation while maintaining prosodic characteristics.

The system achieves disentanglement between emotional expression and speaker identity, enabling independent control of timbre and emotion. Users can provide separate emotion prompts from different speakers, allowing accurate timbre reconstruction while conveying specified emotional tones.

Duration Control Emotional Expression Zero-Shot Video Dubbing

Precise duration control with token specification

Emotional expression disentanglement

GPT latent representations for stability

Natural language emotion control

Perfect for video dubbing applications

Key Features

Discover what makes IndexTTS2 a breakthrough in autoregressive zero-shot text-to-speech synthesis

Precise Duration Control

Novel autoregressive-model-friendly method supporting explicit token specification for precise speech duration control, perfect for video dubbing applications.

Emotional Expression

Achieves disentanglement between emotional expression and speaker identity, enabling independent control of timbre and emotion with natural language guidance.

Zero-Shot Capability

Zero-shot text-to-speech synthesis with GPT latent representations for enhanced speech stability and Qwen3-based natural language emotion control.

Coming Soon

IndexTTS2 model weights and inference code will be released soon by the official team to support research and practical applications.

Follow the official team's GitHub to stay updated on release announcements and technical updates.

Follow on GitHub Read Paper