CSS10

Overview

CSS10 is a seminal open-source dataset designed for training single-speaker Text-to-Speech (TTS) models across ten diverse languages: German, Greek, Spanish, Finnish, French, Hungarian, Japanese, Dutch, Russian, and Chinese. Originating from LibriVox audiobooks, the project provides a consistent technical baseline for researchers and developers in the speech synthesis domain. Each sub-dataset consists of approximately 10 to 20 hours of high-quality audio paired with normalized transcriptions. In the 2026 market, CSS10 remains a critical infrastructure component for 'Edge-TTS' applications and Small Language Models (SLMs). Its architecture allows for efficient transfer learning, enabling developers to create localized voice assets without the massive compute requirements of foundation models. By providing a uniform format (LJSpeech style), it simplifies the training pipeline for popular architectures like FastSpeech 2, VITS, and Tacotron 2. It is particularly valued in 2026 for fine-tuning on-device speech interfaces where privacy and low latency are prioritized over cloud-based synthesis. The dataset's permissive licensing encourages both academic innovation and commercial prototyping in the rapidly expanding multilingual voice interface market.

Common tasks

Multilingual TTS training Cross-lingual voice transfer Speech-to-text alignment validation Phonetic distribution analysis Voice cloning Text-to-speech conversion Speech dataset creation TTS model evaluation

FAQ

View all

What is the sampling rate of the audio files?

All audio files are provided at a sampling rate of 22,050 Hz.

Can I use CSS10 for commercial products?

The dataset is released under CC BY-NC-SA 4.0, meaning it is intended for non-commercial and research use. For commercial use, contact the original authors.

Which languages are included?

German, Greek, Spanish, Finnish, French, Hungarian, Japanese, Dutch, Russian, and Chinese.

Is the dataset pre-aligned?

It provides transcripts and audio, but forced alignment (e.g., using MFA) is typically performed by the user during training.

FAQ+

What is the sampling rate of the audio files?

All audio files are provided at a sampling rate of 22,050 Hz.

Can I use CSS10 for commercial products?

The dataset is released under CC BY-NC-SA 4.0, meaning it is intended for non-commercial and research use. For commercial use, contact the original authors.