vits

We have hosted the application vits in order to run this application in our online workstations with Wine or directly.

Run vits online

Quick description about vits:

VITS is a foundational research implementation of �VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,� a well-known neural TTS architecture. Unlike traditional two-stage systems that separately train an acoustic model and a vocoder, VITS trains an end-to-end model that maps text directly to waveform using a conditional variational autoencoder combined with normalizing flows and adversarial training. This architecture enables parallel generation (fast inference) while achieving speech quality that rivals or surpasses many two-stage systems. The repository provides training and inference pipelines for common datasets such as LJ Speech (single-speaker) and VCTK (multi-speaker), including filelists, configs, and preprocessing scripts. It also includes monotonic alignment search code and g2p preprocessing, which are crucial components for aligning text and speech in an end-to-end setup.

Features:

End-to-end TTS model combining conditional VAE, normalizing flows, and adversarial training
Parallel waveform generation with high naturalness compared to classic two-stage pipelines
Ready-made training recipes for LJ Speech and VCTK datasets (single and multi-speaker)
Monotonic alignment search implementation and phoneme preprocessing scripts
PyTorch-based code suitable for research, modification, and experimental extensions
Widely adopted baseline architecture for many derivative and improved TTS systems

Programming Language: Python.
Categories:

Text to Speech

Page navigation:

By OD Group OU – Registry code: 1609791 -VAT number: EE102345621.