Key Takeaways
- Piper is the correct choice for speed and embedded use. It runs entirely on CPU, produces real-time speech on a Raspberry Pi 5, and supports 20+ languages via downloadable voice packs. No GPU, no Python complexity, MIT license.
- XTTS v2 is the best local voice cloning option in 2026. Give it 6 seconds of reference audio and it clones the voice in 17 languages. Requires 4β6 GB GPU VRAM. The CPML license restricts commercial use β check the license before deploying.
- F5-TTS is the fastest-growing alternative for zero-shot voice cloning. It uses a flow-matching architecture instead of GPT, clones a voice from ~3 seconds of reference audio, and achieves quality competitive with XTTS v2 at faster inference speeds. License: CC-BY-NC-4.0 (non-commercial).
- Coqui TTS is the most flexible open-source TTS toolkit. It supports multiple backends (Tacotron2, VITS, XTTS), voice cloning, and 20+ languages under an MPL 2.0 license. Note: the Coqui company shut down in late 2023; the project is now community-maintained.
- Bark is the only local TTS that generates non-speech audio. It can produce laughter, coughing, sighs, music snippets, and ambient sound effects alongside speech β useful for creative audio, podcast production, and interactive fiction. Its outputs are slow and non-deterministic.
- StyleTTS 2 achieves the highest mean opinion score (MOS) of any open-source English TTS engine. Its diffusion-based style transfer produces near-human naturalness on English narration. It is English-only and does not support voice cloning.
- License matters significantly for commercial use. Piper (MIT), Bark (MIT), StyleTTS 2 (MIT): freely commercial. Coqui (MPL 2.0): commercial allowed with source disclosure conditions. XTTS v2 (CPML): commercial use requires a license agreement. F5-TTS (CC-BY-NC-4.0): commercial use prohibited without a separate agreement.
- None of these match commercial TTS quality at scale. ElevenLabs, Google Text-to-Speech, and Azure TTS still outperform local engines on consistency, naturalness, and latency across all use cases. Local TTS is the right choice when privacy, cost, or offline operation matters more than absolute quality.
Quick Facts
- Fastest local TTS: Piper β real-time on Raspberry Pi 5, ~10Γ faster than real-time on modern desktop CPU.
- Best voice cloning quality: XTTS v2 β 6 seconds of reference audio, cross-lingual cloning in 17 languages.
- Fastest zero-shot voice cloning (newer arch): F5-TTS β ~3 seconds of audio, flow-matching, ~3β5Γ real-time on RTX 4070.
- Most flexible open-source toolkit: Coqui TTS β supports VITS, Tacotron2, XTTS backends, 20+ language models.
- Only generative audio (non-speech sounds): Bark β laughter, sighs, music, ambient. Slowest of all.
- Best English narration quality: StyleTTS 2 β diffusion-based style transfer, near-human MOS on LJSpeech benchmark.
- VRAM requirements: Piper: CPU only. Kokoro: CPU / 1β2 GB. StyleTTS 2: 2β4 GB. Coqui VITS: 2β4 GB. F5-TTS: 3β5 GB. XTTS v2: 4β6 GB. Bark: 4β8 GB.
Why Local TTS Matters
Cloud TTS services (ElevenLabs, Google TTS, Amazon Polly, Azure Speech) are convenient but come with per-character billing, audio data retention policies, and latency from network round-trips. Local TTS eliminates all three.
- Privacy: Your text content never leaves your machine. Critical for medical dictation, legal summaries, private diary narration, or confidential document read-aloud.
- Cost: Cloud TTS pricing is typically $4β$30 per million characters. A developer generating 10 million characters per month saves $40β$300/month with a one-time local setup.
- Latency: No network round-trip. Piper generates the first audio token in under 50 ms on CPU β faster than any cloud TTS round-trip.
- Customization: Voice cloning (XTTS v2, F5-TTS, Coqui) lets you create a custom voice from a few seconds of audio. Cloud providers charge $10+/month per cloned voice.
- Offline operation: Works on planes, in secure facilities, in remote areas with no internet. Embedded voice UI for kiosks and appliances.
Head-to-Head Comparison Table
All local TTS engines compared across the metrics that matter most for production deployment.
π In One Sentence
Piper is fastest on CPU; XTTS v2 gives the best voice cloning quality; F5-TTS provides zero-shot cloning with a newer architecture; Bark is the only engine producing laughter and music; StyleTTS 2 has the best English narration naturalness.
π¬ In Plain Terms
For most offline TTS needs: Piper if you want speed and simplicity, Coqui if you want voice cloning with a permissive license, XTTS v2 if you want the best cloning quality and have a GPU, F5-TTS if you want a newer architecture with faster zero-shot cloning.
| Tool | Quality | Speed | Voice Cloning | Multilingual | VRAM | License | MOS (English) |
|---|---|---|---|---|---|---|---|
| Piper | Good | Very fast (CPU) | No | Yes (20+ langs) | CPU only | MIT | ~3.5 |
| Kokoro | Very good | Fast (CPU) | No | English + expanding | CPU / 1β2 GB | Apache 2.0 | ~4.0 |
| Coqui TTS | Very good | Medium | Yes | Yes (20+ langs) | 2β4 GB | MPL 2.0 | ~3.8 |
| XTTS v2 | Excellent | Slow | Yes (best) | Yes (17 langs) | 4β6 GB | CPML (commercial restricted) | ~4.1 |
| F5-TTS | Excellent | Medium-fast | Yes (zero-shot) | Yes (multilingual) | 3β5 GB | CC-BY-NC-4.0 | ~4.1 |
| Bark | Unique / variable | Slow | Limited | Yes (multilingual) | 4β8 GB | MIT | ~3.2β4.0 (variable) |
| StyleTTS 2 | Excellent (English) | Medium | No | English mainly | 2β4 GB | MIT | ~4.3 |
MOS (mean opinion score) on a 1β5 scale where 5 is indistinguishable from human speech. Scores are approximate and based on published benchmarks or community evaluations. MOS varies significantly by test sentence and listener pool. Human reference MOS: ~4.5.
First-Audio Latency Comparison
First-audio latency is the time from text input to first audible output. Critical for voice assistants and interactive applications. For batch processing (audiobooks, podcast production), total throughput matters more than first-audio latency.
| Engine | First audio (RTX 4070) | First audio (CPU) | First audio (M5 Pro) |
|---|---|---|---|
| Piper | ~30 ms | ~50 ms | ~40 ms |
| Kokoro | ~50 ms | ~80 ms | ~60 ms |
| Coqui VITS | ~100 ms | ~300 ms | ~150 ms |
| StyleTTS 2 | ~150 ms | ~500 ms | ~200 ms |
| F5-TTS | ~200 ms | ~800 ms | ~300 ms |
| XTTS v2 | ~300 ms | ~1500 ms | ~500 ms |
| Bark | ~500 ms | ~3000 ms | ~800 ms |
Piper TTS β Fastest Lightweight Option
Piper is a fast, local text-to-speech system developed by Rhasspy for home automation and embedded use. It uses a VITS-based neural architecture trained on voice datasets with an onnxruntime backend β optimized to run in real-time on a Raspberry Pi 4 or 5 without a GPU.
- Architecture: VITS neural TTS with ONNX inference. Designed for single-board computers and embedded Linux.
- Installation:
pip install piper-tts. Pre-trained voice packs available at the Piper voices repository on Hugging Face. - Usage:
echo "Hello, world" | piper --model en_US-lessac-medium.onnx --output_file output.wav - Voice packs: 20+ languages, multiple voice options per language. Each voice pack is a 20β200 MB ONNX model file.
- Speed: ~10Γ faster than real-time on a modern desktop CPU. Real-time on Raspberry Pi 5. Sub-50 ms first-audio latency.
- Apple Silicon: ~15Γ real-time on M5 Pro (CPU, ARM NEON). Runs natively without GPU β excellent performance on Mac.
- Listen to samples: Piper voice samples
- Best for: Home assistants, kiosk devices, embedded voice UI, privacy-sensitive read-aloud where GPU is unavailable.
- Limitation: No voice cloning. Quality is "good" β natural sounding but clearly synthetic compared to XTTS v2 or StyleTTS 2.
- License: MIT β fully commercial, no restrictions.
- Kokoro TTS β Piper alternative: Kokoro TTS is an emerging alternative to Piper in the lightweight category. It achieves higher naturalness than Piper while remaining fast on CPU. Licensed under Apache 2.0. If Piper's quality doesn't meet your needs but you can't afford GPU VRAM, Kokoro is worth testing.
Coqui TTS β Best Open-Source All-Rounder
Coqui TTS is a Python toolkit for text-to-speech supporting multiple model architectures and voice cloning. It was developed by the Coqui company (which shut down in late 2023) and is now maintained by the open-source community. The toolkit supports Tacotron2, VITS, and XTTS backends.
- Installation:
pip install TTS. Models download automatically on first use. - Voice cloning: Provide 6+ seconds of reference audio.
tts --text "Hello" --model_name tts_models/en/vctk/vits --speaker_wav sample.wav --out_path output.wav - Backend options: VITS (fastest, good quality), Tacotron2 (older, slower), XTTS (best quality, see XTTS v2 section).
- Languages: 20+ language models available via
tts --list_models. - VRAM: 2β4 GB for VITS backend; 4β6 GB for XTTS backend.
- Apple Silicon: ~8Γ real-time on M5 Pro (CPU). No Metal GPU acceleration. Usable for batch generation.
- Community status: Coqui Inc shut down in late 2023. The open-source repo (
coqui-ai/TTS) is community-maintained. No active commercial support. - License: MPL 2.0 β commercial use allowed, but source code of modifications must be disclosed.
- Best for: Developers who want voice cloning with an open-source toolkit and a permissive license.
- Listen to samples: Official coqui.ai demo archived. Community audio examples are linked in the coqui-ai/TTS GitHub repository under the demos section.
XTTS v2 β Best Voice Cloning Quality
XTTS v2 (by Coqui) is the highest-quality voice cloning engine available locally in 2026. It uses a GPT-based architecture with cross-lingual transfer β clone a voice in English and speak it in Spanish, German, French, or 14 other languages from the same 6 seconds of audio.
- Architecture: GPT-based TTS with speaker conditioning. Vision-transformer for prosody modeling.
- Voice cloning: 6 seconds of reference audio is sufficient for a convincing voice clone. 3 seconds produces passable quality.
- Cross-lingual cloning: Clone voice in one language, generate speech in 17 different languages with the same voice characteristics.
- VRAM: 4β6 GB GPU recommended. Runs on CPU but ~5β10Γ slower.
- Speed: Slow β generates ~2Γ real-time on an RTX 4070. Not suitable for real-time voice assistant pipelines.
- Apple Silicon: ~3Γ real-time on M5 Pro (CPU, no Metal acceleration). Usable for batch audio generation, not for real-time voice assistant output.
- Listen to samples: XTTS v2 demo on Hugging Face
- License: CPML (Coqui Public Model License). Free for research and personal use. Commercial use requires a license agreement with the Coqui successor.
from TTS.api import TTS
# Load XTTS v2 model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
# Clone voice from 6-second reference audio and synthesize in any of 17 languages
tts.tts_to_file(
text="Bonjour, je suis votre assistant vocal.",
speaker_wav="reference_voice.wav", # 6+ seconds of the target speaker
language="fr", # Output in French using the cloned voice
file_path="output.wav"
)β οΈWarning: XTTS v2 is covered by the CPML license. Commercial use β including in products, SaaS applications, or services β requires a commercial license agreement. Check the license terms before deploying.
Bark β Generative Audio Beyond Speech
Bark (by Suno AI) is a generative text-to-audio model that produces speech, music, laughter, coughing, sighs, and ambient sounds from text prompts. It is not a traditional TTS engine β it is a generative model that interprets text prompts as audio generation instructions.
- Unique capability: Include `[laughs]
,[sighs],[clears throat],[music], or[sound effect: wind]` in your text and Bark generates those sounds alongside speech. - Not controllable like traditional TTS: Output varies between runs for the same input. Quality is inconsistent β some outputs are excellent, others have artifacts or unintelligible segments.
- Speed: Slow β 2β4Γ slower than real-time even on an RTX 4090. Not suitable for interactive applications.
- Apple Silicon: ~1.5Γ real-time on M5 Pro (CPU, MPS partial). MPS (Metal Performance Shaders) support is partial β most inference still falls back to CPU.
- Listen to samples: Bark audio examples on GitHub
- Best for: Creative audio, podcast production with sound effects, interactive fiction, experimental voice applications.
- VRAM: 4β8 GB GPU. Runs on CPU with significantly lower quality.
- Installation:
pip install suno-bark. Models download on first run (~2 GB). - License: MIT β fully commercial.
- Limitation: No reliable voice cloning. The "voice presets" bundled with Bark are approximate β not a true voice cloning system.
StyleTTS 2 β Highest Natural Quality
StyleTTS 2 is a diffusion-based TTS model that achieves near-human mean opinion scores (MOS) on the LJSpeech benchmark. It transfers speaking style using diffusion β generating speech that is more natural and expressive than VITS-based models.
- Architecture: Diffusion-based style transfer. Samples from a learned distribution of speaking styles rather than deterministically mapping text to audio.
- Quality: Highest MOS scores of any open-source English TTS engine on the LJSpeech benchmark. Listeners rate it as near-indistinguishable from human narration in controlled tests.
- Best for: Audiobook narration, professional voiceover, podcast production, any application where English quality is more important than voice customization.
- Installation: Clone the GitHub repo, install requirements (
pip install -r requirements.txt), download model checkpoints (~500 MB). - Language support: Primarily English. Limited multilingual capability β not recommended for non-English use.
- Voice cloning: Not supported. StyleTTS 2 generates in trained speaker voices only.
- VRAM: 2β4 GB GPU. Faster than XTTS v2 at ~5β8Γ real-time on RTX 4070.
- Apple Silicon: ~6Γ real-time on M5 Pro (CPU). No Metal acceleration, but ARM performance is solid for batch audio generation.
- Listen to samples: StyleTTS 2 on GitHub β search "StyleTTS 2 audio samples" for community examples if the demo page is unavailable.
- License: MIT β fully commercial.
F5-TTS β Zero-Shot Voice Cloning, Fully Open
F5-TTS is a flow-matching-based TTS model with zero-shot voice cloning β clone any voice from ~3 seconds of reference audio without fine-tuning. It is one of the fastest-growing local TTS projects in 2025β2026, actively developed and rapidly gaining community adoption.
- Architecture: Flow-matching (a diffusion-variant approach) instead of the GPT-based architecture used by XTTS v2. Flow-matching typically offers faster inference with competitive quality.
- Voice cloning: ~3 seconds of reference audio is sufficient for zero-shot voice cloning. No fine-tuning required β works on any voice at inference time.
- Quality: Competitive with XTTS v2 on English. MOS scores approximately ~4.1 in community evaluations.
- Speed: ~3β5Γ real-time on RTX 4070 β faster than XTTS v2 (~2Γ real-time) for comparable voice cloning quality.
- Languages: Multilingual β strong support for English and Chinese, with expanding support for other languages.
- Apple Silicon: ~2Γ real-time on M5 Pro (CPU). No Metal acceleration currently.
- VRAM: 3β5 GB GPU recommended. Smaller footprint than XTTS v2.
- Installation:
pip install f5-ttsor clone from GitHub. - License: CC-BY-NC-4.0 β non-commercial use only. Commercial use requires a separate agreement with the authors.
- Why it matters: F5-TTS brings a newer architecture to local voice cloning with an active community. If XTTS v2 is too slow for your pipeline or its CPML license is a concern for non-commercial projects, F5-TTS is the primary alternative to evaluate.
License Breakdown β Important for Commercial Use
License terms are critical for production deployment. A permissive license means you can use the tool in a commercial product without restrictions; a restricted license means you must review the terms carefully before deploying.
| Tool | License | Commercial OK? | Key Condition |
|---|---|---|---|
| Piper | MIT | Yes β no restrictions | Include MIT copyright notice |
| Kokoro | Apache 2.0 | Yes β no restrictions | Include Apache 2.0 notice |
| Coqui TTS | MPL 2.0 | Yes β with conditions | Source code of modifications must be disclosed |
| XTTS v2 | CPML | Research / personal only | Commercial use requires a license agreement |
| F5-TTS | CC-BY-NC-4.0 | Non-commercial only | Commercial use prohibited without separate agreement |
| Bark | MIT | Yes β no restrictions | Include MIT copyright notice |
| StyleTTS 2 | MIT | Yes β no restrictions | Include MIT copyright notice |
πNote: Coqui TTS (the toolkit, MPL 2.0) and XTTS v2 (the specific model, CPML) have different licenses. You can use the Coqui TTS toolkit with VITS or Tacotron2 backends under MPL 2.0 in commercial products. The CPML restriction applies specifically to the XTTS v2 model weights.
How Local TTS Compares to ElevenLabs and Cloud TTS
ElevenLabs, Google Text-to-Speech, and Azure Speech remain the quality ceiling for TTS in 2026. This section shows where local engines compete effectively and where cloud still wins.
- Quality ceiling: ElevenLabs > StyleTTS 2 β XTTS v2 > F5-TTS β Coqui TTS > Piper. ElevenLabs is still the quality ceiling in 2026 for consistency and expressiveness.
- Latency: Piper local (~30β50 ms first audio) is faster than any ElevenLabs API round-trip (~300β500 ms). For real-time voice UI, local Piper wins on latency.
- Cost: ElevenLabs charges $5β99/month by tier. Local TTS costs $0 after one-time hardware. At scale (millions of characters/month), local is significantly cheaper.
- Voice cloning: ElevenLabs Instant Voice Clone β XTTS v2 quality. ElevenLabs Professional Voice Clone (requires a speaker recording session) exceeds any local engine.
- Privacy: Local TTS = no audio data sent anywhere. ElevenLabs = audio processed on their servers. Critical for sensitive content.
- Offline capability: Local = fully offline. ElevenLabs = requires internet. No offline mode available.
- When to use cloud: Professional voiceover production, customer-facing products requiring highest quality, multi-voice projects with dozens of characters.
- When to use local: Privacy-critical audio, embedded devices, cost-sensitive batch processing, offline environments, development and prototyping.
How to Choose
A decision flowchart from your requirement to the right TTS engine:
π In One Sentence
Need voice cloning? β XTTS v2 (best quality) or F5-TTS (faster, newer arch) or Coqui TTS (open license). Need CPU speed? β Piper. Need creative audio? β Bark. Need best English quality? β StyleTTS 2.
π¬ In Plain Terms
If you want to clone someone's voice, use XTTS v2 for quality or F5-TTS for faster inference or Coqui VITS for a permissive license. If you're building a Raspberry Pi or kiosk voice UI, use Piper. If you're making a podcast with sound effects, try Bark. If you're narrating audiobooks in English, use StyleTTS 2.
- Need voice cloning? β Yes: XTTS v2 (best quality, CPML license) or F5-TTS (newer arch, faster, CC-BY-NC-4.0) or Coqui VITS (good quality, MPL 2.0). No: Piper (speed), StyleTTS 2 (quality).
- Need to run on CPU only / Raspberry Pi? β Piper only. Kokoro is a higher-quality CPU alternative with Apache 2.0 license. All other engines require a GPU for acceptable performance.
- Need creative audio with non-speech sounds? β Bark. No other local engine produces laughter, sighs, or music natively.
- Need the best English narration quality? β StyleTTS 2. It outperforms all others on naturalness for English audiobook-style speech.
- Need multilingual support? β XTTS v2 (17 languages, cross-lingual cloning), Coqui (20+ languages), Piper (20+ language packs).
- Need a fully commercial MIT license? β Piper, Bark, or StyleTTS 2. Avoid XTTS v2 for commercial use without checking the CPML. F5-TTS (CC-BY-NC-4.0) also prohibits commercial use without a separate agreement.
- Need voice control via text description? β Parler-TTS. Describe the voice you want ("a calm elderly man speaking slowly") and it generates matching speech. Novel approach β no reference audio needed, no voice cloning. Useful when you need a specific voice character without a sample. GitHub
- Building a voice assistant pipeline? β Piper for low-latency TTS output (see /power-local-llm/build-local-voice-assistant-2026).
FAQ
How much reference audio do I need for voice cloning with XTTS v2?
XTTS v2 requires a minimum of 3 seconds of clean reference audio, with 6+ seconds giving noticeably better results. The audio must be a single speaker with minimal background noise and no music. Higher quality source audio (recorded in a quiet room with a good microphone) produces better clones than compressed audio.
Can I use Piper TTS in a commercial product?
Yes. Piper is licensed under MIT, which permits unlimited commercial use. You must include the MIT license notice in your product. The voice models (ONNX files) may have separate licenses per voice β check the individual voice model's license on the Piper voices repository before deploying.
Is Coqui TTS still maintained after the company shut down?
Yes, but with reduced pace. The Coqui company shut down in late 2023, but the open-source repository (coqui-ai/TTS) is maintained by community contributors. Bug fixes and security patches are applied, but major new model training or features are unlikely without significant community effort. For XTTS v2, expect no new model versions from Coqui.
Which local TTS engine has the best multilingual support?
XTTS v2 supports 17 languages with cross-lingual voice cloning β the most impressive multilingual feature of any local engine. Coqui TTS has 20+ language models but without cross-lingual cloning. Piper has 20+ language voice packs for fast CPU inference. If you need to clone a voice and produce speech in multiple languages from one reference sample, XTTS v2 is the only option.
Can Bark produce music?
Bark can produce simple musical snippets alongside speech when prompted with `[music] or [singing]` tokens. It is not a dedicated music generator β outputs are short, inconsistent, and often artifact-laden. For actual music generation, Bark is not the right tool. It is best used for adding emotional non-speech sounds (laughter, coughing, sighs) to speech output rather than for full music tracks.
What is the best free local TTS for voice cloning?
F5-TTS (CC-BY-NC-4.0) for non-commercial use β it clones voices from ~3 seconds of audio with quality competitive with XTTS v2. For commercial use, Coqui TTS with VITS backend (MPL 2.0) allows commercial deployment with source disclosure conditions. XTTS v2 has the best quality but its CPML license restricts commercial deployment without a separate agreement.
Can I run XTTS v2 on an Apple Silicon Mac?
Yes, but CPU-only β approximately 3Γ real-time on M5 Pro. There is no Metal GPU acceleration for TTS engines currently. Unlike whisper.cpp (which has full Metal support), TTS engines run on CPU on Apple Silicon. Performance is usable for batch audio generation but not suitable for real-time voice assistant output.
Which local TTS engine sounds most human?
StyleTTS 2 for English narration β it achieves the highest MOS scores of any open-source English TTS engine (~4.3 vs human reference ~4.5). XTTS v2 and F5-TTS are competitive (~4.1) for cloned voice naturalness. None match ElevenLabs Turbo v2 at peak quality for production use cases.
Sources
- Piper TTS on GitHub β Source code, voice packs, ONNX model downloads, and Raspberry Pi setup guide.
- Coqui TTS on GitHub β Source code, model list, voice cloning documentation, and Python API reference.
- XTTS v2 documentation β XTTS v2 model card, license (CPML), and voice cloning API.
- Bark on GitHub β Source, audio prompt tokens, model download, and example outputs.
- StyleTTS 2 on GitHub β Architecture paper, model checkpoints, and inference guide.
- F5-TTS on GitHub β Flow-matching TTS with zero-shot voice cloning, installation guide, and multilingual support.
- Kokoro TTS on GitHub β Lightweight high-quality TTS with Apache 2.0 license, CPU-optimized.
- Piper voices on Hugging Face β All available language/voice pack downloads with per-voice license information.
- Piper voice samples β Audio demos for all Piper voices across supported languages.