Skip to main content
PromptQuorumPromptQuorum
Home/Power Local LLM/Local TTS & Voice Cloning Licenses 2026: Which Engines Allow Commercial Use (Piper, XTTS v2, F5-TTS, Coqui)
Voice, Speech & Multimodal

Local TTS & Voice Cloning Licenses 2026: Which Engines Allow Commercial Use (Piper, XTTS v2, F5-TTS, Coqui)

Β·14 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

For commercial use, choose a local TTS engine with a permissive license: Piper, Bark, and StyleTTS 2 are MIT; Kokoro and Tortoise are Apache 2.0 β€” all free for commercial products. The Coqui TTS toolkit is MPL 2.0 (commercial allowed if you disclose changes to the toolkit source). The two best-known voice-cloning models are the catch: XTTS v2 is non-commercial under the Coqui Public Model License (CPML) and F5-TTS is non-commercial under CC-BY-NC-4.0 β€” both prohibit commercial use without a separate agreement, and because Coqui Inc shut down in January 2024 there is currently no one to sell an XTTS v2 commercial license, so treat it as non-commercial only. On capability: Piper is fastest on CPU (real-time on a Raspberry Pi 5, no GPU); XTTS v2 is the best-quality voice clone (6 seconds of reference audio β†’ 17 languages, 4–6 GB VRAM); F5-TTS clones from ~3 seconds with faster flow-matching inference; Bark uniquely generates laughter, sighs, and ambient sound; StyleTTS 2 has the most natural English narration (no cloning); Tortoise is very high quality but extremely slow. This is factual reference, not legal advice β€” verify each license yourself before commercial deployment.

Can you use a local text-to-speech engine in a commercial product? It depends entirely on the license, and the licenses differ sharply. Piper, Bark, and StyleTTS 2 ship under MIT, Kokoro and Tortoise under Apache 2.0 β€” all four free for commercial use. The Coqui TTS toolkit is MPL 2.0 (commercial with conditions). But the two most popular voice-cloning models are restricted: XTTS v2 uses the Coqui Public Model License (CPML, non-commercial), and F5-TTS uses CC-BY-NC-4.0 (non-commercial). This guide gives the exact license for each engine, a clear "can I use this commercially?" answer per engine, the COQUI_TOS_AGREED environment variable for accepting the CPML non-interactively in Docker and CI, and a head-to-head comparison across quality, speed, VRAM, and voice cloning β€” so you can pick the right engine without sending audio to the cloud and without a license surprise in production. (Licenses verified June 2026; this is factual reference, not legal advice β€” read each license yourself before commercial use.)

Key Takeaways

  • Piper is the correct choice for speed and embedded use. It runs entirely on CPU, produces real-time speech on a Raspberry Pi 5, and supports 20+ languages via downloadable voice packs. No GPU, no Python complexity, MIT license.
  • XTTS v2 is the best local voice cloning option in 2026 β€” but it is non-commercial. Give it 6 seconds of reference audio and it clones the voice in 17 languages (4–6 GB GPU VRAM). The CPML license is non-commercial, and since Coqui shut down (Jan 2024) no commercial license is on sale β€” treat XTTS v2 as non-commercial only. Accept the CPML non-interactively in Docker/CI with COQUI_TOS_AGREED=1.
  • F5-TTS is the fastest-growing alternative for zero-shot voice cloning. It uses a flow-matching architecture instead of GPT, clones a voice from ~3 seconds of reference audio, and achieves quality competitive with XTTS v2 at faster inference speeds. License: CC-BY-NC-4.0 (non-commercial).
  • Coqui TTS is the most flexible open-source TTS toolkit. It supports multiple backends (Tacotron2, VITS, XTTS), voice cloning, and 20+ languages under an MPL 2.0 license. Note: the Coqui company shut down in January 2024; the project is now community-maintained.
  • Bark is the only local TTS that generates non-speech audio. It can produce laughter, coughing, sighs, music snippets, and ambient sound effects alongside speech β€” useful for creative audio, podcast production, and interactive fiction. Its outputs are slow and non-deterministic.
  • StyleTTS 2 achieves the highest mean opinion score (MOS) of any open-source English TTS engine. Its diffusion-based style transfer produces near-human naturalness on English narration. It is English-only and does not support voice cloning.
  • License decides commercial use β€” and the split is clean. Free for commercial products: Piper, Bark, StyleTTS 2 (MIT) and Kokoro, Tortoise (Apache 2.0). Commercial with conditions: Coqui TTS toolkit (MPL 2.0, disclose toolkit modifications). Non-commercial only: XTTS v2 (CPML) and F5-TTS (CC-BY-NC-4.0) β€” both need a separate agreement. For commercial voice cloning, use Tortoise (Apache 2.0) or the Coqui toolkit on a VITS backend (MPL 2.0). Factual reference, not legal advice.
  • None of these match commercial TTS quality at scale. ElevenLabs, Google Text-to-Speech, and Azure TTS still outperform local engines on consistency, naturalness, and latency across all use cases. Local TTS is the right choice when privacy, cost, or offline operation matters more than absolute quality.

Quick Facts

  • Fastest local TTS: Piper β€” real-time on Raspberry Pi 5, ~10Γ— faster than real-time on modern desktop CPU.
  • Best voice cloning quality: XTTS v2 β€” 6 seconds of reference audio, cross-lingual cloning in 17 languages.
  • Fastest zero-shot voice cloning (newer arch): F5-TTS β€” ~3 seconds of audio, flow-matching, ~3–5Γ— real-time on RTX 4070.
  • Most flexible open-source toolkit: Coqui TTS β€” supports VITS, Tacotron2, XTTS backends, 20+ language models.
  • Only generative audio (non-speech sounds): Bark β€” laughter, sighs, music, ambient. Slowest of all.
  • Best English narration quality: StyleTTS 2 β€” diffusion-based style transfer, near-human MOS on LJSpeech benchmark.
  • Free for commercial use: Piper, Bark, StyleTTS 2 (MIT); Kokoro, Tortoise (Apache 2.0); Coqui TTS toolkit (MPL 2.0, with conditions). Non-commercial: XTTS v2 (CPML), F5-TTS (CC-BY-NC-4.0).
  • XTTS v2 voices and languages: No fixed voice list β€” you supply a 6-second reference clip and it clones that voice. Built-in speaker presets ship with the model, and it generates in 17 languages: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh-cn, ja, hu, ko, hi.
  • XTTS v2 VRAM: ~2 GB model weights; 4 GB minimum to run, 4–6 GB recommended for real-time inference.
  • Accept the CPML in CI/Docker: export COQUI_TOS_AGREED=1 β€” no interactive prompt needed.
  • VRAM requirements: Piper: CPU only. Kokoro: CPU / 1–2 GB. StyleTTS 2: 2–4 GB. Coqui VITS: 2–4 GB. F5-TTS: 3–5 GB. XTTS v2: 4–6 GB. Bark: 4–8 GB. Tortoise: 4–8 GB.

Why Local TTS Matters

Cloud TTS services (ElevenLabs, Google TTS, Amazon Polly, Azure Speech) are convenient but come with per-character billing, audio data retention policies, and latency from network round-trips. Local TTS eliminates all three.

  • Privacy: Your text content never leaves your machine. Critical for medical dictation, legal summaries, private diary narration, or confidential document read-aloud.
  • Cost: Cloud TTS pricing is typically $4–$30 per million characters. A developer generating 10 million characters per month saves $40–$300/month with a one-time local setup.
  • Latency: No network round-trip. Piper generates the first audio token in under 50 ms on CPU β€” faster than any cloud TTS round-trip.
  • Customization: Voice cloning (XTTS v2, F5-TTS, Coqui) lets you create a custom voice from a few seconds of audio. Cloud providers charge $10+/month per cloned voice.
  • Offline operation: Works on planes, in secure facilities, in remote areas with no internet. Embedded voice UI for kiosks and appliances.
  • Smart home: Piper is the leading TTS layer for always-on local voice interfaces β€” real-time on Raspberry Pi, no GPU needed. For a complete offline voice assistant wired into Home Assistant, see local voice assistant for smart home β†’.

Head-to-Head Comparison Table

All local TTS engines compared across the metrics that matter most for production deployment.

πŸ“ In One Sentence

Piper is fastest on CPU; XTTS v2 gives the best voice cloning quality; F5-TTS provides zero-shot cloning with a newer architecture; Bark is the only engine producing laughter and music; StyleTTS 2 has the best English narration naturalness.

πŸ’¬ In Plain Terms

For most offline TTS needs: Piper if you want speed and simplicity, Coqui if you want voice cloning with a permissive license, XTTS v2 if you want the best cloning quality and have a GPU, F5-TTS if you want a newer architecture with faster zero-shot cloning.

ToolQualitySpeedVoice CloningMultilingualVRAMLicenseMOS (English)
PiperGoodVery fast (CPU)NoYes (20+ langs)CPU onlyMIT~3.5
KokoroVery goodFast (CPU)NoEnglish + expandingCPU / 1–2 GBApache 2.0~4.0
Coqui TTSVery goodMediumYesYes (20+ langs)2–4 GBMPL 2.0~3.8
XTTS v2ExcellentSlowYes (best)Yes (17 langs)4–6 GBCPML (non-commercial)~4.1
F5-TTSExcellentMedium-fastYes (zero-shot)Yes (multilingual)3–5 GBCC-BY-NC-4.0~4.1
BarkUnique / variableSlowLimitedYes (multilingual)4–8 GBMIT~3.2–4.0 (variable)
StyleTTS 2Excellent (English)MediumNoEnglish mainly2–4 GBMIT~4.3
TortoiseExcellentVery slow (minutes/sentence)YesEnglish mainly4–8 GBApache 2.0~4.2

MOS (mean opinion score) on a 1–5 scale where 5 is indistinguishable from human speech. Scores are approximate and based on published benchmarks or community evaluations. MOS varies significantly by test sentence and listener pool. Human reference MOS: ~4.5.

First-Audio Latency Comparison

First-audio latency is the time from text input to first audible output. Critical for voice assistants and interactive applications. For batch processing (audiobooks, podcast production), total throughput matters more than first-audio latency.

EngineFirst audio (RTX 4070)First audio (CPU)First audio (M5 Pro)
Piper~30 ms~50 ms~40 ms
Kokoro~50 ms~80 ms~60 ms
Coqui VITS~100 ms~300 ms~150 ms
StyleTTS 2~150 ms~500 ms~200 ms
F5-TTS~200 ms~800 ms~300 ms
XTTS v2~300 ms~1500 ms~500 ms
Bark~500 ms~3000 ms~800 ms

Piper TTS β€” Fastest Lightweight Option

Piper is a fast, local text-to-speech system developed by Rhasspy for home automation and embedded use. It uses a VITS-based neural architecture trained on voice datasets with an onnxruntime backend β€” optimized to run in real-time on a Raspberry Pi 4 or 5 without a GPU.

  • Architecture: VITS neural TTS with ONNX inference. Designed for single-board computers and embedded Linux.
  • Installation: pip install piper-tts. Pre-trained voice packs available at the Piper voices repository on Hugging Face.
  • Usage: echo "Hello, world" | piper --model en_US-lessac-medium.onnx --output_file output.wav
  • Voice packs: 20+ languages, multiple voice options per language. Each voice pack is a 20–200 MB ONNX model file.
  • Speed: ~10Γ— faster than real-time on a modern desktop CPU. Real-time on Raspberry Pi 5. Sub-50 ms first-audio latency.
  • Apple Silicon: ~15Γ— real-time on M5 Pro (CPU, ARM NEON). Runs natively without GPU β€” excellent performance on Mac.
  • Listen to samples: Piper voice samples
  • Best for: Home assistants, kiosk devices, embedded voice UI, privacy-sensitive read-aloud where GPU is unavailable.
  • Limitation: No voice cloning. Quality is "good" β€” natural sounding but clearly synthetic compared to XTTS v2 or StyleTTS 2.
  • License: MIT β€” fully commercial, no restrictions.
  • Kokoro TTS β€” Piper alternative: Kokoro TTS is an emerging alternative to Piper in the lightweight category. It achieves higher naturalness than Piper while remaining fast on CPU. Licensed under Apache 2.0. If Piper's quality doesn't meet your needs but you can't afford GPU VRAM, Kokoro is worth testing.

Coqui TTS β€” Best Open-Source All-Rounder

Coqui TTS is a Python toolkit for text-to-speech supporting multiple model architectures and voice cloning. It was developed by the Coqui company (which shut down in January 2024) and is now maintained by the open-source community. The toolkit supports Tacotron2, VITS, and XTTS backends.

  • Installation: pip install TTS. Models download automatically on first use.
  • Voice cloning: Provide 6+ seconds of reference audio. tts --text "Hello" --model_name tts_models/en/vctk/vits --speaker_wav sample.wav --out_path output.wav
  • Backend options: VITS (fastest, good quality), Tacotron2 (older, slower), XTTS (best quality, see XTTS v2 section).
  • Languages: 20+ language models available via tts --list_models.
  • VRAM: 2–4 GB for VITS backend; 4–6 GB for XTTS backend.
  • Apple Silicon: ~8Γ— real-time on M5 Pro (CPU). No Metal GPU acceleration. Usable for batch generation.
  • Community status: Coqui Inc shut down in January 2024. The open-source repo (coqui-ai/TTS) is community-maintained. No active commercial support.
  • License: MPL 2.0 β€” commercial use allowed, but source code of modifications must be disclosed.
  • Best for: Developers who want voice cloning with an open-source toolkit and a permissive license.
  • Listen to samples: Official coqui.ai demo archived. Community audio examples are linked in the coqui-ai/TTS GitHub repository under the demos section.

XTTS v2 β€” Best Voice Cloning Quality

XTTS v2 (by Coqui) is the highest-quality voice cloning engine available locally in 2026. It uses a GPT-based architecture with cross-lingual transfer β€” clone a voice in English and speak it in Spanish, German, French, or 14 other languages from the same 6 seconds of audio.

  • Architecture: GPT-based TTS with speaker conditioning. Vision-transformer for prosody modeling.
  • Voice cloning: 6 seconds of reference audio is sufficient for a convincing voice clone. 3 seconds produces passable quality.
  • Cross-lingual cloning: Clone voice in one language, generate speech in 17 different languages with the same voice characteristics.
  • Languages (17): English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (zh-cn), Japanese, Hungarian, Korean, and Hindi. Korean and Hindi were added in XTTS v2.0.3.
  • "XTTS v2 voices": There is no fixed catalog of named voices. XTTS v2 is a cloning model β€” you provide a 6-second reference clip and it reproduces that speaker. The repo ships a handful of built-in speaker presets for quick tests, but the intended workflow is supplying your own speaker_wav.
  • VRAM: Model weights are ~2 GB. 4 GB VRAM is the practical minimum; 4–6 GB is recommended for real-time inference. Runs on CPU but ~5–10Γ— slower.
  • Speed: Slow β€” generates ~2Γ— real-time on an RTX 4070. Not suitable for real-time voice assistant pipelines.
  • Apple Silicon: ~3Γ— real-time on M5 Pro (CPU, no Metal acceleration). Usable for batch audio generation, not for real-time voice assistant output.
  • Listen to samples: XTTS v2 demo on Hugging Face
  • License: CPML (Coqui Public Model License) β€” non-commercial. The CPML permits personal, research, and hobby use of the model and its audio outputs, but prohibits commercial use (any paid product, SaaS, ad-supported content, or client work) without a separate commercial agreement. Coqui Inc shut down in January 2024, so there is currently no entity selling XTTS v2 commercial licenses β€” in practice, treat XTTS v2 as non-commercial only. See the CPML non-interactive acceptance section for the COQUI_TOS_AGREED environment variable.
python
from TTS.api import TTS

# Load XTTS v2 model
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")

# Clone voice from 6-second reference audio and synthesize in any of 17 languages
tts.tts_to_file(
    text="Bonjour, je suis votre assistant vocal.",
    speaker_wav="reference_voice.wav",   # 6+ seconds of the target speaker
    language="fr",                        # Output in French using the cloned voice
    file_path="output.wav"
)

⚠️Warning: XTTS v2 is covered by the CPML (non-commercial) license. Commercial use β€” products, SaaS, services, or paid client work β€” requires a separate commercial agreement, and since Coqui Inc shut down in January 2024 no such agreement is currently available to buy. If you need commercial voice cloning, use Tortoise (Apache 2.0) or the Coqui TTS toolkit on a VITS backend (MPL 2.0). This is factual reference, not legal advice β€” read the CPML yourself before deploying.

Bark β€” Generative Audio Beyond Speech

Bark (by Suno AI) is a generative text-to-audio model that produces speech, music, laughter, coughing, sighs, and ambient sounds from text prompts. It is not a traditional TTS engine β€” it is a generative model that interprets text prompts as audio generation instructions.

  • Unique capability: Include `[laughs], [sighs], [clears throat], [music], or [sound effect: wind]` in your text and Bark generates those sounds alongside speech.
  • Not controllable like traditional TTS: Output varies between runs for the same input. Quality is inconsistent β€” some outputs are excellent, others have artifacts or unintelligible segments.
  • Speed: Slow β€” 2–4Γ— slower than real-time even on an RTX 4090. Not suitable for interactive applications.
  • Apple Silicon: ~1.5Γ— real-time on M5 Pro (CPU, MPS partial). MPS (Metal Performance Shaders) support is partial β€” most inference still falls back to CPU.
  • Listen to samples: Bark audio examples on GitHub
  • Best for: Creative audio, podcast production with sound effects, interactive fiction, experimental voice applications.
  • VRAM: 4–8 GB GPU. Runs on CPU with significantly lower quality.
  • Installation: pip install suno-bark. Models download on first run (~2 GB).
  • License: MIT β€” fully commercial.
  • Limitation: No reliable voice cloning. The "voice presets" bundled with Bark are approximate β€” not a true voice cloning system.

StyleTTS 2 β€” Highest Natural Quality

StyleTTS 2 is a diffusion-based TTS model that achieves near-human mean opinion scores (MOS) on the LJSpeech benchmark. It transfers speaking style using diffusion β€” generating speech that is more natural and expressive than VITS-based models.

  • Architecture: Diffusion-based style transfer. Samples from a learned distribution of speaking styles rather than deterministically mapping text to audio.
  • Quality: Highest MOS scores of any open-source English TTS engine on the LJSpeech benchmark. Listeners rate it as near-indistinguishable from human narration in controlled tests.
  • Best for: Audiobook narration, professional voiceover, podcast production, any application where English quality is more important than voice customization.
  • Installation: Clone the GitHub repo, install requirements (pip install -r requirements.txt), download model checkpoints (~500 MB).
  • Language support: Primarily English. Limited multilingual capability β€” not recommended for non-English use.
  • Voice cloning: Not supported. StyleTTS 2 generates in trained speaker voices only.
  • VRAM: 2–4 GB GPU. Faster than XTTS v2 at ~5–8Γ— real-time on RTX 4070.
  • Apple Silicon: ~6Γ— real-time on M5 Pro (CPU). No Metal acceleration, but ARM performance is solid for batch audio generation.
  • Listen to samples: StyleTTS 2 on GitHub β€” search "StyleTTS 2 audio samples" for community examples if the demo page is unavailable.
  • License: MIT β€” fully commercial.

F5-TTS β€” Zero-Shot Voice Cloning, Fully Open

F5-TTS is a flow-matching-based TTS model with zero-shot voice cloning β€” clone any voice from ~3 seconds of reference audio without fine-tuning. It is one of the fastest-growing local TTS projects in 2025–2026, actively developed and rapidly gaining community adoption.

  • Architecture: Flow-matching (a diffusion-variant approach) instead of the GPT-based architecture used by XTTS v2. Flow-matching typically offers faster inference with competitive quality.
  • Voice cloning: ~3 seconds of reference audio is sufficient for zero-shot voice cloning. No fine-tuning required β€” works on any voice at inference time.
  • Quality: Competitive with XTTS v2 on English. MOS scores approximately ~4.1 in community evaluations.
  • Speed: ~3–5Γ— real-time on RTX 4070 β€” faster than XTTS v2 (~2Γ— real-time) for comparable voice cloning quality.
  • Languages: Multilingual β€” strong support for English and Chinese, with expanding support for other languages.
  • Apple Silicon: ~2Γ— real-time on M5 Pro (CPU). No Metal acceleration currently.
  • VRAM: 3–5 GB GPU recommended. Smaller footprint than XTTS v2.
  • Installation: pip install f5-tts or clone from GitHub.
  • License: CC-BY-NC-4.0 β€” non-commercial use only. Commercial use requires a separate agreement with the authors.
  • Why it matters: F5-TTS brings a newer architecture to local voice cloning with an active community. If XTTS v2 is too slow for your pipeline or its CPML license is a concern for non-commercial projects, F5-TTS is the primary alternative to evaluate.

Licenses & Commercial Use β€” Can I Use This TTS Engine Commercially?

License is the single most important factor for production use, and it splits these engines cleanly into two groups. Permissively licensed engines (MIT, Apache 2.0) are free to ship in a commercial product. Restricted engines (CPML, CC-BY-NC-4.0) are non-commercial β€” using them in a paid product, SaaS, ad-supported content, or client work requires a separate agreement. The table below gives the exact license and a direct "can I use this commercially?" answer for each engine.

πŸ“ In One Sentence

For local TTS in a commercial product, Piper, Bark, and StyleTTS 2 (MIT), Kokoro and Tortoise (Apache 2.0), and the Coqui TTS toolkit on a VITS/Tacotron2 backend (MPL 2.0) are all allowed; XTTS v2 (CPML) and F5-TTS (CC-BY-NC-4.0) are non-commercial.

πŸ’¬ In Plain Terms

The two most popular voice-cloning models β€” XTTS v2 and F5-TTS β€” cannot be used commercially without a separate license. For commercial voice cloning, Tortoise (Apache 2.0) or the Coqui toolkit on a VITS backend (MPL 2.0) are the safe choices.

ToolLicenseCommercial OK?Key Condition
PiperMITYes β€” no restrictionsInclude MIT notice; check per-voice model license
KokoroApache 2.0Yes β€” no restrictionsInclude Apache 2.0 notice
Coqui TTS (toolkit)MPL 2.0Yes β€” with conditionsDisclose source of any modifications to the toolkit files
XTTS v2 (model)CPMLNo β€” non-commercialCommercial needs an agreement; none on sale since Coqui closed (Jan 2024)
F5-TTSCC-BY-NC-4.0No β€” non-commercialNC carries over even to fine-tunes (Emilia training data)
BarkMITYes β€” no restrictionsInclude MIT copyright notice
StyleTTS 2MITYes β€” no restrictionsInclude MIT copyright notice
TortoiseApache 2.0Yes β€” no restrictionsAttribution; obtain consent for any cloned voice

πŸ“ŒNote: Coqui TTS (the toolkit, MPL 2.0) and XTTS v2 (the specific model weights, CPML) are licensed differently. You can ship the Coqui TTS toolkit with VITS or Tacotron2 backends in a commercial product under MPL 2.0. The CPML non-commercial restriction applies specifically to the XTTS v2 model weights and their audio outputs β€” not to the toolkit code.

⚠️Warning: This is factual reference, not legal advice. Licenses change and edge cases (voice consent, dataset terms, per-voice model licenses) matter. Read each engine's license file yourself, and consult a lawyer, before relying on any of these terms for commercial deployment.

Accept the CPML Non-Interactively (COQUI_TOS_AGREED)

The first time you load an XTTS / Coqui model that is covered by the CPML, the library prints the license terms and waits for you to type "y" to accept. That interactive prompt hangs in Docker builds, CI pipelines, and headless servers. To accept the CPML non-interactively, set the COQUI_TOS_AGREED environment variable to 1 β€” this records that you have read and agreed to the Coqui Public Model License before the model loads. It does not change the license: the CPML is still non-commercial, and setting the variable is your agreement to those terms, not a waiver of them.

πŸ“ In One Sentence

Set the environment variable COQUI_TOS_AGREED=1 to accept the Coqui Public Model License (CPML) without the interactive prompt in Docker, CI, or any headless environment.

πŸ’¬ In Plain Terms

In a shell or Dockerfile use export COQUI_TOS_AGREED=1; in Python set `os.environ["COQUI_TOS_AGREED"] = "1"` before importing or loading the model. Either way the model loads without waiting for keyboard input.

  • Shell / CI: export COQUI_TOS_AGREED=1 before running your script.
  • Docker: add ENV COQUI_TOS_AGREED=1 to your Dockerfile, or pass -e COQUI_TOS_AGREED=1 to docker run.
  • Python (set it before the model loads): `import os; os.environ["COQUI_TOS_AGREED"] = "1" β€” must run before TTS(...)` instantiates the XTTS model.
  • What it does: records non-interactive acceptance of the CPML so model load does not block on a y/n prompt. It is not a commercial license and does not remove the non-commercial restriction.
python
# 1) Shell / CI β€” accept the CPML once for the session
export COQUI_TOS_AGREED=1

# 2) Dockerfile β€” bake acceptance into the image
# ENV COQUI_TOS_AGREED=1

# 3) Python β€” set it before the model is created
import os
os.environ["COQUI_TOS_AGREED"] = "1"   # must be set BEFORE the TTS() call below

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cuda")
# Model now loads without the interactive license prompt

⚠️Warning: COQUI_TOS_AGREED=1 only suppresses the interactive prompt β€” it is your acceptance of the CPML, which remains a non-commercial license. It does not grant commercial rights to XTTS v2.

How Local TTS Compares to ElevenLabs and Cloud TTS

ElevenLabs, Google Text-to-Speech, and Azure Speech remain the quality ceiling for TTS in 2026. This section shows where local engines compete effectively and where cloud still wins.

  • Quality ceiling: ElevenLabs > StyleTTS 2 β‰ˆ XTTS v2 > F5-TTS β‰ˆ Coqui TTS > Piper. ElevenLabs is still the quality ceiling in 2026 for consistency and expressiveness.
  • Latency: Piper local (~30–50 ms first audio) is faster than any ElevenLabs API round-trip (~300–500 ms). For real-time voice UI, local Piper wins on latency.
  • Cost: ElevenLabs charges $5–99/month by tier. Local TTS costs $0 after one-time hardware. At scale (millions of characters/month), local is significantly cheaper.
  • Voice cloning: ElevenLabs Instant Voice Clone β‰ˆ XTTS v2 quality. ElevenLabs Professional Voice Clone (requires a speaker recording session) exceeds any local engine.
  • Privacy: Local TTS = no audio data sent anywhere. ElevenLabs = audio processed on their servers. Critical for sensitive content.
  • Offline capability: Local = fully offline. ElevenLabs = requires internet. No offline mode available.
  • When to use cloud: Professional voiceover production, customer-facing products requiring highest quality, multi-voice projects with dozens of characters.
  • When to use local: Privacy-critical audio, embedded devices, cost-sensitive batch processing, offline environments, development and prototyping.

How to Choose

A decision flowchart from your requirement to the right TTS engine:

πŸ“ In One Sentence

Need voice cloning? β†’ XTTS v2 (best quality) or F5-TTS (faster, newer arch) or Coqui TTS (open license). Need CPU speed? β†’ Piper. Need creative audio? β†’ Bark. Need best English quality? β†’ StyleTTS 2.

πŸ’¬ In Plain Terms

If you want to clone someone's voice, use XTTS v2 for quality or F5-TTS for faster inference or Coqui VITS for a permissive license. If you're building a Raspberry Pi or kiosk voice UI, use Piper. If you're making a podcast with sound effects, try Bark. If you're narrating audiobooks in English, use StyleTTS 2.

  • Need voice cloning? β†’ Yes: XTTS v2 (best quality, CPML license) or F5-TTS (newer arch, faster, CC-BY-NC-4.0) or Coqui VITS (good quality, MPL 2.0). No: Piper (speed), StyleTTS 2 (quality).
  • Need to run on CPU only / Raspberry Pi? β†’ Piper only. Kokoro is a higher-quality CPU alternative with Apache 2.0 license. All other engines require a GPU for acceptable performance.
  • Need creative audio with non-speech sounds? β†’ Bark. No other local engine produces laughter, sighs, or music natively.
  • Need the best English narration quality? β†’ StyleTTS 2. It outperforms all others on naturalness for English audiobook-style speech.
  • Need multilingual support? β†’ XTTS v2 (17 languages, cross-lingual cloning), Coqui (20+ languages), Piper (20+ language packs).
  • Need a permissive commercial license? β†’ Piper, Bark, StyleTTS 2 (MIT), Kokoro, Tortoise (Apache 2.0), or the Coqui toolkit on VITS (MPL 2.0). Avoid XTTS v2 (CPML) and F5-TTS (CC-BY-NC-4.0) for commercial use β€” both are non-commercial without a separate agreement.
  • Need commercial voice cloning (permissive license)? β†’ Tortoise (Apache 2.0) for highest quality if you can tolerate minutes-per-sentence generation, or the Coqui TTS toolkit on a VITS backend (MPL 2.0) for faster cloning. XTTS v2 and F5-TTS are higher quality but non-commercial.
  • Need voice control via text description? β†’ Parler-TTS. Describe the voice you want ("a calm elderly man speaking slowly") and it generates matching speech. Novel approach β€” no reference audio needed, no voice cloning. Useful when you need a specific voice character without a sample. GitHub
  • Building a voice assistant pipeline? β†’ Piper for low-latency TTS output (see /power-local-llm/build-local-voice-assistant-2026).

Frequently Asked Questions

How much reference audio do I need for voice cloning with XTTS v2?

XTTS v2 requires a minimum of 3 seconds of clean reference audio, with 6+ seconds giving noticeably better results. The audio must be a single speaker with minimal background noise and no music. Higher quality source audio (recorded in a quiet room with a good microphone) produces better clones than compressed audio.

Can I use Piper TTS in a commercial product?

Yes. Piper is licensed under MIT, which permits unlimited commercial use. You must include the MIT license notice in your product. The voice models (ONNX files) may have separate licenses per voice β€” check the individual voice model's license on the Piper voices repository before deploying.

Is Coqui TTS still maintained after the company shut down?

Yes, but with reduced pace. The Coqui company shut down in January 2024, but the open-source repository (coqui-ai/TTS) is maintained by community contributors. Bug fixes and security patches are applied, but major new model training or features are unlikely without significant community effort. For XTTS v2, expect no new model versions from Coqui.

Which local TTS engine has the best multilingual support?

XTTS v2 supports 17 languages with cross-lingual voice cloning β€” the most impressive multilingual feature of any local engine. Coqui TTS has 20+ language models but without cross-lingual cloning. Piper has 20+ language voice packs for fast CPU inference. If you need to clone a voice and produce speech in multiple languages from one reference sample, XTTS v2 is the only option.

Can Bark produce music?

Bark can produce simple musical snippets alongside speech when prompted with `[music] or [singing]` tokens. It is not a dedicated music generator β€” outputs are short, inconsistent, and often artifact-laden. For actual music generation, Bark is not the right tool. It is best used for adding emotional non-speech sounds (laughter, coughing, sighs) to speech output rather than for full music tracks.

What is the best free local TTS for voice cloning?

F5-TTS (CC-BY-NC-4.0) for non-commercial use β€” it clones voices from ~3 seconds of audio with quality competitive with XTTS v2. For commercial use, Coqui TTS with VITS backend (MPL 2.0) allows commercial deployment with source disclosure conditions. XTTS v2 has the best quality but its CPML license restricts commercial deployment without a separate agreement.

Can I run XTTS v2 on an Apple Silicon Mac?

Yes, but CPU-only β€” approximately 3Γ— real-time on M5 Pro. There is no Metal GPU acceleration for TTS engines currently. Unlike whisper.cpp (which has full Metal support), TTS engines run on CPU on Apple Silicon. Performance is usable for batch audio generation but not suitable for real-time voice assistant output.

Which local TTS engine sounds most human?

StyleTTS 2 for English narration β€” it achieves the highest MOS scores of any open-source English TTS engine (~4.3 vs human reference ~4.5). XTTS v2 and F5-TTS are competitive (~4.1) for cloned voice naturalness. None match ElevenLabs Turbo v2 at peak quality for production use cases.

Can I use XTTS v2 commercially?

No, not without a separate commercial agreement. XTTS v2 is released under the Coqui Public Model License (CPML), which permits personal, research, and hobby use of the model and its audio outputs but prohibits commercial use β€” any paid product, SaaS, ad-supported content, or client work. Coqui Inc shut down in January 2024, so there is currently no entity selling XTTS v2 commercial licenses; in practice, treat XTTS v2 as non-commercial only. For commercial voice cloning, use Tortoise (Apache 2.0) or the Coqui TTS toolkit on a VITS backend (MPL 2.0). This is factual reference, not legal advice β€” read the CPML yourself before deploying.

How do I accept the Coqui CPML license non-interactively (Docker / CI)?

Set the environment variable COQUI_TOS_AGREED to 1. The Coqui/XTTS library normally prints the CPML and waits for you to type "y", which hangs in Docker builds, CI, and headless servers. Setting COQUI_TOS_AGREED=1 records your acceptance so the model loads without the prompt. Use export COQUI_TOS_AGREED=1 in a shell or CI step, ENV COQUI_TOS_AGREED=1 in a Dockerfile, or `os.environ["COQUI_TOS_AGREED"] = "1"` in Python before the TTS() call. It only suppresses the prompt β€” it is your agreement to the CPML and does not grant commercial rights.

How many voices and languages does XTTS v2 support?

XTTS v2 has no fixed catalog of named voices β€” it is a cloning model, so you supply a 6-second reference clip and it reproduces that speaker (the repo also ships a few built-in speaker presets for quick tests). It generates speech in 17 languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (zh-cn), Japanese, Hungarian, Korean, and Hindi. Cloning is cross-lingual: clone a voice once and generate it in any of the 17 languages.

Kokoro vs Piper β€” which lightweight CPU TTS should I use?

Both run fast on CPU with no GPU and both are permissively licensed (Piper is MIT, Kokoro is Apache 2.0), so either is safe for commercial use. Choose Piper when you need the lowest latency and the widest language coverage (20+ language voice packs, real-time on a Raspberry Pi 5) β€” it is the standard for embedded and smart-home voice. Choose Kokoro (an 82M-parameter model built on the StyleTTS 2 architecture) when you want higher naturalness than Piper and can accept slightly more compute; its English quality is closer to the heavier GPU engines. For a Raspberry Pi or always-on assistant, Piper; for a desktop/server read-aloud where quality matters more than milliseconds, Kokoro.

Sources

← Back to Power Local LLM