PromptQuorumPromptQuorum
Home/Prompt Engineering/Prompting Across Languages: How to Get Consistent Results
Use Cases by Vertical

Prompting Across Languages: How to Get Consistent Results

Β·12 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

LLMs were trained primarily on English β€” prompting in French, German, Japanese, or Arabic activates a different region of the model's knowledge, with lower accuracy and higher token costs. Use English system prompts for reasoning, target-language instructions for formality, and always declare the output language explicitly.

Key Takeaways

  • LLMs perform best in English; non-English outputs have 5–15% higher error rates in Tier 3+ languages (Ahuja et al., 2023 MEGA benchmark).
  • English system prompts + native-language user input outperforms native system prompts for structured tasks in Tier 2–3 languages.
  • 1,000 English words β‰ˆ 1,300 tokens; the same content in Arabic β‰ˆ 1,900 tokens β€” 46% more expensive in API cost.
  • Mistral models (Mistral Large 2) lead on French/Italian/Spanish; Google Gemini 3.0 Pro leads on Japanese/Korean/Chinese; GPT-4o leads on Arabic.
  • Few-shot examples must be in the target language β€” mismatched examples cut accuracy by ~20% (Shi et al., 2023).
  • Always declare output language explicitly in the system prompt: "Respond in formal German (Sie-form)." β€” never assume the model will match the user's language.

Quick Facts

  • 46% of CommonCrawl training data is English; only 3% is Chinese, 5% is French, 6% is German.
  • 1,900 tokens needed for 1,000 words in Arabic (46% more than English); 900 tokens for Chinese (31% less).
  • 5–12% accuracy improvement by using English chain-of-thought reasoning with native-language output (Tier 3 languages).
  • 15–20% accuracy drop when using English few-shot examples for non-English tasks (Shi et al., 2023).
  • Mistral Large 2 leads on Romance languages; Gemini 3.0 Pro leads on East Asian; GPT-4o leads on Arabic.

Why Language Matters More Than You Think

πŸ’¬ In Plain Terms

Think of it this way: LLMs learned English from billions of books, websites, and articles. They learned French from millions. When you ask a question in French, the model has fewer examples to draw from, so it makes more mistakes β€” just like you would solving math problems in a language you've only studied for a few weeks versus one you've spoken your whole life.

Multilingual prompting is not translation β€” it is activating a different part of the model's learned distribution. LLMs tokenise and represent text in a shared embedding space, but training data is skewed: CommonCrawl (used to train most LLMs) is ~46% English, ~6% German, ~5% French, ~3% Chinese. Languages with <1% training share (e.g., most African languages, many South Asian languages) behave unpredictably.

When you prompt in French, the model relies on patterns from French training data. Since French data is only ~5% of the training corpus, the model has fewer learned associations to draw from compared to English prompts. This manifests as: lower reasoning accuracy, inconsistent instruction following, higher hallucination rates, and unpredictable output quality.

For a deeper dive into how LLMs actually learn language patterns, see how LLMs actually work.

The 4-Tier Language Model

πŸ“ In One Sentence

Higher training data share = more learned patterns = more reliable outputs; Tier 1 (English) is ~46% of training, Tier 2 (European) is ~5–8%, Tier 3 (Asian/Arabic) is ~2–4%, Tier 4 (<1%) requires retrieval-augmented generation.

Language performance in LLMs follows a four-tier hierarchy based on training data share, with Tier 1 (English) performing near-perfectly and Tier 4 (low-resource languages) producing unreliable outputs. Use the tier system to decide which strategies apply to your target language.

TierLanguagesTraining Share (approx.)Recommended Strategy
Tier 1English~46%Prompt directly, any technique works
Tier 2French, German, Spanish, Portuguese, Italian5–8% eachNative-language user prompts, English system prompt for structure
Tier 3Chinese, Japanese, Korean, Arabic, Russian2–4% eachEnglish CoT + native output, test outputs rigorously
Tier 4Most other languages<1%Use RAG with pre-verified content; avoid generative outputs without human review

Token Costs by Script

The same 1,000-word piece of content costs 46% more tokens in Arabic than in English, and 30% more in Japanese β€” directly increasing your API bill. Token efficiency varies dramatically by script and language family. This affects both API costs and context window budgeting.

See tokens, costs, and limits for a detailed breakdown of how to budget tokens in your multilingual workflows.

LanguageScriptTokens (approx.)vs. EnglishAPI Cost Multiplier
EnglishLatin~1,300baseline1.0Γ—
GermanLatin~1,500+15%1.15Γ—
FrenchLatin~1,450+12%1.12Γ—
SpanishLatin~1,400+8%1.08Γ—
RussianCyrillic~1,700+31%1.31Γ—
Chinese (Simplified)CJK~900βˆ’31%0.69Γ—
JapaneseCJK + kana~1,100βˆ’15%0.85Γ—
KoreanHangul~1,400+8%1.08Γ—
ArabicArabic~1,900+46%1.46Γ—

Should Your System Prompt Be in English or the Target Language?

For structural and reasoning tasks, English system prompts outperform native-language system prompts in Tier 2–3 languages. For tone and formality, native-language system prompts perform better. This is the single most important decision in multilingual prompting β€” get it wrong and your outputs suffer.

Why? Most instruction-following capability in LLMs was trained on English RLHF (Reinforcement Learning from Human Feedback) data. Complex system-level instructions (formatting rules, personas, chain-of-thought directives) are more reliably followed when written in English. English instructions are part of the model's core reasoning pathway.

But style instructions (formality register, cultural tone, politeness level) are best written in the target language because they depend on understanding native speakers' expectations for what "formal French" or "polite Japanese" actually means.

Decision tree: Complex reasoning/formatting rules β†’ English system prompt. Formality register (Sie, Vous, keigo) β†’ target language. Persona definition β†’ English + one target-language sample. Output language specification β†’ always explicit in system prompt: "Respond in formal Japanese (丁寧θͺž / です・ます体)."

For the full breakdown, see system prompt vs. user prompt.

❌ System prompt entirely in German: "Du bist ein Kundensupport-Assistent. Antworte auf Deutsch."

Why it hurts: Complex instructions (error handling, structure, logic) get lost in translation. Model struggles to follow formatting rules in low-resource language.

Fix: Use English for system instructions: "You are a customer support assistant. Respond in German using formal Sie-form." Then include tone/register guidance in German.

⚠️ Common Mistake

Writing both system prompt AND user instructions in the target language often reduces reasoning accuracy. Use English for logic, target language for tone.

πŸ’‘ Pro Tip

Test both approaches (English system + English reasoning vs. English system + native reasoning) on your exact use case. Model behavior varies by language tier.

Bad vs. Good: Multilingual System Prompt

Bad prompt β€” assumes model will detect language and register:

"Summarise this German contract."

Result: Mixed English/German output, informal register, may miss legal terminology.

Good prompt β€” explicit language, register, and reasoning path:

"You are a legal analyst. The following document is a German employment contract (Arbeitsvertrag). Summarise its key obligations in formal German (Sie-Form). Structure: Vertragsparteien, Vergütung, Kündigungsfristen, Besondere Klauseln. Maximum 200 words. Flag any clause that is unusual for standard German employment law with PRÜFEN."

Result: Structured, formal German output with domain-appropriate terminology and flagged anomalies.

Which Models Handle Which Languages Best?

No single model leads across all languages. Mistral Large 2 leads on Romance languages; Google Gemini 3.0 Pro leads on East Asian languages; GPT-4o leads on Arabic and multilingual reasoning tasks. This table aggregates model performance from Ahuja et al. (2023) MEGA benchmark.

ModelTier 2 (European)Tier 3 (East Asian)ArabicBest Use Case
GPT-4oβœ… Strongβœ… Strongβœ… BestGeneral multilingual, structured extraction
Claude Opus 4.7βœ… Strongβœ“ Goodβœ“ GoodDocument analysis, nuanced tone
Gemini 3.0 Proβœ“ Goodβœ… Bestβœ“ GoodJapanese/Korean/Chinese, translation
Mistral Large 2βœ… Best⚠ Moderate⚠ ModerateFrench/Spanish/Italian business content
Qwen 3 72B⚠ Moderateβœ… Strongβœ“ GoodChinese-primary workflows (open-source)
Llama 3.3 70Bβœ“ Good⚠ Moderate⚠ ModerateEuropean languages, budget option

πŸ’‘ Pro Tip

Use PromptQuorum to test your exact prompt across all 6 models simultaneously. Side-by-side output comparison reveals which model performs best for your language + task combination.

πŸ“Œ Did You Know?

Model performance varies not just by language, but by domain. A model might excel at Japanese technical translation but struggle with Japanese customer service tone.

Cost by Use Case

The token cost differences above translate directly to your API bill. Here's the real-world impact based on GPT-4o pricing ($5 per 1M input tokens).

Use CaseEnglish CostArabic CostJapanese CostSavings Tip
100 customer emails/day$X$1.46X$0.85XUse Gemini 3.0 Pro for Japanese; budget 46% extra for Arabic
10,000-word report summary$Y$1.46Y$0.85YChunk in English, output in target language
500 product descriptions$Z$1.46Z$0.85ZChinese is cheapest (0.69Γ—)

Chain-of-Thought Prompting Across Languages

For Tier 3 languages, writing your chain-of-thought instruction in English but requesting the final answer in the target language improves reasoning accuracy by 5–12% (Shi et al., 2023). This cross-lingual CoT technique exploits the model's English reasoning strength while preserving output quality in your target language.

When LLMs reason step-by-step, they rely on patterns from their largest training corpus (English). If you force reasoning to occur entirely in a low-resource language like Japanese or Arabic, accuracy drops because the model has fewer learned reasoning patterns in that language. The hybrid approach β€” English CoT, native-language output β€” is best of both worlds.

Template: `Think through this step by step in English, then write your final answer in Japanese. Question: question`

Decision: Use English CoT when β†’ task requires multi-step reasoning, target language is Tier 3+, accuracy matters more than latency. Use native-language CoT when β†’ tone and register matter more than reasoning depth, target language is Tier 1–2.

Deep dive: Chain-of-thought prompting: how to get LLMs to show their work.

⚠️ Caution

Cross-lingual CoT works for Tier 3 languages but may confuse models in Tier 4 languages. Always test on a small sample before committing to the approach.

πŸ› οΈ Best Practice

For maximum accuracy, combine cross-lingual CoT with few-shot examples: show the model a full example (English reasoning β†’ Japanese answer) before giving it a new task.

Few-Shot Examples and Language Matching

Few-shot examples must be in the same language as the task β€” cross-language few-shot examples reduce output accuracy by 15–20% in Tier 2–3 languages (Shi et al., 2023). Few-shot examples teach the model format, tone, and pattern. When examples are in English but the task is in French, the model receives conflicting signals.

Two strategies: (1) Native few-shot β€” all examples in target language (best for quality). (2) Zero-shot + explicit instructions β€” no examples, but clear style/format rules in English (best when native examples are unavailable). Avoid mixing: English examples + French task = worst of both.

See few-shot vs. zero-shot prompting for the full decision framework.

πŸ“Œ Key Point

Source language mismatch matters: English examples train the model on English formatting, then it must simultaneously switch languages and infer format β€” a dual cognitive load that degrades output.

Formality, Register, and Honorifics

LLMs default to informal registers in most languages. If your use case requires formal German (Sie-form), formal Japanese (丁寧θͺž), or French Vous-form, you must explicitly declare the register in your system prompt β€” models will not infer it from context. This is often overlooked and causes outputs to sound wrong to native speakers.

LanguageLLM DefaultFormal OverrideInformal Override
GermanMixed Sie/duVerwende ausschließlich die Sie-Form.Verwende die du-Form.
FrenchInformal tuUtilisez exclusivement le vouvoiement (Vous).Utilise le tutoiement (tu).
Japaneseですます (polite)Use 丁寧θͺž throughout.Use plain form (だ体).
SpanishMixed Usted/tΓΊUtilice exclusivamente el tratamiento de usted.Usa el tuteo (tΓΊ).
KoreanMixed formal/informalUse formal 합쇼체 throughout.Use informal ν•΄μš”μ²΄.

πŸ› οΈ Best Practice

Test register enforcement on 3–5 sample outputs before deploying. Some models may drift to informal mid-response even with explicit instructions; if so, add a reminder: "Do not switch to informal register under any circumstances."

Code-Switching: When Users Mix Languages

When users mix languages in a prompt (e.g., English question with a German brand name or French code comment), most models respond in the dominant language of the query β€” but this is unreliable without explicit instruction. Code-switching is common in multilingual workplaces where technical terms stay in English but surrounding prose is in another language.

Recommended handling: (1) In system prompt: "When the user writes in mixed languages, respond in target language unless the question is explicitly in English." (2) Detect language programmatically (langdetect, FastText, lingua-rs) before routing to the model, rather than relying on the model to detect it. (3) For production multilingual apps: implement a language detection step before the LLM call to route to the correct prompt template.

⚠️ Warning

Do not rely on models to auto-detect the user's intended output language when code-switching occurs. Always include explicit language declaration in the system prompt or detect programmatically.

Reusable Multilingual Prompt Templates

Four template patterns you can adapt for your own multilingual workflows. Copy and customize the target language placeholders for your use case.

  1. 1
    Language-aware system prompt: "You are a role assistant for Company. Respond in target language using formality register. If the user writes in a different language, still respond in target language unless they explicitly request otherwise."
  2. 2
    Cross-lingual CoT (for Tier 3 languages): "Think through this step by step in English. Write your final answer in Japanese/Arabic/Korean."
  3. 3
    Native few-shot header: "Here are 2 examples of the expected output format in language: Example 1: native-language example Example 2: native-language example Now complete the following: task"
  4. 4
    Register enforcement: "Respond in formal language. Use specific register instruction. Do not switch to informal register regardless of how the user writes."

How PromptQuorum Helps Multilingual Workflows

  • One prompt β†’ multiple models β†’ side-by-side language comparison. Send the same French prompt to Mistral Large 2, Claude, and GPT-4o and see which produces the best register, accuracy, and tone in one run.
  • 9 built-in prompt frameworks β€” all support multilingual templates with language-specific placeholders. Examples: CoT, few-shot, persona, register-enforcement patterns.
  • Token count display per model β€” see exactly how many tokens your Arabic or Japanese input consumes before sending, preventing budget surprises.
  • Context overflow alerts for multilingual inputs β€” automatically flags when Arabic or Russian content (which use 30–46% more tokens) approaches your model's context window.
  • Local LLM support via Ollama/LM Studio β€” test Qwen 3 or Llama 4 on Chinese/Japanese tasks without API costs, then compare outputs with cloud models.
  • Side-by-side output comparison β€” see the exact register, accuracy, and tone differences between models in your target language. Identify which model wins for your specific use case.

Common Mistakes

  • Assuming English prompt β†’ native language output works without adjustment: "Just translate your prompt" produces lower-quality results than rewriting it for the target language. Translated prompts often contain awkward phrasing that confuses the model.
  • Using English few-shot examples for non-English tasks: Cross-language examples reduce accuracy 15–20%. Write or source native-language examples.
  • Not declaring output language explicitly: Models will guess from context β€” and sometimes guess wrong. Always include "Respond in language" in the system prompt.
  • Ignoring token cost differences: Arabic and Russian inputs consume 30–46% more tokens than English equivalents. Budget accordingly.
  • Testing only in English then assuming non-English will be equal quality: Non-English outputs require separate evaluation. Use MGSM or XCOPA benchmarks to measure cross-lingual reasoning.
  • Forcing complex reasoning in Tier 4 languages: For languages with <1% training share, generative tasks often produce confident-sounding wrong answers. Use retrieval (RAG) with pre-verified content instead.

How to Set Up a Multilingual Prompt Workflow

  1. 1
    Identify which language tier(s) your target language(s) fall into (Tier 1–4).
  2. 2
    Select the right model for each language (Mistral Large 2 for Romance, Gemini 3.0 Pro for East Asian, GPT-4o for Arabic).
  3. 3
    Write an English system prompt with explicit language instruction: "Respond in formal German (Sie-form)."
  4. 4
    Prepare few-shot examples in the target language (minimum 2, ideally 3).
  5. 5
    For Tier 3+ languages, test CoT: include "Think step by step in English, then respond in language."
  6. 6
    Run PromptQuorum multi-model dispatch to compare model outputs on your specific language task before committing to one model.

Regional Compliance & Data Considerations

European Union (GDPR): If processing French, German, or other EU-language data, ensure your LLM API meets GDPR Article 28 (Data Processing Agreement). Mistral Large 2 and Claude Opus 4.7 both offer EU-compliant deployments with data residency in Frankfurt/Ireland. GPT-4o requires data processing terms via OpenAI's Data Processing Agreement. Never send personally identifiable information (names, email, phone) to models without explicit consent and DPA coverage.

Japan (APPI): Japanese enterprises deploying multilingual LLMs must comply with the Act on Protection of Personal Information (APPI). Gemini 3.0 Pro offers Japan-region deployment with data residency in Tokyo. GPT-4o and Claude Opus 4.7 require DPA terms. Consider local LLMs (Qwen2.5, Llama 3.1) deployed on-premises to guarantee data never leaves Japan.

China (Data Security Law): Prompting in Chinese or Chinese user data triggers the 2021 Data Security Law (DSL). Foreign cloud LLMs (OpenAI, Anthropic, Google) cannot be used for sensitive PII or government workflows. Deploy Qwen2.5 locally via Alibaba Cloud or Baidu Cloud with data residency compliance. For non-sensitive use (marketing, customer chat), foreign APIs are acceptable but must have data transfer agreements in place.

FAQ

Should I write my prompt in English or the target language?

For structural reasoning tasks, write the system prompt in English. For tone and formality, write the user message and register instructions in the target language.

Why does AI perform worse in non-English languages?

LLM training datasets are dominated by English (~46% of CommonCrawl). Languages with <5% training share have fewer patterns for the model to draw on, producing higher error rates.

Which AI model handles Japanese best?

Google Gemini 3.0 Pro consistently leads on Japanese, Korean, and Chinese. GPT-4o is a close second.

How much more do Arabic prompts cost than English prompts?

Arabic text uses approximately 46% more tokens than equivalent English content. Budget accordingly for high-volume Arabic applications.

Do I need to translate my few-shot examples?

Yes. Few-shot examples should be in the same language as your expected output. Cross-language examples reduce accuracy by 15–20%.

What is cross-lingual chain-of-thought prompting?

Cross-lingual CoT uses English for the reasoning steps but requests the final answer in the target language. For Tier 3 languages, this improves reasoning accuracy by 5–12%.

How do I make an LLM use formal German (Sie-form)?

Add to your system prompt: "Verwende ausschließlich die Sie-Form und einen professionellen Ton." Models default to mixed registers; this instruction is required to enforce Sie-form consistently.

What is code-switching in multilingual prompting?

Code-switching occurs when a user writes in a mix of languages. Without explicit instructions, models respond in whatever language they detect as dominant.

Can I use the same prompt template across all languages?

No. Each language tier requires a different strategy. Tier 1 works with any prompt. Tier 2–3 need language-specific CoT and few-shot strategies. Tier 4 requires RAG.

How does PromptQuorum help with multilingual prompting?

PromptQuorum dispatches the same prompt to multiple models simultaneously and returns side-by-side outputs. This lets you identify which model performs better on your specific language and task in one run.

Sources

Apply these techniques across 25+ AI models simultaneously with PromptQuorum.

Try PromptQuorum free β†’

← Back to Prompt Engineering

Multilingual Prompting: Get Consistent AI Results in Any Language