Key Takeaways
- Llama 3.3 70B is the best all-round creative-writing local model in May 2026. Strongest voice consistency, takes direction well, handles dark themes without refusal when the system prompt frames the work as fiction.
- Qwen3 32B is the right 24 GB-rig pick. Nearly Llama 70B prose quality without the 48 GB+ VRAM bill. The default for most laptops and desktops.
- Mistral Large wins long-form continuity. 128K context out of the box; novel-length drafting without losing thread or character voice.
- Command R+ 104B has the cleanest dialogue voice. Most natural conversational beats across characters; the pick when dialogue is the load-bearing part of the work.
- Yi-1.5 34B is the poetry and lyrical-prose specialist. Niche pick for verse, stylised prose, and short-form work where rhythm matters.
- Uncensored derivatives (Hermes 3, Dolphin 3.0) are the right move when instruct-tuned models refuse. Same base models; the RLHF safety layer is removed; the model follows the prompt instead of declining. Mature fiction, conflict scenes, and morally complex characters become writable.
- Sampling matters more than people think. Temperature 0.8β1.1 and top-p 0.9β0.95 are the creative range. Coding-style settings (0.2β0.4) produce flat, predictable prose. Higher (1.2+) is genre/surreal territory.
Quick Facts
- Best overall: Llama 3.3 70B at Q4_K_M, ~42 GB VRAM. Strongest voice consistency in this set.
- Best 24 GB-rig: Qwen3 32B at Q4_K_M, ~20 GB VRAM. The default for most users.
- Best long-form: Mistral Large at Q4_K_M, ~75 GB total VRAM (heavy). 128K context out of the box.
- Best dialogue: Command R+ 104B at Q4_K_M, ~62 GB VRAM. Cleanest character-voice differentiation.
- Best poetry: Yi-1.5 34B at Q4_K_M, ~21 GB VRAM. Lyrical prose specialist.
- Uncensored options: Hermes 3 (Llama 3.3 base, ~42 GB) and Dolphin 3.0 (multiple base sizes, 13β42 GB).
- Sampling default for prose: temperature 0.95, top-p 0.92, repeat penalty 1.1. Adjust per task type.
How We Tested: 50+ Creative Prompts Across Six Models
The test held the prompt set, sampling settings, and frontend constant; only the model varied. Same 50 prompts across fiction, dialogue, poetry, and worldbuilding tasks; output graded by the same rubric per task type.
- Backend: Ollama 0.5+ on macOS and Linux; same context limits per model; Q4_K_M quantization across all six picks (Q5_K_M for the smaller 32Bβ34B models where VRAM permitted, with no measurable difference in the rubric scores).
- Frontend: Open WebUI for the bulk of the test (chat-style work); SillyTavern for the dialogue-heavy and roleplay subset (matches how creative writers actually use these models).
- Prompt set: 50 prompts split across four task types β fiction (15: short-story openings, scene continuations, descriptive passages), dialogue (15: two-character exchanges, group scenes, voice differentiation), poetry (10: free verse, structured forms, lyrical prose), worldbuilding (10: setting descriptions, factional politics, magic systems). Each prompt run 3 times per model to capture variance.
- Sampling: temperature 0.95, top-p 0.92, repeat penalty 1.1 as the baseline; per-task adjustments noted in the per-model verdicts below.
- Grading rubric: voice consistency (does the character or narrator sound the same across paragraphs?), prompt fidelity (did the model follow the direction or invent its own scene?), prose quality (rhythm, vocabulary, avoiding clichΓ©), and willingness (did the model refuse or sanitise scenes that the prompt explicitly framed as fiction?).
- Honesty constraint: scores reported as relative ranks per task, not invented absolute percentages. "Best dialogue" means consistent first place on the dialogue subset across the 3 runs; "strong" means top-3; "weak" means the model lost to one or more competitors on the rubric.
- For the prompting techniques that improve creative output on any model, see temperature and top-p control and persona prompting.
πNote: Creative-writing benchmarks are inherently subjective. The rubric above (voice consistency, prompt fidelity, prose quality, willingness) is the closest we got to repeatable scoring, but two readers grading the same outputs will disagree on prose quality more often than they agree. Treat the verdicts as starting hypotheses to test on your own work.
Head-to-Head: Six Local Models on Creative Writing Tasks
Llama 3.3 70B leads on the broadest set of tasks; the smaller and specialised models each win one or two categories. Pick by task type, not by overall ranking.
π In One Sentence
Llama 3.3 70B is the strongest all-round creative model; Qwen3 32B is the lighter alternative; Mistral Large wins long-form; Command R+ wins dialogue; Yi-1.5 wins poetry; Hermes/Dolphin handle scenes the others refuse.
π¬ In Plain Terms
No single model is best at everything. Llama 3.3 70B is the safe default if you have the hardware. Qwen3 32B is the smart pick on a 24 GB GPU. Pick a specialist (Mistral for novels, Command R+ for dialogue, Yi-1.5 for poetry) when one task type is the load-bearing part of the work. Pick an uncensored derivative when the instruct-tuned model refuses scenes you need to write.
| Model | Size | VRAM (Q4_K_M) | Fiction | Dialogue | Poetry | Worldbuilding | Best for |
|---|---|---|---|---|---|---|---|
| Llama 3.3 70B | 70B | ~42 GB | Best | Strong | Strong | Best | Best all-round; default if hardware fits |
| Qwen3 32B | 32B | ~20 GB | Strong | Strong | OK | Strong | 24 GB-rig default; small loss vs Llama 70B |
| Mistral Large | 123B | ~75 GB | Strong (long-form) | Strong | OK | Strong | Novel-length continuity, 128K context |
| Command R+ | 104B | ~62 GB | Strong | Best | OK | Strong | Dialogue-heavy work, group scenes |
| Yi-1.5 34B | 34B | ~21 GB | OK | OK | Best | OK | Poetry, lyrical prose, stylised work |
| Hermes 3 / Dolphin 3.0 | 13Bβ70B | ~9β42 GB | Same as base | Same as base | Same as base | Same as base | Scenes instruct-tuned models refuse |
π‘Tip: Two-model setup is the common pattern: Llama 3.3 70B (or Qwen3 32B) as the daily driver, plus the Hermes 3 derivative on the same Ollama for scenes the instruct version refuses. Switch between them per scene; both can sit in ollama list at the same time without conflict.
Per-Model Verdicts
- Llama 3.3 70B β best all-round. Strongest voice consistency in the test set; characters keep the same speech patterns across long scenes. Takes direction well β when the system prompt specifies POV, tone, or genre conventions, Llama 3.3 follows. Handles dark themes (violence, grief, morally grey characters) without refusal when the prompt frames the work as fiction. Where it falls short: long passages occasionally drift into generic "literary" voice; small models in the same family (8B) lose this strength.
- Qwen3 32B β best 24 GB-rig pick. Slightly less voice consistency than Llama 3.3 70B but the gap is small enough that most writers will not notice on prose-heavy work. Strongest of the smaller models on direction-following. Where it falls short: poetry and stylised prose lag noticeably; default to Yi-1.5 for those.
- Mistral Large β best long-form continuity. 128K context window means a 50,000-word draft fits without truncation; the model holds character details, plot threads, and world rules across chapters. Where it falls short: hardware bar is the highest in this set (~75 GB at Q4_K_M); per-token speed slows on long inputs. Use Mistral La Plateforme on EU infrastructure if local hardware is the constraint.
- Command R+ 104B β best dialogue. Distinct character voices that hold across exchanges; group scenes (3+ speakers) stay legible without the "everyone sounds the same" failure mode common to other models. Where it falls short: prose paragraphs between dialogue beats are competent but not lyrical; for purely descriptive passages, Llama 3.3 wins.
- Yi-1.5 34B β best poetry and lyrical prose. Rhythm-aware, comfortable with structured forms (sonnet, villanelle, haiku), produces verse that holds up better than the larger general models. Where it falls short: long-form fiction is competent but not its strength; pick Llama 3.3 or Qwen3 for narrative work.
- Hermes 3 (Llama 3.3 base) and Dolphin 3.0 β uncensored derivatives. Same architecture and base weights as Llama 3.3 70B; the RLHF safety layer is removed during fine-tuning. The model follows the prompt rather than refusing or sanitising mature content. Quality on non-mature prompts is identical to the base model; the difference is only visible on prompts where instruct-tuned versions refuse. Use these when the work requires it (mature fiction, violent scenes, morally complex characters) and stick with the instruct-tuned base otherwise.
πNote: Voice consistency is the metric most writers underweight in early model evaluation. A model that produces brilliant prose on a single paragraph but loses the character's speech patterns by paragraph 5 is worse than a model that produces competent prose with consistent voice across 20 paragraphs. Test long-session continuity, not just one-shot quality.
Sample Outputs: Three Models, One Prompt
Shared prompt: "Open a noir scene where the detective enters a rain-soaked diner at 2 AM." Same temperature (0.95), same top-p (0.92), same repeat penalty (1.1) β only the model changes. The excerpts below are representative of each model's typical output on this prompt class across three runs.
- Llama 3.3 70B β "The bell above the door chimed twice and went silent. Rain ran down my coat in sheets, pooling on the linoleum before I'd taken three steps. The counter girl didn't look up. Two truck drivers at the far booth studied their coffee. Nobody in a place like this at two in the morning wanted to be noticed, and neither did I." β *Concrete sensory grounding, active voice, no wasted words. The social scene is established in three sentences.*
- Qwen3 32B β "I pushed through the door into a wall of cigarette smoke and griddle grease. The diner was half-empty and fully quiet β the kind of quiet that settles in when people have agreed, without saying so, to leave each other alone. A ceiling fan turned overhead. The waitress refilled a mug at the far end of the counter without looking at me." β *Observational setup before action; efficient atmosphere. Slightly more structured than Llama 3.3 but equally clean.*
- Yi-1.5 34B β "Rain hammered the awning with the regularity of a metronome counting out the last hours of a night no one wanted to remember. Inside, the fluorescents threw their cold arithmetic across the laminate β every stain, every scratch, every year the place had been losing arguments with entropy. I found a stool at the near end and sat down like a man who had already made his peace with something." β *Metaphor-led from the first clause; rhythmic sentence variation; denser imagery. Yi-1.5's poetry background shows even in prose.*
| Model | Typical opening phrasing | Typical descriptor pattern |
|---|---|---|
| Llama 3.3 70B | Action-first, immediate sensory detail | Concrete and physical; avoids abstract nouns; socially grounded |
| Qwen3 32B | Environmental observation before character action | Efficient; social/atmospheric detail; slight structural tell |
| Yi-1.5 34B | Metaphor or simile from the first clause | Abstract imagery; rhythmic variation; denser; occasional purple streak |
| Command R+ 104B | Character voice or dialogue-adjacent opener | Conversational; strong distinct voice; weaker solo description |
| Mistral Large | Scene-setting paragraph; slower start | Even and controlled; consistent across long passages; slightly generic |
πNote: These excerpts are illustrative of each model's tendencies across multiple runs, not cherry-picked highlights. Yi-1.5 34B's "losing arguments with entropy" landed in one of three runs; the other two were more straightforward. Run any model 2β3 times on the same prompt and take the one that fits your scene, not just the first output.
Temperature and Top-P for Creative Work
Creative writing wants higher sampling temperatures than coding does. The default sampling parameters that ship with most chat UIs are tuned for question-answering, not for prose β temperature 0.7 and top-p 0.9 produce flat, predictable output on creative prompts.
- Baseline for prose: temperature 0.95, top-p 0.92, repeat penalty 1.1. This is the starting point for most fiction, dialogue, and worldbuilding work. Adjust per task from here.
- Tight dialogue: temperature 0.7β0.85, top-p 0.9. Lower temperatures keep character voices consistent across exchanges; higher values produce out-of-character interjections.
- Lyrical prose and poetry: temperature 1.0β1.2, top-p 0.95. Higher temperatures unlock unexpected word choices that make verse work.
- Surreal or genre fiction: temperature 1.1β1.3, top-p 0.95β0.98. Pushes the model to produce less-common combinations of imagery and metaphor.
- Plot-driven scenes (action, mystery, twists): temperature 0.85β0.95, top-p 0.9. Wants direction-following more than novelty.
- Repeat penalty 1.1β1.15 is the right range for most creative work. Higher (1.2+) makes the model avoid repeating words even when repetition is stylistically intentional; lower (1.0β1.05) lets the model fall into loops on long scenes.
- min_p (0.05β0.1): A newer alternative to top-p that dynamically scales the probability cutoff relative to the peak token probability. More permissive on creative prompts than top-p 0.9 without the incoherence risk of very high top-p values. The recommended default for SillyTavern and KoboldCpp users in 2026 when the interface exposes it; Ollama passes it through as-is, and Open WebUI 0.5+ exposes it under Advanced Settings.
- DRY repetition penalty (multiplier 0.8, base 1.75, allowed length 2): Catches phrase-level repetition that the standard repeat_penalty misses. Where repeat_penalty tracks individual tokens, DRY tracks n-gram sequences β so the clichΓ© "shiver down their spine" in scene 1 is suppressed when it would otherwise appear again in scene 4. Useful for long-session creative work where the model has seen its own output and starts pulling from it.
- Modern creative-writing baseline (2026): temperature 0.95, min_p 0.05, DRY multiplier 0.8 (base 1.75, allowed length 2). Top-p 0.92 still works well if your frontend does not expose min_p or DRY β these are incremental improvements over the classic settings, not mandatory replacements.
- For a fuller treatment of why these parameters matter and how they interact, see temperature and top-p control.
π‘Tip: Test sampling settings on a single short scene per model β three runs at each setting, then pick the temperature where the model sounds most alive without losing the prompt. Settings that work on Llama 3.3 70B will not perfectly transfer to Mistral Large or Yi-1.5; calibrate per model.
Uncensored Models: What They Are and When They Matter
Uncensored does not mean unethical. It means the model has had its instruction-tuning safety layer (RLHF refusals) removed or bypassed, so the model follows the prompt instead of declining or sanitising. The writer is still the author; the tool stops getting in the way.
- What "uncensored" means technically. Models like Hermes 3 and Dolphin 3.0 are fine-tuned variants of base models (Llama 3.3, Qwen3) where the post-training RLHF pass that produces refusals on mature, violent, or morally complex prompts has been replaced with a fine-tune that follows the prompt. Same architecture, same base weights, different post-training.
- When they matter for creative work. Mature fiction (literary novels with sex scenes, crime fiction with graphic violence, horror), historically-accurate writing (war, atrocity, colonial-era brutality), morally complex characters (the model would otherwise refuse to voice a convincing antagonist), and roleplay scenarios that the instruct-tuned models will not engage with.
- Where they fall short. They follow the prompt β including badly-written prompts. The instruct-tuned models often soften vague prompts into something publishable; uncensored models give you exactly what you asked for, which is sometimes worse. The writer's direction matters more.
- Ethical boundaries. "The model will write it" is not a creative-writing licence to write content that targets real people, depicts non-consensual scenarios involving real or identifiable individuals, or that is illegal in the writer's jurisdiction. Local hosting does not change the law; it changes who can see the draft.
- Legal context (May 2026, brief and non-exhaustive). EU AI Act and member-state laws (notably German StGB Β§184/Β§184c) cover specific content categories regardless of where it was generated. US obscenity law applies to publication, not generation. For commercial publishing, the model that produced a draft is irrelevant; the published artefact is what is regulated.
- For a longer treatment of uncensored model ethics, legal context, and best practices, see Uncensored Local LLMs for Creative Writing: Ethics, Legality & Best Practices.
πNote: Uncensored is a workflow choice, not an identity. Many writers use the instruct-tuned model for the bulk of a project and switch to an uncensored derivative for specific scenes that the instruct version refuses. Two model installs in the same Ollama setup is the common pattern.
Frontends for Creative Work
The chat UI you write in matters as much as the model. Three frontends are credible picks for creative-writing workflows in 2026; pick by workflow shape.
- Open WebUI β the general-purpose pick. ChatGPT-like interface, model switching in one click, character cards via system prompts, document upload for context. Best for prose-heavy work where the chat shape matches your drafting flow.
- SillyTavern β the roleplay and dialogue pick. Character card ecosystem (Tavern v2 spec), persona management, lore books for worldbuilding, group chat for multi-character scenes. Best for dialogue-driven work and long-running character or world projects. Pairs well with Command R+ and uncensored derivatives.
- Agnai and RisuAI β narrower SillyTavern alternatives. Lighter feature sets, easier first-run, less customisation. Pick when SillyTavern feels overbuilt for your workflow.
- Plain Ollama CLI plus a text editor β the minimal pick.
ollama run llama3.3:70band pipe scenes through the terminal into your draft document. Loses the persistent character context but wins on writer focus. - For the head-to-head comparison of the roleplay-focused frontends, see SillyTavern vs Agnai vs RisuAI: Best Local Roleplay Frontend.
π‘Tip: Drafting and editing wants different frontends. Use SillyTavern for generation (character voice, scene work), then export the chat to a plain text editor for revision. Editing inside the chat window encourages "ask the model to fix it" instead of writer-driven revision β a long-term skills risk.
Decision: Which Model for Your Work
Five questions, in order, get most writers to the right pick.
π In One Sentence
Pick Qwen3 32B as the default if you have a 24 GB GPU; Llama 3.3 70B if you have 48 GB+; Mistral Large for novel-length work; Command R+ for dialogue; Yi-1.5 for poetry; Hermes/Dolphin for scenes the instruct models refuse.
π¬ In Plain Terms
Qwen3 32B is the right starting model for most writers. Move to one of the specialists when a specific task type (long-form, dialogue, poetry, mature scenes) becomes the bottleneck. Two installs (instruct + uncensored) on the same machine costs nothing β both can sit in Ollama and you switch per scene.
| Your situation | Pick |
|---|---|
| I have 48 GB+ VRAM and want one model for everything | Llama 3.3 70B (instruct) + Hermes 3 (uncensored) on the same Ollama |
| I have a 24 GB GPU or 32 GB Mac and want a strong default | Qwen3 32B |
| I am drafting a novel β long-form continuity is the priority | Mistral Large (or Mistral La Plateforme on EU hardware if local does not fit) |
| My work is dialogue-heavy β character voices need to stay distinct | Command R+ 104B (or Llama 3.3 70B as a lighter alternative) |
| I write poetry, verse, or lyrical prose | Yi-1.5 34B |
| The instruct model is refusing scenes I need to write | Hermes 3 (Llama 3.3 base) or Dolphin 3.0 β keep the instruct version installed for non-mature work |
| I want one model to start with and will iterate | Qwen3 32B β covers most workflows on consumer hardware; switch up when one task type becomes the load-bearing part |
π‘Tip: Most writers overthink the model and underthink the prompt. A well-crafted system prompt with character notes, voice samples, and explicit POV does more for the output than switching from Qwen3 to Llama 70B. See persona prompting for the prompt structure that consistently lifts creative output.
Common Mistakes Picking and Using Local Models for Creative Writing
- Mistake 1: chasing the biggest model on benchmarks. Creative writing scores poorly correlate with general benchmark leaderboards. Yi-1.5 34B beats Llama 3.3 70B on poetry; Command R+ beats both on dialogue. Pick by task, not by leaderboard rank.
- Mistake 2: using coding-style sampling settings. Temperature 0.2β0.4 produces flat, predictable prose. Creative writing wants 0.8β1.1 with top-p 0.9β0.95. The default settings in most chat UIs are tuned for Q&A, not prose.
- Mistake 3: defaulting to the instruct model and giving up when it refuses. The instruct version refuses scenes you have explicitly framed as fiction; the uncensored derivative of the same base model writes them. Two installs in Ollama is the workaround.
- Mistake 4: thin system prompts. "You are a helpful assistant" is the worst possible prompt for creative work. A system prompt with character notes, voice samples, POV, tense, and tone does more for output quality than any model switch. Pair with negative prompting to specify what NOT to do (no exposition, no purple prose, no "she felt").
- Mistake 5: editing inside the chat window. Generating in chat is fine; editing in chat trains a habit of asking the model to fix prose instead of revising it yourself. Export the draft to a text editor for revision; the writer's voice gets stronger when the model is not in the loop.
β οΈWarning: The biggest skill risk with creative-writing AI is outsourcing the revision pass. Generation is mechanical work that benefits from the model; revision is the part that makes the prose yours. Writers who let the model revise lose voice fast β even when they cannot point to which line changed.
Sources
- Hugging Face model cards for Llama 3.3, Qwen3, Mistral Large, Command R+, Yi-1.5 β official model documentation and licensing.
- Hermes 3 (NousResearch) GitHub and model card β uncensored Llama 3.3-based fine-tunes.
- Dolphin 3.0 (Cognitive Computations) model cards β uncensored fine-tunes across multiple base models.
- Ollama Model Library β available models, quantization options, tool-call support flags referenced above.
- SillyTavern documentation β character card spec, persona system, group chat features.
FAQ
Which local LLM is best for fiction writing in 2026?
Llama 3.3 70B is the best all-round pick when hardware permits (~42 GB VRAM at Q4_K_M). On 24 GB rigs, Qwen3 32B is the lighter default with a small quality gap on prose-heavy work. For long-form continuity (novels), Mistral Large's 128K context is the differentiator. Pick by task type: most writers benefit more from the right specialist than from chasing the biggest model.
What is an uncensored local LLM and when should I use one?
An uncensored model is a fine-tune of an existing base model (typically Llama 3.3 or Qwen3) where the RLHF safety layer that produces refusals on mature or morally complex prompts has been removed. The model follows the prompt instead of declining. Use uncensored derivatives (Hermes 3, Dolphin 3.0) for mature fiction, conflict scenes, historically accurate writing, or any workflow where the instruct-tuned model refuses scenes you have framed as fiction. The writer is still the author; the model just stops getting in the way.
What temperature should I use for creative writing?
Temperature 0.8β1.1 is the creative-writing range, paired with top-p 0.9β0.95. Tight dialogue wants 0.7β0.85; lyrical prose and poetry want 1.0β1.2; surreal or genre work wants 1.1β1.3. The defaults in most chat UIs (often 0.7 with top-p 0.9) are tuned for question-answering and produce flat prose on creative prompts. Test on a short scene at 3 settings, pick the one where the model sounds most alive without losing the prompt.
Are local creative-writing models as good as ChatGPT or Claude?
For most prompts, yes β close enough that the privacy and cost advantages dominate. The frontier cloud models still lead on the hardest creative tasks (long-form coherence past 50K tokens, very obscure cultural references, rare languages). For a typical fiction or roleplay session, a writer who calibrated sampling settings on Llama 3.3 70B or Qwen3 32B will not see consistent quality gaps against GPT-5 or Claude. The models that lose are the ones that get a default "0.7 temperature, generic system prompt" treatment β that loses against any cloud model.
Can a local model write a full novel?
It can help draft one. Mistral Large at 128K context can hold a 50,000-word draft in memory; Llama 3.3 70B and Qwen3 32B at 32K context need section-by-section drafting. The bottleneck is not model capability β it is the writer's structure (outline, character bible, lore book) that the model uses to keep continuity. Without those, even Mistral Large drifts. With them, any of the top picks holds together for novel-length work.
Do uncensored models produce illegal content?
No more than instruct-tuned models do. Both produce text the prompt asks for; uncensored models are more willing to engage with mature themes that the instruct-tuned safety layer refuses. Legal liability attaches to the writer and the publication, not the model. EU AI Act, German StGB Β§184/Β§184c, and US obscenity law cover specific content categories regardless of generation method. The local hosting does not change the law; it changes who has visibility into the draft.
Is SillyTavern only for adult roleplay?
No. SillyTavern is a chat-focused frontend with character cards, persona management, and lore books β useful for any dialogue-heavy or character-driven work. Many writers use it for non-roleplay fiction drafting (multi-character scenes, voice consistency across long projects). The character card ecosystem includes adult content but is not limited to it; the same UI works for literary fiction, screenwriting, and game-narrative work.
How is local creative writing different from coding workloads?
Sampling settings and prompt structure. Coding wants temperature 0.2β0.4, deterministic output, structured (JSON, code) output, and explicit constraints in the prompt. Creative writing wants temperature 0.8β1.1, freer output, prose form, and richer system prompts (character voice, POV, tone, genre conventions). The same model β Llama 3.3 70B serves both β produces wildly different output depending on these settings. A coding-style prompt on a creative model produces flat output; a creative-style prompt on a coding model produces hallucinated code.
Which local model has the fewest "AI tells"?
AI tells β phrases like "shiver down their spine," "tapestry," "delve," "navigate," and ChatGPT-style transitional summaries β are more frequent in smaller instruct models. Llama 3.3 70B and Qwen3 32B have fewer tells than models below 20B. Hermes 3 has the fewest in this set: the RLHF refusal-pattern training was also where many formulaic transitions were introduced, and removing it removes both. Yi-1.5 34B is unusual β stronger on rare vocabulary but occasionally over-purple. The highest-impact lever for tell reduction is the system prompt with negative examples ("do not write 'shiver,' 'tapestry,' or 'delve'"), not the model.
How do I avoid the "shiver down their spine" clichΓ©?
A system prompt with negative examples is the highest-impact lever β list 8β12 banned phrases explicitly ("do not write 'shiver,' 'tapestry,' 'delve,' 'masterfully,' or 'she felt'"). Lower the temperature slightly (0.85β0.95 instead of 1.1) to reduce the model's reach for stock language. Repeat penalty 1.1 alone does NOT catch this β the phrases are not exact token repetitions. DRY penalty (multiplier 0.8, base 1.75) catches them at the n-gram level across scenes. A manual revision pass is the final filter. See negative prompting for the prompt structure that consistently kills clichΓ©s.