PromptQuorumPromptQuorum
Home/Local LLMs/Local LLMs With VS Code and Cursor: Setup and Best Practices
Tools & Interfaces

Local LLMs With VS Code and Cursor: Setup and Best Practices

Β·10 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

VS Code and Cursor (an AI-first code editor) can both use local LLMs for code completions and suggestions, via Continue.dev extension (VS Code) or direct integration (Cursor). As of April 2026, local code completions are practical for 7B–13B models and require 8–16 GB RAM. This guide covers setup, best models, and performance tuning.

Key Takeaways

  • VS Code uses Continue.dev extension to connect to local models (Ollama, LM Studio, vLLM).
  • Cursor is a VS Code fork with built-in local model support. No extension needed.
  • Best local models for code: Qwen2.5-Coder 7B, Llama Code 13B, or Mistral 7B.
  • Expect 2–5 second completion latency on consumer GPUs with 7B models.
  • As of April 2026, local code completions are practical for personal use, not yet production-grade for teams.

How to Set Up Continue.dev in VS Code

Continue.dev is a VS Code extension for local and cloud code completions.

json
# 1. Install Continue from VS Code marketplace
# Search "Continue" and click Install

# 2. Make sure Ollama is running
ollama serve

# 3. Open Continue settings (Ctrl+Shift+P β†’ Continue: Open Settings)
# config.json opens

# 4. Configure for your local model:
# Replace the default settings with:
{
  "models": [{
    "title": "Ollama",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://localhost:11434"
  }],
  "tabAutocompleteModel": {
    "title": "Ollama",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

# 5. Start typing code and press Tab for completions
# Or Ctrl+Shift+\ to manually trigger completions

How to Use Local Models in Cursor

Cursor is a VS Code fork optimized for AI-assisted coding. It has built-in support for local models via Ollama.

bash
# 1. Download Cursor from cursor.sh
# 2. Make sure Ollama is running
ollama serve

# 3. Open Cursor Settings (Cmd/Ctrl + ,)
# 4. Search "Model" and set:
#    - Model Provider: "Ollama"
#    - Model: "qwen2.5-coder:7b" (or your choice)
#    - API Base: "http://localhost:11434"

# 5. Type code and press Tab for inline completions
# 6. Ctrl+K for multi-line completions

Which Models Are Best for Code?

ModelHumanEvalVRAMSpeedBest For
Qwen2.5-Coder 7B72%4.7 GBFastBest balance, fastest
Llama Code 7B69%4.7 GBFastGeneral coding
Mistral 7B61%4.5 GBVery fastLightweight, EU servers
Llama Code 13B74%8.5 GBMediumBetter quality on 16GB machines
DeepSeek-Coder 6.7B68%4 GBFastLightweight alternative

What Latency and VRAM Should You Expect?

Completion latency (time to first token) is critical for IDE experience. As of April 2026, here are typical numbers:

HardwareModelLatencyThroughput
RTX 4090 GPUQwen2.5-Coder 7B0.3–0.5 seconds150 tokens/sec
RTX 4070 GPUQwen2.5-Coder 7B0.8–1.5 seconds80 tokens/sec
M3 MacBook ProQwen2.5-Coder 7B2–3 seconds20 tokens/sec
8-core CPU onlyQwen2.5-Coder 7B5–10 seconds3 tokens/sec

Advanced Configuration for Code Completions

Fine-tune the experience with these settings:

json
# config.json advanced settings
{
  "tabAutocompleteModel": {
    "contextLength": 2048,     # How much code context to send
    "maxTokens": 50            # Max tokens per completion
  },
  "completionOptions": {
    "maxContextTokens": 1024,
    "maxSuggestionsCount": 5,
    "debounceWaitMs": 200      # Wait before showing completions (ms)
  },
  # For faster inference, use smaller context:
  "models": [{
    "contextLength": 1024      # Smaller context = faster
  }]
}

# For best speed on 8GB machines:
# - Use 7B model (not 13B)
# - Set maxTokens to 30
# - Set debounceWaitMs to 500 (less flickering)

Common Mistakes With Local Code Completions

  • Not tuning debounce latency. If completions feel "laggy", increase debounceWaitMs (e.g., to 400 ms) to avoid showing incomplete suggestions.
  • Using a model too large for your VRAM. A 13B model + editor overhead can use 12+ GB. On 8GB machines, stick with 7B models.
  • Expecting cloud-level code quality. GPT-4o is significantly better at code than any 7B model. Local completions are 70–80% of cloud quality.
  • Running inference on CPU. CPU completions are impractical (5–10 second latency). GPU is required for usable completions.

Common Questions About Local Code Completions

Is local code completion faster than cloud?

No. Cloud completions (GitHub Copilot) are faster due to optimized servers. Local completions have higher latency but zero cost and zero privacy risk.

Can I use local completions with other IDEs (PyCharm, Neovim)?

Yes, but setup varies. PyCharm has an Ollama plugin. For Neovim, use cmp-ollama (completion plugin). Always check the IDE community for integrations.

Can I use cloud models in Continue or Cursor?

Yes. Configure Continue to use OpenAI, Claude, or Gemini. You can also mix (local for fast, cloud for complex code).

Does local code completion work offline?

Yes. If you have pulled the model in Ollama, completions work entirely offline.

Sources

  • Continue.dev β€” continue.dev
  • Cursor Editor β€” cursor.sh
  • Continue GitHub β€” github.com/continuedev/continue
  • Qwen2.5-Coder β€” github.com/QwenLM/Qwen2.5-Coder

Compare your local LLM against 25+ cloud models simultaneously with PromptQuorum.

Try PromptQuorum free β†’

← Back to Local LLMs

Local LLMs in VS Code and Cursor | PromptQuorum