Wichtigste Erkenntnisse
- 7B models: Too weak. Miss 40% of bugs, surface-level feedback only.
- 13B models: Acceptable for style/lint feedback. Miss subtle logic bugs.
- 70B+ models: Excellent for architectural review, security analysis, refactoring suggestions.
- Best 70B model: Llama 3 70B or DeepSeek 67B. Both catch ~85% of real bugs.
- Fastest 13B: Mistral 7B or Llama 3 8B. Good for quick feedback, not exhaustive review.
- Setup: vLLM + FastAPI + custom prompt template for multi-file context.
- Latency: 70B takes 2β3 min per 500-line file. Batch processing multiple files in parallel reduces total time.
- Cost: Zero (open source) vs. $50/mo (GitHub Copilot Code Review).
Why Model Size Matters for Code Review
7B models lack reasoning depth. They spot obvious syntax errors but miss:
- Race conditions (concurrency bugs)
- SQL injection vulnerabilities
- Off-by-one errors in loops
- Type confusion in duck-typed languages
13B models understand basic logic but struggle with:
- Architectural anti-patterns
- Performance implications (cache misses, O(nΒ²) algorithms)
- Security edge cases
70B+ models excel at:
- Refactoring suggestions (extract method, reduce cyclomatic complexity)
- Security analysis (injection, XSS, CSRF)
- Performance optimization (caching, indexing, parallelization)
Model Recommendations by Code Type
| Code Type | Best Model | Min Size | Reasoning |
|---|---|---|---|
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
Accuracy vs Speed Trade-offs
Speed per file: Mistral 7B ~10 sec/500 lines. Llama 3 70B ~120 sec/500 lines.
Accuracy (bugs caught): Mistral 7B ~45%. Llama 3 70B ~85%.
When to use 7B: Quick feedback during development, non-critical code paths.
When to use 70B: Pre-commit hooks, security-sensitive code, public APIs.
Optimal workflow: Use 7B for real-time feedback (IDE integration), 70B for batch review before merge.
Setup: Local Code Review Pipeline
- 1Start vLLM with Llama 3 70B: `python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-70b-instruct`.
- 2Write custom prompt: "Review this code for bugs, security issues, and refactoring suggestions. Focus on [ISSUE_TYPE]."
- 3Integrate with Git hook: `pre-commit` hook calls API with diff/patch.
- 4Batch requests: group files by directory, send 5 files at once (vLLM processes in parallel).
- 5Parse response: extract suggestions by severity (critical, warning, info).
- 6Format output: post results as PR comments or inline suggestions.
Common Code Review Failures
- Using 7B for security review. False positives everywhere; developers ignore feedback.
- Reviewing without context. Single-function review misses architectural issues. Always pass related files.
- Not specifying issue type. "Review this code" is vague. Use "Check for SQL injection", "Suggest performance optimizations", etc.
FAQ
Can I use a 13B model for code review?
Yes, for linting-level feedback (style, obvious bugs). For security/performance review, no. 70B+ required.
How many files can I review in parallel?
vLLM default batch=32. On 70B, batch=1 per file is realistic. Process 5β10 files sequentially for full review in 10β15 min.
Is Llama 3 70B better than DeepSeek for code review?
Nearly identical. DeepSeek slightly better on math/algorithm optimization. Llama 3 slightly better on security. Pick either.
Can I use code review for pair programming?
Yes. Use 13B Mistral for real-time suggestions (fast). Refresh every 5 min as code changes.
What prompt should I use?
System: "You are an expert code reviewer." User: "Review for: [list issues]. Code: [code] Suggestions:"
How do I avoid hallucinated bugs?
Provide full context (imports, types, related functions). Hallucinations decrease with larger models (70B vs. 7B).
Sources
- Llama 3 model card: accuracy on code understanding benchmarks (HuggingFace)
- DeepSeek technical report: code completion and reasoning evaluation
- Code review bug detection rates: open-source benchmark (OpenRewrite, SonarQube)