μ 체 λ²€μΉλ§ν¬ ν: Apple Silicon(M1βM5)μμμ Whisper μ±λ₯
| μΉ© | Tiny | Base | Small | Medium | Large-v3 |
|---|---|---|---|---|---|
| β | 32Γ | 20Γ | 12Γ | 5Γ | β |
| β | 38Γ | 24Γ | 16Γ | 7Γ | β |
| β | 45Γ | 30Γ | 22Γ | 10Γ | β |
| β | 55Γ | 38Γ | 28Γ | 14Γ | β |
| β | 36Γ | 23Γ | 14Γ | 6Γ | β |
| β | 42Γ | 28Γ | 20Γ | 9Γ | β |
| β | 50Γ | 35Γ | 26Γ | 12Γ | β |
| β | 60Γ | 42Γ | 32Γ | 17Γ | β |
| β | 40Γ | 26Γ | 16Γ | 7Γ | β |
| β | 46Γ | 32Γ | 22Γ | 10Γ | β |
| β | 55Γ | 40Γ | 30Γ | 14Γ | β |
| β | 44Γ | 30Γ | 18Γ | 8Γ | β |
| β | 50Γ | 36Γ | 26Γ | 12Γ | β |
| β | 60Γ | 44Γ | 34Γ | 16Γ | β |
| β | 48Γ | 34Γ | 22Γ | 10Γ | β |
| β | 55Γ | 40Γ | 30Γ | 14Γ | β |
| β | 65Γ | 48Γ | 38Γ | 18Γ | β |
ΓN μ€μκ° = 1μ΄ μμ Nμ΄ λΆλμ μ€λμ€λ₯Ό μ μ¬ν¨. Metal κ°μμ μ¬μ©ν whisper.cpp λ²€μΉλ§ν¬. M1 Pro μ΄μ λͺ¨λ λͺ¨λΈμμ large-v3λ₯Ό μ€μκ° μ΄μμ μλλ‘ μ€νν μ μμ΅λλ€.
Whisper λͺ¨λΈ ν¬κΈ° β μ΄λ€ κ²μ μ νν΄μΌ ν κΉμ?
| λͺ¨λΈ | νλΌλ―Έν° | λμ€ν¬ ν¬κΈ° | RAM μ¬μ©λ | μμ΄ WER | μ΅μ μ©λ |
|---|---|---|---|---|---|
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
| β | β | β | β | β | β |
WER(λ¨μ΄ μ€λ₯μ¨)μ μμ΄ LibriSpeech ν μ€νΈ μΈνΈ κΈ°μ€μ λλ€. Large-v3-turboμ distil-large-v3λ λλΆλΆμ Macμμ μ€μκ° μ²λ¦¬λ₯Ό μν μ΅μ μ κ· νμ μ 곡ν©λλ€ β large-v3 νμ§μ 4β6Γ μλ.
Metal vs Core ML vs Apple Neural Engine: μ΄λ€ λ°±μλλ₯Ό μ νν κΉμ?
Apple Siliconμ Whisperμ μΈ κ°μ§ κ°μ κ²½λ‘λ₯Ό μ 곡ν©λλ€. κ°κ° μ₯λ¨μ μ΄ μμ΅λλ€.
Metal(whisper.cpp κ²½μ ) β κΆμ₯: Apple Metal GPU νλ μμν¬ μ¬μ©, λͺ¨λ M μλ¦¬μ¦ μΉ©κ³Ό νΈν, M5 Proμμ large-v3 10β12Γ μ€μκ°, make WHISPER_METAL=1λ‘ μ€μ . μ΅μ μ©λ: λλΆλΆμ μ¬μ©μ, κ°μ₯ κ°λ¨ν μ€μ , κ²μ¦λ μ±λ₯.
Core ML(Apple Core ML νμ κ²½μ ) β κ³ κΈ: Apple λ¨Έμ λ¬λ νλ μμν¬ μ¬μ©, μΌλΆ μ°μ°μμ Neural Engine(ANE) νμ© κ°λ₯, μΌλΆ μν¬λ‘λμμ 15β20% λΉ λ¦, λͺ¨λΈ λ³ν νμ(10β15λΆ μ€μ ). μ΅μ μ©λ: μ΅λ μλλ₯Ό μνλ κ³ κΈ μ¬μ©μ.
Apple Neural Engine(ANE) β μ νμ μ¬μ©: λͺ¨λ M μλ¦¬μ¦ μΉ©μ μ μ© AI κ°μκΈ°, μ§μ μ κ·Ό λΆκ°(Core ML κ²½μ νμ), μν€ν μ² λΆμΌμΉλ‘ Whisperκ° ANEλ₯Ό μμ ν νμ©νμ§ λͺ»ν¨, μν λͺ¨λΈ(tiny, base)μμ κ°μ₯ ν¨κ³Όμ . μ΅μ μ©λ: λ°°ν°λ¦¬ ꡬλ λ ΈνΈλΆμμμ Whisper tiny/base.
μ ν κΈ°μ€: μ΄κΈ° μ€μ β Metal(whisper.cpp). large-v3 μ΅λ μλ β Metal(whisper.cpp). λ°°ν°λ¦¬ ꡬλ λ ΈνΈλΆ, base λͺ¨λΈ β ANE ν¬ν¨ Core ML. νλ‘λμ μλ² β Metal(κ²μ¦λ¨, μμ μ ). μ€μκ° μ μ¬ β μ€νΈλ¦¬λ° λͺ¨λμ Metal. Mac μΈμ€ν΄μ€ ν΄λΌμ°λ λ°°ν¬ β Metal(컨ν μ΄λν κ°λ₯).
- Metal(whisper.cpp): λ λΉ λ¦, κ΄λ²μν νΈνμ±, κ°μ₯ κ°λ¨ν μ€μ
- Core ML: Neural Engine μ΅μ ν, μΌλΆ μν¬λ‘λμμ 15β20% μλ ν₯μ(λ³ν νμ)
- Apple Neural Engine: λν λͺ¨λΈμμλ μ΄μ μ νμ , λ ΈνΈλΆμ tiny/baseμ μ΅μ
μ€μ : Metal κ°μ whisper.cpp
- 1μμ‘΄μ± μ€μΉ
Why it matters: xcode-select --install (Xcode λꡬ) brew install ffmpeg (μ€λμ€ λ³ν) - 2Metal ν¬ν¨ whisper.cpp 볡μ λ° λΉλ
Why it matters: git clone https://github.com/ggerganov/whisper.cpp cd whisper.cpp make WHISPER_METAL=1 ./main -h | grep -i metal - 3λͺ¨λΈ λ€μ΄λ‘λ
Why it matters: bash ./models/download-ggml-model.sh small (466 MB, μ€μκ°) bash ./models/download-ggml-model.sh large-v3 (3 GB, μ΅κ³ νμ§) bash ./models/download-ggml-model.sh large-v3-turbo (1.6 GB, κ· ν) - 4μ€λμ€ νμΌ μ μ¬
Why it matters: ./main -m models/ggml-large-v3.bin -f /path/to/audio.wav ./main -m models/ggml-large-v3.bin -f audio.wav -oj (JSON) ./main -m models/ggml-large-v3.bin -f audio.wav -l en (μΈμ΄ μ§μ ) - 5λΉWAV μ€λμ€ λ¨Όμ λ³ν
Why it matters: ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav ./main -m models/ggml-large-v3.bin -f output.wav
μ€μκ° μ€νΈλ¦¬λ° μ μ¬(λΌμ΄λΈ λ§μ΄ν¬)
λ§μ΄ν¬μμ μ€μκ° μ μ¬ β μμ± μ΄μμ€ν΄νΈ, νμ μ μ¬, μ κ·Όμ± λꡬμ©.
μ΅μ 1: whisper.cpp μ€νΈλ¦Ό λͺ¨λ
./stream -m models/ggml-small.bin --step 500 --length 5000
# --step 500: 500msλ§λ€ μ²λ¦¬
# --length 5000: μ΅κ·Ό 5μ΄ μ»¨ν μ€νΈ μ μ§
μ΅μ 2: faster-whisperλ₯Ό μ¬μ©ν Python(μλ μ½λ λΈλ‘ μ°Έμ‘°)
M5 Proμμμ μ§μ°: small λͺ¨λΈ ~200ms, large-v3-turbo ~400β600ms, large-v3 ~800msβ1.2s μ€μκ° μ§μ°.
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cpu", compute_type="int8")
buffer = []
chunk_duration = 3
sample_rate = 16000
def callback(indata, frames, time, status):
buffer.append(indata.copy())
if len(buffer) * 1024 / sample_rate >= chunk_duration:
audio = np.concatenate(buffer).flatten().astype(np.float32)
segments, _ = model.transcribe(audio, beam_size=5)
for segment in segments:
print(segment.text)
buffer.clear()
with sd.InputStream(callback=callback, channels=1, samplerate=sample_rate):
print("Listening... (Ctrl+C to stop)")
while True:
sd.sleep(1000)μμ± μ΄μμ€ν΄νΈ νμ΄νλΌμΈ: Whisper + Ollama + Piper TTS
Apple Siliconμμ μμ ν λ‘μ»¬λ‘ μ€νλλ μμ± μ΄μμ€ν΄νΈμ μ 체 μ½λμ λλ€.
import sounddevice as sd
import numpy as np
import requests
import subprocess
from faster_whisper import WhisperModel
WHISPER_MODEL = "large-v3-turbo"
OLLAMA_URL = "http://localhost:11434/api/chat"
LLM_MODEL = "llama3.1:8b"
SAMPLE_RATE = 16000
whisper = WhisperModel(WHISPER_MODEL, device="cpu", compute_type="int8")
def record_audio(duration=5):
print("Listening...")
audio = sd.rec(int(duration * SAMPLE_RATE),
samplerate=SAMPLE_RATE,
channels=1,
dtype=np.float32)
sd.wait()
return audio.flatten()
def transcribe(audio):
segments, _ = whisper.transcribe(audio, beam_size=5)
return " ".join([seg.text for seg in segments])
def llm_respond(user_text):
response = requests.post(OLLAMA_URL, json={
"model": LLM_MODEL,
"messages": [{"role": "user", "content": user_text}],
"stream": False
})
return response.json()["message"]["content"]
def speak(text):
subprocess.run(
["piper", "--model", "en_US-amy-medium.onnx"],
input=text.encode(),
check=True
)
while True:
audio = record_audio(duration=5)
user_text = transcribe(audio)
print(f"You: {user_text}")
if not user_text.strip():
continue
response = llm_respond(user_text)
print(f"AI: {response}")
speak(response)Mac λͺ¨λΈλ³ μ΅μ Whisper μ€μ
| Mac κ΅¬μ± | κΆμ₯ λͺ¨λΈ | μ€μκ° λ°°μ¨ | μ¬μ© μ¬λ‘ |
|---|---|---|---|
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
| β | β | β | β |
μ€μκ° μμ± μ΄μμ€ν΄νΈμ©: μ΅μ μ§μ°μ μν΄ small λλ large-v3-turboλ₯Ό μ¬μ©νμμμ€. νμ/νμΊμ€νΈ μ μ¬μ©: μ΅κ³ μ νλλ₯Ό μν΄ large-v3λ₯Ό μ¬μ©νμμμ€(1β2μ΄ μ§μ° νμ© κ°λ₯).
λ‘컬 Whisper vs ν΄λΌμ°λ μμ± μΈμ μλΉμ€
| μ§ν | Whisper λ‘컬(M5 Pro) | Google Speech-to-Text | OpenAI Whisper API | AssemblyAI |
|---|---|---|---|---|
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
| β | β | β | β | β |
μ λΉμ©(ν루 8μκ°): Whisper λ‘컬 $0, Google $345, OpenAI $86, AssemblyAI $156. κ°μΈμ 보μ λ―Όκ°ν μμ (μλ£, λ²λ₯ , μ λ리μ¦)μ κ²½μ° λ‘컬 Whisperκ° μ μΌν μ νμ λλ€. λλ μ μ¬(ν΄λΌμ°λ μ $100 μ΄μ)μ κ²½μ° λ‘컬 Macμ΄ 12κ°μ λ΄μ λΉμ©μ νμν©λλ€.
Whisperλ ν΄λΌμ°λ APIλ³΄λ€ λΉ λ¦ λκΉ?
M5 Proμμ λ‘컬 μ€ν: 10Γ μ€μκ°(μ§μ° 100ms). ν΄λΌμ°λ API: λ€νΈμν¬λ‘ μΈν 100β500ms μ§μ°. λ‘μ»¬μ΄ λ λΉ λ₯΄κ³ 무λ£μ λλ€.
Whisperλ μ¬λ¬ νμλ₯Ό μ²λ¦¬ν μ μμ΅λκΉ?
μ, νμμ€ν¬νλ‘ νμλ₯Ό λΆλ¦¬ν©λλ€. νμ μ μμ νμΈνλ €λ©΄ νμ²λ¦¬ λλ νμ λΆλ¦¬(diarization) λꡬλ₯Ό μ¬μ©νμμμ€.
μΈμ΄ μ§μμ μ΄λ»κ² λ©λκΉ?
μλ κ°μ§λ₯Ό ν¬ν¨ν 99κ° μΈμ΄λ₯Ό μ§μν©λλ€. μΈμ΄λ§λ€ μ νλκ° λ€λ¦ λλ€ β μμ΄λ 2.5% WER, κΈ°ν μΈμ΄λ 5β15% WERμ λλ€.
μλ λλΉ νμ§ λΉμ¨μ΄ κ°μ₯ μ’μ Whisper λͺ¨λΈμ 무μμ λκΉ?
Large-v3-turbo λλ distil-large-v3μ λλ€. λ λͺ¨λΈ λͺ¨λ large-v3 μ νλμ μ½ 95%λ₯Ό 4β6Γ μλλ‘ λ¬μ±ν©λλ€. λλΆλΆμ μ€μκ° μ¬μ© μ¬λ‘μ κΆμ₯λ©λλ€.
Whisperλ μ΅μμ΄ κ°ν μμ΄λ λΉμμ΄λ―Ό νμλ₯Ό μ²λ¦¬ν μ μμ΅λκΉ?
μ, λ€λ§ WERμ΄ μμΉν©λλ€. μμ΄ μμ΄λ―Ό: μ½ 2.5%. κ°ν μ΅μ/λΉμμ΄λ―Ό: 5β12%. Large-v3λ μν λͺ¨λΈλ³΄λ€ μ΅μμ λ μ μ²λ¦¬ν©λλ€.
Whisperλ νμΊμ€νΈ λ° μμ μ μ¬μ μ ν©ν©λκΉ?
νμΊμ€νΈ: μ, μμ± μ½ν μΈ μ νμν©λλ€. κ°μ¬κ° μλ μμ : λΆμ ν© β Whisperλ μμ±μ©μΌλ‘ νμ΅λμμ΅λλ€. μμ μλ μ λ¬Έ λͺ¨λΈμ μ¬μ©νμμμ€.
κΈ°μ μ©μ΄μ λν Whisperμ μ νλλ μ΄λ»μ΅λκΉ?
κ°λ³μ μ λλ€. μΌλ°μ μΈ κΈ°μ μ©μ΄: μνΈ. κ³ λλ‘ μ λ¬Ένλ μ©μ΄: μλͺ» μ μ¬λ μ μμ΅λλ€. μ νλλ₯Ό λμ΄λ €λ©΄ --prompt νλκ·Έμ μμ μ΄νλ₯Ό μ§μ νμμμ€.
ν Macμμ μ¬λ¬ Whisper μΈμ€ν΄μ€λ₯Ό μ€νν μ μμ΅λκΉ?
μ, λ©λͺ¨λ¦¬μ λ°λΌ μ νλ©λλ€. M5 Pro 36GB: large-v3 μΈμ€ν΄μ€ 2κ° λμ μ€ν κ°λ₯. M5 Max 128GB: 4β6κ°μ μΈμ€ν΄μ€ λλ LLM/TTSμ ν¨κ» νλμ μΈμ€ν΄μ€.