ADR-003: Selection of faster-whisper over whisper.cpp¶
Status¶
Accepted
Date¶
2024-06-01
Context¶
For local voice transcription, we needed to choose a Whisper model implementation from OpenAI that would maximize performance on consumer NVIDIA GPUs (RTX 3060-4090).
Options evaluated:¶
- OpenAI Whisper (original): Reference implementation in PyTorch
- whisper.cpp: Pure C++ implementation with CUDA support
- faster-whisper: Implementation on CTranslate2 (optimized C++/CUDA)
Requirements:¶
- Latency < 500ms for 5 seconds of audio (8x real-time minimum)
- Support for
large-v3models anddistilvariants - INT8/FP16 quantization to optimize VRAM
- Python API for backend integration
- Audio streaming/chunking
Decision¶
Adopt faster-whisper as the main transcription engine.
Justification:¶
| Criteria | Whisper (PyTorch) | whisper.cpp | faster-whisper |
|---|---|---|---|
| Speed | 1x (baseline) | 4x | 4-8x |
| VRAM (large-v3) | 10GB | 6GB | 4-5GB |
| Python API | ✅ Native | ❌ Bindings | ✅ Excellent |
| Quantization | Limited | ✅ | ✅ INT8/FP16 |
| Maintenance | OpenAI | Community | Active (Systran) |
Consequences¶
Positive¶
- ✅ 4-8x faster than original Whisper with same accuracy
- ✅ ~50% less VRAM: Allows using large-v3 on 6GB GPUs
- ✅ Pythonic API: Natural integration with FastAPI async
- ✅ Distil models support:
distil-large-v3for minimum latency
Negative¶
- ⚠️ Additional dependency: CTranslate2 binary (~100MB)
- ⚠️ Less portable: Requires compatible CUDA toolkit
- ⚠️ Lag on new models: New OpenAI releases take ~2 weeks to be available
Alternatives Considered¶
whisper.cpp¶
- Rejected: Immature Python bindings, more complex debugging.
OpenAI Whisper¶
- Rejected: Too slow for real-time experience without enterprise hardware.
Whisper JAX¶
- Rejected: Requires TPU or complex JAX on CUDA configuration.