Skip to content

Voice2Machine

ADR-006: Local-first: Processing Without Cloud

zarvent/voice2machine

ADR-006: Local-first: Processing Without Cloud¶

Status¶

Accepted

Date¶

2024-03-01

Context¶

Existing dictation services (Google Speech-to-Text, Whisper API, Dragon) require sending audio to external servers.

Problems with cloud-based dictation:¶

Privacy: Sensitive audio (medical, legal, personal) leaves the machine
Network latency: 100-500ms additional RTT
Availability: Requires internet connection
Costs: Transcription APIs charge per minute
Rate limits: Throttling on intensive use

User requirements:¶

Absolute privacy: No data should leave the machine
Offline operation: System must work without internet
Minimum latency: < 500ms end-to-end
Zero cost: No subscriptions or usage fees

Decision¶

Adopt "Local-first" philosophy where all voice processing occurs on the user's device.

Implementation:¶

Component	Local Solution
Transcription	faster-whisper on local GPU
LLM (optional)	Ollama with local models
Audio	Processed in RAM
Storage	Only temporary files, deleted post-use

Configurable exceptions:¶

User can opt-in to cloud services for text refinement:

Google Gemini API (for LLM)
But never for raw audio

Consequences¶

Positive¶

✅ Guaranteed privacy: Audio never leaves the device
✅ No network latency: All processing local
✅ Works offline: No internet required for dictation
✅ Predictable cost: Only hardware (GPU), no subscriptions
✅ Compliance: Compatible with regulations (HIPAA, GDPR)

Negative¶

⚠️ Requires GPU: Without NVIDIA GPU, degraded performance
⚠️ Local LLM models: Lower quality than GPT-4/Gemini Pro
⚠️ Manual updates: Models don't auto-update

Alternatives Considered¶

Hybrid (local STT + cloud LLM default)¶

Rejected: Violates privacy-by-default principle.

Cloud-first with local cache¶

Rejected: Unnecessary complexity, audio still needs to be uploaded.

Federated Learning¶

Rejected: Over-engineering for current scope.

References¶