VoiceFlow
System-level voice-to-text that turns speech into polished writing in any app.
§01 · pipeline
§02 · problem
Voice input on desktop is broken. System dictation produces transcripts the writer disowns. Per-app integrations make the universal case impossible. Power users who think faster than they type have no good option.
§03 · approach
A system-level desktop app. Hold a hotkey, speak, release — polished text appears at your cursor in any application. Two-stage AI pipeline: Whisper for raw transcription, then GPToss for intelligent cleanup that removes filler words while preserving intent.
§04 · decisions
What was chosen.
What was rejected.
Electron's globalShortcut becomes unreliable when the app loses focus. Rust intercepts at the OS layer (IOKit on macOS, Win32 API on Windows), capturing 100% of hotkey presses regardless of which app is in front.
Whisper is excellent at transcription but outputs verbatim speech, fillers and all. GPToss handles contextual cleanup — it knows when 'like' is a filler vs. meaningful. Separating concerns lets each stage be tuned independently.
Programmatic insertion behaves differently in every app. The clipboard approach (copy → simulate Cmd+V) works in any text field, anywhere. Original clipboard contents are saved and restored in <50ms.
§05 · tradeoffs
What this costs.
- t/01
Electron adds ~150–200MB memory overhead vs. a native app. The cost of one codebase running on macOS, Windows, and Linux is paid once in RAM.
- t/02
API-based Whisper adds ~500ms latency vs. a local whisper.cpp model. The latency buys consistently better accuracy on technical vocabulary and accents — non-negotiable for the writing use case.
- t/03
Clipboard injection briefly overwrites user clipboard. Mitigated with atomic save-inject-restore in under 50ms, well below human perception.
§06 · impact
What this returned.
§07 · stack