When it comes to speech recognition (ASR), OpenAI's Whisper is the go-to choice. But running it on smartphones or edge devices like Raspberry Pi is challenging due to model size and processing speed requirements.
Enter Moonshine Voice ASR — a lightweight model that achieves speeds far beyond real-time using only CPU, with less than 1/10 the parameters of Whisper.
This article explains Moonshine's features compared to Whisper and explores its Japanese language capabilities.
What Is Moonshine Voice ASR?
Moonshine Voice is an open-source speech recognition model developed by Moonshine AI (formerly Useful Sensors), co-founded by early members of the TensorFlow team.
Its defining feature is being designed from the ground up for real-time operation on edge devices. The model architecture is optimized to run fast enough on smartphone CPUs alone.
Key Features
- Ultra-lightweight: 61.5M parameters for Base model (about 1/25 of Whisper Large-v3)
- No GPU required: 30-60x real-time processing on CPU only
- Open source: MIT license for English models; Moonshine Community License for non-English (commercial use OK under $1M annual revenue)
- Multilingual: Language-specific models for 6 languages including Japanese (Flavors generation)
- Streaming support: v2 generation achieves 320ms low-latency real-time recognition (English only as of March 2026)
Moonshine vs. Whisper
Size and Speed
| Metric | Moonshine Base JA | Whisper Large-v3 |
|---|---|---|
| Parameters | 61.5M | 1,550M |
| Model size | ~135MB | ~3GB |
| GPU required | No | Practically yes |
| RTF (CPU) | 0.016-0.026 | 0.5-2.0+ |
RTF (Real-Time Factor) is the ratio of processing time to audio duration — lower is faster. Moonshine achieves RTF 0.016-0.026 with 16 CPU threads, meaning 38-61x real-time speed. It can transcribe 72 minutes of audio in about 1.2 minutes.
Meanwhile, Whisper Large-v3 on CPU has RTF 0.5-2.0+, making even real-time processing difficult.
Accuracy
Whisper wins on accuracy:
- Whisper Large-v3: 99 languages, 1,550M parameters for high accuracy
- Moonshine Base JA: Japanese-specific, official CER 13.62% (FLEURS dataset)
Moonshine has just 1/25 the parameters of Whisper Large-v3 yet still produces practically useful results for everyday conversations and lectures. However, it doesn't match Whisper Large-v3's accuracy.
Architecture Differences
Whisper processes fixed-length (30-second) inputs, padding shorter audio with zeros — so even 5 seconds of audio requires 30 seconds worth of computation.
Moonshine's computation scales proportionally with audio length. Five seconds of audio means only five seconds of computation. This is why it's dramatically faster for short segments.

Moonshine Generations
Moonshine has three generations, each with distinct characteristics.
Original Moonshine (October 2024)
The first generation. English only, with Tiny (27.1M) and Base (61.5M) sizes. The paper demonstrated speeds far exceeding Whisper.
Flavors of Moonshine (September 2025)
Added language-specific models for 6 languages including Japanese. Same v1 architecture, but optimized per language to achieve higher accuracy than same-size multilingual models.
Moonshine Streaming (February 2026)
The latest generation. Achieves 320ms low-latency real-time recognition with an entirely new sliding-window self-attention architecture. The English Medium Streaming model (245M) matches Whisper Large-v3 (1,550M) accuracy at about 1/6 the parameters. English only as of March 2026 — Japanese models remain on the Flavors (v1) architecture.
Japanese Performance
Here are real test results with Moonshine's Japanese model (Flavors generation).
Test Setup
- Audio: 11-minute lecture, 72-minute lecture
- Model: Moonshine Base JA (61.5M params)
- Hardware: Intel Core Ultra 9 285H / 32GB RAM (CPU only, no GPU)
Processing Speed
| Audio | Processing time | RTF | Speed |
|---|---|---|---|
| 11 min (Base JA) | ~18 sec | 0.026 | 38x |
| 11 min (Tiny JA) | ~11 sec | 0.016 | 62x |
| 72 min (Base JA) | ~71 sec | 0.016 | 61x |
Recognition Accuracy
LCS (Longest Common Subsequence) based comparison against human-corrected reference:
| Test | F1 Score |
|---|---|
| 11 min Base JA | 96.98% |
| 11 min Tiny JA | 93.60% |
| 72 min Base JA | 87.97% |
The 11-minute audio achieves F1 97%, while the 72-minute long-form audio drops to 88%. This is due to the audio complexity difference — the 72-minute audio contains speaker changes, Q&A sections, and more silence gaps.
Strengths and Weaknesses
Strengths:
- General Japanese speech (lectures, presentations, conversations)
- Short to medium segments
Weaknesses:
- Proper nouns (names, technical terms) — no context hint feature like Whisper's initial_prompt
- Homophone disambiguation (character conversion without context)
- No punctuation output (post-processing needed if required)
- Very long speech segments (ONNX quantized model input limit of ~10 seconds)

Why It's Perfect for Edge Devices
Here's why Moonshine excels on smartphones and edge devices:
1. CPU-Only Performance
Achieves 30x+ real-time processing without a GPU. Smartphone CPUs provide sufficient speed.
2. Low Memory Footprint
Model size is only ~135MB (Base) / ~60MB (Tiny) — easily fits in smartphone memory.
3. Fully Offline
Transcription completes without internet connectivity. Ideal for confidential audio.
4. Battery Efficient
CPU-only processing consumes less battery than GPU-based alternatives — critical for mobile apps.
Licensing Notes
Moonshine's license varies by language:
- English models: MIT License (fully free)
- Non-English models (including Japanese): Moonshine Community License
- Research and non-commercial: unlimited
- Commercial use: OK if annual revenue under $1M (registration required)
- Over $1M annual revenue: enterprise license required
Check Moonshine AI's official website for full license details.
WhisperApp and Moonshine
WhisperApp's desktop version uses OpenAI Whisper as its primary engine, leveraging GPU for high-accuracy transcription.
Meanwhile, the WhisperApp mobile version (Android), currently in development, uses Moonshine as its speech recognition engine. Since smartphones often lack or have limited GPU access, Moonshine's CPU-only high-speed processing makes it the optimal choice.
The strategy is to bring WhisperApp's desktop transcription expertise (speaker diarization, LLM integration, subtitle generation) to mobile while using the engine best suited for each device.
Conclusion
| Aspect | Moonshine | Whisper |
|---|---|---|
| Speed | 30-60x on CPU | GPU recommended |
| Accuracy | Practical (CER 13.62%) | High accuracy |
| Size | 60-135MB | 1.5-3GB |
| Offline | Full support | Full support |
| Best for | Mobile, edge, CPU | Desktop, GPU |
Moonshine isn't a complete replacement for Whisper, but it's extremely powerful for fast offline transcription on edge devices. The best approach is to use Moonshine for speed-focused, mobile, and privacy-conscious scenarios, and Whisper for accuracy-focused desktop use.
For technical details (VAD parameter tuning, pipeline construction, benchmark verification), see our detailed technical article on Zenn (Japanese).



