What Is Moonshine Voice ASR? The Edge-First Alternative to Whisper Explained

WhisperApp TeamPublished: March 10, 2026Reading time 5min
XFacebook

When it comes to speech recognition (ASR), OpenAI's Whisper is the go-to choice. But running it on smartphones or edge devices like Raspberry Pi is challenging due to model size and processing speed requirements.

Enter Moonshine Voice ASR — a lightweight model that achieves speeds far beyond real-time using only CPU, with less than 1/10 the parameters of Whisper.

This article explains Moonshine's features compared to Whisper and explores its Japanese language capabilities.

What Is Moonshine Voice ASR?

Moonshine Voice is an open-source speech recognition model developed by Moonshine AI (formerly Useful Sensors), co-founded by early members of the TensorFlow team.

Its defining feature is being designed from the ground up for real-time operation on edge devices. The model architecture is optimized to run fast enough on smartphone CPUs alone.

Key Features

  • Ultra-lightweight: 61.5M parameters for Base model (about 1/25 of Whisper Large-v3)
  • No GPU required: 30-60x real-time processing on CPU only
  • Open source: MIT license for English models; Moonshine Community License for non-English (commercial use OK under $1M annual revenue)
  • Multilingual: Language-specific models for 6 languages including Japanese (Flavors generation)
  • Streaming support: v2 generation achieves 320ms low-latency real-time recognition (English only as of March 2026)

Moonshine vs. Whisper

Size and Speed

Metric Moonshine Base JA Whisper Large-v3
Parameters 61.5M 1,550M
Model size ~135MB ~3GB
GPU required No Practically yes
RTF (CPU) 0.016-0.026 0.5-2.0+

RTF (Real-Time Factor) is the ratio of processing time to audio duration — lower is faster. Moonshine achieves RTF 0.016-0.026 with 16 CPU threads, meaning 38-61x real-time speed. It can transcribe 72 minutes of audio in about 1.2 minutes.

Meanwhile, Whisper Large-v3 on CPU has RTF 0.5-2.0+, making even real-time processing difficult.

Accuracy

Whisper wins on accuracy:

  • Whisper Large-v3: 99 languages, 1,550M parameters for high accuracy
  • Moonshine Base JA: Japanese-specific, official CER 13.62% (FLEURS dataset)

Moonshine has just 1/25 the parameters of Whisper Large-v3 yet still produces practically useful results for everyday conversations and lectures. However, it doesn't match Whisper Large-v3's accuracy.

Architecture Differences

Whisper processes fixed-length (30-second) inputs, padding shorter audio with zeros — so even 5 seconds of audio requires 30 seconds worth of computation.

Moonshine's computation scales proportionally with audio length. Five seconds of audio means only five seconds of computation. This is why it's dramatically faster for short segments.

Moonshine Generations

Moonshine has three generations, each with distinct characteristics.

Original Moonshine (October 2024)

The first generation. English only, with Tiny (27.1M) and Base (61.5M) sizes. The paper demonstrated speeds far exceeding Whisper.

Flavors of Moonshine (September 2025)

Added language-specific models for 6 languages including Japanese. Same v1 architecture, but optimized per language to achieve higher accuracy than same-size multilingual models.

Moonshine Streaming (February 2026)

The latest generation. Achieves 320ms low-latency real-time recognition with an entirely new sliding-window self-attention architecture. The English Medium Streaming model (245M) matches Whisper Large-v3 (1,550M) accuracy at about 1/6 the parameters. English only as of March 2026 — Japanese models remain on the Flavors (v1) architecture.

Japanese Performance

Here are real test results with Moonshine's Japanese model (Flavors generation).

Test Setup

  • Audio: 11-minute lecture, 72-minute lecture
  • Model: Moonshine Base JA (61.5M params)
  • Hardware: Intel Core Ultra 9 285H / 32GB RAM (CPU only, no GPU)

Processing Speed

Audio Processing time RTF Speed
11 min (Base JA) ~18 sec 0.026 38x
11 min (Tiny JA) ~11 sec 0.016 62x
72 min (Base JA) ~71 sec 0.016 61x

Recognition Accuracy

LCS (Longest Common Subsequence) based comparison against human-corrected reference:

Test F1 Score
11 min Base JA 96.98%
11 min Tiny JA 93.60%
72 min Base JA 87.97%

The 11-minute audio achieves F1 97%, while the 72-minute long-form audio drops to 88%. This is due to the audio complexity difference — the 72-minute audio contains speaker changes, Q&A sections, and more silence gaps.

Strengths and Weaknesses

Strengths:

  • General Japanese speech (lectures, presentations, conversations)
  • Short to medium segments

Weaknesses:

  • Proper nouns (names, technical terms) — no context hint feature like Whisper's initial_prompt
  • Homophone disambiguation (character conversion without context)
  • No punctuation output (post-processing needed if required)
  • Very long speech segments (ONNX quantized model input limit of ~10 seconds)

Why It's Perfect for Edge Devices

Here's why Moonshine excels on smartphones and edge devices:

1. CPU-Only Performance

Achieves 30x+ real-time processing without a GPU. Smartphone CPUs provide sufficient speed.

2. Low Memory Footprint

Model size is only ~135MB (Base) / ~60MB (Tiny) — easily fits in smartphone memory.

3. Fully Offline

Transcription completes without internet connectivity. Ideal for confidential audio.

4. Battery Efficient

CPU-only processing consumes less battery than GPU-based alternatives — critical for mobile apps.

Licensing Notes

Moonshine's license varies by language:

  • English models: MIT License (fully free)
  • Non-English models (including Japanese): Moonshine Community License
    • Research and non-commercial: unlimited
    • Commercial use: OK if annual revenue under $1M (registration required)
    • Over $1M annual revenue: enterprise license required

Check Moonshine AI's official website for full license details.

WhisperApp and Moonshine

WhisperApp's desktop version uses OpenAI Whisper as its primary engine, leveraging GPU for high-accuracy transcription.

Meanwhile, the WhisperApp mobile version (Android), currently in development, uses Moonshine as its speech recognition engine. Since smartphones often lack or have limited GPU access, Moonshine's CPU-only high-speed processing makes it the optimal choice.

The strategy is to bring WhisperApp's desktop transcription expertise (speaker diarization, LLM integration, subtitle generation) to mobile while using the engine best suited for each device.

Conclusion

Aspect Moonshine Whisper
Speed 30-60x on CPU GPU recommended
Accuracy Practical (CER 13.62%) High accuracy
Size 60-135MB 1.5-3GB
Offline Full support Full support
Best for Mobile, edge, CPU Desktop, GPU

Moonshine isn't a complete replacement for Whisper, but it's extremely powerful for fast offline transcription on edge devices. The best approach is to use Moonshine for speed-focused, mobile, and privacy-conscious scenarios, and Whisper for accuracy-focused desktop use.

For technical details (VAD parameter tuning, pipeline construction, benchmark verification), see our detailed technical article on Zenn (Japanese).

Turn speech into text.

WhisperApp runs high-accuracy AI transcription locally on your PC. Transcribe meetings, interviews, and videos while keeping your data private.

7-day free trial — no credit card required

Related Articles