Mellon runs a local HTTP server on your Mac that exposes Whisper speech-to-text through a simple API. Whether you're building an AI agent, automating transcription workflows, or integrating voice into your app — everything runs on-device with no API keys, no cloud, and no per-minute billing.
Available in Mellon v1.4.0+. Streaming endpoints available in v1.5.0+.
Getting Started
1. Enable the API Server
Open Mellon's settings and go to API Server. Toggle the server on. It starts on http://localhost:8765 by default.
Verify it's running:
curl http://localhost:8765/health
# {"status": "ok", "model_loaded": true}
2. Send Audio
Pick the endpoint that fits your use case — see the full reference below. The simplest way to test:
curl -X POST http://localhost:8765/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-1"
# {"text": "Hello, this is a test recording."}
Batch Endpoints
These endpoints accept a complete audio file and return the transcription in one response. Best for short-to-medium recordings (up to a few minutes).
POST /v1/audio/transcriptions
Recommended for most integrations. OpenAI-compatible (multipart/form-data). Runs the full pipeline: Whisper + spellcheck + custom dictionary corrections.
This is a drop-in replacement for https://api.openai.com/v1/audio/transcriptions — tools like OpenClaw can point to Mellon with zero code changes.
curl -X POST http://localhost:8765/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-1"
# {"text": "I updated ChronoCat and opened Mellon."}
OpenClaw config example:
{
"tools": {
"media": {
"audio": {
"enabled": true,
"models": [{
"provider": "openai",
"model": "whisper-1",
"baseUrl": "http://127.0.0.1:8765/v1"
}]
}
}
}
}
POST /transcribe-full
Raw audio body. Same full pipeline as above, but returns detailed correction data and timing information. Useful for debugging or when you need to see what Whisper originally produced vs. what was corrected.
curl -X POST http://localhost:8765/transcribe-full \
--data-binary @recording.wav \
-H "Content-Type: application/octet-stream"
# {
# "success": true,
# "text": "I updated ChronoCat and opened Mellon.",
# "whisper_text": "I updated chrono cat and opened melon.",
# "corrections": [
# {"original": "chrono cat", "corrected": "ChronoCat", "source": "custom"}
# ],
# "timing": {"whisper_ms": 1024, "spellcheck_ms": 2, "total_ms": 1026}
# }
POST /transcribe
Raw audio body. Whisper only — no spellcheck or dictionary corrections. Fastest option when you just need raw transcription.
curl -X POST http://localhost:8765/transcribe \
--data-binary @recording.wav \
-H "Content-Type: application/octet-stream"
# {"success": true, "text": "I updated chrono cat.", "duration_ms": 1025}
Streaming Endpoints
For long-form audio or real-time recording, streaming endpoints let you feed audio in chunks while transcription happens in the background. Instead of waiting for the entire recording to finish, Mellon processes audio as it arrives — using Voice Activity Detection (VAD) to find natural silence boundaries and transcribe completed chunks independently.
This means you can transcribe recordings of any length without hitting memory limits, and get progress updates as the session progresses.
How It Works
- Start a session — you get back a
session_id - Feed raw PCM audio chunks as they're recorded (16kHz, mono)
- End the session — Mellon transcribes any remaining audio and returns the full text
Behind the scenes, Mellon accumulates samples, runs VAD every ~5 seconds of new audio, and splits at silence boundaries (minimum 30s chunks, hard cap at 2 minutes). Each chunk is transcribed independently, so text accumulates as you record.
Concurrent Sessions
The streaming API handles multiple transcription sessions simultaneously. Each session independently buffers and chunks its audio, so you can run several recordings in parallel without interference. When two or more sessions produce a chunk at the same time, Whisper model access is serialized — chunks are queued and processed one at a time, ensuring stable performance without contention. This makes the API well-suited for multi-user or multi-source setups where several audio streams need transcription at once.
POST /v1/audio/transcriptions/stream/start
Start a new streaming session. Optionally specify a language.
curl -X POST http://localhost:8765/v1/audio/transcriptions/stream/start \
-H "Content-Type: application/json" \
-d '{"language": "en"}'
# {"session_id": "A1B2C3D4-...", "success": true}
Sessions automatically expire after 10 minutes of inactivity.
POST /v1/audio/transcriptions/stream/feed
Feed a chunk of raw PCM audio data. Send 16-bit signed integer PCM by default (16kHz, mono). For float32 samples, include the X-Audio-Format: float32 header.
Returns progress stats so you can show a live indicator:
curl -X POST http://localhost:8765/v1/audio/transcriptions/stream/feed \
-H "X-Session-Id: A1B2C3D4-..." \
-H "Content-Type: application/octet-stream" \
--data-binary @chunk.pcm
# {
# "success": true,
# "words_so_far": 42,
# "minutes_transcribed": 1.5,
# "minutes_recorded": 2.3,
# "bytes_fed": 32000
# }
Response fields:
words_so_far— number of words transcribed from completed chunksminutes_transcribed— how many minutes of audio have been transcribedminutes_recorded— total audio duration fed so farbytes_fed— size of this particular chunk in bytes
POST /v1/audio/transcriptions/stream/end
End the session. Mellon transcribes any remaining buffered audio and returns the complete text for the entire session.
curl -X POST http://localhost:8765/v1/audio/transcriptions/stream/end \
-H "X-Session-Id: A1B2C3D4-..."
# {"success": true, "text": "The complete transcription of the entire recording session..."}
Streaming Example (Python)
Here's a complete example that records from the microphone and streams to Mellon:
import requests
import sounddevice as sd
import numpy as np
BASE = "http://localhost:8765/v1/audio/transcriptions/stream"
SAMPLE_RATE = 16000
CHUNK_SECONDS = 3
# Start session
r = requests.post(f"{BASE}/start", json={"language": "en"})
session_id = r.json()["session_id"]
print("Recording... press Ctrl+C to stop")
try:
while True:
audio = sd.rec(int(SAMPLE_RATE * CHUNK_SECONDS),
samplerate=SAMPLE_RATE, channels=1, dtype="int16")
sd.wait()
r = requests.post(f"{BASE}/feed",
headers={"X-Session-Id": session_id},
data=audio.tobytes())
print(f" words: {r.json()['words_so_far']}")
except KeyboardInterrupt:
pass
# End session and get full text
r = requests.post(f"{BASE}/end",
headers={"X-Session-Id": session_id})
print(f"\nFinal: {r.json()['text']}")
Agent Mode Endpoints
These endpoints power Mellon's Agent Mode — voice-activated AI commands. They're primarily used for end-to-end testing but are available if you want to build your own integrations.
POST /e2e/agent-mode
Full agent mode pipeline: Whisper transcription, trigger word detection (e.g., "Hey Mellon"), then AI command execution.
POST /e2e/transcribe-full
Full pipeline with cursor context capture — used for E2E testing the complete transcription flow including accessibility context.
POST /e2e/test-trigger
Test trigger word detection without audio. Send JSON with a text field.
curl -X POST http://localhost:8765/e2e/test-trigger \
-H "Content-Type: application/json" \
-d '{"text": "Hey Mellon summarize this paragraph"}'
Utility Endpoints
GET /health
Returns server status and whether the Whisper model is loaded.
curl http://localhost:8765/health
# {"status": "ok", "model_loaded": true}
GET /
Returns a JSON listing of all available endpoints and their descriptions.
Supported Audio Formats
Batch endpoints accept: WAV, MP3, M4A, FLAC, AIFF, OGG (OGG requires macOS 14+). Audio is automatically converted to 16kHz mono PCM internally — no pre-processing needed.
Streaming endpoints expect raw PCM audio at 16kHz mono — either 16-bit signed integer (default) or float32 (with X-Audio-Format: float32 header).
Custom Dictionary
All endpoints that run the full pipeline (everything except /transcribe) benefit from your custom dictionary. Add terms in Mellon → Settings → Dictionary:
- Product names — brand names, app names, project codenames
- People's names — colleagues, contacts, team members
- Technical jargon — industry-specific terms Whisper might misspell
- Medical terms — enable the medical dictionary toggle for healthcare terminology
Privacy
The API server only listens on localhost. All processing happens on your Mac using Apple Silicon's Neural Engine. No audio data leaves your device. No API keys, no accounts, no usage tracking.
Ready to integrate local speech-to-text? Download Mellon free — the API server is included with every installation.