Mellon runs a local HTTP server on your Mac that exposes Whisper speech-to-text through a simple API. Whether you're building an AI agent, automating transcription workflows, or integrating voice into your app — everything runs on-device with no API keys, no cloud, and no per-minute billing.

Available in Mellon v1.4.0+. Streaming endpoints available in v1.5.0+.

Getting Started

1. Enable the API Server

Open Mellon's settings and go to API Server. Toggle the server on. It starts on http://localhost:8765 by default.

Verify it's running:

curl http://localhost:8765/health
# {"status": "ok", "model_loaded": true}

2. Send Audio

Pick the endpoint that fits your use case — see the full reference below. The simplest way to test:

curl -X POST http://localhost:8765/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

# {"text": "Hello, this is a test recording."}

Batch Endpoints

These endpoints accept a complete audio file and return the transcription in one response. Best for short-to-medium recordings (up to a few minutes).

POST /v1/audio/transcriptions

Recommended for most integrations. OpenAI-compatible (multipart/form-data). Runs the full pipeline: Whisper + spellcheck + custom dictionary corrections.

This is a drop-in replacement for https://api.openai.com/v1/audio/transcriptions — tools like OpenClaw can point to Mellon with zero code changes.

curl -X POST http://localhost:8765/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

# {"text": "I updated ChronoCat and opened Mellon."}

OpenClaw config example:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "models": [{
          "provider": "openai",
          "model": "whisper-1",
          "baseUrl": "http://127.0.0.1:8765/v1"
        }]
      }
    }
  }
}

POST /transcribe-full

Raw audio body. Same full pipeline as above, but returns detailed correction data and timing information. Useful for debugging or when you need to see what Whisper originally produced vs. what was corrected.

curl -X POST http://localhost:8765/transcribe-full \
  --data-binary @recording.wav \
  -H "Content-Type: application/octet-stream"

# {
#   "success": true,
#   "text": "I updated ChronoCat and opened Mellon.",
#   "whisper_text": "I updated chrono cat and opened melon.",
#   "corrections": [
#     {"original": "chrono cat", "corrected": "ChronoCat", "source": "custom"}
#   ],
#   "timing": {"whisper_ms": 1024, "spellcheck_ms": 2, "total_ms": 1026}
# }

POST /transcribe

Raw audio body. Whisper only — no spellcheck or dictionary corrections. Fastest option when you just need raw transcription.

curl -X POST http://localhost:8765/transcribe \
  --data-binary @recording.wav \
  -H "Content-Type: application/octet-stream"

# {"success": true, "text": "I updated chrono cat.", "duration_ms": 1025}

Streaming Endpoints

For long-form audio or real-time recording, streaming endpoints let you feed audio in chunks while transcription happens in the background. Instead of waiting for the entire recording to finish, Mellon processes audio as it arrives — using Voice Activity Detection (VAD) to find natural silence boundaries and transcribe completed chunks independently.

This means you can transcribe recordings of any length without hitting memory limits, and get progress updates as the session progresses.

How It Works

  1. Start a session — you get back a session_id
  2. Feed raw PCM audio chunks as they're recorded (16kHz, mono)
  3. End the session — Mellon transcribes any remaining audio and returns the full text

Behind the scenes, Mellon accumulates samples, runs VAD every ~5 seconds of new audio, and splits at silence boundaries (minimum 30s chunks, hard cap at 2 minutes). Each chunk is transcribed independently, so text accumulates as you record.

Concurrent Sessions

The streaming API handles multiple transcription sessions simultaneously. Each session independently buffers and chunks its audio, so you can run several recordings in parallel without interference. When two or more sessions produce a chunk at the same time, Whisper model access is serialized — chunks are queued and processed one at a time, ensuring stable performance without contention. This makes the API well-suited for multi-user or multi-source setups where several audio streams need transcription at once.

POST /v1/audio/transcriptions/stream/start

Start a new streaming session. Optionally specify a language.

curl -X POST http://localhost:8765/v1/audio/transcriptions/stream/start \
  -H "Content-Type: application/json" \
  -d '{"language": "en"}'

# {"session_id": "A1B2C3D4-...", "success": true}

Sessions automatically expire after 10 minutes of inactivity.

POST /v1/audio/transcriptions/stream/feed

Feed a chunk of raw PCM audio data. Send 16-bit signed integer PCM by default (16kHz, mono). For float32 samples, include the X-Audio-Format: float32 header.

Returns progress stats so you can show a live indicator:

curl -X POST http://localhost:8765/v1/audio/transcriptions/stream/feed \
  -H "X-Session-Id: A1B2C3D4-..." \
  -H "Content-Type: application/octet-stream" \
  --data-binary @chunk.pcm

# {
#   "success": true,
#   "words_so_far": 42,
#   "minutes_transcribed": 1.5,
#   "minutes_recorded": 2.3,
#   "bytes_fed": 32000
# }

Response fields:

  • words_so_far — number of words transcribed from completed chunks
  • minutes_transcribed — how many minutes of audio have been transcribed
  • minutes_recorded — total audio duration fed so far
  • bytes_fed — size of this particular chunk in bytes

POST /v1/audio/transcriptions/stream/end

End the session. Mellon transcribes any remaining buffered audio and returns the complete text for the entire session.

curl -X POST http://localhost:8765/v1/audio/transcriptions/stream/end \
  -H "X-Session-Id: A1B2C3D4-..."

# {"success": true, "text": "The complete transcription of the entire recording session..."}

Streaming Example (Python)

Here's a complete example that records from the microphone and streams to Mellon:

import requests
import sounddevice as sd
import numpy as np

BASE = "http://localhost:8765/v1/audio/transcriptions/stream"
SAMPLE_RATE = 16000
CHUNK_SECONDS = 3

# Start session
r = requests.post(f"{BASE}/start", json={"language": "en"})
session_id = r.json()["session_id"]

print("Recording... press Ctrl+C to stop")
try:
    while True:
        audio = sd.rec(int(SAMPLE_RATE * CHUNK_SECONDS),
                       samplerate=SAMPLE_RATE, channels=1, dtype="int16")
        sd.wait()
        r = requests.post(f"{BASE}/feed",
                          headers={"X-Session-Id": session_id},
                          data=audio.tobytes())
        print(f"  words: {r.json()['words_so_far']}")
except KeyboardInterrupt:
    pass

# End session and get full text
r = requests.post(f"{BASE}/end",
                  headers={"X-Session-Id": session_id})
print(f"\nFinal: {r.json()['text']}")

Agent Mode Endpoints

These endpoints power Mellon's Agent Mode — voice-activated AI commands. They're primarily used for end-to-end testing but are available if you want to build your own integrations.

POST /e2e/agent-mode

Full agent mode pipeline: Whisper transcription, trigger word detection (e.g., "Hey Mellon"), then AI command execution.

POST /e2e/transcribe-full

Full pipeline with cursor context capture — used for E2E testing the complete transcription flow including accessibility context.

POST /e2e/test-trigger

Test trigger word detection without audio. Send JSON with a text field.

curl -X POST http://localhost:8765/e2e/test-trigger \
  -H "Content-Type: application/json" \
  -d '{"text": "Hey Mellon summarize this paragraph"}'

Utility Endpoints

GET /health

Returns server status and whether the Whisper model is loaded.

curl http://localhost:8765/health
# {"status": "ok", "model_loaded": true}

GET /

Returns a JSON listing of all available endpoints and their descriptions.

Supported Audio Formats

Batch endpoints accept: WAV, MP3, M4A, FLAC, AIFF, OGG (OGG requires macOS 14+). Audio is automatically converted to 16kHz mono PCM internally — no pre-processing needed.

Streaming endpoints expect raw PCM audio at 16kHz mono — either 16-bit signed integer (default) or float32 (with X-Audio-Format: float32 header).

Custom Dictionary

All endpoints that run the full pipeline (everything except /transcribe) benefit from your custom dictionary. Add terms in Mellon → Settings → Dictionary:

  • Product names — brand names, app names, project codenames
  • People's names — colleagues, contacts, team members
  • Technical jargon — industry-specific terms Whisper might misspell
  • Medical terms — enable the medical dictionary toggle for healthcare terminology

Privacy

The API server only listens on localhost. All processing happens on your Mac using Apple Silicon's Neural Engine. No audio data leaves your device. No API keys, no accounts, no usage tracking.

Ready to integrate local speech-to-text? Download Mellon free — the API server is included with every installation.