Mellon API: Complete Guide to Local Speech-to-Text Endpoints

Mellon runs a local HTTP server on your Mac that exposes Whisper speech-to-text through a simple API. Whether you're building an AI agent, automating transcription workflows, or integrating voice into your app — everything runs on-device with no API keys, no cloud, and no per-minute billing.

As of Mellon Voice 1.8.24, final transcription responses can also include a structured correction report. Your app can show the same kind of spellcheck summary as Mellon itself, then let users add accepted words or submit smarter fixes through the dictionary API.

Core API available in Mellon v1.4.0+. Streaming available in v1.5.0+. Correction reports and smart unknown-word actions require v1.8.24+ (API v1.3).

Getting Started

1. Enable the API Server

Open Mellon's settings and go to API Server. Toggle the server on. It starts on http://localhost:8765 by default.

Verify it's running:

curl http://localhost:8765/health
# {"status": "ok", "model_loaded": true}

2. Send Audio

Pick the endpoint that fits your use case — see the full reference below. The simplest way to test:

curl -X POST http://localhost:8765/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

# {"text": "Hello, this is a test recording."}

Batch Endpoints

These endpoints accept a complete audio file and return the transcription in one response. Best for short-to-medium recordings (up to a few minutes).

POST /v1/audio/transcriptions

Recommended for most integrations. OpenAI-compatible (multipart/form-data). Runs the full pipeline: Whisper + spellcheck + custom dictionary corrections.

This is a drop-in replacement for https://api.openai.com/v1/audio/transcriptions — tools like OpenClaw can point to Mellon with zero code changes.

curl -X POST http://localhost:8765/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

# {"text": "I updated ChronoCat and opened Mellon."}

OpenClaw config example:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "models": [{
          "provider": "openai",
          "model": "whisper-1",
          "baseUrl": "http://127.0.0.1:8765/v1"
        }]
      }
    }
  }
}

POST /transcribe-full

Raw audio body. Same full pipeline as above, but returns detailed correction data and timing information. Useful for debugging or when you need to see what Whisper originally produced vs. what was corrected.

curl -X POST http://localhost:8765/transcribe-full \
  --data-binary @recording.wav \
  -H "Content-Type: application/octet-stream"

# {
#   "success": true,
#   "text": "I updated ChronoCat and opened Mellon.",
#   "whisper_text": "I updated chrono cat and opened melon.",
#   "corrections": [
#     {
#       "original": "chrono cat",
#       "corrected": "ChronoCat",
#       "source": "custom",
#       "sources": ["custom"],
#       "count": 1
#     }
#   ],
#   "correction_report": {
#     "pipeline_applied": true,
#     "medical_dictionary_enabled": true,
#     "changed": true,
#     "correction_count": 1,
#     "corrections": [
#       {
#         "original": "chrono cat",
#         "corrected": "ChronoCat",
#         "source": "custom",
#         "sources": ["custom"],
#         "count": 1
#       }
#     ],
#     "unknown_word_count": 0,
#     "unknown_words": [],
#     "duration_ms": 2
#   },
#   "timing": {"whisper_ms": 1024, "spellcheck_ms": 2, "total_ms": 1026}
# }

POST /transcribe

Raw audio body. Whisper only — no spellcheck or dictionary corrections. Fastest option when you just need raw transcription.

curl -X POST http://localhost:8765/transcribe \
  --data-binary @recording.wav \
  -H "Content-Type: application/octet-stream"

# {"success": true, "text": "I updated chrono cat.", "duration_ms": 1025}

Streaming Endpoints

For long-form audio or real-time recording, streaming endpoints let you feed audio in chunks while transcription happens in the background. Instead of waiting for the entire recording to finish, Mellon processes audio as it arrives — using Voice Activity Detection (VAD) to find natural silence boundaries and transcribe completed chunks independently.

This means you can transcribe recordings of any length without hitting memory limits, and get progress updates as the session progresses.

How It Works

Start a session — you get back a session_id
Feed raw PCM audio chunks as they're recorded (16kHz, mono)
End the session — Mellon transcribes any remaining audio, applies the final correction pipeline once, and returns the corrected full text plus its report

Behind the scenes, Mellon accumulates samples, runs VAD every ~5 seconds of new audio, and splits at silence boundaries (minimum 30s chunks, hard cap at 2 minutes). Each chunk is transcribed independently, so text accumulates as you record.

Concurrent Sessions

The streaming API handles multiple transcription sessions simultaneously. Each session independently buffers and chunks its audio, so you can run several recordings in parallel without interference. When two or more sessions produce a chunk at the same time, Whisper model access is serialized — chunks are queued and processed one at a time, ensuring stable performance without contention. This makes the API well-suited for multi-user or multi-source setups where several audio streams need transcription at once.

POST /v1/audio/transcriptions/stream/start

Start a new streaming session. Optionally specify a language.

curl -X POST http://localhost:8765/v1/audio/transcriptions/stream/start \
  -H "Content-Type: application/json" \
  -d '{"language": "en"}'

# {"session_id": "A1B2C3D4-...", "success": true}

Sessions automatically expire after 10 minutes of inactivity.

POST /v1/audio/transcriptions/stream/feed

Feed a chunk of raw PCM audio data. Send 16-bit signed integer PCM by default (16kHz, mono). For float32 samples, include the X-Audio-Format: float32 header.

Returns progress stats so you can show a live indicator:

curl -X POST http://localhost:8765/v1/audio/transcriptions/stream/feed \
  -H "X-Session-Id: A1B2C3D4-..." \
  -H "Content-Type: application/octet-stream" \
  --data-binary @chunk.pcm

# {
#   "success": true,
#   "text_so_far": "",
#   "words_so_far": 42,
#   "minutes_transcribed": 1.5,
#   "minutes_recorded": 2.3,
#   "bytes_fed": 32000
# }

Response fields:

text_so_far — text from completed chunks only; it can be empty for a short recording
words_so_far — number of words transcribed from completed chunks
minutes_transcribed — how many minutes of audio have been transcribed
minutes_recorded — total audio duration fed so far
bytes_fed — size of this particular chunk in bytes

Important: this is chunk progress, not token-by-token streaming. Do not treat an empty text_so_far as a failed recording. The remaining audio is flushed when you end the session.

POST /v1/audio/transcriptions/stream/end

End the session. Mellon transcribes any remaining buffered audio, runs the same final custom/medical dictionary pipeline as /transcribe-full, and returns the corrected complete text plus a structured correction report.

curl -X POST http://localhost:8765/v1/audio/transcriptions/stream/end \
  -H "X-Session-Id: A1B2C3D4-..."

# {
#   "success": true,
#   "text": "The corrected complete transcription...",
#   "correction_report": {
#     "pipeline_applied": true,
#     "medical_dictionary_enabled": true,
#     "changed": true,
#     "correction_count": 1,
#     "corrections": [
#       {
#         "original": "melon",
#         "corrected": "Mellon",
#         "source": "custom",
#         "sources": ["custom"],
#         "count": 1
#       }
#     ],
#     "unknown_word_count": 1,
#     "unknown_words": ["chronocat"],
#     "duration_ms": 2
#   }
# }

Streaming Example (Python)

Here's a complete example that records from the microphone and streams to Mellon:

import requests
import sounddevice as sd
import numpy as np

BASE = "http://localhost:8765/v1/audio/transcriptions/stream"
SAMPLE_RATE = 16000
CHUNK_SECONDS = 3

# Start session
r = requests.post(f"{BASE}/start", json={"language": "en"})
session_id = r.json()["session_id"]

print("Recording... press Ctrl+C to stop")
try:
    while True:
        audio = sd.rec(int(SAMPLE_RATE * CHUNK_SECONDS),
                       samplerate=SAMPLE_RATE, channels=1, dtype="int16")
        sd.wait()
        r = requests.post(f"{BASE}/feed",
                          headers={"X-Session-Id": session_id},
                          data=audio.tobytes())
        print(f"  words: {r.json()['words_so_far']}")
except KeyboardInterrupt:
    pass

# End session and get full text
r = requests.post(f"{BASE}/end",
                  headers={"X-Session-Id": session_id})
result = r.json()
print(f"\nFinal: {result['text']}")
print(f"Corrections: {result['correction_report']['correction_count']}")

Correction Reports and Client Toasts

Mellon returns data, not a remote UI. If your integration wants a spellcheck banner or toast, render it from correction_report after the final transcript succeeds.

correction_count is the total number of corrected occurrences.
corrections groups matching original/corrected pairs and gives each pair a count.
source identifies the primary correction source: medical, custom, context, or replacement.
unknown_words lists terms the pipeline could not resolve.
medical_dictionary_enabled tells the client which dictionary mode produced the report.

The report is additive. Older clients can continue reading only success and text. Proxies that want the report must explicitly forward it instead of reducing the response to text alone.

Dictionary Management API

A client can turn unknown words into actions without recreating Mellon's native decision logic.

Add an accepted word

If the unknown spelling is correct, add it as a custom term:

curl -X POST http://localhost:8765/v1/dictionary/terms \
  -H "Content-Type: application/json" \
  -d '{"term": "ChronoCat", "source": "user"}'

Fix an unknown word

When the user supplies the intended spelling, call the smart resolver. Mellon decides whether the durable result should be a term, an exact replacement, or both:

curl -X POST http://localhost:8765/v1/dictionary/resolve-unknown \
  -H "Content-Type: application/json" \
  -d '{
    "unknown_word": "cronocat",
    "proposed_correction": "ChronoCat"
  }'

# {
#   "success": true,
#   "unknown_word": "cronocat",
#   "proposed_correction": "ChronoCat",
#   "strategy": "term_and_replacement",
#   "term_added": true,
#   "replacement_added": true
# }

Possible strategies are replacement, term, and term_and_replacement. Use this endpoint for a client-side “Fix” button rather than duplicating the term-versus-replacement workflow.

Direct dictionary CRUD

GET/POST/DELETE /v1/dictionary/terms — search, add, or remove custom terms
GET/POST/PATCH/DELETE /v1/dictionary/replacements — manage exact word mappings
GET/PATCH /v1/dictionary/settings — inspect or update medical-dictionary and toast settings
GET /v1/dictionary — load status, counts, settings, and endpoint summary

Agent Mode Endpoints

These endpoints power Mellon's Agent Mode — voice-activated AI commands. They're primarily used for end-to-end testing but are available if you want to build your own integrations.

POST /e2e/agent-mode

Full agent mode pipeline: Whisper transcription, trigger word detection (e.g., "Hey Mellon"), then AI command execution.

POST /e2e/transcribe-full

Full pipeline with cursor context capture — used for E2E testing the complete transcription flow including accessibility context.

POST /e2e/test-trigger

Test trigger word detection without audio. Send JSON with a text field.

curl -X POST http://localhost:8765/e2e/test-trigger \
  -H "Content-Type: application/json" \
  -d '{"text": "Hey Mellon summarize this paragraph"}'

Utility Endpoints

GET /health

Returns server status and whether the Whisper model is loaded.

curl http://localhost:8765/health
# {"status": "ok", "model_loaded": true}

GET /

Returns a JSON listing of all available endpoints and their descriptions.

Supported Audio Formats

Batch endpoints accept: WAV, MP3, M4A, FLAC, AIFF, OGG (OGG requires macOS 14+). Audio is automatically converted to 16kHz mono PCM internally — no pre-processing needed.

Streaming endpoints expect raw PCM audio at 16kHz mono — either 16-bit signed integer (default) or float32 (with X-Audio-Format: float32 header).

Custom Dictionary

All endpoints that run the full pipeline (everything except /transcribe) benefit from your custom dictionary. Add terms in Mellon → Settings → Dictionary:

Product names — brand names, app names, project codenames
People's names — colleagues, contacts, team members
Technical jargon — industry-specific terms Whisper might misspell
Medical terms — enable the medical dictionary toggle for healthcare terminology

Privacy

Transcription processing stays on the Mac running Mellon Voice. For same-machine integrations, use http://127.0.0.1:8765. The HTTP listener has no built-in authentication or TLS, so do not expose it directly to the public internet. If another trusted machine needs access, protect the connection with your firewall, VPN, or an authenticated reverse proxy.

Ready to integrate local speech-to-text? Download Mellon free — the API server is included with every installation.