Speech-to-text is one of the highest-signal features you can ship in six hours. Whisper — OpenAI's open-source speech model — is the default choice: multilingual, robust to noise, and available both as a hosted API and as downloadable weights. This guide is the shortest path from an empty repo to a working Whisper integration.
1. Pick your Whisper: API vs self-hosted
You have three practical options. During a hackathon, option 1 wins nine times out of ten.
Hosted API
whisper-1 or gpt-4o-transcribe via a single HTTP call. ~$0.006/min. Zero infra. Use this by default.
whisper.cpp / faster-whisper
Runs on CPU or a modest GPU. Free per-minute. Great for batch pipelines or an air-gapped demo — but you pay in setup time.
Replicate / HF Inference
Serverless GPU wrapper around the open weights. A middle ground when you want the large-v3 model without owning a GPU.
2. Node.js: batch transcription in ~15 lines
The hosted API accepts a multipart upload. From a server route, stream the file straight in — no base64, no chunking.
import OpenAI from "openai";
import { createReadStream } from "node:fs";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function transcribe(path: string) {
const res = await openai.audio.transcriptions.create({
file: createReadStream(path),
model: "whisper-1",
// Optional: response_format: "verbose_json" to get word timestamps
// Optional: language: "en" to skip auto-detect and cut latency
});
return res.text;
}3. Python: the same call, plus timestamps
from openai import OpenAI
client = OpenAI()
def transcribe(path: str):
with open(path, "rb") as f:
res = client.audio.transcriptions.create(
file=f,
model="whisper-1",
response_format="verbose_json",
timestamp_granularities=["segment"],
)
return res.text, res.segmentsFor self-hosted, swap the client for faster-whisper. The call shape is nearly identical, and on a laptop GPU the base model transcribes faster than real time.
4. Real-time-ish transcription in the browser
True streaming Whisper is not a thing yet — the hosted API is batch. The trick that wins demos is rolling windows: record short WAV chunks on the client, POST each one, and concatenate the deltas.
- Use the Web Audio API to capture PCM, not
MediaRecorderwith a timeslice — timeslice fragments have no container header on chunks 2+ and Whisper rejects them. - Encode each ~3-second window as a self-contained 16 kHz mono WAV.
- Upload to your backend server function, not directly to OpenAI — keep the API key server-side.
- Stitch responses in order; render partial text as it arrives.
- Guard against silent windows — a mic that never opens produces a header-only WAV that returns 400.
If you need sub-second latency (e.g. a voice UI that reacts as the user speaks), reach for gpt-4o-transcribe with SSE streaming, or Deepgram's WebSocket API. Whisper is best when a 2–5 second delay is acceptable.
5. Recipe: voice-controlled interface
A crowd-pleaser demo in about an hour of work:
- Push-to-talk button on the UI. Record while held, stop on release.
- Upload the WAV to a
/api/transcribeserver function → Whisper → text. - Feed the text to an LLM with a system prompt like "translate the user's request into a JSON action call from this list".
- Execute the action client-side. Speak the confirmation back with the browser's
SpeechSynthesisAPI or an OpenAI TTS call.
6. Recipe: meeting summarizer
The most requested hackathon Whisper application — and easy to build well:
- Accept an audio or video upload (
mp3,m4a,mp4,wav). Whisper handles them all. - If the file is longer than 25 MB (the API cap), split it with
ffmpeginto 10-minute chunks, transcribe each, and concatenate — do not truncate. - Pass the full transcript to an LLM with a prompt asking for a structured summary: TL;DR, decisions, action items with owners, open questions.
- Render the summary alongside the timestamped transcript so users can click to jump to any moment.
7. Costs, limits, and the traps that will bite you
- 25 MB upload cap on the hosted API. Compress to 16 kHz mono or chunk with ffmpeg.
- File extension must match the container. Safari records
mp4, notwebm; name uploads accordingly or the API returns "Audio file might be corrupted". - OGG/Opus from WhatsApp/Telegram is rejected. Transcode to WAV or MP3 first.
- Never call OpenAI from the browser. The key leaks. Proxy through your backend.
- Set
languagewhen you know it. Auto-detect is fine but slower and occasionally wrong on short clips. - Cache transcripts by file hash. Judges will re-run your demo; billing twice for the same audio is avoidable.