Technical Guide

Implementing OpenAI Whisper in a hackathon build.

Node.js and Python patterns, API vs self-hosted tradeoffs, and two ship-ready recipes: a voice-controlled interface and a meeting summarizer.

Speech-to-text is one of the highest-signal features you can ship in six hours. Whisper — OpenAI's open-source speech model — is the default choice: multilingual, robust to noise, and available both as a hosted API and as downloadable weights. This guide is the shortest path from an empty repo to a working Whisper integration.

1. Pick your Whisper: API vs self-hosted

You have three practical options. During a hackathon, option 1 wins nine times out of ten.

Hosted API

whisper-1 or gpt-4o-transcribe via a single HTTP call. ~$0.006/min. Zero infra. Use this by default.

whisper.cpp / faster-whisper

Runs on CPU or a modest GPU. Free per-minute. Great for batch pipelines or an air-gapped demo — but you pay in setup time.

Replicate / HF Inference

Serverless GPU wrapper around the open weights. A middle ground when you want the large-v3 model without owning a GPU.

2. Node.js: batch transcription in ~15 lines

The hosted API accepts a multipart upload. From a server route, stream the file straight in — no base64, no chunking.

import OpenAI from "openai";
import { createReadStream } from "node:fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function transcribe(path: string) {
  const res = await openai.audio.transcriptions.create({
    file: createReadStream(path),
    model: "whisper-1",
    // Optional: response_format: "verbose_json" to get word timestamps
    // Optional: language: "en" to skip auto-detect and cut latency
  });
  return res.text;
}

3. Python: the same call, plus timestamps

from openai import OpenAI
client = OpenAI()

def transcribe(path: str):
    with open(path, "rb") as f:
        res = client.audio.transcriptions.create(
            file=f,
            model="whisper-1",
            response_format="verbose_json",
            timestamp_granularities=["segment"],
        )
    return res.text, res.segments

For self-hosted, swap the client for faster-whisper. The call shape is nearly identical, and on a laptop GPU the base model transcribes faster than real time.

4. Real-time-ish transcription in the browser

True streaming Whisper is not a thing yet — the hosted API is batch. The trick that wins demos is rolling windows: record short WAV chunks on the client, POST each one, and concatenate the deltas.

  • Use the Web Audio API to capture PCM, not MediaRecorder with a timeslice — timeslice fragments have no container header on chunks 2+ and Whisper rejects them.
  • Encode each ~3-second window as a self-contained 16 kHz mono WAV.
  • Upload to your backend server function, not directly to OpenAI — keep the API key server-side.
  • Stitch responses in order; render partial text as it arrives.
  • Guard against silent windows — a mic that never opens produces a header-only WAV that returns 400.

If you need sub-second latency (e.g. a voice UI that reacts as the user speaks), reach for gpt-4o-transcribe with SSE streaming, or Deepgram's WebSocket API. Whisper is best when a 2–5 second delay is acceptable.

5. Recipe: voice-controlled interface

A crowd-pleaser demo in about an hour of work:

  1. Push-to-talk button on the UI. Record while held, stop on release.
  2. Upload the WAV to a /api/transcribe server function → Whisper → text.
  3. Feed the text to an LLM with a system prompt like "translate the user's request into a JSON action call from this list".
  4. Execute the action client-side. Speak the confirmation back with the browser's SpeechSynthesis API or an OpenAI TTS call.

6. Recipe: meeting summarizer

The most requested hackathon Whisper application — and easy to build well:

  1. Accept an audio or video upload (mp3, m4a, mp4, wav). Whisper handles them all.
  2. If the file is longer than 25 MB (the API cap), split it with ffmpeg into 10-minute chunks, transcribe each, and concatenate — do not truncate.
  3. Pass the full transcript to an LLM with a prompt asking for a structured summary: TL;DR, decisions, action items with owners, open questions.
  4. Render the summary alongside the timestamped transcript so users can click to jump to any moment.

7. Costs, limits, and the traps that will bite you

  • 25 MB upload cap on the hosted API. Compress to 16 kHz mono or chunk with ffmpeg.
  • File extension must match the container. Safari records mp4, not webm; name uploads accordingly or the API returns "Audio file might be corrupted".
  • OGG/Opus from WhatsApp/Telegram is rejected. Transcode to WAV or MP3 first.
  • Never call OpenAI from the browser. The key leaks. Proxy through your backend.
  • Set language when you know it. Auto-detect is fine but slower and occasionally wrong on short clips.
  • Cache transcripts by file hash. Judges will re-run your demo; billing twice for the same audio is avoidable.