Is Whisper free to use?

The Whisper model weights are open source and free to self-host under the MIT license. The hosted OpenAI Whisper API is paid — currently around $0.006 per minute of audio — but is dramatically faster and simpler to wire up during a 6-hour hackathon than running your own GPU inference.

Should I use the Whisper API or self-host during a hackathon?

Use the hosted API. It removes GPU provisioning, cold starts, and model download from your critical path. Self-hosting whisper.cpp or faster-whisper only wins when you have pre-recorded, high-volume batches or a strict no-cloud constraint from the judges.

Can Whisper do real-time transcription?

Yes, using short rolling windows. Capture 2–5 second audio chunks from the browser via MediaRecorder or Web Audio, ship them to your backend, transcribe each chunk with Whisper, and stitch the deltas. For true streaming with sub-second latency, OpenAI's gpt-4o-transcribe or Deepgram are better fits.

OpenAI Whisper Implementation Guide — Audio-to-Text for Hackathons

Speech-to-text is one of the highest-signal features you can ship in six hours. Whisper — OpenAI's open-source speech model — is the default choice: multilingual, robust to noise, and available both as a hosted API and as downloadable weights. This guide is the shortest path from an empty repo to a working Whisper integration.

1. Pick your Whisper: API vs self-hosted

You have three practical options. During a hackathon, option 1 wins nine times out of ten.

Hosted API

whisper-1 or gpt-4o-transcribe via a single HTTP call. ~$0.006/min. Zero infra. Use this by default.

whisper.cpp / faster-whisper

Runs on CPU or a modest GPU. Free per-minute. Great for batch pipelines or an air-gapped demo — but you pay in setup time.

Replicate / HF Inference

Serverless GPU wrapper around the open weights. A middle ground when you want the large-v3 model without owning a GPU.

2. Node.js: batch transcription in ~15 lines

The hosted API accepts a multipart upload. From a server route, stream the file straight in — no base64, no chunking.

import OpenAI from "openai";
import { createReadStream } from "node:fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function transcribe(path: string) {
  const res = await openai.audio.transcriptions.create({
    file: createReadStream(path),
    model: "whisper-1",
    // Optional: response_format: "verbose_json" to get word timestamps
    // Optional: language: "en" to skip auto-detect and cut latency
  });
  return res.text;
}

3. Python: the same call, plus timestamps

from openai import OpenAI
client = OpenAI()

def transcribe(path: str):
    with open(path, "rb") as f:
        res = client.audio.transcriptions.create(
            file=f,
            model="whisper-1",
            response_format="verbose_json",
            timestamp_granularities=["segment"],
        )
    return res.text, res.segments

For self-hosted, swap the client for faster-whisper. The call shape is nearly identical, and on a laptop GPU the base model transcribes faster than real time.

4. Real-time-ish transcription in the browser

True streaming Whisper is not a thing yet — the hosted API is batch. The trick that wins demos is rolling windows: record short WAV chunks on the client, POST each one, and concatenate the deltas.

Use the Web Audio API to capture PCM, not MediaRecorder with a timeslice — timeslice fragments have no container header on chunks 2+ and Whisper rejects them.
Encode each ~3-second window as a self-contained 16 kHz mono WAV.
Upload to your backend server function, not directly to OpenAI — keep the API key server-side.
Stitch responses in order; render partial text as it arrives.
Guard against silent windows — a mic that never opens produces a header-only WAV that returns 400.

If you need sub-second latency (e.g. a voice UI that reacts as the user speaks), reach for gpt-4o-transcribe with SSE streaming, or Deepgram's WebSocket API. Whisper is best when a 2–5 second delay is acceptable.

5. Recipe: voice-controlled interface

A crowd-pleaser demo in about an hour of work:

Push-to-talk button on the UI. Record while held, stop on release.
Upload the WAV to a /api/transcribe server function → Whisper → text.
Feed the text to an LLM with a system prompt like "translate the user's request into a JSON action call from this list".
Execute the action client-side. Speak the confirmation back with the browser's SpeechSynthesis API or an OpenAI TTS call.

6. Recipe: meeting summarizer

The most requested hackathon Whisper application — and easy to build well:

Accept an audio or video upload (mp3, m4a, mp4, wav). Whisper handles them all.
If the file is longer than 25 MB (the API cap), split it with ffmpeg into 10-minute chunks, transcribe each, and concatenate — do not truncate.
Pass the full transcript to an LLM with a prompt asking for a structured summary: TL;DR, decisions, action items with owners, open questions.
Render the summary alongside the timestamped transcript so users can click to jump to any moment.

7. Costs, limits, and the traps that will bite you

25 MB upload cap on the hosted API. Compress to 16 kHz mono or chunk with ffmpeg.
File extension must match the container. Safari records mp4, not webm; name uploads accordingly or the API returns "Audio file might be corrupted".
OGG/Opus from WhatsApp/Telegram is rejected. Transcode to WAV or MP3 first.
Never call OpenAI from the browser. The key leaks. Proxy through your backend.
Set language when you know it. Auto-detect is fine but slower and occasionally wrong on short clips.
Cache transcripts by file hash. Judges will re-run your demo; billing twice for the same audio is avoidable.

← Hackathon prep guide Register your team →

Implementing OpenAI Whisper in a hackathon build.