Comments on: Running Whisper AI for Real-Time Speech-to-Text on Linux

By: Ravi Saive

Ravi Saive — Tue, 27 Jan 2026 06:13:00 +0000

In reply to irmhild. @irmhild, Thanks for the update, you’re actually very close. By “quasi live,” I meant that Whisper can’t transcribe speech word by word in real time. Instead, you record audio continuously, collect a few seconds of speech, send that chunk to Whisper, show the text, and repeat. Saving to a file was just to confirm your microphone audio is good and since that worked, Whisper itself is fine. The limitations mainly come from Whisper, not Linux. It’s designed for full audio segments, so if the audio chunks are too short or mostly silence, it often returns no text. The “streaming logic” is just the code that handles recording and chunking audio before sending it to the model. If you want something more naturally real-time, you could also try tools like whisper.cpp (streaming versions) or Vosk, which are built more for continuous speech recognition. So your setup isn’t broken, it’s just the audio buffering part that needs adjustment.

By: irmhild

irmhild — Mon, 26 Jan 2026 10:31:14 +0000

In reply to Ravi Saive. Dear Ravi, Thank you for your reply. After returning from holiday, I tried again today. As you suggested, I saved a file, but I’m not sure what to do next to transcribe in a quasi “live” mode. Could you clarify what you meant? It works in transcription mode, but I’m not sure that’s what you were referring to. I also have another question. You mention several possible limitations of real-time transcription — are these limitations related to Whisper itself, your script, Python, or Linux? Where does the “streaming logic” come from? Do you know of any alternative solutions for real-time transcription that I could try?

By: Ravi Saive

Ravi Saive — Fri, 09 Jan 2026 06:15:05 +0000

In reply to irmhild.

@irmhild,

Thank you for your kind words, and I’m glad to hear you were able to resolve the initial error and successfully transcribe audio files in German.

Regarding the real-time transcription issue: what you are seeing is a common limitation rather than a configuration mistake. model.transcribe() is designed for complete audio segments, not for continuous real-time streams. If the incoming audio buffer is too short, contains mostly silence, or is not finalized, Whisper may simply return no text without raising an error.

A few points to check:

Make sure audio_data actually contains speech and not just silence. Whisper will output nothing if the audio energy is too low.

Real-time transcription typically requires buffering audio into longer chunks (e.g., several seconds) before calling transcribe(). Calling it too frequently on small frames often results in empty output.

Ensure the audio is sampled at 16 kHz (or properly resampled), mono, and normalized to the expected float range.

For real-time use, many implementations use a loop that accumulates audio, applies a voice-activity check, and only then calls transcribe().

Since file-based transcription works for you in German, language support is not the issue. The problem is almost certainly related to how the live audio is captured, buffered, or passed to the model.

I would recommend testing by saving a few seconds of your “real-time” audio to a file and transcribing that file. If that works, the issue is confirmed to be in the streaming logic rather than Whisper itself.

I hope this helps, and please feel free to share more details about your audio capture setup if you need further assistance.

By: irmhild

irmhild — Thu, 01 Jan 2026 12:22:57 +0000

Dear Ravi, First of all, thank you for this great work, and all the best to you in 2026. My question: I am deaf and would like to use real-time transcription in German. After some initial trouble (specifically the ValueError: need at least one array to concatenate, which I fixed using your suggested check), everything works as expected: whisper --help works, and I can also get a transcript from an audio file in German. However, when I try real-time transcription (using, of course, result = model.transcribe(audio_data.flatten(), language="de")), I get no output at all—no text, nothing. I have tried waiting for some time, but still nothing happens. Do you have any idea what might be going wrong? Thank you very much in advance!

By: Ravi Saive

Ravi Saive — Thu, 20 Nov 2025 04:28:27 +0000

In reply to Paolo.

@Paolo,

Thanks for the update!

Really appreciate you adding all these options. I’ll give them a spin and let you know my thoughts.

Great work!