How to Convert Audio to Text with OpenAI Whisper in Python

In this Python tutorial I’m going to show you how to do speech to text in just three lines of code using the Hugging Face Transformers pipeline with OpenAI Whisper. Automatic speech recognition is not a very simple task to do, but since Whisper came out there are a lot of applications being built because it is really good at what it does and it is multilingual. I’ll use the Transformers pipeline to download the Whisper medium model from the Hugging Face Hub and run speech to text in a simple Google Colab notebook.

This walkthrough was prompted by an update announcing that the Hugging Face Model Hub added OpenAI Whisper in Transformers. OpenAI Whisper is a speech recognition Transformer model trained on almost 680,000 hours of audio. I put together a quick tutorial to make it easy to try.

Whisper Speech-to-Text with Transformers

I’m starting with a Google Colab notebook. Make sure you have a GPU runtime if you can, but if you only have a CPU it still works fine. GPU inference is faster, and I’ll show what to change if you do not have a GPU.

If you want a complete end-to-end example that mirrors this flow, see a step-by-step project with Whisper in Python.

Environment and installation

First, I check for a GPU with `nvidia-smi` to confirm I have a Tesla T4 machine. That’s good to go for faster inference. If you don’t have a GPU, you can still proceed.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 1

Next, I install the Transformers library directly from the Hugging Face GitHub. Once installed, you generally won’t need to do this again in a deployment because you’ll pin it in requirements.txt or a config file.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 2

Create the Whisper Speech-to-Text pipeline

Import the pipeline with `from transformers import pipeline`. If you’re familiar with pipeline, you know you can specify a task like sentiment analysis, text classification, or summarization. Here we’re using the task `automatic-speech-recognition`.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 3

Hugging Face pipeline already had ASR, but it wasn’t using OpenAI Whisper earlier. Now that Whisper medium and large are available on the Hub, we can set the task and the exact model, for example `openai/whisper-medium`.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 4

I’ve found Whisper medium to be a good trade-off between tiny and base, and you don’t always need large. I also set `device=0` because I have a GPU; if you don’t have a GPU, just omit the device argument and it will default to CPU.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 5

Input audio for Whisper Speech-to-Text

We need an input audio file to transcribe. I pick a movie dialogue clip, copy the audio address, and download it in Colab with `wget` so it’s saved as `audio.mp3`. Once the clip is saved, I preview it to confirm it works.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 6

With the file ready, I pass `audio.mp3` to the Whisper pipeline and print the text output. The result says, “Starting tonight people will die, I’m a man of my word,” which matches the clip.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 7

For more practical uses, see how this model works for subtitle generation in using Whisper for video subtitle captioning.

A second noisy Whisper Speech-to-Text example

I try another line: “This town deserves a better class of criminal, and I’m going to give it to them.” I download it the same way, run the pipeline again, and check the output. It transcribes “criminal” as “crew” and repeats “to,” which is understandable because the audio is quite noisy and I didn’t do any preprocessing.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 8

The takeaway is that Whisper still does well in tough conditions, but noise can affect certain words. Preprocessing can help if you need higher accuracy in very noisy clips.

Choose the right Whisper Speech-to-Text model

You can switch models based on your use case. Use `tiny`, `base`, `medium`, or `large` by changing the model name in the pipeline call. Choose based on where you deploy and if you can run on CPU or need the accuracy boost from a larger model.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 9

If you specifically need an English workflow with Hugging Face, here’s a focused resource on how to convert English speech to text with Hugging Face.

Three-line Whisper Speech-to-Text summary

Import the pipeline from Transformers. Create the pipeline with the ASR task, set the model to `openai/whisper-medium`, and set `device=0` if you have a GPU. Call the pipeline on your audio file and you have the text output.

How to Convert Audio to Text with OpenAI Whisper in Python screenshot 10

Resources

Open the Colab notebook here: Whisper Speech-to-Text Colab.

Final Thoughts

Using the Hugging Face Transformers pipeline with OpenAI Whisper, you can get state-of-the-art speech recognition in just a few lines. It works on CPU, runs faster on GPU, and supports multiple languages. Pick the model size that fits your needs, give it an audio file, and read the transcript.

Leave a Comment