[Tutorial] Using Whisper to Transcribe Oral Interviews

Tips for getting started, and some thoughts on how to use it

by Yacine Chitour and Julien Boelaert

Recorded interviews are a classic element in the social scientific toolbox. They may be used to collect oral history, during the course of ethnographic research, or for panel interviews. And while the transcription process can often be serendipitous, it is also notoriously tedious work.

There are, of course, dozens of transcription services that claim to automatically render audio interviews into text, yet their results are far from perfect. Most give error rates of over 55% for spontaneous discussions¹, and some come with a hefty price tag.

They perform even worse when the recording is marred by background noises – a quite ordinary situation for ethnographers who carry out their craft in coffees shops, parks, in the interviewee’s living room with the TV playing in the background, etc. — it is then difficult to avoid a manual transcription.

This tutorial introduces a free tool to transcribe interviews fairly quickly and with remarkable quality, even in such “noisy” scenes. OpenAI’s Whisper is a free automatic speech recognition software. It is based on an artificial neural network trained on several hundred thousand hours of transcribed recordings, in dozens of different languages (details available here). Six model “sizes” are available, from “tiny” to “large-v2”, transcription quality increasing with the size of the model (as does computing time).

This tool is not a substitute for careful listening, but overall, it brings down the time spent typing text: once the text has been transcribed by the machine, all that is left to do is listen to the interview and check for transcription errors, or assign each statement to a speaker.

1. How does it work ?

Good news: it takes only a few lines of code.

In this tutorial, we provide a script written in Python. Although it is not necessary to master Python, you will need to have Anaconda installed on your computer (or Python on Linux and MacOS). This will allow you to run the script on applications like Spyder or on a Jupyter notebook. Before running the script in Python, you will have to run a few command lines in the conda terminal (or the default terminal on Linux and MacOS) in order to install whisperor ffmpeg. All the command lines to do so are detailed on the OpenAI github repository².

We then load the whisper library, and define the location of the recorded interview on the hard drive:

import whisper

# Define the path of the recording
interview = "/path/to/your/interview.mp3" #Path of the audio file to transcribe
whisper_model = "large-v2" # Size of the transcription model

Once we have specified the location of the interview, we can load the whisper model (which will be downloaded on the first execution), and launch the transcription:

# (Down)load the model
print("Loading the model")
model = whisper.load_model(whisper_model)

# Transcription
print("Transcription started")

transcription = model.transcribe(interview)

print("Transcription completed")

Now we just need to create a text file to save the entire transcription, timestamped at each segment delimited by the model — which makes it easier to identify the speakers, or to focus on certain interview excerpts only:

# A function to timestamp the speech segments:
def convert(seconds):
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    return f"{h:02d}:{m:02d}:{s:02d}"

# Path to save the interview transcript in .txt format
interview_transcript = "/path/to/your/interview.txt"

# Save the transcript
with open(interview_transcript , 'w', encoding='utf-8') as f:
    for segment in transcription["segments"]:
        start_time = convert(segment['start'])
        end_time = convert(segment['end'])
        f.write(f"{start_time} - {end_time}: {segment['text']}")

The timestamped transcript appears in the chosen folder. The result looks like this:

Transcript of a Bernie Sanders’ interview with the `tiny` model

2. Does it really work?

To assess Whisper’s efficiency, we report the error rates according to the model used, for two transcriptions: one of a studio interview of Bernie Sanders by Stephen Colbert (5min 02), the other of a Wall Street Journal report on the protests against Macron’s pension reform (4min 48). We then compare transcription times on two different PCs (one with a rather slow processor, the other one faster) and on a Google Colab notebook on which we use a GPU, which greatly speeds up the process.

Despite the loud background noises in the WSJ report, the word error rate³ of the transcription with the large-v2 model is less than 10% (Table 1). We can also see a significant difference between the studio transcription and the POS where the sound recordings in outdoor locations are frequent:

	Bernie Sanders’ interview (3,026 words)	France’s Pension Reforms (672 words)
`large-v2`	1.9%	9.9%
`medium`	7.8%	17.4%
`small`	7.0%	13.2%
`base`	9.0%	19.5%
`tiny`	12.6%	26.5%

Table 1: Word error rate (WER) by recording quality (studio and outdoor) and Whisper model used

Overall, the error rates of Whisper transcriptions are much lower than those observed for existing commercial software or services. On careful examination of the Whisper transcriptions, we can distinguish various kinds of errors or omissions:

Whisper ignores certain prompts or hesitations: “Ummm… errr… There were some regulations that were put into place…” is transcribed “There were some regulations that were put into place…”
The quick responses of the interviewee/interviewer interrupting the questions are not always correctly rendered: “Could that lead to a class action lawsuit… // Damn right, it should ! // … like it was against the tobacco companies? // Of course, exactly!” becomes “Could that lead to a class action lawsuit like it was against the tobacco companies? // Exactly.”
Sometimes, the model adds sentences at the beginning or end of the transcription that were simply not present in the audio. For example, using the medium model, we can read “That was great.” at the beginning of Bernie’s interview transcript. In other cases, the transcript includes an entire passage transcribed twice in a row, which accounts in part for the high error rate in the transcription of the second audio.
Whisper misses some parts of the audio when there are shifts in the language used. This explains another part of the high error rate in the second audio, where the speech sometimes switches from English to French — which is translated in this case. “Many see the fights over pensions as the last stand in defense of France’s social welfare model. ‘J’y vois une régression des droits sociaux assortie d’une procédure politique qui est épouvantablement anti-démocratique. Et voilà, je proteste.'” is transcribed “Many see the fights over pensions as the last stand in defense of France’s social welfare model. ‘This is a political procedure that is dreadfully anti-democratic. And I protest.‘”

Noticeably, it seems that Whisper makes quite different types of errors in English and in French, where grammatical errors and spelling mistakes seem to be more prevalent.

Note that if your computer has a slow processor, transcribing your interviews with the more elaborate Whisper models is not worth the effort (Table 2). Unless you have a fast CPU, access to a personal GPU, or access through an institution, transcribing a one-hour interview with the large-v2 model will take your computer several hours… You can still transcribe your interview with the base model, which strikes a good compromise between transcription speed and transcription quality — you will still have to pay close attention to the output, as it will not be devoid of errors.

	slow CPU (Intel Core i5-6300U CPU @ 2.40GHz 2.50GHz)	faster CPU (Intel Core i7-8665U CPU @ 1.90GHz × 8)	GPU (Google Colab)
`large-v2`	8 h 36 min (× 100)	7 min 30 (× 1,5)	3 min 20 (<× 1)
`medium`	3h 49 min (× 44)	4 min 30	2 min
`small`	1h 12 min (× 15)	2 min 20	1 min
`base`	20 min (× 4)	2 min	30 sec
`tiny`	9 min (× 1.8)	50 sec	24 sec

Table 2: Transcription time of a report in French (4 min 54 sec) according to the speed of the processor used and the Whisper model

There is still the Google Colab solution: with a personal Google account, you have free access to a GPU on a Colab notebook. This, however, poses major problems in terms of privacy regarding the personal data collected during the interview. When you use a Colab notebook, the interview and its transcript are shared with Google — since the notebooks are stored on Drive, and their content is collected by Google. In this respect, transcribing an interview collected as part of a social scientific investigation using Google Colab is probably a violation of the European GDPR.

Another option is to use the OpenAI API, which offers online access to its models for a fee. OpenAI’s policy is more ambiguous, but it explicitly states that it reserves the right to “process” users’ “content” (which includes text produced…or audio uploaded online) and personal information (see OpenAI’s Privacy policy, section 9, “International Users”). Obviously, if you use Whisper online to transcribe your interview, then it is stored in OpenAI’s databases, which may not be the most desirable thing if you are working on sensitive material or on closely scrutinized areas.

In any case, OpenAI’s recent problems with the disclosure of personal information call for the utmost caution. However, if you want to transcribe a YouTube video or a press conference, using Colab is not a problem.

Here is an example of how to use it: https://colab.research.google.com/drive/1aHv1JpeqAy7j7f9rLf7dIM_d5jRpbda-?usp=sharing

3. Should I use Whisper? Sure, but…

Whisper is a very efficient transcription tool, which is already used by journalists, and for the automatic subtitling of movies and TV shows. It obviously has several advantages and a few limitations.

On the downside:

The most efficient models (medium, large, large-v2) require very long computation times on slow computers. Whisper works very well if you have a powerful processor or a GPU.
Whisper does not separate speakers (speaker diarization). Put otherwise, at the time of this writing, the model cannot distinguish between different speakers in the same recording.
Moreover, in some languages such as French, Whisper corrects and “smoothes” the respondents’ speech a little, so that the words spoken are systematically translated into “legitimate French”. In English this seems less frequent, but Whisper sometimes corrects the oral expressions used by the respondents – the “you know…” that punctuate sentences are often deleted.
Sometimes, Whisper mistranscribes words that deviate from legitimate language or diction: for example, the transcription of American rap lyrics has not always been successful.

So not only does Whisper “scripturalize” speech⁴, but it sometimes ignores certain sociolects. While this effect is quite marked for French speech, it needs to be judged on a case-by-case basis in English.

On the bright side :

Whisper recognizes sentences even when they don’t have an end, by putting a “…” at the end. In general, Whisper punctuates the interview fairly well. It also recognizes some informal phrases (for instance : “I’ma”, “ain’t”, “kinda”, etc.)
Whisper ignores background noise. Even with the TV on next to the microphone, the transcription of an interview is accurate.
Moreover, each line of the transcript output corresponds to one speaker and one speaker only, so that the questions are clearly distinct from the respondent’s speech.
We have a time-stamped text, which makes it possible to pick up certain interesting sections directly, or to transcribe only certain portions of the interview “by hand”.

A distinction must therefore be made between two cases:

If you want to transcribe public audio (the recording of a press conference, a podcast, the audio of a YouTube video, or a TV show), then using a Colab notebook along with Whisper is not a problem.
If you are transcribing an interview that contains sensitive data, or information that could compromise you or your respondents, it is best to use the Whisper models on your own computer by running a Python script. With a powerful CPU, you will save the time spent typing the text and going back to listen to particular segments. You will then only have to review the raw text produced by Whisper, attributing each passage to its author, and correcting errors – which means listening to the entire interview once.

4. It also worked on Doja Cat

Transcript of *Boss Bitch*, Doja Cat, 2min 14, 2020, on Google Colab.

Apart from a few approximations (“high heeled shoes” instead of “high heel shoes“, “Back then till I touch my toes” instead of “Backbend ’til I touch my toes” or “before” becoming “B-4“, etc.), Whisper handles it well.

You can now transcribe your favorite songs ?

References

¹ Elise Tancoigne, Jean Philippe Corbellini, Gaëlle Deletraz, Laure Gayraud, Sandrine Ollinger et Daniel Valero, « Un mot pour un autre ? Analyse et comparaison de huit plateformes de transcription automatique », Bulletin de Méthodologie Sociologique, vol. 155, n^o 1, 2022, p. 61.

² Take note that on Jupyter Notebook, you have to run the following lines the very first time: !pip install git+https://github.com/openai/whisper.gitthen !pkexec apt install ffmpeg.

³ We use here the Word Error Rate (WER) metric. On this subject, see Elise Tancoigne, Jean Philippe Corbellini, Gaëlle Deletraz, et al., op. cit., p. 59. This indicator is calculated from the following website: https://www.amberscript.com/fr/outil-wer/.

⁴ By correcting the “errors” of oral language, the OpenAI product has reignited the debate on the correct way to transcribe an interview in social sciences. In French sociology, see in particular Stéphane Beaud, « Quelques observations relatives au texte de B. Lahire », Critiques sociales, 8, 1996, p. 102-107 and Bernard Lahire’s response, « Du travail d’enquête à l’écriture des paroles des enquêtés : réponse aux interrogations de Stéphane Beaud », ibid., p. 108-114. This discussion provides some elements for making a well-informed choice between the readable “rewriting” of interviews – which is what Whisper seems to be leaning towards – and their “phonetic transcription”, which is more faithful to sociolects.