I can read more quickly than I can listen. I can search a text file but not an audio file. There are many cases when the written word is superior to the spoken word, but sometimes the source data is spoken. Can I extract spoken word data into something written?
Whisper.cpp is an ML model which interprets spoken words and outputs written words and can be run on a Mac laptop. Simon Willison frequently refers to his use of MacWhisper on his blog, so I thought I’d try my hand at a command-line approach to transcribing a YouTube video.
Whisper.cpp is easy to setup but there’s a learning curve. You need:
models/
subdirectory$ # Step 1, setup Whisper.cpp
$ brew install whisper-cpp # or install for your machine type
$ mkdir models
$ wget -O models/ggml-base.en.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin?download=true
$ wget -O ggml-metal.metal https://github.com/ggerganov/llama.cpp/raw/master/ggml-metal.metal
Now that Whisper.cpp is ready to run, you need to get your input material. In my case, I got a webm video which had an Opus audio stream. This required some careful but simple conversion:
$ # Step 2, get your input
$ youtube-dl --extract-audio --audio-format best [yt url]
$ ffmpeg -i file_from_youtube_dl_execution.opus -ar 16000 -vn prepared_input.wav
Now you have your input and Whisper.cpp is setup, you can generate your
transcript. You can run it just like whisper-cpp prepared_input.wav
and
copy the output from stdout, or you can pass dictate how it’s output: txt,
csv, srt, vtt, lrc, and json.
$ # Step 3, extract your transcript
$ whisper-cpp --output-txt prepared_input.wav
$ less prepared_input.wav.txt # read your transcript!
You can supply as many --output-<fmt>
flags as you would like and it will
output all formats you request. Pretty neat!