1 min read

Council Transcripting CLI

As a productive citizen, I set out to better understand the options for pulling transcripts from City Council meetings on YouTube.

There are a bunch of ways to get the transcripts — you can find them on the City’s website, request them by email, or use YouTube plugins. All of that works. But I wanted something I could run from the command line: just one Python script that lives on my homelab and ideally runs on a cron schedule. A scrappy, minimal setup that helps me actually understand how the YouTube API works.

Result: I built it and I now use it to transcribe every Council meeting I'm interested in. That's about four of them a week, each is usually two-hours long.

At the command line we deliver one line.

python main.py https://www.youtube.com/watch?v=-VghD-JiO5s --method whisper --rejoin-output 

Let's break it down:

python
First of all, summon python.

main.py
Call the python script we wrote.

https://www.youtube.com/watch?v=-VghD-JiO5s
Hand off the url we want.

--method whisper
An answer our script function is waiting for: Would you like me to use YouTube transcripts or OpenAI Whisper to transcribe this video? In this case, we're telling it to use whisper

--rejoin-output
Our functions download the audio from the url, then split that into 15-minute chunks which is each individually transcribed. --rejoin-output stitches them back together, with 2-minute overlays, into one (enormous) text file.

And that's it. Neat and tidy. Easy and Clear. Exactly what I set out to do. It works and I use it all the time.

Speaking of "tidy", I soon found that the processing left behind large audio files and processed chunks of transcript. I wrote a --tidy function to clean things up. Really helpful.

Improvements are in the works. A much better flow including, of course, lines broken down by sentence and attributed to the correct speaker. Also timestamps linked to the video at the section being discussed in the transcripts.

I'm looking forward to sharing more as I build this out.