So, you’ve got a pile of audio – let’s say about 10 hours worth – and you need to turn it all into text. Not just any text, though. You need the works: who said what (that’s diarization, folks) and maybe even a hint of how emphatically they said it (prominence). Sounds like a job for a beefy speech-to-text (STT) system, right?
My first instinct, like many, was to peek at what the big cloud players offer off-the-shelf. Quick search, and boom, Google’s Speech-to-Text V2 API pops up. The pricing table looks something like the image I’ve been staring at:
(Image note: The user provided an image showing Google STT V2 API pricing, with “Standard recognition models” at $0.016 / 1 minute for the 0-500,000 minute tier).
Let’s do some quick math for our 10 hours of audio. 10 hours = 10 * 60 minutes = 600 minutes. At 0.016 = $9.60.
Okay, $9.60 might not break the bank for a one-off. But what if this is a regular thing? Or what if those 10 hours are just the tip of the iceberg? That cost can add up, and fast. Especially if you’re an indie hacker or a small team, every dollar counts. This got me thinking: there has to be a more wallet-friendly way for batch jobs where I don’t need instant, on-demand processing for every single file.
The “Serverless” Siren Song (and its Complexities)
Naturally, the “serverless” buzzword comes to mind. AWS Lambda, Google Cloud Functions… could we just spin up a function, throw our audio at it, and get text back? For STT, especially with GPU-hungry models like Whisper and diarization libraries like PyAnnote, it’s not quite that simple.
- GPU Access: Getting GPU access in standard serverless functions is tricky or non-existent. Some platforms offer GPU options, but they might come with their own set of limitations or cost models that bring us back towards square one for heavy, long-running tasks.
- Execution Limits: Many serverless platforms have execution time limits (e.g., 15 minutes for Lambda). Processing hours of audio, even chunked, can bump against these.
- Package Sizes: Packing up large ML models and their dependencies (hello PyTorch, CUDA libraries, etc.) into a deployable serverless package can be a nightmare, often exceeding size limits.
- State Management: Diarization benefits from understanding the broader context of the audio. Chopping it into tiny, independent serverless invocations can make maintaining speaker consistency a real headache.
While serverless is fantastic for many things, for this kind of heavy, stateful-ish batch processing, it felt like I’d be wrestling with the infrastructure more than working on the actual STT problem.
The Middle Ground: Your Own (Temporary) GPU Workhorse on EC2
This led me to what I think is a pretty sweet middle ground for batch STT jobs: setting up an AWS EC2 instance with a GPU, running our open-source STT pipeline, and – crucially – only having it switched on when we actually need it.
The idea is straightforward:
- Set up a capable EC2 instance: Something like a
g4dn.xlargewhich has an NVIDIA T4 GPU. This is a good balance of power and cost. - Dockerize the STT pipeline: Package up Python, PyTorch, Whisper, PyAnnote, FFmpeg, and all the goodies into a Docker container. This makes it portable and easy to run. (More on this setup in the
HOSTING_STT_OUTLINE.mdI’ve been scribbling in). - Automate startup/shutdown: Before I run my local script (like
generateTranscript.tswhich processes my video files), it pings AWS to wake up the EC2 instance. It waits for it to be ready, then sends the audio files (or S3 paths) to an API running on the EC2 instance. Once all the batch processing is done, the script tells AWS to shut the EC2 instance down.
So, what about the cost of this approach?
Let’s assume our 10 hours of audio can be processed by the EC2 instance in about 2 hours of actual run time. This is a reasonable target if the models are optimized and the GPU is doing its job.
- EC2
g4dn.xlarge(On-Demand, N. Virginia region): Roughly $0.526 per hour. - Total EC2 cost for 2 hours: 2 * 1.052**
Let’s add a tiny bit for EBS storage (the instance’s disk), but for 2 hours, it’s negligible, maybe a few cents. Data transfer for uploading audio to S3 and downloading transcripts is also usually very cheap or within free tiers for this scale.
So, we’re looking at around 1.10 to process our 10 hours of audio.
Compare that to the $9.60 from the standard cloud API. That’s a pretty significant saving – almost 9x cheaper!
If we get more aggressive and use EC2 Spot Instances (which can be up to 70-90% cheaper but can be interrupted), that 0.15-$0.20/hour.
- EC2
g4dn.xlarge(Spot Instance, optimistic): 2 hours * 0.36**
Now we’re talking! From nearly ten bucks down to less than half a dollar for the same 10 hours of audio, complete with diarization and prominence features.
The Trade-offs
Of course, this isn’t “free.” The main cost is the initial setup time and a bit more complexity in our local scripting to manage the EC2 instance’s lifecycle. But for batch jobs where we can tolerate a little startup delay, the cost savings are compelling. We’re essentially trading a higher operational cost (per-minute API fees) for a bit of upfront engineering and a much lower per-processing-hour cost.
This is the path I’m currently on. It gives control, leverages awesome open-source tools, and keeps the bean counters happy. It’s a bit like building your own race car instead of taking a taxi – more work to get started, but way more fun and economical in the long run if you’re doing a lot of laps!
(Next steps would involve diving deeper into the Docker setup, the API on EC2, and the generateTranscript.ts modifications – much of which is captured in the HOSTING_STT_OUTLINE.md)