Audio Chunking Tutorial
Chunking Longer Audio Files for Whisper Models on Groq
By default, our speech endpoints only support audio files up to 25MB via direct file uploads. If you have audio files that are longer than 25MB, you'll need to chunk your audio! In this tutorial, we'll learn how to process long audio files efficiently using chunking methods with Groq API. Breaking down audio files into manageable chunks is essential for reliable transcription of longer recordings while maintaining high accuracy.
Groq is great for processing long audio files thanks to its fast inference speeds and even hours of audio that we process into chunks can be transcribed in a matter of minutes. As such, we'll use Whisper Large V3 powered by Groq and learn how to:
- Preprocess audio files for optimal transcription
- Split audio files into manageable chunks
- Implement a smart overlap for our chunks
- Transcribe our chunks using Whisper Large V3
- Merge our results while properly handling overlaps
- Save our transcriptions in multiple formats for further handling
Sound exciting? Let's get chunking!
Step 1: Install Required Libraries
First, you'll need FFmpeg installed on your system since we'll need it for format conversion and one of our required libraries depends on it. You can install on your system with the following:
- Windows: Download from https://ffmpeg.org/download.html
- Mac:
brew install ffmpeg - Linux:
sudo apt-get install ffmpeg
Now, let's install the libraries we'll need for audio processing and transcription. Although there are other libaries for audio manipulation, we'll use PyDub since it provides a high-level interface that can handle all the audio formats supported by Groq API through ffmpeg, which is what we'll use for preprocessing our audio:
Step 2: Import Required Libraries
Now that we've installed the libraries we need, let's import them:
Step 3: Preprocess Audio to Downsample (Optional)
Whisper models require audio files to be 16,000 Hz mono format before transcribing for standardization and Groq API will re-encode audio files to these settings after recieving them. You may want to preprocess your audio files client-side if your original file is extremely large and you want to make them smaller without a loss in quality (i.e. without chunking, Groq API speech endpoints accept up to 25MB). We recommend FLAC for lossless compression.
Let's define a function called preprocess_audio that takes our file path for any audio file as input and returns the path to a converted FLAC format file. We'll use Python's subprocess to run ffmpeg with the proper arguments for reducing our audio file quality to 16kHz and converting to mono, or single audio channel.
We'll also use a few extra (but optional) parameters for suppressing FFmpeg version info and build details, only showing errors and suppressing warnings, and automatically overwriting our output file if it already exists to ensure we always get a fresh conversion:
Step 4: Create Function for Transcribing a Single Chunk
Now that our audio is downsampled, we can create a dedicated worker function for transcribing individual audio chunks that will be called by our main transcription controller function in Step 7. Our transcribe_single_chunk function uses the Whisper Large V3 model via Groq API to transcribe one chunk at a time. Let's break down how our function handles each chunk:
- Uses Python's
tempfilemodule for safe, automatic cleanup of temporary files - Uses
whisper-large-v3via Groq API and specifies language as English andverbose_jsonas the response format - Times Groq API calls to monitor performance
- Provides detailed progress tracking (current chunk transcribed vs. total chunks)
- Maintains consistent error handling and resource cleanup
We highly recommend specifying language. Whisper analyzes the first 30 seconds of your audio to determine language, but this could result in errors from Whisper possibly choosing the wrong language, especially if your audio has background noise, music, or silence in that timeframe. Specifying language will also help speed up requests since Whisper can forego audio analysis for determining language.
Tip: Setting response_format to verbose_json for Groq API transcription and translation endpoints provides timestamps for audio segments! It also provides avg_logprob, compression_ratio, and no_speech_prob! See our official docs for more info.
Once the single chunk is transcribed, the function returns a tuple of the transcription result and the processing time.
Step 5: Handle Chunk Overlaps in Audio Transcription
When dealing with chunked audio transcription, one of the biggest challenges is handling the transitions between chunks smoothly (which is the basis of this entire tutorial inspired by a conversation with one of the developers in our community, Jan Zheng - thank you for the insightful conversations!). This is because Whisper can sometimes cut words off mid-word at chunk boundaries, transcribe the same word slightly differently in adjacent chunks, and have varying accuracy at the beginning and end of chunks.
To handle these challenges, we'll explore two strategies for handling chunk overlaps:
- The Local Agreement strategy, or longest common prefix (LCP) approach for finding exact matches between chunks
- The longest common sequence algorithm with sliding window alignment for more robust matching
Initially, the LCP approach seemed promising as it can handle varying overlaps, but because of Whisper's nature and the possibility of mid-word cutoffs, this approach is too restrictive since it looks for exact word matches between chunks. Through testing and feedback from one of my teammates (shoutout to Graden), our implementation will be the longest common sequence algorithm that:
- Isn't restricted to just checking chunk boundaries
- Can handle both partial word and character-level matching
- Uses a weighted scoring system that combines number of matching words/characters, position-based weighting (via an epsilon value), and minimum threshold of 2 matches for reliability
- Is more fault-tolerant of Whisper's boundary transcription quirks
Let's look at a practical example of what we're dealing with and consider the following two chunks:
Chunk 1: "Hello my name ich"
Chunk 2: "mine name is Jonathan"
This is where our find_longest_common_sequence function comes in and:
- Tries different alignments by sliding the sequences
- For each position: Count matching elements, calculate a score ((matches/position) + tiny position-based weight), and require at least 2 matches to consider the alignment
- Find the best alignment (in this case, "name")
- Take the left half from Chunk 1 ("Hello my") and the right half from Chunk 2 ("name is Jonathan")
- Combine the sequences while handling variations like "ich/is" and "my/mine" into a clean final result ("Hello my name is Jonathan")
Step 5: Merge Audio Chunk Transcriptions
With our sequence alignment function ready, we can now implement the merge_transcripts function that will combine all our chunks into a single coherent transcript. merge_transcripts takes a list of chunk transcription results and processes them based on the available data:
- Processes both segment-level and word-level timestamps when available:
- Extracts and adjusts all word timestamps based on their chunk's starting position
- Preserves all word-level timing information regardless of segment presence
- Combines words from all chunks into a single coherent list
- For segment-level data, the function:
- Handles overlapping segments by merging them into a single segment with combined text
- Processes the boundaries between chunks using
find_longest_common_sequenceto create smooth transitions - Maintains detailed segment metadata including
temperature,avg_logprob,compression_ratio, andno_speech_prob
- Creates a comprehensive output that includes:
- The complete transcript text
- All merged and properly timed segments with their metadata
- Word-level timestamps when requested
The function works with timestamp granularities containing only segments, only words, or both!
Step 6: Save Transcription Outputs
Now let's implement our helper function that handles our transcription outputs. This save_results function creates a dedicated transcriptions directory to keep our outputs organized, uses timestamped filenames to prevent overwrites, and saves our results in multiple formats for different use cases: plain text, JSON, and segmented JSON for detailed timestamp information.
Step 7: Create Transcription Engine and Assemble the Pipeline
Now comes the fun part - bringing all the pieces we built together with transcribe_audio_in_chunks, which is our orchestrator function that takes our audio file, splits it into chunks, coordinates the transcription process, combines the chunked transcription outputs, and saves our results! Think of this function as the conductor of our transcription orchestra that makes sure every function, or part, plays its role at the right time.
While Whisper was trained on 30-second segments and the recommended chunk size is 30-60 seconds, this can vary and longer chunks can actually provide better results when using Groq API. For this tutorial, we're using 600-second (10-minute) chunks with a 10-second overlap for an optimal balance of:
- Reduced calls to Groq API (fewer chunks)
- Better transcription accuracy (longer context)
- Reliable word boundary handling
- Staying safely within the current 25MB per-request limit for Groq API transcriptions and translations
Why an Overlap? Overlapping chunks prevents our model from losing context at chunk boundaries and cutting words in half. Without an overlap, we might split right in the middle of a word or sentence, which would cause missing content, increased hallucinations, and transcription errors. By overlapping (typically 5-10 seconds), we ensure that words and context spanning chunk boundaries are captured completely.
Understanding Chunk Overlap and Overhead When we use overlapping chunks, we're actually processing some audio multiple times. For example, with our settings for this tutorial, each 600-second chunk has 10 seconds of overlap at the start and end. This means we're processing 620 seconds (600 + 10 + 10) for each 600-second chunk, which creates a 3.33% overhead (20 extra seconds). This is much more efficient than shorter chunks. For example, 60-second chunks with 5-second overlaps would have a 33.3% overhead!
Overhead matters because more overhead means more API calls, higher costs, and more potential for transcription errors at boundaries. More processing time is also a factor, but since we're using Groq API, the impact there is minimal.
You may have to pretend to be Goldilocks and find overlapping chunks that are just right for your typical use case. While too small of a chunk size could result in the model missing important context for transcription, too large of a chunk size could lead to potential degradation in accuracy. For a rapid-fire podcast conversation or interview, shorter chunks might work better. For a slow-paced lecture or meeting, longer chunks could be your answer. You need to find the chunk size that's just right!
Step 8: Run the Pipeline!
After quite the adventure where we've learned about audio chunking, it's time to put our transcription orchestra into action and see how it performs with real audio files. Replace the "path_to_your_audio" below with the path for a long audio file of your choice. Groq API supports flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm audio file formats, but since we are converting to FLAC before sending our request to Groq, you can process any format that FFmpeg can handle.
When you run this code, you'll get several types of output that help you track the transcription process:
- Progress Updates: The code provides real-time feedback about which chunk it's processing as well as the time ranges.
- Transcription Results: The code creates a directory called
transcriptionsthat will have three different output files.
Conclusion
This wraps up our journey through audio chunking and transcription with the lightning-fast Groq API! Once you do get the transcriptions, make sure to review them and remember that audio chunking and transcribing is both an art and a science! While our pipeline above handles the science part well, you might need to adjust the art part (the chunks and overlaps) based on your specific audio. Don't be afraid to experiment further on your own with different parameters.
The following sections are optional learnings for debugging methods and considerations for production. If you enjoyed this tutorial and have other topics you'd like to learn more about, request one from me on X. Happy building!