💬 WhisperSub: Building a Multilingual Subtitle Tool

Hey 👋

I wanted to share WhisperSub, a command-line tool that generates, transcribes, and merges subtitles for video files. I wanted a way to watch anime with both English translations and Japanese transcriptions, and built the tool to automate that workflow. This post focuses on the design choices and technical details of the journey.

Scope and approach

WhisperSub is designed to be flexible and language-agnostic. I evaluated several Whisper implementations (standard whisper, faster-whisper, whisperX) and chose stable-ts for its accuracy and word-level timestamps. A major part of the work was broadening language support and handling romanization for different scripts:

Japanese (Hepburn)
Chinese (Hanyu Pinyin)
Korean (Revised Romanization)
Russian (GOST)
Arabic (ISO 233)
and others

Each language required specific handling and tests to ensure reasonable results across different inputs.

Technical highlights

The main components are:

Audio extraction from video (ffmpeg)
Speech recognition using stable-ts for improved accuracy and word-level timestamps
Optional speech isolation with Demucs
Extraction of existing subtitles when present
Merging and exporting a consolidated subtitle file

Other features include adjustable sync tolerance and flexible I/O options. Optional GPU acceleration is available for faster transcription runs, but enabling and validating CUDA-based performance was not trivial: it requires compatible PyTorch/CUDA/ctranslate2 builds, careful dependency testing, and extra validation on different hardware/configurations.

Subtitle synchronization challenges

Subtitle synchronization turned out to be one of the most intricate parts of the project. Ensuring that subtitles align perfectly with audio while maintaining readability required a combination of precise timing adjustments, format compatibility, and layering logic.

Timing adjustments: WhisperSub uses a sync tolerance parameter to snap subtitles to the nearest matching audio event. This helps reduce drift but requires careful tuning to avoid introducing artifacts.
Layering and overlap handling: The tool employs a smart layering algorithm to prevent overlapping subtitles. This is particularly important when merging multiple tracks or adding romanized text alongside translations.
Format-specific quirks: The ASS format allows for advanced styling and positioning, which WhisperSub leverages to maintain visual clarity. However, converting to simpler formats like SRT can strip these features, making it impossible to preserve most styling options.
Iterative testing: The logic for merging and aligning subtitles was refined through extensive testing with real-world media, encountering multiple edge cases (e.g., overlapping events, multi-line subtitles) that required custom handling.

Lessons learned

The bulk of the engineering effort was about choosing and integrating the pieces that produce consistent results across noisy, real-world media. The two areas that taught me the most were the different Whisper implementations and audio separation tools:

Whisper implementations and trade-offs
- stable-ts: good accuracy and useful word-level timestamps, which simplifies subtitle alignment. It can be heavier on resources for larger models.
- faster-whisper: significantly faster inference on CPU or smaller GPUs, but with trade-offs in some edge-case accuracy and stability across languages.
- whisperX / standard whisper: robust and well-understood, but may require additional tooling for fine-grained timestamps and alignment.
- Practical trade-offs: model size versus latency, GPU memory limits, and whether word-level timestamps are available dictate the best choice for a given workflow. Automatic language detection is convenient but not always perfect; explicit language selection can improve results.
Audio separation (Demucs vs Spleeter and friends)
- Demucs: generally produces higher-quality voice isolation and helps transcription accuracy in noisy scenes, but it is computationally intensive and slower, especially without GPU acceleration.
- Spleeter: faster and lighter-weight, useful for quick experiments, but its separation quality is usually lower than Demucs.
- Separation helps with clarity and downstream transcription quality, but it can introduce artifacts and increases processing time and complexity. I exposed separation as an optional step so users can trade quality for performance.
Subtitle synchronization
- Syncing subtitles is a non-trivial task, especially when dealing with overlapping events or multi-language tracks. The iterative process of refining the sync logic taught me the importance of balancing automation with user control.
- Tools like pysubs2 simplify subtitle manipulation but require careful handling of edge cases, such as multi-line subtitles or events with inconsistent timing.
- Testing with diverse media files was crucial to uncover and address edge cases, ensuring that the tool performs reliably across different scenarios.

Overall, the project is about balancing accuracy, performance, and complexity. The implementation choices in WhisperSub aim to give users control over those trade-offs.

The project is open-source on GitHub. If you try it and have suggestions or find issues, please open an issue or get in touch.

— Aleix