Quick-start¶
Installation guide¶
Requires python >= 3.12
Since I could not get the software stack to behave properly on my AMD GPU development has been done inside of a Docker Container. CPU usage should work bare metal but anything else is up to sheer luck.
The project currently interfaces with the models through huggingfaces transformers.
For AMD user, please use the Dockerfile-rocm dockerfile. For active development a rocm compatible devcontainer is also available.
NVIDIA & CPU users could try to just install the project on bare metal and see how it goes. For this purpose install the dependencies from:
requirements.txt is meant to be used inside a pytorch container.
flash-attention is optional but comes already installed in the rocm container, as it has been validated to work. NVIDIA user might also like to installed this additional dependencies for potentially better performance.
If you do not install flash_attention, the tool will fallback to pytorches integrated paged | sdpa attention backend, which should work on all platforms.
Usage¶
The script provides progress bars for each cpu worker launched. If the progressbar shows a stalled process it is most like a visual bug with rich the process will have finished if overall progress bars for N = number of cpu workers are displayed.
You interact with the tool via cli, like:
When files are being saved, existing files can also be override by specifying:
Using -s will skip files for which subtitles already exist. Due to the fact that naming cannot be inferred back to the tracks within a file no track will be processed even if the subtitles found only belong to one of multiple tracks in the MKV file.
The current architecture allows you to launch N OCR model GPU workers followed by N language model GPU workers. N=4 CPU workers each work on a single subtitle track for which pgs images corresponding to the amount of images found in the track are processed. Each image instance is processed one-by-one.
Each worker is launches as a separate process meaning you will need at least N_cw + N_ow + N_lw + 2 threads available on your system. The default is 6 threads meaning a 3 core CPU with 2 threads per core is required at the very least. The extra +2 are Managers with handle communication between processes via Queues. One manager controls the GPU queues, while the other controls the CPU and progress queues (used for progress bar).
All CPU workers queue their images towards a global GPU queue. OCR GPU workers than draw items from the first queue and processes the images. Once processed the extracted text is passed through another queue towards the language model workers which classify the language of the text.
Finally the language model workers send the text with the language classification back to the CPU worker who initially processes this item, ensuring processed tracks remain consistent and ordered.
The amount of workers can be adjusted with the following arguments:
Additionally the -b, --batchsize arguments exists to batch images for inference, however, this options has not been tested much due to AMD GPU crashes - use with caution.