How to Align Text with Audio

You’ve got an audio or a video and the script of the speech. Now, you would like to align the text with the speech. Instead of manually creating the timelines, we can use a computer-aided approach:

First, recognize the speech to get the timelines with the recognized words.
Then, align the recognized text with the existing text.
Finally, determining the alignment of the aligned text and the timelines based on the text length.

The computer-aided video/audio translation tool Silhouette is made for such a purpose.

Recognized result:

recognized-result

Aligner:

aligner

Aligned result:

aligned

PS: if the recognized speech is accurate, we do not have to do this. This is for cases where the audio quality is not good, which leads to bad recognition results.

Further Reading