awk script to detect repeating segments / hallucination subtitle lines #1025
mrfragger
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Found a good script here and only had to modify it a bit to say show between 3 to 99 repeats so as not to show blank lines in srt or vtt files. Also made it into a one-liner. Script isn't to fix it but just to detect if there is repeating. Usually it's hard to parse through 30+ or 50+ hour audiobook subtitles so this should help you locate issues.
edit: Updated script to only show lines numbers that are within 3 lines (12 lines away technically) to much more easily identify possible repeating segments / hallucinations. Note vtt sometimes there's an index# as it's not a requirement so it's either 4 lines away or 3 lines away for the next line of subtitle text. Many times it is just music or a repeating chorus.
(is this one 11297 is 12 lines away from 11309) and it only compares the second line # with the first to determine if it should print out the lines numbers.
Other number of times along with subtitles will be printed but NOT their line numbers.
Just put a single vtt or srt in a directory and run the awk script.
for WebVTT subtitles
awk '{ x = lines[$0]["count"]++; lines[$0]["NR"][x] = NR; }END {fmt_s = "%sx %" max "-s %s\n\n"; for (i in lines) {if (lines[i]["count"] > 2 && lines[i]["count"] < 99) {for (j = 0; j < lines[i]["count"]; j++) {s = s lines[i]["NR"][j] ", ";} s = substr(s, 1, length(s) - 2); printf(fmt_s, lines[i]["count"], i, "\n" s ); s = "";}}}' *.vtt | awk -F, '$2 < $1+13' | grep -E -A1 '^[0-9]{,2}x' --color=always | less -r
for SRT subtitles
awk '{ x = lines[$0]["count"]++; lines[$0]["NR"][x] = NR; }END {fmt_s = "%sx %" max "-s %s\n\n"; for (i in lines) {if (lines[i]["count"] > 2 && lines[i]["count"] < 99) {for (j = 0; j < lines[i]["count"]; j++) {s = s lines[i]["NR"][j] ", ";} s = substr(s, 1, length(s) - 2); printf(fmt_s, lines[i]["count"], i, "\n" s ); s = "";}}}' *.srt | awk -F, '$2 < $1+13' | grep -E -A1 '^[0-9]{,2}x' --color=always | less -r
Output will look like this and where there are line numbers print that's where the possible unwanted repeats are. You can either try to fix the audio and retranscribe or simply change subtitles to say something like subtitles will resume at 7m 45s or whatever. Note the line numbers. First line shows 6 repeats of [ on line 181, 184, 187, 196, 205, 214). 181,184,187 means it's repeating three times on the next line. Two lines for the timecode and blank line.
Showing duplicate timecodes repeating three times.
For this particular problem it repeats from the first 3m 48s till 7m 45s so for four minutes it was missing proper subtitles. The culprit was upon listening to it around the 3m 48s mark there was 4 or 5 seconds of silence then a clicking sound then a few seconds of silence, then reading of a book title in Latin while this is in English and that caused it. So I had to re-encode the entire audiobook encoding the chapters to single files then that chapter used LosslessCut to cut out the unwanted part and merge them into one chapter then re-encode entire audiobook then transcribed it with whisper.cpp again and it worked perfectly for the four mins that were messed up previously.
Sidenote: the screenshots show VSCode extension Highlight Duplicates but it doesn't really help as you have to manually look for the highlights. Another extension would print out a report of counted dupes but didn't reference the line number. Syntax highliting of subs SRT or VTT is with extension Subtitles Editor. I found many awk scripts and sort uniq scripts to do just that print out number of occurrences of dupes but they didn't reference the line number. Even with cat -n audiobook.vtt and say uniq -f 1 so as to skip field one...it was a no go...at least for me.
Beta Was this translation helpful? Give feedback.
All reactions