Segment very long! #95

gianfelicevincenzo · 2024-06-12T17:27:15Z

Is it possible to divide the segment into smaller parts?

tkarabela · 2024-06-12T19:31:44Z

If you wish to split a subtitle into shorter ones, there is no built-in function for this, but it can easily be done :

from pysubs2 import SSAFile, SSAEvent
from itertools import chain
from textwrap import wrap

input_srt = """\
1
00:00:00,000 --> 00:10:00,000
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas sollicitudin. Sed convallis magna eu sem.
Etiam bibendum elit eget erat. Proin mattis lacinia justo. Etiam posuere lacus quis dolor.

2
00:12:00,000 --> 00:20:00,000
Nulla turpis magna, cursus sit amet, suscipit a, interdum id, felis. Aenean placerat. Nullam rhoncus aliquam metus.
Curabitur vitae diam non enim vestibulum interdum. In laoreet, magna id viverra tincidunt, sem odio bibendum justo, vel imperdiet sapien wisi sed libero.

3
00:32:00,000 --> 00:35:00,000
Short subtitle.

"""


def split_event(e: SSAEvent) -> list[SSAEvent]:
    words = e.plaintext.split()
    n = len(words)
    if n > 10:
        e1 = e.copy()
        e2 = e.copy()
        t = int((e.start + e.end) / 2)
        e1.plaintext = "\n".join(wrap(" ".join(words[:n // 2]), 60))
        e1.end = t
        e2.plaintext = "\n".join(wrap(" ".join(words[n // 2:]),60))
        e2.start = t
        return [e1, e2]
    else:
        return [e]


subs = SSAFile.from_string(input_srt)
subs.events = list(chain(*map(split_event, subs.events)))
print(subs.to_string("srt"))

# 1
# 00:00:00,000 --> 00:05:00,000
# Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
# Maecenas sollicitudin. Sed convallis magna eu
# 
# 2
# 00:05:00,000 --> 00:10:00,000
# sem. Etiam bibendum elit eget erat. Proin mattis lacinia
# justo. Etiam posuere lacus quis dolor.
# 
# 3
# 00:12:00,000 --> 00:16:00,000
# Nulla turpis magna, cursus sit amet, suscipit a, interdum
# id, felis. Aenean placerat. Nullam rhoncus aliquam metus.
# Curabitur vitae diam
# 
# 4
# 00:16:00,000 --> 00:20:00,000
# non enim vestibulum interdum. In laoreet, magna id viverra
# tincidunt, sem odio bibendum justo, vel imperdiet sapien
# wisi sed libero.
# 
# 5
# 00:32:00,000 --> 00:35:00,000
# Short subtitle.

gianfelicevincenzo · 2024-06-12T21:59:39Z

yes, that's actually what I did. that's exactly what I did. Thank you. In your script I noticed that you save inside ".plaintext". What is the best approach between .text and .plaintext?

I have used instead textwrap.fill

m3t4f1v3 · 2024-06-13T13:22:31Z

unfortunately, the code you've provided doesn't necessarily guarantee that each section will be 10 words or less (I tried putting it down to 3 words) but iterating the split_event function allows it to be certain

I've adjusted the code (in a very hacky way) to keep iterating until it meets the requirement

subs = pysubs2.load("subs.srt")

meets_max_length = False

while not meets_max_length:
    meets_max_length = True
    for chunk in subs:
        if len(chunk.plaintext.split()) <= 3:
            pass
        else:
            meets_max_length = False
    subs.events = list(chain(*map(split_event, subs.events)))


print(subs.to_string("srt"))

tkarabela · 2024-06-13T23:27:32Z

Yeah, my code was just meant to illustrate how to split a subtitle into multiple ones using the library API. The actual splitting logic can be as sophisticated as needed :)

What is the best approach between .text and .plaintext?

Splitting .plaintext is straightforward, but removes any inline formatting (eg. <i>lorem</i> in the source SRT file).

Correctly splitting .text is tricky - you would want to split the chunks returned by https://pysubs2.readthedocs.io/en/latest/api-reference.html#pysubs2.formats.substation.parse_tags but there is currently no function to recombine them together to get a single text with override tags (not mentioning event-wide SubStation tags like \pos or \t that are not currently supported by parse_tags()). But if you only care about specific cases eg. to preserve italics, you could re-add them manually by looking at the computed styles returned by parse_tags().

Input to consider:

1
00:00:00,000 --> 00:10:00,000
<i>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas sollicitudin. Sed convallis magna eu sem.
Etiam bibendum elit eget erat. Proin mattis lacinia justo. Etiam posuere lacus quis dolor.</i>

tkarabela added the question label Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segment very long! #95

Segment very long! #95

gianfelicevincenzo commented Jun 12, 2024 •

edited

Loading

tkarabela commented Jun 12, 2024

gianfelicevincenzo commented Jun 12, 2024 •

edited

Loading

m3t4f1v3 commented Jun 13, 2024

tkarabela commented Jun 13, 2024 •

edited

Loading

Segment very long! #95

Segment very long! #95

Comments

gianfelicevincenzo commented Jun 12, 2024 • edited Loading

tkarabela commented Jun 12, 2024

gianfelicevincenzo commented Jun 12, 2024 • edited Loading

m3t4f1v3 commented Jun 13, 2024

tkarabela commented Jun 13, 2024 • edited Loading

gianfelicevincenzo commented Jun 12, 2024 •

edited

Loading

gianfelicevincenzo commented Jun 12, 2024 •

edited

Loading

tkarabela commented Jun 13, 2024 •

edited

Loading