Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment very long! #95

Open
gianfelicevincenzo opened this issue Jun 12, 2024 · 4 comments
Open

Segment very long! #95

gianfelicevincenzo opened this issue Jun 12, 2024 · 4 comments
Labels

Comments

@gianfelicevincenzo
Copy link

gianfelicevincenzo commented Jun 12, 2024

Is it possible to divide the segment into smaller parts?

@tkarabela
Copy link
Owner

If you wish to split a subtitle into shorter ones, there is no built-in function for this, but it can easily be done :

from pysubs2 import SSAFile, SSAEvent
from itertools import chain
from textwrap import wrap

input_srt = """\
1
00:00:00,000 --> 00:10:00,000
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas sollicitudin. Sed convallis magna eu sem.
Etiam bibendum elit eget erat. Proin mattis lacinia justo. Etiam posuere lacus quis dolor.

2
00:12:00,000 --> 00:20:00,000
Nulla turpis magna, cursus sit amet, suscipit a, interdum id, felis. Aenean placerat. Nullam rhoncus aliquam metus.
Curabitur vitae diam non enim vestibulum interdum. In laoreet, magna id viverra tincidunt, sem odio bibendum justo, vel imperdiet sapien wisi sed libero.

3
00:32:00,000 --> 00:35:00,000
Short subtitle.

"""


def split_event(e: SSAEvent) -> list[SSAEvent]:
    words = e.plaintext.split()
    n = len(words)
    if n > 10:
        e1 = e.copy()
        e2 = e.copy()
        t = int((e.start + e.end) / 2)
        e1.plaintext = "\n".join(wrap(" ".join(words[:n // 2]), 60))
        e1.end = t
        e2.plaintext = "\n".join(wrap(" ".join(words[n // 2:]),60))
        e2.start = t
        return [e1, e2]
    else:
        return [e]


subs = SSAFile.from_string(input_srt)
subs.events = list(chain(*map(split_event, subs.events)))
print(subs.to_string("srt"))

# 1
# 00:00:00,000 --> 00:05:00,000
# Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
# Maecenas sollicitudin. Sed convallis magna eu
# 
# 2
# 00:05:00,000 --> 00:10:00,000
# sem. Etiam bibendum elit eget erat. Proin mattis lacinia
# justo. Etiam posuere lacus quis dolor.
# 
# 3
# 00:12:00,000 --> 00:16:00,000
# Nulla turpis magna, cursus sit amet, suscipit a, interdum
# id, felis. Aenean placerat. Nullam rhoncus aliquam metus.
# Curabitur vitae diam
# 
# 4
# 00:16:00,000 --> 00:20:00,000
# non enim vestibulum interdum. In laoreet, magna id viverra
# tincidunt, sem odio bibendum justo, vel imperdiet sapien
# wisi sed libero.
# 
# 5
# 00:32:00,000 --> 00:35:00,000
# Short subtitle.

@gianfelicevincenzo
Copy link
Author

gianfelicevincenzo commented Jun 12, 2024

yes, that's actually what I did. that's exactly what I did. Thank you. In your script I noticed that you save inside ".plaintext". What is the best approach between .text and .plaintext?

I have used instead textwrap.fill

@m3t4f1v3
Copy link

unfortunately, the code you've provided doesn't necessarily guarantee that each section will be 10 words or less (I tried putting it down to 3 words) but iterating the split_event function allows it to be certain

I've adjusted the code (in a very hacky way) to keep iterating until it meets the requirement

subs = pysubs2.load("subs.srt")

meets_max_length = False

while not meets_max_length:
    meets_max_length = True
    for chunk in subs:
        if len(chunk.plaintext.split()) <= 3:
            pass
        else:
            meets_max_length = False
    subs.events = list(chain(*map(split_event, subs.events)))


print(subs.to_string("srt"))

@tkarabela
Copy link
Owner

tkarabela commented Jun 13, 2024

Yeah, my code was just meant to illustrate how to split a subtitle into multiple ones using the library API. The actual splitting logic can be as sophisticated as needed :)

What is the best approach between .text and .plaintext?

Splitting .plaintext is straightforward, but removes any inline formatting (eg. <i>lorem</i> in the source SRT file).

Correctly splitting .text is tricky - you would want to split the chunks returned by https://pysubs2.readthedocs.io/en/latest/api-reference.html#pysubs2.formats.substation.parse_tags but there is currently no function to recombine them together to get a single text with override tags (not mentioning event-wide SubStation tags like \pos or \t that are not currently supported by parse_tags()). But if you only care about specific cases eg. to preserve italics, you could re-add them manually by looking at the computed styles returned by parse_tags().

Input to consider:

1
00:00:00,000 --> 00:10:00,000
<i>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas sollicitudin. Sed convallis magna eu sem.
Etiam bibendum elit eget erat. Proin mattis lacinia justo. Etiam posuere lacus quis dolor.</i>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants