Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp tag regression in 0.26.0 #99

Closed
WithoutPants opened this issue Mar 12, 2024 · 3 comments · Fixed by #102
Closed

Timestamp tag regression in 0.26.0 #99

WithoutPants opened this issue Mar 12, 2024 · 3 comments · Fixed by #102

Comments

@WithoutPants
Copy link
Contributor

Inline timestamp tags are rendered incorrectly.

The following test illustrates this issue:

func TestWebVTTWithTimestampTag(t *testing.T) {
	testData := `WEBVTT

	00:01:00.000 --> 00:02:00.000
	Sentence with a timestamp<00:01:02.000> in the middle`

	s, err := astisub.ReadFromWebVTT(strings.NewReader(testData))
	require.NoError(t, err)

	require.Len(t, s.Items, 1)

	b := &bytes.Buffer{}
	err = s.WriteToWebVTT(b)
	require.NoError(t, err)
	require.Equal(t, `WEBVTT

1
00:01:00.000 --> 00:02:00.000
Sentence with an timestamp<00:01:02.000> in the middle
`, b.String())
}

This test passes in v0.25.1 and fails in 0.26.0.

I can't determine if this is the same bug as #94 as the original file is no longer available. I can verify that #96 does not fix this issue.

I was able to determine that this was introduced by 1e3a211

I was able to fix this with the following diff:

diff --git a/webvtt.go b/webvtt.go
index 3b1f5e4..4f56392 100644
--- a/webvtt.go
+++ b/webvtt.go
@@ -574,7 +574,7 @@ func (li LineItem) webVTTBytes() (c []byte) {
                        c = append(c, []byte(tag.startTag())...)
                }
        }
-       c = append(c, []byte(escapeWebVTT(li.Text))...)
+       c = append(c, []byte(li.Text)...)
        if li.InlineStyle != nil {
                noTags := len(li.InlineStyle.WebVTTTags)
                for i := noTags - 1; i >= 0; i-- {

But of course, this breaks the TestWebVTTEscape test. I'm not familiar enough with the code to determine a proper fix, but happy to do a PR if you have an idea on how to address this.

WithoutPants added a commit to WithoutPants/stash that referenced this issue Mar 12, 2024
WithoutPants added a commit to stashapp/stash that referenced this issue Mar 12, 2024
@asticode
Copy link
Owner

asticode commented Mar 13, 2024

We need to add proper handling of inline timestamp tags indeed and I'd be happy to point you to the proper direction for a PR 👍

Here's what needs to be done:

  • add a StartAt time.Time attribute to LineItem. Add(), Fragment() and similar methods, as well as their tests, would need to be updated as well to process this new attribute, but that would make this PR way more complicated and I wouldn't mind to do that part after merging your changes. It's up to you whether you want to update those methods as well in this PR.
  • in parseTextWebVTT, when processing an html tag, start by determining whether its content is a timestamp. If so, store it in a startAt variable. Later, when creating the LineItem, provide the StartAt value as well.
  • finally, in LineItem.webVTTBytes(), if !StartAt.IsZero() append the proper webvtt tag
  • add a proper test for inline timestamp tags in webvtt_test.go

Let me know whether you feel like updating Add(), Fragment() and similar methods. Again, I don't mind doing it myself after merging your PR, we'll just have to add the proper TODOs in the code if that's the case 👍

@WithoutPants
Copy link
Contributor Author

The html tokenizer doesn't actually recognise the timestamp tag as a start tag token, presumably because the first character after the < is numeric. I'm guessing we would need to split the resultant text or tokenize it further. I've given it a brief go, but I suspect you might have a better idea.

Let me know whether you feel like updating Add(), Fragment() and similar methods. Again, I don't mind doing it myself after merging your PR, we'll just have to add the proper TODOs in the code if that's the case 👍

Having had a quick look, I think it's probably beyond my current knowledge of the codebase, so I'd prefer to leave that as TODOs.

@asticode
Copy link
Owner

Having had a quick look, I think it's probably beyond my current knowledge of the codebase, so I'd prefer to leave that as TODOs.

👍

The html tokenizer doesn't actually recognise the timestamp tag as a start tag token, presumably because the first character after the < is numeric. I'm guessing we would need to split the resultant text or tokenize it further. I've given it a brief go, but I suspect you might have a better idea.

Splitting the input text at the start of parseTextWebVTT based on inline timestamp tags seems like a good idea. We would then have to loop through split items to run the existing logic 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants