Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is no way to tell a segment linebreak from a real linebreak in verbose_json #2381

Open
C0rn3j opened this issue Aug 23, 2024 · 0 comments

Comments

@C0rn3j
Copy link

C0rn3j commented Aug 23, 2024

Regular JSON:

[{"text": " Hello, this message is long enough to have a line break, so it will have a line break.\n"}]

verbose_json:

"text": " Hello, this message is long enough to have a line break, so\n it will have a line break.\n"

actual full output:

[{"task": "transcribe", "language": "english", "duration": 6.383999824523926, "text": " Hello, this message is long enough to have a line break, so\n it will have a line break.\n", "segments": [{"id": 0, "text": " Hello, this message is long enough to have a line break, so", "start": 0.0, "end": 4.3, "tokens": [2425, 11, 341, 3636, 307, 938, 1547, 281, 362, 257, 1622, 1821, 11, 370], "words": [{"word": " Hello", "start": 0.01, "end": 0.43, "t_dtw": -1, "probability": 0.9765891432762146}, {"word": ",", "start": 0.43, "end": 0.54, "t_dtw": -1, "probability": 0.9967133402824402}, {"word": " this", "start": 0.87, "end": 0.9400000000000001, "t_dtw": -1, "probability": 1.0}, {"word": " message", "start": 0.9400000000000001, "end": 1.55, "t_dtw": -1, "probability": 1.0}, {"word": " is", "start": 1.55, "end": 1.72, "t_dtw": -1, "probability": 1.0}, {"word": " long", "start": 1.72, "end": 2.0, "t_dtw": -1, "probability": 1.0}, {"word": " enough", "start": 2.09, "end": 2.58, "t_dtw": -1, "probability": 1.0}, {"word": " to", "start": 2.58, "end": 2.75, "t_dtw": -1, "probability": 1.0}, {"word": " have", "start": 2.75, "end": 3.09, "t_dtw": -1, "probability": 1.0}, {"word": " a", "start": 3.09, "end": 3.17, "t_dtw": -1, "probability": 1.0}, {"word": " line", "start": 3.17, "end": 3.5, "t_dtw": -1, "probability": 1.0}, {"word": " break", "start": 3.5, "end": 3.83, "t_dtw": -1, "probability": 1.0}, {"word": ",", "start": 4.0, "end": 4.11, "t_dtw": -1, "probability": 0.9934980869293213}, {"word": " so", "start": 4.11, "end": 4.25, "t_dtw": -1, "probability": 1.0}], "temperature": 0.20000000298023224, "avg_logprob": -0.0022336323745548725}, {"id": 1, "text": " it will have a line break.", "start": 4.3, "end": 6.38, "tokens": [309, 486, 362, 257, 1622, 1821, 13], "words": [{"word": " it", "start": 4.3, "end": 4.45, "t_dtw": -1, "probability": 1.0}, {"word": " will", "start": 4.45, "end": 4.79, "t_dtw": -1, "probability": 1.0}, {"word": " have", "start": 4.79, "end": 5.13, "t_dtw": -1, "probability": 1.0}, {"word": " a", "start": 5.13, "end": 5.21, "t_dtw": -1, "probability": 1.0}, {"word": " line", "start": 5.21, "end": 5.38, "t_dtw": -1, "probability": 1.0}, {"word": " break", "start": 5.54, "end": 5.98, "t_dtw": -1, "probability": 1.0}, {"word": ".", "start": 5.98, "end": 6.38, "t_dtw": -1, "probability": 1.0}], "temperature": 0.20000000298023224, "avg_logprob": 0.0}]}]

This becomes a problem when one is trying to sort out actual newlines for filtering out hallucinations:

"text": " Hello, this message is long enough to have a line break, so\n it will have a line break.\n Thank you.\n"
[{"task": "transcribe", "language": "english", "duration": 7.684000015258789, "text": " Hello, this message is long enough to have a line break, so\n it will have a line break.\n Thank you.\n", "segments": [{"id": 0, "text": " Hello, this message is long enough to have a line break, so", "start": 0.0, "end": 3.36, "tokens": [2425, 11, 341, 3636, 307, 938, 1547, 281, 362, 257, 1622, 1821, 11, 370], "words": [{"word": " Hello", "start": 0.32, "end": 0.32, "t_dtw": -1, "probability": 0.9946319460868835}, {"word": ",", "start": 0.35000000000000003, "end": 0.45, "t_dtw": -1, "probability": 0.8468424081802368}, {"word": " this", "start": 0.45, "end": 0.71, "t_dtw": -1, "probability": 1.0}, {"word": " message", "start": 0.71, "end": 0.88, "t_dtw": -1, "probability": 1.0}, {"word": " is", "start": 1.18, "end": 1.29, "t_dtw": -1, "probability": 1.0}, {"word": " long", "start": 1.29, "end": 1.54, "t_dtw": -1, "probability": 1.0}, {"word": " enough", "start": 1.56, "end": 1.94, "t_dtw": -1, "probability": 1.0}, {"word": " to", "start": 1.94, "end": 2.07, "t_dtw": -1, "probability": 1.0}, {"word": " have", "start": 2.07, "end": 2.2600000000000002, "t_dtw": -1, "probability": 1.0}, {"word": " a", "start": 2.37, "end": 2.39, "t_dtw": -1, "probability": 1.0}, {"word": " line", "start": 2.39, "end": 2.65, "t_dtw": -1, "probability": 1.0}, {"word": " break", "start": 2.65, "end": 2.97, "t_dtw": -1, "probability": 1.0}, {"word": ",", "start": 2.97, "end": 3.1, "t_dtw": -1, "probability": 0.9999771118164062}, {"word": " so", "start": 3.1, "end": 3.17, "t_dtw": -1, "probability": 1.0}], "temperature": 0.20000000298023224, "avg_logprob": -0.01144307479262352}, {"id": 1, "text": " it will have a line break.", "start": 3.36, "end": 4.8, "tokens": [309, 486, 362, 257, 1622, 1821, 13], "words": [{"word": " it", "start": 3.36, "end": 3.36, "t_dtw": -1, "probability": 1.0}, {"word": " will", "start": 3.48, "end": 3.62, "t_dtw": -1, "probability": 1.0}, {"word": " have", "start": 3.62, "end": 3.88, "t_dtw": -1, "probability": 1.0}, {"word": " a", "start": 3.88, "end": 3.94, "t_dtw": -1, "probability": 1.0}, {"word": " line", "start": 3.94, "end": 4.2, "t_dtw": -1, "probability": 1.0}, {"word": " break", "start": 4.2, "end": 4.5200000000000005, "t_dtw": -1, "probability": 1.0}, {"word": ".", "start": 4.5200000000000005, "end": 4.79, "t_dtw": -1, "probability": 1.0}], "temperature": 0.20000000298023224, "avg_logprob": 0.0}, {"id": 2, "text": " Thank you.", "start": 4.8, "end": 6.8, "tokens": [1044, 291, 13], "words": [{"word": " Thank", "start": 4.8, "end": 5.7, "t_dtw": -1, "probability": 0.9997387528419495}, {"word": " you", "start": 5.7, "end": 6.24, "t_dtw": -1, "probability": 1.0}, {"word": ".", "start": 6.24, "end": 6.78, "t_dtw": -1, "probability": 0.9999942779541016}], "temperature": 0.20000000298023224, "avg_logprob": -5.340576171875e-05}]}]

In this case, I can no longer split by newlines, as if I do so, I will end up with broken up sentences:

Hello, this is a test of verbose JSON. This message is
meant to be really long so it can line break, but it won't.
 Thank you.

and if I try to read the text of the segments instead, I will simply lose the newlines altogether:
Hello, this message is long enough to have a line break, so it will have a line break. Thank you.

I would prefer if the full text in verbose_json did not have the inserted newlines, and maybe on top of that, if the segments had a marker if they're a full line or if text follows after them, so the text can be available both in the full text and reconstructed correctly from the segmented texts.

Relevant code to test these as follows:

       files = {'file': (f, open(f, 'rb'))}
       data = {'temperature': '0.2', 'response_format': 'verbose_json'}

       try:
           response = requests.post(cpp_url, files=files, data=data)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant