-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add intersection of input, reference to overlap #1817
Conversation
@@ -106,6 +107,20 @@ def create_ngram_index( | |||
ngram_index[n][reference_ngram].add( | |||
EntryDataOverlapKey(stats_key=stats_key, instance_id=id, part=PART_REF) | |||
) | |||
|
|||
# compute intersection ngrams [defined as n-grams that occur between instance and reference] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear what intersection means exactly or "between instance and reference"...can this be defined a bit more precisely in natural language?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep the example I give in the PR description is:
input = ["is 2+2 4 true or false"] reference = ["true"], the intersection is the 5-gram ["4 true or false true"] (which is formed from the input 4-gram [4 true or false] and the reference 1-gram [true]) I'll reword the comment
@@ -110,7 +110,7 @@ def create_ngram_index( | |||
|
|||
# concatenate the last n-1 tokens of input and the first n-1 tokens of reference and compute n-grams on this "interesection token" sequence | |||
# for instance: input = ["is 2+2 4 true or false"] reference = ["true"] | |||
# the intersection is the 5-gram ["4 true or false true"] | |||
# the intersection is the 5-gram ["4 true or false true"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good (as we discussed), but (i) do we have to get providers to rerun with this new code (non-trivial cost) and (ii) I wonder how often the question and answer will be juxtaposed. If there's any token that separates the Q and A, then we won't detect overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think for now we can just run this for the pile and see what sort of insights we get; this is simply an additional metric that doesn't affect existing metrics. If we're concerned for a token gap, we can allow for a skip token budget, esp for this metric, though it may be premature to add at this point
Closing stale PR. Not deleting the branch, in case it is still being used. |
Previously, we computed ngrams for input and references, but did not calculate ngrams for those on the boundary of input and reference. e.g. input = ["is 2+2 4 true or false"] reference = ["true"], the 5-gram ["4 true or false true"] is not captured; this PR adds this in