Add intersection of input, reference to overlap #1817

andyzorigin · 2023-08-29T03:42:30Z

Previously, we computed ngrams for input and references, but did not calculate ngrams for those on the boundary of input and reference. e.g. input = ["is 2+2 4 true or false"] reference = ["true"], the 5-gram ["4 true or false true"] is not captured; this PR adds this in

percyliang · 2023-08-29T03:59:25Z

scripts/data_overlap/compute_data_overlap_metrics.py

@@ -106,6 +107,20 @@ def create_ngram_index(
                        ngram_index[n][reference_ngram].add(
                            EntryDataOverlapKey(stats_key=stats_key, instance_id=id, part=PART_REF)
                        )
+
+                # compute intersection ngrams [defined as n-grams that occur between instance and reference]


Not clear what intersection means exactly or "between instance and reference"...can this be defined a bit more precisely in natural language?

Yep the example I give in the PR description is:
input = ["is 2+2 4 true or false"] reference = ["true"], the intersection is the 5-gram ["4 true or false true"] (which is formed from the input 4-gram [4 true or false] and the reference 1-gram [true]) I'll reword the comment

scripts/data_overlap/compute_data_overlap_metrics.py

percyliang · 2023-08-29T04:14:15Z

scripts/data_overlap/compute_data_overlap_metrics.py

@@ -110,7 +110,7 @@ def create_ngram_index(

                # concatenate the last n-1 tokens of input and the first n-1 tokens of reference and compute n-grams on this "interesection token" sequence
                # for instance: input = ["is 2+2 4 true or false"] reference = ["true"]
-                # the intersection is the 5-gram ["4 true or false true"] 
+                # the intersection is the 5-gram ["4 true or false true"]


I think this is good (as we discussed), but (i) do we have to get providers to rerun with this new code (non-trivial cost) and (ii) I wonder how often the question and answer will be juxtaposed. If there's any token that separates the Q and A, then we won't detect overlap.

Yeah I think for now we can just run this for the pile and see what sort of insights we get; this is simply an additional metric that doesn't affect existing metrics. If we're concerned for a token gap, we can allow for a skip token budget, esp for this metric, though it may be premature to add at this point

yifanmai · 2024-02-06T05:45:54Z

Closing stale PR. Not deleting the branch, in case it is still being used.

Andy Z added 3 commits August 28, 2023 20:41

Add intersection

5cdddb5

Move

0860c9b

Output intersection

1824f5d

andyzorigin changed the title ~~Add intersection~~ Add intersection of input, reference to overlap Aug 29, 2023

andyzorigin requested review from yifanmai and percyliang August 29, 2023 03:57

Black

2995db7

percyliang reviewed Aug 29, 2023

View reviewed changes

Andy Z added 2 commits August 28, 2023 21:03

Update comment

576655f

Black

b774960

percyliang reviewed Aug 29, 2023

View reviewed changes

scripts/data_overlap/compute_data_overlap_metrics.py Outdated Show resolved Hide resolved

Split comment

08288a5

percyliang reviewed Aug 29, 2023

View reviewed changes

Andy Z added 2 commits August 28, 2023 21:14

Update comment

003a3c0

black

85fc9ea

yifanmai closed this Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add intersection of input, reference to overlap #1817

Add intersection of input, reference to overlap #1817

andyzorigin commented Aug 29, 2023 •

edited

Loading

percyliang Aug 29, 2023

andyzorigin Aug 29, 2023

percyliang Aug 29, 2023

andyzorigin Aug 29, 2023

yifanmai commented Feb 6, 2024

Add intersection of input, reference to overlap #1817

Add intersection of input, reference to overlap #1817

Conversation

andyzorigin commented Aug 29, 2023 • edited Loading

percyliang Aug 29, 2023

Choose a reason for hiding this comment

andyzorigin Aug 29, 2023

Choose a reason for hiding this comment

percyliang Aug 29, 2023

Choose a reason for hiding this comment

andyzorigin Aug 29, 2023

Choose a reason for hiding this comment

yifanmai commented Feb 6, 2024

andyzorigin commented Aug 29, 2023 •

edited

Loading