Skip to content

Commit

Permalink
Set output-ngrams to default to true and update README (stanford-crfm…
Browse files Browse the repository at this point in the history
…#1776)

Co-authored-by: Andy Z <[email protected]>
  • Loading branch information
2 people authored and danielz02 committed Sep 7, 2023
1 parent 9394a66 commit 8bfaf0f
Show file tree
Hide file tree
Showing 3 changed files with 4 additions and 3 deletions.
1 change: 1 addition & 0 deletions scripts/data_overlap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ For instance, you can call this with The Pile, e.g. have:
output_stats = arbitrary output file name, e.g. "output_stats"
input_format = the_pile

If you don't want to output the ngrams that are overlapping in test set to a separate "{output_stats}_ngrams" file, you can pass --no-output-ngrams.
There are additional optional args:
--normalization default
Expand Down
2 changes: 1 addition & 1 deletion scripts/data_overlap/common/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ def get_data_overlap_args() -> Any:
required=True,
help="The format of your input file for your training data, e.g. raw, custom, the_pile",
)
parser.add_argument("--output-ngrams", type=bool, default=False, help="Whether to output ngrams")
parser.add_argument("--no-output-ngrams", type=bool, default=False, help="Pass to not output ngrams")
parser.add_argument(
"--tags",
type=str,
Expand Down
4 changes: 2 additions & 2 deletions scripts/data_overlap/compute_data_overlap_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,10 +233,10 @@ def compute_document_data_overlap(
stats_key_to_input_ids=stats_key_to_input_ids,
stats_key_to_reference_ids=stats_key_to_reference_ids,
entry_overlap_key_to_ngram_counts=entry_overlap_key_to_ngram_counts,
output_ngrams=args.output_ngrams,
output_ngrams=not args.no_output_ngrams,
)

if args.output_ngrams:
if not args.no_output_ngrams:
all_entry_overlap_ngrams = []
with open(f"{args.output_stats}_ngrams", "w") as f:
for entry_overlap_key in entry_overlap_key_to_ngram_counts:
Expand Down

0 comments on commit 8bfaf0f

Please sign in to comment.