Skip to content

duhaime/detect_reuse

Repository files navigation

Detecting (crosslingual) text reuse

This repo contains simple Python utilities for identifying crosslingual textual reuse. Quickstart:

git clone https://github.com/duhaime/detect_reuse 
cd detect_reuse 
cd text_cleaning_resources
gunzip normalized_stats_one_million.txt
cd ../
python combinatorial_ngrams.py sample/encyclopedie_volume05_translated.txt sample/goldsmith_animated_nature_full_unsplit.txt 8 4 4

This command looks for textual reuse between "sample/encyclopedie_volume05_translated.txt" and "sample/goldsmith_animated_nature_full_unsplit.txt"

Translation Utility

translate_texts/translate_text.py uses goslate (pip install goslate) to translate all texts into a common language. Usage:

python translate_text.py encyclopedie_volume05.txt "en" "utf-8"

Where the arguments in order are: the text to be translated, the language into which the text should be translated, and the encoding of the input.

Running the command above transforms Volume V of the French Encyclopédie into English: "L'Encyclopédie vient de faire une excellente acquisition en la personne de M. Bourgelat , Ecuyer du Roi, chef de son Académie à Lyon ..." becomes "The Encyclopedia has made a great acquisition in the person of Mr. Bourgelat, Esquire of the King, the captain of his Academy in Lyons ..."

Detecting Textual Reuse

One can search for textual reuse between two files by running:

python combinatorial_ngrams.py {text_one} {text_two} {window size} {step size} {ngram size}

{window size} = the size of the sliding window to be created
{step size}   = number of words to advance the sliding window when it moves, and 
{ngram size}  = number of words to include in each ngram.

The output will contain data in the following format:

path_to_text_one {tab} path_to_text_two {tab} number_of_shared_ngrams {tab} sentence_from_text_one {tab} sentence_from_text_two {newline}

Sorting by the third column can give an estimate of textual similarity between the passages, with more similar passages having higher values here.

About

Python utilities for detecting textual reuse

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages