Implemented a simple Inverted Index for document retrieval for the cranfield document collection.
The code performs the following functions for document retrieval on the cranField documents collection.
- Preprocesses the data in the text collection (stop words removal, stemming, removing punctuations etc.)
- Creates an inverted index for looking up relevant documents for each word in the vocabulary of the document collection
- Assigns weights to each word in the documents/queries by using TF-IDF weighting scheme
- Retrieves a list of relevant documents for the given list of queries specified in queries.txt and ranks the retrieved documents in decreasing order of their cosine similarty with the query
- Calculates average precision and recall for the top (10/50/100/500) retrieved documents by using relevance.txt
Python 3 and nltk are required to run the code
Ensure all the files from the repo are present in the same directory. After checking out the project the code can be run in two ways.
python Inverted_index.py <path_to_cranfieldDocs> <path_to_query_file> <path_to_relevance_file>
Open the project in any preferred IDE of your choice and run the file Inverted_index.py by passing the above arguments at runtime
Query: 1 Pr: 0.0 Re:0.0
Query: 2 Pr: 0.2 Re:0.13333333333333333
Query: 3 Pr: 0.2 Re:0.13333333333333333
Query: 4 Pr: 0.1 Re:0.05555555555555555
Query: 5 Pr: 0.1 Re:0.05263157894736842
Query: 6 Pr: 0.4 Re:0.2222222222222222
Query: 7 Pr: 0.6 Re:0.6666666666666666
Query: 8 Pr: 0.2 Re:0.5
Query: 9 Pr: 0.1 Re:0.125
Query: 10 Pr: 0.2 Re:0.08333333333333333
Query: 1 Pr: 0.0 Re:0.0
Query: 2 Pr: 0.12 Re:0.4
Query: 3 Pr: 0.14 Re:0.4666666666666667
Query: 4 Pr: 0.06 Re:0.16666666666666666
Query: 5 Pr: 0.14 Re:0.3684210526315789
Query: 6 Pr: 0.14 Re:0.3888888888888889
Query: 7 Pr: 0.16 Re:0.8888888888888888
Query: 8 Pr: 0.06 Re:0.75
Query: 9 Pr: 0.12 Re:0.75
Query: 10 Pr: 0.08 Re:0.16666666666666666
Query: 1 Pr: 0.0 Re:0.0
Query: 2 Pr: 0.09 Re:0.6
Query: 3 Pr: 0.09 Re:0.6
Query: 4 Pr: 0.06 Re:0.3333333333333333
Query: 5 Pr: 0.13 Re:0.6842105263157895
Query: 6 Pr: 0.09 Re:0.5
Query: 7 Pr: 0.09 Re:1.0
Query: 8 Pr: 0.03 Re:0.75
Query: 9 Pr: 0.06 Re:0.75
Query: 10 Pr: 0.04 Re:0.16666666666666666
Query: 1 Pr: 0.002 Re:1.0
Query: 2 Pr: 0.03 Re:1.0
Query: 3 Pr: 0.03 Re:1.0
Query: 4 Pr: 0.032 Re:0.8888888888888888
Query: 5 Pr: 0.038 Re:1.0
Query: 6 Pr: 0.036 Re:1.0
Query: 7 Pr: 0.018 Re:1.0
Query: 8 Pr: 0.008 Re:1.0
Query: 9 Pr: 0.016 Re:1.0
Query: 10 Pr: 0.026 Re:0.5416666666666666
The function accepts the text data as a parameter and does the following operations: a. Removes the punctuations from the text data b. Removes numbers from the data c. Converts the data to lower case d. Tokenizes the words from the text data
The function reads the common English stop words from the file stopwords.txt and then returns them as a list
This function takes in the raw data from the SGML files from the cranfieldDocs collection and extracts the data present between the <TITLE> and tags and returns the content
The function takes in tokenized words and removes stop words from it and the words which are smaller than 2 characters in length
The function takes the path of the crafieldDocs directory and reads the contents of the files present in the directory calls the parse_data function and returns a list of tuples of the following format – (document_no, text_data)
The function takes in a list of tokenized words and returns a dictionary with token as the key and the tf_score for that token as the value
This function parses all the words from the document collection and returns a vocabulary of all the unique words from the collection
The function takes in a list of tokens and returns a list of stemmed words using NLTKs PorterStemmer
The function takes a list of tuples generated by read_data and then performs the following functions: a. tokenizes and removes puntuations b. removes stop words c. stems words d. removes stop words after stemming and words shorter than 2 characters
The function takes in the entire data dictionary and calculates the idf score for each word in the vocabulary for the document collection and returns a dictionary with word as key and idf score as the value
The function takes in a the data dictionary of the parsed and preprocessed data and the dictionary for idf_scores and returns a dictionary containing key as the document id and value which is another dictionary with words of that document as keys and their tf-idf scores as values
This function is similar to that of preprocess data it performs the same operations on the queries
This function takes the preprocessed queries and the idf_scores dictionary and then performs calculation for calculating the tfidf scores of the query terms and returns a similar dictionary as returned by calculate_tfidf function but for query terms
This function takes in the data dictionary of preprocessed data, builds and inverted index and then returns a dictionary with key as each unique word from the document vocabulary and value having a list of documents where the word in present
This function takes in the path of the file containing the query ids and their relevant documents and then returns a dictionary with key as document ids and value as a list of documents relevant for it
This function takes the retrieved document list and relevant document list of a query, calculates and returns the precision and recall