Skip to content

Text Mining

Jonathan Janetzki edited this page Jul 28, 2017 · 19 revisions

This page describes the jobs that constitute the text mining pipeline.

Overview


Preprocessing

Overview


Tables


Wikipedia import

WikipediaImport parses Wikipedia XML dump to WikipediaEntries.

TextParser parses all Wikipedia articles from wikitext to raw text and extracts its links.


Wikidata import

WikidataImport parses each article from the Wikidata JSON dump into a WikidataEntity and imports it into the Cassandra.

TagEntities builds the subclass hierarchy of selected Wikidata classes and tags every WikidataEntity that is an instance of one of the subclasses with the top level class.

ResolveEntities resolves the Wikidata IDs in the properties of each WikidataEntity. Entities labeled with an instancetype in the TagEntities job are not resolved.

WikidataDataLakeImport translates each WikidataEntity into a Subject and writes it into a staging table.

FindRelations finds Wikidata relations between Subjects, translates the Wikidata ID relations intoa Subject UUID relations and replaces the Wikidata IDs with their names.


Link analysis

LinkCleaner removes dead links that do not point to an existing page.

RedirectResolver resolves all redirects for each ParsedWikipediaEntry by replacing them with the pages they point to and writing the redirects to a cassandra table.

LinkAnalysis groups Aliases of Links by page names and vice versa.


Alias analysis

CompanyLinkFilter removes all Wikipedia Links with an alias that never points to a company.

ReducedLinkAnalysis groups the Links once on the aliases and once on the pages and saves them to the columns for reduced links.

LocalTrieBuilder creates a token based trie of a given list of aliases and serializes it to the filesystem.

AliasTrieSearch finds all occurrences of the aliases in the given trie in all Wikipedia articles and writes them to the foundaliases column.

LinkExtender extends Wikipedia articles with Links of occurrences of aliases that were previously linked in the same article.

AliasCounter counts Alias occurrences and merges them into previously extracted Aliases with their corresponding pages.


Context analysis

DocumentFrequencyCounter counts DocumentFrequencies over all articles.

TermFrequencyCounter enriches each ParsedWikipediaEntry with the term frequencies of the whole article (called context) and the term frequencies for all link contexts (called link contexts).


Feature extraction

CosineContextComparator calculates the cosine similarity (context score) for each pair of Link alias and page it may refer to. It combines it with the links score and entity score, which leads to the respective FeatureEntry.

SecondOrderFeatureGenerator enriches page scores and cosine similarities of given FeatureEntries with their respective second order feature values.

FeatureCsvExport exports a sample of FeatureEntries as .csv file for eventual evaluation. (This is just for evaluation and no necessary part of the pipeline.)

TfidfProfiler compares the performance of the context extraction and tf-idf computation of the Ingestion project with the implementation in the Spark MLlib. (This is just for evaluation and no necessary part of the pipeline.)


Classifier training

ClassifierTraining trains a RandomForestModel with FeatureEntries and evaluates it in terms of precision, recall and f-score.


Named entity linking

WikipediaReduction reduces each ParsedWikipediaEntry to its relevant attributes for NEL.

ArticleTrieSearch finds TrieAliases in TrieAliasArticles and writes them back to the same table.

TextNEL classifies the TrieAliases found in TrieAliasArticles by the ArticleTrieSearch and writes the positives into the foundentities column back to the same table.

HtmlGenerator generates HTML from found links to named entities and their corresponding raw articles.


Relation Extraction

RelationSentenceParser parses all Sentences with at least two entities from each Wikipedia article and writes them to the Cassandra.

CooccurrenceCounter counts the cooccurrences in all sentences.

CooccurrenceExport exports the cooccurrences to Neo4j CSV. (This is just for debugging and no necessary part of the pipeline.)

RelationClassifier trains a classifier on the relations from DBpedia.