-
Notifications
You must be signed in to change notification settings - Fork 1
Text Mining
This page describes the jobs that constitute the text mining pipeline.
WikipediaImport
parses Wikipedia XML dump to WikipediaEntries
.
TextParser parses all Wikipedia articles from wikitext to raw text and extracts its links.
WikidataImport
parses each article from the Wikidata JSON dump into a WikidataEntity
and imports it into the Cassandra.
TagEntities
builds the subclass hierarchy of selected Wikidata classes and tags every WikidataEntity
that is an instance of one of the subclasses with the top level class.
ResolveEntities
resolves the Wikidata IDs in the properties of each WikidataEntity
. Entities labeled with an instancetype in the TagEntities
job are not resolved.
WikidataDataLakeImport
translates each WikidataEntity
into a Subject
and writes it into a staging table.
FindRelations
finds Wikidata relations between Subjects
, translates the Wikidata ID relations intoa Subject
UUID relations and replaces the Wikidata IDs with their names.
LinkCleaner removes dead links that do not point to an existing page.
RedirectResolver
resolves all redirects for each ParsedWikipediaEntry
by replacing them with the pages they point to and writing the redirects to a cassandra table.
LinkAnalysis
groups Aliases
of Links
by page names and vice versa.
CompanyLinkFilter
removes all Wikipedia Links
with an alias that never points to a company.
ReducedLinkAnalysis
groups the Links
once on the aliases and once on the pages and saves them to the columns for reduced links.
LocalTrieBuilder creates a token based trie of a given list of aliases and serializes it to the filesystem.
AliasTrieSearch
finds all occurrences of the aliases in the given trie in all Wikipedia articles and writes them to the foundaliases
column.
LinkExtender
extends Wikipedia articles with Links
of occurrences of aliases that were previously linked in the same article.
AliasCounter
counts Alias
occurrences and merges them into previously extracted Aliases
with their corresponding pages.
DocumentFrequencyCounter
counts DocumentFrequencies
over all articles.
TermFrequencyCounter
enriches each ParsedWikipediaEntry
with the term frequencies of the whole article (called context) and the term frequencies for all link contexts (called link contexts).
CosineContextComparator
calculates the cosine similarity (context score) for each pair of Link
alias and page it may refer to. It combines it with the links score and entity score, which leads to the respective FeatureEntry
.
SecondOrderFeatureGenerator
enriches page scores and cosine similarities of given FeatureEntries
with their respective second order feature values.
FeatureCsvExport exports a sample of FeatureEntries
as .csv file for eventual evaluation. (This is just for evaluation and no necessary part of the pipeline.)
TfidfProfiler compares the performance of the context extraction and tf-idf computation of the Ingestion project with the implementation in the Spark MLlib. (This is just for evaluation and no necessary part of the pipeline.)
ClassifierTraining
trains a RandomForestModel
with FeatureEntries
and evaluates it in terms of precision, recall and f-score.
WikipediaReduction
reduces each ParsedWikipediaEntry
to its relevant attributes for NEL.
ArticleTrieSearch
finds TrieAliases
in TrieAliasArticles
and writes them back to the same table.
TextNEL
classifies the TrieAliases
found in TrieAliasArticles
by the ArticleTrieSearch
and writes the positives into the foundentities
column back to the same table.
HtmlGenerator generates HTML from found links to named entities and their corresponding raw articles.
RelationSentenceParser
parses all Sentences
with at least two entities from each Wikipedia article and writes them to the Cassandra.
CooccurrenceCounter counts the cooccurrences in all sentences.
CooccurrenceExport exports the cooccurrences to Neo4j CSV. (This is just for debugging and no necessary part of the pipeline.)
RelationClassifier trains a classifier on the relations from DBpedia.