glossterm is a pipeline that extracts, lexes, and parses wiktionary data.
In order to generate files for the web app, you need to grab an English
Wiktionary dump, put it in data/
and run the following commands.
You can run the commands by doing e.g. go run cmd/gtdump/main.go
or running make
to install globally available commands that can be run as e.g. gtdump
.
-
gtdump
downloads Wiktionary dump to en.xml.bz2. -
gtsplit
splits Wiktionary dump into N files so it can be parsed in parallel. N is set to the current number of cores. -
gtparse
parses split files into words.gob and descendants.gob. Use --no-backup after initial change to index to edit index in place and compare to previously committed index. -
gtresolve
reads words.gob and looks up DescendantTrees references in descendants.gob, and inlines them. -
gtquads
generates quads for each word to power graph lookups, e.g. find all descendants for the Latin roots of a given word. -
gtbeam
fetches cognates in parallel using Apache Beam local runner. -
gtcognates
inlines cognates fromgtbeam
into words.gob -
gtcompare
compares new index to old index. always use to manually verify parsing changes -
gtindex
incrementally indexes (additions, deletions, updates) words in Firestore
-
gtpage <word>
extracts a single XML page for a given word. Example:gtpage helado
-
gtlex <word.xml>
lexes a single XML page for a given word. Example:gtpage hombre | gtlex
-
gtparseword <word.xml>
parses a single XML word. Example:gtpage horno | gtparseword
-
gtparseetymtree <word.xml>
parses a single etymtree XML page. Example:gtpage Template:etymtree/la/germanus | gtparseetymtree
-
gtdescend <word>
shows the descendants from any words mentioned for a given word. -
gtread <word>
reads word from words.gob. Example:gtpage pt/nariz
-
gtsearch <query>
searches the index for a given word.