glossterm 0.5

glossterm is a pipeline that extracts, lexes, and parses wiktionary data.

Pipeline

In order to generate files for the web app, you need to grab an English Wiktionary dump, put it in data/ and run the following commands.

You can run the commands by doing e.g. go run cmd/gtdump/main.go or running make to install globally available commands that can be run as e.g. gtdump.

gtdump downloads Wiktionary dump to en.xml.bz2.
gtsplit splits Wiktionary dump into N files so it can be parsed in parallel. N is set to the current number of cores.
gtparse parses split files into words.gob and descendants.gob. Use --no-backup after initial change to index to edit index in place and compare to previously committed index.
gtresolve reads words.gob and looks up DescendantTrees references in descendants.gob, and inlines them.
gtquads generates quads for each word to power graph lookups, e.g. find all descendants for the Latin roots of a given word.
gtbeam fetches cognates in parallel using Apache Beam local runner.
gtcognates inlines cognates from gtbeam into words.gob
gtcompare compares new index to old index. always use to manually verify parsing changes
gtindex incrementally indexes (additions, deletions, updates) words in Firestore

Debugging a single word

gtpage <word> extracts a single XML page for a given word. Example: gtpage helado
gtlex <word.xml> lexes a single XML page for a given word. Example: gtpage hombre | gtlex
gtparseword <word.xml> parses a single XML word. Example: gtpage horno | gtparseword
gtparseetymtree <word.xml> parses a single etymtree XML page. Example: gtpage Template:etymtree/la/germanus | gtparseetymtree
gtdescend <word> shows the descendants from any words mentioned for a given word.
gtread <word> reads word from words.gob. Example: gtpage pt/nariz
gtsearch <query> searches the index for a given word.

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
assets/tpl		assets/tpl
cmd		cmd
data		data
docs/images		docs/images
lib		lib
vendor		vendor
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
Procfile		Procfile
README.md		README.md
glide.lock		glide.lock
glide.yaml		glide.yaml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

glossterm 0.5

Pipeline

Debugging a single word

About

Releases

Packages

Languages

License

vthommeret/glossterm

Folders and files

Latest commit

History

Repository files navigation

glossterm 0.5

Pipeline

Debugging a single word

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages