A Python-based CLI tool to set up the database for a dictionary app.
The data file included with the freely licensed Ding dictionary lookup program can be used to populate the database with English ↔ German translations.
- Imports data quickly using bulk insertion and configurable chunking. It takes about 40 seconds to import 367 000 translations (753 000 words) into a new SQLite database.
- Uses Python generators to process large amounts of data efficiently, i.e. without consuming too much memory.
- The import process is atomic: if an error occurs midway through the task, it is aborted and the database remains in its original state (uses SQL transactions).
- Abstraction of SQL/DDL specifics (via SQLAlchemy) allows for easily adding support for new databases (tested with PostgreSQL and SQLite).
- Flexible command line interface (CLI) which supports reading from standard input (via Click).
- Reasonably scalable, OpenAPI-compliant web API (via FastAPI).
Type dictionarydb
to start the CLI tool:
$ dictionarydb
Since no command was specified, it will print a block of usage information:
Usage: dictionarydb [OPTIONS] COMMAND [ARGS]...
Set up and populate a translation dictionary database.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
api Start the API server.
import Import new entries into the dictionary database.
init Create the database schema for the dictionary database.
Three commands are available:
dictionarydb init
to initialise a new database (see Initialising the database).dictionarydb import
to import translations into the database (see Importing translations).dictionarydb api
to run the lookup API server (see Starting the API server).
You can run each command with the --help
argument to show the available options. For example, to show usage information for the init
command:
$ dictionarydb init --help
This would give you the following output:
Usage: dictionarydb init [OPTIONS]
Create the database schema for the dictionary database.
Options:
-u, --database-url TEXT URL of the database to initialize.
--confirm / --no-confirm Whether or not to ask for confirmation before
proceeding.
--help Show this message and exit.
Use the init
command to create the database schema:
$ dictionarydb init
By default, this will create a new SQLite database in the data/
directory. If you want to use PostgreSQL instead, use the --database-url
option:
$ dictionarydb init --database-url='postgresql://localhost:5432/dictionary'
Note: you will need to create the database (dictionary
in this example) manually before running the command.
When all is done, the following schema will have been created in your database:
Now you can populate the database with data. First, download the Ding dictionary data file and unpack it:
$ curl --output de-en.txt.xz https://ftp.tu-chemnitz.de/pub/Local/urz/ding/de-en-devel/de-en.txt.xz
$ xz --decompress de-en.txt.xz
This should give you a file named de-en.txt
in the current directory.
You can now use the dictionarydb import
command and point it at your de-en.txt
file in order to run an import:
$ dictionarydb import ./de-en.txt --source-language="deu" --target-language="eng"
It will give you the following output:
Starting dictionary import from file "./de-en.txt"…
This will remove all existing entries. Continue? [y/N]: y
Removing existing dictionary entries…
Creating languages…
Storing new dictionary entries…
Committing transaction…
Successfully completed dictionary import (0 deleted, 376541 added, 39.67 seconds elapsed).
Once the import is successful, your database should contain about 750 000 words:
Note: if you want to use PostgreSQL instead, use the --database-url
option again as described above. Set the DICTIONARYDB_DATABASE_URL
environment variable to the same value to make it persistent (see also: Configuration).
The CLI tool also supports reading data from another shell command. Pass -
as the input filename (stdin
) in this scenario:
$ xzcat ./de-en.txt.xz | dictionarydb import - --source-language="deu" --target-language="eng" --no-confirm
You could even do a streaming import directly over HTTP:
$ curl --silent https://ftp.tu-chemnitz.de/pub/Local/urz/ding/de-en-devel/de-en.txt.xz | xzcat | dictionarydb import - --source-language="deu" --target-language="eng" --no-confirm
You could use a cron job to populate the database automatically and keep it up to date in an unattended fashion.
For example, the following command will download the latest source file and then decompress and import it into the database:
$ curl --output de-en.txt.xz https://ftp.tu-chemnitz.de/pub/Local/urz/ding/de-en-devel/de-en.txt.xz && xzcat de-en.txt.xz | dictionarydb import - --source-language="deu" --target-language="eng" --min-entries=370000 --no-confirm
Note: this uses the --min-entries
option to enforce that a minimum number of valid entries is imported successfully. If that is not the case, then the import will fail cleanly, meaning the database transaction will be rolled back and the database will remain in its original state.
The --no-confirm
option is used to prevent the shell from waiting for the user's confirmation indefinitely.
While most of the options can be passed to dictionarydb
using command line flags, you might want to make some settings persistent. You can do this by setting one or more of the following environment variables:
DICTIONARYDB_LOG_LEVEL
: The log level (verbosity) to use. Defaults to "INFO".DICTIONARYDB_LOG_COLORS
: Whether or not to color the log output. Defaults to true.DICTIONARYDB_DATABASE_URL
: A connection URL to use for connecting to the database. The default is to create a new SQLite database file in thedata/
directory.DICTIONARYDB_IMPORT_CHUNK_SIZE
: Maximum number of entries to hold in memory at once during the import. Data will be sent to the database (and freed from memory) once n entries have been read. Defaults to 10 000.DICTIONARYDB_API_HOST
: Network address on which the API server should listen. Defaults to localhost.DICTIONARYDB_API_PORT
: TCP port number on which the API server should run. Defaults to 8080.DICTIONARYDB_API_TRUST_PROXY_IPS
: Proxy IP addresses to trust when determining the client's IP, port and protocol. By default, only 127.0.0.1 (i.e. a proxy running locally) is trusted.
Hint: clicking the name of a setting will take you to a more detailed description of the respective setting along with configuration examples.
To make the dictionarydb
tool always use a local PostgreSQL database, you could set the DICTIONARYDB_DATABASE_URL
environment variable as follows:
export DICTIONARYDB_DATABASE_URL="postgresql://localhost:5432/dictionary"
See this query for an example of how you could look up a word and its available translations using SQL.
The dictionarydb api
command lets you start the API server:
$ dictionarydb api
If all goes well, the API is now running:
Starting API server on http://localhost:8080…
Hit Ctrl+C if you need to shut it down again.
Point curl
to the /lookup
endpoint as follows to translate the word conscientious):
$ curl --request GET "http://localhost:8080/lookup?source_language=eng&target_language=deu&search_string=conscientious&max_results=3" --header "Accept: application/json"
Note that you need to pass three GET
parameters:
source_language
: ISO 639-3 code of the language you want to translate from.target_language
: ISO 639-3 code of the language into which you want to translate.search_string
: the word (or a substring of it) you would like to look up.
The max_results
query parameter is optional. It can be used to limit the number of results that are returned.
curl
should give you a response as follows:
{
"results": [
{
"word": "conscientious",
"language": "eng",
"translation": "gewissenhaft {adj}",
"translation_language": "deu",
"relevance": 1.0
},
{
"word": "conscientiously",
"language": "eng",
"translation": "gewissenhaft {adv}",
"translation_language": "deu",
"relevance": 0.7647058963775635
},
{
"word": "conscientiousness",
"language": "eng",
"translation": "Gewissenhaftigkeit {f} [psych.]",
"translation_language": "deu",
"relevance": 0.6842105388641357
}
]
}
Consider using a graphical API client like Insomnia for a more comfortable experience:
Finally, there is an OpenAPI documentation site available at http://localhost:8080/docs. Use the "Try it out" button on the /lookup resource to perform dictionary lookups.
A demo deployment of the dictionary lookup API is available at https://dictionarydb.herokuapp.com. It uses the Heroku scheduler add-on to keep the database up to date with the latest translations automatically.
Contributions welcome! See the CONTRIBUTING.md document for an overview of how to set up the project for development.