Implement building ngrams storage via python #19

KOLANICH · 2021-09-15T13:00:48Z

Hi. I have written an abstraction layer around multiple libraries doing word splitting (londonisacapitalofgreatbritain must become london is a capital of great britain). All the libs rely on preprocessed ngrams dicts, some on unigrams, some additionally on bigrams. All of them store them very inefficiently - as a text file, 1 line - one n-gram. For bigrams it already causes duplication.

My middleware provides a unified interface to them, and also converts their ngrams formats to each other.

I'd like to support your lib format for ngrams storage too. But it'd require some way to convert other formats into your format and back.

The text was updated successfully, but these errors were encountered:

jermp · 2021-09-15T13:12:38Z

Hi, could you provide a better description of what exactly you need Tongrams to do?
In this way, I can help you better with this problem.

KOLANICH · 2021-09-15T14:05:17Z

Currently I plan to use it as an independent compressed storage format for bigrams and maybe to try to add its support into wordninja and wordsegment (as they are written in python and use internally looking-up a probability by the n-gram, so these are low-hanging fruits). instant-segment support cannot really be brought fast properly, since this lib is header-only, so one has to build a binary of this lib first and then us FFIs of ths languages. So currently I plan just convert from tongrams to their format.

I am not very familiar to this lib currently and currently I expect like only

lookup
something to iterate all the n-grams with their probabilities
something to create a storage using a list of n-grams (unigrams and bigrams are usually stored separately in the datasets accompanying word splitting libraries)

gonna have uses for my use case.

jermp · 2021-09-16T12:39:12Z

If the libraries you want to use store their output in a (rather standard)
Google-like format as the one illustrated in the README, then you can easily
index their content using Tongrams.
The operations are the ones you mentioned, plus perplexity scoring when
probabilities are associated to n-grams.

KOLANICH · 2021-09-19T20:14:01Z

If the libraries you want to use store their output in a (rather standard)
Google-like format as the one illustrated in the README,

Most of them use the format that is somehow like the one in the README, but a bit different in whitespaces.

then you can easily index their content using Tongrams.

Do you mean using the CLI tools? I meant using the python API (yeah, for my use case it can be possible to pre-serialize the dataset and then consume it (it is likely to be the default use case), though it is not the best way to deal with it (I think about the lib as a middleware, I have abstract classes for storing ngrams in some "internal" format, and backends have methods to convert models from/to their internal formats to/from the abstraction layer "internal" format).), without any subprocess calls. Also I meant adding n-grams one-by-one via API, without serializing them as text first into a file only to read the file with tongrams lib.

jermp added help wanted question labels Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement building ngrams storage via python #19

Implement building ngrams storage via python #19

KOLANICH commented Sep 15, 2021

jermp commented Sep 15, 2021

KOLANICH commented Sep 15, 2021 •

edited

Loading

jermp commented Sep 16, 2021

KOLANICH commented Sep 19, 2021

Implement building ngrams storage via python #19

Implement building ngrams storage via python #19

Comments

KOLANICH commented Sep 15, 2021

jermp commented Sep 15, 2021

KOLANICH commented Sep 15, 2021 • edited Loading

jermp commented Sep 16, 2021

KOLANICH commented Sep 19, 2021

KOLANICH commented Sep 15, 2021 •

edited

Loading