Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write paper for JOSS #7

Merged
merged 13 commits into from
Dec 7, 2019
84 changes: 84 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
@Article{Julia,
Title = {{J}ulia: A Fresh Approach to Numerical Computing},
Author = {Jeff Bezanson and Alan Edelman and Stefan Karpinski and Viral B. Shah},
Year = {2014},
Eprint = {1411.1607},
Eprintclass = {cs.MS},
Eprinttype = {arXiv},
Keywords = {tools},
Timestamp = {2015.12.18},
Url = {http://arxiv.org/abs/1411.1607}
}

@Inproceedings{NLTK1,
Title = {NLTK: the natural language toolkit},
Author = {Bird, Steven and Loper, Edward},
Booktitle = {Proceedings of the ACL 2004 on Interactive poster and demonstration sessions},
Year = {2004},
Organization = {Association for Computational Linguistics},
Pages = {31},
Timestamp = {2018.02.07},
Url = {http://www.aclweb.org/anthology/P04-3031}
}
@Book{NLTK2,
Title = {Natural language processing with Python},
Author = {Bird, Steven and Klein, Ewan and Loper, Edward},
Publisher = {" O'Reilly Media, Inc."},
Year = {2009},
Keywords = {software, tools},
Timestamp = {2015.07.12},
Url = {http://www.nltk.org/}
}

@electronic{penntok,
author = {MacIntyre, Robert},
title = {Sed script to produce Penn Treebank tokenization on arbitrary raw text.},
organization = "Massachusetts Institute of Technology",
url = {https://web.archive.org/web/20130804202913/http://www.cis.upenn.edu/%7Etreebank/tokenizer.sed},
urldate = {31.08.2018},
year = {1995}
}

@electronic{toktok,
author = {Dehdari, Jonathan},
title = {tok-tok: A fast, simple, multilingual tokenizer },
url = {https://github.com/jonsafari/tok-tok},
urldate = {31.08.2018},
year = {2015}
}
@phdthesis{toktokpub,
title={A Neurophysiologically-Inspired Statistical Language Model},
author={Dehdari, Jonathan},
year={2014},
school={The Ohio State University}
}

@article{reversibletok1,
author = {Sebastian J. Mielke and Jason Eisner},
title = {Spell Once, Summon Anywhere: {A} Two-Level Open-Vocabulary Language Model},
journal = {CoRR},
volume = {abs/1804.08205},
year = {2018},
url = {http://arxiv.org/abs/1804.08205},
archivePrefix = {arXiv},
eprint = {1804.08205},
timestamp = {Mon, 13 Aug 2018 16:49:01 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1804-08205},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

@online{reversibletok2,
author = {Sebastian J. Mielke},
title = {A simple, reversible, language-agnostic tokenizer},
year = {2019},
url = {https://sjmielke.com/papers/tokenize/},
urldate = {22.04.2018}
}

@online{tweettok,
author = {Christopher Potts},
title = {Sentiment Symposium Tutorial: Tokenizing},
year = {2019},
url = {http://sentiment.christopherpotts.net/tokenizing.html#sentiment},
urldate = {2011}
}
88 changes: 88 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
title: 'WordTokenizers.jl: Basic tools for tokenizing natural language in Julia'
tags:
- julialang
- natural language processing (NLP)
- tokenization
- text mining
- information retrieval
authors:
- name: Ayush Kaushal
orcid: 0000-0002-6703-0728
affiliation: 1
- name: Lyndon White
orcid: 0000-0003-1386-1646
affiliation: 2
- name: Mike Innes
orcid: 0000-0003-0788-0242
affiliation: 3
- name: Rohit Kumar
orcid: 0000-0002-6758-8350
affiliation: 4

affiliations:
- name: Indian Institute of Technology, Kharagpur
index: 1
- name: The University of Western Australia
index: 2
- name: Julia Computing
index: 3
- name: ABV-Indian Institute of Information Technology and Management Gwalior
index: 4

date: 1 July 2019
bibliography: paper.bib
---

# Summary

WordTokenizers.jl is a tool to help users of the Julia programming language ([@Julia]), work with natural language.
In natural language processing (NLP) tokenization refers to breaking a text up into parts -- the tokens.
Generally, tokenization refers to breaking a sentence up into words and other tokens such as punctuation.
Such _word tokenization_ also often includes some normalizing, such as correcting unusual spellings or removing all punctuations.
Complementary to word tokenization is _sentence segmentation_ (sometimes called _sentence tokenization_),
where a document is broken up into sentences, which can then be tokenized into words.
Tokenization and sentence segmentation are some of the most fundamental operations to be performed before applying most NLP or information retrieval algorithms.

WordTokenizers.jl provides a flexible API for defining fast tokenizers and sentence segmentors.
Using this API several standard tokenizers and sentence segmenters have been implemented, allowing researchers and practitioners to focus on the higher details of their NLP tasks.

WordTokenizers.jl does not implement significant novel tokenizers or sentence segmenters.
Rather, it contains ports/implementations the well-established and commonly used algorithms.
At present, it contains rules-based methods primarily designed for English.
Several of the implementations are sourced from the Python NLTK project ([@NLTK1], [@NLTK2]);
although these were in turn sourced from older pre-existing methods.

WordTokenizers.jl uses a `TokenBuffer` API and its various lexers for fast word tokenization.
`TokenBuffer` turns the string into a readable stream.
A desired set of TokenBuffer lexers are used to read characters from the stream and flush out into an array of tokens.
The package provides the following tokenizers made using this API.

- A Tweet Tokenizer([@tweettok]) for casual text.
- A general purpose NLTK Tokenizer([@NLTK1, @NLTK2]).
- An improved version of the multilingual Tok-tok tokenizer([@toktok], [@toktokpub]).

With various lexers written for the `TokenBuffer` API, users can also create their high-speed custom tokenizers with ease.
The package also provides a simple reversible tokenizer ([@reversibletok1], [@reversibletok2]),
that works by leaving certain merge symbols, as a means to reconstruct tokens into the original string.

WordTokenizers.jl exposes a configurable default interface,
which allows the tokenizer and sentence segmenters to be configured globally (where this is used).
This allowed for easy benchmarking and comparisons of different methods.

WordTokenizers.jl is currently being used by packages like [TextAnalysis.jl](https://github.com/JuliaText/TextAnalysis.jl), [Transformers.jl](https://github.com/chengchingwen/Transformers.jl) and [CorpusLoaders.jl](https://github.com/JuliaText/CorpusLoaders.jl) for tokenizing text.

## Other similar softwares

![Speed comparison of Tokenizers on IMDB Movie Review Dataset](speed_compare.png)

There are various NLP libraries and toolkits written in other programming languages, available to Julia users for tokenization.
[NLTK](https://github.com/nltk/nltk) and [Spacy](https://github.com/explosion/spaCy) packages provide with a variety of tokenizers, accessed to Julia users via `PyCall`.
Shown above is a performance benchmark of using some of the WordTokenizers.jl tokenizers vs PyCalling the default tokenizers from NLTK and SpaCy.
This was evaluated on the ~127,000 sentences of the IMDB Movie Review Dataset.
It can be seen that the performance of WordTokenizers.jl is very strong.

There are many more packages like [Stanford CoreNLP](https://github.com/stanfordnlp/CoreNLP), [AllenNLP](https://github.com/allenai/allennlp/) providing a couple of basic tokenizers.
However, WordTokenizers.jl is [faster](https://github.com/Ayushk4/Tweet_tok_analyse/tree/master/speed) and simpler to use, providing with a wider variety of tokenizers and a means to build custom tokenizers.

# References
Binary file added paper/speed_compare.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.