Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextSearch or StringSearch package #1

Open
gilesc opened this issue Feb 6, 2014 · 12 comments
Open

TextSearch or StringSearch package #1

gilesc opened this issue Feb 6, 2014 · 12 comments

Comments

@gilesc
Copy link

gilesc commented Feb 6, 2014

Continued from JuliaText/TextAnalysis.jl#13 ...

@malmaud , I see two possible scopes for this package. One would be to confine it to in-RAM string search algorithms like Aho-Corasick, Boyer-Moore, etc. This would be fairly straightforward, although one possible complication is that these algorithms really apply to any kind of abstract sequence, not just strings (of characters), so the API might need to be designed as such to be maximally useful. And at that point we are getting a little out of the realm of "text" processing.

The other direction this package could go is to include both searching of in-RAM strings and of text corpora, which might be in RAM or on disk. Other issues of true text search like lexing, thesauri, inverted indices, etc, would then arise eventually.

I think option #2 is the better design (although it is obviously more work). Some of my research interests lie in this direction so I could contribute a fair amount.

So I am open to preferences between these two choices or suggestions that I haven't considered. I'm also open to naming suggestions.

I am also curious about whether there is any kind of (even very rough) roadmap for the various text processing and basic NLP tasks and data structures that will form a common core for JuliaText, including for a potential TextSearch package. For example:

  • Will there be common data structures for corpora, documents, sentences, tokens, part-of-speech tags, parse trees, etc, (which gives the API more structure), or will there just be strings and arrays of strings everywhere (which makes the API more accessible for newcomers)?
  • Where (what package) will "shallow NLP" tasks such as sentence splitting, stemming, and tokenization be handled?

I'd propose that there should be a TextBase package, along the lines of StatsBase, that defines a few of these fundamental data structures and operations. Perhaps TextAnalysis could serve as inspiration for this, but I think things like TFIDF, LSA, and perhaps stemming, could be left to a different package.

@johnmyleswhite
Copy link

I agree with most of what you've proposed here. I think it would be to keep in-RAM and on-disk search separate at first, just so that you release a package for in-RAM search that is completely mature while you let the on-disk solution mature.

I think having advanced data structures is definitely the way to go. I just wanted to make something pretty simple to start with.

Splitting stuff from TextAnalysis into packages seems like a great idea.

@malmaud
Copy link
Contributor

malmaud commented Feb 7, 2014

I agree with everything John said. I'm not sure John or I have the inclination at the moment for writing something like TextBase ourselves, but I think it's a good idea and would welcome it. I have the time for discussing its organization and pitching in with the coding occasionally, but not for doing the bulk of the work.

I think NLTK is the gold standard here in terms of a consistent set of datatypes that form the basis for a wide range of applications, and is what we should be working towards. In general, I like the scientific Python model of creating an ecosystem that exposes users to just a few very high-quality and fairly broad packages.

Here's what I was thinking for necessary packages that an end-user would actually interact with. They in turn could be implemented in terms of more narrow packages as an organizational aid for us-I'm pretty indifferent about that.

  • A Julia functional equivalent of CoreNLP that handles routine NLP tasks (like parsing, stemming, and so on).
  • Topic modeling (there are some Julia topic modeling packages floating around already)
  • Diffing-related tools
  • Sequence modeling (CRFs, ngrams, ...)
  • Tools for accessing corpora commonly used in empirical NLP
  • Advanced algorithms and datastructures for working with large strings

@johnmyleswhite
Copy link

Likewise, I agree with everything Jonathan said.

@gilesc
Copy link
Author

gilesc commented Feb 9, 2014

OK, I'll start on wrapping up my current work into a relatively simple "StringSearch" package, since it doesn't seem like there's enough momentum to start on more serious NLP work just yet.

As for the eventual design of a text mining ecosystem for Julia, I do appreciate (as a user) having an ecosystem with a few, broad packages. It is not entirely clear to me just how that will play out in Julia without nested packages, though. Specifically, for a Julia CoreNLP, that is a LOT of functionality for one namespace. (Note how in both NLTK and CoreNLP that functionality is not in just one namespace.)

Consider just parsing. There will need to be a parser interface (simple), but probably multiple implementations, since parsing is a open research problem. Then there are different kinds of parsers, like HPSG and dependency parsers that return different data structures for their results.

The same holds true for almost every step of the process. Note in NLTK how you don't just call some stem() function, you call the Porter stemmer or the Lancaster stemmer. A huge mass of functionality is in the nltk.stem module. Obviously it will be a very long time before Julia gets there, but you see my point: what all these NLP steps have in common is a pretty simple interface and many, many complex, partially overlapping, research-grade implementations. So to me it makes sense to define the interface separately (i.e. separate package) from the implementation for "CoreNLP"-type tasks. The implementations will be very slow in coming because there are so many of them.

However, for all your other proposed packages besides CoreNLP, I think they have a good-sized, well-defined scopes.

@johnmyleswhite
Copy link

You can created nested packages in Julia. The only issue with nesting packages is that, IIRC, you end up paying the memory costs for all the packages in the outermost loaded package. I imagine that could and will be changed at some point in the future. So I wouldn't be too concerned with that issue as a long-term problem.

@gilesc
Copy link
Author

gilesc commented Feb 9, 2014

Yes, I know. But it seems to be very rarely done. I've also read comments by Stefan and others that, even though it is technically supported, it's not considered to be the "Julia way".

@johnmyleswhite
Copy link

That's true. But there's also not been any library big enough that it would have been a useful way to organize things. Given the scale of most Julia libraries, multiple dispatch has worked out well so far.

@gilesc
Copy link
Author

gilesc commented Feb 9, 2014

I don't really buy this. Instead of having "Stats.Base" and "Stats.Distributions", a decision was made to put the two in separate packages, even though they are heavily interdependent. There are many similar examples.

I'd argue that the current (small) scale of individual Julia libraries is a consequence of not using nested packages, not the other way around. And this practice promotes fragmentation.

@johnmyleswhite
Copy link

Distributions is not a great example because that decision was also made with regard to licensing issues.

@gilesc
Copy link
Author

gilesc commented Feb 9, 2014

MLBase, then. I would rather Pkg.add("Learn") and using Learn.SVM or Learn.GLM, analogous to the way Python's scikits-learn does it, than the current system of hunting down a new package every time I want to try a different ML algorithm.

It also is easier from a documentation and user understanding standpoint. All ML algos should share a similar interface, and they share the same general concept. Why shouldn't they be documented and packaged together?

@gilesc
Copy link
Author

gilesc commented Feb 9, 2014

(Instead of "all ML algos", I should have said, all supervised learning ML algos).

@johnmyleswhite
Copy link

Yeah, we all agree with that vision. You're in fact describing our long-term goals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants