-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TextSearch or StringSearch package #1
Comments
I agree with most of what you've proposed here. I think it would be to keep in-RAM and on-disk search separate at first, just so that you release a package for in-RAM search that is completely mature while you let the on-disk solution mature. I think having advanced data structures is definitely the way to go. I just wanted to make something pretty simple to start with. Splitting stuff from TextAnalysis into packages seems like a great idea. |
I agree with everything John said. I'm not sure John or I have the inclination at the moment for writing something like TextBase ourselves, but I think it's a good idea and would welcome it. I have the time for discussing its organization and pitching in with the coding occasionally, but not for doing the bulk of the work. I think NLTK is the gold standard here in terms of a consistent set of datatypes that form the basis for a wide range of applications, and is what we should be working towards. In general, I like the scientific Python model of creating an ecosystem that exposes users to just a few very high-quality and fairly broad packages. Here's what I was thinking for necessary packages that an end-user would actually interact with. They in turn could be implemented in terms of more narrow packages as an organizational aid for us-I'm pretty indifferent about that.
|
Likewise, I agree with everything Jonathan said. |
OK, I'll start on wrapping up my current work into a relatively simple "StringSearch" package, since it doesn't seem like there's enough momentum to start on more serious NLP work just yet. As for the eventual design of a text mining ecosystem for Julia, I do appreciate (as a user) having an ecosystem with a few, broad packages. It is not entirely clear to me just how that will play out in Julia without nested packages, though. Specifically, for a Julia CoreNLP, that is a LOT of functionality for one namespace. (Note how in both NLTK and CoreNLP that functionality is not in just one namespace.) Consider just parsing. There will need to be a parser interface (simple), but probably multiple implementations, since parsing is a open research problem. Then there are different kinds of parsers, like HPSG and dependency parsers that return different data structures for their results. The same holds true for almost every step of the process. Note in NLTK how you don't just call some stem() function, you call the Porter stemmer or the Lancaster stemmer. A huge mass of functionality is in the nltk.stem module. Obviously it will be a very long time before Julia gets there, but you see my point: what all these NLP steps have in common is a pretty simple interface and many, many complex, partially overlapping, research-grade implementations. So to me it makes sense to define the interface separately (i.e. separate package) from the implementation for "CoreNLP"-type tasks. The implementations will be very slow in coming because there are so many of them. However, for all your other proposed packages besides CoreNLP, I think they have a good-sized, well-defined scopes. |
You can created nested packages in Julia. The only issue with nesting packages is that, IIRC, you end up paying the memory costs for all the packages in the outermost loaded package. I imagine that could and will be changed at some point in the future. So I wouldn't be too concerned with that issue as a long-term problem. |
Yes, I know. But it seems to be very rarely done. I've also read comments by Stefan and others that, even though it is technically supported, it's not considered to be the "Julia way". |
That's true. But there's also not been any library big enough that it would have been a useful way to organize things. Given the scale of most Julia libraries, multiple dispatch has worked out well so far. |
I don't really buy this. Instead of having "Stats.Base" and "Stats.Distributions", a decision was made to put the two in separate packages, even though they are heavily interdependent. There are many similar examples. I'd argue that the current (small) scale of individual Julia libraries is a consequence of not using nested packages, not the other way around. And this practice promotes fragmentation. |
Distributions is not a great example because that decision was also made with regard to licensing issues. |
MLBase, then. I would rather It also is easier from a documentation and user understanding standpoint. All ML algos should share a similar interface, and they share the same general concept. Why shouldn't they be documented and packaged together? |
(Instead of "all ML algos", I should have said, all supervised learning ML algos). |
Yeah, we all agree with that vision. You're in fact describing our long-term goals. |
Continued from JuliaText/TextAnalysis.jl#13 ...
@malmaud , I see two possible scopes for this package. One would be to confine it to in-RAM string search algorithms like Aho-Corasick, Boyer-Moore, etc. This would be fairly straightforward, although one possible complication is that these algorithms really apply to any kind of abstract sequence, not just strings (of characters), so the API might need to be designed as such to be maximally useful. And at that point we are getting a little out of the realm of "text" processing.
The other direction this package could go is to include both searching of in-RAM strings and of text corpora, which might be in RAM or on disk. Other issues of true text search like lexing, thesauri, inverted indices, etc, would then arise eventually.
I think option #2 is the better design (although it is obviously more work). Some of my research interests lie in this direction so I could contribute a fair amount.
So I am open to preferences between these two choices or suggestions that I haven't considered. I'm also open to naming suggestions.
I am also curious about whether there is any kind of (even very rough) roadmap for the various text processing and basic NLP tasks and data structures that will form a common core for JuliaText, including for a potential TextSearch package. For example:
I'd propose that there should be a TextBase package, along the lines of StatsBase, that defines a few of these fundamental data structures and operations. Perhaps TextAnalysis could serve as inspiration for this, but I think things like TFIDF, LSA, and perhaps stemming, could be left to a different package.
The text was updated successfully, but these errors were encountered: