Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Welcome! #1

Open
hwchen opened this issue Feb 24, 2019 · 17 comments
Open

Welcome! #1

hwchen opened this issue Feb 24, 2019 · 17 comments

Comments

@hwchen
Copy link
Contributor

hwchen commented Feb 24, 2019

Hi! I’m excited to begin discussion of strategies for implementing a dataframe.

I imagine this repo as the main archive of discussions, with perhaps a discord channel for real-time chat.

I think discussion can be in the issues for now. We could do something more formal eventually, whether a wiki or md files, if we want to crystallize some directions.

Some topics I’m interested in:

  • user api (type checking, ergonomics)

  • backend (performance, integration with other data engines)

  • use cases (pain points from other systems, examples of current production systems to switch to Rust)

  • prior art (discussion of design decisions from other data engineering/scientific computing libraries)

  • WIP (post updates about your current attempt, design decisions, etc

@LukeMathWalker
Copy link

Hello!

I want to get back here to lay down some thoughts, but I thought it would interesting as well to collect the spurious pieces I have seen floating around in my corner of Rust about dataframes:

@nevi-me
Copy link

nevi-me commented Feb 25, 2019

Hi! I have https://github.com/nevi-me/rust-dataframe in addition to what @LukeMathWalker mentioned.

@LukeMathWalker
Copy link

Another interesting conversion concerning DataFrames: rust-ndarray/ndarray#539

@LukeMathWalker
Copy link

@hwchen
Copy link
Contributor Author

hwchen commented Mar 25, 2019

@LukeMathWalker thanks for continuing to add references. I'd like to start putting together a document with these sources and perhaps some commentary, like an annotated bibliography.

@galuhsahid
Copy link

Might be interesting to see Go's approach to this: https://github.com/go-gota/gota

@jblondin
Copy link

Hi everyone! I just wanted to mention my crate that I've been working on lately: https://github.com/jblondin/agnes.

I guess I should be on reddit more, since it's pretty similar to this (and I originally based the structure on frunk's HLists):

Other food for though on columnar storage: https://www.reddit.com/r/rust/comments/afo4ln/exploring_columnoriented_data_in_rust_with_frunk/

It's still early code, and I've kinda been working on it in an one-person echo chamber (never a great idea -- my cats are decent debuggers but horrible at calling out bad design decisions), but I think it has some potential. It is typesafe (columns are referred to by unit-like marker structs which are associated with that column's data type), avoids copies as much as possible, and has basic join, print, iteration, and serialization functionality. I wrote a user guide here. I probably need to write up a design document as well.

I'm planning on most likely replacing the lowest-level data storage with ndarray to for ease of interoperability (especially if ndarray is going to eventually interop with Apache Arrow as Luca mentions here.

Let me know if there's anything I can do to help this initiative -- I'd love to see a stable dataframe library in Rust!

@paddyhoran
Copy link

Just my opinion...

ndarray and it's ecosystem are gaining some good momentum but I don't believe that a dataframe library in Rust should be based on ndarray. This is how pandas is now, based on numpy, and Apache Arrow is being developed in part to solve some of the issues that this created.

I believe we should build a data frame library as a 'front end' to Apache Arrow. This library would serve the purpose of data access and "data wrangling" and could provide a way to zero copy convert to ndarray data structures. ndarray could then focus on the computations you want to apply to "cleaned" data.

Arrow is seeing adoption from a range of projects and adopting this underlying infrastructure would allow us to take advantage of the Arrow ecosystem.

I'm a committer to the Rust Arrow implementation along with a few others and we would welcome the input regarding requirements of higher level libraries. There are others focusing on lower level details in Arrow, there is already a query execution engine called datafusion in Arrow as mentioned above. This group could then focus on api design and feedback to Arrow.

The key thing to gain consensus on is which project is the dataframe library. The Rust community is smaller and I think we all need to focus on one data frame project and drive it forward.

This probably requires someone to step forward and volunteer to drive such a project forward. I don't need such a library bad enough to do this but I would contribute to such a project if it existed.

@jblondin
Copy link

I believe we should build a data frame library as a 'front end' to Apache Arrow. This library would serve the purpose of data access and "data wrangling" and could provide a way to zero copy convert to ndarray data structures. ndarray could then focus on the computations you want to apply to "cleaned" data.

I see your point and can agree with this -- using Rust as a data science language will require a lot of interoperability and Apache Arrow is the best way forward for this that I've seen. ndarray probably should be seen as a computation target (linalg, stats, etc) instead of a baseline data format.

@nevi-me
Copy link

nevi-me commented Apr 17, 2019

I agree with @paddyhoran, using Arrow also benefits us with not having to worry about a lot of IO. I created https://github.com/nevi-me/rust-dataframe with the intention of bikeshedding a dataframe that relies on Arrow for both in-memory data, as well as some computation.

Although rust-dataframe looks stagnant, I'm still working on ideas around it on paper. Also, I'm also contribution to Arrow with the things that I'd like to be able to do in the library (I'm mainly working on IO support for basic things like CSV, JSON).

I also think that if/when ndarray supports Arrow, it would make for a great UDF interface where one needs multi-dimensional data, and we could use ndarray's stats functionality in dataframes built in Rust.

The other effort I've been trying, though time is a huge constraint as I have a hectic work schedule + studying, is creating Arrow interfaces to SQL DBs in Rust. I've got a simple PostgreSQL one working, but haven't had time to put it on GH.

@LukeMathWalker
Copy link

LukeMathWalker commented Apr 20, 2019

I think that interoperability should be a core principle of whatever we decide to invest in: it's unreasonable to expect anyone to work in a Rust-only environment for domains such as Machine Learning or Data Engineering.
I don't think is anyone's interest to create another isolated computational environment, it would be just a waste of time.

On the other side though, I'd like to build an API that feels native and first-citizen to Rust.
One point that I feel strongly about is using the compiler and the type system to their fullest extent.
I'd love to see typed DataFrames, with compile-time checks on common manipulations (e.g. as access to columns by index name), steering as far away as possible from a "stringy" API.
It should also be possible to use common Rust patterns (Enums, NewTypes, etc.) as first-class citizens, thus avoiding the "boundary" feeling that I often experience in Python when my Pandas code comes into contact with my business logic code. Something very similar to what I experience when working with databases/ORMs.

ndarray can be a good target for computation-heavy workloads, but Apache Arrow looks like a much more apt solution for what we are trying to build. I don't have a lot of visibility over the project though, what is its state right now? @paddyhoran

@nevi-me
Copy link

nevi-me commented Apr 20, 2019

Hi @LukeMathWalker I'll answer the question that you've asked @paddyhoran

Arrow is very usable, although we might make minor/breaking changes to the parts of the library that we're still working on (we don't support some data types that the CPP and other implementations support, and some might require some refactoring).

We have:

The foundational part which one would use to rely on Arrow is sound and relatively stable.

@jblondin
Copy link

One possible concern with the Arrow implementation (please correct me if I'm wrong @nevi-me @paddyhoran) is that it seems to currently require the nightly toolchain. Specifically, a dependency on packed_simd and use of the specialization feature (perhaps more, this is just what I gathered from a quick look).

I personally don't see this as a huge problem as eventually these things will be stabilized and we're just starting this project, but I thought I'd point it out.

@nevi-me
Copy link

nevi-me commented Apr 20, 2019

Yes, I suppose we could hide packed_simd behind a feature flag, but we have to wait for specialisation to become stable.

One thing I'm personally unsure of is what will happen after 0.15 is released in a few months, because the release after that might be 1.0.0.
We don't follow semver, and because we are a multilingual library, some languages might still be behind when Arrow cpp/python/java are considered stable.

@LukeMathWalker
Copy link

LukeMathWalker commented Apr 21, 2019

I am not too worried by using the nightly toolchain to leverage specialization.

What does this versioning strategy imply @nevi-me? Do we risk to have breaking changes without a bump in the major version number?
It wouldn't be a major problem if Arrow is a private dependency, but if we do happen to expose or use its types in our public API it becomes more troublesome.

@LukeMathWalker LukeMathWalker mentioned this issue Apr 21, 2019
@nevi-me
Copy link

nevi-me commented Apr 21, 2019

It would likely be a private dependency, the IPC part of the format is versioned, so when reading Arrow data from say am external system, that system would declare its version. So that helps with avoiding breakages.

If a library that uses Arrow doesn't stay far behind the latest version, small changes would theoretically be easy to handle.

One significant consideration though is that if publishing a crate that depends on Arrow, we'd likely have to either move at Arrow's cadence (we're aiming for a release every 2 months going forward), or fork it like what DataFusion did before it was donated to Arrow.
This depends on how much we'd contribute upstream to Arrow, as I'd imagine some functionality might better being upstream. A rising tide that lifts all boats

@jesskfullwood
Copy link

Hi all. I have created a library similiar to @jblondin , here: https://github.com/jesskfullwood/frames. It does maps, joins, groupby, filter all in typesafe manner, and it allows arbitrary fields types (e.g. you can have enums and structs in your columns). But while it is functional, it is much less polished and I somewhat gave up on it when I decided I couldn't get the ergonomics that I wanted (something as intuitive as R data.table but FAST (even for strings) and TYPESAFE). I decided I would revisit it when GATs and specialization have landed.

I absolutely am looking for something to use in production, at work we have an unmanageably complex series of R scripts and I'm desperate to introduce some type-safety. Something like frameless only.. not Spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants