-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General purpose data loaders #133
Comments
I think there are a few common patterns we've seen so far:
|
It's probably best to either encourage storing data in a certain way and pick a single faster pattern to deal with, or to somehow allow mixing and matching among these. Although, I have found that |
This looks like a great start. I took the following approach: I assumed a directory structure like mentioned above. In deep learning I typically see images as well as text in directories. Most unstructured data takes some form of a hierarchical storage layout. I'm assuming you guys could use that to your advantage. The problem you're going to run in to (for desirable patterns) is time series data. For example when working with video encoders (a big part of the problems I typically solve) There's several kinds of ways you can vectorize an image or audio file. It's usually desirable to have in frames. I'm not sure how far you guys would go with this but it'd be great to see this done right (and in a more integrated fashion) I personally have to target more platforms than servers (phones are a big one for us) but I'd be happy to share lessons learned or contrib in some way. |
In the ImageLoaderUtils class we have a function that takes in a filename and produces a label which is dataset specific (e.g. VOC and ImageNet have different labelsMap functions). Right now this is built for reading from .tar files with hierarchical layouts embedded in them, but I think we could generalize this to layouts on HDFS. One thing we want to discourage, however, is having lots of tiny files on HDFS, because lots of tiny files really impact HDFS performance, so the current pattern (one tar file per class - or any other sensible way to get a relatively small number of big files) should be encouraged. Re: time series data/performance - this is probably a separate issue, but we've talked a lot about support for hypercubes as a first-class data structure, both as a local data structure and (eventually) a distributed data structure. |
cc @thisisdhaas @sjyk who are also interested in general purpose data loaders for data that comes from SampleClean |
We have included a number of data loaders tailored to standard academic datasets with KeystoneML, but it would be good to include general purpose WAV and image loaders in the project as well.
In particular, much of the work we did with ImageNet involved working around bugs in Java image libraries and some of that work can be repurposed.
The text was updated successfully, but these errors were encountered: