Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert this asset to use current TS processor API #26

Open
kstaken opened this issue Mar 7, 2019 · 1 comment
Open

Convert this asset to use current TS processor API #26

kstaken opened this issue Mar 7, 2019 · 1 comment

Comments

@kstaken
Copy link
Member

kstaken commented Mar 7, 2019

This is still using the old style processor APis and should be updated at some point.

@macgyver603
Copy link
Contributor

This is mostly done with the compressed_file_reader being the only processor left to convert. The slicer is substantially different than the file_reader, so I think the biggest question here would be whether to modernize the compressed_file_reader or to add it as part of the file_reader.

Currently, the processor will uncompress files to a separate working directory before slicing them for processing. Once the last slice of a file is processed, an archive mechanism (known slice order issue in #17) moves it to an "archive" directory. Also, there is a timer mechanism in the slicer to check the specified directory at some interval for new files if the job is a persistent job. Finally, the processor also maintains an on-disk state for each file being processed (I think this should be removed in favor of just logging file statuses where applicable).


In adding compression as an option for the file_reader, I imagine it would have a compression_type option with this schema:

compression_type: {
    doc: 'Determines whether or not to uncompress files',
    default: 'uncompressed',
    format: ['uncompressed', 'lz4',...]
}

For the decompression jobs, the slicer could just decompress the files in place and add both the compressed path and the uncompressed path as metadata for each record. For now, the files would be left on-disk as-is for cleanup after the job by an operator or some other program. Adding this to the file_reader should be fairly straightforward since it would just be a matter of adding the compression utilities to the slicer. The next question would be whether or not to preserve the persistent job logic. I think all of the file reader jobs I have encountered so far were once jobs, but if there is a need for persistent file reader jobs, this functionality should at least be extended to the file_reader as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants