Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new transform to ingest markdown (.md) files and convert to parquet format #364

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

bogdanscode
Copy link

Why are these changes needed?

Convert .md files to parquet files so that they can be processed by data prep pipeline
This is the preferred input for InstructLab

Related issue number (if any).

178

…rquet format

## Why are these changes needed?
Convert .md files to parquet files so that they can be processed by data prep pipeline
This is the preferred input for InstructLab

## Related issue number (if any).
178
Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution. This ingest2parquet tool is destined to be deprecated in favor of the new code2parquet transform which allows for more scalable conversions (i.e. think TB of data). I might suggest using code2parquet as a template to create a new transform, name it markdown2parquet and put it in a new directory called transforms/language/markdown2parquet. Let me know if you have questions

@daw3rd
Copy link
Member

daw3rd commented Jul 2, 2024

Also, in the future can you sign your commits?

@daw3rd
Copy link
Member

daw3rd commented Jul 3, 2024

Oh and the code2parquet transform is in transforms/code/code2parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants