Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrote simple streaming dataloader #10

Open
wants to merge 3 commits into
base: typing
Choose a base branch
from
Open

Conversation

lennart-finke
Copy link
Collaborator

Description

Added a streaming dataloader to avoid storing large datasets.

Related Issue

Progress towards #8

@danbraunai-apollo
Copy link
Collaborator

Does this still work with distributed data parallel? I'm a bit concerned that this setup is more complex than needed. In general I think we'll need to support:

  1. The dataset and tokenizer will be hosted on huggingface.
  2. The pre-tokenized dataset will be hosted on huggingface (so we don't have to tokenize it on the fly everytime we train).

I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler.

I've added the above to #8 (sorry I should have done this earlier).

@lennart-finke
Copy link
Collaborator Author

When writing it I was thinking about distributed data parallel, and it could work, but I have not been able to test that.

Currently the code assumes that the the tokenizer comes from tiktoken and that the streamed dataset is not tokenized.

Am open to closing the PR if there is an easier solution; that decision I would leave to you.

@lennart-finke
Copy link
Collaborator Author

lennart-finke commented Aug 15, 2024

After looking into this a bit more, there is also a builtin in Huggingface for this, the function split_dataset_by_node . If that makes sense, next I'll adapt the file you mentioned to use this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants