Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty Dataset in distributed mode #331

Closed
Jason3900 opened this issue Feb 4, 2023 · 3 comments
Closed

Empty Dataset in distributed mode #331

Jason3900 opened this issue Feb 4, 2023 · 3 comments

Comments

@Jason3900
Copy link

Jason3900 commented Feb 4, 2023

Hi, I'm trying to fully-finetuning on CPM-ANT+. I followed the instructions provided in readme, using the preprocess_dataset.py to generate the binary data file. But it seems that when world_size > 1 (in distributed mode), the read() method in DistributedDataset will raise an error "Empty Dataset", while the data will be successfully read in single node mode. Could you help me fix it? Thanks.

DistributedDataset("path/to/binary/file", bmt.rank(), bmt.world_size()),

@Jason3900 Jason3900 reopened this Feb 4, 2023
@Jason3900
Copy link
Author

Jason3900 commented Feb 4, 2023

BTW, the input of preprocess_dataset.py follow the format you provided. Each line is a json with "task" and "text" as keys.

@zh-zheng
Copy link
Collaborator

zh-zheng commented Feb 4, 2023

Probably because the amount of data is small. You can use a smaller block_size (here) when init the DistributedDataset.

@Jason3900
Copy link
Author

Thanks, it works. Hope this will be mentioned in README.

@zh-zheng zh-zheng pinned this issue Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants