Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support incremental downloads of large files somehow. #18

Open
crlf0710 opened this issue Aug 23, 2023 · 6 comments
Open

Support incremental downloads of large files somehow. #18

crlf0710 opened this issue Aug 23, 2023 · 6 comments

Comments

@crlf0710
Copy link

For larger models (maybe >= 3.0 GB), there's a larger chance that the downloads get aborted by network failures with slower networks. I wonder if it's possible to make the downloads incremental, so restart from the beginning is not needed?

@LaurentMazare
Copy link
Collaborator

Sounds like a nice thing to have (I also would enjoy it). I have no clue how hard that would be but seems more like an issue for the upstream hf-hub crate. @Narsil what do you think?

@Narsil
Copy link
Collaborator

Narsil commented Aug 23, 2023

Definitely for upstream.

the async version already has a retry parameter (it's not resumable downloads, just it will retry when some part of file failed).

The sync is much simpler for now.
We could easily add something like that https://docs.rs/ureq/latest/ureq/enum.Error.html#examples

Resumable would be harder as it would require using temporary files for parts and validating parts during resume (something huggingface_hub is not doing, as it's merely resuming assuming all bytes were correct and using the size of the file for the hint of the resume, which might be very tricky given the current code in async).

@Narsil
Copy link
Collaborator

Narsil commented Aug 23, 2023

Would you be willing to write a PR for it @crlf0710 ?

@crlf0710
Copy link
Author

crlf0710 commented Aug 24, 2023

Would you be willing to write a PR for it @crlf0710 ?

I'm glad to if there's some mentoring... I haven't got myself familiar with the code base yet.

@Narsil Narsil transferred this issue from huggingface/candle Aug 24, 2023
@Narsil
Copy link
Collaborator

Narsil commented Aug 24, 2023

Transfered issue.

The code here is relatively simple everything should be in src/api/sync.rs.

@benedikt-schaber
Copy link

@crlf0710 If you have not already begun working on this and would not mind, I should be able to create a PR this weekend to introduce the same retry capabilities to sync that the tokio version already has.

@Narsil Regarding resumable downloads, could we create our own version of .incomplete (perhaps in the tmp folder)? Using etag for identification and creating two files for each download, the partial file itself and a meta file containing the chunk size and the successfully downloaded chunks (as a 'checklist' to keep it async). We could then also assume that all successfully downloaded chunks are correct and just add the missing chunks.
We could rechunk the contiguous chunk sequences or require users to reuse the same chunk size.
Does this sound somewhat reasonable/Would such a feature be welcomed? I could then also write the PR for this.

This was referenced Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants