Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download multiple urls with download timeout #703

Open
vodkaslime opened this issue Sep 27, 2024 · 1 comment
Open

Download multiple urls with download timeout #703

vodkaslime opened this issue Sep 27, 2024 · 1 comment
Labels
documentation Docs in need of update or extension enhancement New feature or request

Comments

@vodkaslime
Copy link

Trying to download multiple urls with download timeout.

I could download single urls one by one with fetch_url with setting download timeout. (Not sure if it's best practice to set download timeout):

config = use_config()
config.set("DEFAULT", "DOWNLOAD_TIMEOUT", "5")
downloaded = fetch_url(url, config=config)

However when following tutorial https://trafilatura.readthedocs.io/en/latest/downloads.html:

from trafilatura.downloads import add_to_compressed_dict, buffered_downloads, load_download_buffer

# list of URLs
mylist = ['https://www.example.org', 'https://www.httpbin.org/html']
# number of threads to use
threads = 4

# converted the input list to an internal format
url_store = add_to_compressed_dict(mylist)
# processing loop
while url_store.done is False:
    bufferlist, url_store = load_download_buffer(url_store, sleep_time=5)
    # process downloads
    for url, result in buffered_downloads(bufferlist, threads):
        # do something here
        print(url)
        print(result)

I'm not sure how to add DOWNLOAD_TIMEOUT to each connection in this code. It would be great if anyone could help out.

Thanks

@adbar adbar added enhancement New feature or request documentation Docs in need of update or extension labels Oct 1, 2024
@adbar
Copy link
Owner

adbar commented Oct 1, 2024

Hi @vodkaslime, indeed. It is not currently possible to pass a suitable argument to buffered_downloads, there is a missing link between the config (older) and options (newer) formats.

The code and the docs are both impacted and both need to be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Docs in need of update or extension enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants