Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add task: download_from_url() to download a URL to a file #562

Merged
merged 10 commits into from
Nov 6, 2024

Conversation

tomkinsc
Copy link
Member

@tomkinsc tomkinsc commented Oct 14, 2024

This adds a task to tasks_utils.wdl, download_from_url(), to download content from an individual URL to a file, via wget. This task exists as a workaround until Terra supports this functionality natively. (cromwell already does)

This has the following inputs:

  • url_to_download: The URL to download; this is passed to wget (required)
  • output_filename: The filename to use for the downloaded file. This is optional, though it can be helpful in the event the server does not advise on a filename to use via the 'Content-Disposition' HTTP response header. (optional)
  • additional_wget_opts: Additional options passed to wget as part of the download command. (optional)
  • request_method: The request method (GET, POST, etc.) passed to wget. (default: GET)
  • request_max_retries: The maximum number of (additional) re-tries to attempt in the event of failed download. (optional)
  • md5_hash_expected: The (binary-mode) md5 hash expected for the file to download. If provided and the value does not match the md5 hash of the downloaded file, the task will fail. mutually exclusive with md5_hash_expected_file_url (optional)
  • md5_hash_expected_file_url: The url of a file containing the (binary-mode) md5 hash expected for the file to download. If provided and the value does not match the md5 hash of the downloaded file, the task will fail. mutually exclusive with md5_hash_expected (optional)
  • save_response_header_to_file: If save_response_header_to_file=true, http response headers will be saved to a separate output file. Only applicable for http[s] URLs. (optional)
  • disk_size: The size of the disk used for the instance downloading the file (default: 50 GB)

Note: at present, this task only downloads a single file from a single URL. This is a design decision made for a couple reasons:

  1. the output file name (and any name collision thereof) is not necessarily known in advance, since the server can specify it via the Content-Disposition response header. Returning a single file takes advantage of separation of task outputs at runtime to avoid collisions
  2. Parallel execution of the task in multiple jobs on multiple complete instances can—when bandwidth is limited on the requesting side—afford faster completion time
  3. Parallel execution of the task in multiple jobs on multiple complete instances spreads request load across multiple originating IPs (sometimes helpful to avoid throttling)

Download a URL to a file. This task exists as a workaround until Terra supports this functionality natively
cromwell already supports this: https://cromwell.readthedocs.io/en/stable/filesystems/HTTP/
… to save http response headers

add file integrity checkng to download_from_url by comparing against an md5 checksum provided as a string or via an additional URL, as well as an option to save http response headers
add read failure timeout to wget call in download_from_url; this causes wget to retry in the event a download hangs. This has been found to resolve an issue with some ftp downloads stalling at 100% without finalizing.
@tomkinsc tomkinsc changed the title add task: download_from_web() to download a URL to a file add task: download_from_url() to download a URL to a file Oct 28, 2024
@tomkinsc tomkinsc requested a review from dpark01 October 28, 2024 22:43
@tomkinsc tomkinsc merged commit e498352 into master Nov 6, 2024
14 checks passed
@tomkinsc tomkinsc deleted the ct-add-task-get-from-web branch November 6, 2024 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant