From fc38a0d20685a3e371325521c7f7171b0a9cef6a Mon Sep 17 00:00:00 2001 From: Marcel Friedrichs Date: Thu, 21 Jun 2018 12:12:16 +0200 Subject: [PATCH] docs: Add documentation and changelog from wiki --- CHANGELOG.md | 65 +++++++++++++++++ README.md | 195 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 260 insertions(+) create mode 100644 CHANGELOG.md diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..fef7e5d --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,65 @@ +# Changelog + +## 1.7.0 +- TODO + +## 1.6.1 +- Dependency update: AWS SDK for Java 1.9.27 -> 1.10.16 (includes joda time 2.8.1) + +## 1.6.0 +- Added mode `-c` to clean up the remains of interrupted multipart uploads. +- Added option `--reduced-redundancy` to store uploads with reduced redundancy. +- Dependency update: AWS SDK for Java 1.9.6 -> 1.9.27 + +## 1.5.2 +- Fix: As S3 has no concept of real folders, AWS S3 Web Console simulates an empty folder by creating a zero-length file with the filename ending on a trailing slash. From now on these empty folders are created on the receiving end when using recursive downloads. + +## 1.5.1 +- Dependency update: AWS SDK for Java 1.8.9.1 -> 1.9.6 (support for eu-central-1) +- Dependency update: logback 1.0.13 -> 1.1.2 + +## 1.5.0 +- Added the ability to add additional metadata to all uploads by using `-m/--metadata `. + +## 1.4.2 +- Important bug-fix: Downloads using a Grid Download Feature Flag were using wrong offsets when writing to the output file. +- AWS SDK for Java update: 1.4.5 -> 1.8.9.1 +- FASTQ file format detection for FASTQ split downloads. + +## 1.4.1 +- Raised the logging threshold for streaming mode from WARN to ERROR. + +## 1.4.0 +- Important bug-fix: Automatically resume interrupted chunk downloads. +- Added retry for interrupted single file downloads. +- Added retry for failed S3 region list request. +- Default connection timeout is now at 30 sec (was 5 min). + +## 1.3.3 +- Updated dependency 'logback' from 1.0.11 -> 1.0.13 because of [http://jira.qos.ch/browse/LOGBACK-749%7Cbug](http://jira.qos.ch/browse/LOGBACK-749%7Cbug). +- Enabled automatic retries for initial metadata request that is issued before download action occurs. + +## 1.3.2 +- Improved feedback for ambiguous S3 URL (missing trailing slash). +- Added option `--trace` for easier debugging. + +## 1.3.0 +`- Added support for the download via a pre-signed S3 HTTP URL. This works for both standard and grid downloads. (Thanks to Thomas!) + +## 1.2.1 +- Fixed a bug where the minimum chunk upload size restriction of 5MB was also applied to downloads despite the fact that minimum download chunk sizes are not required by the S3 service + +## 1.2.0 +- Added `--grid-download-feature-fastq`. This feature alters the offsets of the grid download splits in a way that FASTQ file parts remain valid. It also preserves paired-ends/mate-pairs in interleaved FASTQ files containing Illumina sequence identifiers.. +- AWS SDK for Java update: 1.4.3 -> 1.4.5 + +## 1.1.0 +- Added `--grid-download-feature-split` which enables the grid download to save the data to independent smaller files (all identically named but on different nodes) +- Added `--session-token` for temporary authentication support + +## 1.0.3 +- Added automatic retries for all chunk transfer operations to compensate for connection-related S3 failures (default: 6 retries) +- Disabled INFO logging for the HTTP Client logger of the AWS Java SDK + +## 1.0.2 +- Fixed a bug where the use of the STDIN file list would result in the attempt to upload an empty filename diff --git a/README.md b/README.md index 967d775..395d605 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,35 @@ [![](https://jitpack.io/v/BiBiServ/bibis3.svg)](https://jitpack.io/#BiBiServ/bibis3) +Amazon Simple Storage Service (S3) is a cloud storage service with some very interesting characteristics for storing +large amounts of data. It has virtually infinite scalability both in terms of storage space and transfer speed. +Bioinformatics pipelines running on EC2 compute instances need fast access to the data stored in S3 but existing tools* +for S3 data transfer do not use the full potential S3 has to offer. (*Evaluated tools were `s3cmd` and `boto` (original +and modified versions, Mar 2013)) + +***BiBiS3*** is a command line tool that attempts to close this gap by pushing the transfer speeds both from and to S3 +to the limits of the underlying hardware. Additionally ***BiBiS3*** in itself is as scalable as Amazon's S3 as it is +capable of downloading different chunks of the same data to an arbitrary number of machines simultaneously. The key +to maximum speed using S3 is massive, data-agnostic parallelization. + +The targets can either be a single machine, a shared Network File System (NFS) between multiple nodes or the +filesystems of all the nodes. Directories can be copied recursively while ***BiBiS3*** maintains a stable count of +parallel transfer threads regardless of the directory structure. + +In another scenario where the parts of a single file are to be evenly distributed across multiple machines, +***BiBiS3*** is performing a split of the data. In case of the FASTQ file format this split is even content-aware +and preserves all FASTQ entries. A distributed download can be invoked e.g. via the Oracle Grid Engine (OGE) which +is part of [BiBiGrid](https://github.com/BiBiServ/bibigrid). + +***Features*** +- ***Parallel*** transfer of multiple chunks of data. +- ***Recursive*** transfer of directories with ***parallelization of multiple files***. +- Simultaneous download ***via a cluster*** to e.g. a shared NFS where each node only downloads a portion of the data. + +***Performance*** +On a single AWS instance we have seen download speeds of over to 300 MByte/sec from S3. Using the distributed cluster +download mode ***BiBiS3*** downloads show an aggregate throughput of more than 22 GByte/sec on 80 c3.8xlarge instances. + ## Compile, Build & Package *Requirements: Java >= 8, Maven >= 3.3.9* @@ -11,3 +40,169 @@ > cd bibis3 > mvn clean package ~~~ + +## Setup +***Credentials File*** + +To get access to buckets that need proper authentication, create a .properties file called +`.aws-credentials.properties` in your user home directory with the following content: +~~~ +accessKey=XXXXXXXXXXXXXXXX # your AWS access key +secretKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # your AWS secret key +~~~ + +***Please note***: The Access Key and Secret Key are very sensitive information! Please make sure that the configuration file can only be read by you! E.g. use the following command: `chmod 600 ~/.aws-credentials.properties` + +Alternatively the credentials can also be supplied via command-line parameters. + +## Command-Line Usage +The basic commands follow the behavior of the Unix command cp as closely as possible. + +~~~BASH +usage: java -jar bibis3-1.7.0.jar -u|d|g|c SRC DEST + --access-key AWS Access Key. + -c,--clean-up-parts Clean up all unfinished parts of + previous multipart uploads that were + initiated on the specified bucket over + a week ago. BUCKET has to be an S3 + URL. + --chunk-size Multipart chunk size in Bytes. + --create-bucket Create bucket if nonexistent. + -d,--download Download files. SRC has to be an S3 + URL. + --debug Debug mode. + --endpoint Endpoint for client authentication + (default: standard AWS endpoint). + -g,--download-url Download a file with Http GET from + (pre-signed) S3-Http-Url. SRC has to + be an Http-URL with Range support for + Http GET. + --grid-current-node Identifier of the node that is running + this program (must be 1 >= i <= + grid-nodes. + --grid-download Download only a subset of all chunks. + This is useful for downloading e. g. + to a shared filesystem via different + machines simultaneously. + --grid-download-feature-fastq Download separate parts of a fastq + file to different nodes into different + files and make sure the file splits + conserve the fastq file format. + --grid-download-feature-split Download separate parts of a single + file to different nodes into different + files all with the same name. + (--grid-download required) + --grid-nodes Number of grid nodes. + -h,--help Help. + -m,--metadata Adds metadata to all uploads. Can be + specified multiple times for + additional metadata. + -q,--quiet Disable all log messages. + -r,--recursive Enable recursive transfer of a + directory. + --reduced-redundancy Set the storage class for uploads to + Reduced Redundancy instead of + Standard. + --region S3 region. For AWS has to be one of: + ap-south-1, eu-west-3, eu-west-2, + eu-west-1, ap-northeast-2, + ap-northeast-1, ca-central-1, + sa-east-1, cn-north-1, us-gov-west-1, + ap-southeast-1, ap-southeast-2, + eu-central-1, us-east-1, us-east-2, + us-west-1, cn-northwest-1, us-west-2 + (default: us-east-1). + --secret-key AWS Secret Key. + --session-token AWS Session Token. + --streaming-download Run single threaded download and send + special progress info to STDOUT. + -t,--threads Number of parallel threads to use + (default: 50). + --trace Extended debug mode. + -u,--upload Upload files. DEST has to be an S3 + URL. + --upload-list-stdin Take list of files to upload from + STDIN. In this case the SRC argument + has to be omitted. + -v,--version Version. +S3 URLs have to be in the form of: 's3:///', e.g. +'s3://mybucket/mydatafolder/data.txt'. When using recursive transfer (-r) +the trailing slash of the directory is mandatory, e.g. +'s3://mybucket/mydatafolder/'. +~~~ + +## Basic Examples +***Upload of a single file from the local directory to S3:*** +~~~BASH +java -jar bibis3.jar -u myfile.tgz s3://mybucket/somedir/ +~~~ + +***Download of a single file from S3 to the current directory:*** +~~~BASH +java -jar bibis3.jar -d s3://mybucket/somedir/myfile.tgz . +~~~ + +***Download of a directory from S3 to a local directory called 'mydir' using 20 threads:*** +~~~BASH +java -jar bibis3.jar -t 20 -r -d s3://mybucket/somedir/ mydir +~~~ + +Attention should be paid to the trailing slash of the S3 URL which in addition to the `-r` option is required +for the recursive transfer of a directory. + +***Example shell script for the simultaneous download of all the contents of an S3 directory via a cluster:*** + +***simultaneous-download.sh:*** +~~~BASH +#!/bin/bash +java -jar bibis3.jar \ +--access-key "XXXXXXX" \ +--secret-key "XXXXXXXXXXXX" \ +--region eu-west-1 \ +--grid-download \ +--grid-nodes "$1" \ +--grid-current-node "$SGE_TASK_ID" \ +-d s3://mybucket/mydir/ targetdir +~~~ +which could be run within an SGE/OGE with 5 nodes (4 cores each) as follows: +~~~BASH +qsub -pe multislot 4 -t 1-5 simultaneous-download.sh 5 +~~~ + +The parameter `-pe multislot 4` ensures that the array job ist equally distributed among the nodes (leading to only +one task per node). + +The targetdir is usually located inside a shared filesystem (e.g. NFS). However, if `--grid-download-feature-split` +is enabled, then the targetdir has to be local for each node. + +## Grid Download Feature Flags +***Grid Download Feature Flags*** can be used in addition to `--grid-download`. When supplying one of these flags, the +file parts are saved to different files on different machines. Additionally these flags can be used to force a specific +split position for individual file types. Grid Download Feature Flags ***cannot be combined***. Only the last one +supplied will be in effect. + +***Split:*** + +`--grid-download-feature-split` + +Splits arbitrarily. + +***FASTQ:*** + +`--grid-download-feature-fastq` + +Preserves FASTQ entries as well as paired-ends/mate-pairs for files using Illumina sequence identifiers. + +## Cleanup +The Amazon S3 documentation says: + +> "Once you initiate a multipart upload, Amazon S3 retains all the parts until you either complete or abort the upload. Throughout its lifetime, you are billed for all storage, bandwidth, and requests for this multipart upload and its associated parts." + +When an upload encounters a fatal error, the upload gets neither completed nor aborted. Already uploaded multipart +chunks remain in S3 but are invisible to the user. Therefore, it is recommended to clean up interrupted multipart +uploads periodically. + +***Clean up the remainings of interrupted multipart uploads for the bucket 'mybucket' that were initiated more than 7 days ago:*** +~~~BASH +java -jar bibis3.jar -c s3://mybucket/ +~~~