Skip to content

Commit

Permalink
docs: Add documentation and changelog from wiki
Browse files Browse the repository at this point in the history
  • Loading branch information
AstrorEnales committed Jun 21, 2018
1 parent 1dd58d2 commit fc38a0d
Show file tree
Hide file tree
Showing 2 changed files with 260 additions and 0 deletions.
65 changes: 65 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Changelog

## 1.7.0
- TODO

## 1.6.1
- Dependency update: AWS SDK for Java 1.9.27 -> 1.10.16 (includes joda time 2.8.1)

## 1.6.0
- Added mode `-c` to clean up the remains of interrupted multipart uploads.
- Added option `--reduced-redundancy` to store uploads with reduced redundancy.
- Dependency update: AWS SDK for Java 1.9.6 -> 1.9.27

## 1.5.2
- Fix: As S3 has no concept of real folders, AWS S3 Web Console simulates an empty folder by creating a zero-length file with the filename ending on a trailing slash. From now on these empty folders are created on the receiving end when using recursive downloads.

## 1.5.1
- Dependency update: AWS SDK for Java 1.8.9.1 -> 1.9.6 (support for eu-central-1)
- Dependency update: logback 1.0.13 -> 1.1.2

## 1.5.0
- Added the ability to add additional metadata to all uploads by using `-m/--metadata <key> <value>`.

## 1.4.2
- Important bug-fix: Downloads using a Grid Download Feature Flag were using wrong offsets when writing to the output file.
- AWS SDK for Java update: 1.4.5 -> 1.8.9.1
- FASTQ file format detection for FASTQ split downloads.

## 1.4.1
- Raised the logging threshold for streaming mode from WARN to ERROR.

## 1.4.0
- Important bug-fix: Automatically resume interrupted chunk downloads.
- Added retry for interrupted single file downloads.
- Added retry for failed S3 region list request.
- Default connection timeout is now at 30 sec (was 5 min).

## 1.3.3
- Updated dependency 'logback' from 1.0.11 -> 1.0.13 because of [http://jira.qos.ch/browse/LOGBACK-749%7Cbug](http://jira.qos.ch/browse/LOGBACK-749%7Cbug).
- Enabled automatic retries for initial metadata request that is issued before download action occurs.

## 1.3.2
- Improved feedback for ambiguous S3 URL (missing trailing slash).
- Added option `--trace` for easier debugging.

## 1.3.0
`- Added support for the download via a pre-signed S3 HTTP URL. This works for both standard and grid downloads. (Thanks to Thomas!)

## 1.2.1
- Fixed a bug where the minimum chunk upload size restriction of 5MB was also applied to downloads despite the fact that minimum download chunk sizes are not required by the S3 service

## 1.2.0
- Added `--grid-download-feature-fastq`. This feature alters the offsets of the grid download splits in a way that FASTQ file parts remain valid. It also preserves paired-ends/mate-pairs in interleaved FASTQ files containing Illumina sequence identifiers..
- AWS SDK for Java update: 1.4.3 -> 1.4.5

## 1.1.0
- Added `--grid-download-feature-split` which enables the grid download to save the data to independent smaller files (all identically named but on different nodes)
- Added `--session-token` for temporary authentication support

## 1.0.3
- Added automatic retries for all chunk transfer operations to compensate for connection-related S3 failures (default: 6 retries)
- Disabled INFO logging for the HTTP Client logger of the AWS Java SDK

## 1.0.2
- Fixed a bug where the use of the STDIN file list would result in the attempt to upload an empty filename
195 changes: 195 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,35 @@

[![](https://jitpack.io/v/BiBiServ/bibis3.svg)](https://jitpack.io/#BiBiServ/bibis3)

Amazon Simple Storage Service (S3) is a cloud storage service with some very interesting characteristics for storing
large amounts of data. It has virtually infinite scalability both in terms of storage space and transfer speed.
Bioinformatics pipelines running on EC2 compute instances need fast access to the data stored in S3 but existing tools*
for S3 data transfer do not use the full potential S3 has to offer. (*Evaluated tools were `s3cmd` and `boto` (original
and modified versions, Mar 2013))

***BiBiS3*** is a command line tool that attempts to close this gap by pushing the transfer speeds both from and to S3
to the limits of the underlying hardware. Additionally ***BiBiS3*** in itself is as scalable as Amazon's S3 as it is
capable of downloading different chunks of the same data to an arbitrary number of machines simultaneously. The key
to maximum speed using S3 is massive, data-agnostic parallelization.

The targets can either be a single machine, a shared Network File System (NFS) between multiple nodes or the
filesystems of all the nodes. Directories can be copied recursively while ***BiBiS3*** maintains a stable count of
parallel transfer threads regardless of the directory structure.

In another scenario where the parts of a single file are to be evenly distributed across multiple machines,
***BiBiS3*** is performing a split of the data. In case of the FASTQ file format this split is even content-aware
and preserves all FASTQ entries. A distributed download can be invoked e.g. via the Oracle Grid Engine (OGE) which
is part of [BiBiGrid](https://github.com/BiBiServ/bibigrid).

***Features***
- ***Parallel*** transfer of multiple chunks of data.
- ***Recursive*** transfer of directories with ***parallelization of multiple files***.
- Simultaneous download ***via a cluster*** to e.g. a shared NFS where each node only downloads a portion of the data.

***Performance***
On a single AWS instance we have seen download speeds of over to 300 MByte/sec from S3. Using the distributed cluster
download mode ***BiBiS3*** downloads show an aggregate throughput of more than 22 GByte/sec on 80 c3.8xlarge instances.

## Compile, Build & Package

*Requirements: Java >= 8, Maven >= 3.3.9*
Expand All @@ -11,3 +40,169 @@
> cd bibis3
> mvn clean package
~~~

## Setup
***Credentials File***

To get access to buckets that need proper authentication, create a .properties file called
`.aws-credentials.properties` in your user home directory with the following content:
~~~
accessKey=XXXXXXXXXXXXXXXX # your AWS access key
secretKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # your AWS secret key
~~~

***Please note***: The Access Key and Secret Key are very sensitive information! Please make sure that the configuration file can only be read by you! E.g. use the following command: `chmod 600 ~/.aws-credentials.properties`

Alternatively the credentials can also be supplied via command-line parameters.

## Command-Line Usage
The basic commands follow the behavior of the Unix command cp as closely as possible.

~~~BASH
usage: java -jar bibis3-1.7.0.jar -u|d|g|c SRC DEST
--access-key <arg> AWS Access Key.
-c,--clean-up-parts Clean up all unfinished parts of
previous multipart uploads that were
initiated on the specified bucket over
a week ago. BUCKET has to be an S3
URL.
--chunk-size <arg> Multipart chunk size in Bytes.
--create-bucket Create bucket if nonexistent.
-d,--download Download files. SRC has to be an S3
URL.
--debug Debug mode.
--endpoint <arg> Endpoint for client authentication
(default: standard AWS endpoint).
-g,--download-url Download a file with Http GET from
(pre-signed) S3-Http-Url. SRC has to
be an Http-URL with Range support for
Http GET.
--grid-current-node <arg> Identifier of the node that is running
this program (must be 1 >= i <=
grid-nodes.
--grid-download Download only a subset of all chunks.
This is useful for downloading e. g.
to a shared filesystem via different
machines simultaneously.
--grid-download-feature-fastq Download separate parts of a fastq
file to different nodes into different
files and make sure the file splits
conserve the fastq file format.
--grid-download-feature-split Download separate parts of a single
file to different nodes into different
files all with the same name.
(--grid-download required)
--grid-nodes <arg> Number of grid nodes.
-h,--help Help.
-m,--metadata <key> <value> Adds metadata to all uploads. Can be
specified multiple times for
additional metadata.
-q,--quiet Disable all log messages.
-r,--recursive Enable recursive transfer of a
directory.
--reduced-redundancy Set the storage class for uploads to
Reduced Redundancy instead of
Standard.
--region <arg> S3 region. For AWS has to be one of:
ap-south-1, eu-west-3, eu-west-2,
eu-west-1, ap-northeast-2,
ap-northeast-1, ca-central-1,
sa-east-1, cn-north-1, us-gov-west-1,
ap-southeast-1, ap-southeast-2,
eu-central-1, us-east-1, us-east-2,
us-west-1, cn-northwest-1, us-west-2
(default: us-east-1).
--secret-key <arg> AWS Secret Key.
--session-token <arg> AWS Session Token.
--streaming-download Run single threaded download and send
special progress info to STDOUT.
-t,--threads <arg> Number of parallel threads to use
(default: 50).
--trace Extended debug mode.
-u,--upload Upload files. DEST has to be an S3
URL.
--upload-list-stdin Take list of files to upload from
STDIN. In this case the SRC argument
has to be omitted.
-v,--version Version.
S3 URLs have to be in the form of: 's3://<bucket>/<key>', e.g.
's3://mybucket/mydatafolder/data.txt'. When using recursive transfer (-r)
the trailing slash of the directory is mandatory, e.g.
's3://mybucket/mydatafolder/'.
~~~

## Basic Examples
***Upload of a single file from the local directory to S3:***
~~~BASH
java -jar bibis3.jar -u myfile.tgz s3://mybucket/somedir/
~~~

***Download of a single file from S3 to the current directory:***
~~~BASH
java -jar bibis3.jar -d s3://mybucket/somedir/myfile.tgz .
~~~

***Download of a directory from S3 to a local directory called 'mydir' using 20 threads:***
~~~BASH
java -jar bibis3.jar -t 20 -r -d s3://mybucket/somedir/ mydir
~~~

Attention should be paid to the trailing slash of the S3 URL which in addition to the `-r` option is required
for the recursive transfer of a directory.

***Example shell script for the simultaneous download of all the contents of an S3 directory via a cluster:***

***simultaneous-download.sh:***
~~~BASH
#!/bin/bash
java -jar bibis3.jar \
--access-key "XXXXXXX" \
--secret-key "XXXXXXXXXXXX" \
--region eu-west-1 \
--grid-download \
--grid-nodes "$1" \
--grid-current-node "$SGE_TASK_ID" \
-d s3://mybucket/mydir/ targetdir
~~~
which could be run within an SGE/OGE with 5 nodes (4 cores each) as follows:
~~~BASH
qsub -pe multislot 4 -t 1-5 simultaneous-download.sh 5
~~~

The parameter `-pe multislot 4` ensures that the array job ist equally distributed among the nodes (leading to only
one task per node).

The targetdir is usually located inside a shared filesystem (e.g. NFS). However, if `--grid-download-feature-split`
is enabled, then the targetdir has to be local for each node.

## Grid Download Feature Flags
***Grid Download Feature Flags*** can be used in addition to `--grid-download`. When supplying one of these flags, the
file parts are saved to different files on different machines. Additionally these flags can be used to force a specific
split position for individual file types. Grid Download Feature Flags ***cannot be combined***. Only the last one
supplied will be in effect.

***Split:***

`--grid-download-feature-split`

Splits arbitrarily.

***FASTQ:***

`--grid-download-feature-fastq`

Preserves FASTQ entries as well as paired-ends/mate-pairs for files using Illumina sequence identifiers.

## Cleanup
The Amazon S3 documentation says:

> "Once you initiate a multipart upload, Amazon S3 retains all the parts until you either complete or abort the upload. Throughout its lifetime, you are billed for all storage, bandwidth, and requests for this multipart upload and its associated parts."

When an upload encounters a fatal error, the upload gets neither completed nor aborted. Already uploaded multipart
chunks remain in S3 but are invisible to the user. Therefore, it is recommended to clean up interrupted multipart
uploads periodically.

***Clean up the remainings of interrupted multipart uploads for the bucket 'mybucket' that were initiated more than 7 days ago:***
~~~BASH
java -jar bibis3.jar -c s3://mybucket/
~~~

0 comments on commit fc38a0d

Please sign in to comment.