S3 for CRAB

Using CERN S3 for CRAB

This is a high level descriptions of the plan for integrating CERN S3 storage service into CRAB

Initially we will use S3 to replace CRABCache as described in Wiki page: CRABCache replacement with S3

Later on we can look into extending use to store and server all logs which are now accessed on schedd's via htttp. Other uses may come up later.

Generalities

S3 allows to create a storage container via OpenStack. We ask for a storage quota in our OpenStack project and size has to be negotiated with CERN of course, but several TB's are a no issue) This storage is part of CEPH and I/O is limited since there is some kind of gateway. It is not like EOS where the server runs on the same host which has the disks, so they can provide GB/s (Dan VanDerSteer said). Access to this storage is via the same interface as Amazon Web Services S3 (or course).

NOTE: with OpenStack we create storage containers and give them a name. We can later access those with boto client and in there those containers are mapped to "buckets" with same name.

For our OpenStack project we can get a set of keys that we need to keep secret. Using those we can create buckets inside our storage.the container and manage objects (files) in there. Key holder can decide if a bucket (or an object in a bucket) is public or private. Key holder can also create pre-signed URL's with an expiration time to make it easy for clients to e.g upload objects to a bucket.

Note: even if we create multiple buckets, we manage them with the same set of keys and they must all fit into same overall storage quota. We can not set size limit on different buckets unless we put them in different OpenStack projects.

Objects can be stored as /dir1/dir2/../filename and those dirs can be used to list and count things, so we can keep track of use.

More details and how-to's tailored for our case in this document from Prajesh, (originally on google docs here )

We decided to use S3 via the boto3 python client, and use py3 on SL7. There are already rpms' for this in cmsdist/comp and it also comes as a dependency of Rucio client.

Example:

$ # prepare a file with the secrets:
$ cat credentials
[default]
aws_access_key_id = <put you key here without “ ” >
aws_secret_access_key = <access key>
$ export AWS_SHARED_CREDENTIALS_FILE=credentials

Then in python

import boto3
endpoint='https://s3.cern.ch'
conn = boto3.client('s3', endpoint_url=endpoint)

Permissions/Protections

AWS has a sophisticated Identity and Access Management tool. But CERN implementation has nothing like that at the moment. We only have:

public objects, they are so public that not even CERN SSO will be required to access
private obejcts, they are so private that the master keys are needed
And for anything else, there's pre-signed URLs

Contact and Support

CERN IT Storage group has opened a dedicated Mattermost channel for us

A few useful links :

Provide feedback

Saved searches