Skip to content

Commit

Permalink
HADOOP-19057. S3A: Landsat bucket used in tests no longer accessible (#…
Browse files Browse the repository at this point in the history
…6515)


The AWS landsat data previously used in some S3A tests is no
longer accessible

This PR moves to the new external file
s3a://noaa-cors-pds/raw/2024/001/akse/AKSE001x.24_.gz

* Large enough file for scale tests
* Bucket supports anonymous access
* Ends in .gz to keep codec tests happy
* No spaces in path to keep bucket-info happy

Test Code Changes
* Leaves the test key name alone: fs.s3a.scale.test.csvfile
* Rename all methods and fields move remove "csv" from their names and
  move to "external file" we no longer require it to be CSV.
* Path definition and helper methods have been moved to PublicDatasetTestUtils
* Improve error reporting in ITestS3AInputStreamPerformance if the file
  is too short
  
With S3 Select removed, there is no need for the file to be
a CSV file; there is a test which tries to unzip it; other
tests have a minimum file size.

Consult the JIRA for the settings to add to auth-keys.xml
to switch earlier builds to this same file.

Contributed by Steve Loughran
  • Loading branch information
steveloughran authored Feb 13, 2024
1 parent 5cbe52f commit 7651afd
Show file tree
Hide file tree
Showing 30 changed files with 362 additions and 298 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -585,7 +585,7 @@ If an operation fails with an `AccessDeniedException`, then the role does not ha
the permission for the S3 Operation invoked during the call.

```
> hadoop fs -touch s3a://landsat-pds/a
> hadoop fs -touch s3a://noaa-isd-pds/a
java.nio.file.AccessDeniedException: a: Writing Object on a:
software.amazon.awssdk.services.s3.model.S3Exception: Access Denied
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,9 @@ Specific buckets can have auditing disabled, even when it is enabled globally.

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.audit.enabled</name>
<name>fs.s3a.bucket.noaa-isd-pds.audit.enabled</name>
<value>false</value>
<description>Do not audit landsat bucket operations</description>
<description>Do not audit bucket operations</description>
</property>
```

Expand Down Expand Up @@ -342,9 +342,9 @@ either globally or for specific buckets:
</property>

<property>
<name>fs.s3a.bucket.landsat-pds.audit.referrer.enabled</name>
<name>fs.s3a.bucket.noaa-isd-pds.audit.referrer.enabled</name>
<value>false</value>
<description>Do not add the referrer header to landsat operations</description>
<description>Do not add the referrer header to operations</description>
</property>
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -747,7 +747,7 @@ For example, for any job executed through Hadoop MapReduce, the Job ID can be us
### `Filesystem does not have support for 'magic' committer`

```
org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://landsat-pds': Filesystem does not have support for 'magic' committer enabled
org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://noaa-isd-pds': Filesystem does not have support for 'magic' committer enabled
in configuration option fs.s3a.committer.magic.enabled
```

Expand All @@ -760,42 +760,15 @@ Remove all global/per-bucket declarations of `fs.s3a.bucket.magic.enabled` or se

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.committer.magic.enabled</name>
<name>fs.s3a.bucket.noaa-isd-pds.committer.magic.enabled</name>
<value>true</value>
</property>
```

Tip: you can verify that a bucket supports the magic committer through the
`hadoop s3guard bucket-info` command:
`hadoop s3guard bucket-info` command.


```
> hadoop s3guard bucket-info -magic s3a://landsat-pds/
Location: us-west-2
S3A Client
Signing Algorithm: fs.s3a.signing-algorithm=(unset)
Endpoint: fs.s3a.endpoint=s3.amazonaws.com
Encryption: fs.s3a.encryption.algorithm=none
Input seek policy: fs.s3a.experimental.input.fadvise=normal
Change Detection Source: fs.s3a.change.detection.source=etag
Change Detection Mode: fs.s3a.change.detection.mode=server
S3A Committers
The "magic" committer is supported in the filesystem
S3A Committer factory class: mapreduce.outputcommitter.factory.scheme.s3a=org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
S3A Committer name: fs.s3a.committer.name=magic
Store magic committer integration: fs.s3a.committer.magic.enabled=true
Security
Delegation token support is disabled
Directory Markers
The directory marker policy is "keep"
Available Policies: delete, keep, authoritative
Authoritative paths: fs.s3a.authoritative.path=```
```
### Error message: "File being created has a magic path, but the filesystem has magic file support disabled"

A file is being written to a path which is used for "magic" files,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -284,14 +284,13 @@ a bucket.
The up to date list of regions is [Available online](https://docs.aws.amazon.com/general/latest/gr/s3.html).

This list can be used to specify the endpoint of individual buckets, for example
for buckets in the central and EU/Ireland endpoints.
for buckets in the us-west-2 and EU/Ireland endpoints.


```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint.region</name>
<name>fs.s3a.bucket.us-west-2-dataset.endpoint.region</name>
<value>us-west-2</value>
<description>The region for s3a://landsat-pds URLs</description>
</property>

<property>
Expand Down Expand Up @@ -354,9 +353,9 @@ The boolean option `fs.s3a.endpoint.fips` (default `false`) switches the S3A con
For a single bucket:
```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint.fips</name>
<name>fs.s3a.bucket.noaa-isd-pds.endpoint.fips</name>
<value>true</value>
<description>Use the FIPS endpoint for the landsat dataset</description>
<description>Use the FIPS endpoint for the NOAA dataset</description>
</property>
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ If it was deployed unbonded, the DT Binding is asked to create a new DT.

It is up to the binding what it includes in the token identifier, and how it obtains them.
This new token identifier is included in a token which has a "canonical service name" of
the URI of the filesystem (e.g "s3a://landsat-pds").
the URI of the filesystem (e.g "s3a://noaa-isd-pds").

The issued/reissued token identifier can be marshalled and reused.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -481,8 +481,8 @@ This will fetch the token and save it to the named file (here, `tokens.bin`),
even if Kerberos is disabled.

```bash
# Fetch a token for the AWS landsat-pds bucket and save it to tokens.bin
$ hdfs fetchdt --webservice s3a://landsat-pds/ tokens.bin
# Fetch a token for the AWS noaa-isd-pds bucket and save it to tokens.bin
$ hdfs fetchdt --webservice s3a://noaa-isd-pds/ tokens.bin
```

If the command fails with `ERROR: Failed to fetch token` it means the
Expand All @@ -498,11 +498,11 @@ host on which it was created.
```bash
$ bin/hdfs fetchdt --print tokens.bin

Token (S3ATokenIdentifier{S3ADelegationToken/Session; uri=s3a://landsat-pds;
Token (S3ATokenIdentifier{S3ADelegationToken/Session; uri=s3a://noaa-isd-pds;
timestamp=1541683947569; encryption=EncryptionSecrets{encryptionMethod=SSE_S3};
Created on vm1.local/192.168.99.1 at time 2018-11-08T13:32:26.381Z.};
Session credentials for user AAABWL expires Thu Nov 08 14:02:27 GMT 2018; (valid))
for s3a://landsat-pds
for s3a://noaa-isd-pds
```
The "(valid)" annotation means that the AWS credentials are considered "valid":
there is both a username and a secret.
Expand All @@ -513,11 +513,11 @@ If delegation support is enabled, it also prints the current
hadoop security level.

```bash
$ hadoop s3guard bucket-info s3a://landsat-pds/
$ hadoop s3guard bucket-info s3a://noaa-isd-pds/

Filesystem s3a://landsat-pds
Filesystem s3a://noaa-isd-pds
Location: us-west-2
Filesystem s3a://landsat-pds is not using S3Guard
Filesystem s3a://noaa-isd-pds is not using S3Guard
The "magic" committer is not supported

S3A Client
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -314,9 +314,8 @@ All releases of Hadoop which have been updated to be marker aware will support t
Example: `s3guard bucket-info -markers aware` on a compatible release.

```
> hadoop s3guard bucket-info -markers aware s3a://landsat-pds/
Filesystem s3a://landsat-pds
Location: us-west-2
> hadoop s3guard bucket-info -markers aware s3a://noaa-isd-pds/
Filesystem s3a://noaa-isd-pds
...
Expand All @@ -326,13 +325,14 @@ Directory Markers
Authoritative paths: fs.s3a.authoritative.path=
The S3A connector is compatible with buckets where directory markers are not deleted
...
```

The same command will fail on older releases, because the `-markers` option
is unknown

```
> hadoop s3guard bucket-info -markers aware s3a://landsat-pds/
> hadoop s3guard bucket-info -markers aware s3a://noaa-isd-pds/
Illegal option -markers
Usage: hadoop bucket-info [OPTIONS] s3a://BUCKET
provide/check information about a specific bucket
Expand All @@ -354,9 +354,8 @@ Generic options supported are:
A specific policy check verifies that the connector is configured as desired

```
> hadoop s3guard bucket-info -markers keep s3a://landsat-pds/
Filesystem s3a://landsat-pds
Location: us-west-2
> hadoop s3guard bucket-info -markers keep s3a://noaa-isd-pds/
Filesystem s3a://noaa-isd-pds
...
Expand All @@ -371,9 +370,8 @@ When probing for a specific policy, the error code "46" is returned if the activ
does not match that requested:

```
> hadoop s3guard bucket-info -markers delete s3a://landsat-pds/
Filesystem s3a://landsat-pds
Location: us-west-2
> hadoop s3guard bucket-info -markers delete s3a://noaa-isd-pds/
Filesystem s3a://noaa-isd-pds
S3A Client
Signing Algorithm: fs.s3a.signing-algorithm=(unset)
Expand All @@ -398,7 +396,7 @@ Directory Markers
Authoritative paths: fs.s3a.authoritative.path=
2021-11-22 16:03:59,175 [main] INFO util.ExitUtil (ExitUtil.java:terminate(210))
-Exiting with status 46: 46: Bucket s3a://landsat-pds: required marker polic is
-Exiting with status 46: 46: Bucket s3a://noaa-isd-pds: required marker polic is
"keep" but actual policy is "delete"
```
Expand Down Expand Up @@ -450,10 +448,10 @@ Audit the path and fail if any markers were found.


```
> hadoop s3guard markers -limit 8000 -audit s3a://landsat-pds/
> hadoop s3guard markers -limit 8000 -audit s3a://noaa-isd-pds/
The directory marker policy of s3a://landsat-pds is "Keep"
2020-08-05 13:42:56,079 [main] INFO tools.MarkerTool (DurationInfo.java:<init>(77)) - Starting: marker scan s3a://landsat-pds/
The directory marker policy of s3a://noaa-isd-pds is "Keep"
2020-08-05 13:42:56,079 [main] INFO tools.MarkerTool (DurationInfo.java:<init>(77)) - Starting: marker scan s3a://noaa-isd-pds/
Scanned 1,000 objects
Scanned 2,000 objects
Scanned 3,000 objects
Expand All @@ -463,8 +461,8 @@ Scanned 6,000 objects
Scanned 7,000 objects
Scanned 8,000 objects
Limit of scan reached - 8,000 objects
2020-08-05 13:43:01,184 [main] INFO tools.MarkerTool (DurationInfo.java:close(98)) - marker scan s3a://landsat-pds/: duration 0:05.107s
No surplus directory markers were found under s3a://landsat-pds/
2020-08-05 13:43:01,184 [main] INFO tools.MarkerTool (DurationInfo.java:close(98)) - marker scan s3a://noaa-isd-pds/: duration 0:05.107s
No surplus directory markers were found under s3a://noaa-isd-pds/
Listing limit reached before completing the scan
2020-08-05 13:43:01,187 [main] INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 3:
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -616,15 +616,14 @@ header.x-amz-version-id="KcDOVmznIagWx3gP1HlDqcZvm1mFWZ2a"
A file with no-encryption (on a bucket without versioning but with intelligent tiering):

```
bin/hadoop fs -getfattr -d s3a://landsat-pds/scene_list.gz
bin/hadoop fs -getfattr -d s3a://noaa-cors-pds/raw/2024/001/akse/AKSE001x.24_.gz
# file: s3a://landsat-pds/scene_list.gz
header.Content-Length="45603307"
header.Content-Type="application/octet-stream"
header.ETag="39c34d489777a595b36d0af5726007db"
header.Last-Modified="Wed Aug 29 01:45:15 BST 2018"
header.x-amz-storage-class="INTELLIGENT_TIERING"
header.x-amz-version-id="null"
# file: s3a://noaa-cors-pds/raw/2024/001/akse/AKSE001x.24_.gz
header.Content-Length="524671"
header.Content-Type="binary/octet-stream"
header.ETag=""3e39531220fbd3747d32cf93a79a7a0c""
header.Last-Modified="Tue Jan 02 00:15:13 GMT 2024"
header.x-amz-server-side-encryption="AES256"
```

###<a name="changing-encryption"></a> Use `rename()` to encrypt files with new keys
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -503,7 +503,7 @@ explicitly opened up for broader access.
```bash
hadoop fs -ls \
-D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
s3a://landsat-pds/
s3a://noaa-isd-pds/
```

1. Allowing anonymous access to an S3 bucket compromises
Expand Down Expand Up @@ -1630,11 +1630,11 @@ a session key:
</property>
```

Finally, the public `s3a://landsat-pds/` bucket can be accessed anonymously:
Finally, the public `s3a://noaa-isd-pds/` bucket can be accessed anonymously:

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.aws.credentials.provider</name>
<name>fs.s3a.bucket.noaa-isd-pds.aws.credentials.provider</name>
<value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
</property>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -447,7 +447,8 @@ An example of this is covered in [HADOOP-13871](https://issues.apache.org/jira/b

1. For public data, use `curl`:

curl -O https://landsat-pds.s3.amazonaws.com/scene_list.gz
curl -O https://noaa-cors-pds.s3.amazonaws.com/raw/2023/001/akse/AKSE001a.23_.gz

1. Use `nettop` to monitor a processes connections.


Expand Down Expand Up @@ -696,7 +697,7 @@ via `FileSystem.get()` or `Path.getFileSystem()`.
The cache, `FileSystem.CACHE` will, for each user, cachec one instance of a filesystem
for a given URI.
All calls to `FileSystem.get` for a cached FS for a URI such
as `s3a://landsat-pds/` will return that singe single instance.
as `s3a://noaa-isd-pds/` will return that singe single instance.

FileSystem instances are created on-demand for the cache,
and will be done in each thread which requests an instance.
Expand All @@ -720,7 +721,7 @@ can be created simultaneously for different object stores/distributed
filesystems.

For example, a value of four would put an upper limit on the number
of wasted instantiations of a connector for the `s3a://landsat-pds/`
of wasted instantiations of a connector for the `s3a://noaa-isd-pds/`
bucket.

```xml
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -260,22 +260,20 @@ define the target region in `auth-keys.xml`.
### <a name="csv"></a> CSV Data Tests

The `TestS3AInputStreamPerformance` tests require read access to a multi-MB
text file. The default file for these tests is one published by amazon,
[s3a://landsat-pds.s3.amazonaws.com/scene_list.gz](http://landsat-pds.s3.amazonaws.com/scene_list.gz).
This is a gzipped CSV index of other files which amazon serves for open use.
text file. The default file for these tests is a public one.
`s3a://noaa-cors-pds/raw/2023/001/akse/AKSE001a.23_.gz`
from the [NOAA Continuously Operating Reference Stations (CORS) Network (NCN)](https://registry.opendata.aws/noaa-ncn/)

Historically it was required to be a `csv.gz` file to validate S3 Select
support. Now that S3 Select support has been removed, other large files
may be used instead.
However, future versions may want to read a CSV file again, so testers
should still reference one.

The path to this object is set in the option `fs.s3a.scale.test.csvfile`,

```xml
<property>
<name>fs.s3a.scale.test.csvfile</name>
<value>s3a://landsat-pds/scene_list.gz</value>
<value>s3a://noaa-cors-pds/raw/2023/001/akse/AKSE001a.23_.gz</value>
</property>
```

Expand All @@ -285,21 +283,21 @@ is hosted in Amazon's US-east datacenter.
1. If the data cannot be read for any reason then the test will fail.
1. If the property is set to a different path, then that data must be readable
and "sufficiently" large.
1. If a `.gz` file, expect decompression-related test failures.

(the reason the space or newline is needed is to add "an empty entry"; an empty
`<value/>` would be considered undefined and pick up the default)


If using a test file in a different AWS S3 region then
a bucket-specific region must be defined.
For the default test dataset, hosted in the `landsat-pds` bucket, this is:
For the default test dataset, hosted in the `noaa-cors-pds` bucket, this is:

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint.region</name>
<value>us-west-2</value>
<description>The region for s3a://landsat-pds</description>
</property>
<property>
<name>fs.s3a.bucket.noaa-cors-pds.endpoint.region</name>
<value>us-east-1</value>
</property>
```

### <a name="access"></a> Testing Access Point Integration
Expand Down Expand Up @@ -857,7 +855,7 @@ the tests become skipped, rather than fail with a trace which is really a false
The ordered test case mechanism of `AbstractSTestS3AHugeFiles` is probably
the most elegant way of chaining test setup/teardown.

Regarding reusing existing data, we tend to use the landsat archive of
Regarding reusing existing data, we tend to use the noaa-cors-pds archive of
AWS US-East for our testing of input stream operations. This doesn't work
against other regions, or with third party S3 implementations. Thus the
URL can be overridden for testing elsewhere.
Expand Down
Loading

0 comments on commit 7651afd

Please sign in to comment.