[fs] basic sync tool #14248

danking · 2024-02-05T17:37:19Z

CHANGELOG: Introduce hailctl fs sync which robustly transfers one or more files between Amazon S3, Azure Blob Storage, and Google Cloud Storage.

There are really two distinct conceptual changes remaining here. Given my waning time available, I am not going to split them into two pull requests. The changes are:

basename always agrees with the basename UNIX utility. In particular, the folder /foo/bar/baz/'s basename is not '' it is 'baz'. The only folders or objects whose basename is '' are objects whose name literally ends in a slash, e.g. an object named gs://foo/bar/baz/.
hailctl fs sync, a robust copying tool with a user-friendly CLI.

hailctl fs sync comprises two pieces: plan.py and sync.py. The latter, sync.py is simple: it delegates to our existing copy infrastructure. That copy infastructure has been lightly modified to support this use-case. The former, plan.py, is a concurrent file system diff.

plan.py generates and sync.py consumes a "plan folder" containing these files:

matches files whose names and sizes match. Two columns: source URL, destination URL.
differs files or folders whose names match but either differ in size or differ in type. Four columns: source URL, destination URL, source state, destination state. The states are either: file, dif, or a size. If either state is a size, both states are sizes.
srconly files only present in the source. One column: source URL.
dstonly files only present in the destination. One column: destination URL.
plan a proposed set of object-to-object copies. Two columns: source URL, destination URL.
summary a one-line file containing the total number of copies in plan and the total number of bytes which would be copied.

As described in the CLI documentation, the intended use of these commands is:

hailctl fs sync --make-plan plan1 --copy-to gs://gcs-bucket/a s3://s3-bucket/b
hailctl fs sync --use-plan plan1

The first command generates a plan folder and the second command executes the plan. Separating this process into two commands allows the user to verify what exactly will be copied including the exact destination URLs. Moreover, if hailctl fs sync --use-plan fails, the user can re-run hailctl fs sync --make-plan to generate a new plan which will avoid copying already successfully copied files. Moreover, the user can re-run hailctl fs sync --make-plan to verify that every file was indeed successfully copied.

Testing. This change has a few sync-specific tests but largely reuses the tests for hailtop.aiotools.copy.

Future Work. Propagating a consistent kind of hash across all clouds and using that for detecting differences is a better solution than the file-size based difference used here. If all the clouds always provided the same type of hash value, this would be trivial to add. Alas, at time of writing, S3 and Google both support CRC32C for every blob (though, in S3, you must explicitly request it at object creation time), but Azure Blob Storage does not. ABS only supports MD5 sums which Google does not support for multi-part uploads.

Resolves #14654

danking · 2024-02-12T15:26:43Z

I made some pretty substantial changes over the weekend to allow me to copy our giant annotation database buckets. Let me clean those up before we review.

danking · 2024-02-22T22:25:27Z

Waiting on these to merge first:

CHANGELOG: Introduce `hailctl fs sync` which robustly transfers one or more files between Amazon S3, Azure Blob Storage, and Google Cloud Storage. There are really two distinct conceptual changes remaining here. Given my waning time available, I am not going to split them into two pull requests. The changes are: 1. `basename` always agrees with the the [`basename` UNIX utility](https://en.wikipedia.org/wiki/Basename). In particular, the folder `/foo/bar/baz/`'s basename is *not* `''` it is `'baz'`. The only folders or objects whose basename is `''` are objects whose name literally ends in a slash, e.g. an *object* named `gs://foo/bar/baz/`. 2. `hailctl fs sync`, a robust copying tool with a user-friendly CLI. `hailctl fs sync` comprises two pieces: `plan.py` and `sync.py`. The latter, `sync.py` is simple: it delegates to our existing copy infrastructure. That copy infastructure has been lightly modified to support this use-case. The former, `plan.py`, is concurrent file system `diff`. `plan.py` generates and `sync.py` consumes a "plan folder" containing these files: 1. `matches` files whose names and sizes match. Two columns: source URL, destination URL. 2. `differs` files or folders whose names match but either differ in size or differ in type. Four columns: source URL, destination URL, source state, destination state. The states are either: `file`, `dif`, or a size. If either state is a size, both states are sizes. 3. `srconly` files only present in the source. One column: source URL. 4. `dstonly` files only present in the destination. One column: destination URL. 5. `plan` a proposed set of object-to-object copies. Two columns: source URL, destination URL. 6. `sumary` a one-line file containing the total number of copies in plan and the total number of bytes which would be copied. As described in the CLI documentation, the intended use of these commands is: ``` hailctl fs sync --make-plan plan1 --copy-to gs://gcs-bucket/a s3://s3-bucket/b hailctl fs sync --use-plan plan1 ``` The first command generates a plan folder and the second command executes the plan. Separating this process into two commands allows the user to verify what exactly will be copied including the exact destination URLs. Moreover, if `hailctl fs sync --use-plan` fails, the user can re-run `hailctl fs sync --make-plan` to generate a new plan which will avoid copying already successfully copied files. Moreover, the user can re-run `hailctl fs sync --make-plan` to verify that every file was indeed successfully copied. Testing. This change has a few sync-specific tests but largely reuses the tests for `hailtop.aiotools.copy`. Future Work. Propagating a consistent kind of hash across all clouds and using that for detecting differences is a better solution than the file-size based difference used here. If all the clouds always provided the same type of hash value, this would be trivial to add. Alas, at time of writing, S3 and Google both support CRC32C for every blob (though, in S3, you must explicitly request it at object creation time), but *Azure Blob Storage does not*. ABS only supports MD5 sums which Google does not support for multi-part uploads.

chrisvittal

This is a good start at a user friendly cross cloud sync tool and I feel that it should be merged as is.

There's a lot of opportunity to take advantage of cloud specific APIs like the GCS storage transfer service to make this and the more basic copier tool more robust.

Dismissing for now, as our CI currently thinks the approval makes this mergable

chrisvittal

Re-approving after my previous review was dismissed to unblock the merge queue.

patrick-schultz

I think this is about ready to merge. Just one question.

patrick-schultz · 2024-09-24T13:51:37Z

hail/python/hailtop/hailctl/__main__.py


+import click


Is this a new dependency? Should we add it to requirements.txt?

danking assigned daniel-goldstein Feb 5, 2024

danking mentioned this pull request Feb 5, 2024

[fs] basic sync tool #13834

Closed

danking force-pushed the new-new-copier branch 2 times, most recently from e4ad2c5 to c4d1a62 Compare February 8, 2024 01:00

danking unassigned daniel-goldstein Feb 12, 2024

danking force-pushed the new-new-copier branch 3 times, most recently from b42f926 to 5363d1f Compare February 12, 2024 22:15

danking mentioned this pull request Feb 14, 2024

[batch] Add Job Groups to Batch #14282

Merged

Dan King added 19 commits February 27, 2024 17:17

more docs

86b5a4b

fix bad help string

8fbc0f2

use recursive=True for rapid listing

861ca13

fix

7660033

no prints

f956f0a

allow isdir without trailing slash

5b0c8a9

simplify sync.py dramatically

46b1b34

update listeners before return

55c83e0

use uvloop

eac543c

uvloopx

2bc9d1d

import fix

9588f91

maybe get InsertObjectStream right

b905f30

await _cleanup_future too

c4162cb

use async with instead of async with await

c8388b1

smaller part size maybe helps?

527488f

prints

c8891ec

use order of magnitude less file parallelism than partition parallelism

0e6ae95

files are 1

04448ed

fix Self improt

44c9364

danking force-pushed the new-new-copier branch from 31b850f to 44c9364 Compare February 27, 2024 22:27

Dan King and others added 7 commits February 27, 2024 22:48

fix bad imports

f6eacea

Merge remote-tracking branch 'hi/main' into new-new-copier

f502b8f

[uvloopx] consolidate uvloop initialization code to one place

2bd85ec

also front_end.py

4c05471

revert unnecsesary changes to copy and copier

4bba249

add assertion about total size and also fix lint about unused variable

7ae6107

:Merge remote-tracking branch 'upstream/main' into new-new-copier

38727c9

chrisvittal self-assigned this Jun 25, 2024

fix

9c25bc0

chrisvittal mentioned this pull request Jul 2, 2024

[hailtop/fs] Make sync / copy tools take advantage of cloud specific apis as much as possible #14601

Open

Merge branch 'main' into new-new-copier

1a33173

chrisvittal previously approved these changes Aug 5, 2024

View reviewed changes

chrisvittal mentioned this pull request Aug 5, 2024

[fs] The sync tool issue #14654

Open

chrisvittal removed their assignment Aug 7, 2024

chrisvittal requested a review from patrick-schultz August 7, 2024 16:33

chrisvittal self-assigned this Aug 7, 2024

chrisvittal force-pushed the new-new-copier branch from bed4625 to 20b3fce Compare August 7, 2024 22:15

lint fixes

44ef794

chrisvittal force-pushed the new-new-copier branch from 20b3fce to 44ef794 Compare August 7, 2024 22:15

chrisvittal added 3 commits August 8, 2024 11:41

test fixes

7087fc0

fix?

5a32d23

lint

c66533f

chrisvittal self-requested a review August 19, 2024 20:04

chrisvittal approved these changes Sep 10, 2024

View reviewed changes

Merge branch 'main' into new-new-copier

a447e22

patrick-schultz requested changes Sep 24, 2024

View reviewed changes

hail/python/hailtop/hailctl/__main__.py

import click

Copy link

Collaborator

patrick-schultz Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new dependency? Should we add it to requirements.txt?

chrisvittal mentioned this pull request Sep 25, 2024

[release] Release 0.2.133 #14701

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fs] basic sync tool #14248

[fs] basic sync tool #14248

danking commented Feb 5, 2024 •

edited by chrisvittal

Loading

danking commented Feb 12, 2024

danking commented Feb 22, 2024

chrisvittal left a comment

chrisvittal left a comment

patrick-schultz left a comment

patrick-schultz Sep 24, 2024

[fs] basic sync tool #14248

Are you sure you want to change the base?

[fs] basic sync tool #14248

Conversation

danking commented Feb 5, 2024 • edited by chrisvittal Loading

danking commented Feb 12, 2024

danking commented Feb 22, 2024

chrisvittal left a comment

Choose a reason for hiding this comment

chrisvittal left a comment

Choose a reason for hiding this comment

patrick-schultz left a comment

Choose a reason for hiding this comment

patrick-schultz Sep 24, 2024

Choose a reason for hiding this comment

danking commented Feb 5, 2024 •

edited by chrisvittal

Loading