Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better double asterisks ** support #1329

Merged
merged 31 commits into from
Aug 22, 2023

Conversation

john-jam
Copy link
Contributor

@john-jam john-jam commented Aug 6, 2023

Description

This PR:

  • makes the glob method compliant with posix rules and add better support for double asterisks ** in globs.
  • supports maxdepth for the glob method.
  • fixes the copy, get and put methods to not create unwanted directories on some scenarios.

I added some tests to check the posix rules against the python glob.glob implementation and the bash globstar support. Both do not support double asterisks with other prefix/suffix than / so for example **.json is the same as *.json and not the same as **/*.json. But since this repository already supported that well, this feature is kept and patterns like ab**cd are still working properly (returning paths like ab/ef/cd).

New rules:

  • the root dir is returned when using find. It is because of the posix ** rule compliance. For example, it should return an empty string among the other paths when using ** or return ab among the other paths when using ab/**.
  • when using a trailing slash in globs, it returns only directories.
  • the double asterisks ** matches 0+ directories.
  • the maxdepth option is applied on the first ** when using a glob.

References

@john-jam john-jam marked this pull request as ready for review August 8, 2023 08:15
@john-jam
Copy link
Contributor Author

john-jam commented Aug 8, 2023

@martindurant @ianthomas23 I can't run all the tests on my machine locally (errors with dask and ftp). Could you trigger a CI run?

@martindurant
Copy link
Member

You are failing on the range of things that we typically see break whenever you change anything about how paths work :|. In particular, you will find it difficult to reconcile what happens on windows without having a system of your own to run tests on.

@@ -65,10 +65,6 @@ def ls(self, path, detail=False, **kwargs):
else:
return [posixpath.join(path, f) for f in os.listdir(path)]

def glob(self, path, **kwargs):
path = self._strip_protocol(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing _strip_protocol from the flow here is likely the cause of all the windows errors, because for LocalFileSystem, it also converts windows-style paths to posix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what caused this issue: #1322
How about using make_posix_path helper here then?

Copy link
Contributor Author

@john-jam john-jam Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the windows tests by using make_path_posix on the expected outputs in the tests when required: https://github.com/john-jam/filesystem_spec/actions/runs/5807521471/job/15742517288

@martindurant
Copy link
Member

@ianthomas23 , do you think you will have time to go through this?

@ianthomas23
Copy link
Collaborator

@ianthomas23 , do you think you will have time to go through this?

Yes I will, but later in the week.

@john-jam john-jam force-pushed the better-double-asterisk-support branch from e332f05 to 8680aef Compare August 8, 2023 15:12
@john-jam
Copy link
Contributor Author

john-jam commented Aug 8, 2023

You are failing on the range of things that we typically see break whenever you change anything about how paths work

Yes, there's a lot of updates to force glob compliant with posix rules but maybe some of them are not wanted? For example the following posix rules could be reverted if you think it's not worth it:

  • the root dir is returned when using double asterisks **.
  • when using a trailing slash in globs, it returns only directories.

.github/workflows/main.yaml Outdated Show resolved Hide resolved
Copy link
Collaborator

@ianthomas23 ianthomas23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to merge now as far as I am concerned. I'll give @martindurant the final say.

When this is merged we can re-run the CI of the s3fs and gcsfs PRs, and I will review them properly. They look fine at initial glance. Ideally we would want to merge both of those quite quickly after this so that everything is consistent before the next releases.

I've closed my gcsfs abstract tests PR fsspec/gcsfs#535 as fsspec/gcsfs#572 is definitely a better solution.

@john-jam
Copy link
Contributor Author

@ianthomas23 I changed the posix tests with python glob and bash glob to use relative paths instead of absolute paths.
This allowed me to detect an incoherence with the ** use case. fsspec was returning an empty string "" for the current directory but the python glob and bash glob, both don't return an empty string "" for the current dir when using relative paths.

When using a prefix though (e.g. dir/**), the three implementations return the root folder (e.g. dir).

I fixed it in this commit.

@martindurant Please let me know if there's any work still to be done in this PR for it to be merged soon?

@john-jam
Copy link
Contributor Author

@ianthomas23 @martindurant I found another edge case with gcsfs. I created two PRs (here and here) but need to merge this PR before. I'll change the PR targets accordingly once this one is merged.

@martindurant
Copy link
Member

Merging now, and then we can deal with the small leftovers.

@martindurant martindurant merged commit c3b4bc3 into fsspec:master Aug 22, 2023
9 of 11 checks passed
@john-jam john-jam deleted the better-double-asterisk-support branch August 28, 2023 00:57
@john-jam
Copy link
Contributor Author

@martindurant @ianthomas23 Thanks for the merge of this PR and the ones in s3fs and gcsfs!

Do you know approx. when will you release a new version of those 3 repos? And also if you plan to include those changes in gcsfs?

@martindurant
Copy link
Member

Releases should be out this week, so long as lingering PRs are cleaned up.

seanses added a commit to xetdata/pyxet that referenced this pull request Sep 8, 2023
`fsspec` upgrade in fsspec/filesystem_spec#1329
makes
1. when using a trailing slash in globs, it returns only directories.
2. the double asterisks ** matches 0+ directories.

This PR fix the tests accordingly.
AlenkaF added a commit to apache/arrow that referenced this pull request Sep 14, 2023
…tory (#37558)

### Rationale for this change

There has been some changes in the way fsspec lists the directories with new version 2023.9.0, see fsspec/filesystem_spec#1329, which caused our tests to start failing.

### What changes are included in this PR?

This PR updates the `get_file_info_selector` in [FSSpecHandler](https://arrow.apache.org/docs/_modules/pyarrow/fs.html#FSSpecHandler) class to keep the behaviour of our spec.

### Are there any user-facing changes?

No.

* Closes: #37555

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
… directory (apache#37558)

### Rationale for this change

There has been some changes in the way fsspec lists the directories with new version 2023.9.0, see fsspec/filesystem_spec#1329, which caused our tests to start failing.

### What changes are included in this PR?

This PR updates the `get_file_info_selector` in [FSSpecHandler](https://arrow.apache.org/docs/_modules/pyarrow/fs.html#FSSpecHandler) class to keep the behaviour of our spec.

### Are there any user-facing changes?

No.

* Closes: apache#37555

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
… directory (apache#37558)

### Rationale for this change

There has been some changes in the way fsspec lists the directories with new version 2023.9.0, see fsspec/filesystem_spec#1329, which caused our tests to start failing.

### What changes are included in this PR?

This PR updates the `get_file_info_selector` in [FSSpecHandler](https://arrow.apache.org/docs/_modules/pyarrow/fs.html#FSSpecHandler) class to keep the behaviour of our spec.

### Are there any user-facing changes?

No.

* Closes: apache#37555

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid destination directory creation when using glob double asterisk(**)
3 participants