Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to read folder #1665

Open
FergusChen opened this issue Aug 23, 2024 · 1 comment
Open

Failed to read folder #1665

FergusChen opened this issue Aug 23, 2024 · 1 comment

Comments

@FergusChen
Copy link

I noticed a difference in fsspec's handling of folders containing parquet files:
Call method: pd.read_parquet ("s3://xxx/test_dir/")
Normally, if there is a parquet file under the test_dir, this method can read the contents of the parquet file normally. The problem is:
If test_dir is a folder automatically created by aws s3 cp a.parquet s3://xxx/test_dir/, it can be read normally.
But if test_dir a file created by clicking the "Create Folder" button on the s3 console (https://console.amazonaws.cn/), and then uploading a.parquet to this folder, an exception will be thrown: raise FileNotFoundError (path)
The reason is that in line 339 of the fs.py in pyarrow get_file_info_selector () will call this method:

selected_files = self.fs.find (
Selector.base_dir, maxdepth = maxdepth, withdirs = True, detail = True
)

This method calls wrapper () in fsspec's asyn.py and return return sync (self.loop, func, * args, ** kwargs)
The return contains two results: s3://xxx/test_dir/ and s3://xxx/test_dir/a.parquet
After that, pyarrow will read s3://xxx/test_dir/ as a file, which raises the raise FileNotFoundError (path) exception in the fs.py

But the problem is that if the test_dir is a folder automatically created by aws s3 cp a.parquet s3://xxx/test_dir/ the return result of wrapper () does not contain s3://xxx/test_dir/. Therefore, the data can be read normally.

version info:
fsspec 2024.6.1
pyarrow 13.0.0
s3fs 2024.6.1
s3transfer 0.10.2
awscli 1.34.4
aiobotocore 2.13.3
boto3 1.35.4
botocore 1.35.4

@martindurant
Copy link
Member

I'm not sure what you think correct behaviour on fsspec's part would be here. find() is supposed to get all the files in a directory tree, so it stands to reason that the zero-length directory placeholder would be there. I would say that arrow is in the wrong to try to read this, since it doesn't match the convention for parquet datafile names. Datasets produced by spark would, for instance, include "_SUCCESS" files, and I assume these do not cause a problem.

As to whether your "test_dir/" is a file or a directory - unfortunately, the backend isn't posix, and we have to do the best we can. So ls() on that path works as it would for a directory, but info() will get the file details. I don't think there's a behaviour that would satisfy all possible use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants