Failed to read folder #1665

FergusChen · 2024-08-23T03:35:00Z

I noticed a difference in fsspec's handling of folders containing parquet files:
Call method: pd.read_parquet ("s3://xxx/test_dir/")
Normally, if there is a parquet file under the test_dir, this method can read the contents of the parquet file normally. The problem is:
If test_dir is a folder automatically created by aws s3 cp a.parquet s3://xxx/test_dir/, it can be read normally.
But if test_dir a file created by clicking the "Create Folder" button on the s3 console (https://console.amazonaws.cn/), and then uploading a.parquet to this folder, an exception will be thrown: raise FileNotFoundError (path)
The reason is that in line 339 of the fs.py in pyarrow get_file_info_selector () will call this method:

selected_files = self.fs.find (
Selector.base_dir, maxdepth = maxdepth, withdirs = True, detail = True
)

This method calls wrapper () in fsspec's asyn.py and return return sync (self.loop, func, * args, ** kwargs)
The return contains two results: s3://xxx/test_dir/ and s3://xxx/test_dir/a.parquet
After that, pyarrow will read s3://xxx/test_dir/ as a file, which raises the raise FileNotFoundError (path) exception in the fs.py

But the problem is that if the test_dir is a folder automatically created by aws s3 cp a.parquet s3://xxx/test_dir/ the return result of wrapper () does not contain s3://xxx/test_dir/. Therefore, the data can be read normally.

version info:
fsspec 2024.6.1
pyarrow 13.0.0
s3fs 2024.6.1
s3transfer 0.10.2
awscli 1.34.4
aiobotocore 2.13.3
boto3 1.35.4
botocore 1.35.4

The text was updated successfully, but these errors were encountered:

martindurant · 2024-08-28T15:39:58Z

I'm not sure what you think correct behaviour on fsspec's part would be here. find() is supposed to get all the files in a directory tree, so it stands to reason that the zero-length directory placeholder would be there. I would say that arrow is in the wrong to try to read this, since it doesn't match the convention for parquet datafile names. Datasets produced by spark would, for instance, include "_SUCCESS" files, and I assume these do not cause a problem.

As to whether your "test_dir/" is a file or a directory - unfortunately, the backend isn't posix, and we have to do the best we can. So ls() on that path works as it would for a directory, but info() will get the file details. I don't think there's a behaviour that would satisfy all possible use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to read folder #1665

Failed to read folder #1665

FergusChen commented Aug 23, 2024

martindurant commented Aug 28, 2024

Failed to read folder #1665

Failed to read folder #1665

Comments

FergusChen commented Aug 23, 2024

martindurant commented Aug 28, 2024