You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed a difference in fsspec's handling of folders containing parquet files:
Call method: pd.read_parquet ("s3://xxx/test_dir/")
Normally, if there is a parquet file under the test_dir, this method can read the contents of the parquet file normally. The problem is:
If test_dir is a folder automatically created by aws s3 cp a.parquet s3://xxx/test_dir/, it can be read normally.
But if test_dir a file created by clicking the "Create Folder" button on the s3 console (https://console.amazonaws.cn/), and then uploading a.parquet to this folder, an exception will be thrown: raise FileNotFoundError (path)
The reason is that in line 339 of the fs.py in pyarrowget_file_info_selector () will call this method:
This method calls wrapper () in fsspec's asyn.py and return return sync (self.loop, func, * args, ** kwargs)
The return contains two results: s3://xxx/test_dir/ and s3://xxx/test_dir/a.parquet
After that, pyarrow will read s3://xxx/test_dir/ as a file, which raises the raise FileNotFoundError (path) exception in the fs.py
But the problem is that if the test_dir is a folder automatically created by aws s3 cp a.parquet s3://xxx/test_dir/ the return result of wrapper () does not contain s3://xxx/test_dir/. Therefore, the data can be read normally.
I'm not sure what you think correct behaviour on fsspec's part would be here. find() is supposed to get all the files in a directory tree, so it stands to reason that the zero-length directory placeholder would be there. I would say that arrow is in the wrong to try to read this, since it doesn't match the convention for parquet datafile names. Datasets produced by spark would, for instance, include "_SUCCESS" files, and I assume these do not cause a problem.
As to whether your "test_dir/" is a file or a directory - unfortunately, the backend isn't posix, and we have to do the best we can. So ls() on that path works as it would for a directory, but info() will get the file details. I don't think there's a behaviour that would satisfy all possible use cases.
I noticed a difference in fsspec's handling of folders containing parquet files:
Call method:
pd.read_parquet ("s3://xxx/test_dir/")
Normally, if there is a parquet file under the test_dir, this method can read the contents of the parquet file normally. The problem is:
If test_dir is a folder automatically created by
aws s3 cp a.parquet s3://xxx/test_dir/
, it can be read normally.But if test_dir a file created by clicking the "Create Folder" button on the s3 console (https://console.amazonaws.cn/), and then uploading a.parquet to this folder, an exception will be thrown:
raise FileNotFoundError (path)
The reason is that in line 339 of the fs.py in pyarrow
get_file_info_selector ()
will call this method:This method calls wrapper () in fsspec's asyn.py and return
return sync (self.loop, func, * args, ** kwargs)
The return contains two results: s3://xxx/test_dir/ and s3://xxx/test_dir/a.parquet
After that, pyarrow will read s3://xxx/test_dir/ as a file, which raises the
raise FileNotFoundError (path)
exception in the fs.pyBut the problem is that if the test_dir is a folder automatically created by
aws s3 cp a.parquet s3://xxx/test_dir/
the return result of wrapper () does not contain s3://xxx/test_dir/. Therefore, the data can be read normally.version info:
fsspec 2024.6.1
pyarrow 13.0.0
s3fs 2024.6.1
s3transfer 0.10.2
awscli 1.34.4
aiobotocore 2.13.3
boto3 1.35.4
botocore 1.35.4
The text was updated successfully, but these errors were encountered: