Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What filesystem to use with parquet files #1648

Open
kthyng opened this issue Jul 14, 2024 · 1 comment
Open

What filesystem to use with parquet files #1648

kthyng opened this issue Jul 14, 2024 · 1 comment

Comments

@kthyng
Copy link

kthyng commented Jul 14, 2024

This might be a naive question but I have spent a bit of time trying to figure it out and haven't made much progress.

I'm trying to do this workflow for a parquet file:

import fsspec

fs = fsspec.filesystem().open(path_to_file)

This sort of workflow without specifying a protocol finds that the parquet file is a directory and returns IsADirectory exception. So I am trying to figure out which protocol to use. Looking through the docs, two built-in implementations mention parquet files, but they both seem aimed at kerchunk files specifically. I'm not sure if this means I can use them for other uses or not? I tried with protocol="reference" and then I wasn't sure what to use for fo. I am using a local parquet file and I used that for fo, something like this:

fs = fsspec.filesystem("reference", fo=path_to_file).open(path_to_file)

but then it couldn't find my file, though it is sitting in the same directory and I had just given the file name in "path_to_file". I am using local files now but in general wouldn't always be.

Am I taking the wrong approach altogether? Any idea for how to approach this? Thanks.

@martindurant
Copy link
Member

I am a little confused on what you want to do. As you say, a parquet dataset is (usually) a collection of files in a directory or tree. fsspec is for reading bytes or doing filesystem manipulations, so it makes no sense to "open" a directory.

fs = fsspec.filesystem()
fs.find(path) # list all files
fsspec.open(path+"/**/*.parquet", "rb")  # "open" all matching data files

However, the parquet libraries understand the layout of parquet files, so you don't need to do this.

pd.read_parquet(path)

will call fsspec as needed (via arrow, which also has a concept of filesystems, or via fastparquet). Same goes for dask, polars, etc.

And of course Intake

data = intake.readers.datatypes.Parquet(path)
reader = data.to_reader("pandas")
# or
reader = intake.auto_pipeline(path, "pandas:DataFrame") # works if path matches *.parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants