Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow performance indexing into netcdf with a vector #135

Closed
rafaqz opened this issue Jan 21, 2024 · 1 comment
Closed

Very slow performance indexing into netcdf with a vector #135

rafaqz opened this issue Jan 21, 2024 · 1 comment

Comments

@rafaqz
Copy link
Collaborator

rafaqz commented Jan 21, 2024

Indexing with a vector can be incredibly slow with the current implementation of batchgetindex.

From this comment by @Alexander-Barth
#131 (comment)

NetCDF.jl + DiskArray

julia> using NetCDF

julia> v = NetCDF.open("http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0","water_temp");

julia> size(v)
(4500, 4251, 40, 14969)

julia> v1 = @time v[1000,1000,1,1:10];
  1.478774 seconds (3.71 M allocations: 250.669 MiB, 3.42% gc time, 59.32% compilation time)

julia> v1 = @time v[1000,1000,1,1:10];
  0.467219 seconds (45 allocations: 1.781 KiB)

julia> v1 = @time v[1000,1000,1,collect(1:10)];
739.847348 seconds (1.32 M allocations: 819.944 MiB, 0.03% gc time, 0.10% compilation time)
From this comment by @Alexander-Barth 
https://github.com/meggart/DiskArrays.jl/issues/131#issuecomment-1902759762

NCDatasets 0.12 without DiskArray

julia> using NCDatasets

julia> ds = NCDataset("http://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0"); v = ds["water_temp"];

julia> v1 = @time v[1000,1000,1,1:10];
  0.726571 seconds (134.15 k allocations: 8.849 MiB, 14.90% compilation time)

julia> v1 = @time v[1000,1000,1,1:10];
  0.218793 seconds (62 allocations: 1.984 KiB)

julia> v1 = @time v[1000,1000,1,collect(1:10)];
  0.879592 seconds (982.55 k allocations: 66.829 MiB, 9.49% gc time, 65.79% compilation time)

So 12 minutes versus 1 seconds for this example. Yes, I know that I can bypass DiskArrays by providing my own batchgetindex. But I don't think that DiskArray should make the current assumptions if it aims to be generic (if this is the goal).

Can we have the current assumptions (i.e. reading a chunk similarly fast than reading a subset of chunk) documented?

Concerning the API, another think I am wondering if we need a function named batchgetindex at all. Why not having getindex simply pass the indices to DiskArrays.readblock! and let DiskArrays.readblock! figure out how to best load the data. DiskArrays.readblock! is already specific to the storage format.

@rafaqz
Copy link
Collaborator Author

rafaqz commented Feb 19, 2024

This is fixed in #146. It was mostly wait time from the server blocking repeated calls.

@rafaqz rafaqz closed this as completed Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant