Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider pulling real data from cloud in tutorials with zarr #662

Open
nicholasloveday opened this issue Sep 6, 2024 · 5 comments
Open
Assignees

Comments

@nicholasloveday
Copy link
Collaborator

Some of the tutorials use real data. These don't work as well in binder as you need to download the data and save it to disk.

An alternative approach would be to pull the data from the cloud into memory.

E.g.,

hres = xr.open_zarr('gs://weatherbench2/datasets/hres/2016-2022-0012-1440x721.zarr')
hres["2m_temperature"].sel(time="2020-01-01T00:00:00", prediction_timedelta=pd.Timedelta("1 days")).plot()

takes 1.5 seconds to pull the ECMWF forecast down from the cloud and plot it.

The dependencies that would need to be added for the tutorials are zarr and gcsfs

There is also reanalysis data and data driven models that can be pulled down from the cloud (see https://weatherbench2.readthedocs.io/en/latest/data-guide.html).

You can get data on the same grid so it makes verification with scores super easy!

Something to discuss

@nicholasloveday
Copy link
Collaborator Author

Okay, it looks like it's slower (~20s) to establish the initial connection with the Google storage and then pulling data down is fast

@nicholasloveday
Copy link
Collaborator Author

Here's an example of verifying GraphCast against ERA5 with scores

from scores.continuous import mse
import xarray as xr
import pandas as pd

graphcast = xr.open_zarr(
    "gs://weatherbench2/datasets/graphcast/2020/date_range_2019-11-16_2021-02-01_12_hours_derived.zarr"
)
fcst = graphcast["2m_temperature"].sel(
    time="2020-01-01T00:00:00", prediction_timedelta=pd.Timedelta("5 days")
)
era5 = xr.open_zarr("gs://weatherbench2/datasets/era5-forecasts/2020-1440x721.zarr")
obs = era5["2m_temperature"].sel(
    time="2020-01-01T00:00:00", prediction_timedelta=pd.Timedelta("0 days")
)
fcst = fcst.compute()
obs = obs.compute()
obs = obs.rename({"latitude": "lat", "longitude": "lon"})
result = mse(fcst, obs, preserve_dims="all")
result.plot(vmax=100)

which also takes ~1.5 seconds after an initial connection to the cloud storage

@tennlee
Copy link
Collaborator

tennlee commented Sep 7, 2024

If that works well in binder, that's good to know. I'd like some testing to be done before we proceed, and it might be a few weeks before I could do this myself. I'd be happy to see a new notebook created on a branch which we can then develop and test. It might be nice to have an ML-focused notebook which goes into some new areas, possibly looking at evaluation more than the use of individual scores. Thanks very much for putting your example together.

@nicholasloveday
Copy link
Collaborator Author

Yes - I agree that we need to test this in binder. It worked well on my laptop, so it may be an improvement for people running it locally on their computer.

A ML focused notebook sounds like a great idea. A scores + weatherbench2 tutorial would be quite nice.

@tennlee tennlee self-assigned this Sep 11, 2024
@tennlee
Copy link
Collaborator

tennlee commented Sep 11, 2024

I've started a branch for this on my fork, based on your sample code. I won't be able to do this very quickly, so if you want to do something more quickly, feel free. It required the packages "zarr" and "gcsfs" to be installed. Zarr is a common format to want, so that's fine to add to the tutorial requirements, and I think there will be enough interest in these datasets to justify adding gcsfs to the tutorial requirements also. Another option would be to simply document those requirements in the notebook itself. But if the goal is to have this work nicely in binder, it's probably more reliable to add them to the requirements. If you'd like to, I'm happy to add you to my fork as well and you can push directly to the feature branch if you want to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants