Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and reorganize apply_ufunc material #180

Merged
merged 40 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
a425211
Update and reorganize apply_ufunc material
dcherian Jun 14, 2023
1291a5c
More complex output
dcherian Jun 16, 2023
799c94a
Fix syntax
dcherian Jun 16, 2023
fccc15f
fix syntax?
dcherian Jun 16, 2023
627fcce
Add more solutions
dcherian Jun 16, 2023
20c4537
Remove nested code-cell
dcherian Jun 16, 2023
3fe7152
Updates
dcherian Jun 21, 2023
5f7bf8e
fix toc
dcherian Jun 21, 2023
a31526a
WIP
dcherian Jun 22, 2023
ad40947
small edits
dcherian Jun 23, 2023
7884868
finish automatic vectorizing
dcherian Jun 23, 2023
2e100c4
spellcheck
dcherian Jun 23, 2023
62ddb3b
Expect exception raising
dcherian Jun 23, 2023
e7a82a0
fixes
dcherian Jun 23, 2023
d076f21
cleanup
dcherian Jun 23, 2023
d9a093d
small updates
dcherian Jun 26, 2023
7b908be
Add numba vectorize notebook
dcherian Jun 26, 2023
2befb26
fix
dcherian Jun 26, 2023
748eab0
Small edits
dcherian Jun 26, 2023
f961e0e
Merge branch 'main' into apply-ufunc-reorg
dcherian Jun 26, 2023
7540f2c
add numba logo
dcherian Jun 26, 2023
87968ba
small text update
dcherian Jun 26, 2023
1fe8d87
fix spelling
dcherian Jun 26, 2023
68b9103
Tweaks to dask material
dcherian Jun 26, 2023
2401c14
more tweaks
dcherian Jun 26, 2023
67f973d
Review comments
dcherian Jun 27, 2023
7f8170d
Migrate to exercise syntax.
dcherian Jun 27, 2023
630e851
Merge branch 'main' into apply-ufunc-reorg
dcherian Jun 27, 2023
4a55bcb
Add sphinx_exercise + review feedback
dcherian Jun 27, 2023
4b92ebf
Break out core dimensions material
dcherian Jun 27, 2023
0756860
fix
dcherian Jun 27, 2023
742231b
fix link
dcherian Jun 27, 2023
89a37c9
one more use of exercise syntax
dcherian Jun 27, 2023
95e0717
More review comments.
dcherian Jun 27, 2023
4cf9e9d
Significant dask updates.
dcherian Jun 27, 2023
e36fde5
last fix?
dcherian Jun 27, 2023
6fefa59
more improvements.
dcherian Jun 27, 2023
0c5fb88
lint
dcherian Jun 27, 2023
e2c77b0
suppress gufunc error formatting warning
dcherian Jun 27, 2023
34e34c0
remove dask output spam
dcherian Jun 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,6 @@ repos:
hooks:
- id: flake8

- repo: https://github.com/asottile/seed-isort-config
rev: v2.2.0
hooks:
- id: seed-isort-config
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
hooks:
Expand Down
3 changes: 2 additions & 1 deletion _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,10 @@ sphinx:
# maintain old paths and redirect them (so google results dont go to 404)
# https://github.com/wpilibsuite/sphinxext-rediraffe
- sphinxext.rediraffe
- sphinx_exercise
config:
# application/vnd.holoviews_load.v0+json, application/vnd.holoviews_exec.v0+json
suppress_warnings: ['mystnb.unknown_mime_type']
suppress_warnings: ['mystnb.unknown_mime_type', 'misc.highlighting_failure']
notfound_context:
body: "<h1>Whoops! 404 Page Not Found</h1>\n\n<p>Sorry, this page doesn't exist. Many sections of this book have been updated recently.</p><p> Try the search box 🔎 to find what you're looking for!</p>"
notfound_urls_prefix: /
Expand Down
8 changes: 6 additions & 2 deletions _toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,12 @@ parts:
- file: advanced/apply_ufunc/apply_ufunc.md
sections:
- file: advanced/apply_ufunc/simple_numpy_apply_ufunc
- file: advanced/apply_ufunc/simple_dask_apply_ufunc
- file: advanced/apply_ufunc/vectorize_1d
- file: advanced/apply_ufunc/core-dimensions
- file: advanced/apply_ufunc/complex-output-numpy
- file: advanced/apply_ufunc/automatic-vectorizing-numpy
- file: advanced/apply_ufunc/dask_apply_ufunc
- file: advanced/apply_ufunc/numba-vectorization
- file: advanced/apply_ufunc/example-interp
- file: advanced/map_blocks/map_blocks.md
sections:
- file: advanced/map_blocks/simple_map_blocks
Expand Down
359 changes: 359 additions & 0 deletions advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb
dcherian marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "6849dcdc-3484-4f41-8b23-51613d36812f",
"metadata": {
"tags": []
},
"source": [
"(vectorize)=\n",
"# Automatic Vectorization"
]
},
{
"cell_type": "markdown",
"id": "afc56d28-6e55-4967-b27d-28e2cc539cc7",
"metadata": {
"tags": []
},
"source": [
"Previously we looked at [applying functions](gentle-intro) on numpy arrays, and the concept of [core dimensions](core-dimensions).\n",
"We learned that functions commonly support specifying \"core dimensions\" through the `axis` keyword\n",
"argument. \n",
"\n",
"However many functions exist, that implicitly have core dimensions, but do not provide an `axis` keyword\n",
"argument. Applying such functions to a nD array usually involves one or multiple loops over the other dimensions\n",
"--- termed \"loop dimensions\" or \"broadcast dimensions\".\n",
"\n",
"\n",
"A good example is numpy's 1D interpolate function `numpy.interp`:\n",
"\n",
"```\n",
" Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n",
" Docstring:\n",
" One-dimensional linear interpolation.\n",
"\n",
" Returns the one-dimensional piecewise linear interpolant to a function\n",
" with given discrete data points (`xp`, `fp`), evaluated at `x`.\n",
"```\n",
"\n",
"This function expects 1D arrays as input, so there is one core dimension and we cannot easily apply \n",
"it to a nD array since there is no `axis` keyword argument. \n",
"\n",
"\n",
"Our goal here is to \n",
"1. Understand the difference between core dimensions and loop dimensions\n",
"1. Understand vectorization\n",
"1. Learn how to apply such functions without loops using `apply_ufunc` by providing the `vectorize` keyword argument.\n",
"\n",
"## Core dimensions and looping\n",
"\n",
"Let's say we want to\n",
"interpolate an array with two dimensions (`space`, `time`) over the `time` dimension, we might \n",
"1. loop over the `space` dimension, \n",
"1. subset the array to a 1D array at that `space` location, \n",
"1. Interpolate the 1D arrays to the new `time` vector, and\n",
"1. Assign that new interpolated 1D array to the appropriate location of a 2D output array\n",
"\n",
"In pseudo-code this might look like\n",
"\n",
"```python\n",
"for index in range(size_of_space_axis):\n",
" out[index, :] = np.interp(..., array[index, :], ...)\n",
"```\n",
"\n",
"\n",
"```{exercise}\n",
":label: coreloopdims\n",
"\n",
"Consider the example problem of interpolating a 2D array with dimensions `space` and `time` along the `time` dimension.\n",
"Which dimension is the core dimension, and which is the \"loop dimension\"?\n",
"```\n",
"```{solution} coreloopdims\n",
":class: dropdown\n",
"\n",
"`time` is the core dimension, and `space` is the loop dimension.\n",
"```\n",
"\n",
"## Vectorization\n",
"\n",
"The pattern of looping over any number of \"loop dimensions\" and applying a function along \"core dimensions\" \n",
"is so common that numpy provides wrappers that automate these steps: \n",
"1. [numpy.apply_along_axis](https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html)\n",
"1. [numpy.apply_over_axes](https://numpy.org/doc/stable/reference/generated/numpy.apply_over_axes.html)\n",
"1. [numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html)\n",
"\n",
"\n",
"`apply_ufunc` provides an easy interface to `numpy.vectorize` through the keyword argument `vectorize`. Here we see how to use\n",
"that to automatically apply `np.interp` along a single axis of a nD array\n",
"\n",
"## Load data\n",
"\n",
"First lets load an example dataset\n",
"\n",
"```{tip}\n",
"We'll reduce the length of error messages using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76aa13b8-5ced-4468-a72e-6b0a29172d6d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%xmode minimal\n",
"\n",
"import xarray as xr\n",
"import numpy as np\n",
"\n",
"xr.set_options(display_expand_data=False)\n",
"\n",
"air = (\n",
" xr.tutorial.load_dataset(\"air_temperature\")\n",
" .air.sortby(\"lat\") # np.interp needs coordinate in ascending order\n",
" .isel(time=slice(4), lon=slice(3)) # choose a small subset for convenience\n",
")\n",
"air"
]
},
{
"cell_type": "markdown",
"id": "81356724-6c1a-4d4a-9a32-bb906a9419b2",
"metadata": {
"tags": []
},
"source": [
"## Review\n",
"\n",
"\n",
"We'll work with the `apply_ufunc` call from the section on [handling dimensions that change size](complex-output-change-size). See the \"Handling Complex Output\" section for how to get here.\n",
"\n",
"This version only works with 1D vectors. We will expand that to work with inputs of any number of dimensions."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cb286fa0-deba-4929-b18a-79af5acb0b5b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"newlat = np.linspace(15, 75, 100)\n",
"\n",
"xr.apply_ufunc(\n",
" np.interp, # first the function\n",
" newlat,\n",
" air.lat,\n",
" air.isel(lon=0, time=0), # this version only works with 1D vectors\n",
" input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n",
" output_core_dims=[[\"lat\"]],\n",
" exclude_dims={\"lat\"},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e3382658-14e1-4842-a618-ce7a27948c31",
"metadata": {
"tags": []
},
"source": [
"## Try nD input\n",
"\n",
"Our goal is to interpolate latitude at every longitude and time, such that we go from a dataset with dimensions `(time: 4, lat: 25, lon: 3)` to `(time: 4, lat: 100, lon: 3)`. \n",
"\n",
"If we blindly try passing `air` (a 3D DataArray), we get a hard-to-understand error"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1476bcce-cc7b-4252-90dd-f45502dffb09",
"metadata": {
"tags": [
"raises-exception"
]
},
"outputs": [],
"source": [
"newlat = np.linspace(15, 75, 100)\n",
"\n",
"xr.apply_ufunc(\n",
" np.interp, # first the function\n",
" newlat,\n",
" air.lat,\n",
" air,\n",
" input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n",
" output_core_dims=[[\"lat\"]],\n",
" exclude_dims={\"lat\"},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "1d1da9c2-a634-4920-890c-74d9bec9eab9",
"metadata": {
"tags": []
},
"source": [
"We will use a \"wrapper\" function `debug_interp` to examine what gets passed to `numpy.interp`.\n",
"\n",
"```{tip}\n",
"Such wrapper functions are a great way to understand and debug `apply_ufunc` use cases.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa306dcf-eec3-425c-b278-42d15bbc0e4f",
"metadata": {
"tags": [
"raises-exception"
]
},
"outputs": [],
"source": [
"def debug_interp(xi, x, data):\n",
" print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n",
" return np.interp(xi, x, data)\n",
"\n",
"\n",
"interped = xr.apply_ufunc(\n",
" debug_interp, # first the function\n",
" newlat,\n",
" air.lat,\n",
" air,\n",
" input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n",
" output_core_dims=[[\"lat\"]],\n",
" exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n",
")"
]
},
{
"cell_type": "markdown",
"id": "6f5c928b-f8cb-4016-9d6d-39743f9c2976",
"metadata": {
"tags": []
},
"source": [
"That's a hard-to-interpret error from NumPy but our `print` call helpfully printed the shapes of the input data: \n",
"\n",
" data: (4, 3, 25) | x: (25,) | xi: (100,)\n",
"\n",
"We see that `apply_ufunc` passes the full 3D array to `interp1d_np` which in turn passes that on to `numpy.interp`. But `numpy.interp` requires a 1D input, and thus the error.\n",
"\n",
"Instead of passing the full 3D array we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`."
]
},
{
"cell_type": "markdown",
"id": "737cc6b4-522f-488c-9124-524cc42ebef3",
"metadata": {
"tags": []
},
"source": [
"## Vectorization with `np.vectorize`\n"
]
},
{
"cell_type": "markdown",
"id": "b6dac8da-8420-4fc4-9aeb-29b8999d4b37",
"metadata": {
"tags": []
},
"source": [
"`apply_ufunc` makes it easy to loop over the loop dimensions by specifying `vectorize=True`:\n",
"\n",
" vectorize : bool, optional\n",
" If True, then assume ``func`` only takes arrays defined over core\n",
" dimensions as input and vectorize it automatically with\n",
" :py:func:`numpy.vectorize`. This option exists for convenience, but is\n",
" almost always slower than supplying a pre-vectorized function.\n",
" Using this option requires NumPy version 1.12 or newer.\n",
" \n",
"\n",
"```{warning}\n",
"Also see the numpy documentation for [numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html). Most importantly\n",
"\n",
" The vectorize function is provided primarily for convenience, not for performance. \n",
" The implementation is essentially a for loop.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d72fdd8c-44d2-4f6e-9fc4-7084e0e49986",
"metadata": {
"tags": [],
"user_expressions": []
},
"outputs": [],
"source": [
"interped = xr.apply_ufunc(\n",
" debug_interp, # first the function\n",
" newlat,\n",
" air.lat,\n",
" air,\n",
" input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n",
" output_core_dims=[[\"lat\"]],\n",
" exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n",
" vectorize=True,\n",
")\n",
"interped"
]
},
{
"cell_type": "markdown",
"id": "d81f399e-1649-4d4b-ad28-81cba8403210",
"metadata": {
"tags": []
},
"source": [
"Wow that worked!\n",
"\n",
"Notice that \n",
"1. the printed input shapes are all 1D and correspond to one vector of size 25 along the `lat` dimension.\n",
"2. `debug_interp` was called 4x3 = 12 times which is the total number `lat` vectors since the size along `time` is 4, and the size along `lon` is 3.\n",
"3. The result `interped` is now an xarray object with coordinate values copied over from `data`. \n",
"\n",
"\n",
"```{note}\n",
"`lat` is now the *last* dimension in `interped`. This is a \"property\" of core dimensions: they are moved to the end before being sent to `interp1d_np` as noted in the docstring for `input_core_dims`\n",
"\n",
" Core dimensions are automatically moved to the last axes of input\n",
" variables before applying ``func``, which facilitates using NumPy style\n",
" generalized ufuncs [2]_.\n",
"```\n",
"\n",
"## Conclusion\n",
"This is why `apply_ufunc` is so convenient; it takes care of a lot of code necessary to apply functions that consume and produce numpy arrays to xarray objects.\n",
"\n",
"The `vectorize` keyword argument, when set to True, will use `numpy.vectorize` to apply the function by looping over the \"loop dimensions\" --- dimensions that are not the core dimensions for the applied function."
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading