From 7884868cef39a28c2f02ebb95b9e7cf5560182dd Mon Sep 17 00:00:00 2001 From: dcherian Date: Fri, 23 Jun 2023 11:51:01 -0600 Subject: [PATCH] finish automatic vectorizing --- .../automatic-vectorizing-numpy.ipynb | 385 +++++------------- 1 file changed, 112 insertions(+), 273 deletions(-) diff --git a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb index 67f6c113..5ce61891 100644 --- a/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb +++ b/advanced/apply_ufunc/automatic-vectorizing-numpy.ipynb @@ -21,15 +21,14 @@ "source": [ "Previously we looked at applying functions on numpy arrays, and the concept of core dimensions.\n", "We learned that functions commonly support specifying \"core dimensions\" through the `axis` keyword\n", - "argument. However many functions exist, that implicitly have core dimensions, but do not provide an `axis` keyword\n", - "argument. Applying such functions to a nD array usually involves one or multiple loops.\n", - "Our goal here is to learn how to apply such functions without loops using `apply_ufunc` by providing the `vectorize`\n", - "keyword argument.\n", + "argument. \n", "\n", + "However many functions exist, that implicitly have core dimensions, but do not provide an `axis` keyword\n", + "argument. Applying such functions to a nD array usually involves one or multiple loops over the other dimensions\n", + "--- termed \"loop dimensions\" or \"broadcast dimensions\".\n", "\n", - "## Introduction\n", "\n", - "A good example is numpy's 1D interpolate function :py:func:`numpy.interp`:\n", + "A good example is numpy's 1D interpolate function `numpy.interp`:\n", "\n", "```\n", " Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n", @@ -44,13 +43,19 @@ "it to a nD array since there is no `axis` keyword argument. \n", "\n", "\n", + "Our goal here is to \n", + "1. Understand the difference between core dimensions and loop dimensions\n", + "1. Understand vectorization\n", + "1. Learn how to apply such functions without loops using `apply_ufunc` by providing the `vectorize` keyword argument.\n", + "\n", "### Core dimensions and looping\n", - "A common pattern to solve this problem is to loop over all the other dimensions. Let's say we want to\n", + "\n", + "Let's say we want to\n", "interpolate an array with two dimensions (`space`, `time`) over the `time` dimension, we might \n", - "1. loop over the space dimension, \n", - "2. subset the array to a 1D array at that `space` loction, \n", - "3. Interpolate the 1D arrays to the new time vector, and\n", - "4. Assign that new interpolated 1D array to the appropriate location of a 2D output array\n", + "1. loop over the `space` dimension, \n", + "1. subset the array to a 1D array at that `space` loction, \n", + "1. Interpolate the 1D arrays to the new `time` vector, and\n", + "1. Assign that new interpolated 1D array to the appropriate location of a 2D output array\n", "\n", "In pseudo-code this might look like\n", "\n", @@ -60,11 +65,16 @@ "```\n", "\n", "\n", - "#### Exercise\n", + "```{tip} Exercise\n", "\n", "Consider the example problem of interpolating a 2D array with dimensions `space` and `time` along the `time` dimension.\n", "Which dimension is the core dimension, and which is the \"loop dimension\"?\n", + "```\n", + "```{tip} Solution\n", + ":class: dropdown\n", "\n", + "`time` is the core dimension, and `space` is the loop dimension.\n", + "```\n", "\n", "## Vectorization\n", "\n", @@ -101,6 +111,8 @@ "import xarray as xr\n", "import numpy as np\n", "\n", + "xr.set_options(display_expand_data=False)\n", + "\n", "air = (\n", " xr.tutorial.load_dataset(\"air_temperature\")\n", " .air.sortby(\"lat\") # np.interp needs coordinate in ascending order\n", @@ -111,153 +123,132 @@ }, { "cell_type": "markdown", - "id": "ef161fb5-15d7-4d13-831e-6407d5293bc1", + "id": "81356724-6c1a-4d4a-9a32-bb906a9419b2", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "The function we will apply is `np.interp` which expects 1D numpy arrays. This functionality is already implemented in xarray so we use that capability to make sure we are not making mistakes." + "## Review\n", + "\n", + "\n", + "We'll work with the `apply_ufunc` call from the section on [handling dimensions that change size](complex-output-change-size). See the \"Handling Complex Output\" section for how to get here.\n", + "\n", + "This version only works with 1D vectors. We will expand that to work with inputs of any number of dimensions." ] }, { "cell_type": "code", "execution_count": null, - "id": "bf37f81b-f6b5-42d4-b20e-0d22e3d124d6", + "id": "cb286fa0-deba-4929-b18a-79af5acb0b5b", "metadata": { "tags": [] }, "outputs": [], "source": [ "newlat = np.linspace(15, 75, 100)\n", - "air.interp(lat=newlat)" + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air.isel(lon=0, time=0), # this version only works with 1D vectors\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")" ] }, { "cell_type": "markdown", - "id": "426f5211-031e-4096-8bdc-78a0cb4c1766", + "id": "e3382658-14e1-4842-a618-ce7a27948c31", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "Let's define a function that works with one vector of data along `lat` at a time." + "## Try nD input\n", + "\n", + "If we blindly try passing `air` (a 3D DataArray), we get a hard-to-understand error" ] }, { "cell_type": "code", "execution_count": null, - "id": "cb286fa0-deba-4929-b18a-79af5acb0b5b", - "metadata": {}, - "outputs": [], - "source": [ - "def interp1d_np(data, x, xi):\n", - " return np.interp(xi, x, data)\n", - "\n", - "\n", - "interped = interp1d_np(air.isel(time=0, lon=0), air.lat, newlat)\n", - "expected = air.interp(lat=newlat)\n", - "\n", - "# no errors are raised if values are equal to within floating point precision\n", - "np.testing.assert_allclose(expected.isel(time=0, lon=0).values, interped)" - ] - }, - { - "cell_type": "markdown", - "id": "454a9887-948a-45f0-9b7c-dc07bb542726", + "id": "1476bcce-cc7b-4252-90dd-f45502dffb09", "metadata": { - "user_expressions": [] + "tags": [] }, + "outputs": [], "source": [ - "## `exclude_dims`" + "newlat = np.linspace(15, 75, 100)\n", + "\n", + "xr.apply_ufunc(\n", + " np.interp, # first the function\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"},\n", + ")" ] }, { "cell_type": "markdown", - "id": "42dad673-7987-4dff-870b-c2bdff440993", + "id": "1d1da9c2-a634-4920-890c-74d9bec9eab9", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "\n", - "```\n", - "exclude_dims : set, optional\n", - " Core dimensions on the inputs to exclude from alignment and\n", - " broadcasting entirely. Any input coordinates along these dimensions\n", - " will be dropped. Each excluded dimension must also appear in\n", - " ``input_core_dims`` for at least one argument. Only dimensions listed\n", - " here are allowed to change size between input and output objects.\n", - "```" + "We will use a \"wrapper\" function `interp1d_np` to examine what gets passed to `numpy.interp`" ] }, { "cell_type": "code", "execution_count": null, - "id": "0ce03d38-db86-49c2-a60d-46d312ad60f8", + "id": "fa306dcf-eec3-425c-b278-42d15bbc0e4f", "metadata": { - "tags": [ - "raises-exception" - ] + "tags": [] }, "outputs": [], "source": [ - "xr.apply_ufunc(\n", + "def interp1d_np(xi, x, data):\n", + " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", + " return np.interp(xi, x, data)\n", + "\n", + "\n", + "interped = xr.apply_ufunc(\n", " interp1d_np, # first the function\n", - " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", - " air.lat,\n", " newlat,\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", ")" ] }, { "cell_type": "markdown", - "id": "79db343e-efcc-4fa5-a2ec-1da91bc93225", - "metadata": { - "user_expressions": [] - }, - "source": [ - "## Core dimensions\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "e7c81b20-83a1-4575-8e9f-bf929ecc66ba", + "id": "6f5c928b-f8cb-4016-9d6d-39743f9c2976", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "Core dimensions are central to using `apply_ufunc`. In our case, our function expects to receive a 1D vector along `lat` — this is the dimension that is \"core\" to the function's functionality. Multiple core dimensions are possible. `apply_ufunc` needs to know which dimensions of each variable are core dimensions.\n", - "\n", - " input_core_dims : Sequence[Sequence], optional\n", - " List of the same length as ``args`` giving the list of core dimensions\n", - " on each input argument that should not be broadcast. By default, we\n", - " assume there are no core dimensions on any input arguments.\n", + "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", "\n", - " For example, ``input_core_dims=[[], ['time']]`` indicates that all\n", - " dimensions on the first argument and all dimensions other than 'time'\n", - " on the second argument should be broadcast.\n", + " data: (4, 3, 25) | x: (25,) | xi: (100,)\n", "\n", - " Core dimensions are automatically moved to the last axes of input\n", - " variables before applying ``func``, which facilitates using NumPy style\n", - " generalized ufuncs [2]_.\n", - " \n", - " output_core_dims : List[tuple], optional\n", - " List of the same length as the number of output arguments from\n", - " ``func``, giving the list of core dimensions on each output that were\n", - " not broadcast on the inputs. By default, we assume that ``func``\n", - " outputs exactly one array, with axes corresponding to each broadcast\n", - " dimension.\n", - "\n", - " Core dimensions are assumed to appear as the last dimensions of each\n", - " output in the provided order.\n", - " \n", - "Next we specify `\"lat\"` as `input_core_dims` on both `air` and `air.lat`" + "We see that `apply_ufunc` passes the full 3D array to `interp1d_np` which in turn passes that on to `numpy.interp`. This cannot work." ] }, { "cell_type": "code", "execution_count": null, - "id": "2ced88c6-84ca-4294-9ed6-479ee353b3c8", + "id": "0ce03d38-db86-49c2-a60d-46d312ad60f8", "metadata": { "tags": [ "raises-exception" @@ -270,142 +261,30 @@ " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", " air.lat,\n", " newlat,\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []],\n", " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", ")" ] }, { "cell_type": "markdown", - "id": "cb20872d-06ae-400d-ada7-443ce6fa02a4", - "metadata": { - "user_expressions": [] - }, - "source": [ - "xarray is telling us that it expected to receive back a numpy array with 0 dimensions but instead received an array with 1 dimension corresponding to `newlat`. We can fix this by specifying `output_core_dims`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b3957ac5-c622-41f9-b774-5527693d370e", - "metadata": {}, - "outputs": [], - "source": [ - "xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", - " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", - " air.lat,\n", - " newlat,\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", - " output_core_dims=[[\"lat\"]],\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "60cc2c23-bdf9-41f1-a09a-a6df8531fd21", - "metadata": { - "user_expressions": [] - }, - "source": [ - "Finally we get some output! Let's check that this is right\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d6778fe6-5433-4f8d-bb40-2fb41b78b574", - "metadata": {}, - "outputs": [], - "source": [ - "interped = xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", - " air.isel(time=0, lon=0), # now arguments in the order expected by 'interp1_np'\n", - " air.lat,\n", - " newlat,\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", - " output_core_dims=[[\"lat\"]],\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - ")\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)" - ] - }, - { - "cell_type": "markdown", - "id": "3c7ea6dd-260e-4315-9d6d-3dbe2af2538e", - "metadata": { - "user_expressions": [] - }, - "source": [ - "No errors are raised so it is right!" - ] - }, - { - "cell_type": "markdown", - "id": "3441f26d-8076-47a4-82c1-108c4c870cb9", + "id": "737cc6b4-522f-488c-9124-524cc42ebef3", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "## Vectorization with `np.vectorize`" + "## Vectorization with `np.vectorize`\n" ] }, { "cell_type": "markdown", - "id": "121aeb07-d444-4394-a4ef-9199dc2725a5", + "id": "b6dac8da-8420-4fc4-9aeb-29b8999d4b37", "metadata": { + "tags": [], "user_expressions": [] }, "source": [ - "Now our function currently only works on one vector of data which is not so useful given our 3D dataset.\n", - "Let's try passing the whole dataset. We add a `print` statement so we can see what our function receives." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9097c318-5114-4f00-a8e8-f85cdae39bf3", - "metadata": { - "tags": [ - "raises-exception" - ] - }, - "outputs": [], - "source": [ - "def interp1d_np(data, x, xi):\n", - " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", - " return np.interp(xi, x, data)\n", - "\n", - "\n", - "interped = xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", - " air.isel(lon=slice(3), time=slice(4)), # now arguments in the order expected by 'interp1_np'\n", - " air.lat,\n", - " newlat,\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", - " output_core_dims=[[\"lat\"]],\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - ")\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)" - ] - }, - { - "cell_type": "markdown", - "id": "f4873ac9-7000-4a0a-9d8f-e425e2bf2671", - "metadata": { - "tags": [] - }, - "source": [ - "That's a hard-to-interpret error but our `print` call helpfully printed the shapes of the input data: \n", - "\n", - " data: (10, 53, 25) | x: (25,) | xi: (100,)\n", - "\n", - "We see that `air` has been passed as a 3D numpy array which is not what `np.interp` expects. Instead we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`.\n", + "Instead of passing the full 3D array we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`.\n", "`apply_ufunc` makes this easy by specifying `vectorize=True`:\n", "\n", " vectorize : bool, optional\n", @@ -427,71 +306,22 @@ { "cell_type": "code", "execution_count": null, - "id": "7f8d8f74-3282-403d-8cce-7434780b29a3", - "metadata": { - "tags": [ - "raises-exception" - ] - }, - "outputs": [], - "source": [ - "def interp1d_np(data, x, xi):\n", - " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", - " return np.interp(xi, x, data)\n", - "\n", - "\n", - "interped = xr.apply_ufunc(\n", - " interp1d_np, # first the function\n", - " air, # now arguments in the order expected by 'interp1_np'\n", - " air.lat, # as above\n", - " newlat, # as above\n", - " input_core_dims=[[\"lat\"], [\"lat\"], []], # list with one entry per arg\n", - " output_core_dims=[[\"lat\"]], # returned data has one dimension\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be set!\n", - " vectorize=True, # loop over non-core dims\n", - ")\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(expected, interped)" - ] - }, - { - "cell_type": "markdown", - "id": "a0e1347f-3473-4fb6-9506-de46e314ba8c", + "id": "d72fdd8c-44d2-4f6e-9fc4-7084e0e49986", "metadata": { + "tags": [], "user_expressions": [] }, - "source": [ - "This unfortunately is another cryptic error from numpy. \n", - "\n", - "Notice that `newlat` is not an xarray object. Let's add a dimension name `new_lat` and modify the call. Note this cannot be `lat` because xarray expects dimensions to be the same size (or broadcastable) among all inputs. `output_core_dims` needs to be modified appropriately. We'll manually rename `new_lat` back to `lat` for easy checking." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "add52a3e-6c10-4761-ad39-101a58cda921", - "metadata": {}, "outputs": [], "source": [ - "def interp1d_np(data, x, xi):\n", - " print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n", - " return np.interp(xi, x, data)\n", - "\n", - "\n", "interped = xr.apply_ufunc(\n", " interp1d_np, # first the function\n", - " air, # now arguments in the order expected by 'interp1_np'\n", - " air.lat, # as above\n", - " newlat, # as above\n", - " input_core_dims=[[\"lat\"], [\"lat\"], [\"new_lat\"]], # list with one entry per arg\n", - " output_core_dims=[[\"new_lat\"]], # returned data has one dimension\n", - " exclude_dims=set((\"lat\",)), # dimensions allowed to change size. Must be a set!\n", - " vectorize=True, # loop over non-core dims\n", - ")\n", - "interped = interped.rename({\"new_lat\": \"lat\"})\n", - "interped[\"lat\"] = newlat # need to add this manually\n", - "xr.testing.assert_allclose(\n", - " expected.transpose(*interped.dims), interped # order of dims is different\n", + " newlat,\n", + " air.lat,\n", + " air,\n", + " input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n", + " output_core_dims=[[\"lat\"]],\n", + " exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n", + " vectorize=True,\n", ")\n", "interped" ] @@ -504,15 +334,24 @@ "user_expressions": [] }, "source": [ - "Notice that the printed input shapes are all 1D and correspond to one vector along the `lat` dimension.\n", + "Wow that worked!\n", + "\n", + "Notice that \n", + "1. the printed input shapes are all 1D and correspond to one vector of size 25 along the `lat` dimension.\n", + "2. `interp1d_np` was called 4x3 = 12 times which is the total number `lat` vectors since the size along `time` is 4, and the size along `lon` is 3.\n", + "3. The result `interped` is now an xarray object with coordinate values copied over from `data`. \n", "\n", - "The result is now an xarray object with coordinate values copied over from `data`. This is why `apply_ufunc` is so convenient; it takes care of a lot of boilerplate necessary to apply functions that consume and produce numpy arrays to xarray objects.\n", "\n", - "One final point: `lat` is now the *last* dimension in `interped`. This is a \"property\" of core dimensions: they are moved to the end before being sent to `interp1d_np` as was noted in the docstring for `input_core_dims`\n", + "```{note}\n", + "`lat` is now the *last* dimension in `interped`. This is a \"property\" of core dimensions: they are moved to the end before being sent to `interp1d_np` as noted in the docstring for `input_core_dims`\n", "\n", " Core dimensions are automatically moved to the last axes of input\n", " variables before applying ``func``, which facilitates using NumPy style\n", - " generalized ufuncs [2]_." + " generalized ufuncs [2]_.\n", + "```\n", + "\n", + "## Conclusion\n", + "This is why `apply_ufunc` is so convenient; it takes care of a lot of code necessary to apply functions that consume and produce numpy arrays to xarray objects.\n" ] } ],