Skip to content

Commit

Permalink
Merge pull request #131 from lsst-sqre/tickets/DM-37516
Browse files Browse the repository at this point in the history
DM-37516: Add Google Cloud Storage utilities and testing support
  • Loading branch information
rra authored Jan 12, 2023
2 parents d141a7a + ab7ac20 commit 5e3cb8d
Show file tree
Hide file tree
Showing 15 changed files with 751 additions and 7 deletions.
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,12 @@ Headline template:
X.Y.Z (YYYY-MM-DD)
-->

## 3.5.0 (unreleased)
## 3.5.0 (2023-01-12)

- Add new helper class `safir.gcs.SignedURLService` to generate signed URLs to Google Cloud Storage objects using workload identity.
To use this class, depend on `safir[gcs]`.
- Add the `safir.testing.gcs` module, which can be used to mock the Google Cloud Storage API for testing.
To use this module, depend on `safir[gcs]`.
- Add new helper class `safir.pydantic.CamelCaseModel`, which is identical to `pydantic.BaseModel` except with configuration added to accept camel-case keys using the `safir.pydantic.to_camel_case` alias generator and overrides of `dict` and `json` to export in camel-case by default.

## 3.4.0 (2022-11-29)
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.PHONY: init
init:
pip install --upgrade pip tox tox-docker pre-commit
pip install --upgrade -e ".[arq,db,dev,kubernetes]"
pip install --upgrade -e ".[arq,db,dev,gcs,kubernetes]"
pre-commit install
rm -rf .tox
6 changes: 6 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ API reference
.. automodapi:: safir.dependencies.logger
:include-all-objects:

.. automodapi:: safir.gcs
:include-all-objects:

.. automodapi:: safir.kubernetes
:include-all-objects:

Expand All @@ -49,5 +52,8 @@ API reference
.. automodapi:: safir.pydantic
:include-all-objects:

.. automodapi:: safir.testing.gcs
:include-all-objects:

.. automodapi:: safir.testing.kubernetes
:include-all-objects:
2 changes: 2 additions & 0 deletions docs/documenteer.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ nitpick_ignore_regex = [
['py:.*', 'starlette.*'],
]
nitpick_ignore = [
['py:class', 'unittest.mock.Base'],
['py:class', 'unittest.mock.CallableMixin'],
["py:obj", "JobMetadata.id"],
]

Expand Down
1 change: 0 additions & 1 deletion docs/user-guide/arq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ If your app uses a configuration system like ``pydantic.BaseSettings``, this exa
class Config(BaseSettings):
arq_queue_url: RedisDsn = Field(
"redis://localhost:6379/1", env="APP_ARQ_QUEUE_URL"
)
Expand Down
113 changes: 113 additions & 0 deletions docs/user-guide/gcs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
##################################
Using the Google Cloud Storage API
##################################

Safir-based applications are encouraged to use the `google-cloud-storage <https://cloud.google.com/python/docs/reference/storage/latest>`__ Python module.
It provides both a sync and async API and works well with `workload identity <https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity>`__.

Google Cloud Storage support in Safir is optional.
To use it, depend on ``safir[gcs]``.

Generating signed URLs
======================

The preferred way to generate signed URLs for Google Cloud Storage objects is to use workload identity for the running pod, assign it a Kubernetes service account bound to a Google Cloud service account, and set appropriate permissions on that Google Cloud service account.

The credentials provided by workload identity cannot be used to sign URLs directly.
Instead, one first has to get impersonation credentials for the same service account, and then use those to sign the URL.
`safir.gcs.SignedURLService` automates this process.

To use this class, the workload identity of the running pod must have ``roles/iam.serviceAccountTokenCreator`` for a Google service account, and that service account must have appropriate GCS permissions for the object for which one wants to create a signed URL.
Then, do the following:

.. code-block:: python
from datetime import timedelta
from safir.gcs import SignedURLService
url_service = SignedURLService("service-account")
url = url_service.signed_url("s3://bucket/path/to/file", "application/fits")
The argument to the constructor is the name of the Google Cloud service account that will be used to sign the URLs.
This should be the one for which the workload identity has impersonation permissions.
(Generally, this should be the same service account to which the workload identity is bound.)

Optionally, you can specify the lifetime of the signed URLs as a second argument, which should be a `datetime.timedelta`.
If not given, the default is one hour.

The path to the Google Cloud Storage object for which to create a signed URL must be an S3 URL.
The second argument to `~safir.gcs.SignedURLService.signed_url` is the MIME type of the underlying object, which will be encoded in the signed URL.

Testing with mock Google Cloud Storage
======================================

The `safir.testing.gcs` module provides a limited, mock Google Cloud Storage (GCS) API suitable for testing.
By default, this mock provides just enough functionality to allow retrieving a bucket, retrieving a blob from the bucket, and creating a signed URL for the blob.
If a path to a tree of files is given, it can also mock some other blob attributes and methods based on the underlying files.

Testing signed URLs
-------------------

Applications that want to run tests with the mock GCS API should define a fixture (in ``conftest.py``) as follows:

.. code-block:: python
from datetime import timedelta
from typing import Iterator
import pytest
from safir.testing.gcs import MockStorageClient, patch_google_storage
@pytest.fixture
def mock_gcs() -> Iterator[MockStorageClient]:
yield from patch_google_storage(
expected_expiration=timedelta(hours=1), bucket_name="some-bucket"
)
The ``expected_expiration`` argument is optional and tells the mock object what expiration the application is expected to request for its signed URLs.
If this option is given and the application, when tested, requests a signed URL with a different expiration, the mock will raise an assertion failure.

The ``bucket_name`` argument is optional.
If given, an attempt by the tested application to request a bucket of any other name will raise an assertion failure.

When this fixture is in use, the tested application can use Google Cloud Storage as normal, as long as it only makes the method calls supported by the mock object.
Some parameters to the method requesting a signed URL will be checked for expected values.
The returned signed URL will always be :samp:`https://example.com/{name}`, where the last component will be the requested blob name.
This can then be checked via assertions in tests.

To ensure that the mocking is done correctly, be sure not to import ``Client``, ``Credentials``, or similar symbols from ``google.cloud.storage`` or ``google.auth`` directly into a module.
Instead, use:

.. code-block:: python
from google.cloud import storage
and then use, for example, ``storage.Client``.

Testing with a tree of files
----------------------------

To mock additional blob attributes and methods, point the test fixture at a tree of files with the ``path`` parameter.

.. code-block:: python
:emphasize-lines: 1, 7
from pathlib import Path
@pytest.fixture
def mock_gcs() -> Iterator[MockStorageClient]:
yield from patch_google_storage(
path=Path(__file__).parent / "data" / "files",
expected_expiration=timedelta(hours=1),
bucket_name="some-bucket",
)
The resulting blobs will then correspond to the files on disk and will support the additional attributes ``size``, ``updated``, and ``etag``, and the additional methods ``download_as_bytes``, ``exists``, ``open``, and ``reload`` (which does nothing).
The Etag value of the blob will be the string version of its inode number.

Mock signed URLs will continue to work exactly the same as when a path is not provided.
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,4 @@ User guide
ivoa
kubernetes
pydantic
gcs
10 changes: 7 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ dependencies = [
dynamic = ["version"]

[project.optional-dependencies]
arq = [
"arq>=0.23"
]
db = [
"asyncpg",
"sqlalchemy[asyncio]",
Expand All @@ -52,12 +55,13 @@ dev = [
# documentation
"documenteer[guide]>=0.7.0b2",
]
gcs = [
"google-auth",
"google-cloud-storage"
]
kubernetes = [
"kubernetes_asyncio"
]
arq = [
"arq>=0.23"
]

[[project.authors]]
name = "Association of Universities for Research in Astronomy, Inc. (AURA)"
Expand Down
100 changes: 100 additions & 0 deletions src/safir/gcs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
"""Utilities for interacting with Google Cloud Storage."""

from __future__ import annotations

from datetime import timedelta
from typing import Optional
from urllib.parse import urlparse

import google.auth
from google.auth import impersonated_credentials
from google.cloud import storage

__all__ = ["SignedURLService"]


class SignedURLService:
"""Generate signed URLs for Google Cloud Storage blobs.
Uses default credentials plus credential impersonation to generate signed
URLs for Google Cloud Storage blobs. This is the correct approach when
running as a Kubernetes pod using workload identity.
Parameters
----------
service_account
The service account to use to sign the URLs. The workload identity
must have access to generate service account tokens for that service
account.
lifetime
Lifetime of the generated signed URLs.
Notes
-----
The workload identity (or other default credentials) under which the
caller is running must have ``roles/iam.serviceAccountTokenCreator`` on
the service account given in the ``service_account`` parameter. This is
how a workload identity can retrieve a key that can be used to create a
signed URL.
See `gcs_signedurl <https://github.com/salrashid123/gcs_signedurl>`__ for
additional details on how this works.
"""

def __init__(
self, service_account: str, lifetime: timedelta = timedelta(hours=1)
) -> None:
self._lifetime = lifetime
self._service_account = service_account
self._gcs = storage.Client()
self._credentials, _ = google.auth.default()

def signed_url(self, uri: str, mime_type: Optional[str]) -> str:
"""Generate signed URL for a given storage object.
Parameters
----------
uri
URI for the storage object. This must start with ``s3://`` and
use the S3 URI syntax to specify bucket and blob of a Google
Cloud Storage object.
mime_type
MIME type of the object, for encoding in the signed URL.
Returns
-------
str
New signed URL, which will be valid for as long as the lifetime
parameter to the object.
Raises
------
ValueError
The ``uri`` parameter is not an S3 URI.
Notes
-----
This is inefficient, since it gets new signing credentials each time
it generates a signed URL. Doing better will require figuring out the
lifetime and refreshing the credentials when the lifetime has expired.
"""
parsed_uri = urlparse(uri)
if parsed_uri.scheme != "s3":
raise ValueError(f"URI {uri} is not an S3 URI")
bucket = self._gcs.bucket(parsed_uri.netloc)
blob = bucket.blob(parsed_uri.path[1:])
signing_credentials = impersonated_credentials.Credentials(
source_credentials=self._credentials,
target_principal=self._service_account,
target_scopes=(
"https://www.googleapis.com/auth/devstorage.read_only"
),
lifetime=2,
)
return blob.generate_signed_url(
version="v4",
expiration=self._lifetime,
method="GET",
response_type=mime_type,
credentials=signing_credentials,
)
Loading

0 comments on commit 5e3cb8d

Please sign in to comment.