Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds OIDC authentication support for pubsub Apache Pulsar #3023

Merged
merged 7 commits into from
Aug 1, 2023

Conversation

JoshVanL
Copy link
Contributor

@JoshVanL JoshVanL commented Jul 31, 2023

Closes #2591

PR adds support for OIDC authentication to the pubsub pulsar. This is configured via the following metadata fields:

- oidcTokenCAPEM: CA PEM certificate bundle to connect to the OIDC issuer. If not defined, the system pool will be used.
- oidcTokenURL: URL to request the OIDC client_credentials token from.
- oidcClientID: token client ID.
- oidcClientSecret: token client secret.
- oidcAudiences: comma separated list of audiences to request for. Must not be empty.
- oidcScopes: comma separated list of scopes to request. If empty, defaults to `"openid"`. If defined, `"openid"` must be present.

Another field authType has been added which accepts none, token, and oidc. Defaults to none, however if undefined and the token metadata field has been defined, then this defaults to token to maintain backwards compatibility.


Adds a certification test which runs the current pulsar suite, but with OIDC authentication enabled. Uses pulsar v3 since that is the minimum version with OIDC authentication support. Runs a mock OIDC issuer server during the test. We template the CA PEM bundle, based on the generated self signed certificate from the mock server at test runtime.


PR doesn't move the kafka OIDC implementation to use the shared internal package since kafka OIDC auth isn't tested. Once a certificate test is written, we can move kafka to use the internal package.

Docs PR: dapr/docs#3655


Question: Because the pulsar certification tests now take some time with running the suite in both non and oidc auth modes- do we want to split them into two separate workflows?

/cc @yaron2

@JoshVanL JoshVanL requested review from a team as code owners July 31, 2023 14:16
@JoshVanL JoshVanL force-pushed the pubsub-pulsar-authentication-oidc branch from 601a58e to dfbce6c Compare August 1, 2023 17:33
@yaron2 yaron2 merged commit af5bace into dapr:master Aug 1, 2023
84 checks passed
Copy link
Contributor

@ItalyPaleAle ItalyPaleAle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should have not been merged this quickly…

return
case <-ctx.Done():
return
case <-c.clock.After(renewDuration):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is anti-pattern for using OAuth and we should not refresh tokens in background. Instead, tokens should be fetched lazily and refreshed only if necessary. This is especially true when the consumer of the tokens is something like Pulsar where connections are long-lived so tokens don't need to be refreshed constantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an anti-pattern? I have never heard of this. Fetching tokens lazily is strictly worse from the dapr perspective, increasing latency for the pubsub app which is particularly acute for publishing. I don't think we should be concerned with reducing load on the OIDC provider- signing JWTs is not resource intensive and is basically free for all intense and purposes. Regardless of long lived etc. we can assume that lazily requesting tokens will produce about roughly the same amount of requests as renewing in the background for a publishing app (bar the renewal time we agree on), however fetching tokens lazily will increase the latency on the unlucky publish(es).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT Dapr maintains persistent connections to Pulsar. I see that for publishing connections too: https://github.dev/JoshVanL/components-contrib/blob/19e17528a2c2a4a3098e53fad019548402a62e46/pubsub/pulsar/pulsar.go#L254-L255

Since connections are persisted, it's possible that we may not even need to refresh tokens ever for the lifetime of the sidecar, unless the connection is dropped. So refreshing automatically "just in case" seems really wasteful.

Yes, signing a JWT is "free" for the IdP but doing it when we know we don't need it is not polite. It may help hitting rate-limits. It also can create additional noise in the audit logs that are sent to SIEMs, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So refreshing automatically "just in case" seems really wasteful.

Fair enough- I still disagree with this statement though. I maintain that the latency improvements for daprd fair out way the small resource used to fetch more tokens than perhaps needed.

It may help hitting rate-limits. It also can create additional noise in the audit logs that are sent to SIEMs, etc.

An IdP which rate limits clients for requesting tokens at half expiry is miss configured imo, and audit logs are built around the concept of clients renewing their identity documents. They will be indexed and filtered around this exact scenario.

c.lock.RLock()
defer c.lock.RUnlock()

if c.closed.Load() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where you check if the token that's in cache is valid, and if not you lazily request a new one.

Since multiple goroutines could be calling Token() at the same time, I recommend implementing some synchronization method (or use the "promise" library!)

func (c *ClientCredentials) tokenRenewDuration() time.Duration {
c.lock.RLock()
defer c.lock.RUnlock()
return c.currentToken.Expiry.Sub(c.clock.Now()) / 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to halve the time, that's wasteful. Maybe subtract 1 min, but since tokens are used right away there's no need to halve this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 2/3rds? I think 1 minute is far too short- node migrations in kube can take longer than a minute for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most IdP's I'm familiar with issue tokens with a lifetime of 1h by default. Cutting 30 or 20 mins out of that seems unnecessary.

5 mins?

Why are Kubernetes node migrations a concern here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very common to be running an internal provider in (or in another) cluster. A node migration is an example scenario where the issuer server becomes unavailable for a non-trivial amount of time, however the system should be able to cope with that as best it can.

I like keeping the the renewal time a function of the duration of the token lifetime, as the operator would have chosen the expiry based on their deployment constraints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very common to be running an internal provider in (or in another) cluster. A node migration is an example scenario where the issuer server becomes unavailable for a non-trivial amount of time, however the system should be able to cope with that as best it can.

That is an operational thing that should not be a concern of Dapr's. If an ops person runs their own IdP, they are responsible for ensuring its availability.

Copy link
Contributor Author

@JoshVanL JoshVanL Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an ops person runs their own IdP, they are responsible for ensuring its availability.

I agree, but it is standard practice as a client to have enough buffer time in the validity of your identity documents to maximize up time if the provider goes down. Using the document validity duration is a great metric to choose that buffer time, rather than using an arbitrary static value. The ops person will have chosen that duration for a reason, and is a great signal to act on.

RedeliveryDelay time.Duration `mapstructure:"redeliveryDelay"`
internalTopicSchemas map[string]schemaMetadata `mapstructure:"-"`
PublicKey string `mapstructure:"publicKey"`
PrivateKey string `mapstructure:"privateKey"`
Keys string `mapstructure:"keys"`

AuthType string `mapstructure:"authType"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally when a component has multiple auth mechanisms, we do not force users to specify that unless that's strictly necessary.

In this case, it is not necessary.

When the component is initialized, you can check if token is present and assume auth type is a shared token. Otherwise, check if the OIDC credentials are present.
If none is present, return an error.

This is more aligned with what we do for almost all other components, and saves users from having another metadata option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this component different from how we handle auth on Kafka for example? If we don't use auth explicitly there may be additional token auth mechanisms in the future that will conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about Kafka, but for example all Azure components automatically select shared token auth vs AzureAD depending on the metadata that's passed.

In the rare case of conflicts we have added one-off metadata keys such as useAzureAD for Azure SQL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fear with not having an explicit auth field is that we end up with "magic config" schemas, which are both confusing and surprising to users. It wouldn't be clear to me what the behavior would be if I defined multiple auth types for example.

This is also particularly important for security folks who are responsible for auditing or writing policy over config. A clear auth type field gives them the confidence that things are setup correctly and the software is going to do what they expect.

Audiences: m.OIDCAudiences,
})
if err != nil {
return fmt.Errorf("could not instantiate oidc token provider: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nit:

Suggested change
return fmt.Errorf("could not instantiate oidc token provider: %v", err)
return fmt.Errorf("could not instantiate oidc token provider: %w", err)

@ItalyPaleAle
Copy link
Contributor

@JoshVanL please re-open this PR after addressing the feedback above, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement OIDC Authentication against Keycloak for Apache Pulsar
3 participants