Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workerStats as worker options to collect stats on poller start/stop events #1356

Merged
merged 17 commits into from
Aug 22, 2024

Conversation

ketsiambaku
Copy link
Contributor

@ketsiambaku ketsiambaku commented Jul 9, 2024


Detailed Description
[In-depth description of the changes made to the interfaces, specifying new fields, removed fields, or modified data structures]

This PR introduces a new worker option EventMonitor which wraps a LifeCycle interface to collect stats on poller start and stop (in this PR) and running activities and workflows (in a follow-up PR) for debugging purposes.

The change is backward compatible with older versions of the client because a default implementation is created when a worker is instantiated without this option being provided. A constructor for the LifeCycle interface is also added to the public API to serve as basic guidelines for usage by the OS community.

Impact Analysis

  • Backward Compatibility: [Analysis of backward compatibility]
  • Forward Compatibility: [Analysis of forward compatibility]

Testing Plan

  • Unit Tests: [Do we have unit test covering the change?] The change is covered by unit tests
  • Local tests with worker service

Rollout Plan
The new exposed types are marked as deprecated at the moment.
The plan is to pre-release and experiment in Uber internal for sometime before removing the deprecated annotation
Revert plan: The change can be safely reverted. Need coordination with consumers code using the introduced worker option


Copy link

codecov bot commented Jul 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.48%. Comparing base (59a175b) to head (25d074d).
Report is 1 commits behind head on master.

Additional details and impacted files
Files Coverage Δ
internal/common/debug/workerstats_noop.go 100.00% <100.00%> (ø)
internal/internal_worker.go 80.33% <100.00%> (+0.24%) ⬆️
internal/internal_worker_base.go 66.66% <100.00%> (+0.16%) ⬆️
internal/worker.go 14.28% <ø> (ø)

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59a175b...25d074d. Read the comment docs.

internal/internal_worker.go Outdated Show resolved Hide resolved
internal/internal_worker.go Outdated Show resolved Hide resolved
internal/internal_worker.go Outdated Show resolved Hide resolved
internal/internal_worker_base.go Outdated Show resolved Hide resolved
@Groxx
Copy link
Contributor

Groxx commented Jul 9, 2024

A whole bunch of stuff from IRL chat, but to put it in writing and try to make it a bit less meandering and aimless:

  1. Let's name this highly specific to what it's collecting, not "stats collector". We're pretty much guaranteed to want to make changes to it (adding more things at the least) and that'll require some boilerplate to deal with every change being a breaking change, so it's best to make it sound both unstable and as specific as it truly is.
  2. For API stability purposes, we should mark this as "in development and unstable" (it'll go in the "/x/" package in v2 to be a bit clearer about this), and bundle it into a worker-options-field with very specific sub-fields for each thing. We'll definitely have more eventually.
  3. I kinda don't think it should just be start() and stop() with no args tbh, it's a bit hard to use for anything but ++ and --. I'm not sure what info would be useful, but using at least the worker's identity seems like a better starting point (and very unlikely to need to change later). Maybe also tasklist, domain, ...almost all config is potentially useful...
    • I guess this is a request to try to build something useful on top of this PR, and see what's missing, before actually merging it. It's hard to guess at stuff like this without concrete examples, and APIs are hard to change once exposed.
    • If doing that finds that "no args" is fine, then "no args" is fine. We can just make another collector-thing-interface later when we build something else, "no args" is at least very easy to maintain.
  4. PR description needs some work, as it's mostly placeholders :) Just delete whatever isn't relevant.

In a general sense, I'm 100% in favor of "useful but inherently unstable" APIs like this, so I say 👍 let's build it. It doesn't have to be perfect because we can add more things later. We just need to make it very clear that it's not (yet?) stable so nobody tries to write a library around it that we would constantly be breaking.
/x/ in v2 will hopefully make this even clearer, and will let us lock it down in our internal monorepo to prevent Hyrum's Law from wreaking havoc.


None of the extra verifying / samples-repo-trying / etc is a blocker for this or needs a PR (unless you want), it's just a "this is probably worth doing first because we'll need it at some point, and there's no time like the present". Treat it as a crash-course in library API design, and feel it out from both library and user side - it's very much worth learning, and learning how to do, and this feature is a fairly isolated playground to figure things out.

(... and then when it all starts feeling natural you'll probably start getting mad at API lack-of-design-s everywhere. it's a blessing and a curse.)

For building something with this and testing it out, I'd recommend:

  • go work init somewhere, and use ./relative/path/to/go.uber.org/cadence and use ./relative/path/to/cadence-samples in it.
    • Go doesn't care where things are, but IDEs get kinda cranky if the go.work isn't in the repo they're open to.
    • replace lib => ../relative/path works fine too, go.work is just a bit nicer because it doesn't require a bunch of go mod tidy-ing and is easier to keep PRs clean.
  • git checkout this PR and try to use it in https://github.com/uber-common/cadence-samples
    • Specifically, make sure you have go env GOWORK pointing to the right file, and go run github.com/uber-common/cadence-samples/the/thing/to/try should take care of everything. IDEs might need something completely separate, sadly.
  • Apparently we have Prometheus and Grafana in our main docker-compose.yml... though I suspect it won't be able to collect client-library-metrics like this, if I'm remembering right and Prometheus is a pull-based system.
    • docker something --network host might take care of that. I'd be curious if there's another option, but I think it'd be a totally fine fallback either way.
    • Whatever you figure out: let's get that documented somewhere, it'd be very useful to have a reliable set of steps for the future. Unit tests are rather hard to use to check complicated behavior like we sometimes track with metrics.

I think we'll likely want an API like:

WorkerOptions struct {
  ...
  UnstableEventMonitoringOrSomething struct {
    PollerLifecycle PollerLifecycle
  }
}
PollerLifecycle interface {
  Start(identity string)
  Stop(identity string)
}

so we can make additions and breaking changes like this:

WorkerOptions struct {
  ...
   UnstableEventMonitoringOrSomething struct {
     PollerLifecycle   PollerLifecycle
+    PollerLifecycleV2 PollerLifecycleV2
   }
 }
 PollerLifecycle interface {
   Start(identity string)
   Stop(identity string)
 }
+PollerLifecycleV2 interface {
+  PollerLifecycle
+  Poll(identity)
+}

and support old versions + new versions + more kinds of collectors in the future.

Another pattern that's fairly common in open source is to upcast a general interface:

WorkerOptions struct {
  ...
  UnstableEventMonitoringOrSomething StatsCollector
}

type StatsCollector interface {
  // does nothing / is not called, just excludes accidental random objects
  CollectsStats()
}
type PollStats interface {
  StatsCollector
  Start(..)
  Stop(..)
}
Type PollStatsV2 interface {
  PollStats
  Poll(..)
}

// upcast to use the various impls / sub-interfaces
if pollv2, ok := thing.(PollStatsV2); ok {
  pollv2.Start(..)
} else if poll, ok := thing.(PollStats); ok {
  poll.Start(..)
}
// etc

but I think we shouldn't do that.
It has some benefits, but the semantics can get extremely complicated and impossible to know perfectly if we start adding a bunch of things in there.

E.g. if we have "poll stats v1, v2, and workflow stats v1" and we want to remove support for "poll stats v1", anyone providing poll-v1 and workflow-v1 will just lose their poll stats, and we have no way to warn them mechanically. They won't have provided "nothing" that we can call an error (workflow-v1 works fine), we just won't notice that they had a poll-v1 and didn't do the poll-v2 upgrade.

With the structy setup, if we do this:

   UnstableEventMonitoringOrSomething struct {
-    PollerLifecycle   PollerLifecycle
     PollerLifecycleV2 PollerLifecycleV2
     WorkflowStuff     WorkflowStuff
   }

poll-v1-users get a compile-time failure and a very clear upgrade path: use v2 or get rid of it. We can add disjoint and breaking changes any time we need (it's just a new field), we can warn on deprecating things (it's just a // deprecated field), and semantics are easy to keep trivial and obvious even in cases where e.g. they provide both v1 and v2: we always call everything that was given to us, no attempting to be smart about only calling the newest APIs or similar.

The structy setup isn't great for "I want to write a library that provides a few of these" as it doesn't have an easy way to auto-upgrade poll-v1 to poll-v2 (users would need to wire that up), but there are ways to deal with that too (e.g. hand the config struct to the library and have it fill the fields).


For internal implementation purposes: noop impls are fine IMO. Checking for nils everywhere is worse in a lot of practical ways. As long as it's truly internal we can always change it, so no design concerns there at all.

@ketsiambaku ketsiambaku changed the title Add stats collector in worker options Add event monitoring worker options to collect stats on poller lifecycle Aug 15, 2024
@@ -187,6 +187,10 @@ func ensureRequiredParams(params *workerExecutionParameters) {
if params.UserContext == nil {
params.UserContext = context.Background()
}
if params.EventMonitoring.PollerLifeCycle == nil {
params.EventMonitoring.PollerLifeCycle = newNoopPollerLifeCycle()
params.Logger.Info("No PollerLifeCycle configured for cadence worker. Use noop PollerLifeCycle as default.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should be a debug; nobody is forced to implement it.
However, we should have guidance on how we use it and how it could be helpful for OS customers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to debug.
I'm proposing a default implementation which just count the running pollers (which is basically what I was planning for the Uber internal implementation) and exposed a NewLifeCycle() constructor so as a start the OS users can at least have a wrapper around it

internal/worker.go Show resolved Hide resolved
internal/worker.go Outdated Show resolved Hide resolved
internal/internal_worker.go Outdated Show resolved Hide resolved
internal/internal_worker.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jakobht jakobht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, I'm not sure I completely understand what we want to achieve, so some of the comments might reflect that.

internal/debug/lifecycle.go Outdated Show resolved Hide resolved
internal/debug/lifecycle_test.go Outdated Show resolved Hide resolved
internal/debug/lifecycle_test.go Outdated Show resolved Hide resolved
internal/internal_worker.go Outdated Show resolved Hide resolved
Comment on lines 32 to 35
runImpl struct {
lifeCycle *lifeCycleImpl
stopped atomic.Bool
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need this struct. The lifecycle holds the whole state - and where it is call there is no risk of calling stop twice.

So we could call .Stop() on the LifeCycle itself.

Copy link
Contributor Author

@ketsiambaku ketsiambaku Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts on this were:

Users can create an instance of LifeCycle by using debug.NewLifeCycle() and this internal implementation is what’s returned. So if for whatever crazy reason someone uses this and calls Stop() and Start() in any order the internal atomic.Int32 will no longer reflect the actual number of running pollers. So LifeCycle.LoadPollerCount() will no longer be accurate. This is also why we are tracking if Stop() has been called which is your question in the comment below

Btw That’s how I am planning to use it internally. Create an instance of LifeCycle and pass it as a worker option and then call LifeCycle.LoadPollerCount() to load the number on the debug page.

Also the way we use it to record poller stopped
defer LifeCycle.Start(...).Stop() kind of explains that we expect this to be in order

Let me know if this makes sense and happy to discuss if you have another approach on how to handle this

Comment on lines 81 to 83
if r.stopped.CompareAndSwap(false, true) {
r.lifeCycle.pollerCount.Dec()
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are tracking if it's stopped we call this function exactly once, when the poller stops, so I do not think this is nessesary.

}
)

func (lc *lifeCycleImpl) PollerStart(workerID string) Run {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (lc *lifeCycleImpl) PollerStart(workerID string) Run {
func (lc *lifeCycleImpl) PollerStart() Run {

We are not using the argument here? Maybe we should have a map or something? But currently it's not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial plan was to define the interface only and the implementation would need to be provided by users. Internally, we’d have only a noop implementation for backward compatibility. Having only a Start() and Stop() with no args would limit the users on their implementation. That’s why I added at least workerID as an argument.
Eventually I added a default implementation and though it is not used in the default workerID just seemed like a good starting point for consumers implementation


// LifeCycle contains a set of methods to collect information on a running worker
// Deprecated: in development and very likely to change
LifeCycle interface {
Copy link
Contributor

@3vilhamster 3vilhamster Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LifeCycle interface {
PollerTracker interface {
Start() Stopper
Stats() int
}
// WokrflowInfo -> Domain and Tasklist can be extracted from the worker.
WokrflowInfo struct {
ID, RunID, Type string
}
WorkflowTracker interface {
Start(id WokrflowInfo) Stopper
Stats() [{WokrflowInfo, int}]
}
ActivitInfo struct {
ID string
WfInfo WokrflowInfo
}
ActivityTracker interface {
Start(id ActivityInfo) Stopper
Stats() [{ActivityInfo, int}]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the changes per yesterday’s discussion. Per @jakobht 's point Now that we removed the ID argument from Start() we don’t have to return a Stopper and can instead have Stop() directly in the PollerTracker. I don’t have a strong opinion or preference on this.

Comment on lines 57 to 59
EventMonitor struct {
LifeCycle LifeCycle
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EventMonitor struct {
LifeCycle LifeCycle
}
WorkerStats struct {
PollerTracker PollerTracker
WorkflowTracker WorkflowTracker
ActivityTracker ActivityTracker
}

@ketsiambaku ketsiambaku changed the title Add event monitoring worker options to collect stats on poller lifecycle Add workerStats as worker options to collect stats on poller start/stop events Aug 21, 2024
@ketsiambaku ketsiambaku merged commit ed58224 into uber-go:master Aug 22, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants