Add workerStats as worker options to collect stats on poller start/stop events #1356

ketsiambaku · 2024-07-09T13:29:56Z

Detailed Description
[In-depth description of the changes made to the interfaces, specifying new fields, removed fields, or modified data structures]

This PR introduces a new worker option EventMonitor which wraps a LifeCycle interface to collect stats on poller start and stop (in this PR) and running activities and workflows (in a follow-up PR) for debugging purposes.

The change is backward compatible with older versions of the client because a default implementation is created when a worker is instantiated without this option being provided. A constructor for the LifeCycle interface is also added to the public API to serve as basic guidelines for usage by the OS community.

Impact Analysis

Backward Compatibility: [Analysis of backward compatibility]
Forward Compatibility: [Analysis of forward compatibility]

Testing Plan

Unit Tests: [Do we have unit test covering the change?] The change is covered by unit tests
Local tests with worker service

Rollout Plan
The new exposed types are marked as deprecated at the moment.
The plan is to pre-release and experiment in Uber internal for sometime before removing the deprecated annotation
Revert plan: The change can be safely reverted. Need coordination with consumers code using the introduced worker option

codecov · 2024-07-09T16:31:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.48%. Comparing base (59a175b) to head (25d074d).
Report is 1 commits behind head on master.

Additional details and impacted files

Files	Coverage Δ
internal/common/debug/workerstats_noop.go	`100.00% <100.00%> (ø)`
internal/internal_worker.go	`80.33% <100.00%> (+0.24%)`	⬆️
internal/internal_worker_base.go	`66.66% <100.00%> (+0.16%)`	⬆️
internal/worker.go	`14.28% <ø> (ø)`

... and 1 file with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59a175b...25d074d. Read the comment docs.

internal/internal_worker.go

internal/internal_worker_base.go

Groxx · 2024-07-09T20:07:21Z

A whole bunch of stuff from IRL chat, but to put it in writing and try to make it a bit less meandering and aimless:

Let's name this highly specific to what it's collecting, not "stats collector". We're pretty much guaranteed to want to make changes to it (adding more things at the least) and that'll require some boilerplate to deal with every change being a breaking change, so it's best to make it sound both unstable and as specific as it truly is.
For API stability purposes, we should mark this as "in development and unstable" (it'll go in the "/x/" package in v2 to be a bit clearer about this), and bundle it into a worker-options-field with very specific sub-fields for each thing. We'll definitely have more eventually.
I kinda don't think it should just be start() and stop() with no args tbh, it's a bit hard to use for anything but ++ and --. I'm not sure what info would be useful, but using at least the worker's identity seems like a better starting point (and very unlikely to need to change later). Maybe also tasklist, domain, ...almost all config is potentially useful...
- I guess this is a request to try to build something useful on top of this PR, and see what's missing, before actually merging it. It's hard to guess at stuff like this without concrete examples, and APIs are hard to change once exposed.
- If doing that finds that "no args" is fine, then "no args" is fine. We can just make another collector-thing-interface later when we build something else, "no args" is at least very easy to maintain.
PR description needs some work, as it's mostly placeholders :) Just delete whatever isn't relevant.

In a general sense, I'm 100% in favor of "useful but inherently unstable" APIs like this, so I say 👍 let's build it. It doesn't have to be perfect because we can add more things later. We just need to make it very clear that it's not (yet?) stable so nobody tries to write a library around it that we would constantly be breaking.
/x/ in v2 will hopefully make this even clearer, and will let us lock it down in our internal monorepo to prevent Hyrum's Law from wreaking havoc.

None of the extra verifying / samples-repo-trying / etc is a blocker for this or needs a PR (unless you want), it's just a "this is probably worth doing first because we'll need it at some point, and there's no time like the present". Treat it as a crash-course in library API design, and feel it out from both library and user side - it's very much worth learning, and learning how to do, and this feature is a fairly isolated playground to figure things out.

(... and then when it all starts feeling natural you'll probably start getting mad at API lack-of-design-s everywhere. it's a blessing and a curse.)

For building something with this and testing it out, I'd recommend:

go work init somewhere, and use ./relative/path/to/go.uber.org/cadence and use ./relative/path/to/cadence-samples in it.
- Go doesn't care where things are, but IDEs get kinda cranky if the go.work isn't in the repo they're open to.
- replace lib => ../relative/path works fine too, go.work is just a bit nicer because it doesn't require a bunch of go mod tidy-ing and is easier to keep PRs clean.
git checkout this PR and try to use it in https://github.com/uber-common/cadence-samples
- Specifically, make sure you have go env GOWORK pointing to the right file, and go run github.com/uber-common/cadence-samples/the/thing/to/try should take care of everything. IDEs might need something completely separate, sadly.
Apparently we have Prometheus and Grafana in our main docker-compose.yml... though I suspect it won't be able to collect client-library-metrics like this, if I'm remembering right and Prometheus is a pull-based system.
- docker something --network host might take care of that. I'd be curious if there's another option, but I think it'd be a totally fine fallback either way.
- Whatever you figure out: let's get that documented somewhere, it'd be very useful to have a reliable set of steps for the future. Unit tests are rather hard to use to check complicated behavior like we sometimes track with metrics.

I think we'll likely want an API like:

WorkerOptions struct {
  ...
  UnstableEventMonitoringOrSomething struct {
    PollerLifecycle PollerLifecycle
  }
}
PollerLifecycle interface {
  Start(identity string)
  Stop(identity string)
}

so we can make additions and breaking changes like this:

WorkerOptions struct {
  ...
   UnstableEventMonitoringOrSomething struct {
     PollerLifecycle   PollerLifecycle
+    PollerLifecycleV2 PollerLifecycleV2
   }
 }
 PollerLifecycle interface {
   Start(identity string)
   Stop(identity string)
 }
+PollerLifecycleV2 interface {
+  PollerLifecycle
+  Poll(identity)
+}

and support old versions + new versions + more kinds of collectors in the future.

Another pattern that's fairly common in open source is to upcast a general interface:

WorkerOptions struct {
  ...
  UnstableEventMonitoringOrSomething StatsCollector
}

type StatsCollector interface {
  // does nothing / is not called, just excludes accidental random objects
  CollectsStats()
}
type PollStats interface {
  StatsCollector
  Start(..)
  Stop(..)
}
Type PollStatsV2 interface {
  PollStats
  Poll(..)
}

// upcast to use the various impls / sub-interfaces
if pollv2, ok := thing.(PollStatsV2); ok {
  pollv2.Start(..)
} else if poll, ok := thing.(PollStats); ok {
  poll.Start(..)
}
// etc

but I think we shouldn't do that.
It has some benefits, but the semantics can get extremely complicated and impossible to know perfectly if we start adding a bunch of things in there.

E.g. if we have "poll stats v1, v2, and workflow stats v1" and we want to remove support for "poll stats v1", anyone providing poll-v1 and workflow-v1 will just lose their poll stats, and we have no way to warn them mechanically. They won't have provided "nothing" that we can call an error (workflow-v1 works fine), we just won't notice that they had a poll-v1 and didn't do the poll-v2 upgrade.

With the structy setup, if we do this:

   UnstableEventMonitoringOrSomething struct {
-    PollerLifecycle   PollerLifecycle
     PollerLifecycleV2 PollerLifecycleV2
     WorkflowStuff     WorkflowStuff
   }

poll-v1-users get a compile-time failure and a very clear upgrade path: use v2 or get rid of it. We can add disjoint and breaking changes any time we need (it's just a new field), we can warn on deprecating things (it's just a // deprecated field), and semantics are easy to keep trivial and obvious even in cases where e.g. they provide both v1 and v2: we always call everything that was given to us, no attempting to be smart about only calling the newest APIs or similar.

The structy setup isn't great for "I want to write a library that provides a few of these" as it doesn't have an easy way to auto-upgrade poll-v1 to poll-v2 (users would need to wire that up), but there are ways to deal with that too (e.g. hand the config struct to the library and have it fill the fields).

For internal implementation purposes: noop impls are fine IMO. Checking for nils everywhere is worse in a lot of practical ways. As long as it's truly internal we can always change it, so no design concerns there at all.

3vilhamster · 2024-08-15T12:08:48Z

internal/internal_worker.go

@@ -187,6 +187,10 @@ func ensureRequiredParams(params *workerExecutionParameters) {
 	if params.UserContext == nil {
 		params.UserContext = context.Background()
 	}
+	if params.EventMonitoring.PollerLifeCycle == nil {
+		params.EventMonitoring.PollerLifeCycle = newNoopPollerLifeCycle()
+		params.Logger.Info("No PollerLifeCycle configured for cadence worker. Use noop PollerLifeCycle as default.")


nit: this should be a debug; nobody is forced to implement it.
However, we should have guidance on how we use it and how it could be helpful for OS customers.

changed to debug.
I'm proposing a default implementation which just count the running pollers (which is basically what I was planning for the Uber internal implementation) and exposed a NewLifeCycle() constructor so as a start the OS users can at least have a wrapper around it

internal/worker.go

internal/internal_worker.go

jakobht

Added some comments, I'm not sure I completely understand what we want to achieve, so some of the comments might reflect that.

internal/debug/lifecycle.go

internal/debug/lifecycle_test.go

internal/internal_worker.go

jakobht · 2024-08-19T09:29:30Z

internal/debug/lifecycle.go

+	runImpl struct {
+		lifeCycle *lifeCycleImpl
+		stopped   atomic.Bool
+	}


I don't understand why we need this struct. The lifecycle holds the whole state - and where it is call there is no risk of calling stop twice.

So we could call .Stop() on the LifeCycle itself.

My thoughts on this were:

Users can create an instance of LifeCycle by using debug.NewLifeCycle() and this internal implementation is what’s returned. So if for whatever crazy reason someone uses this and calls Stop() and Start() in any order the internal atomic.Int32 will no longer reflect the actual number of running pollers. So LifeCycle.LoadPollerCount() will no longer be accurate. This is also why we are tracking if Stop() has been called which is your question in the comment below

Btw That’s how I am planning to use it internally. Create an instance of LifeCycle and pass it as a worker option and then call LifeCycle.LoadPollerCount() to load the number on the debug page.

Also the way we use it to record poller stopped
defer LifeCycle.Start(...).Stop() kind of explains that we expect this to be in order

Let me know if this makes sense and happy to discuss if you have another approach on how to handle this

jakobht · 2024-08-19T09:30:30Z

internal/debug/lifecycle.go

+	if r.stopped.CompareAndSwap(false, true) {
+		r.lifeCycle.pollerCount.Dec()
+	} else {


I don't understand why we are tracking if it's stopped we call this function exactly once, when the poller stops, so I do not think this is nessesary.

jakobht · 2024-08-19T09:32:09Z

internal/debug/lifecycle.go

+	}
+)
+
+func (lc *lifeCycleImpl) PollerStart(workerID string) Run {


Suggested change

func (lc *lifeCycleImpl) PollerStart(workerID string) Run {

func (lc *lifeCycleImpl) PollerStart() Run {

We are not using the argument here? Maybe we should have a map or something? But currently it's not used.

The initial plan was to define the interface only and the implementation would need to be provided by users. Internally, we’d have only a noop implementation for backward compatibility. Having only a Start() and Stop() with no args would limit the users on their implementation. That’s why I added at least workerID as an argument.
Eventually I added a default implementation and though it is not used in the default workerID just seemed like a good starting point for consumers implementation

3vilhamster · 2024-08-20T10:58:31Z

internal/common/debug/lifecycle.go

+
+	// LifeCycle contains a set of methods to collect information on a running worker
+	// Deprecated: in development and very likely to change
+	LifeCycle interface {


Suggested change

LifeCycle interface {

PollerTracker interface {

Start() Stopper

Stats() int

}

// WokrflowInfo -> Domain and Tasklist can be extracted from the worker.

WokrflowInfo struct {

ID, RunID, Type string

}

WorkflowTracker interface {

Start(id WokrflowInfo) Stopper

Stats() [{WokrflowInfo, int}]

}

ActivitInfo struct {

ID string

WfInfo WokrflowInfo

}

ActivityTracker interface {

Start(id ActivityInfo) Stopper

Stats() [{ActivityInfo, int}]

}

made the changes per yesterday’s discussion. Per @jakobht 's point Now that we removed the ID argument from Start() we don’t have to return a Stopper and can instead have Stop() directly in the PollerTracker. I don’t have a strong opinion or preference on this.

3vilhamster · 2024-08-20T10:58:46Z

internal/common/debug/lifecycle.go

+	EventMonitor struct {
+		LifeCycle LifeCycle
+	}


Suggested change

EventMonitor struct {

LifeCycle LifeCycle

}

WorkerStats struct {

PollerTracker PollerTracker

WorkflowTracker WorkflowTracker

ActivityTracker ActivityTracker

}

ketsiambaku requested review from Groxx, shijiesheng, agautam478, jakobht, 3vilhamster, dkrotx, taylanisikdemir and demirkayaender as code owners July 9, 2024 13:29

taylanisikdemir reviewed Jul 9, 2024

View reviewed changes

internal/internal_worker.go Outdated Show resolved Hide resolved

internal/internal_worker.go Outdated Show resolved Hide resolved

internal/internal_worker.go Outdated Show resolved Hide resolved

internal/internal_worker_base.go Outdated Show resolved Hide resolved

ketsiambaku force-pushed the stats-collector branch from d5acb7c to 6bedb01 Compare August 13, 2024 08:21

ketsiambaku changed the title ~~Add stats collector in worker options~~ Add event monitoring worker options to collect stats on poller lifecycle Aug 15, 2024

3vilhamster reviewed Aug 15, 2024

View reviewed changes

ketsiambaku added 10 commits August 18, 2024 22:48

Add stats collector in worker options

1df8f1d

add comment for type definition

e3399a4

update comments

b4506c7

add tests assertions

864ff92

add pollerlifecycle interface

22c0399

lint

f83435a

fix typo

b214c09

add implementation and tests

d359245

run lint

f219e1c

update debug log msg & internal pkg

a7c4fe9

ketsiambaku force-pushed the stats-collector branch from 2495592 to a7c4fe9 Compare August 18, 2024 20:48

3vilhamster approved these changes Aug 19, 2024

View reviewed changes

internal/internal_worker.go Outdated Show resolved Hide resolved

jakobht reviewed Aug 19, 2024

View reviewed changes

ketsiambaku added 3 commits August 19, 2024 14:59

address comments

b074895

address more comments

48cfb9b

address more comments

a70cf38

lint

3ff2962

3vilhamster reviewed Aug 20, 2024

View reviewed changes

ketsiambaku added 3 commits August 21, 2024 13:24

rewrite poller tracker

993da0a

fix typo in test

ad95a31

add tests for noop impl

25d074d

ketsiambaku changed the title ~~Add event monitoring worker options to collect stats on poller lifecycle~~ Add workerStats as worker options to collect stats on poller start/stop events Aug 21, 2024

ketsiambaku merged commit ed58224 into uber-go:master Aug 22, 2024
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workerStats as worker options to collect stats on poller start/stop events #1356

Add workerStats as worker options to collect stats on poller start/stop events #1356

ketsiambaku commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 9, 2024 •

edited

Loading

Groxx commented Jul 9, 2024 •

edited

Loading

3vilhamster Aug 15, 2024

ketsiambaku Aug 19, 2024

jakobht left a comment

jakobht Aug 19, 2024

ketsiambaku Aug 19, 2024 •

edited

Loading

jakobht Aug 19, 2024

jakobht Aug 19, 2024

ketsiambaku Aug 19, 2024

3vilhamster Aug 20, 2024 •

edited

Loading

ketsiambaku Aug 21, 2024

3vilhamster Aug 20, 2024

	func (lc *lifeCycleImpl) PollerStart(workerID string) Run {
	func (lc *lifeCycleImpl) PollerStart() Run {

-	LifeCycle interface {
+	PollerTracker interface {
+	    Start() Stopper
+	    Stats() int
+	}
+	// WokrflowInfo -> Domain and Tasklist can be extracted from the worker.
+	WokrflowInfo struct {
+	   ID, RunID, Type string
+	}
+	WorkflowTracker interface {
+	   Start(id WokrflowInfo) Stopper
+	    Stats() [{WokrflowInfo, int}]
+	}
+	ActivitInfo struct {
+	   ID string
+	   WfInfo WokrflowInfo
+	}
+	ActivityTracker interface {
+	    Start(id ActivityInfo) Stopper
+	    Stats() [{ActivityInfo, int}]
+	}

-	EventMonitor struct {
-		LifeCycle LifeCycle
-	}
+	WorkerStats struct {
+		PollerTracker PollerTracker
+		WorkflowTracker WorkflowTracker
+		ActivityTracker ActivityTracker
+	}

Add workerStats as worker options to collect stats on poller start/stop events #1356

Add workerStats as worker options to collect stats on poller start/stop events #1356

Conversation

ketsiambaku commented Jul 9, 2024 • edited Loading

codecov bot commented Jul 9, 2024 • edited Loading

Codecov Report

Groxx commented Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ketsiambaku Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

3vilhamster Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ketsiambaku commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 9, 2024 •

edited

Loading

Groxx commented Jul 9, 2024 •

edited

Loading

ketsiambaku Aug 19, 2024 •

edited

Loading

3vilhamster Aug 20, 2024 •

edited

Loading