-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add workerStats as worker options to collect stats on poller start/stop events #1356
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
A whole bunch of stuff from IRL chat, but to put it in writing and try to make it a bit less meandering and aimless:
In a general sense, I'm 100% in favor of "useful but inherently unstable" APIs like this, so I say 👍 let's build it. It doesn't have to be perfect because we can add more things later. We just need to make it very clear that it's not (yet?) stable so nobody tries to write a library around it that we would constantly be breaking. None of the extra verifying / samples-repo-trying / etc is a blocker for this or needs a PR (unless you want), it's just a "this is probably worth doing first because we'll need it at some point, and there's no time like the present". Treat it as a crash-course in library API design, and feel it out from both library and user side - it's very much worth learning, and learning how to do, and this feature is a fairly isolated playground to figure things out. (... and then when it all starts feeling natural you'll probably start getting mad at API lack-of-design-s everywhere. it's a blessing and a curse.) For building something with this and testing it out, I'd recommend:
I think we'll likely want an API like: WorkerOptions struct {
...
UnstableEventMonitoringOrSomething struct {
PollerLifecycle PollerLifecycle
}
}
PollerLifecycle interface {
Start(identity string)
Stop(identity string)
} so we can make additions and breaking changes like this: WorkerOptions struct {
...
UnstableEventMonitoringOrSomething struct {
PollerLifecycle PollerLifecycle
+ PollerLifecycleV2 PollerLifecycleV2
}
}
PollerLifecycle interface {
Start(identity string)
Stop(identity string)
}
+PollerLifecycleV2 interface {
+ PollerLifecycle
+ Poll(identity)
+} and support old versions + new versions + more kinds of collectors in the future. Another pattern that's fairly common in open source is to upcast a general interface: WorkerOptions struct {
...
UnstableEventMonitoringOrSomething StatsCollector
}
type StatsCollector interface {
// does nothing / is not called, just excludes accidental random objects
CollectsStats()
}
type PollStats interface {
StatsCollector
Start(..)
Stop(..)
}
Type PollStatsV2 interface {
PollStats
Poll(..)
}
// upcast to use the various impls / sub-interfaces
if pollv2, ok := thing.(PollStatsV2); ok {
pollv2.Start(..)
} else if poll, ok := thing.(PollStats); ok {
poll.Start(..)
}
// etc but I think we shouldn't do that. E.g. if we have "poll stats v1, v2, and workflow stats v1" and we want to remove support for "poll stats v1", anyone providing poll-v1 and workflow-v1 will just lose their poll stats, and we have no way to warn them mechanically. They won't have provided "nothing" that we can call an error (workflow-v1 works fine), we just won't notice that they had a poll-v1 and didn't do the poll-v2 upgrade. With the structy setup, if we do this: UnstableEventMonitoringOrSomething struct {
- PollerLifecycle PollerLifecycle
PollerLifecycleV2 PollerLifecycleV2
WorkflowStuff WorkflowStuff
} poll-v1-users get a compile-time failure and a very clear upgrade path: use v2 or get rid of it. We can add disjoint and breaking changes any time we need (it's just a new field), we can warn on deprecating things (it's just a The structy setup isn't great for "I want to write a library that provides a few of these" as it doesn't have an easy way to auto-upgrade poll-v1 to poll-v2 (users would need to wire that up), but there are ways to deal with that too (e.g. hand the config struct to the library and have it fill the fields). For internal implementation purposes: noop impls are fine IMO. Checking for nils everywhere is worse in a lot of practical ways. As long as it's truly internal we can always change it, so no design concerns there at all. |
d5acb7c
to
6bedb01
Compare
internal/internal_worker.go
Outdated
@@ -187,6 +187,10 @@ func ensureRequiredParams(params *workerExecutionParameters) { | |||
if params.UserContext == nil { | |||
params.UserContext = context.Background() | |||
} | |||
if params.EventMonitoring.PollerLifeCycle == nil { | |||
params.EventMonitoring.PollerLifeCycle = newNoopPollerLifeCycle() | |||
params.Logger.Info("No PollerLifeCycle configured for cadence worker. Use noop PollerLifeCycle as default.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should be a debug; nobody is forced to implement it.
However, we should have guidance on how we use it and how it could be helpful for OS customers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to debug.
I'm proposing a default implementation which just count the running pollers (which is basically what I was planning for the Uber internal implementation) and exposed a NewLifeCycle() constructor so as a start the OS users can at least have a wrapper around it
2495592
to
a7c4fe9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, I'm not sure I completely understand what we want to achieve, so some of the comments might reflect that.
internal/debug/lifecycle.go
Outdated
runImpl struct { | ||
lifeCycle *lifeCycleImpl | ||
stopped atomic.Bool | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need this struct. The lifecycle holds the whole state - and where it is call there is no risk of calling stop twice.
So we could call .Stop() on the LifeCycle itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts on this were:
Users can create an instance of LifeCycle by using debug.NewLifeCycle()
and this internal implementation is what’s returned. So if for whatever crazy reason someone uses this and calls Stop()
and Start()
in any order the internal atomic.Int32
will no longer reflect the actual number of running pollers. So LifeCycle.LoadPollerCount()
will no longer be accurate. This is also why we are tracking if Stop()
has been called which is your question in the comment below
Btw That’s how I am planning to use it internally. Create an instance of LifeCycle and pass it as a worker option and then call LifeCycle.LoadPollerCount()
to load the number on the debug page.
Also the way we use it to record poller stopped
defer LifeCycle.Start(...).Stop()
kind of explains that we expect this to be in order
Let me know if this makes sense and happy to discuss if you have another approach on how to handle this
internal/debug/lifecycle.go
Outdated
if r.stopped.CompareAndSwap(false, true) { | ||
r.lifeCycle.pollerCount.Dec() | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we are tracking if it's stopped we call this function exactly once, when the poller stops, so I do not think this is nessesary.
internal/debug/lifecycle.go
Outdated
} | ||
) | ||
|
||
func (lc *lifeCycleImpl) PollerStart(workerID string) Run { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (lc *lifeCycleImpl) PollerStart(workerID string) Run { | |
func (lc *lifeCycleImpl) PollerStart() Run { |
We are not using the argument here? Maybe we should have a map or something? But currently it's not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial plan was to define the interface only and the implementation would need to be provided by users. Internally, we’d have only a noop implementation for backward compatibility. Having only a Start() and Stop() with no args would limit the users on their implementation. That’s why I added at least workerID as an argument.
Eventually I added a default implementation and though it is not used in the default workerID just seemed like a good starting point for consumers implementation
internal/common/debug/lifecycle.go
Outdated
|
||
// LifeCycle contains a set of methods to collect information on a running worker | ||
// Deprecated: in development and very likely to change | ||
LifeCycle interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LifeCycle interface { | |
PollerTracker interface { | |
Start() Stopper | |
Stats() int | |
} | |
// WokrflowInfo -> Domain and Tasklist can be extracted from the worker. | |
WokrflowInfo struct { | |
ID, RunID, Type string | |
} | |
WorkflowTracker interface { | |
Start(id WokrflowInfo) Stopper | |
Stats() [{WokrflowInfo, int}] | |
} | |
ActivitInfo struct { | |
ID string | |
WfInfo WokrflowInfo | |
} | |
ActivityTracker interface { | |
Start(id ActivityInfo) Stopper | |
Stats() [{ActivityInfo, int}] | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made the changes per yesterday’s discussion. Per @jakobht 's point Now that we removed the ID argument from Start() we don’t have to return a Stopper and can instead have Stop() directly in the PollerTracker. I don’t have a strong opinion or preference on this.
internal/common/debug/lifecycle.go
Outdated
EventMonitor struct { | ||
LifeCycle LifeCycle | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EventMonitor struct { | |
LifeCycle LifeCycle | |
} | |
WorkerStats struct { | |
PollerTracker PollerTracker | |
WorkflowTracker WorkflowTracker | |
ActivityTracker ActivityTracker | |
} |
Detailed Description
[In-depth description of the changes made to the interfaces, specifying new fields, removed fields, or modified data structures]
This PR introduces a new worker option EventMonitor which wraps a LifeCycle interface to collect stats on poller start and stop (in this PR) and running activities and workflows (in a follow-up PR) for debugging purposes.
The change is backward compatible with older versions of the client because a default implementation is created when a worker is instantiated without this option being provided. A constructor for the LifeCycle interface is also added to the public API to serve as basic guidelines for usage by the OS community.
Impact Analysis
Testing Plan
Rollout Plan
The new exposed types are marked as deprecated at the moment.
The plan is to pre-release and experiment in Uber internal for sometime before removing the deprecated annotation
Revert plan: The change can be safely reverted. Need coordination with consumers code using the introduced worker option