Skip to content
This repository has been archived by the owner on May 18, 2021. It is now read-only.

Useful metrics #106

Open
jan-g opened this issue Nov 9, 2017 · 5 comments
Open

Useful metrics #106

jan-g opened this issue Nov 9, 2017 · 5 comments

Comments

@jan-g
Copy link
Contributor

jan-g commented Nov 9, 2017

Operationally, there are some obvious things to measure per flow node. These should be exposed via /metrics if they aren't already:

DB connectivity:

  • number of active pool connections (vs. idle)
  • sql span histograms for journalling

One upper limit on how many concurrent stage operations we can sustain per second is (max pool connections) / <sql query span>.

Executor connectivity:

  • number of active fn invocations the executor is waiting on.

Error counts:

  • fn failures
  • db errors
  • lower-level errors: eg, socket availability (we might conceivably bump into this if we have a naive http/1.1 connection to the fn api).
@hhexo
Copy link
Contributor

hhexo commented Nov 9, 2017

For DB connectivity we have spans of the time taken by sql persistence operations (by operation). These are then collected in histogram by the prometheus mapper. I would argue that we also need quantiles (therefore use prometheus Summaries instead of Histograms), so I can add those.

Counters (number of connections, number of fn invocations, ...) are supported in prometheus but I'm not sure if opentracing has a concept for those (still, I'm a bit ignorant when it comes to opentracing).

@jan-g
Copy link
Contributor Author

jan-g commented Nov 10, 2017

I don't mind using raw prometheus (or something wrapped around it) if it means we can get counters out for useful things.

@zootalures
Copy link
Member

I don't think opentracing concerns itself with metrics/gauge stuff - and retconning numbers from the events is a bad idea, I assume we'll need to generate propmetheus metrics from internal gauge/counters alongside the event metrics.

@hhexo
Copy link
Contributor

hhexo commented Nov 20, 2017

Note: #114 adds a few of the mentioned metrics.

  • DB timings (already there)
  • API call timings
  • Number of currently active Flows
  • Number of currently active Fn invocations
  • Duration of individual flows (aggregated as histogram / quantiles)

@hhexo
Copy link
Contributor

hhexo commented Nov 21, 2017

#114 is closed now, because it will be done as part of #84 since that changes the api.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants