Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Observability Options #132

Open
datamel opened this issue Sep 14, 2021 · 3 comments
Open

Investigate Observability Options #132

datamel opened this issue Sep 14, 2021 · 3 comments

Comments

@datamel
Copy link
Contributor

datamel commented Sep 14, 2021

Cylc comprises of a distribution of systems and, as such, if there is a bottleneck anywhere, then this can be difficult to pinpoint. Also, one of the key objectives of observability is to, not only see there is a problem, but to facilitate the discovery of where the problem has occurred.

Observability offers a detailed view of the internals of a software system - and Open Telemetry offers a standardised way of looking at traces.

By using a standard that is not tied to any language or platform, we can easily send traces from all parts of the system e.g. seeing the flow from cylc-flow to the ui-server, through to the ui itself.

This would be independent of our current logging. We could look at Open Logging and Open Metrics in the future when these standards are finalised also.

Using Open Telemetry and not a proprietary logging method, the users are free to send all telemetry to tracing aggregating tools of their choice as those increasingly support Open Telemetry framework; for example Zipkin and Jaeger.

So, for example, we may be able to set up spans such that, for example, for a request, we can display a detailed view of how time is spent on each process - a Gantt chart that you could drill down into to spot any bottlenecks.

Although not a current priority, this may be worth some investigation once things settle.

@kinow
Copy link
Member

kinow commented Sep 14, 2021

Big +1 for this @datamel. I used Prometheus and Grafana before. Lately more people have moved from Prometheus to Open Telemetry. I think we can keep this one and close cylc/cylc-flow#2904 in favor of this one , and is probably related to #72

@datamel
Copy link
Contributor Author

datamel commented Sep 14, 2021

Thanks @kinow. Yes, I have seen Grafana and Prometheus in action and they look great. I wasn't aware of those issues so great to have more thoughts. I do a spot of technical writing and there is a lot of interest in Open Telemetry at the moment. I also spent a training day last month learning a bit more about it, and it looks fairly straight forward to implement (famous last words 😄), and quite powerful if you get it right.

@hjoliver
Copy link
Member

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants