Distributed tracing support in Openwhisk #2192

sandeep-paliwal · 2017-04-27T08:08:35Z

I am creating this enhancement to discuss/add distributed tracing support to Openwhisk project.
This will help to gather latency data of various components(controller, invoker etc.) in various usage context(eg. action invocation). This data can help troubleshoot any performance bottlenecks.

One option is to use Zipkin - http://zipkin.io/
There is already instrumented library which supports Akka and Spray framework - https://github.com/levkhomich/akka-tracing

I have done some initial work around this and I can contribute it back once its complete.
It would be nice to get thoughts/concerns/suggestions around this.

markusthoemmes · 2017-04-27T08:49:47Z

Hi, thank you very much for working on this, sounds like an awesome idea.

One important historic bit though: We've kinda build our own tracing using our logs (see the markers we attach to our logs). Those are parseable by, for example, Elasticsearch.

It would be nice to integrate into this and generate as little code overhead as possible as we've been very cautious with instrumenting our code that way.

Other than that: Looking forward to your pull-request. 👍

jthomas · 2017-04-27T10:42:41Z

I did experiment with a client-side wrapper for actions to enable zipkin support:
https://github.com/jthomas/zipkin-instrumentation-openwhisk

It does work but there are performance issues. This has also been discussed on the mailing list.
👍 for adding this into the core platform.

jthomas · 2017-04-27T10:43:48Z

cc @adriancole

JonathanMace · 2017-04-27T15:37:00Z

Damn, small world @jthomas.

One option would be to consider slightly more general-purpose context propagation and deploy Zipkin on top of that, see (work in progress / under submission): https://github.com/JonathanMace/tracingplane

codefromthecrypt · 2017-04-27T15:48:18Z

@JonathanMace nice to see you! here or elsewhere we should revisit brave-tracingplane I have some examples of swapping its in-process propagation innards here https://github.com/openzipkin/brave/blob/master/brave/src/test/java/brave/features/log4j2_context/Log4JThreadContextTest.java

else - most recent scala-brave stuff here, though this is not to presume what is best fit for this project https://github.com/bizreach/play-zipkin-tracing mainly the underlying tracer api is better now, so worth a look

codefromthecrypt · 2017-04-27T15:53:17Z

fyi I bumped the existing akka project in case they are interested in updating to latest greatest levkhomich/akka-tracing#95

jthomas · 2017-04-27T16:32:52Z

Hey @JonathanMace - good to hear from you. Long time no see :)

sandeep-paliwal · 2017-04-28T07:57:33Z

Thanks Everyone for feedback on this.
@adriancole Can I use this https://github.com/openzipkin/zipkin-reporter-java for span propagation via Kafka?
Something like - A Producer(Akka based actor) starts span sampling sends them over via Kafka to Consumer(Akka based actor). Consumer create span sample Producer span as it parent.

codefromthecrypt · 2017-04-28T09:04:45Z

Hi, Sandeep. so reporting is out-of-band, and I think what you are asking about is in-band propagation. propagating across kafka is a bit hairy as john mace will tell you. We do have folks doing it, but no standard as choices tend to be not great (abuse keys or coordinate an envelope) https://gist.github.com/adriancole/76d94054b77e3be338bd75424ca8ba30

…

On Fri, Apr 28, 2017 at 9:57 AM, sandeep-paliwal ***@***.***> wrote: Thanks Everyone for feedback on this. @adriancole <https://github.com/adriancole> Can I use this https://github.com/openzipkin/zipkin-reporter-java <http://url> for span propagation via Kafka? Something like - A Producer(Akka based actor) starts span sampling sends them over via Kafka to Consumer(Akka based actor). Consumer create span sample Producer span as it parent. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2192 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAD612s35TCjh9k1z1M6-YyalqYmPJBDks5r0ZvxgaJpZM4NJ4Ba> .

style95 · 2017-04-28T09:43:13Z

It looks many people in this thread are big fan of zipkin.
However, one more option could be pinpoint which is also based on Google Dapper.
Since core OW components run on JVM, pinpoint could be a good option I think.

AFAIK, main differences between Zipkin and Pinpoint are as follow:

Zipkin requires code changes on core components.
It keeps "context" throughout all call stacks.
That means we should add that "context" arguments to every call.
It makes tracing code is tightly coupled with core logics.

But Pinpoint uses "bytecode instrumentation" so it does not require any code changes on core modules. It intervenes codes to change bytecode at class loading time.

This way, it introduces many advantages.

First, it hides tracing api from core modules. Developers of core modules are not required to care about the tracing codes. Tracing codes are decoupled from core logics.

Second, it`s easy to enable/disable the tracing.
We can easily enable/disable pinpoint tracing logic by just adding/removing jvm startup options.

-javaagent:$AGENT_PATH/pinpoint-bootstrap-$VERSION.jar
-Dpinpoint.agentId=<Agent's UniqueId>
-Dpinpoint.applicationName=<The name indicating a same service (AgentId collection)>

Many languages do not support bytecode instrumentation, but OW is written in scala and running on JVM. It can take advantage of bytecode instrumentation.

Pinpoints provide plugins for many java libraries such as httpclient, jetty, log4j, logback, thrift, cassandra, gson and so on. But for scala libraries, we need to develop plugin.

Even though we should develop plugins for scala libraries, pinpoint is still a good option I believe.

You can refer to this for more details: The Value of Bytecode Instrumentation

ps. I have not caught Zipkin up for a few months. So if I am wrong, kindly correct me : )

codefromthecrypt · 2017-04-28T10:13:49Z

I like pinpoint and hyun and folks are earnest and welcome oss folks. I would say that theres a false dichotomy though. Theres nothing that says zipkin instrumentation must not use bytecode instrumentation. For example, we have tossed around some sort of integration between pinpoint and zipkin, and there is already progress from stagemonitor which also does bytecode instrumentation. Regardless, fair and useful chat to see if openwhisk prefers to do bytecode instrumentation and/or whether pinpoint is a better choice for that or other reasons. Thats a call folks here should make. Thanks for asking the hard questions! On 28 Apr 2017 17:43, "Dominic Kim" <[email protected]> wrote: It looks many people in this thread are big fan of zipkin. However, one more option could be pinpoint <https://github.com/naver/pinpoint> which is also based on Google Dapper. Since core OW components run on JVM, pinpoint could be a good option I think. AFAIK, main differences between Zipkin and Pinpoint are as follow: Zipkin requires code changes on core components. It keeps "context" throughout all call stacks. That means we should add that "context" arguments to every call. It makes tracing code is tightly coupled with core logics. But Pinpoint uses "bytecode instrumentation" so it does not require any code changes on core modules. It intervenes codes to change bytecode at class loading time. [image: image] <https://cloud.githubusercontent.com/assets/3447251/25522874/8026c21a-2c3f-11e7-85d8-f453af3e0fe7.png> This way, it introduces many advantages. First, it hides tracing api from core modules. Developers of core modules are not required to care about the tracing codes. Tracing codes are decoupled from core logics. Second, it`s easy to enable/disable the tracing. We can easily enable/disable pinpoint tracing logic by just adding/removing jvm startup options.

…

-javaagent:$AGENT_PATH/pinpoint-bootstrap-$VERSION.jar -Dpinpoint.agentId=<Agent's UniqueId> -Dpinpoint.applicationName=<The name indicating a same service (AgentId collection)> Many languages do not support bytecode instrumentation, but OW is written in scala and running on JVM. It can take advantage of bytecode instrumentation. Pinpoints provide plugins for many java libraries such as httpclient, jetty, log4j, logback, thrift, cassandra, gson and so on. But for scala libraries, we need to develop plugin. Even though we should develop plugins for scala libraries, pinpoint is still a good option I believe. You can refer to this for more details: The Value of Bytecode Instrumentation <https://github.com/naver/pinpoint/wiki/Technical-Overview-Of-Pinpoint#the-value-of-bytecode-instrumentation> ps. I have not caught Zipkin up for a few months. So if I am wrong, kindly correct me : ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2192 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAD61zLFivxIbBAVj8UrMBl4bj1Askwyks5r0bS2gaJpZM4NJ4Ba> .

style95 · 2017-05-02T01:12:16Z

@adriancole
I also like zipkin, few years ago I tried to make go-zipkin library.
Anyway, I am curious about integration between pinpoint and zipkin.
In which parts pinpoint and zipkin could be integrated?
Could you share the details?

I just thought even though zipkin has many advantages, ow is running on jvm and it is great to take advantage of bytecode instrumentation in pinpoint.
If effort to make zipkin to uses bytecode instrumentation is lesser than the one to make pinpoint plugin, there is no reason not to use zipkin.

If you share your experience on integration or making bytecode instrumentation available in zipkin, it would help for folks in this thread to figure it out which is better for OW.

codefromthecrypt · 2017-05-02T01:45:57Z

In which parts pinpoint and zipkin could be integrated? Could you share the details?

Hyun and I spoke about trying to get their collector to emit to zipkin as an alternative. Since the pinpoint model is more rich than zipkin's, it should be possible. We haven't sat down and tried, yet.

I just thought even though zipkin has many advantages, ow is running on jvm and it is great to take advantage of bytecode instrumentation in pinpoint.

By running your own JVM, you have abilities beyond normal. For example, normal you can install whatever tracing you want when a platform is initialized. What you are hinting at is that there may be some third party code you are running and not otherwise able to configure. Can you enumerate concretely what is otherwise unconfigurable that make you want to primarily use bytecode? This will help the discussion from wandering hypothetically.

If effort to make zipkin to uses bytecode instrumentation is lesser than the one to make pinpoint plugin, there is no reason not to use zipkin. If you share your experience on integration or making bytecode instrumentation available in zipkin, it would help for folks in this thread to figure it out which is better for OW.

Presuming bytecode instrumentation is a requirement, you probably would need to see bytebuddy code examples using brave or similar libraries to instrument things, right?

codefromthecrypt · 2017-05-02T01:47:47Z

FWIW, here's one example of bytecode instrumentation approach (stagemonitor which uses brave) https://github.com/stagemonitor/boot-zipkin

style95 · 2017-05-02T03:03:50Z

If we can get bytecode instrumentation feature of pinpoint along zipkin`s rich scala libraries support, it would be great.

By running your own JVM, you have abilities beyond normal. For example,
normal you can install whatever tracing you want when a platform is
initialized. What you are hinting at is that there may be some third party
code you are running and not otherwise able to configure. Can you enumerate
concretely what is otherwise unconfigurable that make you want to primarily
use bytecode? This will help the discussion from wandering hypothetically.

Regarding my preference on bytecode, it`s relevant to code changes rather than configurable components. I am not quite expert on distributed tracing than you. So if I am wrong, kindly correct me.

To introduce distributed tracing, we may need followings.
In abstraction, we need interceptor which intercepts normal remote call and manipulate/transport some information such as span in-between the call. Accordingly, we need to initialize the interceptor, and keep the context for each remote call. But these procedures should be manually done in code level.
That means, OW core codes such as controller, invoker should be changed in some level. We may use wrapper for libraries such as akka-http which is provided in zipkin in some cases.

With bytecode instrumentation, above procedures will be done by framework. Framework will insert above logics automatically at class loading time. Ideally no code changes on core modules are required. But for libraries, if there is no available plugin for libraries, we may need to implement new one to apply bytecode instrumentation on library code.

So if we can use pinpoint along with zipkin, it would be the best.
AFAIK, zipkin already has many libraries support such as akka-http. If we use bytecode instrumentation of pinpoint, there would be "ideally" no code changes on core modules and be able to take the power of rich library supports with zipkin.

But at first I have no idea on integration of both framework, I preferred pinpoint.

codefromthecrypt · 2017-05-02T04:20:32Z

If you are unable to control the libraries OpenWhisk uses, and there is an agent that somehow does, probably sounds like you should do what you prefer, which is to use an agent. Just know that you dont live in isolatation.. make sure whatever your agent uses for propagation can interop with others. Either that or mention to all consumers that they too need to use the same agent. Most frameworks are aware of the libraries they use and don't then need to rely on bytecode instrumentation. Black box instrumentation is usually done when frameworks havent chosen a path for tracing. Frameworks that employ instrumentation directly can easily unit test their tracing code and guarantee things like remote in and out are traced. This is actually the first conversation I have had with a framework who is in control of their code, yet preferred to rely on agents to do tracing. Your call of course, so go with whatever you like knowing pros and cons!

codefromthecrypt · 2017-05-02T05:39:48Z

fyi all @Xylus is the person on pinpoint I've chatted with most. He's pretty aware of differences between it and Zipkin, too, at least from a high level. Although subject to hands available, I know both of us are happy to facilitate things that can make interop smoother.

Regardless, it would be cool to let him know if you end up using pinpoint, and I'm sure he'd answer any questions you have. Best wishes!

ddragosd · 2017-05-09T17:54:04Z

the question that I have if we are to use pinpoint is: can we track 1 activation through controller -> kafka -> invoker and back ?

codefromthecrypt · 2017-05-10T00:03:36Z

the question that I have if we are to use pinpoint is: can we track 1 activation through controller -> kafka -> invoker and back ?

in either approach, you'd need to either hijack the message key or wrap the body to propagate the trace through kafka. Brave folks use the former sometimes. There's also this, which will make propagation very easy, likely for either approach, but which version that goes into is unknown https://cwiki.apache.org/confluence/display/KAFKA/KIP-82+-+Add+Record+Headers <https://github.com/openzipkin/brave/pull/url> ps is your use in kafka SPSC (one producer sends a message and only one consumer gets it?)

rabbah · 2017-05-10T00:12:11Z

What's the desired goal? Is it the tracing of individual transactions through the system? If you look through the current system logs you can see we are already doing this. Log messages are all time stamped and some carry special markers which are deltas since a previous marker. You can trace how long an activation request for example spent in the database query vs Kafka vs invoker etc. Does it make sense to tie into the existing instrumentation vs adding new instrumentation?

ddragosd · 2017-05-10T02:50:42Z

Does it make sense to tie into the existing instrumentation vs adding new instrumentation?

I assume you're talking about whisk.common package, right ?
Your question makes me think of the Logging trait with a ZipkinStreamLogging implementation, reusing the existing markers. In this way who wants to print to disk and consume disk I/O can do it, and who wants to send those markers into zipkin, could in theory do this; or both. By reusing the transactionId as the parent SpanID, we could build a simple 1 level hierarchy and log those markers. A Distributed tracing system makes sense to be in the picture as after a certain throughput, disk I/O becomes a bottleneck for logging, or we don't want to capture everything, but just sample a subset of all requests flowing through the system.

@rabbah is this in the same line with what you were thinking ?

codefromthecrypt · 2017-05-10T03:24:36Z

I think the key thing is that distributed tracing adds causality (vs correlation which could be off due to clock skew etc). Here's some details on the usual suspects https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing

ddragosd · 2017-05-10T23:36:40Z

@adriancole ❤️ your slide deck ! thanks for sharing !

@sandeep-paliwal it looks like the TransactionId class has duration and timestamp infos needed for a span. So it's possible that if logging.emit method called from this class is updated to include these infos, then PrintStreamLogging would format a String message to be printed in the console, while a new ZipkinLogging implementation would send a Span. In this case createMessageWithMarker method would move inside PrintStreamLogging impl. @rabbah feel free to keep me honest. The remaining problem is how to generate child spans for invoker, DB ops, docker, etc.

style95 · 2017-05-11T01:30:09Z

@rabbah JFYI, many tracing systems provide rich UI with additional information such as JVM heap size, PermGen, cpu usage and so on.

Since these all information including tracing the transaction could be seen altogether, it would be easy to find the problem and what happens in the system at that time and so on.

As @adriancole shared, with logs also we can populate tracing information.
(That`s cool slide, I got clear view in terms of logs, metrics and tracing❤️ )

Generally there might be not a big difference, if we manually deploy the system which collects all the logs, a few required metrics, and create a combined view with them altogether.
(Though we have to cross many huddles such as timing issue among distributed servers, etc)

JFYI, following screen shots are what pinpoint provides.

Server map with request latency map
Call stack with exception
Heap, PermGen, CPU usage

codefromthecrypt · 2017-05-11T01:39:11Z

@rabbah <https://github.com/rabbah> JFYI, many tracing systems provide rich UI and additional information such as JVM heap size, PermGen usage, cpu usage and so on.

I think a good analogy for this is that some APMs are tracing systems and some tracing systems are APMs, but not all tracing systems are also APMs. Pinpoint aims to replace things like new relic, and has a vastly larger feature set than distributed tracing. I actually intend to make a deck about this point, as it is also something usually confused!

rabbah · 2017-05-11T01:59:12Z

Hence the reason for my question. If the goal is to trace a transaction through the system, we have a mechanism for that. As was suggested you can build an adaptor to the existing logger (one or more) and consume these in your favorite stack. The metrics you describe (heap, cpu, etc) serve a different purpose. I've used, and built, tools for this. And you won't get that from the logs of course. But for the latter I would rather we made it plugable so that developers and operators can deploy their preferred tools and we don't have to pick and choose one.

codefromthecrypt · 2017-05-11T02:13:34Z

Hence the reason for my question. If the goal is to trace a transaction through the system, we have a mechanism for that.

you have a means to stain logs with transaction ids, but that won't give causality, right? It won't be able to model any parent-child relationships as they occur. Any system doing this would need to have a way to indicate parent/child relationships. So, I would say you have a correlation system in place, but not tracing. If that's what was meant by this issue, then maybe close?

rabbah · 2017-05-11T04:29:43Z

We shouldn't close the issue because it's not clear, from the discussion, that there's agreement on the goal or desired outcome.

My point is that we should not pick one platform or another. Instead we should make it so that it's possible to use any of the tools described here - or others - because for example an organization might already have a standard platform and policies and we should force a particular choice. For operators of the platform, there are many metrics that are useful to generate and monitor and establish alerts for.

Often, tracing a transaction through the system (which I've described above) serves a different purpose from the causal analysis I think you're alluding to.

codefromthecrypt · 2017-05-11T04:39:42Z

In situations like these, where there isnt a clear goal or direction from the project itself, I usually defer to what end users ask for and see if you can make that possible. Arbitrary portability isnt likely to predict what users want. Ex If users are clammering for X and Y, see how to make those work together. If no one is clammering except people who are not end users, wait until that is not the case. My 2p

sandeep-paliwal · 2017-05-16T12:56:56Z

Thanks @ddragosd and @rabbah. I am working on the same line as suggested in previous comments to integrate tracing with the existing logging. I had made progress in getting the trace working with zipkin in context of given action invocation. Now on to get the markers used in Logging to work with tracing and in general fit the tracing changes in existing logging.

sandeep-paliwal · 2017-05-22T12:49:21Z

I am close to finishing the tracing changes and raise a PR. I thought I will share some tracing screens which make use of existing OpenWhisk LogMarkers.

Action invocation with Cold container

Action Invocation with warm container

jthomas · 2017-05-22T13:39:50Z

Wow, very cool to see this coming along.

ddragosd · 2017-05-22T17:32:07Z

looking awesome @sandeep-paliwal. Besides seeing it in action I'm looking forward to seeing how you've managed to setup a depth of 3 with child spans from the ZipkinLogging class.

sandeep-paliwal · 2017-05-24T10:15:49Z

Hi,
I have created a Pull request for this - #2282

ddragosd · 2017-05-26T15:10:07Z

@adriancole I'd be interested to get your thoughts on whether Tracing should be used in Prod ( with sampling ) vs Non-Prod environments.

codefromthecrypt · 2017-06-06T05:51:03Z

@adriancole <https://github.com/adriancole> I'd be interested to get your thoughts on whether Tracing should be used in Prod ( with sampling ) vs Non-Prod environments.

tracing is for troubleshooting production, but it can also be used for non-prod :) typical concerns are volume of trace data, which implies sampling policy

Enables Tracing support via Zipkin and OpenTracer. It can be enabled via config tracing { zipkin { url = "http://localhost:9411" //url to connecto to zipkin server //sample-rate to decide a request is sampled or not. sample-rate = "0.01" // sample 1% of requests by default } } Tracing enables tracking of request from Controller to Invoker

rabbah · 2018-07-08T03:31:00Z

The PR is now merged 🎉

Enables Tracing support via Zipkin and OpenTracer. It can be enabled via config tracing { zipkin { url = "http://localhost:9411" //url to connecto to zipkin server //sample-rate to decide a request is sampled or not. sample-rate = "0.01" // sample 1% of requests by default } } Tracing enables tracking of request from Controller to Invoker

rabbah added enhancement monitoring labels Apr 27, 2017

rabbah closed this as completed Jul 8, 2018

Distributed tracing support in Openwhisk #2192

Distributed tracing support in Openwhisk #2192

Comments

sandeep-paliwal commented Apr 27, 2017

markusthoemmes commented Apr 27, 2017

jthomas commented Apr 27, 2017

jthomas commented Apr 27, 2017

JonathanMace commented Apr 27, 2017

codefromthecrypt commented Apr 27, 2017

codefromthecrypt commented Apr 27, 2017

jthomas commented Apr 27, 2017

sandeep-paliwal commented Apr 28, 2017

codefromthecrypt commented Apr 28, 2017 via email

style95 commented Apr 28, 2017

codefromthecrypt commented Apr 28, 2017 via email

style95 commented May 2, 2017

codefromthecrypt commented May 2, 2017 via email

codefromthecrypt commented May 2, 2017 via email

style95 commented May 2, 2017

codefromthecrypt commented May 2, 2017 via email • edited Loading

codefromthecrypt commented May 2, 2017

ddragosd commented May 9, 2017

codefromthecrypt commented May 10, 2017 via email

rabbah commented May 10, 2017 via email • edited Loading

ddragosd commented May 10, 2017

codefromthecrypt commented May 10, 2017 via email

ddragosd commented May 10, 2017

style95 commented May 11, 2017 • edited Loading

codefromthecrypt commented May 11, 2017 via email

rabbah commented May 11, 2017 via email

codefromthecrypt commented May 11, 2017 via email

rabbah commented May 11, 2017

codefromthecrypt commented May 11, 2017 via email

sandeep-paliwal commented May 16, 2017

sandeep-paliwal commented May 22, 2017

jthomas commented May 22, 2017

ddragosd commented May 22, 2017

sandeep-paliwal commented May 24, 2017

ddragosd commented May 26, 2017

codefromthecrypt commented Jun 6, 2017 via email

rabbah commented Jul 8, 2018

codefromthecrypt commented May 2, 2017 via email •

edited

Loading

rabbah commented May 10, 2017 via email •

edited

Loading

style95 commented May 11, 2017 •

edited

Loading