Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed tracing support in Openwhisk #2192

Closed
sandeep-paliwal opened this issue Apr 27, 2017 · 37 comments
Closed

Distributed tracing support in Openwhisk #2192

sandeep-paliwal opened this issue Apr 27, 2017 · 37 comments

Comments

@sandeep-paliwal
Copy link
Contributor

I am creating this enhancement to discuss/add distributed tracing support to Openwhisk project.
This will help to gather latency data of various components(controller, invoker etc.) in various usage context(eg. action invocation). This data can help troubleshoot any performance bottlenecks.

One option is to use Zipkin - http://zipkin.io/
There is already instrumented library which supports Akka and Spray framework - https://github.com/levkhomich/akka-tracing

I have done some initial work around this and I can contribute it back once its complete.
It would be nice to get thoughts/concerns/suggestions around this.

@markusthoemmes
Copy link
Contributor

Hi, thank you very much for working on this, sounds like an awesome idea.

One important historic bit though: We've kinda build our own tracing using our logs (see the markers we attach to our logs). Those are parseable by, for example, Elasticsearch.

It would be nice to integrate into this and generate as little code overhead as possible as we've been very cautious with instrumenting our code that way.

Other than that: Looking forward to your pull-request. 👍

@jthomas
Copy link
Member

jthomas commented Apr 27, 2017

I did experiment with a client-side wrapper for actions to enable zipkin support:
https://github.com/jthomas/zipkin-instrumentation-openwhisk

It does work but there are performance issues. This has also been discussed on the mailing list.
👍 for adding this into the core platform.

@jthomas
Copy link
Member

jthomas commented Apr 27, 2017

cc @adriancole

@JonathanMace
Copy link

Damn, small world @jthomas.

One option would be to consider slightly more general-purpose context propagation and deploy Zipkin on top of that, see (work in progress / under submission): https://github.com/JonathanMace/tracingplane

@codefromthecrypt
Copy link

@JonathanMace nice to see you! here or elsewhere we should revisit brave-tracingplane I have some examples of swapping its in-process propagation innards here https://github.com/openzipkin/brave/blob/master/brave/src/test/java/brave/features/log4j2_context/Log4JThreadContextTest.java

else - most recent scala-brave stuff here, though this is not to presume what is best fit for this project https://github.com/bizreach/play-zipkin-tracing mainly the underlying tracer api is better now, so worth a look

@codefromthecrypt
Copy link

fyi I bumped the existing akka project in case they are interested in updating to latest greatest levkhomich/akka-tracing#95

@jthomas
Copy link
Member

jthomas commented Apr 27, 2017

Hey @JonathanMace - good to hear from you. Long time no see :)

@sandeep-paliwal
Copy link
Contributor Author

Thanks Everyone for feedback on this.
@adriancole Can I use this https://github.com/openzipkin/zipkin-reporter-java for span propagation via Kafka?
Something like - A Producer(Akka based actor) starts span sampling sends them over via Kafka to Consumer(Akka based actor). Consumer create span sample Producer span as it parent.

@codefromthecrypt
Copy link

codefromthecrypt commented Apr 28, 2017 via email

@style95
Copy link
Member

style95 commented Apr 28, 2017

It looks many people in this thread are big fan of zipkin.
However, one more option could be pinpoint which is also based on Google Dapper.
Since core OW components run on JVM, pinpoint could be a good option I think.

AFAIK, main differences between Zipkin and Pinpoint are as follow:

Zipkin requires code changes on core components.
It keeps "context" throughout all call stacks.
That means we should add that "context" arguments to every call.
It makes tracing code is tightly coupled with core logics.

But Pinpoint uses "bytecode instrumentation" so it does not require any code changes on core modules. It intervenes codes to change bytecode at class loading time.

image

This way, it introduces many advantages.

First, it hides tracing api from core modules. Developers of core modules are not required to care about the tracing codes. Tracing codes are decoupled from core logics.

Second, it`s easy to enable/disable the tracing.
We can easily enable/disable pinpoint tracing logic by just adding/removing jvm startup options.

-javaagent:$AGENT_PATH/pinpoint-bootstrap-$VERSION.jar
-Dpinpoint.agentId=<Agent's UniqueId>
-Dpinpoint.applicationName=<The name indicating a same service (AgentId collection)>

Many languages do not support bytecode instrumentation, but OW is written in scala and running on JVM. It can take advantage of bytecode instrumentation.

Pinpoints provide plugins for many java libraries such as httpclient, jetty, log4j, logback, thrift, cassandra, gson and so on. But for scala libraries, we need to develop plugin.

Even though we should develop plugins for scala libraries, pinpoint is still a good option I believe.

You can refer to this for more details: The Value of Bytecode Instrumentation

ps. I have not caught Zipkin up for a few months. So if I am wrong, kindly correct me : )

@codefromthecrypt
Copy link

codefromthecrypt commented Apr 28, 2017 via email

@style95
Copy link
Member

style95 commented May 2, 2017

@adriancole
I also like zipkin, few years ago I tried to make go-zipkin library.
Anyway, I am curious about integration between pinpoint and zipkin.
In which parts pinpoint and zipkin could be integrated?
Could you share the details?

I just thought even though zipkin has many advantages, ow is running on jvm and it is great to take advantage of bytecode instrumentation in pinpoint.
If effort to make zipkin to uses bytecode instrumentation is lesser than the one to make pinpoint plugin, there is no reason not to use zipkin.

If you share your experience on integration or making bytecode instrumentation available in zipkin, it would help for folks in this thread to figure it out which is better for OW.

@codefromthecrypt
Copy link

codefromthecrypt commented May 2, 2017 via email

@codefromthecrypt
Copy link

codefromthecrypt commented May 2, 2017 via email

@style95
Copy link
Member

style95 commented May 2, 2017

If we can get bytecode instrumentation feature of pinpoint along zipkin`s rich scala libraries support, it would be great.

By running your own JVM, you have abilities beyond normal. For example,
normal you can install whatever tracing you want when a platform is
initialized. What you are hinting at is that there may be some third party
code you are running and not otherwise able to configure. Can you enumerate
concretely what is otherwise unconfigurable that make you want to primarily
use bytecode? This will help the discussion from wandering hypothetically.

Regarding my preference on bytecode, it`s relevant to code changes rather than configurable components. I am not quite expert on distributed tracing than you. So if I am wrong, kindly correct me.

To introduce distributed tracing, we may need followings.
In abstraction, we need interceptor which intercepts normal remote call and manipulate/transport some information such as span in-between the call. Accordingly, we need to initialize the interceptor, and keep the context for each remote call. But these procedures should be manually done in code level.
That means, OW core codes such as controller, invoker should be changed in some level. We may use wrapper for libraries such as akka-http which is provided in zipkin in some cases.

With bytecode instrumentation, above procedures will be done by framework. Framework will insert above logics automatically at class loading time. Ideally no code changes on core modules are required. But for libraries, if there is no available plugin for libraries, we may need to implement new one to apply bytecode instrumentation on library code.

So if we can use pinpoint along with zipkin, it would be the best.
AFAIK, zipkin already has many libraries support such as akka-http. If we use bytecode instrumentation of pinpoint, there would be "ideally" no code changes on core modules and be able to take the power of rich library supports with zipkin.

But at first I have no idea on integration of both framework, I preferred pinpoint.

@codefromthecrypt
Copy link

codefromthecrypt commented May 2, 2017 via email

@codefromthecrypt
Copy link

fyi all @Xylus is the person on pinpoint I've chatted with most. He's pretty aware of differences between it and Zipkin, too, at least from a high level. Although subject to hands available, I know both of us are happy to facilitate things that can make interop smoother.

Regardless, it would be cool to let him know if you end up using pinpoint, and I'm sure he'd answer any questions you have. Best wishes!

@ddragosd
Copy link
Contributor

ddragosd commented May 9, 2017

the question that I have if we are to use pinpoint is: can we track 1 activation through controller -> kafka -> invoker and back ?

@codefromthecrypt
Copy link

codefromthecrypt commented May 10, 2017 via email

@rabbah
Copy link
Member

rabbah commented May 10, 2017 via email

@ddragosd
Copy link
Contributor

Does it make sense to tie into the existing instrumentation vs adding new instrumentation?

I assume you're talking about whisk.common package, right ?
Your question makes me think of the Logging trait with a ZipkinStreamLogging implementation, reusing the existing markers. In this way who wants to print to disk and consume disk I/O can do it, and who wants to send those markers into zipkin, could in theory do this; or both. By reusing the transactionId as the parent SpanID, we could build a simple 1 level hierarchy and log those markers. A Distributed tracing system makes sense to be in the picture as after a certain throughput, disk I/O becomes a bottleneck for logging, or we don't want to capture everything, but just sample a subset of all requests flowing through the system.

@rabbah is this in the same line with what you were thinking ?

@codefromthecrypt
Copy link

codefromthecrypt commented May 10, 2017 via email

@ddragosd
Copy link
Contributor

@adriancole ❤️ your slide deck ! thanks for sharing !

@sandeep-paliwal it looks like the TransactionId class has duration and timestamp infos needed for a span. So it's possible that if logging.emit method called from this class is updated to include these infos, then PrintStreamLogging would format a String message to be printed in the console, while a new ZipkinLogging implementation would send a Span. In this case createMessageWithMarker method would move inside PrintStreamLogging impl. @rabbah feel free to keep me honest. The remaining problem is how to generate child spans for invoker, DB ops, docker, etc.

@style95
Copy link
Member

style95 commented May 11, 2017

@rabbah JFYI, many tracing systems provide rich UI with additional information such as JVM heap size, PermGen, cpu usage and so on.

Since these all information including tracing the transaction could be seen altogether, it would be easy to find the problem and what happens in the system at that time and so on.

As @adriancole shared, with logs also we can populate tracing information.
(That`s cool slide, I got clear view in terms of logs, metrics and tracing❤️ )

Generally there might be not a big difference, if we manually deploy the system which collects all the logs, a few required metrics, and create a combined view with them altogether.
(Though we have to cross many huddles such as timing issue among distributed servers, etc)

JFYI, following screen shots are what pinpoint provides.

  1. Server map with request latency map
    ss_server-map

  2. Call stack with exception
    ss_call-stack

  3. Heap, PermGen, CPU usage
    ss_inspector

@codefromthecrypt
Copy link

codefromthecrypt commented May 11, 2017 via email

@rabbah
Copy link
Member

rabbah commented May 11, 2017 via email

@codefromthecrypt
Copy link

codefromthecrypt commented May 11, 2017 via email

@rabbah
Copy link
Member

rabbah commented May 11, 2017

We shouldn't close the issue because it's not clear, from the discussion, that there's agreement on the goal or desired outcome.

My point is that we should not pick one platform or another. Instead we should make it so that it's possible to use any of the tools described here - or others - because for example an organization might already have a standard platform and policies and we should force a particular choice. For operators of the platform, there are many metrics that are useful to generate and monitor and establish alerts for.

Often, tracing a transaction through the system (which I've described above) serves a different purpose from the causal analysis I think you're alluding to.

@codefromthecrypt
Copy link

codefromthecrypt commented May 11, 2017 via email

@sandeep-paliwal
Copy link
Contributor Author

Thanks @ddragosd and @rabbah. I am working on the same line as suggested in previous comments to integrate tracing with the existing logging. I had made progress in getting the trace working with zipkin in context of given action invocation. Now on to get the markers used in Logging to work with tracing and in general fit the tracing changes in existing logging.

@sandeep-paliwal
Copy link
Contributor Author

I am close to finishing the tracing changes and raise a PR. I thought I will share some tracing screens which make use of existing OpenWhisk LogMarkers.

  1. Action invocation with Cold container

screen1

  1. Action Invocation with warm container

screen2

@jthomas
Copy link
Member

jthomas commented May 22, 2017

Wow, very cool to see this coming along.

@ddragosd
Copy link
Contributor

looking awesome @sandeep-paliwal. Besides seeing it in action I'm looking forward to seeing how you've managed to setup a depth of 3 with child spans from the ZipkinLogging class.

@sandeep-paliwal
Copy link
Contributor Author

Hi,
I have created a Pull request for this - #2282

@ddragosd
Copy link
Contributor

@adriancole I'd be interested to get your thoughts on whether Tracing should be used in Prod ( with sampling ) vs Non-Prod environments.

@codefromthecrypt
Copy link

codefromthecrypt commented Jun 6, 2017 via email

chetanmeh pushed a commit that referenced this issue Jun 29, 2018
Enables Tracing support via Zipkin and OpenTracer. 

It can be enabled via config

   tracing {
        zipkin {
             url = "http://localhost:9411" //url to connecto to zipkin server
             //sample-rate to decide a request is sampled or not.
             sample-rate = "0.01" // sample 1% of requests by default
        }
    }

Tracing enables tracking of request from Controller to Invoker
@rabbah
Copy link
Member

rabbah commented Jul 8, 2018

The PR is now merged 🎉

@rabbah rabbah closed this as completed Jul 8, 2018
BillZong pushed a commit to BillZong/openwhisk that referenced this issue Nov 18, 2019
Enables Tracing support via Zipkin and OpenTracer. 

It can be enabled via config

   tracing {
        zipkin {
             url = "http://localhost:9411" //url to connecto to zipkin server
             //sample-rate to decide a request is sampled or not.
             sample-rate = "0.01" // sample 1% of requests by default
        }
    }

Tracing enables tracking of request from Controller to Invoker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants