- Author(s): Noah Eisen ([email protected])
- Approver: markdroth
- Status: Draft
- Implemented in: N/A
- Last updated: 2018-07-01
- Discussion at: https://groups.google.com/forum/#!topic/grpc-io/WFDj3KeHYTI
This design uses the same terminology discussed in the channelz design. Channels and subchannels will be handled by the same channel tracing code, so for this design, a channel means either a channel or subchannel from the channelz terminology.
In addition, the following terminology will be used:
- a "Trace event" is an interesting thing that happens to a channel. Examples include things like creation, address resolution, subchannel creation, connectivity state changes. Some Trace events (like a new subchannel being created), will refer to the ChannelData of the relevant channel or subchannel.
- a "Channel trace" is a data structure responsible for holding all trace data for a single channel. This includes a list of Trace events, as well as metadata like the timestamp at which the channel was created.
The document proposes adding a dedicated Channel trace for every channel and subchannel. The Channel trace will record important events in the life of a channel, like address resolution, subchannel creation, channel state changes, etc. The data from this Channel trace will be made available through the channelz service.
Channel connectivity issues are the root cause of a significant portion of user reported gRPC bugs. Channel tracing will be helpful for getting live channel data from a misbehaving program.
The data from Channel trace will be exposed via the channelz service.
Since the tracing objects may consume large amounts of space, care must be taken to prevent the Channel trace from using too many resources. Implementations MUST provide some control for limiting the amount of memory used for channel trace events, such as a max number of trace events per node or the max amount of memory used. Once that maximum is reached, adding a new Trace event will cause oldest Trace event to be removed until the invariant is reestablished.
The Channel trace for a given channel or subchannel must be maintained as long as there are any Trace events that refer to the channel or subchannel. After the last Trace event that refers to the channel or subchannel is removed, the Channel trace for that channel or subchannel may be cleaned up.
Trace events should only be added for events that happen relatively infrequently (not at a per RPC basis). An example list of Trace events from a healthy channel might look like:
... Channel created
... Address resolved: 8.8.8.8:443
... Address picked: 8.8.8.8:443
... Starting TCP connection
... TCP connection established
... Auth handshake complete
... Entering idle mode
We define a minimal set of events that must be traced, in order to consider the channel tracing feature as complete.
All of these events must be traced:
- Channel creation and deletion
- Subchannel creation and deletion
- Channel connectivity state changes
- Subchannel connectivity state changes
- Interesting address resolution events (see below for details)
Address resolution is special case. We want to track resolution events, but we do not want to flood the trace buffer with them. This will occur in environments with push based resolvers and dynamically scheduled backends (as is the case internally at Google). So we define several types of "interesting" resolution events that must be traced:
- Address resolution resulting in service config change
- Address resolution that causes number of backends to go from zero to non-zero
- Address resolution that causes number of backends to go from non-zero to zero
- Address resolution that causes a new LB policy to be created
Language specific implementations may add any additional trace that is deemed useful, as long as the trace is not expected to happen at a per-RPC frequency.
The data will be accessed via the channelz service, which sends a ChannelTrace proto as part of a larger message concerning a channel or subchannel. The following is a relevant excerpt, taken directly from the proto definition in channelz.
// the definitions of these protos can be found in A14-channelz.md
message ChannelRef {}
message SubchannelRef {}
// A trace event is an interesting thing that happened to a channel or
// subchannel, such as creation, address resolution, subchannel creation, etc.
message ChannelTraceEvent {
// High level description of the event.
string description = 1;
// The supported severity levels of trace events.
enum Severity {
CT_UNKNOWN = 0;
CT_INFO = 1;
CT_WARNING = 2;
CT_ERROR = 3;
}
// the severity of the trace event
Severity severity = 2;
// When this event occurred.
google.protobuf.Timestamp timestamp = 3;
// ref of referenced channel or subchannel.
// Optional, only present if this event refers to a child object. For example,
// this field would be filled if this trace event was for a subchannel being
// created.
oneof child_ref {
ChannelRef channel_ref = 4;
SubchannelRef subchannel_ref = 5;
}
}
message ChannelTrace {
// Number of events ever logged in this tracing object. This can differ from
// events.size() because events can be overwritten or garbage collected by
// implementations.
int64 num_events_logged = 1;
// Time that this channel was created.
google.protobuf.Timestamp creation_timestamp = 2;
// List of events that have occurred on this channel.
repeated ChannelTraceEvent events = 3;
}