Skip to content

Commit

Permalink
[PROF-9983] Leverage local symbols for otel-profiling-agent (#23)
Browse files Browse the repository at this point in the history
* symbolication: support uploading local symbols remotely

* readme: document experimental local symbol upload

* symbolication: remove noisy error log

when we only filter out fs.ErrNotExist, we get frequent failures
in the local development environment with the following error:
/proc/<pid>/root// is a directory

we seem to think that /proc/<pid>/root// is a executable path,
this happens for short-lived runc processes where /proc/<pid>/maps
has "/" as the pathname for the runc executable itself.

so when we try to open the elf file, we first try to read it from
/proc/<pid>/map_files and (as the process is short lived) when
this fails, we try to open /proc/<pid>/root// which fails with
a "is a directory" error instead of fs.ErrNotExist

to fix this, we just ignore all errors returned by GetElf() and
only log them for debug purposes.

https://github.com/elastic/otel-profiling-agent/blob/0945fe6/libpf/process/process.go#L221-L239

* uploader: fix segfault on failure to init datadog uploader (#26)

when we fail to initialize the datadog symbol uploader, we set the
uploader to nil, which will cause a panic later on when handling
a new executable for symbol upload.

this sets the uploader to symbolication.NewNoopUploader() when the
datadog uploader is set up, but fails to be initialized.
  • Loading branch information
Gandem committed Jun 26, 2024
1 parent 767d031 commit a28575a
Show file tree
Hide file tree
Showing 16 changed files with 442 additions and 30 deletions.
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,18 @@ sudo otel-profiling-agent -tags 'service:myservice;remote_symbols:yes' -collecti

For this to work you need to run a Datadog agent that listens for APM traffic at `localhost:8126`. If your agent is reachable under a different address, you can modify the `-collection-agent` parameter accordingly.

## Configuration

### Local symbol upload (Experimental)

For compiled languages (C/C++/Rust/Go), the profiling-agent can upload local symbols (when available) to Datadog for symbolication. Symbols need to be available locally (unstripped binaries).

To enable local symbol upload:
1. Set the `DD_EXPERIMENTAL_LOCAL_SYMBOL_UPLOAD` environment variable to `true`.
2. Provide a Datadog API key through the `DD_API_KEY` environment variable.
3. Set the `DD_SITE` environment variable to [your Datadog site](https://docs.datadoghq.com/getting_started/site/#access-the-datadog-site) (e.g. `datadoghq.com`).


## Development

A `docker-compose.yml` file is provided to help run the agent in a container for local development.
Expand All @@ -45,6 +57,7 @@ DD_API_KEY=your-api-key # required
DD_SITE=datadoghq.com # optional, defaults to "datadoghq.com"
OTEL_PROFILING_AGENT_SERVICE=my-service # optional, defaults to "otel-profiling-agent-dev"
OTEL_PROFILING_AGENT_REPORTER_INTERVAL=10s # optional, defaults to 60s
DD_EXPERIMENTAL_LOCAL_SYMBOL_UPLOAD=true # optional, defaults to false
```

Then, you can run the agent with the following command:
Expand Down
9 changes: 7 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,15 @@ services:
- arch=${ARCH:?error}
privileged: true
pid: "host"
environment:
DD_SITE: ${DD_SITE:-datadoghq.com}
DD_EXPERIMENTAL_LOCAL_SYMBOL_UPLOAD: ${DD_EXPERIMENTAL_LOCAL_SYMBOL_UPLOAD:-false}
volumes:
- .:/agent
- /var/run/docker.sock:/var/run/docker.sock:ro
command: bash -c "sudo mount -t debugfs none /sys/kernel/debug && make && sudo /agent/otel-profiling-agent -tags 'service:${OTEL_PROFILING_AGENT_SERVICE:-otel-profiling-agent-dev};remote_symbols:yes' -collection-agent "http://datadog-agent:8126" -reporter-interval ${OTEL_PROFILING_AGENT_REPORTER_INTERVAL:-60s} -samples-per-second 20 -save-cpuprofile"
secrets:
- dd-api-key
command: ['/bin/sh', '-c', 'export DD_API_KEY=$$(cat /run/secrets/dd-api-key); sudo mount -t debugfs none /sys/kernel/debug && make && sudo -E /agent/otel-profiling-agent -tags "service:${OTEL_PROFILING_AGENT_SERVICE:-otel-profiling-agent-dev};remote_symbols:yes" -collection-agent "http://datadog-agent:8126" -reporter-interval ${OTEL_PROFILING_AGENT_REPORTER_INTERVAL:-60s} -samples-per-second 20 -save-cpuprofile']

datadog-agent:
image: gcr.io/datadoghq/agent:7
Expand All @@ -24,7 +29,7 @@ services:
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
secrets:
- dd-api-key
entrypoint: [ '/bin/sh', '-c', 'export DD_API_KEY=$$(cat /run/secrets/dd-api-key) ; /bin/entrypoint.sh' ]
entrypoint: ['/bin/sh', '-c', 'export DD_API_KEY=$$(cat /run/secrets/dd-api-key) ; /bin/entrypoint.sh']

secrets:
dd-api-key:
Expand Down
66 changes: 62 additions & 4 deletions libpf/pfelf/file.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import (
"io"
"os"
"path/filepath"
"slices"
"sort"
"syscall"
"unsafe"
Expand All @@ -49,8 +50,12 @@ const (
// parsed sections (e.g. symbol tables and string tables; libxul
// has about 4MB .dynstr)
maxBytesLargeSection = 16 * 1024 * 1024
buildIDSectionName = ".note.gnu.build-id"
)

var debugStrSectionNames = []string{".debug_str", ".zdebug_str", ".debug_str.dwo"}
var debugInfoSectionNames = []string{".debug_info", ".zdebug_info"}

// ErrSymbolNotFound is returned when requested symbol was not found
var ErrSymbolNotFound = errors.New("symbol not found")

Expand Down Expand Up @@ -109,6 +114,11 @@ type File struct {
// bias is the load bias for ELF files inside core dump
bias libpf.Address

// filePath is the path of the ELF file as opened by os.Open()
// This can be a path to a mapping file, or a path to the original ELF binary.
// This is empty is the file is opened from a coredump.
filePath string

// InsideCore indicates that this ELF is mapped from a coredump ELF
InsideCore bool

Expand Down Expand Up @@ -168,7 +178,7 @@ func Open(name string) (*File, error) {
return nil, err
}

ff, err := newFile(buffered, f, 0, false)
ff, err := newFile(name, buffered, f, 0, false)
if err != nil {
f.Close()
return nil, err
Expand All @@ -186,13 +196,15 @@ func (f *File) Close() (err error) {
}

// NewFile creates a new ELF file object that borrows the given reader.
func NewFile(r io.ReaderAt, loadAddress uint64, hasMusl bool) (*File, error) {
return newFile(r, nil, loadAddress, hasMusl)
func NewFile(path string, r io.ReaderAt, loadAddress uint64, hasMusl bool) (*File, error) {
return newFile(path, r, nil, loadAddress, hasMusl)
}

func newFile(r io.ReaderAt, closer io.Closer, loadAddress uint64, hasMusl bool) (*File, error) {
func newFile(path string, r io.ReaderAt, closer io.Closer,
loadAddress uint64, hasMusl bool) (*File, error) {
f := &File{
elfReader: r,
filePath: path,
InsideCore: loadAddress != 0,
closer: closer,
}
Expand Down Expand Up @@ -971,3 +983,49 @@ func (f *File) DynString(tag elf.DynTag) ([]string, error) {
func (f *File) IsGolang() bool {
return f.Section(".go.buildinfo") != nil || f.Section(".gopclntab") != nil
}

func (f *File) FilePath() (string, error) {
if f.InsideCore {
return "", errors.New("file path not available for ELF inside coredump")
}
return f.filePath, nil
}

// HasDWARFData is a copy of pfelf.HasDWARFData, but for the libpf.File interface.
func (f *File) HasDWARFData() bool {
hasBuildID := false
hasDebugStr := false
for _, section := range f.Sections {
// NOBITS indicates that the section is actually empty, regardless of the size in the
// section header.
if section.Type == elf.SHT_NOBITS {
continue
}

if section.Name == buildIDSectionName {
hasBuildID = true
}

if slices.Contains(debugStrSectionNames, section.Name) {
hasDebugStr = section.Size > 0
}

// Some files have suspicious near-empty, partially stripped sections; consider them as not
// having DWARF data.
// The simplest binary gcc 10 can generate ("return 0") has >= 48 bytes for each section.
// Let's not worry about executables that may not verify this, as they would not be of
// interest to us.
if section.Size < 32 {
continue
}

if slices.Contains(debugInfoSectionNames, section.Name) {
return true
}
}

// Some alternate debug files only have a .debug_str section. For these we want to return true.
// Use the absence of program headers and presence of a Build ID as heuristic to identify
// alternate debug files.
return len(f.Progs) == 0 && hasBuildID && hasDebugStr
}
18 changes: 17 additions & 1 deletion main.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ import (
"github.com/elastic/otel-profiling-agent/metrics/agentmetrics"
"github.com/elastic/otel-profiling-agent/reporter"

"github.com/elastic/otel-profiling-agent/symbolication"
"github.com/elastic/otel-profiling-agent/tracer"

log "github.com/sirupsen/logrus"
Expand Down Expand Up @@ -337,8 +338,23 @@ func mainWithExitCode() exitCode {
// Start reporter metric reporting with 60 second intervals.
defer reportermetrics.Start(mainCtx, rep, 60*time.Second)()

uploader := symbolication.NewNoopUploader()

ddSymbolUpload := os.Getenv("DD_EXPERIMENTAL_LOCAL_SYMBOL_UPLOAD")
if ddSymbolUpload == "true" {
log.Infof("Enabling Datadog local symbol upload")
uploader, err = symbolication.NewDatadogUploader()
if err != nil {
log.Errorf(
"Failed to create Datadog symbol uploader, symbol upload will be disabled: %v",
err,
)
uploader = symbolication.NewNoopUploader()
}
}

// Load the eBPF code and map definitions
trc, err := tracer.NewTracer(mainCtx, rep, times, includeTracers, !argSendErrorFrames)
trc, err := tracer.NewTracer(mainCtx, rep, uploader, times, includeTracers, !argSendErrorFrames)
if err != nil {
msg := fmt.Sprintf("Failed to load eBPF tracer: %s", err)
log.Error(msg)
Expand Down
2 changes: 1 addition & 1 deletion process/coredump.go
Original file line number Diff line number Diff line change
Expand Up @@ -564,5 +564,5 @@ func (cf *CoredumpFile) ReadAt(p []byte, addr int64) (int, error) {
// The returned `pfelf.File` is borrowing the coredump file. Closing it will not close the
// underlying CoredumpFile.
func (cf *CoredumpFile) OpenELF() (*pfelf.File, error) {
return pfelf.NewFile(cf, cf.Base, cf.parent.hasMusl)
return pfelf.NewFile(cf.Name, cf, cf.Base, cf.parent.hasMusl)
}
2 changes: 1 addition & 1 deletion process/process.go
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,7 @@ func (sp *systemProcess) OpenELF(file string) (*pfelf.File, error) {
if err != nil {
return nil, fmt.Errorf("failed to extract VDSO: %v", err)
}
return pfelf.NewFile(vdso, 0, false)
return pfelf.NewFile(file, vdso, 0, false)
}
return pfelf.Open(sp.getMappingFile(m))
}
Expand Down
43 changes: 31 additions & 12 deletions processmanager/execinfomanager/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
package execinfomanager

import (
"context"
"errors"
"fmt"
"os"
Expand Down Expand Up @@ -35,6 +36,7 @@ import (
sdtypes "github.com/elastic/otel-profiling-agent/nativeunwind/stackdeltatypes"
pmebpf "github.com/elastic/otel-profiling-agent/processmanager/ebpf"
"github.com/elastic/otel-profiling-agent/support"
"github.com/elastic/otel-profiling-agent/symbolication"
"github.com/elastic/otel-profiling-agent/tpbase"
"github.com/elastic/otel-profiling-agent/util"
)
Expand Down Expand Up @@ -90,6 +92,9 @@ type ExecutableInfoManager struct {
// sdp allows fetching stack deltas for executables.
sdp nativeunwind.StackDeltaProvider

// uploader is used to upload symbolication data.
uploader symbolication.Uploader

// state bundles up all mutable state of the manager.
state xsync.RWMutex[executableInfoManagerState]

Expand All @@ -101,6 +106,7 @@ type ExecutableInfoManager struct {
// NewExecutableInfoManager creates a new instance of the executable info manager.
func NewExecutableInfoManager(
sdp nativeunwind.StackDeltaProvider,
uploader symbolication.Uploader,
ebpf pmebpf.EbpfHandler,
includeTracers config.IncludedTracers,
) (*ExecutableInfoManager, error) {
Expand Down Expand Up @@ -138,7 +144,8 @@ func NewExecutableInfoManager(
deferredFileIDs.SetLifetime(deferredFileIDTimeout)

return &ExecutableInfoManager{
sdp: sdp,
sdp: sdp,
uploader: uploader,
state: xsync.NewRWMutex(executableInfoManagerState{
interpreterLoaders: interpreterLoaders,
executables: map[host.FileID]*entry{},
Expand All @@ -154,9 +161,9 @@ func NewExecutableInfoManager(
//
// The return value is copied instead of returning a pointer in order to spare us the use
// of getters and more complicated locking semantics.
func (mgr *ExecutableInfoManager) AddOrIncRef(fileID host.FileID,
func (mgr *ExecutableInfoManager) AddOrIncRef(hostFileID host.FileID, fileID libpf.FileID,
elfRef *pfelf.Reference) (ExecutableInfo, error) {
if _, exists := mgr.deferredFileIDs.Get(fileID); exists {
if _, exists := mgr.deferredFileIDs.Get(hostFileID); exists {
return ExecutableInfo{}, ErrDeferredFileID
}
var (
Expand All @@ -169,7 +176,7 @@ func (mgr *ExecutableInfoManager) AddOrIncRef(fileID host.FileID,

// Fast path for executable info that is already present.
state := mgr.state.WLock()
info, ok := state.executables[fileID]
info, ok := state.executables[hostFileID]
if ok {
defer mgr.state.WUnlock(&state)
info.rc++
Expand All @@ -180,10 +187,11 @@ func (mgr *ExecutableInfoManager) AddOrIncRef(fileID host.FileID,
// so we release the lock before doing this.
mgr.state.WUnlock(&state)

if err = mgr.sdp.GetIntervalStructuresForFile(fileID, elfRef, &intervalData); err != nil {
if err = mgr.sdp.GetIntervalStructuresForFile(hostFileID, elfRef, &intervalData); err != nil {
if !errors.Is(err, os.ErrNotExist) {
mgr.deferredFileIDs.Add(fileID, libpf.Void{})
mgr.deferredFileIDs.Add(hostFileID, libpf.Void{})
}

return ExecutableInfo{}, fmt.Errorf("failed to extract interval data: %w", err)
}

Expand All @@ -197,21 +205,20 @@ func (mgr *ExecutableInfoManager) AddOrIncRef(fileID host.FileID,
// Re-take the lock and check whether another thread beat us to
// inserting the data while we were waiting for the write lock.
state = mgr.state.WLock()
defer mgr.state.WUnlock(&state)
if info, ok = state.executables[fileID]; ok {
if info, ok = state.executables[hostFileID]; ok {
info.rc++
return info.ExecutableInfo, nil
}

// Load the data into BPF maps.
ref, gaps, err = state.loadDeltas(fileID, intervalData.Deltas)
ref, gaps, err = state.loadDeltas(hostFileID, intervalData.Deltas)
if err != nil {
mgr.deferredFileIDs.Add(fileID, libpf.Void{})
mgr.deferredFileIDs.Add(hostFileID, libpf.Void{})
return ExecutableInfo{}, fmt.Errorf("failed to load deltas: %w", err)
}

// Create the LoaderInfo for interpreter detection
loaderInfo := interpreter.NewLoaderInfo(fileID, elfRef, gaps)
loaderInfo := interpreter.NewLoaderInfo(hostFileID, elfRef, gaps)

// Insert a corresponding record into our map.
info = &entry{
Expand All @@ -222,7 +229,19 @@ func (mgr *ExecutableInfoManager) AddOrIncRef(fileID host.FileID,
mapRef: ref,
rc: 1,
}
state.executables[fileID] = info
state.executables[hostFileID] = info
mgr.state.WUnlock(&state)

// Processing symbols for upload can take a while, so we release the lock
// before doing this.
// We also use a timeout to avoid blocking the process manager for too long.
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()

err = mgr.uploader.HandleExecutable(ctx, elfRef, fileID)
if err != nil {
log.Errorf("Failed to handle executable %v: %v", elfRef.FileName(), err)
}

return info.ExecutableInfo, nil
}
Expand Down
6 changes: 4 additions & 2 deletions processmanager/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ import (
pmebpf "github.com/elastic/otel-profiling-agent/processmanager/ebpf"
eim "github.com/elastic/otel-profiling-agent/processmanager/execinfomanager"
"github.com/elastic/otel-profiling-agent/reporter"
"github.com/elastic/otel-profiling-agent/symbolication"
"github.com/elastic/otel-profiling-agent/tracehandler"
"github.com/elastic/otel-profiling-agent/traceutil"
"github.com/elastic/otel-profiling-agent/util"
Expand Down Expand Up @@ -68,7 +69,8 @@ var (
// implementation.
func New(ctx context.Context, includeTracers config.IncludedTracers, monitorInterval time.Duration,
ebpf pmebpf.EbpfHandler, fileIDMapper FileIDMapper, symbolReporter reporter.SymbolReporter,
sdp nativeunwind.StackDeltaProvider, filterErrorFrames bool) (*ProcessManager, error) {
uploader symbolication.Uploader, sdp nativeunwind.StackDeltaProvider,
filterErrorFrames bool) (*ProcessManager, error) {
if fileIDMapper == nil {
var err error
fileIDMapper, err = newFileIDMapper(lruFileIDCacheSize)
Expand All @@ -84,7 +86,7 @@ func New(ctx context.Context, includeTracers config.IncludedTracers, monitorInte
}
elfInfoCache.SetLifetime(elfInfoCacheTTL)

em, err := eim.NewExecutableInfoManager(sdp, ebpf, includeTracers)
em, err := eim.NewExecutableInfoManager(sdp, uploader, ebpf, includeTracers)
if err != nil {
return nil, fmt.Errorf("unable to create ExecutableInfoManager: %v", err)
}
Expand Down
4 changes: 4 additions & 0 deletions processmanager/manager_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import (
pmebpf "github.com/elastic/otel-profiling-agent/processmanager/ebpf"
"github.com/elastic/otel-profiling-agent/remotememory"
"github.com/elastic/otel-profiling-agent/reporter"
"github.com/elastic/otel-profiling-agent/symbolication"
"github.com/elastic/otel-profiling-agent/traceutil"
"github.com/elastic/otel-profiling-agent/util"

Expand Down Expand Up @@ -322,6 +323,7 @@ func TestInterpreterConvertTrace(t *testing.T) {
nil,
&symbolReporterMockup{},
nil,
nil,
true)
require.NoError(t, err)

Expand Down Expand Up @@ -416,6 +418,7 @@ func TestNewMapping(t *testing.T) {
ebpfMockup,
NewMapFileIDMapper(),
symRepMockup,
symbolication.NewNoopUploader(),
&dummyProvider,
true)
require.NoError(t, err)
Expand Down Expand Up @@ -606,6 +609,7 @@ func TestProcExit(t *testing.T) {
ebpfMockup,
NewMapFileIDMapper(),
repMockup,
symbolication.NewNoopUploader(),
&dummyProvider,
true)
require.NoError(t, err)
Expand Down
Loading

0 comments on commit a28575a

Please sign in to comment.