Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable retry by default for OTLP exporters #6588

Merged
merged 3 commits into from
Jul 19, 2024

Conversation

jack-berg
Copy link
Member

Resolves #5641.

  • Changes all OtlpHttp{Signal}Exporters and OtlpGrpc{Signal}Exporters to enable retry by default.
  • Adds new otel.java.exporter.otlp.retry.disabled (defaults to false) property to superseed otel.experimental.java.exporter.otlp.retry.enabled (defaults to true). The old property will be retained for some time for backwards compatibility.

Context:

  • OTLP retry has been around years in this project and the programmatic configuration has been stable since july 2023.
  • My employer (New Relic) has recommended it for production use for that entire time without any reports of issues
  • The spec mandates retrying transient errors with a retry strategy. By defaulting to not retrying, I believe we've technically been non-compliant this whole time. This was justified by the desire to wait for some additional emerging standard describing more details about how that retry strategy should work. I now believe its impossible to unify the implementations with a common algorithm and options, since there are many stable implementations with different designs.
  • The moratorium on new env vars and the progress of file config means its unlikely that a new env var is going to be introduced to replace out experimental otel.experimental.java.exporter.otlp.retry.enabled. We can't remove this property because users rely on it, so we might as well embrace it and introduce a stable java-specific property. The new name otel.java.exporter.otlp.retry.disabled aligns with spec naming for booleans "All Boolean environment variables SHOULD be named and defined such that false is the expected safe default behavior".

@jack-berg jack-berg requested a review from a team July 16, 2024 21:56
Copy link

codecov bot commented Jul 16, 2024

Codecov Report

Attention: Patch coverage is 70.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 90.58%. Comparing base (3fa57f9) to head (088b49f).
Report is 1 commits behind head on main.

Files Patch % Lines
...lemetry/exporter/otlp/internal/OtlpConfigUtil.java 62.50% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6588      +/-   ##
============================================
- Coverage     90.62%   90.58%   -0.05%     
+ Complexity     6261     6260       -1     
============================================
  Files           689      689              
  Lines         18704    18704              
  Branches       1844     1847       +3     
============================================
- Hits          16951    16943       -8     
- Misses         1198     1200       +2     
- Partials        555      561       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@jkwatson jkwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of making this the default

@trask
Copy link
Member

trask commented Jul 17, 2024

I'm wondering about the relationship between OTEL_BSP_EXPORT_TIMEOUT (default 30s) and OTEL_EXPORTER_OTLP_TIMEOUT (default 10s).

It seems good for the total exporter time across retries to be less than OTEL_BSP_EXPORT_TIMEOUT, otherwise the BatchSpanProcessor will report timeout on something which may still go through(?)

I think currently though the Java implementation applies OTEL_EXPORTER_OTLP_TIMEOUT per retry? (related open-telemetry/opentelemetry-specification#2346)

@jack-berg
Copy link
Member Author

I think currently though the Java implementation applies OTEL_EXPORTER_OTLP_TIMEOUT per retry?

The timeout applies to the collective set of all requests (original + any retries).

For okhttp, we set callTimeout with the configured timeout, which has javadoc:

The call timeout spans the entire call: resolving DNS, connecting, writing the request body, server processing, and reading the response body. If the call requires redirects or retries all must complete within one timeout period.

For jdk, we set the timeout to the configured timeout less the duration spent on requests up till that point.

So we're safe on your point about exporter timeout needing to be shorter than bsp timeout.

@jack-berg jack-berg merged commit 8fd45a7 into open-telemetry:main Jul 19, 2024
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider stabilizing autoconfigure retry
3 participants