Network Error Log Collection Discussion #209
Replies: 10 comments 16 replies
-
To keep the discussion more productive, I propose focusing the discussion on Consumption of reports: analysis and visualization initially. The aim is to arrive at a good format that covers most of the common use cases while capturing the essential information. per @fortuna suggestion, the report should at least contain:
Other considerations
@gghazinouri @ainghazal @hellais your input and participation is much appreciated. |
Beta Was this translation helpful? Give feedback.
-
Food for thought: how do we answer the question "Is Shadowsocks working in Iran?". "What makes it work or not work?". It would be great to have something like MAT: But with a few differences:
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
By relay model, do you mean collector servers relaying reports to other collector servers? I think the collector server should expose a REST API endpoint to return reports with GET query params. This way other collectors can have We can think about a bit more and flush out the details while we focus on the a minimal PoC.
Current `report' package implementation is a minimal PoC and definitely there are a several areas for improvement. Yes NEL recommends an expiry date for reports. I believe the collector could also disregard old reports as reports include a time/date. If the client app has access to persistent storage, we can for example it can save unsent reports to file system and try when tunnel/VPN app successfully connects to Internet.
Yes, we should definitely have a
That one is a bit tricky. If we assume aggregation and analysis events are much lower frequency than receiving reports (for example 99% of reports arrive after 1 day of capture on the client to the server and aggregation is done on a daily basis), then the late arriving reports probably don't have any significant impact on the aggregate result. Although aggregates can help us paint the big picture, there's a lot of value in obtaining the result of individual experiments and connection logs. This is specially helpful for service providers and tool/protocol designers who are more interested in specific connection failure report rather than (a daily) aggregate of results. For example a service provider wants to know why customer X cannot connect to server Y with protocol Z, etc, where as others can.
Yes, I agree it is better to decouple server IP & port and perhaps transport params into separate fields. This can make querying the reports easier with given set of query params. |
Beta Was this translation helpful? Give feedback.
-
@ainghazal The collector server need to inject To make matter more complicated, I can imagine many reports would at the collector server through a tunnel server to relay which would change the |
Beta Was this translation helpful? Give feedback.
-
@ainghazal I had a meeting with @fortuna and we discussed some of the points on this thread. I am going to summarize them below:
With these uses cases, we basically cover queries based on transport and transport specific parameters. The list of possible queries can help us define the UI and visualization for reports and better understand how the reports are being consumed. For example, we can turn the query params into filters in the UI.
|
Beta Was this translation helpful? Give feedback.
-
Report FormatThe report format must match the test. Each test is possibly measuring a different thing. DNS TestFor Shadowsocks, we have a test that uses DNS resolution over UDP and TCP to test both protocols (we should change the code to take a dialer instead). That is superior to a HTTP fetch because it's simpler and quicker, with less data transferred and less things to go wrong. The request and response are often one packet each. So the report should align with that test, which does:
That can be the top level structure. But things can get complicated. TransportThe "connect" step is dialer-dependent. Typically the connect will start with host name resolution for both TCP and UDP, and, in the case of TCP, if using Happy Eyeballs, the connection attempts. Note that the dialer may or may not implement Happy Eyeballs. Should we report each connection attempt? That seems to be overkill to me. I'd be happy to just report the Happy Eyeballs overall result, but we should be able to report on the DNS resolution. Or perhaps we report each connection attempt, in addition to the overall connect. Note that host name resolution actually applies to IP addresses as well. Some stacks will map IPv4 to IPv6 or IPv6 to IPv4. Some protocols, like multipath TCP, may involve multiple connections. Protocols over TLS may want to report on the TLS handshake. This is Dialer-dependent. In the case of our Shadowsocks implementation, the client establishes a connection to the proxy on connect, but the initialization vector and the connection request are only sent in the "send" operation. We need a format that can represent a fixed test, but also allow for dialer-specific behavior, while making it possible to analyze in a reasonable way. I like the idea of representing a tree of operations as spans, with start and end time, and their results. So you can imagine a generic format like this: {
op: "dns_test"
start_time: ...
end_time: ...
status: { code: "OK" }
spans: [
{op: "connect", ..., status: "OK", spans: [
{op: "resolve", query: "...", answers: [...], status: "OK"}
{op: "happy_eyeballs", ..., selected_address: "...", status: "OK"}
]},
{op: "send", ..., "status": "OK},
{op: "receive", ..., "status": "OK"},
]
} This is actually a bit complicated, but it lets someone query for "dns_test" specifically, and inspect each of the spans in a transport-agnostic way. And if you are interested in a specific protocol, you can still dig. We can consider an easier to query format, but it may depend on how databases handle dynamic schemas: {
dns_test: {
start_time: ...
end_time: ...
status: { code: "OK" }
connect: {
resolve: {query: "...", answers: [...], status: "OK"}
happy_eyeballs: {..., selected_address: "...", status: "OK"}
}
send: {..., status: "OK"}
receive: {..., status: "OK"}
}
} It's not clear to me where exactly we would put the protocol information. A possibility is to have an explicit dialer type: {
dns_test: {
dialer: {shadowsocks: {prefix: "..."}} // Or whatever representation we decide. TBD
...
}
} If you are analyzing Shadowsocks disguises, you can get the prefix from dns_test.dialer.shadowsocks.prefix, and use dns_test.connect.happy_eyeballs.selected_address to get the port. There's probably a better way to do that. For instance, we may want to know the TCP connection without caring about happy eyeballs, though you kind of need to know what your stream dialer is. I just wanted to write some ideas down to explore some of the concerns. |
Beta Was this translation helpful? Give feedback.
-
I realized that we need different format for stream-based and packet-based tests. For stream tests I'm trying:
That's the implementation in #223. However, that doesn't work for packet-based tests because there's no connection establishment. Instead, we need:
With stream, we can have a list of connection attempt results, a selected address, and a single transport result. They are essentially different tests, so it can make sense for them to have different formats. In more details, we need:
PR #223 will need to be adjusted to reflect that approach. |
Beta Was this translation helpful? Give feedback.
-
In order to address some of the concerns from the discussion today, I would like to propose two types of reports: Client Reports and Third-Party or 3P Reports. They should help us better separate some of the concerns. A Client Report is, by definition, reported by the client directly to a Client Report Collector. The collector can, in turn, derive a new report, annotated with information about the user or IP. Client Reports are always reported directly to a Client Report Collector. A 3P Report is reported by a third-party to a 3P Report Collector or aggregator. The 3P Report should have a different format, with sensitive information scrubbed. Sensitive IP addresses can be replaced with AS and country information. 3P Reports can be safely transmitted over tunnels, since the IP information is already determined and sanitized. Conceptually, a client app might run a local Client Report Collector that receives a Client Report, derives a 3P Report with IP information, and sends it to a 3P Report Collector. A third-party like OONI would receive 3P Reports only. Thoughts? |
Beta Was this translation helpful? Give feedback.
-
I am opening another thread to prevent this getting burried. I think we are focusing too much here on the format, but the topic of how VPN applications are meant to send reports bypassing the VPN is critical to the design as mentioned in this comment. I understand your design is quite impacted by the use-case of outline, but I think it would be suboptimal to not consider how other providers might technically adopt such a solution.
Does this mean that VPN applications are expected to implement connectivity testing inside of their client? How does connectivity testing work in the context of a transport that is not session oriented? For example if you look at the JNI for boringtun, which implements wireguard and is used in the warp VPN client: https://github.com/cloudflare/boringtun/blob/master/boringtun/src/jni.rs, you can see that there is no concept of connection. They state in the docs:
This is because each platform basically just implements the calls to create the sockets and setup the send and receive threads on them and then it optimistically expects it to work when the builder sets the route for it. In this context, I think it would be very useful to have telemetry in the form of stats saying "how much traffic has been sent and received, how many errors were logged during a session, etc.". This is because when you use something like UDP (which is prevalent in most VPN clients today), you don' t have the concept of a "connection" as you do in TCP. I hope this better explains the issue. |
Beta Was this translation helpful? Give feedback.
-
Network Error Log Collection
This document aims to discuss high-level software requirements for collecting and consuming network error logs captured on a client.
Problem statement
When a VPN app (or any other generic client that makes network calls) cannot access the remote destination server over the internet, it is difficult to understand the cause of failure.
Modern web browsers implement NEL (network error logging) that can collect and send reports to a remote collector.
However this is specifically designed for standard web traffic and not accessible to use in other standalone client applications (such as mobile apps).
Separation of concerns
An end-to-end report collection system encampasses several components, including:
The client application can log and capture a set of error logs that reflect various types of connection failure or success. The client needs to pick a data format that encapsulates this information. Ideally apps can conform to existing data formats or be allowed to define their own formats.
As an example, NEL uses a JSON data format that adheres to this spec.
The collector server receives this data and stores it for later consumption. The server should expose an API to the client; some level of protection is required to prevent links from being spammed with junk data such as issuing API tokens to the clients or using a long secret URL. Storage limits can be enforced with auto purge to keep the log size constants for each URL.
It must also be resilient, easy to set up, and use as I will discuss below in more details.
The last step, but perhaps the most important piece is to analyze and make sense of the report to gain insight into the underlying root causes and find a work-around for example if a blocking is taking place. Consumption of the log data requires that the report data adheres to some known format (either user defined or standard format). This way target information can be easily extracted from the report data and analyzed. Enforcing a universal format is however challening and not practical. A good format must offer flexibility to define new fields (inject new information) while including more rigid sections that capture protocol agnostic or generic information.
Target audience and experience
This system can be used by the following target groups:
1. Developers & users of VPN and networking apps:
Developers can incorporate this functionality into their apps to offer a facility to collect logs and send them to a remote connector.
The developers may also allow their app users to specify a custom collector address to which logs are submitted to. The address can be incorporated into the server access key as a parameter as well.
For example, I implemented this concept in the Outline connectivity tester app, Blazer proxy app as well as Outline connectivity CLI. In all of these applications, the end-user can input a URL to indicate the address of the remote collector server.
Also, I implemented a report package in Go that collects a report and submits it to a remote destimation. The idea is to just
import
a package in the client app and call a function to collect and submit a report.2. Service providers (service managers)
Network error logging from client vantage points can assist service providers in troubleshooting and addressing potential blocking issues and improve their service offering. In theory, they could use the reports to adaptively adjust the transport to bypass blocking.
Service providers may prefer to setup and utilize their own private collectors, and potentially setup a redirect to relay reports to a public after some post-processing to redact PID and other sensetive information.
Depending on the client support, the address of the collector can be embedded into access key URL shared with the end-user.
3. Internet Freedom Community
The community at large can benefit from reports and analysis based on such reports and aggregate of results to gain insights into common blocking techniques and their impacts.
Public collectors can play an important role here. Private collectors could potentially opt-in to share their findings with a public collector.
There are privacy considerations here and any PID or credentials must be redacted in such reports and client & server IP addresses must be mapped to ASNs and not included in reports.
System attributes
A winning system design should satisfy the following high-level requirements as much as possible:
Resilience: Blocking report collector destinations is by nature more challenging since (1) the traffic does not have characteristics of a tunneling traffic; It’s legit HTTPS (2) the amount of data is small, it can be stored and sent whenever a connection is made (could be through a VPN or not). However, it helps if collectors are decentralized. Centralized aggregators can potentially pull logs from various collectors, analyze, and visualize the results. Collector servers can also be proxied behind cloudflare or other CDNs to increase resilience.
Easy to integrate into Apps:
The clients should be able to collect and send the reports in a few lines of code. It should be easy to do that in any programming language and use common design patterns. For example, I have opted to use JSON to encapsulate the log information and send it via a simple HTTP POST request. Other options such as mTLS gRPC are possible but could impose unnecessary frictions in terms of integration and use.
Easy to setup a collector:
Ideally anyone should be able to spin off their own report collector server with a few clicks.
For example, I have demonstrated use of Google App Scripts to set up a report collector that uses Google spreadsheet to collect data, possibly analyze and visualize with spreadsheet formulas. Service providers such as OONI can play a crucial role in offering their infrastructure to help accomplish this objective.
What information should the log contain:
At the base level, posix socket error codes and messages can provide insights into TCP/UDP level issues. Transport and application specific error messages can capture errors related to DNS, HTTP(S), TLS.
Section 6 of NEL specification provide a list of application level and socket level errors:
The client can possibly run custom connectivity experimments and include their results.
Additional thoughts
I have looked into various approaches to set up a remote report collector. Below is a quick summary of my findings:
Other possible options:
Beta Was this translation helpful? Give feedback.
All reactions