Network Error Log Collection Discussion #209

amircybersec · 2024-04-02T20:32:29Z

amircybersec
Apr 2, 2024
Collaborator

Network Error Log Collection

This document aims to discuss high-level software requirements for collecting and consuming network error logs captured on a client.

Problem statement

When a VPN app (or any other generic client that makes network calls) cannot access the remote destination server over the internet, it is difficult to understand the cause of failure.

Modern web browsers implement NEL (network error logging) that can collect and send reports to a remote collector.

However this is specifically designed for standard web traffic and not accessible to use in other standalone client applications (such as mobile apps).

Separation of concerns

An end-to-end report collection system encampasses several components, including:

Client side component

The client application can log and capture a set of error logs that reflect various types of connection failure or success. The client needs to pick a data format that encapsulates this information. Ideally apps can conform to existing data formats or be allowed to define their own formats.

As an example, NEL uses a JSON data format that adheres to this spec.

Collector server

The collector server receives this data and stores it for later consumption. The server should expose an API to the client; some level of protection is required to prevent links from being spammed with junk data such as issuing API tokens to the clients or using a long secret URL. Storage limits can be enforced with auto purge to keep the log size constants for each URL.

It must also be resilient, easy to set up, and use as I will discuss below in more details.

Consumption of reports; Analysis and visualization

The last step, but perhaps the most important piece is to analyze and make sense of the report to gain insight into the underlying root causes and find a work-around for example if a blocking is taking place. Consumption of the log data requires that the report data adheres to some known format (either user defined or standard format). This way target information can be easily extracted from the report data and analyzed. Enforcing a universal format is however challening and not practical. A good format must offer flexibility to define new fields (inject new information) while including more rigid sections that capture protocol agnostic or generic information.

Target audience and experience

This system can be used by the following target groups:

1. Developers & users of VPN and networking apps:

Developers can incorporate this functionality into their apps to offer a facility to collect logs and send them to a remote connector.

The developers may also allow their app users to specify a custom collector address to which logs are submitted to. The address can be incorporated into the server access key as a parameter as well.

For example, I implemented this concept in the Outline connectivity tester app, Blazer proxy app as well as Outline connectivity CLI. In all of these applications, the end-user can input a URL to indicate the address of the remote collector server.

Also, I implemented a report package in Go that collects a report and submits it to a remote destimation. The idea is to just import a package in the client app and call a function to collect and submit a report.

2. Service providers (service managers)

Network error logging from client vantage points can assist service providers in troubleshooting and addressing potential blocking issues and improve their service offering. In theory, they could use the reports to adaptively adjust the transport to bypass blocking.

Service providers may prefer to setup and utilize their own private collectors, and potentially setup a redirect to relay reports to a public after some post-processing to redact PID and other sensetive information.

Depending on the client support, the address of the collector can be embedded into access key URL shared with the end-user.

3. Internet Freedom Community

The community at large can benefit from reports and analysis based on such reports and aggregate of results to gain insights into common blocking techniques and their impacts.

Public collectors can play an important role here. Private collectors could potentially opt-in to share their findings with a public collector.

There are privacy considerations here and any PID or credentials must be redacted in such reports and client & server IP addresses must be mapped to ASNs and not included in reports.

System attributes

A winning system design should satisfy the following high-level requirements as much as possible:

Resilient (accessible and hard to block)
Easy to integrate into any client application
Easy to set up a collector to begin collecting logs

Resilience: Blocking report collector destinations is by nature more challenging since (1) the traffic does not have characteristics of a tunneling traffic; It’s legit HTTPS (2) the amount of data is small, it can be stored and sent whenever a connection is made (could be through a VPN or not). However, it helps if collectors are decentralized. Centralized aggregators can potentially pull logs from various collectors, analyze, and visualize the results. Collector servers can also be proxied behind cloudflare or other CDNs to increase resilience.

Easy to integrate into Apps:
The clients should be able to collect and send the reports in a few lines of code. It should be easy to do that in any programming language and use common design patterns. For example, I have opted to use JSON to encapsulate the log information and send it via a simple HTTP POST request. Other options such as mTLS gRPC are possible but could impose unnecessary frictions in terms of integration and use.

Easy to setup a collector:
Ideally anyone should be able to spin off their own report collector server with a few clicks.

For example, I have demonstrated use of Google App Scripts to set up a report collector that uses Google spreadsheet to collect data, possibly analyze and visualize with spreadsheet formulas. Service providers such as OONI can play a crucial role in offering their infrastructure to help accomplish this objective.

What information should the log contain:

At the base level, posix socket error codes and messages can provide insights into TCP/UDP level issues. Transport and application specific error messages can capture errors related to DNS, HTTP(S), TLS.

Section 6 of NEL specification provide a list of application level and socket level errors:

The client can possibly run custom connectivity experimments and include their results.

Additional thoughts

I have looked into various approaches to set up a remote report collector. Below is a quick summary of my findings:

OpenTelemetry

Pros
- Open source and widely adopted for metrics and trace collection
- sophisticated and comes with lots of bells and whistles
- Available open source collector servers such as Jaeger
Cons
- Require adherence to OTLP protocol
- Bloats the client and can be an overkill
- Designed for distributed tracing from multiple microservices

Sentry

Pros
- Easy to set up
- Good client support
Cons
- Vendor specific
- Require payment after certain limits are passed
- Centralized and susceptible to blocking

Other possible options:

Cloud functions (Google, AWS, Digital Ocean)
Influxdb Cloud
Grafana Cloud

amircybersec · 2024-04-02T20:42:30Z

amircybersec
Apr 2, 2024
Collaborator Author

To keep the discussion more productive, I propose focusing the discussion on Consumption of reports: analysis and visualization initially. The aim is to arrive at a good format that covers most of the common use cases while capturing the essential information.

per @fortuna suggestion, the report should at least contain:

UUID (unique for each report)
User network info (ASN)
Server network (or servers, there may be multiple)
Protocol (we may need a registry)
- TCP/UDP
- IPv4/6
- Shadowsocks, VLESS, VMESS, TLS, etc
- Needs to represent composed transports
User agent (binary identification, OS, version "Android Outline Client v1.31.0")
Protocol-specific features/tweaks
- SS prefix
- Is request merged with connect?

Other considerations

How do I know which reports are related to my server?
- Could embed a provider ID to the reporter URL
- Perhaps the provider registers with OONI first, and OONI can give them a token. Then OONI can tell which providers they are helping.
- There could be an option for providers to opt into sharing reports with the public (after sanitization)

@gghazinouri @ainghazal @hellais your input and participation is much appreciated.

0 replies

fortuna · 2024-04-03T16:26:14Z

fortuna
Apr 3, 2024
Maintainer

Food for thought: how do we answer the question "Is Shadowsocks working in Iran?". "What makes it work or not work?".

It would be great to have something like MAT:

But with a few differences:

I would like to pick multiple dimensions (from the list @amircybersec shared) to breakdown and group the rows. For example, they could be grouped by network, then binary, or just a "prefix".
I would like a filter, so I can see just Outline. Or just one network.
No "anomaly", "confirmed", "failure". It should be "ok" or "failure", but the failure should be an error string explaining what failed and how.

3 replies

fortuna Apr 3, 2024
Maintainer

Oh, and for the bars the reports should probably be weighted according to their sampling rate.

ainghazal Apr 3, 2024

I will address the high level question, "what is working in place X":

one recurrent conversation we're having when defining what metrics to collapse as "working", in the context of OONI VPN experiments, has to do with binarizing "working" or "not working" from the perspective of what constitutes an usable connection. To begin with, I think we should deal with being able to establish a session, i.e., is there a firewall targeting (a) the target IP, or IP range; (b) the ports used; (c) a fingerprint in the protocol itself.

even if we want to simplify to working/not-working, I think it's useful to define very strictly what the metrics tell (i.e., handshake, or establishing a tunnel). There could be a tier-2 metric that has to do with capturing what users do tell us they consider "usable"; i.e., tunnel still usable after n-mins, or a drop in bandwidth of no more than x times (i.e., can we ping through the tunnel? what's the performance when trying to download one big file / many small files?).

As to "in Iran", or country X, I think the most we can do is talk about networks, or more precisely networks that are sampled at a rate to throw meaningful aggregates. Some artificial metrics can then be built: an easy indicator to talk about countries can be the percentile of networks (perhaps qualified roughly by population served?) that can be said of experiencing no interference, on that country, for a given (circumvention) protocol...

fortuna Apr 11, 2024
Maintainer

To be clear, I don't want the system to tell me if it's working or not. That often hides the information I need. I want the system to tell me what happens when I try Shadowsocks. Then I can make my own conclusions. Trying to provide a high level analysis has been counter-productive for my use cases.

ainghazal · 2024-04-03T23:03:55Z

ainghazal
Apr 3, 2024

Perhaps premature at this point (agree on a minimal PoC first), but I think the relay model could be specified to perform aggregates. Based on previous conversations, and regarding configuring a NEL-like collector within OONI API, I think daily aggregates per (source) ASN could be a natural & easy thing to do. I've written about privacy considerations & threat model on this OONI spec proposal.
To avoid selection bias, I think it's perhaps important to specify (even as a recommendation) how any client is supposed to handle failures to contact the collector. I've seen you've implemented retries and fallbacks, but I suspect this won't be enough in the case one collector is blocked. Being practical, any reporting is better than none at all, but to understand what we'll be seeing I think it's good to think about the retry strategy. If I'm not mistaken, the NEL spec had suggestions for pruning undelivered reports.; 1 week sounds like a good expiry time. This could also mean "try hard to send reports, send reports upon connecting to a new network, send past reports when activating a new tunnel", etc.
Is there, or should there be, a uuid concept? (equivalent to report-id in OONI API). What I'm thinking is: what happens if a report ends up being injected into >1 collector, and later we want to cross-ref info coming from different collectors? Also, but minor: If retries can take more than a few hours/days, there's the possibility that new reports arrive after a lower-freq aggregate, and then needing to trigger reprocessing from time to time. All these are different cases that may speak of the need to give some attention to tracking provenance (eventually with signed reports?).
Log information: I think to be minimally useful, the endpoint indication could be decoupled from the connection information. I would expect something like proto+variant://IP:PORT/tcp/?parameter=xxx.

0 replies

amircybersec · 2024-04-08T18:11:44Z

amircybersec
Apr 8, 2024
Collaborator Author

Perhaps premature at this point (agree on a minimal PoC first), but I think the relay model could be specified to perform aggregates. Based on previous conversations, and regarding configuring a NEL-like collector within OONI API, I think daily aggregates per (source) ASN could be a natural & easy thing to do. I've written about privacy considerations & threat model on this OONI spec proposal.

By relay model, do you mean collector servers relaying reports to other collector servers?

I think the collector server should expose a REST API endpoint to return reports with GET query params. This way other collectors can have cron jobs and scrape other collectors and aggregate the results. However, there should be some design considerations to avoid looping and double aggregation.

We can think about a bit more and flush out the details while we focus on the a minimal PoC.

To avoid selection bias, I think it's perhaps important to specify (even as a recommendation) how any client is supposed to handle failures to contact the collector. I've seen you've implemented retries and fallbacks, but I suspect this won't be enough in the case one collector is blocked. Being practical, any reporting is better than none at all, but to understand what we'll be seeing I think it's good to think about the retry strategy. If I'm not mistaken, the NEL spec had suggestions for pruning undelivered reports.; 1 week sounds like a good expiry time. This could also mean "try hard to send reports, send reports upon connecting to a new network, send past reports when activating a new tunnel", etc.

Current `report' package implementation is a minimal PoC and definitely there are a several areas for improvement.

Yes NEL recommends an expiry date for reports. I believe the collector could also disregard old reports as reports include a time/date. If the client app has access to persistent storage, we can for example it can save unsent reports to file system and try when tunnel/VPN app successfully connects to Internet.

Is there, or should there be, a uuid concept? (equivalent to report-id in OONI API). What I'm thinking is: what happens if a report ends up being injected into >1 collector, and later we want to cross-ref info coming from different collectors?

Yes, we should definitely have a uuid in each report.

Also, but minor: If retries can take more than a few hours/days, there's the possibility that new reports arrive after a lower-freq aggregate, and then needing to trigger reprocessing from time to time. All these are different cases that may speak of the need to give some attention to tracking provenance (eventually with signed reports?).

That one is a bit tricky. If we assume aggregation and analysis events are much lower frequency than receiving reports (for example 99% of reports arrive after 1 day of capture on the client to the server and aggregation is done on a daily basis), then the late arriving reports probably don't have any significant impact on the aggregate result.

Although aggregates can help us paint the big picture, there's a lot of value in obtaining the result of individual experiments and connection logs. This is specially helpful for service providers and tool/protocol designers who are more interested in specific connection failure report rather than (a daily) aggregate of results.

For example a service provider wants to know why customer X cannot connect to server Y with protocol Z, etc, where as others can.

Log information: I think to be minimally useful, the endpoint indication could be decoupled from the connection information. I would expect something like proto+variant://IP:PORT/tcp/?parameter=xxx.

Yes, I agree it is better to decouple server IP & port and perhaps transport params into separate fields. This can make querying the reports easier with given set of query params.

0 replies

amircybersec · 2024-04-08T18:19:42Z

amircybersec
Apr 8, 2024
Collaborator Author

@ainghazal The collector server need to inject client_asn and client_country into the report. One of the limitations of using Google App Script with Google Spreadsheet for collecting reports is that I don't have access to original request coming from the client to extract client IP address.

To make matter more complicated, I can imagine many reports would at the collector server through a tunnel server to relay which would change the client_asn to that of the relay. Do you have any suggestions on how to address this? Maybe we client should attempt to insert that into the report somehow?

1 reply

ainghazal Apr 8, 2024

this raises a great point! if provider can deploy a custom collector, then doing the geolocation server-side is probably fine, and more or less straightforward. Also, after past discussions, the idea of collecting geolocation from all the sources available (and indicating perhaps confidence) looks like a good plan looking forward.

I'm tempted to use off-the shelf code that ooni probe exposes as a library. In case it's of interest, OONI probe strategy is like this:

Attempt a bunch of HTTP and STUN geolocation sources (they're randomized) to get the client IP. The assumption is that the likelyhood of all of the targets (some of them quite popular) being blocked is low (but we also should try to measure this failure somehow!).
Maxmind's geolocation database is bundled in the clients to get the ASN and CC from the client's IP

To make matter most complicated, I can imagine many reports would at the collector server through a tunnel server to relay which would change the client_asn to that of the relay.

That is also a concern of mine. I think we could try to do some form of QA on the reports by checking against some known VPN/Tunnel IP ranges. But yes, assuming at some time we have a ~generic way to opportunistically deliver reports by a different means than the original tunnel (Tor, Psiphon, SMTP...), one should be careful to annotate the original metadata for the report. I was thinking that the "original" ASN/CC is a must-have, but also perhaps annotating something like "delivered-via" as an addition to the original field (in case the submission mechanism is != than the original one -- which can be useful to detect geolocation mistakes...

amircybersec · 2024-04-10T16:09:11Z

amircybersec
Apr 10, 2024
Collaborator Author

@ainghazal I had a meeting with @fortuna and we discussed some of the points on this thread. I am going to summarize them below:

We think it is safe to assume that reporter server is not blocked for now (even if report collector gets blocked, the log collector service provider can do some tricks to unblock their endpoints). We can build v1 report format on this assumption, and later introduce v2 format that may have a delivered-via field. For v1 format, the report collector is responsible to derive client_ip & client_asn and other origin fields from the incoming request. This will eliminate the need to make HTTP and STUN calls to get client IP or embed Maxmind's geolocation database on the client side and will simplify the client implementation. We should strive to keep the reporter client as thin as possible and super easy to embed in arbitrary applications.
I believe we should think about type of queries users probably want to run on the reports. For example, I am personally interested to perform the following queries on the collected reports:
- GET all reports dated between date1-date2 for shadowsocks protocol with ?prefix=xxx param captured on clients with locations in Iran/China (can be more specific and carrier based) that reflect a success (or failure)
- GET all reports shadowsocks protocol with ?prefix=xxx param that include TCP reset or TCP timeout in the report error field
- GET all reports that match a given ASN and protocol ('vless', shadowsocks, etc)
- It would be very helpful if we can do historical queries such that: GET reports for connections that were working initially but then stopped working (like a diff report). This can give us some insight into why and when certain connections get flagged.

With these uses cases, we basically cover queries based on transport and transport specific parameters.

The list of possible queries can help us define the UI and visualization for reports and better understand how the reports are being consumed. For example, we can turn the query params into filters in the UI.

We can keep things as minimal as possible for the PoC and later bump up report format version to extend it and also make the queries more sophisticated and add more dimensions to it.

0 replies

fortuna · 2024-04-11T14:50:39Z

fortuna
Apr 11, 2024
Maintainer

Report Format

The report format must match the test. Each test is possibly measuring a different thing.

DNS Test

For Shadowsocks, we have a test that uses DNS resolution over UDP and TCP to test both protocols (we should change the code to take a dialer instead). That is superior to a HTTP fetch because it's simpler and quicker, with less data transferred and less things to go wrong. The request and response are often one packet each.

So the report should align with that test, which does:

Dial the destination ("connect")
Write the request ("send")
Read the response ("receive")

That can be the top level structure. But things can get complicated.

Transport

The "connect" step is dialer-dependent.

Typically the connect will start with host name resolution for both TCP and UDP, and, in the case of TCP, if using Happy Eyeballs, the connection attempts. Note that the dialer may or may not implement Happy Eyeballs.

Should we report each connection attempt? That seems to be overkill to me. I'd be happy to just report the Happy Eyeballs overall result, but we should be able to report on the DNS resolution. Or perhaps we report each connection attempt, in addition to the overall connect.

Note that host name resolution actually applies to IP addresses as well. Some stacks will map IPv4 to IPv6 or IPv6 to IPv4.

Some protocols, like multipath TCP, may involve multiple connections. Protocols over TLS may want to report on the TLS handshake. This is Dialer-dependent.

In the case of our Shadowsocks implementation, the client establishes a connection to the proxy on connect, but the initialization vector and the connection request are only sent in the "send" operation.

We need a format that can represent a fixed test, but also allow for dialer-specific behavior, while making it possible to analyze in a reasonable way.

I like the idea of representing a tree of operations as spans, with start and end time, and their results.

So you can imagine a generic format like this:

{
  op: "dns_test"
  start_time: ...
  end_time: ...
  status: { code: "OK" }
  spans: [
    {op: "connect", ..., status: "OK", spans: [
      {op: "resolve", query: "...", answers: [...], status: "OK"}
      {op: "happy_eyeballs", ..., selected_address: "...", status: "OK"}
    ]},
    {op: "send", ..., "status": "OK},
    {op:  "receive", ..., "status": "OK"},
  ]
}

This is actually a bit complicated, but it lets someone query for "dns_test" specifically, and inspect each of the spans in a transport-agnostic way. And if you are interested in a specific protocol, you can still dig.

We can consider an easier to query format, but it may depend on how databases handle dynamic schemas:

{
  dns_test: {
    start_time: ...
    end_time: ...
    status: { code: "OK" }
    connect: {
      resolve: {query: "...", answers: [...], status: "OK"}
      happy_eyeballs: {..., selected_address: "...", status: "OK"}
    }
    send: {..., status: "OK"}
    receive: {..., status: "OK"}
  }
}

It's not clear to me where exactly we would put the protocol information. A possibility is to have an explicit dialer type:

{
  dns_test: {
    dialer: {shadowsocks: {prefix: "..."}}   // Or whatever representation we decide. TBD
    ...
  }
}

If you are analyzing Shadowsocks disguises, you can get the prefix from dns_test.dialer.shadowsocks.prefix, and use dns_test.connect.happy_eyeballs.selected_address to get the port.

There's probably a better way to do that. For instance, we may want to know the TCP connection without caring about happy eyeballs, though you kind of need to know what your stream dialer is. I just wanted to write some ideas down to explore some of the concerns.

1 reply

fortuna Apr 19, 2024
Maintainer

For the most part, we can simplify all of this by assuming that the transport is a connection wrapper. That way the test code can create the connection, then wrap it.

However, that doesn't reaaaaly work for UDP SOCKS5. In that case we need to have a stream and a packet dialer, because we first establish a TCP connection to create the association, then send UDP packets. 🤔

fortuna · 2024-04-24T17:45:30Z

fortuna
Apr 24, 2024
Maintainer

I realized that we need different format for stream-based and packet-based tests.

For stream tests I'm trying:

Resolve name
For each IP, until successful
2.1. Try to connect
With the successful connection, perform the transport test.

That's the implementation in #223.

However, that doesn't work for packet-based tests because there's no connection establishment. Instead, we need:

Resolve name
For each IP, until successful
2.1. Perform transport test

With stream, we can have a list of connection attempt results, a selected address, and a single transport result.
With packet, we need a list of transport results, each with the attempted address.

They are essentially different tests, so it can make sense for them to have different formats.
However, we can make the stream test match the packet test, so I believe we should go with that.

In more details, we need:

Extract the first hop address. That can be done with an intercepting base dialer
Resolve the first hop address
For each IP address:
Create a base dialer that dials the fixed IP address
Wrap the base dialer with the transport
Run the test

PR #223 will need to be adjusted to reflect that approach.

/cc @amircybersec @ainghazal

1 reply

amircybersec Apr 25, 2024
Collaborator Author

@fortuna I can revise the code in PR #223 to reflect this, and reconcile/unify the format to avoid having to define two separate test formats.

I am also trying to make the test process faster by running each for loop item in a separate thread/go routine. This way we can just abort as soon as we have one successful connection (either TCP connect in streamDialer case and transport success in PacketDialer case) and not have to wait for timeouts that could happen for some of the resolved IPs (UNLESS we want to be verbose and capture all test case results explicitly)

fortuna · 2024-04-25T23:48:49Z

fortuna
Apr 25, 2024
Maintainer

In order to address some of the concerns from the discussion today, I would like to propose two types of reports: Client Reports and Third-Party or 3P Reports. They should help us better separate some of the concerns.

A Client Report is, by definition, reported by the client directly to a Client Report Collector. The collector can, in turn, derive a new report, annotated with information about the user or IP. Client Reports are always reported directly to a Client Report Collector.

A 3P Report is reported by a third-party to a 3P Report Collector or aggregator. The 3P Report should have a different format, with sensitive information scrubbed. Sensitive IP addresses can be replaced with AS and country information.

3P Reports can be safely transmitted over tunnels, since the IP information is already determined and sanitized.

Conceptually, a client app might run a local Client Report Collector that receives a Client Report, derives a 3P Report with IP information, and sends it to a 3P Report Collector.

A third-party like OONI would receive 3P Reports only.

Thoughts?

5 replies

hellais Apr 26, 2024

Why have 2 separate formats instead of having one which has some fields which are optional and can be filled out on a best effort basis?

hellais Apr 26, 2024

I think the crux of the discussion was the expectation that the collector would need to see the real IP of the user and enrich the submitted dataset with this information, which is dependent on the fact that the VPN application would need to somehow escape the VPN tunnel in order for that to happen.

This complicates adoption by other VPN applications for which it might either be too complicated to implement logic to escape the VPN connection or consider leaking application traffic outside of the VPN to be a breach of user trust (eg. I would imagine privacy focussed VPNs would be especially resistant to doing something of this sort).

It was mentioned in the call it's possible to escape the VPN tunnel socket via the protect method of VpnService, however as far as I can tell no such option exists inside of iOS.
Moreover, moving the logic for submitting the report into the VPN tunnel service might be considered a burden to a potential integration which would have to work, amongst other things, with the memory limits of the VPNService.

fortuna Apr 26, 2024
Maintainer

It’s two formats, not three. A publisher could just serve the 3P Reports, or present in their own format, but that’s out of scope.

Why two formats? when you have different fields in different reports you effectively have different formats. We can choose to be explicit or implicit about it. It’s much clearer and easier to understand when we are explicit about it. It sets the right expectations about what’s in the report, what fields are used and for what purpose.

amircybersec Apr 29, 2024
Collaborator Author

Based on my understanding, I believe @fortuna is referring to report types. We can have a (parent) report that embeds different (child) reports of different types. This is essentially what @ainghazal is already doing to some extent in his PoC. For example, we can have the following example report types:

We can define 2 (or even more report types)

collector-report or observer-report type
probe-report or client-report type

The reports can be nested/wrapped in one single JSON report.

{
  "report-type": "collector-report",
  "uuid": "xxx-xxx-xxx-xxx",
  “collector_id”: “2xT3w2uR8P”
  "time": "2024-04-12T00:00:00Z", //report was observed/collected
  "origin_ip": "126.100.19.0",
  "origin_asn": "ASXXX",
  "origin_cc": "ir",
  "child_report": {
        "report-type": "probe-report",
        "uuid": "xxx-xxx-xxx-xxx",
        "time": "2024-04-12T00:00:00Z",
        "origin_ip": "126.1.1.1", //optional 
        "origin_asn": "ASXXX", //optional 
        "origin_cc": "ir", //optional 
        "endpoint": {
            "endpoint_asn": AS5555
            "endpoint_ip": 140.90.21.0
            "endpoint_cc": us
            "port": 80
        },
        "transport": {
          "protocol": "ss",
          "config": {
            "prefix": "HTTP/1.1"
          }
        },
        "connectivity_reports": [{
            "test-type": "connect-read-write",
            "start_time": "2024-04-12T00:00:00Z",
            "duration_ms": 50,
            "status": "success",
            "error": "",
            "TBD...": "TBD..."
        },
        {
            "test-type": "resolver-test",
            "start_time": "2024-04-12T00:00:00Z",
            "duration_ms": 50,
            "status": "failure",
            "error": "TBD..."
        }]
  }
}

Best effort

The client should make best effort to include scrubbed vantage point meta-data before sending it to a collector. If the collector is trusted (see below section on trust model), the report can be sent directly to the collector. The client can even opt not to include its vantage point metadata and rely on the trusted collector to annotate the client-report with observed origin metadata.

Report collectors would always look at client reports that have vantage data populated with the highest level of confidence.

If the client chooses not to include its vantage point data (which requires making extra service API calls), then the client must ensure that the reports are sent directly to a (known and trusted) report collector and NOT through a proxy.

Since there is no guarantee that a client is correctly implementing these guidelines, we may end up with reports that have low fidelity. They can discuss and address this concern down the road.

I also don't think we should limit clients not sending reports directly to OONI, but doing so would require format conversion or report embedding on the client side.

Trust Model

Collectors can be divided into 4 groups: private and public collectors from known and unknown service providers.

Generally, reports sent to either private or public collectors offered by unknown (untrusted) providers must go through a proxy to ensure the identity (exact origin) of the client remains protected.

If the collector is set up by the developer of the client app, proxy service provider itself, or a reputable organization, it can be considered as a known collector and client can send reports directly to it without sending the report through a proxy.

The client can potentially have a whitelist of known collectors and choose a direct report policy for them.

I made the illustration below so that we can reference it in future discussions.

Client design considerations

Client connectivity testing must be performed prior to the tunnel being established (or in between reconnect attempts). For example, in my client implementation, every time the user presses the Connect button, a connectivity test is performed and the result is samples and reported BEFORE establishing the connection to the remote proxy server.

One downside of this approach is the delay it can introduce before the initial connection is established.

One possible optimized version of this would be to only perform connectivity tests if the first connection attempt fails (instead of always testing before every connection attempt). This would however introduce a bias since we are always sampling failures.

A second approach is to randomly perform the test when the user presses the connect button.

A third possible approach is to perform connectivity testing in the background when the VPN app is running but VPN is off.

In my demo client app, I also have a “test all configs” button that explicitly tests all tunnels and reports the result. This button is only enabled if the VPN is off.

Although the testing must be performed when the tunnel is not yet established (VPN is off), reporting may happen after VPN is turned ON.

However the client app can explicitly prohibit this behavior and set a rule to only send reports when VPN is off. This would be specially necessary IF the client chooses not to include its vantage point metadata in the “client-report” and leaves it to the observer to annotate the report with the observed origin metadata.

If the client reliably includes its vantage point in the report, then it can send it to the collector either through a proxy or directly.

We can make a set of design recommendations for client app developers and outline the implications of each design choice.

Other notes

Report collector should be able to take the client report and scrub it if it has not already been scrubbed by the client.
If the client report is missing client vantage point data, the report collector must NOT add them to the client report itself and only add observed origin metadata in its own report (similar to the example format above). This way we know who has reporting what.
If OONI receives a report that has different origin information in the client-report and the collector-report (if client-report populates the vantage point data), then it should consider client-report as the true origin.

fortuna Apr 29, 2024
Maintainer

I'd emphasize that the basic use case is very simple:

client --[Client Report]--> Provider Collector

Then we can extend it:

Provider Collector --[3P Report]--> 3P Collector

There's a mapping to add annotations and scrub data that is needed:

Client Report -> 3P Report

Those three building blocks are all we need. We can analyze and understand them in isolation.

The case where the client reports directly to a 3P Collector can be reduced to those two steps above in sequence, with a conceptual Provider Collector running on the client itself. A Relay is just the second case twice.

OONI would participate in this system as a 3P Collector.

hellais · 2024-04-30T22:31:54Z

hellais
Apr 30, 2024

I am opening another thread to prevent this getting burried.

I think we are focusing too much here on the format, but the topic of how VPN applications are meant to send reports bypassing the VPN is critical to the design as mentioned in this comment.

I understand your design is quite impacted by the use-case of outline, but I think it would be suboptimal to not consider how other providers might technically adopt such a solution.

Client connectivity testing must be performed prior to the tunnel being established (or in between reconnect attempts). For example, in my client implementation, every time the user presses the Connect button, a connectivity test is performed and the result is samples and reported BEFORE establishing the connection to the remote proxy server.

Does this mean that VPN applications are expected to implement connectivity testing inside of their client? How does connectivity testing work in the context of a transport that is not session oriented?

For example if you look at the JNI for boringtun, which implements wireguard and is used in the warp VPN client: https://github.com/cloudflare/boringtun/blob/master/boringtun/src/jni.rs, you can see that there is no concept of connection.

They state in the docs:

It implements the underlying WireGuard protocol, without the network or tunnel stacks, those can be implemented in a platform idiomatic way.

This is because each platform basically just implements the calls to create the sockets and setup the send and receive threads on them and then it optimistically expects it to work when the builder sets the route for it.

In this context, I think it would be very useful to have telemetry in the form of stats saying "how much traffic has been sent and received, how many errors were logged during a session, etc.". This is because when you use something like UDP (which is prevalent in most VPN clients today), you don' t have the concept of a "connection" as you do in TCP.

I hope this better explains the issue.

5 replies

hellais May 1, 2024

As an experiment to demonstrate demonstrate how tricky this would be in practice, I would suggest you try to imagine what you would have to do in order to add support for sending these kinds of metrics to a collector bypassing the VPN for the wireguard-android mobile app: https://github.com/WireGuard/wireguard-android.

The tunnel is created by the following native JNI wrapped call: https://github.com/WireGuard/wireguard-android/blob/master/tunnel/src/main/java/com/wireguard/android/backend/GoBackend.java#L326 and then two calls are made to protect the sockets so that it's able to speak to the outside: https://github.com/WireGuard/wireguard-android/blob/master/tunnel/src/main/java/com/wireguard/android/backend/GoBackend.java#L334.

At this point no traffic has gone over the VPN yet. At which point would you place the code for sending back a metric that, for example, the VPN is working?

How would that be hooked up into this code so that it's not going via the VPN?

amircybersec May 1, 2024
Collaborator Author

@hellais When I was talking about "Connect" in my previous comments, I was literally referring to the Connect button that almost all VPN apps implement.

MahsaNG client app is the only one that I am aware of that implement connectivity testing and reporting. It is one of the most popular circumvention apps used in Iran these days.

When the user presses the Test button in MahsaNG app (as well as the Blazer Proxy demo app I wrote), the app tests all available configs in the app and reports it to their radar system. They could potentially submit their test result to OONI as well.

Does this mean that VPN applications are expected to implement connectivity testing inside of their client? How does connectivity testing work in the context of a transport that is not session oriented?

Yes, I believe VPN applications should implement connectivity testing inside of their client apps.

It is up to the client app developer to decide when to do the test and when to submit it to a collector. They must follow certain design guidelines (that we can put together in a design reference doc), otherwise their measurements will be corrupted or have low fidelity.

For example, if the client attempt to call a STUN server to get client vantage point data WHEN the VPN is ON, it will incorrectly capture the vantage point data and the report will include incorrect information. Or if the client does not include its vantage point data and send it to a collector through a proxy, then the origin data of the report will be incorrect.

The client app can take various approaches in implementing connectivity testing. The approach we are currently taking in Outline SDK as well as approach taken in MahsaNG app, is basically a standalone snapshot connectivity test. Meaning, the test is performed as a standalone probe.

This would be provide "good enough" coverage for capturing connectivity issues. It won't however capture disruptions that can manifest itself after some usage of the tunnel. For example, IF the censor keeps track of the amount of download/upload data a connection is using and then drop the connection if certain thresholds are exceeded, then the standalone connectivity test/probe can't reproduce it.

Another approach (which is I believe what you are suggesting in your comment) is to have the testing embedded into the protocol to capture errors that can occurs during actual tunnel usage and a session . This would be technically possible for some protocols but implementing it would be more involved.

I believe there is nothing stopping anyone to implement and use your proposed testing approach. This would be a completely separate concern from other parts of the system design and the report format.

amircybersec May 1, 2024
Collaborator Author

@hellais I just wanted to add two quick points here (1) In our current implementation of connectivity test in Outline SDK we basically shoot out a packet through the tunnel that goes out to a DNS name server and read back the response. We repeat over both TCP and UDP. This gives us good coverage as we can test (a) connect (b) write (c) read operations on TCP/IP layer.

We can pass the test payload through various transports and also capture wrapped error messages in the error report.

We are developing the concept of an intercept dialer which we could potentially use to capture conn/read/write errors during lifespan of the connection.

hellais May 2, 2024

Thanks for explaining this to me, the design goals are more clear to me now. Basically you want VPN apps to implement VPN endpoint testing capabilities in their software and submit these results to ~~you~~ the collector.

I was under the misguided impression that part of the design goal was to allow VPN providers to submit also data "they already had" without needing to add new features to the app.

It's now more clear to me that this is out of scope for the design.

I do however wonder, why if a VPN app is putting in the work to implement a connectivity test for their endpoints, they would adopt this specific protocol, instead of designing and implementing their own.

After all the submission protocol, which doesn't have with it censorship circumvention properties, seems like the easy part. Is the idea that by using a third party collector to send their telemetry to, instead of rolling out their own, they it would somehow be more resilient to blocking?

amircybersec May 5, 2024
Collaborator Author

Thanks for explaining this to me, the design goals are more clear to me now. Basically you want VPN apps to implement VPN endpoint testing capabilities in their software and submit these results to you the collector.

The current scope (which can be expanded down the road of course) is to get reports from client apps since they can effectively act as probes. The network errors are captured on VPN client apps. Apps must log the errors and report them to a collector (either private or public, which would be OONI in this case).

I was under the misguided impression that part of the design goal was to allow VPN providers to also submit data "they already had" without needing to add new features to the app. It's now more clear to me that this is out of scope for the design.

I don't think the current design would limit VPN providers not to send the data "they already have" as long as they can convert them to the format we are discussing here. There's no limit on where the reports are sent from as long as they meet the format specification which is dictated by how the data will be consumed and presented.

Since this current approach is inherently a client side feature, it is something that the client has to implement. I can't imagine an approach where clients don't need to implement anything at all.

It would be also possible to implement error reporting on tunnel server side (currently out of scope of this discussion though). This could be complementary to the client side approach.

Server side error logging, however, can only capture certain error types that can occur AFTER connection is made (such as TCP reset). It won’t be possible to observe complete blocking from the server side. For example this research paper proposes some techniques.

I do however wonder, why if a VPN app is putting in the work to implement a connectivity test for their endpoints, they would adopt this specific protocol, instead of designing and implementing their own.

I don’t think we are not forcing a specific protocol. The reports are simply submitted to an HTTPS endpoint. The connectivity testing can span a variety of different tests. At the base layer (TCP/IP), we perform the connect/read/write testing for udp or tcp which every transport uses. We can add various tests to this toolbox of tests.

The high-level idea and value proposition is to make implementation and adding connectivity testing and reporting to a collector super simple and straightforward for app developers and service providers as much as possible.

In other words, the value proposition should be:

“start logging and reporting network errors with a few lines of code” for app developers and
“run this service and pass the URL to your users” for service providers

If developers are writing the networking parts of their app in Go, they could just import and use connectivity testing and reporting packages from Outline SDK. They can also re-implement the test in other programming languages if they wish. As long as the output report format complies with the suggested format, they can send it to OONI or private collectors that accept the format.

Providers should be able to easily spin out a private collector (run an installer and a server) and share the report collector service endpoint with their end users.

If developers follow the suggested format, they can submit the report to the collector that @ainghazal has implemented or send it directly to OONI. If they don't want to use the suggested format, they need to do their own collection and digestion of the report (which would be more work).

Therefore, we need to make the format as flexible and generic as possible. The parallel idea here is OpenTelemetry and why folks use OTEL format since it is vendor agnostic and there are so many available collectors that digest that format.

After all, the submission protocol, which doesn't have censorship circumvention properties, seems like the easy part. Is the idea that by using a third party collector to send their telemetry to, instead of rolling out their own, they would somehow be more resilient to blocking?

This point probably warrants a more in depth discussion. I am going to try to cover some of the points here.

A private report collector is harder to block due to its low profile deployment and distributed nature. Also the traffic volume that goes to collectors is much lower than tunnel traffic and it has a burst pattern. Plus it’s legit HTTPS rather than a camouflaged or some other transport wrapped in HTTPS.

To clarify, what @ainghazal has implemented is an example of a private collector. I also have an example implementation of a private report collector that uses Google spreadsheets to collect reports.

Clients can further implement strategies that make error reporting more resilient such as broadcasting reports to multiple private collectors.

Private collectors can relay their reports to a 3GP collector such as OONI. Client apps can also attempt to send reports directly to OONI as long as they put the reports in the correct format that can be consumed by OONI.

Most common use case scenario

A tunnel service provider wants to understand blocking of their server endpoints. Such a provider often rents a VPS and uses Outline Manager, X-UI, Hiddify, S-UI dashboard to manage users.

Providers often receive support requests from their end-users that they cannot connect but they struggle to understand the extent and scope of the connectivity issue. In certain cases, blocking can happen at ISP level or even some strange. To fully close the feedback loop, certain insights can be offered to the provider on changing protocol/transport settings (for example, changing to a different prefix in case of shadowsocks).

To address this use case scenario, service providers can set up their own private collectors and pass the collector URL to end users of their services. This can be even made a queryparam in the access key; for example:

ss://userinfo@host:port?prefix=xxx&report=https://somecollector.com/report

I envision client apps would allow users to enter a report URL under the settings section of the client application (or even hard-code several trusted private collectors in the application bundle).

Report format probably needs to embed client specific data to make it clear if it came from Outline Client, V2rayNG, MahsaNG app, etc.

Privacy concerns

Private collectors can also improve the privacy models. If a service provider has full control over private collectors, they can rely on them to perform scrubbing of reports sensitive information (if not already scrubbed on the client), and also mask the origin IP of the client (by relaying the report). In certain cases, users of clients may not like the idea of exposing their IP address by sending their reports directly to a report collector that they may not control or trust.

Network Error Log Collection Discussion #209

amircybersec Apr 2, 2024 Collaborator

Network Error Log Collection

Problem statement

Separation of concerns

Target audience and experience

1. Developers & users of VPN and networking apps:

2. Service providers (service managers)

3. Internet Freedom Community

System attributes

What information should the log contain:

Additional thoughts

Replies: 10 comments · 16 replies

amircybersec Apr 2, 2024 Collaborator Author

Other considerations

fortuna Apr 3, 2024 Maintainer

fortuna Apr 3, 2024 Maintainer

fortuna Apr 11, 2024 Maintainer

amircybersec Apr 8, 2024 Collaborator Author

amircybersec Apr 8, 2024 Collaborator Author

amircybersec Apr 10, 2024 Collaborator Author

fortuna Apr 11, 2024 Maintainer

Report Format

DNS Test

Transport

fortuna Apr 19, 2024 Maintainer

fortuna Apr 24, 2024 Maintainer

amircybersec Apr 25, 2024 Collaborator Author

fortuna Apr 25, 2024 Maintainer

fortuna Apr 26, 2024 Maintainer

amircybersec Apr 29, 2024 Collaborator Author

Best effort

Trust Model

Client design considerations

Other notes

fortuna Apr 29, 2024 Maintainer

amircybersec May 1, 2024 Collaborator Author

amircybersec May 1, 2024 Collaborator Author

amircybersec May 5, 2024 Collaborator Author

Most common use case scenario

Privacy concerns

amircybersec
Apr 2, 2024
Collaborator

Replies: 10 comments 16 replies

amircybersec
Apr 2, 2024
Collaborator Author

fortuna
Apr 3, 2024
Maintainer

fortuna Apr 3, 2024
Maintainer

fortuna Apr 11, 2024
Maintainer

amircybersec
Apr 8, 2024
Collaborator Author

amircybersec
Apr 8, 2024
Collaborator Author

amircybersec
Apr 10, 2024
Collaborator Author

fortuna
Apr 11, 2024
Maintainer

fortuna Apr 19, 2024
Maintainer

fortuna
Apr 24, 2024
Maintainer

amircybersec Apr 25, 2024
Collaborator Author

fortuna
Apr 25, 2024
Maintainer

fortuna Apr 26, 2024
Maintainer

amircybersec Apr 29, 2024
Collaborator Author

fortuna Apr 29, 2024
Maintainer

amircybersec May 1, 2024
Collaborator Author

amircybersec May 1, 2024
Collaborator Author

amircybersec May 5, 2024
Collaborator Author