Characterization

Overview

We need to characterize Ziti performance so that we can compare it against plain internet, against other technologies and against itself, so we can tell if we improving, maintaining or degrading performance over time.

Characterization scenarios will be done across three axis.

The model
- This includes the numbers and interactions of services, identities and polices
The deployment
- This includes the number and type of instances and in which regions they are deployed. It also includes if we are using tunnelers or native Ziti applications
The traffic
- This includes the number of concurrent concurrent sessions, the amount of data sent and the number of iterations.

Models

Baseline

1 service
1 identity
1 edge router
1 of each policy

For models with multiple edge routers, do we need to set the runtime up so only one is active, for consistency in test results (and also keeping testing costs down?)

For each policy from A <-> B, ensure we have at least

an A with a policy which has all Bs
a B with a policy which has all As
an A with all policies
a B with all policies
Ensure that the A and B we test with are worst case: have access to maximum entities on both sides and are lexically sorted last to expose slowdowns in scans

Small

20 services
100 identities
10 edge routers
10 Service Policies
10 Edge Router Policies
10 Service Edge Router Policies

Medium

100 services
5,000 identities
100 edge routers
50 Service Policies
50 Edge Router Policies
10 Service Edge Router Policies

Large

200 services
100,000 identities
500 edge routers
250 Service Policies
250 Edge Router Policies
100 Service Edge Router Policies

Pure Model Tests

We can test the model in isolation outside the context of a full deployment/throughput/scale testing to ensure that the queries we need to do for the SDK will scale well. Ideally permission checks would O(1) so that the only non-constant would be service look-ups (since as a user has more services, that will naturally take more time).

This testing can be done locally, just exercising the APIs used by the SDK. If we can eliminate poor performance here that will let us focus on performance in the edge routers for the throughput and connection scale testing.

Results

baseline             | small                 | medium               | large
=====================|=======================|======================|=====================
Create API Session:  | Create API Session:   | Create API Session:  | Create API Session:
    Min  : 6ms       | 	Min  : 8ms           | 	Min  : 8ms          | 	Min  : 15ms
    Max  : 46ms      | 	Max  : 53ms          | 	Max  : 66ms         | 	Max  : 58ms
    Mean : 23.3ms    | 	Mean : 20.45ms       | 	Mean : 24.4ms       | 	Mean : 28.85ms
    95th : 45.9ms    | 	95th : 52.39ms       | 	95th : 65.6ms       | 	95th : 57.24ms
Refresh API Session: | Refresh API Session:  | Refresh API Session: | Refresh API Session:
    Min  : 0ms       | 	Min  : 0ms           | 	Min  : 0ms          | 	Min  : 0ms
    Max  : 0ms       | 	Max  : 0ms           | 	Max  : 0ms          | 	Max  : 0ms
    Mean : 0ms       | 	Mean : 0ms           | 	Mean : 0ms          | 	Mean : 0ms
    95th : 0ms       | 	95th : 0ms           | 	95th : 0ms          | 	95th : 0ms
Get Services:        | Get Services:         | Get Services:        | Get Services:
    Min  : 14ms      | 	Min  : 156ms         | 	Min  : 785ms        | 	Min  : 3521ms
    Max  : 17ms      | 	Max  : 187ms         | 	Max  : 848ms        | 	Max  : 3705ms
    Mean : 16ms      | 	Mean : 169.6ms       | 	Mean : 805.4ms      | 	Mean : 3620.5ms
    95th : 17ms      | 	95th : 187ms         | 	95th : 848ms        | 	95th : 3705ms
Create Session:      | Create Session:       | Create Session:      | Create Session:
    Min  : 6ms       | 	Min  : 8ms           | 	Min  : 18ms         | 	Min  : 2033ms
    Max  : 36ms      | 	Max  : 49ms          | 	Max  : 38ms         | 	Max  : 4951ms
    Mean : 15.75ms   | 	Mean : 20.35ms       | 	Mean : 24.05ms      | 	Mean : 3386.95ms
    95th : 35.9ms    | 	95th : 48.95ms       | 	95th : 37.9ms       | 	95th : 4944.65ms
Refresh Session:     | Refresh Session:      | Refresh Session:     | Refresh Session:
    Min  : 0ms       | 	Min  : 0ms           | 	Min  : 0ms          | 	Min  : 0ms
    Max  : 0ms       | 	Max  : 0ms           | 	Max  : 0ms          | 	Max  : 0ms
    Mean : 0ms       | 	Mean : 0ms           | 	Mean : 0ms          | 	Mean : 0ms
    95th : 0ms       | 	95th : 0ms           | 	95th : 0ms          | 	95th : 0ms

Follow Up Work

Will likely denormalize models and improve lookups to allow permission checks to be O(1). That should keep session create time low, regardless of number of services/identities and let service list time be linear with number of services.

Deployments

We should test with a variety of instance types, from t2 on up. Until we start testing, it will be hard to say what is needed. For high bandwidth applications you often need bigger instance types, even if the CPU and memory aren't required.

The controller should require smaller instances than the router, at least in terms of network use.

We shouldn't need to test deployment variations, such as tunneler vs SDK enabled application for all scenarios. We can pick one or two scenarios in order to find out if there are noticeable differences.

Traffic

There are some different traffic types we should test:

IPerf, for sustained throughput testing. This can be done with various degrees of parallelism.
Something like a web-service or HTTP server, for lots of concurrent, short lived connections, to get a feel for connection setup/teardown overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly