Skip to content

Latest commit

 

History

History
906 lines (674 loc) · 48.8 KB

CHANGELOG.md

File metadata and controls

906 lines (674 loc) · 48.8 KB

Release 1.2.0

What's New

  • New Router Metrics
  • Changes to identity connect status
  • HA Bootstrap Changes
  • Connect Events
  • SDK Events
  • Bug fixes and other HA work

New Router Metrics

The following new metrics are available for edge routers:

  1. edge.connect.failures - meter tracking failed connect attempts from sdks This tracks failures to not having a valid token. Other failures which happen earlier in the connection process may not be tracked here.
  2. edge.connect.successes - meter tracking successful connect attempts from sdks
  3. edge.disconnects - meter tracking disconnects of previously successfully connected sdks
  4. edge.connections - gauge tracking count of currently connected sdks

Identity Connect Status

Ziti tracks whether an identity is currently connected to an edge router. This is the hasEdgeRouterConnection field on Identity.

Identity connection status used to be driven off of heartbeats from the edge router. This feature doesn't work correctly when running with controller HA.

To address this, while also providing more operation insight, connect events were added (see below for more details on the events themselves).

The controller can be configured to use status from heartbeats, connect events or both. If both are used as source, then if either reports the identity as connected, then it will show as connected. This is intended for when you have a mix of routers and they don't all yet supported connect events.

The controller now also aims to be more precise about identity state. There is a new field on Identity: edgeRouterConnectionStatus. This field can have one of three values:

  • offline
  • online
  • unknown

If the identity is reported as connected to any ER, it will be marked as online. If the identity has been reported as connected, but the reporting ER is now offline, the identity may still be connected to the ER. While in this state it will be marked as 'unknown'. After a configurable interval, it will be marked as offline.

New controller config options:

identityStatusConfig:
  # valid values ['heartbeats', 'connect-events', 'hybrid']
  # defaults to 'hybrid' for now
  source: connect-events 

  # determines how often we scan for disconnected routers
  # defaults to 1 minute
  scanInterval: 1m

  # determines how long an identity will stay in unknown status before it's marked as offline
  # defaults to 5m
  unknownTimeout: 5m

HA Bootstrapping Changes

Previously bootstrapping the RAFT cluster and initializing the controller with a default administrator were separate operations. Now, the raft cluster will be bootstrapped whenever the controller is initialized.

The controller can be initialized as follows:

  1. Using ziti agent controller init
  2. Using ziti agent controller init-from-db
  3. Specifying a db: entry in the config file. This is equivalent to using ziti agent controller init-from-db.

Additionally:

  1. minClusterSize has been removed. The cluster will always be initialized with a size of 1.
  2. bootstrapMembers has been renamed to initialMembers. If initialMembers are specified, the bootstrapping controller will attempt to add them after bootstrap has been complete. If they are invalid they will be ignored. If they can't be reached (because they're not running yet), the controller will continue to retry until they are reached, or it is restarted.

Connect Events

These are events generated when a successful connection is made to a controller, from any of:

  1. Identity, using the REST API
  2. Router
  3. Controller (peer in an HA cluster)

They are also generated when an SDK connects to a router.

Controller Configuration

events:
  jsonLogger:
    subscriptions:
      - type: connect
    handler:
      type: file
      format: json
      path: /tmp/ziti-events.log

Router Configuration

connectEvents:
  # defaults to true. 
  # If set to false, minimal information about which identities are connected will still be 
  # sent to the controller, so the `edgeRouterConnectionStatus` field can be populated, 
  # but connect events will not be generated.
  enabled: true

  # The interval at which connect information will be batched up and sent to the controller. 
  # Shorter intervals will improve data resolution on the controller. Longer intervals could
  # more efficient.
  batchInterval: 3s

  # The router will also periodically sent the full state to the controller, to ensure that 
  # it's in sync. It will do this automatically if the router gets disconnected from the 
  # controller, or if the router is unable to send a connect events messages to the controller.
  # This controls how often the full state will be sent under ordinairy conditions
  fullSyncInterval: 5m

  # If enabled is set to true, the router will collect connect events and send them out
  # at the configured batch interval. If there are a huge number of connecting identities
  # or if the router is disconnected from the controller for a time, it may be unable to
  # send events. In order to prevent queued events from exhausting memory, a maximum 
  # queue size is configured. 
  # Default value 100,000
  maxQueuedEvents: 100000
  

Example Events

{
  "namespace": "connect",
  "src_type": "identity",
  "src_id": "ji2Rt8KJ4",
  "src_addr": "127.0.0.1:59336",
  "dst_id": "ctrl_client",
  "dst_addr": "localhost:1280/edge/management/v1/edge-routers/2L7NeVuGBU",
  "timestamp": "2024-10-02T12:17:39.501821249-04:00"
}
{
  "namespace": "connect",
  "src_type": "router",
  "src_id": "2L7NeVuGBU",
  "src_addr": "127.0.0.1:42702",
  "dst_id": "ctrl_client",
  "dst_addr": "127.0.0.1:6262",
  "timestamp": "2024-10-02T12:17:40.529865849-04:00"
}
{
  "namespace": "connect",
  "src_type": "peer",
  "src_id": "ctrl2",
  "src_addr": "127.0.0.1:40056",
  "dst_id": "ctrl1",
  "dst_addr": "127.0.0.1:6262",
  "timestamp": "2024-10-02T12:37:04.490859197-04:00"
}

SDK Events

Building off of the connect events, there are events generated when an identity/sdk comes online or goes offline.

events:
  jsonLogger:
    subscriptions:
      - type: sdk
    handler:
      type: file
      format: json
      path: /tmp/ziti-events.log
{
  "namespace": "sdk",
  "event_type" : "sdk-online",
  "identity_id": "ji2Rt8KJ4",
  "timestamp": "2024-10-02T12:17:39.501821249-04:00"
}

{
  "namespace": "sdk",
  "event_type" : "sdk-status-unknown",
  "identity_id": "ji2Rt8KJ4",
  "timestamp": "2024-10-02T12:17:40.501821249-04:00"
}

{
  "namespace": "sdk",
  "event_type" : "sdk-offline",
  "identity_id": "ji2Rt8KJ4",
  "timestamp": "2024-10-02T12:17:41.501821249-04:00"
}

Component Updates and Bug Fixes

Release 1.1.15

What's New

  • Bug fixes, enhancements and continuing progress on controller HA

Component Updates and Bug Fixes

Release 1.1.14

What's New

  • Bug fixes, enhancements and continuing progress on controller HA

Component Updates and Bug Fixes

Release 1.1.13

This release will not be promoted, as a test binary was unintentionally released in the release archives.

Release 1.1.12

What's New

  • Bug fixes, enhancements and continuing progress on controller HA
  • Data corruption Fix

Data Corruption Fix

Previous to version 1.1.12, the controller would not handle changes to the policy type of a service policy. Specifically if the type was changed from Bind -> Dial, or Dial -> Bind, a set of denormalized data would be left behind, leaving the permissions with the old policy type.

Example:

  1. Identity A has Bind access to service B via Bind service policy C.
  2. The policy type of service policy C is changed from Bind to Dial.
  3. The service list would now likely show that Identity A has Dial and Bind access to service B, instead of just Dial access.

Mitigation/Fixing Bad Data

If you encounter this problem, the easiest and safest way to solve the problem is to to delete and recreate the affected service policy.

If changing policy types is something you do on a regular basis, and can't upgrade to a version with the fix, you can work around the issue by deleting and recreating policies, instead of updating them.

If you're not sure if you have ever changed a policy type, there is a database integrity check tool which can be run which looks for data integrity errors. It is run against a running system.

Start the check using:

ziti fabric db start-check-integrity

This kicks off the operation in the background. The status of the check can be seen using:

ziti fabric db check-integrity-status 

By default this is a read-only operation. If the read-only run reports errors, it can be run with the -f flag, which will have it try to fix errors. The data integrity errors caused by this bug should all be fixable by the integrity checker.

ziti fabric db start-check-integrity -f

WARNINGS:

  • Always make a database snapshot before running the integrity checker: ziti db fabric snapshot <optional path
  • The integrity checker can be very resource intensive, depending on the size of your data model. It is recommended that you run the integrity checker when the system is otherwise not busy.

Component Updates and Bug Fixes

Release 1.1.11

What's New

  • This release updates to Go v1.23
  • Updates to the latest version of golangci-lint, to allow it to work with the new version of Go
  • Linter fixes to address issues caught by updated linter

Release 1.1.10

What's New

  • Bug fixes, enhancements and continuing progress on controller HA

Component Updates and Bug Fixes

Release 1.1.9

What's New

  • Bug fixes, enhancements and continuing progress on controller HA
  • Includes a performance update (Issue #2340) which should improve connection ramp times. Previously circuits would start with a relatively low window size and ramp slowly. Circuits will now start with a large initial window size and scale back if they can't keep up
  • Added ziti ops verify-network. A command to aid when configuring the overlay network, this command will check config files for obvious mistakes
  • Added ziti ops verify-traffic. Another command to aid when configuring the overlay network, this command will ensure the overlay network is able to pass traffic

Component Updates and Bug Fixes

Release 1.1.8

What's New

  • Bug fixes, enhancements and continuing progress on controller HA

Component Updates and Bug Fixes

  • github.com/openziti/edge-api: v0.26.20 -> v0.26.23

    • Issue #120 - Add API for retrieving services referencing a config
    • Issue #121 - Add API for retrieving the set of attribute roles used by posture checks
  • github.com/openziti/sdk-golang: v0.23.38 -> v0.23.39

    • Issue #596 - SDK should submit selected config types to auth and service list APIs
    • Issue #593 - SDK Golang OIDC Get API Session Returns Wrong Value
  • github.com/openziti/storage: v0.2.47 -> v0.3.0

    • Issue #80 - Set indexes aren't cleaned up when referenced entities are deleted, only when they change
    • Issue #78 - Allow searching for things without case sensitivity
  • github.com/openziti/ziti: v1.1.7 -> v1.1.8

    • Issue #2121 - Use router data model for edge router tunnel
    • Issue #2245 - Add ability to retrieve a list of services that reference a config
    • Issue #2089 - Enhance Management API to list Posture Check Roles
    • Issue #2209 - /edge/v1/external-jwt-signers needs to be open
    • Issue #2010 - Add config information to router data model
    • Issue #1990 - Implement subscriber model for identity/service events in router
    • Issue #2240 - Secondary ext-jwt Auth Policy check incorrectly requires primary ext-jwt auth to be enabled

Release 1.1.7

What's New

  • Release actions fixes
  • Fix for a flaky acceptance test

Release 1.1.6

What's New

  • Trust Domain Configuration
  • Controller HA Beta 2

Trust Domain Configuration

OpenZiti controllers from this release forward will now require a trust domain to be configured. High Availability (HA) controllers already have this requirement. HA Controllers configure their trust domain via SPIFFE ids that are embedded in x509 certificates.

For feature parity, non-HA controllers will now have this same requirement. However, as re-issuing certificates is not always easily done. To help with the transition, non-HA controllers will have the ability to have their trust domain sourced from the controller configuration file through the root configuration value trustDomain. The configuration field which takes a string that must be URI hostname compatible (see: https://github.com/spiffe/spiffe/blob/main/standards/SPIFFE-ID.md). If this value is not defined, a trust domain will be generated from the root CA certificate of the controller.

For networks that will be deployed after this change, it is highly suggested that a SPIFFE id is added to certificates. The ziti pki create ... tooling supports the --spiffe-id option to help handle this scenario.

Generated Trust Domain Log Messages

The following log messages are examples of warnings produced when a controller is using a generated trust domain:

WARNING this environment is using a default generated trust domain [spiffe://d561decf63d229d66b07de627dbbde9e93228925], 
  it is recommended that a trust domain is specified in configuration via URI SANs or the 'trustDomain' field

WARNING this environment is using a default generated trust domain [spiffe://d561decf63d229d66b07de627dbbde9e93228925], 
  it is recommended that if network components have enrolled that the generated trust domain be added to the 
  configuration field 'additionalTrustDomains'

Trust domain resolution:

  • Non-HA controllers

    • Prefers SPIFFE ids in x509 certificate URI SANs, looking at the leaf up the signing chain
    • Regresses to trustDomain in the controller configuration file if not found
    • Regress to generating a trust domain from the server certificates root CA, if the above do not resolve
  • HA Controllers

    • Requires x509 SPIFFE ids in x509 certificate URI SANs

Additional Trust Domains

When moving between trust domains (i.e. from the default generated to a new named one), the controller supports having other trust domains. The trust domains do not replace certificate chain validation, which is still checked and enforced.

Additional trust domains are configured in the controller configuration file under the root field additionalTrustDomains. This field is an array of hostname safe strings.

The most common use case for this is field is if a network has issued certificates using the generated trust domain and now wants to transition to a explicitly defined one.

Controller HA Beta 2

This release can be run in HA mode. The code is still beta, as we're still finding and fixing bugs. Several bugs have been fixed since Beta 1 and c-based SDKs and tunnelers now work in HA mode. The smoketest can now be run with HA controllers and clients.

For more information:

Component Updates and Bug Fixes

  • github.com/openziti/storage: v0.2.45 -> v0.2.46

    • Issue #76 - Add support for non-boltz symbols to the the boltz stores
  • github.com/openziti/ziti: v1.1.5 -> v1.1.6

    • Issue #2171 - Routers should consider control channels unresponsive if they are not connected
    • Issue #2219 - Add inspection for router connections
    • Issue #2195 - cached data model file set to
    • Issue #2222 - Add way to get read-only status from cluster nodes
    • Issue #2191 - Change raft list cluster members element name from values to data to match rest of REST api
    • Issue #785 - ziti edge update service-policy to empty/no posture checks fails
    • Issue #2205 - Merge fabric and edge model code
    • Issue #2165 - Add network id

Release 1.1.5

What's New

  • Bug fixes

Component Updates and Bug Fixes

Release 1.1.4

What's New

  • Controller HA Beta 1
  • Bug fixes

Controller HA Beta 1

This release can be run in HA mode. The code is still beta, as we're still finding and fixing bugs. Several bugs have been fixed since Alpha 3 and c-based SDKs and tunnelers now work in HA mode. The smoketest can now be run with HA controllers and clients.

For more information:

Component Updates and Bug Fixes

Release 1.1.3

What's New

  • Sticky Terminator Selection
  • Linux and Docker deployments log formats no longer default to the simplified format option and now use logging library defaults: json for non-interactive, text for interactive.

NOTE: This release is the first since 1.0.0 to be marked promoted from pre-release. Be sure to check the release notes for the rest of the post-1.0.0 releases to get the full set of changes.

Stick Terminator Strategy

This release introduces a new terminator selection strategy sticky. On every dial it will return a token to the dialer, which represents the terminator used in the dial. This token maybe passed in on subsequent dials. If no token is passed in, the strategy will work the same as the smartrouting strategy. If a token is passed in, and the terminator is still valid, the same terminator will be used for the dial. A terminator will be consideder valid if it still exists and there are no terminators with a higher precedence.

This is currently only supported in the Go SDK.

Go SDK Example

ziti edge create service test --terminator-strategy sticky
	conn := clientContext.Dial("test")
	token := conn.Conn.GetStickinessToken()
	_ = conn.Close()

	dialOptions := &ziti.DialOptions{
		ConnectTimeout:  time.Second,
		StickinessToken: token,
	}
	conn = clientContext.DialWithOptions("test", dialOptions))
	nextToken := conn.Conn.GetStickinessToken()
	_ = conn.Close()

Component Updates and Bug Fixes

Release 1.1.2

What's New

  • Bug fixes and minor enhancements

Component Updates and Bug Fixes

  • github.com/openziti/sdk-golang: v0.23.32 -> v0.23.35
  • github.com/openziti/ziti: v1.1.1 -> v1.1.2
    • Issue #2032 - Auto CA Enrollment Fails w/ 400 Bad Request
    • Issue #2026 - Root Version Endpoint Handling 404s
    • Issue #2002 - JWKS endpoints may not refresh on new KID
    • Issue #2007 - Identities for edge routers with tunneling enabled sometimes show hasEdgeRouterConnection=false even though everything is OK
    • Issue #1983 - delete of non-existent entity causes panic when run on follower controller

Release 1.1.1

What's New

HA Alpha 3

This release can be run in HA mode. The code is still alpha, as we're still finding and fixing bugs.

For more information:

New Contributors

Thanks to new contributors

  • @Vrashabh-Sontakke

Component Updates and Bug Fixes

Release 1.1.0

What's New

  • HA Alpha2
  • Deployments Alpha
    • Linux packages provide systemd services for controller and router. Both depend on existing package openziti which provides the ziti command line tool.
      • openziti-controller provides ziti-controller.service
      • openziti-router provides ziti-router.service
    • Container images for controller and router now share the bootstrapping logic with the packages, so they support the same configuration options.

HA Alpha2

This release can be run in HA mode. The code is still alpha, so there are still some bugs and missing features, however basic functionality work with the exceptions noted. See the HA Documementation for instructions on setting up an HA cluster.

Known Issues

  • JWT Session exchange isn't working with Go SDK clients
    • This means Go clients will need to be restarted once their sessions expire
  • Service/service policy changes might not be reflected in routers
    • Changes to policy may not yet properly sync to the routers, causing unexpected behavior with ER/Ts running in HA mode

More information can be found on the HA Project Board

Component Updates and Bug Fixes

Release 1.0.0

About 1.0

What does marking OpenZiti as 1.0 mean?

Backwards Compatibility

We've guaranteed API stability for SDK clients for years and worked hard to ensure that routers and controllers would be backwards and forward compatible. However, we have had a variety of management API changes and CLI changes. For post 1.0 releases we expect to make additions to the APIs and CLI, but won't remove anything until it's been first marked as deprecated and then only with a major version bump.

Stability and Scale

Recent releases have seen additional testing using chaos testing techniques. These tests involve setting up relatively large scale environments, knocking out various components and then verifying that the network is able to return to a stable state. These test are run for hours to try and eliminate race conditions and distributed state machine problems.

OpenZiti is also being used as underlying infrastrcture for the zrok public service. Use of this network has grown quickly and proven that it's possible to build ziti native apps that can scale up.

Backward Incompatible Changes to pre-1.0 releases

Administrators no longer have access to dial/bind all services by default. See below for details.

What's New

  • Administrators no longer have access to dial/bind all services by default.
  • TLS Handshakes can now be rate limited in the controller
  • TLS Handshake timeouts can now be set on the controller when using ALPN
  • Bugfixes

DEFAULT Bind/Dial SERVICE PERMISSIONS FOR Admin IDENTITIES HAVE CHANGED

Admin identities were able to Dial and Bind all services regardless of the effective service policies prior to this release. This could lead to a confusing situation where a tunneler that was assuming an Admin identity would put itself into an infinite connect-loop when a service's host.v1 address overlapped with any addresses in its intercept configuration.

Please create service policies to grant Bind or Dial permissions to Admin identities as needed.

TLS Handshake

A TLS handhshake rate limiter can be enabled. This is useful in cases where there's a flood of TLS requests and the controller can't handle them all. It can get into a state where it can't respond to TLS handshakes quickly enough, so the clients time out. They then retry, adding to the the load. The controller ends up wasting time doing work that isn't use.

This uses the same rate limiting as the auth rate limiter.

Additionally the server side handshake timeout can now be configured.

Configuration:

tls: 
  handshakeTimeout: 15s

  rateLimiter:
    # if disabled, no tls handshake rate limiting with be enforced
    enabled: true
    # the smallest window size for tls handshakes
    minSize: 5
    # the largest allowed window size for tls handshakes
    maxSize: 5000
    # after how long to consider a handshake abandoned if neither success nor failure was reported
    timeout: 30s

New metrics:

  • tls_handshake_limiter.in_process - number of TLS handshakes in progress
  • tls_handshake_limiter.window_size - number of TLS handhshakes allowed concurrently
  • tls_handshake_limiter.work_timer - timer tracking how long TLS handshakes are taking

Component Updates and Bug Fixes