Skip to content

Commit

Permalink
docs: worker v2 public docs. (#42873)
Browse files Browse the repository at this point in the history
Update docs to prep for Worker V2 OSS launch.

The main focus is a brief explanation of what/why/how of the feature. We have a blog post in the works I'm going to link to for more explaination once it's published.

- Remove old *_WORKER configs.
- Update the diagram to reflect the new flow. Viewing now is not great due to the rendering. Will likely follow up on a different diagram.
- Write Worker V2 explanation. Note that I left all the Docker pieces in place and tried to make things 'flow'. Will follow up here when we deprecate Docker officially.

Co-authored-by: Jimmy Ma <[email protected]>
  • Loading branch information
davinchia and gosusnp authored Jul 31, 2024
1 parent 502e9b7 commit aa5d6dd
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 45 deletions.
10 changes: 6 additions & 4 deletions docs/operator-guides/configuring-airbyte.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,10 +114,12 @@ The following variables are relevant to both Docker and Kubernetes.

#### Worker

1. `MAX_SPEC_WORKERS` - Defines the maximum number of Spec workers each Airbyte Worker container can support. Defaults to 5.
2. `MAX_CHECK_WORKERS` - Defines the maximum number of Check workers each Airbyte Worker container can support. Defaults to 5.
3. `MAX_SYNC_WORKERS` - Defines the maximum number of Sync workers each Airbyte Worker container can support. Defaults to 5.
4. `MAX_DISCOVER_WORKERS` - Defines the maximum number of Discover workers each Airbyte Worker container can support. Defaults to 5.
1. `MAX_CHECK_WORKERS` - Defines the maximum number of Non-Sync workers each Airbyte Worker container can support. Defaults to 5.
2. `MAX_SYNC_WORKERS` - Defines the maximum number of Sync workers each Airbyte Worker container can support. Defaults to 10.

#### Launcher

1. `WORKLOAD_LAUNCHER_PARALLELISM` - Defines the number of jobs that can be started at once. Defaults to 10.

#### Data Retention

Expand Down
29 changes: 20 additions & 9 deletions docs/understanding-airbyte/high-level-view.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,33 +15,44 @@ A more concrete diagram can be seen below:
```mermaid
---
title: Architecture Overview
config:
theme: neutral
---
%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%
flowchart LR
W[fa:fa-display WebApp/UI]
S[fa:fa-server Server/Config API]
S[fa:fa-server Config API Server]
D[(fa:fa-table Config & Jobs)]
T(fa:fa-calendar Temporal)
L[(fa:fa-server Launcher)]
O[(fa:fa-superpowers Orchestrator)]
Q[(fa:fa-superpowers Queue)]
T(fa:fa-calendar Temporal/Scheduler)
W2[1..n Airbyte Workers]
WL[fa:fa-server Workload API Server]
W -->|sends API requests| S
S -->|store data| D
S -->|create workflow| T
T -->|launch task| W2
W2 -->|return job| T
W2 -->|launches| Source
W2 -->|launches| Destination
W2 -->|return status| T
W2 -->|creates job| WL
WL -->|queues workload| Q
Q -->|reads from| L
L -->|launches| O
O -->|launches/reads from| Source
O -->|launches/reads from/writes to| Destination
O -->|reports status to| WL
```

- **Web App/UI** [`airbyte-webapp`, `airbyte-proxy`]: An easy-to-use graphical interface for interacting with the Airbyte API.
- **Server/Config API** [`airbyte-server`, `airbyte-server-api`]: Handles connection between UI and API. Airbyte's main control plane. All operations in Airbyte such as creating sources, destinations, connections, managing configurations, etc.. are configured and invoked from the API.
- **Config API Server** [`airbyte-server`, `airbyte-server-api`]: Handles connection between UI and API. Airbyte's main control plane. All operations in Airbyte such as creating sources, destinations, connections, managing configurations, etc.. are configured and invoked from the API.
- **Database Config & Jobs** [`airbyte-db`]: Stores all the connections information \(credentials, frequency...\).
- **Temporal Service** [`airbyte-temporal`]: Manages the task queue and workflows.
- **Worker** [`airbyte-worker`]: The worker connects to a source connector, pulls the data and writes it to a destination.
- **Workload API** [`airbyte-workload-api-server`]: Manages workloads, Airbyte's internal job abstraction.
- **Launcher** [`airbyte-workload-launcher`]: Launches workloads.

The diagram shows the steady-state operation of Airbyte, there are components not described you'll see in your deployment:

- **Cron** [`airbyte-cron`]: Clean the server and sync logs (when using local logs)
- **Cron** [`airbyte-cron`]: Clean the server and sync logs (when using local logs). Regularly updates connector definitions and sweeps old workloads.
- **Bootloader** [`airbyte-bootloader`]: Upgrade and Migrate the Database tables and confirm the enviroment is ready to work.

This is a holistic high-level description of each component. For Airbyte deployed in Kubernetes the structure is very similar with a few changes.
58 changes: 26 additions & 32 deletions docs/understanding-airbyte/jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ There are 2 flavors of workers:
The worker extracts data from the connector and reports it to the scheduler. It does this by listening to the connector's STDOUT.
These jobs are synchronous as they are part of the configuration process and need to be immediately run to provide a good user experience. These are also all lightweight operations.

2. **Asynchronous Job Worker** - Workers that interact with 2 connectors \(e.g. sync, reset\)
2. **Asynchronous Job Worker** - Workers that interact with 2 connectors \(e.g. sync, clear\)

The worker passes data \(via record messages\) from the source to the destination. It does this by listening on STDOUT of the source and writing to STDIN on the destination.
These jobs are asynchronous as they are often long-running resource-intensive processes. They are decoupled from the rest of the platform to simplify development and operation.
Expand Down Expand Up @@ -312,33 +312,31 @@ The Cloud Storage store is treated as the source-of-truth of execution state.

The Container Orchestrator is only available for Airbyte Kubernetes today and automatically enabled when running the Airbyte Helm Charts deploys.

```mermaid
---
title: Start a new Sync
---
sequenceDiagram
%% participant API
participant Temporal as Temporal Queues
participant Sync as Sync Workflow
participant ReplicationA as Replication Activity
participant ReplicationP as Replication Process
participant PersistA as Persistent Activity
participant AirbyteDB
Sync->>Temporal: Start a replication Activity
Temporal->>Sync: Pick up a new Sync
Temporal->>ReplicationA: Pick up a new task
ReplicationA->>ReplicationP: Starts a process
ReplicationP->>ReplicationA: Replication Summary with State message and stats
ReplicationA->>Temporal: Return Output (States and Summary)
Temporal->>Sync: Read results from Replication Activity
Sync->>Temporal: Start Persistent State Activity
Temporal->>PersistA: Pick up new task
PersistA->>AirbyteDB: Persist States
PersistA->>Temporal: Return output
```

Users running Airbyte Docker should be aware of the above pitfalls.

## Workloads

Workloads is Airbyte's next generation Worker architecture. It is designed to be more scalable, reliable and maintainable than the current Worker architecture. It performs particularly
well in low-resource environments.

One big flaw of pre-Workloads architecture was the coupling of scheduling a job with starting a job. This complicated configuration, and created thundering herd situations for
resource-constrained environments with spiky job scheduling.

Workloads is an Airbyte-internal job abstraction decoupling the number of running jobs (including those in queue), from the number of jobs that can be started. Jobs stay queued
until more resources are available or canceled. This allows for better back pressure and self-healing in resource constrained environments.

Workers now communicate with the Workload API Server to create a Workload instead of directly starting jobs.

The **Workload API Server** places the job in a queue. The **Launcher** picks up the job and launches the resources needed to run the job e.g. Kuberenetes pods. It throttles
job creation based on available resources, minimising deadlock situations.

With this set up, Airbyte now supports:
- configuring the maximum number of concurrent jobs via `MAX_CHECK_WORKERS` and `MAX_SYNC_WORKERS` environment variables.`
- configuring the maximum number of jobs that can be started at once via ``
- differentiating between job schedule time & job start time via the Workload API, though this is not exposed to the UI.

This also unlocks future work to turn Workers asynchronous, which allows for more efficient steady-state resource usage.

## Configuring Jobs & Workers

Details on configuring jobs & workers can be found [here](../operator-guides/configuring-airbyte.md).
Expand All @@ -348,9 +346,5 @@ Details on configuring jobs & workers can be found [here](../operator-guides/con
Airbyte exposes the following environment variable to change the maximum number of each type of worker allowed to run in parallel.
Tweaking these values might help you run more jobs in parallel and increase the workload of your Airbyte instance:

- `MAX_SPEC_WORKERS`: Maximum number of _Spec_ workers allowed to run in parallel.
- `MAX_CHECK_WORKERS`: Maximum number of _Check connection_ workers allowed to run in parallel.
- `MAX_DISCOVERY_WORKERS`: Maximum number of _Discovery_ workers allowed to run in parallel.
- `MAX_SYNC_WORKERS`: Maximum number of _Sync_ workers allowed to run in parallel.

The current default value for these environment variables is currently set to **5**.
- `MAX_CHECK_WORKERS`: Maximum number of _Non-Sync_ workers allowed to run in parallel. Default to **5**.
- `MAX_SYNC_WORKERS`: Maximum number of _Sync_ workers allowed to run in parallel. Defaults to **10**.

0 comments on commit aa5d6dd

Please sign in to comment.