Skip to content

Commit

Permalink
Update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
mymeiyi committed Nov 2, 2023
1 parent 7c3eef8 commit e2aeb7b
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 6 deletions.
21 changes: 18 additions & 3 deletions docs/en/docs/data-operate/import/import-way/group-commit-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,18 @@ In Doris, all methods of data loading are independent jobs which initiate a new

It should be noted that the group commit is returned after the data is writed to WAL, at this time, the data is not visible for users, the default time interval is 10 seconds.

## Fundamental

### Write process
1. User starts a group commit load, FE generates a plan fragment;
2. BE executes the plan. Unlike non group commit load, the processed data is not sent to each tablet, but put into a queue in the memory shared by multiple group commit load;
3. BE starts an internal load, which consumes the data in the queue, writes to WAL, and notifies that the data related load has been finished;
4. After that, the data is processed in the same way as non group commit load, send to each tablet, write memtable, and flushed to segment files.

### WAL Introduction

Each group commit load will generate a corresponding WAL file, which is used to recover failed load jobs. If there is a restart be or fail to run the group commit load during the writing process, be will replay WAL file through a stream load in the background to reimport the data, which can make sure that data is not lost. If the group commit load job is completed normally, the WAL will be directly deleted to reduce disk space usage.

## Basic operations

If the table schema is:
Expand Down Expand Up @@ -244,6 +256,9 @@ See [Synchronize Data Using Insert Method](../import-scenes/jdbc-load.md) for mo
The time interval of the internal group commit load job will stop and start a new internal job, the default value is 10000 milliseconds.

+ group_commit_replay_wal_dir
+ group_commit_sync_wal_batch
+ group_commit_replay_wal_retry_num
+ group_commit_replay_wal_retry_interval_seconds

The directory for storing WAL files, which can be configured with multiple directories. By default, a directory named `wal` is created under Doris Home. It is recommended that users configure it separately. Configuration examples:

```
group_commit_replay_wal_dir=/data1/storage/wal,/data2/storage/wal,/data3/storage/wal
```
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,18 @@ under the License.

需要注意的是,攒批写入在数据写入WAL后即返回,此时不能立刻读出数据,默认为10秒后可以读出。

## 原理介绍

### 写入流程
1. 用户发起攒批写入,FE生成执行计划;
2. BE执行规划,与非攒批导入不同,处理后的数据不是发给各个tablet,而是放到一个内存中的队列中,多个攒批共享这个队列;
3. BE内部发起一个导入规划,消费队列中的数据,写入WAL,并通知该数据对应的写入已完成;
4. 之后,消费后的数据和普通写入的处理流程一样,发给各个tablet,写入memtable,下刷为segment文件等;

### WAL介绍

每一次攒批会生成一个对应的WAL文件,其作用是用于恢复失败的攒批作业,在写入过程中如果发生了be重启或者攒批作业运行失败,be可以通过relay WAL文件,在后台发起一个stream load重新导入数据,保证攒批数据不丢失。如果攒批作业正常执行完成,WAL会被直接删掉。

## 基本操作

假如表的结构为:
Expand Down Expand Up @@ -244,6 +256,9 @@ private static void groupCommitInsertBatch() throws Exception {
攒批写入开启多久后结束,默认为10000,即10秒。

+ group_commit_replay_wal_dir
+ group_commit_sync_wal_batch
+ group_commit_replay_wal_retry_num
+ group_commit_replay_wal_retry_interval_seconds

存放WAL文件的目录,可以配置多个目录,默认在Doris Home下创建一个名为wal的目录,建议用户单独配置,配置示例

```
group_commit_replay_wal_dir=/data1/storage/wal,/data2/storage/wal,/data3/storage/wal
```

0 comments on commit e2aeb7b

Please sign in to comment.