Skip to content

Commit

Permalink
[Feature][Connector-V2] Supports the transfer of any file (apache#6826)
Browse files Browse the repository at this point in the history
  • Loading branch information
Hisoka-X authored Jun 4, 2024
1 parent 1da9bd6 commit c140178
Show file tree
Hide file tree
Showing 30 changed files with 520 additions and 22 deletions.
3 changes: 2 additions & 1 deletion docs/en/connector-v2/sink/CosFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Options

Expand Down Expand Up @@ -115,7 +116,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file

We supported as the following file types:

`text` `json` `csv` `orc` `parquet` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`.

Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/sink/FtpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Options

Expand Down Expand Up @@ -120,7 +121,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file

We supported as the following file types:

`text` `json` `csv` `orc` `parquet` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.

Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/sink/HdfsFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] json
- [x] excel
- [x] xml
- [x] binary
- [x] compress codec
- [x] lzo

Expand All @@ -46,7 +47,7 @@ Output data to hdfs file
| custom_filename | boolean | no | false | Whether you need custom the filename |
| file_name_expression | string | no | "${transactionId}" | Only used when `custom_filename` is `true`.`file_name_expression` describes the file expression which will be created into the `path`. We can add the variable `${now}` or `${uuid}` in the `file_name_expression`, like `test_${uuid}_${now}`,`${now}` represents the current time, and its format can be defined by specifying the option `filename_time_format`.Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file. |
| filename_time_format | string | no | "yyyy.MM.dd" | Only used when `custom_filename` is `true`.When the format in the `file_name_expression` parameter is `xxxx-${now}` , `filename_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:[y:Year,M:Month,d:Day of month,H:Hour in day (0-23),m:Minute in hour,s:Second in minute] |
| file_format_type | string | no | "csv" | We supported as the following file types:`text` `json` `csv` `orc` `parquet` `excel` `xml`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
| file_format_type | string | no | "csv" | We supported as the following file types:`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
| field_delimiter | string | no | '\001' | Only used when file_format is text,The separator between columns in a row of data. Only needed by `text` file format. |
| row_delimiter | string | no | "\n" | Only used when file_format is text,The separator between rows in a file. Only needed by `text` file format. |
| have_partition | boolean | no | false | Whether you need processing partitions. |
Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/sink/LocalFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Options

Expand Down Expand Up @@ -94,7 +95,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file

We supported as the following file types:

`text` `json` `csv` `orc` `parquet` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.

Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/sink/OssFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Data Type Mapping

Expand Down Expand Up @@ -166,7 +167,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${Now}` , `file

We supported as the following file types:

`text` `json` `csv` `orc` `parquet` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.

Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/sink/OssJindoFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Options

Expand Down Expand Up @@ -119,7 +120,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file

We supported as the following file types:

`text` `json` `csv` `orc` `parquet` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.

Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/sink/S3File.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Description

Expand Down Expand Up @@ -172,7 +173,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file

We supported as the following file types:

`text` `json` `csv` `orc` `parquet` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.

Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/sink/SftpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ By default, we use 2PC commit to ensure `exactly-once`
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Options

Expand Down Expand Up @@ -113,7 +114,7 @@ When the format in the `file_name_expression` parameter is `xxxx-${now}` , `file

We supported as the following file types:

`text` `json` `csv` `orc` `parquet` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is `txt`.

Expand Down
41 changes: 40 additions & 1 deletion docs/en/connector-v2/source/CosFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Description

Expand Down Expand Up @@ -76,7 +77,7 @@ The source file path.

File type, supported as the following file types:

`text` `csv` `parquet` `orc` `json` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

If you assign file type to `json`, you should also assign schema option to tell connector how to parse data to the row you want.

Expand Down Expand Up @@ -160,6 +161,11 @@ connector will generate data as the following:
|---------------|-----|--------|
| tyrantlucifer | 26 | male |

If you assign file type to `binary`, SeaTunnel can synchronize files in any format,
such as compressed packages, pictures, etc. In short, any files can be synchronized to the target place.
Under this requirement, you need to ensure that the source and sink use `binary` format for file synchronization
at the same time. You can find the specific usage in the example below.

### bucket [string]

The bucket address of Cos file system, for example: `Cos://tyrantlucifer-image-bed`
Expand Down Expand Up @@ -321,6 +327,39 @@ Source plugin common parameters, please refer to [Source Common Options](common-
```

### Transfer Binary File

```hocon
env {
parallelism = 1
job.mode = "BATCH"
}
source {
CosFile {
bucket = "cosn://seatunnel-test-1259587829"
secret_id = "xxxxxxxxxxxxxxxxxxx"
secret_key = "xxxxxxxxxxxxxxxxxxx"
region = "ap-chengdu"
path = "/seatunnel/read/binary/"
file_format_type = "binary"
}
}
sink {
// you can transfer local file to s3/hdfs/oss etc.
CosFile {
bucket = "cosn://seatunnel-test-1259587829"
secret_id = "xxxxxxxxxxxxxxxxxxx"
secret_key = "xxxxxxxxxxxxxxxxxxx"
region = "ap-chengdu"
path = "/seatunnel/read/binary2/"
file_format_type = "binary"
}
}
```

## Changelog

### next version
Expand Down
41 changes: 40 additions & 1 deletion docs/en/connector-v2/source/FtpFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Description

Expand Down Expand Up @@ -86,7 +87,7 @@ The source file path.

File type, supported as the following file types:

`text` `csv` `parquet` `orc` `json` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

If you assign file type to `json` , you should also assign schema option to tell connector how to parse data to the row you want.

Expand Down Expand Up @@ -159,6 +160,11 @@ connector will generate data as the following:
|---------------|-----|--------|
| tyrantlucifer | 26 | male |

If you assign file type to `binary`, SeaTunnel can synchronize files in any format,
such as compressed packages, pictures, etc. In short, any files can be synchronized to the target place.
Under this requirement, you need to ensure that the source and sink use `binary` format for file synchronization
at the same time. You can find the specific usage in the example below.

### connection_mode [string]

The target ftp connection mode , default is active mode, supported as the following modes:
Expand Down Expand Up @@ -288,6 +294,39 @@ Source plugin common parameters, please refer to [Source Common Options](common-
```

### Transfer Binary File

```hocon
env {
parallelism = 1
job.mode = "BATCH"
}
source {
FtpFile {
host = "192.168.31.48"
port = 21
user = tyrantlucifer
password = tianchao
path = "/seatunnel/read/binary/"
file_format_type = "binary"
}
}
sink {
// you can transfer local file to s3/hdfs/oss etc.
FtpFile {
host = "192.168.31.48"
port = 21
user = tyrantlucifer
password = tianchao
path = "/seatunnel/read/binary2/"
file_format_type = "binary"
}
}
```

## Changelog

### 2.2.0-beta 2022-09-26
Expand Down
3 changes: 2 additions & 1 deletion docs/en/connector-v2/source/HdfsFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Description

Expand All @@ -43,7 +44,7 @@ Read data from hdfs file system.
| Name | Type | Required | Default | Description |
|---------------------------|---------|----------|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| path | string | yes | - | The source file path. |
| file_format_type | string | yes | - | We supported as the following file types:`text` `json` `csv` `orc` `parquet` `excel` `xml`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
| file_format_type | string | yes | - | We supported as the following file types:`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`.Please note that, The final file name will end with the file_format's suffix, the suffix of the text file is `txt`. |
| fs.defaultFS | string | yes | - | The hadoop cluster address that start with `hdfs://`, for example: `hdfs://hadoopcluster` |
| read_columns | list | yes | - | The read column list of the data source, user can use it to implement field projection.The file type supported column projection as the following shown:[text,json,csv,orc,parquet,excel,xml].Tips: If the user wants to use this feature when reading `text` `json` `csv` files, the schema option must be configured. |
| hdfs_site_path | string | no | - | The path of `hdfs-site.xml`, used to load ha configuration of namenodes |
Expand Down
33 changes: 32 additions & 1 deletion docs/en/connector-v2/source/LocalFile.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ Read all the data in a split in a pollNext call. What splits are read will be sa
- [x] json
- [x] excel
- [x] xml
- [x] binary

## Description

Expand Down Expand Up @@ -71,7 +72,7 @@ The source file path.

File type, supported as the following file types:

`text` `csv` `parquet` `orc` `json` `excel` `xml`
`text` `csv` `parquet` `orc` `json` `excel` `xml` `binary`

If you assign file type to `json`, you should also assign schema option to tell connector how to parse data to the row you want.

Expand Down Expand Up @@ -155,6 +156,11 @@ connector will generate data as the following:
|---------------|-----|--------|
| tyrantlucifer | 26 | male |

If you assign file type to `binary`, SeaTunnel can synchronize files in any format,
such as compressed packages, pictures, etc. In short, any files can be synchronized to the target place.
Under this requirement, you need to ensure that the source and sink use `binary` format for file synchronization
at the same time. You can find the specific usage in the example below.

### read_columns [list]

The read column list of the data source, user can use it to implement field projection.
Expand Down Expand Up @@ -363,6 +369,31 @@ LocalFile {
```

### Transfer Binary File

```hocon
env {
parallelism = 1
job.mode = "BATCH"
}
source {
LocalFile {
path = "/seatunnel/read/binary/"
file_format_type = "binary"
}
}
sink {
// you can transfer local file to s3/hdfs/oss etc.
LocalFile {
path = "/seatunnel/read/binary2/"
file_format_type = "binary"
}
}
```

## Changelog

### 2.2.0-beta 2022-09-26
Expand Down
Loading

0 comments on commit c140178

Please sign in to comment.