Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#4984] improvement(core, doris): Add the random distribution strategy #4985

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,17 @@ public static Distribution hash(int number, Expression... expressions) {
return new DistributionImpl(Strategy.HASH, number, expressions);
}

/**
* Create a distribution by randomly distributing the data across the number of buckets.
*
* @param number The number of buckets
* @param expressions The expressions to distribute by
* @return The created random distribution
*/
public static Distribution random(int number, Expression... expressions) {
return new DistributionImpl(Strategy.RANDOM, number, expressions);
}

/**
* Create a distribution by the given strategy.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,10 @@ public enum Strategy {
RANGE,

/** Distributes data evenly across partitions. */
EVEN;
EVEN,

/** Distributes data randomly across partitions or table. */
RANDOM;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the differences between EVEN and RANDOM?

AFAIK, RANDOM is a kind of implementation of EVEN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both aim to balance the distribution of data to optimize performance, "Random" emphasizes more on the randomness of the data, while "Even" focuses on maintaining the uniformity of the distribution.

They are slightly different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FANNG1 do you have any comments on this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, HASH, RANGE RANDOM are the implemetation how we do the distribution, even is the something like distribution result, both HASH and RANDOM are the implementation of EVEN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's recommend to remove EVEN? But I remember that @yuqi1129 has done research and there is a certain kind of table that uses EVEN as a distribution name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi1129 , do you remember which kind of table use even distribution, could it be replaced by round-robin or random?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao any thoughts on this point? #4991 depends on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You guys have better background on this, you can have a off-line discussion and negotiate out a solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could not reach an agreement till now, so I postponed it to 0.7.0 as it's not a bugfix.


/**
* Get the distribution strategy by name.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ protected String generateCreateTableSql(
.map(column -> BACK_QUOTE + column.toString() + BACK_QUOTE)
.collect(Collectors.joining(", ")));
sqlBuilder.append(")");
} else if (distribution.strategy() == Strategy.EVEN) {
} else if (distribution.strategy() == Strategy.RANDOM) {
sqlBuilder.append(NEW_LINE).append(" DISTRIBUTED BY ").append("RANDOM");
}

Expand Down Expand Up @@ -220,8 +220,8 @@ private static void validateDistribution(Distribution distribution, JdbcColumn[]
Preconditions.checkArgument(null != distribution, "Doris must set distribution");

Preconditions.checkArgument(
Strategy.HASH == distribution.strategy() || Strategy.EVEN == distribution.strategy(),
"Doris only supports HASH or EVEN distribution strategy");
Strategy.HASH == distribution.strategy() || Strategy.RANDOM == distribution.strategy(),
"Doris only supports HASH or RANDOM distribution strategy");

if (distribution.strategy() == Strategy.HASH) {
// Check if the distribution column exists
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,43 @@ private static Map<String, String> createProperties() {
return properties;
}

@Test
void testRandomDistribution() {
String tableName = GravitinoITUtils.genRandomName("doris_basic_test_table");
String tableComment = "test_comment";
List<JdbcColumn> columns = new ArrayList<>();
JdbcColumn col_1 =
JdbcColumn.builder().withName("col_1").withType(INT).withComment("id").build();
columns.add(col_1);
JdbcColumn col_2 =
JdbcColumn.builder().withName("col_2").withType(VARCHAR_255).withComment("col_2").build();
columns.add(col_2);
JdbcColumn col_3 =
JdbcColumn.builder().withName("col_3").withType(VARCHAR_255).withComment("col_3").build();
columns.add(col_3);
Map<String, String> properties = new HashMap<>();

Distribution distribution =
Distributions.random(DEFAULT_BUCKET_SIZE, NamedReference.field("col_1"));
Index[] indexes = new Index[] {};

// create table
TABLE_OPERATIONS.create(
databaseName,
tableName,
columns.toArray(new JdbcColumn[0]),
tableComment,
createProperties(),
null,
distribution,
indexes);
List<String> listTables = TABLE_OPERATIONS.listTables(databaseName);
assertTrue(listTables.contains(tableName));
JdbcTable load = TABLE_OPERATIONS.load(databaseName, tableName);
assertionsTableInfo(
tableName, tableComment, columns, properties, indexes, Transforms.EMPTY_TRANSFORM, load);
}

@Test
public void testBasicTableOperation() {
String tableName = GravitinoITUtils.genRandomName("doris_basic_test_table");
Expand Down
Loading