Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5019] feat: (hadoop-catalog): Add a framework to support multi-storage in a pluginized manner for fileset catalog #5020

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

yuqi1129
Copy link
Contributor

What changes were proposed in this pull request?

Add a framework to support multiple storage system within Hadoop catalog

Why are the changes needed?

Some users want Gravitino to manage file system like S3 or GCS.

Fix: #5019

Does this PR introduce any user-facing change?

N/A.

How was this patch tested?

Existing test.

@yuqi1129
Copy link
Contributor Author

@xiaozcy
Please help to take a look if this PR meets your requirements, thanks.

build.gradle.kts Outdated
@@ -744,7 +744,8 @@ tasks {
if (!it.name.startsWith("catalog") &&
!it.name.startsWith("authorization") &&
!it.name.startsWith("client") && !it.name.startsWith("filesystem") && !it.name.startsWith("spark") && !it.name.startsWith("iceberg") && it.name != "trino-connector" &&
it.name != "integration-test" && it.name != "hive-metastore-common" && !it.name.startsWith("flink")
it.name != "integration-test" && it.name != "hive-metastore-common" && !it.name.startsWith("flink") &&
it.name != "hadoop-gcs-bundle" && it.name != "hadoop-s3-bundle"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better renaming to "gcs-bundle" and "aws-bundle", like this, it may not only about storage, it also includes the security and other things.

@Override
public Configuration getConfiguration(Map<String, String> conf) {
Configuration configuration = new Configuration();
conf.forEach(configuration::set);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you pass all of the configurations into the hadoop conf? I think it's enough to pass in the configurations that start with gravitino.bypass in the default implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the current implementation is putting all configurations including keys without the prefix gravitino.bypass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not appropriate, it might be a mistake before, can you confirm this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have changed the logic and only those with prefix gravitino.bypass can pass to the Hadoop configuration, it's indeed more reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the line 137 of HadoopCatalogOperations still passes all the configuration into the Hadoop conf, this should be removed either.

@@ -132,6 +132,20 @@ public void initialize(
Map.Entry::getValue));
bypassConfigs.forEach(hadoopConf::set);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization of Hadoop conf here seems to be useless. It's better to move these codes above into the DefaultConfigurationProvider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right, I will do it according to your suggestion.

* @return The FileSystem instance.
* @throws IOException If the FileSystem instance cannot be created.
*/
FileSystem getFileSystem(Configuration configuration, Path path) throws IOException;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao
Please take a look at this PR, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to provide a Path here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that most methods can get a FileSystem in https://github.com/apache/hadoop/blob/a9b7913d5689cae804fbd072a03cd2d2f771ab6a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L536-L571 do support specify the path explicitly, so I added it.

Indeed, we can also use configuration fs.defaultFs to replace it. If you feel it's an excessive value, I can remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current interface design is highly dependent on the Hadoop's implementation, like Configuration and Path, which is not a good design, also lose the flexibility to support some non-hadoop fs, a good interface should not rely on the third party libraries as possible as we can.

I would be inclined to only rely on FileSystem, like:

interface FileSystemProvider {

  FileSystem getFileSystem(Map<String, String> conf);

  String scheme():
}

Using "conf" should be enough to create a filesystem you wanted.

public String getScheme() {
return "s3a";
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove gcs, s3 related code, we can add them later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@@ -742,4 +778,27 @@ private boolean checkSingleFile(Fileset fileset) {
fileset.name());
}
}

static FileSystem getFileSystem(Path path, Configuration conf) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't have use Path to create a FileSystem, Configuration itself should be enough.

public class HDFSFileSystemProvider implements FileSystemProvider {

@Override
public FileSystem getFileSystem(Configuration configuration, Path path) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao
Can you please help me verify if it's okay to initialize a HDFS file system? I can do little for it compared to the default method.

*
* @return The scheme of this FileSystem provider.
*/
default String getScheme() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to return "file" as default, as there's no default filesystem implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I removed the default implementation.

* @return The FileSystem instance.
* @throws IOException If the FileSystem instance cannot be created.
*/
FileSystem getFileSystem(Configuration configuration, Path path) throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current interface design is highly dependent on the Hadoop's implementation, like Configuration and Path, which is not a good design, also lose the flexibility to support some non-hadoop fs, a good interface should not rely on the third party libraries as possible as we can.

I would be inclined to only rely on FileSystem, like:

interface FileSystemProvider {

  FileSystem getFileSystem(Map<String, String> conf);

  String scheme():
}

Using "conf" should be enough to create a filesystem you wanted.

public FileSystem getFileSystem(Map<String, String> config) throws IOException {
Configuration configuration = new Configuration();
config.forEach(configuration::set);
return FileSystem.get(FileSystem.getDefaultUri(configuration), configuration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will get a local filesystem if the default setting is not local filesystem.

@jerryshao
Copy link
Contributor

I'm OK with the interface design, please polish the code to make it ready to review, also refactor the client side GravitinoVirtualFileSystem to also leverage this interface to make the fs pluggable.

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Oct 9, 2024

refactor the client side GravitinoVirtualFileSystem to also leverage this interface to make the fs pluggable

I will use another PR to refactor the Java client for the fileset.

@yuqi1129 yuqi1129 self-assigned this Oct 9, 2024
@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Oct 9, 2024

I'm OK with the interface design, please polish the code to make it ready to review, also refactor the client side GravitinoVirtualFileSystem to also leverage this interface to make the fs pluggable.

@jerryshao
I have updated the PR and please help to review it again, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add a framework to support multi-storage in a pluginized manner for fileset catalog.
3 participants