-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#5019] feat: (hadoop-catalog): Add a framework to support multi-storage in a pluginized manner for fileset catalog #5020
base: main
Are you sure you want to change the base?
Conversation
@xiaozcy |
build.gradle.kts
Outdated
@@ -744,7 +744,8 @@ tasks { | |||
if (!it.name.startsWith("catalog") && | |||
!it.name.startsWith("authorization") && | |||
!it.name.startsWith("client") && !it.name.startsWith("filesystem") && !it.name.startsWith("spark") && !it.name.startsWith("iceberg") && it.name != "trino-connector" && | |||
it.name != "integration-test" && it.name != "hive-metastore-common" && !it.name.startsWith("flink") | |||
it.name != "integration-test" && it.name != "hive-metastore-common" && !it.name.startsWith("flink") && | |||
it.name != "hadoop-gcs-bundle" && it.name != "hadoop-s3-bundle" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better renaming to "gcs-bundle" and "aws-bundle", like this, it may not only about storage, it also includes the security and other things.
@Override | ||
public Configuration getConfiguration(Map<String, String> conf) { | ||
Configuration configuration = new Configuration(); | ||
conf.forEach(configuration::set); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you pass all of the configurations into the hadoop conf? I think it's enough to pass in the configurations that start with gravitino.bypass
in the default implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the current implementation is putting all configurations including keys without the prefix gravitino.bypass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not appropriate, it might be a mistake before, can you confirm this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have changed the logic and only those with prefix gravitino.bypass
can pass to the Hadoop configuration, it's indeed more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the line 137 of HadoopCatalogOperations
still passes all the configuration into the Hadoop conf, this should be removed either.
@@ -132,6 +132,20 @@ public void initialize( | |||
Map.Entry::getValue)); | |||
bypassConfigs.forEach(hadoopConf::set); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initialization of Hadoop conf here seems to be useless. It's better to move these codes above into the DefaultConfigurationProvider
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right, I will do it according to your suggestion.
* @return The FileSystem instance. | ||
* @throws IOException If the FileSystem instance cannot be created. | ||
*/ | ||
FileSystem getFileSystem(Configuration configuration, Path path) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao
Please take a look at this PR, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to provide a Path here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that most methods can get a FileSystem in https://github.com/apache/hadoop/blob/a9b7913d5689cae804fbd072a03cd2d2f771ab6a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L536-L571 do support specify the path explicitly, so I added it.
Indeed, we can also use configuration fs.defaultFs
to replace it. If you feel it's an excessive value, I can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current interface design is highly dependent on the Hadoop's implementation, like Configuration
and Path
, which is not a good design, also lose the flexibility to support some non-hadoop fs, a good interface should not rely on the third party libraries as possible as we can.
I would be inclined to only rely on FileSystem
, like:
interface FileSystemProvider {
FileSystem getFileSystem(Map<String, String> conf);
String scheme():
}
Using "conf" should be enough to create a filesystem you wanted.
public String getScheme() { | ||
return "s3a"; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove gcs, s3 related code, we can add them later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
@@ -742,4 +778,27 @@ private boolean checkSingleFile(Fileset fileset) { | |||
fileset.name()); | |||
} | |||
} | |||
|
|||
static FileSystem getFileSystem(Path path, Configuration conf) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't have use Path to create a FileSystem, Configuration itself should be enough.
public class HDFSFileSystemProvider implements FileSystemProvider { | ||
|
||
@Override | ||
public FileSystem getFileSystem(Configuration configuration, Path path) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao
Can you please help me verify if it's okay to initialize a HDFS file system? I can do little for it compared to the default method.
* | ||
* @return The scheme of this FileSystem provider. | ||
*/ | ||
default String getScheme() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to return "file" as default, as there's no default filesystem implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I removed the default implementation.
* @return The FileSystem instance. | ||
* @throws IOException If the FileSystem instance cannot be created. | ||
*/ | ||
FileSystem getFileSystem(Configuration configuration, Path path) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current interface design is highly dependent on the Hadoop's implementation, like Configuration
and Path
, which is not a good design, also lose the flexibility to support some non-hadoop fs, a good interface should not rely on the third party libraries as possible as we can.
I would be inclined to only rely on FileSystem
, like:
interface FileSystemProvider {
FileSystem getFileSystem(Map<String, String> conf);
String scheme():
}
Using "conf" should be enough to create a filesystem you wanted.
public FileSystem getFileSystem(Map<String, String> config) throws IOException { | ||
Configuration configuration = new Configuration(); | ||
config.forEach(configuration::set); | ||
return FileSystem.get(FileSystem.getDefaultUri(configuration), configuration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will get a local filesystem if the default setting is not local filesystem.
I'm OK with the interface design, please polish the code to make it ready to review, also refactor the client side GravitinoVirtualFileSystem to also leverage this interface to make the fs pluggable. |
I will use another PR to refactor the Java client for the fileset. |
@jerryshao |
What changes were proposed in this pull request?
Add a framework to support multiple storage system within Hadoop catalog
Why are the changes needed?
Some users want Gravitino to manage file system like S3 or GCS.
Fix: #5019
Does this PR introduce any user-facing change?
N/A.
How was this patch tested?
Existing test.