Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Read Support for Presto-MaxCompute Connector #4

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

dingxin-tech
Copy link

This PR introduces the read support for the Presto-MaxCompute Connector, enabling users to perform queries on tables within Alibaba's MaxCompute.
While the current state of the connector does not support write operations, DDL operations, or querying partitioned tables, it is a significant step forward in integrating MaxCompute with Presto.

Key features included in this PR:

  • Read support for MaxCompute standard tables.
  • Configuration setup to connect Presto with MaxCompute projects.
  • Documentation updates for setting up and using the connector.

The following are yet to be implemented in future PRs:

  • The ability to query partitioned tables.
  • Write support for MaxCompute tables.
  • Support for DDL operations within MaxCompute from Presto.

@tdcmeehan tdcmeehan self-assigned this May 9, 2024
@tdcmeehan
Copy link
Contributor

@dingxin-tech thank you for your contribution. Regarding testing: can you describe how to add tests for this plugin? Is it possible to add integration tests that run against a mocked version of MaxCompute, or through some other means?

@dingxin-tech
Copy link
Author

@tdcmeehan I have no idea how to add integration tests since MaxCompute currently does not provide external testing accounts. Do you know how other similar databases handle this? Maybe I should implement something like a MaxCompute emulator similar to the BigQuery emulator?

@tdcmeehan
Copy link
Contributor

@dingxin-tech ideally, Alibaba would give us a testing token to just run some minimal workload. If you could create an emulator, that would be awesome.

@dingxin-tech
Copy link
Author

@dingxin-tech ideally, Alibaba would give us a testing token to just run some minimal workload. If you could create an emulator, that would be awesome.

Yes, I am currently working on the development of the MaxCompute-Emulator, and I am pleased to report that we have made some significant incremental progress. With the current pace, I anticipate that we will be able to complete support for Presto regression within the next two to three weeks.

@tdcmeehan
Copy link
Contributor

@dingxin-tech I see you've added an emulator, is this ready to review?

@dingxin-tech
Copy link
Author

@dingxin-tech I see you've added an emulator, is this ready to review?

Thank you for your attention.

Yes, I believe this PR is ready for review.

Although some critical features (such as reading the partition table, writing to the table) are not supported yet (in fact, the emulator also does not support these functionalities at the moment), I believe the current functionalities form a good initial whole for the first commit in this repository. These new features can be added in subsequent PRs to further improve the connector.

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work. Can you please add integration style tests and more unit tests for the various utilities? For example, I think we should have decent unit test coverage over the Arrow to Page conversions. I also think we can add a dependency on presto-tests, and run AbstractTestIntegrationSmokeTest using your new query runner. Can you give it a try?

dingxin-tech and others added 2 commits July 9, 2024 19:51
Co-authored-by: Timothy Meehan <[email protected]>
…rated smoke testing

Update dependencies to support the new version of the MaxCompute simulator image, and introduce new test classes to verify basic integration behavior. At the same time, Maven configuration is enhanced to better support the Presto plug-in development process.
@dingxin-tech
Copy link
Author

I have made changes based on your suggestions:

  1. I removed the configuration from META-INF and added a dependency on the presto-maven-plugin.
  2. I upgraded the emulator and added TestIntegrationSmokeTest.
  3. Regarding the additional tests for ArrowToPage that you mentioned, I currently have no idea on how to determine if a Page meets expectations, but I believe the SmokeTest has already covered that part of the cases.

Could you please review it again? @tdcmeehan

@tdcmeehan
Copy link
Contributor

Can you please add a Github action to trigger the tests to run in CI? See here for reference: https://github.com/prestodb/airlift/pull/77/files

Copy link

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our current thinking about adding new connectors to the core repo as opposed to third party repos? Is there anything preventing this from being a separate project?

pom.xml Outdated
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>26.0-jre</version>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can push this to the current version?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

@@ -0,0 +1,222 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this file belongs in this PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add it ?

}
return new SimpleStruct((StructTypeInfo) structTypeInfo, values);
}
catch (Exception e) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be more specific about the exception?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I realized that this method doesn't actually throw a checked exception, so I've removed the try-catch block. I'm not sure why I implemented it that way :)


private CommonUtils() {}

public static String serialize(Serializable object)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is object serialization used at all? It's mostly avoided in java these days as it's been a common source of security issues

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serialize and deserialize methods are specifically utilized for serializing and deserializing TableBatchReadSession objects. This is particularly important for MaxComputeSplit, where JSON format is employed for message transmission within Presto. Given that TableBatchReadSession contains numerous variables and has been well-implemented with the Serializable interface, I believe utilizing these methods is a sound approach.

throw new IOException("Failed after retries", e);
}
try {
LOG.warn("Retrying after exception: " + e.getMessage(), e);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actionable? Will anyone read it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method was actually not being used, so I've removed it.

@tdcmeehan
Copy link
Contributor

What is our current thinking about adding new connectors to the core repo as opposed to third party repos? Is there anything preventing this from being a separate project?

There's nothing preventing this from being a separate project, however we would like to move more of our modules in presto to third party modules, and given that MaxCompute is a widely used and publicly available offering, I'm OK adding this to the Presto org. I also think it will be a good model for other connectors to follow.

Copy link

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the documentation! I commented with several nits about phrasing, and a couple of suggestions about formatting. If any of my suggestions don't work for you, I'm happy to discuss them if you like and explain my reasoning.

Also, I suggest you add a new page to the Presto documentation for the new connector, by creating a new file in https://github.com/prestodb/presto/tree/master/presto-docs/src/main/sphinx/connector and adding that new file to the Connectors index file.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
- `analytics.properties` for the analysis project.
With this setup, `sales` and `analytics` directories will be created in Presto respectively.

### Data Type Mapping

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Data Type Mapping section would be ideal to move into the connector documentation for this new connector, in a new file in https://github.com/prestodb/presto/tree/master/presto-docs/src/main/sphinx/connector.

dingxin-tech and others added 8 commits September 27, 2024 13:56
Co-authored-by: Steve Burnett <[email protected]>
Co-authored-by: Steve Burnett <[email protected]>
Co-authored-by: Steve Burnett <[email protected]>
Co-authored-by: Steve Burnett <[email protected]>
Co-authored-by: Steve Burnett <[email protected]>
Co-authored-by: Steve Burnett <[email protected]>
Co-authored-by: Steve Burnett <[email protected]>
@dingxin-tech
Copy link
Author

Thanks for the documentation! I commented with several nits about phrasing, and a couple of suggestions about formatting. If any of my suggestions don't work for you, I'm happy to discuss them if you like and explain my reasoning.

Also, I suggest you add a new page to the Presto documentation for the new connector, by creating a new file in https://github.com/prestodb/presto/tree/master/presto-docs/src/main/sphinx/connector and adding that new file to the Connectors index file.

Thank you for your input; all of your suggestions have been incorporated into the changes. I agree that it would be beneficial to add a link to the MaxCompute connector in the Presto documentation repository. However, this should be done in a separate PR, potentially initiated after this current PR has been merged.

@dingxin-tech
Copy link
Author

Really great work. Can you please add integration style tests and more unit tests for the various utilities? For example, I think we should have decent unit test coverage over the Arrow to Page conversions. I also think we can add a dependency on presto-tests, and run AbstractTestIntegrationSmokeTest using your new query runner. Can you give it a try?

Sure, I've added. :)

@steveburnett
Copy link

Thank you for your input; all of your suggestions have been incorporated into the changes. I agree that it would be beneficial to add a link to the MaxCompute connector in the Presto documentation repository. However, this should be done in a separate PR, potentially initiated after this current PR has been merged.

Thank you! Some of my comments may have been hidden by GitHub. When you have time, would you please recheck? They may be hidden under "8 hidden conversations".
Screenshot 2024-09-27 at 11 15 31 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants