Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Handle the scenario where the metadata between underlying sources and Graviton is inconsistent #250

Closed
5 tasks done
jerryshao opened this issue Aug 17, 2023 · 0 comments
Assignees
Labels
improvement Improvements on everything

Comments

@jerryshao
Copy link
Contributor

jerryshao commented Aug 17, 2023

What would you like to be improved?

Graviton is a federated metadata lake, it manages the metadata from underlying sources, as well as the additional metadata stored in Graviton. The users will get complete metadata by combining both two parts together. With this, it will potentially have several problems:

  1. When creating the metadata object, graviton needs to write metadata from two places, if one operation is failed, we will leave the half-completed metadata in the system. In the meantime, some sources don't support rollback, so we cannot clean the metadata if failed.
  2. When loading the metadata object, it will also meet the inconsistent problem, scenarios are:
    1. if metadata exists in the underlying sources, but not in graviton, how do we handle this scenario?
    2. if metadata exists in the graviton but not in the underlying sources, how do we handle this scenario?
  3. when we alter the metadata object, we require both updating the metadata in underlying storage as well as graviton itself, then the problem is more complicated.
  4. For the drop operation, the metadata both in the underlying sources and graviton should be deleted, this may potentially lead to half-deleted metadata.

So basically, the problem is that users could both manipulate the metadata through Graviton and directly from underlying sources, the inconsistency is unavoidable.

How should we improve?

How do we handle this inconsistency?

As I mentioned above, inconsistency is unavoidable, whether caused by operation failure, or introduced by operating different systems or starting from scratch.

Here, I think:

  1. The prerequisite is that we should assume that one place is the SSOT, and another place is the complement. We cannot guarantee that two places are equally the SSOT.
  2. We should try our best to keep the metadata consistent in two places. If not, we should have a mechanism to tolerate this inconsistency.
  3. We should separate the operations of underlying sources from our own operations semantically.

Details will be posted here continuously.

The related issues are:

Subtasks:

@jerryshao jerryshao added the improvement Improvements on everything label Aug 17, 2023
@jerryshao jerryshao added this to the Graviton v0.2.0 milestone Aug 17, 2023
@jerryshao jerryshao self-assigned this Aug 21, 2023
jerryshao added a commit that referenced this issue Sep 6, 2023
…rwriteable (#330)

### What changes were proposed in this pull request?

This PR proposes to change the requirements of the `AuditInfo` fields to
make them optional and overwriteable.

### Why are the changes needed?

This is the first change of #250, the change is going to address two
problems:

1. If the `AuditInfo` is not existed in both Graviton store and
underlying source, we should support the empty `AuditInfo`, or only
several fields are set in `AuditInfo`.
2. If the `AuditInfo` are both set in the Graviton store and underlying
source, we should support `AuditInfo` mergeable.

Fix: #317

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modify and add the UTs to test.
jerryshao added a commit that referenced this issue Sep 8, 2023
… id for this field (#348)

### What changes were proposed in this pull request?

This PR proposes to make some changes to entity's ID field.

1. Assign a unique ID for each entity when creating.
2. Remove the parent ID field for catalog, schema and table entity
(since it is useless currently).

### Why are the changes needed?

This is the subtask for #250 . With unique ID assigned to each entity,
we could leverage this unique ID as a "record" or "watermark" to be
bound between the underlying sources and graviton store, which can
guarantee the SSOT of entities.

Fix: #168

### Does this PR introduce _any_ user-facing change?

1. This change removes the parent id field of catalog, schema, and table
entity's proto definition.

### How was this patch tested?

With the existing UTs.
xunliu pushed a commit that referenced this issue Sep 24, 2023
…guarantee SSOT (#403)

### What changes were proposed in this pull request?

This is the final work of #250 , with this PR there're several major
refactorings:
1. Removing all the entity store operations in HiveCatalogOperation,
which makes each CatalogOperation only focus on its own logic.
2. Processing all the additional metadata information in
CatalogOperationDispatcher, also guarantees the SSOT.
3. Refactor the BaseXXX (BaseTable, BaseSchema and BaseColumn), to
separate the metadata logics from entity information.
4. With all the above changes, changing the UTs accordingly.

### Why are the changes needed?

With this PR, we have several advantages:

1. No need to handle entity store operations in each catalog, unify all
of them in core module.
2. Remove the complex transaction semantics, using SSOT best effort
mechanism.

Fix: #318 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Adding new UTs to cover the code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvements on everything
Projects
None yet
Development

No branches or pull requests

1 participant