Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dot-ETL Project Proposal #1716

Merged
merged 7 commits into from
Jun 21, 2023
Merged

Dot-ETL Project Proposal #1716

merged 7 commits into from
Jun 21, 2023

Conversation

liangjh
Copy link
Contributor

@liangjh liangjh commented May 3, 2023

Project Abstract

Polkadot as an ecosystem has nurtured a significant developer community, and hosts a number of well-known parachains spanning a diverse series of domains, including Defi lending / liquidity, DEXs, NFTs, RWAs and securitization, as well as identity and privacy applications. While there has been a great deal of interest in developing on polkadot, there hasn't thus far been a simple means to query and visualize transaction-level data and aggregates.

Dot-ETL will be similar in functionality to the Ethereum ETL project. In the same way that the ETH-ETL offering Ethereum transaction data as a Public Dataset from Google has helped to establish higher TVL and adoption of the Ethereum network, the goal is that by making PolkaDot transactional data easily accessible without the majority of data engineering tasks that exist in extracting data in usable form from the blockchain will lead to greater development and interest for the protocol by mainstream users of platforms such as Google Cloud. Once data is supported and provided in this format, there are also other potential use cases that can expand adoption of PolkaDot data by the blockchain industry such as easily being able to host Chainlink oracles for this data and provide it in readily available form for a number of different cross-chain applications.

Grant level

  • Level 1: Up to $10,000, 2 approvals
  • Level 2: Up to $30,000, 3 approvals
  • Level 3: Unlimited, 5 approvals (for >$100k: Web3 Foundation Council approval)

Application Checklist

  • The application template has been copied and aptly renamed (project_name.md).
  • I have read the application guidelines.
  • Payment details have been provided (bank details via email or BTC, Ethereum (USDC/DAI) or Polkadot/Kusama (USDT) address in the application).
  • The software delivered for this grant will be released under an open-source license specified in the application.
  • The initial PR contains only one commit (squash and force-push if needed).
  • The grant will only be announced once the first milestone has been accepted (see the announcement guidelines).
  • I prefer the discussion of this application to take place in a private Element/Matrix channel. My username is: @_______:matrix.org (change the homeserver if you use a different one)

@CLAassistant
Copy link

CLAassistant commented May 3, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@Noc2 Noc2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your interest in our grants program. Looks like an interesting application. This is just a reminder to sign our terms and conditions (see the message above) and update the payment address, level etc. I will also share the application with Keegan, who is currently ooo, but wrote the RFP. Apart from this, could you integrate as many technical details as possible into the milestone tables? For example, that you plan on using Python for it, or what kind of pallet or even smart contract functionality you will be able to support.

@Noc2 Noc2 added the changes requested The team needs to clarify a few things first. label May 4, 2023
@Noc2 Noc2 requested a review from keeganquigley May 4, 2023 07:57
Copy link
Contributor

@keeganquigley keeganquigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the application @liangjh a few initial comments:

  • Are you aware of Substrate ETL which is maintained by Polkaholic? They are treasury funded now, but have already published two BigQuery public datasets. I wonder if this project might be duplicating their work; could you take a look and briefly explain how your implementation would be different? (For example, we already had a team terminate their grant to work on the Substrate ETL)
  • Also, could Google Cloud Composer be used for the existing public datasets?
  • Are you planning on publishing datasets for Kusama too, or just Polkadot?
  • Are you able to add a Total Estimated Duration? You can always file an amendment PR to extend the timeline if you need more time.

applications/dot_etl.md Outdated Show resolved Hide resolved
| **0c.** | Testing and Testing Guide | Core functions will be fully covered by comprehensive unit tests to ensure functionality and robustness. In the guide, we will describe how to run these tests. |
| **0d.** | Docker | We will provide a Dockerfile(s) that can be used to test all the functionality delivered with this milestone. |
| 0e. | Article | We will publish articles
| 1. | Outreach to RWA / Defi-focused Parachains | With milestone 1 completed, prioritize parachain data that is related to RWA given the more relevant use of analytics for that data in DeFi applications in order to help further grow interest in the PolkaDot ecosystem |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here, is that it's hard to evaluate but also, the fact that RWA parachains are prioritized. The foundation aims to remain objective and to not show preference to any individual parachain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Keegan here, this is impossible to verify and typically not covered by the grants program. Please remove or rephrase.

@liangjh
Copy link
Contributor Author

liangjh commented May 9, 2023

Hi Keegan / David - thanks for the review & comments here. we're going to regroup / discuss on our end and circle back.

@Noc2 Noc2 added the on hold There is an external blocker, such as another grant in progress. label May 22, 2023
@liangjh
Copy link
Contributor Author

liangjh commented May 24, 2023

Hi Keegan - again, thanks for the feedback / comments. We'll update the proposal shortly

Are you aware of Substrate ETL which is maintained by Polkaholic?

Thanks for pointing out the polkaholic substrate-etl project. Our project's aim would be very similar, w/ similar outcomes i.e. creating a full stack ETL into bigquery. Our approach will differ in the infrastructural components: we will utilize the subquery indexer; we will create airflow DAGs that will be able to run within and coordinated within google composer. While similar to polkaholic, this will create some redundancy.

Also, could Google Cloud Composer be used for the existing public datasets?

Yes, we will write the airflow DAGs that will coordinate updates to the dataset on a periodic basis (can be set to run on any timeframe / frequency). We'll provide the code / tests for anyone to stand up the ETL, as well as guides on how to run and modify the components of the ETL.

Are you planning on publishing datasets for Kusama too, or just Polkadot?

Yes, Kusama can be included. This is based on substrate, so it will be possible to point the same components at Kusama.

Are you able to add a Total Estimated Duration? You can always file an amendment PR to extend the timeline if you need more time.

I believe we had set a 3-4 duration as a conservative estimate (I have a day job), but can make this more explicit in the proposal.

@liangjh
Copy link
Contributor Author

liangjh commented May 24, 2023

Thanks for the comments / feedback. Will add more detail to the timeline / proposal. I had in mind only the main components (i.e. blocks, extrinsics, balances, etc) rather than pallet and smart contract specifics. Goal is to create a foundation that can be expanded and extended incrementally over time (by us or the community at large) to accommodate additional pieces of functionality (pallets / smart contract features, etc).

Apart from this, could you integrate as many technical details as possible into the milestone tables? For example, that you plan on using Python for it, or what kind of pallet or even smart contract functionality you will be able to support.

@johncandido
Copy link
Contributor

Thanks so much for the feedback, particularly in regards to the potential commonalities between this project and that of what Colorful Notion has been funded to do through the treasury. Upon reading through the details of their efforts, here are some reasons that we feel this proposal enhances the interests of the ecosystem in addition to the work that the Colorful Notion team is doing:

  • The Colorful Notion team makes use of a custom indexer that they have created and support, polkaholic.io, as you mention. The fundamental difference with this proposal is that we intend to utilize the Subquery Network as indexer in order to provide access to the data in GCP as Jonathan has mentioned. There are a few reasons why this is advantageous to the ecosystem:
  1. Speed assessment of technical implementation based on the Subquery Network in relation to Polkaholic as an indexer. Having both methods is useful for this analysis. This can significantly improve the time of getting tables available for public use in a form like GCP once performance benefits of both methods are assessed.

  2. Redundancy of data provided for public consumption and use. Given the focus of this project is to make use of Subquery Network as opposed to Polkaholic, we build in a necessary accuracy check on the data being produced and available publicly. This is of particular importance given the eventual desire to proliferate this data while maintaining accuracy that can be verified via different indexing methodologies/sources. This most significant application for this is to provide multiple data sources to Chainlink for PolkaDot to be accessed through oracles. Providing multiple instances of the data from different sources to find alignment will only strengthen the reference and use of this data outside of the ecosystem. It will also allow for a comparison of potential drawbacks of one particular methodology versus another.

  • Given the large scope of potential applications of ecosystem data, particularly around the many parachains or Kusama implementations for which data tables can be provided for, it allows for multiple teams to capture different niche interests within the ecosystem and pivot in an agile way to address different priority uses of data that may exist based on demand for consumption in GCP. This is made clear in this quote from the Colorful Notion proposal:

"While Colorful Notion is the initial and primary child bounty earner of this SXBD initiative and lead the 2 starting projects, it is not realistic nor desirable for CN to be the sole implementer of all SXBD projects, because domain expertise can never be best assembled by just one team. We instead envision a collaborative set of SXBD projects as part of a SXBD/ Polkadot Data Alliance that can form in the not distant future, assisted by curators, where diverse teams with different domain expertise can contribute data deemed valuable by the community within the same success metric bounty structure."

Instead of relying on one team's prioritization of providing tables for different niche ecosystem activities, multiple teams with different prioritizations, perhaps even coordinated, can lead to a "divide and conquer" approach that provide the most public data sets to GCP as quickly as possible.

  • Having two projects in parallel looking to provide the same public data builds is an insurance policy against any one group abandoning a particular project or losing momentum in the scope of work for providing the data in a timely manner.

  • Differences in third party partnerships that add to the proliferation and use of the data in GCP form may exist between projects given the domain expertise of the teams. For instance this team has a clear focus of expanding the use of these tables by Nansen, given the focus on machine learning and AI implementations which are skill sets particular to this team and of interest to Nansen.

@keeganquigley
Copy link
Contributor

Thanks @liangjh and @johncandido for the thorough research and for defending your positions. I agree that having multiple teams with different prioritizations is a better approach, and creates valuable accuracy checks, since more sources are always better than one. I would personally support a level 2 PoC in this regard. I will remove the "on hold" status.

@liangjh feel free to ping me once you have made all applicable changes, and I will mark it as ready for review. If you could also add a team name and payment address/currency that would be great as well. Thanks!

@keeganquigley keeganquigley removed the on hold There is an external blocker, such as another grant in progress. label May 24, 2023
changes w/ more detail to accomodate feedback
@keeganquigley
Copy link
Contributor

Thanks for the changes all, I will mark the application as ready for review and ping the rest of the committee for any additional questions/comments.

@keeganquigley keeganquigley added ready for review The project is ready to be reviewed by the committee members. and removed changes requested The team needs to clarify a few things first. labels Jun 1, 2023
@sourabhniyogi
Copy link
Contributor

Hi @liangjh @johncandido -- I am from Colorful Notion, and we are assembling a Polkadot Data Alliance to work on data problems across the Substrate/Polkadot Web3 ecosystem. Very happy to see this. I think it would be wonderful to have a large scale index of Subquery + Subsquid done by a competent team, and would like to specifically be able to compare them in a data warehouse (Google Cloud BigQuery) and have our collective efforts get stronger. The genesis of this can be to start with dashboards comparing different indexed datasets but in the end has to be about projects getting the tools + data-driven teams using the tools to grow the Web3 ecosystem.

To do this seriously for every parachain requires serious amounts of compute, storage, and devops (thus funding, but also love, pain, loss of sleep, and everything in between for seriously committed teams), and within W3F's "no maintenance" mandate it is necessary to have a serious plan for MAINTENANCE -- see my comment on #1768 for specific ideas on this. We think a "Polkadot Data Alliance" parent bounty can support maintenance as well as research. I think having at around 3 and at most 5 teams do comparable work is ideal, while also leaving space for paid business models. With { Subquery+Subsquid }, 1 team makes sense to me.

In particular, one strategy is there by two child bounties organized around { Subquery + Subsquid } activity to pick up wherever this leaves off. The child bounty needs a child bounty curator (e.g. James) and your team would be a recipient.

If you would like to define the first child bounty in a Subquery group, email [email protected] (curator) + [email protected] (original proponent) your draft and one of us will put it there. If you want to work with me to share your first Subquery-based index result and compare data to others in what I think can be community supported Superset, my telegram handle is @sourabhniyogi --

applications/dot_etl.md Outdated Show resolved Hide resolved
Co-authored-by: Sebastian Müller <[email protected]>
Copy link
Contributor

@dsm-w3f dsm-w3f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johncandido and @liangjh thanks for the application. I understood that you are proposing a different indexing approach. Could you please take a look at this RFP as well and inform us if there is some overlapping of your proposal with it or if this could be considered a future work on top of the scope of this grant proposal?

@liangjh
Copy link
Contributor Author

liangjh commented Jun 13, 2023

Thanks for the comments / feedback. Will add more detail to the timeline / proposal. I had in mind only the main components (i.e. blocks, extrinsics, events, etc) rather than pallet specifics. Goal is to create a foundation that can be expanded and extended incrementally over time (either by us or the community at large) to accommodate more specific pieces of functionality.

Apart from this, could you integrate as many technical details as possible into the milestone tables? For example, that you plan on using Python for it, or what kind of pallet or even smart contract functionality you will be able to support.

Hi, not sure if I had answered sufficiently in the body of the proposal. We intend to pull in all blocks / extrinsics / events. There are some common attributes to all extrinsics, which we will pull out as separate fields, but I imagine the majority pallet-specific properties will stored in generalized JSON fields; the base tables can be further parsed by extensions to the airflow ETLs to provide more specific tables customized to particular pallets / parachains.

Our current thinking is to stand up a managed subquery indexer as well as several airflow DAGs (i.e. python) orchestrated in google composer. The DAGs / ETLs would query the subquery indexer(s) using graphql, and save to BigQuery. From there, any number of ETLs could be created to cater to the structures of specific pallets, etc - those could live within the DAGs that we've created, or could run elsewhere (just based on the core tables produced here).

The main benefit is that downstream teams could operate purely in python using airflow or ETL tool of choice without having to bridge substrate internals via subquery or running their own node.

@liangjh
Copy link
Contributor Author

liangjh commented Jun 13, 2023

hi @dsm-w3f thanks for pointing out that RFP + your review. just reading through that RFP, believe it could be extensions on top of this work. it looks like they are seeking to create data tools to answer very specific questions that currently require a manual workflow across several tools to answer. would imagine that if we export all of the available primitives in substrate, that subsequent ETLs / flows build on top of these primitives could be built to answer in whole or in part some of the example user-driven questions they raised. a number of analytic frameworks / tools already have built-in integrations into bigquery out of the box (e.g. looker + others), so an export to bigquery could open up the capability to field various questions / views of the data.

keeganquigley
keeganquigley previously approved these changes Jun 13, 2023
Copy link
Contributor

@keeganquigley keeganquigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liangjh and @johncandido for the discussion and the thorough answers. I'm generally happy to go ahead with it. Cool to see a potential collaboration with Colorful Notion and the Polkadot Data Alliance as well.

Copy link
Collaborator

@Noc2 Noc2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I'm generally happy to go ahead with but could you update the "Total Estimated Duration: TBD"? Maybe just add eight months here. We can always update this later, but it would be nice to have it because it's also mentioned in our terms and conditions, as far as I remember.

updated total estimated duration
@liangjh
Copy link
Contributor Author

liangjh commented Jun 14, 2023

Thanks for the updates. I'm generally happy to go ahead with but could you update the "Total Estimated Duration: TBD"? Maybe just add eight months here. We can always update this later, but it would be nice to have it because it's also mentioned in our terms and conditions, as far as I remember.

sounds good, thx for the comment. updated to 8 months total duration.

@takahser
Copy link
Collaborator

@liangjh thanks for answering my questions.
Regarding BigQuery, data types and the double-entry ledger format removal it LGTM now.
Regarding the frontend, are you planning to build one in the future or are you solely interested in implementing the SDK/backend code?

@dsm-w3f
Copy link
Contributor

dsm-w3f commented Jun 15, 2023

@liangjh and @johncandido thank you for the answers. The only point that still raises my attention is to get tight to cloud providers. I think for a Web3 company depending on cloud providers to have analytics is something to think about. Is it possible to add in your solution one more abstraction layer or a way to write the data not only for one cloud provider but to allow extensibility to others or self-hosted solutions? I understand that for now working with BigQuery is ok, but if the solution is well-architected could make it easier to extend to other not proprietary solutions. Data Access Layers and Plugin Architecture could help in this task. Let me know if it is possible to you do this way.

@liangjh
Copy link
Contributor Author

liangjh commented Jun 15, 2023

Hi @takahser - only the backend code is in scope.

Regarding the frontend, are you planning to build one in the future or are you solely interested in implementing the SDK/backend code?

@liangjh
Copy link
Contributor Author

liangjh commented Jun 15, 2023

Hi @dsm-w3f - sure, I think its reasonable to explore that - or at very least, have the right programming constructs / interfaces to allow for extensibility to other formats / providers, even if we don't implement a whole series of other providers and databases. it'll be open source so will allow for others to write the underlying implementations and plug into the main framework as needed.

write the data not only for one cloud provider but to allow extensibility to others or self-hosted solutions? I

@dsm-w3f
Copy link
Contributor

dsm-w3f commented Jun 15, 2023

@liangjh thank you for the answer. Would you mind incorporating this design approach in your application document? In this way, BigQuery is an instance of a tool that we could use. I think this adds value to your proposal.

Updates to accommodate common interface to allow for extensibility to other platforms, etc.
@liangjh
Copy link
Contributor Author

liangjh commented Jun 15, 2023

incorporating this design approach in your application document?

done! lmk what you think, thanks -

dsm-w3f
dsm-w3f previously approved these changes Jun 15, 2023
Copy link
Contributor

@dsm-w3f dsm-w3f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to support your project.

Noc2
Noc2 previously approved these changes Jun 15, 2023
| **0c.** | Testing and Testing Guide | Core functions will be fully covered by comprehensive unit tests to ensure functionality and robustness. In the guide, we will describe how to run these tests. |
| **0d.** | Docker | We will provide a Dockerfile(s) that can be used to test all the functionality delivered with this milestone. |
| 0e. | Article | We will publish articles
| 1. | Outreach to RWA / Defi-focused Parachains | With milestone 1 completed, prioritize parachain data that is related to RWA given the more relevant use of analytics for that data in DeFi applications in order to help further grow interest in the PolkaDot ecosystem |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Keegan here, this is impossible to verify and typically not covered by the grants program. Please remove or rephrase.

applications/dot_etl.md Outdated Show resolved Hide resolved
applications/dot_etl.md Outdated Show resolved Hide resolved
updated milestone specific steps
@liangjh liangjh dismissed stale reviews from Noc2 and dsm-w3f via 219f750 June 16, 2023 01:37
Copy link
Member

@semuelle semuelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me.

@semuelle semuelle requested a review from takahser June 16, 2023 09:26
Copy link
Collaborator

@takahser takahser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangjh thanks for the patience, please forgive the delay, I'm still trying to digest all the information that is being discussed in this thread. As I comprehend, there seem to be two critical distinctions between Substrate-ETL and your SDK from a technical standpoint:

  1. The base for your SDK is Subscan.
  2. The utilization of airflow DAGs within your SDK.

Could you elaborate on how each of these aspects represents an advantage or enhancement over Substrate-ETL?

@liangjh
Copy link
Contributor Author

liangjh commented Jun 16, 2023

Hi @takahser - sure, thanks for the questions, see below -

  1. The base for your SDK is Subscan.

the proposal is to utlize SubQuery to access blocks/extrinsics/events on the network. it simplifies / shortcuts having to set up a node and parsing internals. it also makes the data accessible via graphql, which allows us to query from external systems (i.e. airflow). note there are a lot of paths to getting here, SubQuery isn't the only way. I just found SubQuery to be straightforward to set up + they have the means to create managed indexers and there is an effort underway to decentralize as well (much like thegraph).

  1. The utilization of airflow DAGs within your SDK.

the airflow DAGs will help orchestrate the execution of steps in the process, but the main logic will be in python modules that can be extended / built on top of or deployed anywhere.

Could you elaborate on how each of these aspects represents an advantage or enhancement over Substrate-ETL?

its another path to arriving at similar answers (i.e. queryable tables in bigquery), to keegan's point. both substrate-etl and parity also have integrations in the same manner. additionally, if someone wanted to take the framework and instead write to an internal database + deploy internal airflow DAGs for a more specific purpose / use case, they could utilize this as well. so two main points: (a) use of a known indexer within the community (subquery), (b) extensible python framework (+ airflow orchestration) to allow others to build more advanced views / ETLs on top of the base tables.

keeganquigley
keeganquigley previously approved these changes Jun 20, 2023
Copy link
Contributor

@keeganquigley keeganquigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your updated answers @liangjh still looks good to me as well.

@keeganquigley keeganquigley dismissed their stale review June 20, 2023 22:39

unresolved conversations

@semuelle semuelle requested a review from takahser June 21, 2023 15:44
Copy link
Collaborator

@takahser takahser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangjh thanks for answering my questions.
I'm happy to approve as well. I think it makes sense to have an alternative, more light-weight implementation here that leverages SubQuery.

@takahser takahser merged commit f1ece0d into w3f:master Jun 21, 2023
1 check failed
@github-actions
Copy link
Contributor

Congratulations and welcome to the Web3 Foundation Grants Program! Please refer to our Milestone Delivery repository for instructions on how to submit milestones and invoices, our FAQ for frequently asked questions and the support section of our README for more ways to find answers to your questions.

Before you start, take a moment to read through our announcement guidelines for all communications related to the grant or make them known to the right person in your organisation. In particular, please don't announce the grant publicly before at least the first milestone of your project has been approved. At that point or shortly before, you can get in touch with us at [email protected] and we'll be happy to collaborate on an announcement about the work you’re doing.

Lastly, please remember to let us know in case you run into any delays or deviate from the deliverables in your application. You can either leave a comment here or directly request to amend your application via PR. We wish you luck with your project! 🚀

@keeganquigley
Copy link
Contributor

Hi @liangjh @johncandido hope you had a wonderful holidays. How is M1 coming along? If the project is still delayed, please consider filing an amendment pull request to extend the timeline. Thanks!

ainhoa-a pushed a commit to ainhoa-a/Grants-Program that referenced this pull request Jan 26, 2024
* Create / update dot_etl.md, + minor changes and feedback

* Update dot_etl.md

changes w/ more detail to accomodate feedback

* Update applications/dot_etl.md

Co-authored-by: Sebastian Müller <[email protected]>

* Update dot_etl.md

updated total estimated duration

* Update dot_etl.md

update - remove double entry ledger from M1 description

* Update dot_etl.md

Updates to accommodate common interface to allow for extensibility to other platforms, etc.

* Update dot_etl.md

updated milestone specific steps

---------

Co-authored-by: johncandido <[email protected]>
Co-authored-by: Sebastian Müller <[email protected]>
@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/decentralized-futures-fidi-polkadot-s-code-free-intelligence/7475/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review The project is ready to be reviewed by the committee members.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants