-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding application: Polkadot Analytics Platform #1883
Conversation
The Polkadot Analytics Platform aims at building a comprehensive data analysis and visualization tool for the Polkadot ecosystem. This is a follow-up grant application for the project: w3f#1420
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the application, @rbrandao. I have some questions. Feel free to amend the application accordingly.
- As you state in the application, we recently signed three projects addressing the same RFP. You have listed some technical differences between your project and the others, but what I don't understand is how an average user benefits from your approach.
- You are proposing building a platform funded by subscriptions. Have you done any market research on the size of the target audience and potential revenue? Given that there are already a number of existing platforms, this seems ambitious.
- Do you have experience with ETLs and the technologies you propose using? I was under the assumption that your background was largely academic and I'm worried that we are funding too many datasets and analytics platforms that stop being maintained. Going by the conversations we had with people working on such products, keeping up to date with the data and changes in the ecosystem is a challenge in itself.
Thank you for the comments, @semuelle !
For the average user, having a CNL (Controlled Natural Language) to perform queries is key. CNLs preserve most of the natural properties of their base language. In this sense, users can intuitively and easily specify their intent. In contrast to the other three projects addressing the same RFP, in our project users do not need to know or learn programming languages such as SQL, GraphQL, or any other. In addition to the CNL, we designed the concept of informative artifacts to allow dashboard composability. This adds to the benefits to the average user, as they will be able to reuse and adapt existing artifacts through a visual and interactive interface. We believe that by leveraging the ontological framework and the controlled natural language querying engine, users can easily perform complex cross-chain data analysis without requiring in-depth technical knowledge or familiarity with specific programming languages. This empowers the average user within the Polkadot ecosystem to efficiently retrieve and analyze blockchain data across multiple parachains, contributing to improved decision-making, research, and understanding cross-chain effects and dynamics.
In the application, we mentioned that the platform could be leveraged and monetized via additional funding applications or a SaaS subscription model (monthly fee payments and a free tier with limited capabilities). However, this is an early-stage technical project application. As the roadmap evolves, we will look for partnerships, sponsorships, or other supporting programs that might be suited, e.g. builders programs, VC funding, or treasury funding. In this case, market research and potential revenue analysis are key for achieving the requested funding indeed.
Both my partner and I at MOBR have been working with AI, roughly for the last eight years. We worked in different industry R&D projects at IBM for seven years, specifically with knowledge engineering (KE). The computational KE field comprehends among other activities the capability of structuring domain knowledge in a way it can be queried. Commonly, information in expert domains are rather unstructured and dynamic. In other words, data has to be extracted, cleansed, transformed, and injected into knowledge bases continuously. As a matter of fact, in our previous work we designed and deployed a graph database technology (named Hyperknowledge) to address this specific issue of maintaining the dynamicity of domain knowledge and data. More details about the projects we engaged before can be found at our LinkdIn profiles [1, 2]. I'm glad to answer any questions you may have about them. Concerning the other technologies mentioned in the application, we have previous experience with them in these previous projects as well. [1] Dr. Moreno LinkedIn profile: https://linkedin.com/in/marcio-moreno-phd-598a459a/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications, @rbrandao. I think this CNL approach might be useful. However, I would prefer to test this in a smaller context, e.g. by extending existing querying engines, to see what the actual benefit is. Or, if you have some concrete examples of how the CNL simplifies certain queries, feel free to add them.
In any case, I will share your application with the rest of the committee.
Thanks again for your comments, @semuelle. Our idea for the query engine is to extend an existing SPARQL query engine and not to create a whole new query engine from scratch. As described on Milestone 4, deliverable 1, in this component we will Implement the logic for executing CNL queries by translating them into SPARQL queries to fetch data from the Knowledge Base. Concerning concrete examples of how CNL may simplify certain queries, please consider Fig. 1 in the grant application with the overview of the process. In it, we illustrate specifically competency questions (CQ5 and CQ6) in a controlled natural language, and the corresponding structured equivalent in SPARQL. More examples and details are available in the published paper. CQ5CNL: How many transactions happened between July 4th 2023 and July 8th 2023 specifically in the Moonbeam parachain? CQ6CNL: What are the top 5 parachains by pull requests in the last 7 days? |
@rbrandao thank you for the grant application. I think this topic is interesting but it seems that we already have some grants related to data analysis. The initiatives should work in synergy and ideally reusing the work of each other. See this comment from Karim, which is one of the team leads from the data team at Parity, in another grant application. In this way, I have some doubts about your project. How this grant application could reuse the efforts that were already in progress? This could lead to a reduction in the scope and price of your application? How? From what I saw in the other applications in the same area, this one is the most expensive and my feeling is that we are already supporting part of it in other projects. Could be possible to you to follow the recommendations from Karim and focus on frontend (query part) or capabilities that other grants cannot provide? |
Thanks for your comments, @dsm-w3f. We are excited to see this surge of analytics project proposals in the ecosystem. Checkout below our points regarding yours and Karim’s comment.
In terms of reusing the current efforts, in the proposed architecture we are already foreseeing the reuse of available ETLs in the Data Layer (see Fig. 2 in our application). The idea is that the "Semantic ETL workflows" proposed in Milestone 2 will fetch data using available ETLs. If the current ETLs like Substrate-ETL and dot-etl (when available) are not suitable for our use cases, we could propose extensions to existing projects. But this is not in the scope of our current application. Nevertheless, for the proposed platform to succeed we need to transform the fetched data and align it with the ontology, creating knowledge representations that will be the basis for the upper layers in our architecture. So, the scope of work cannot be reduced since there is no other project offering a knowledge-oriented solution with the features this platform requires. One point that I'd like to mention is that we had our previous grant application submitted before the publication of this RFP. The previous grant was focused on devising a domain ontology which would be the steppingstone to achieve an envisioned analytics platform for the ecosystem. This platform is the focus of the current application. Our vision for this platform is based on our past experience as AI researchers and developers in different industries. In our approach, we are combining a holistic view of data and domain knowledge to create a queryable knowledge base that can be leveraged to meet experts, developers and average users demands. In addition to providing a knowledge-oriented solution, we anticipate that the proposed platform could also be valuable in terms of creating AI research opportunities. Including, exploring semantic reasoning in the KB to provide answers that are not explicitly represented; research initiatives to support rich insights based on predictive models (e.g., link prediction, concept2vec), etc.
Technically, we are already focusing on the frontend and query features, as we do not plan to implement an ETL from scratch. However, our solution demands the implementation of intermediary layers to process, transform, and align information to maintain triples as knowledge representation in a triplestore database. In addition, it is necessary to build on top of triplestore query engines to support CNL queries, and create necessary endpoints including those to consume the knowledge properly. As far as we know, there is no alternative to the backend features we need. Below are some reflections regarding Karim’s comment on the dot-etl proposal.
As opposed to using relational databases, our approach uses triplestores to maintain a knowledge base, where ledger data is transformed and injected, and can be further enriched through other data sources and domain knowledge. In our perspective, a triplestore database offers distinct advantages over a relational database due to its semantic model. It excels in capturing and querying complex relationships by utilizing RDF triples that facilitate efficient representation of diverse and evolving data structures. Unlike the rigid schema of relational databases, triplestores allow schema-less data integration, making them adaptable to evolving data sources. They naturally support semantic reasoning and inferencing, enabling advanced querying for deriving new insights. This enables applications such as knowledge graphs, semantic web, and linked data, fostering enhanced data integration, flexibility, and semantic querying capabilities that are often challenging to achieve with traditional relational databases.
Considering the charting possibilities, we are proposing creating the concept of informative artifacts as query results to compose custom dashboards. We plan to reuse different libraries to specifically deal with rendering and dashboard composition. We will definitely consider Apache Superset for that matter as well.
On the one hand, in our proposal the platform will provide to the average users a UI with querying capabilities. Their interaction will be facilitated through a CNL-based query specification supported by autocomplete and contextualized suggestion features. Query results will be presented as visual content that users will be able to interact and customize. In the future, these artifacts can be leveraged through social engagement, e.g. sharing, ranking, bragging, etc. Dune Analytics has a similar social engagement approach. On the other hand, Dune Analytics is built on top of a relational database and requires users to learn their custom SQL query language, which is a technical query language, while also demanding users to learn and understand their data model. There is no support for query building features or user-friendly mechanisms, such as autocomplete, or contextual suggestions based on entities that compose the domain knowledge. In our approach, the proposed platform will be built on top of a knowledge base, comprising a custom triplestore database, which is commonly leveraged to provide semantics and additional information to support such features. CNL is key for the average users to specify their queries and it meets the requirement stated in the RFP, i.e. "the tools should NOT demand that users need to know or learn technical query languages such as SQL, GraphQL, or any other."
We plan to use existing ETLs (Substrate-ETL, dot-etl). However, we will still have to cover costs of computing power for maintaining data synchronized in the knowledge base, as well as storage costs. The data and information extraction will be carried out by the planned Semantic ETL workflows, which are a series of Airflow tasks that will use existing ETLs, to transform and align data to inject knowledge in the triplestore. These pipelines may be triggered over a schedule that can be adjusted depending on the afforded costs and accepted delay. As commented before, in the application we mentioned that the platform could be leveraged and monetized via additional funding applications or a SaaS subscription model (monthly fee payments and a free tier with limited capabilities). However, this is an early-stage technical project application. As the roadmap evolves, we will look for partnerships, sponsorships, or other supporting programs that might be suited, e.g. builders programs, VC funding, or treasury funding. In this case, market research and potential revenue analysis are key for achieving the requested funding indeed.
Regarding sharing query results, each query will have an informative artifact associated with it. This artifact will provide not only the results, but also a reference to an endpoint so results can be accessed and polled dynamically. As for direct access to the KB representation, we will provide endpoints for performing queries. Another possibility is to have dumps from the triplestore in Turtle format.
Yes, we will provide a Dockerfile in the project github repository. We specified docker images as deliverables in our milestones.
As illustrated in the application, we designed a systematic process to perform information extraction (reusing existent ETLs and existent Substrate interfaces) from the ecosystem. Specifically, this process will comprise workflows, i.e. streamlined Airflow tasks to continuously (at a configured schedule) fetch, transform, cleanse, and align the data to inject or backfill structured knowledge by a specific component in the architecture, which is capable of leveraging triplestore data model flexibility. Going forward, the extraction processes, as well as the domain ontology, should be regularly reviewed to ensure that data is accurately represented and handled. |
@rbrandao thank you for the answer. I now understood that the project already considers the databases that are being developed by other projects as sources of information. A concern that remains is regarding the price, since US$500 per hour is very expensive and not all scope of the work needs PhDs to be developed. Do you consider giving us a discount? An option to be more cost-efficient maybe is to hire less expensive professionals for at least some parts of the development. Another concern is regarding the scope of the questions that would be answered by the tool. As far as I remember, the ones from the last grant were in some way generic and not very tight with Polkadot ecosystem. The questions that could be answered by the tool are the same proposed from the previous grant? Any change on that? |
Thanks for the prompt comments @dsm-w3f. Indeed, we already foresee the use of existing ETLs, Substrate interfaces, and datasets.
The specified rate is US$500 per day, not per hour. This price is reasonable for a PhD daily rate with the required expertise. It is less than the commonly observed market rates, honestly. Note that it is an 8 months project roadmap.
The work needs a deep understanding of specialized technologies, demanding skills like NLP, AI/Knowledge Engineering, HCC (Human-centered Computing), and UI/UX. I don't see how hiring less specialized professionals could work here. At least not without compromising the achieved results.
The scope of the questions will be broad, supported by a combination of the concepts, properties and individuals aligned with the POnto domain ontology, that are represented in the Knowledge Base. In this sense, to illustrate the scope of the questions that the platform will be able to answer consider the following categories:
Concerning the scope of the questions, this is exactly the kind of discussion that we tried to foster on the last Milestone of our previous research grant, with the discussions over the Mural and questionnaire. We are addressing all suggestions from the feedback we received, as well as the types of queries stated in the RFP. Note that, after deploying an initial MVP of the platform, there will be many evolution opportunities, including the expansion of the ontology, which would further expand the query answering capabilities as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described on Milestone 4, deliverable 1, in this component we will Implement the logic for executing CNL queries by translating them into SPARQL queries to fetch data from the Knowledge Base.
Thanks for the updates, @rbrandao. I agree that this concept could be quite useful and is worth pursuing. However, looking at the complexity of starting, running and maintaining an analytics platform, as people in the ecosystem have shared with us, I would not recommend tying it to a complete platform. I'd be happy to give my +1 for a library that helps converting CNL to SPARQL and possibly other tools that might be useful in this regard. That would also be a better fit for the grants program as a whole.
Thanks for the comments and suggestions @semuelle. Indeed, an analytics platform is a complex asset to develop and maintain. There are no alternatives when it comes to developing robust software platforms that are not available off-the-shelf. Considering your recommendation of developing a library to help convert CNL to SPARQL. As far as we know, there is no other project considering a knowledge base to support analytics in the Polkadot ecosystem. How could such a library be useful as a standalone solution without the required backend layers (knowledge base with a proper knowledge representation, endpoints, ontology and information alignment, etc.)? The set of open source libraries that compose the platform would be good contributions to the community. However, as isolated components, libraries would not bring much value as they would in a cohesive solution as the envisioned platform. |
@rbrandao thank you for the answers. The price charged per day is more reasonable than per hour. However, I agree with @semuelle that the core part of the app is the CNL to SPARQL conversion and maintaining ETL and other infrastructure at this moment would not be appropriate. Maybe we could try the proposed technology on a small scale, understand if and how this could generate value for our ecosystem before proceeding to fund a full product. Furthermore, I'd like to ask if the proposed technology will be based on another one such as Sparkils (docs here), other tools available, or do you plan to develop a new tool from scratch? |
Hi @dsm-w3f, it's hard to cherry pick a core component of the proposed approach, if we had to. In our understanding, a foundational aspect relies on the structuring of a domain ontology. And that's why we proposed it beforehand. For us, it makes no sense focusing specifically on the CNL to SPARQL aspects without considering the big picture of the platform (as commented before on @semuelle's suggestion). Concerning the design of the querying building, our initial interface proposal aims at providing a textual interface. That is, an "omnibox" text input tool that will suggest terms and autocomplete expressions based on contextual information. That is opposed to the referred tool (https://github.com/sebferre/sparklis), which demands users to go through a series of clicking and selections over widgets and visual components. Indeed, we plan to develop our own tool, but that doesn't mean everything from scratch. We will definitely consider open-source solutions that may help, including the aforementioned work and others discussed in the comparative survey. We co-authored a number of papers and patents published throughout the years, specifically dealing with and advancing the state-of-the-art in querying technologies. Checkout some of them: An Extensible Approach for Query-Driven Multimodal Knowledge Graph Completion US Patents: If you want, we can talk about them, or if you have further interest in getting to know details about our vision of the query building aspects. |
@rbrandao thank you for the answers. Would translating the CNL to SPARQL without selecting options, like other tools, cause precision problems in the translation? How do you plan to deal with that? Furthermore, let me know if you plan to make changes to the application document considering our discussion or if this is the final version in order to give a decision on it. |
Hi @dsm-w3f. If by "selecting options" you mean interactivity through a visual UI to support the translation, it would not cause "precision problems" not having it. As specified in M3 of our proposal, we included deliverable 1) CNL grammar, 2) CNL syntax definition, 3) CNL semantics definition. These deliverables would support parsing and validating any valid specification of queries in the controlled natural language. In addition, our textual interface to support query building includes autocomplete and contextualized suggestions features, which also would assist in the validation and “precision” of the query translation.
Concerning the focus of this initiative, we think that it is key to follow the current proposed milestones order, since there is a dependency among the proposed assets. That is, each asset depends on the previous milestone outcome. An alternative to this application would be breaking down each of the proposed milestones in separated L1/L2 grant applications. This would address @semuelle's and yours concerns regarding a complete platform as a single application. Note that, by proposing all of the deliverables together we reduce the costs of development, since we can plan the roadmap and scope of work ahead. Whereas, if we break it down into small projects the roadmap can change a lot with discussions and things may take a different route with different costs. Currently we have the following milestones in our radar: M1 (10k in 1mo), M2 (18k in 2mo), M3 (18k in 2mo), M4 (15.5k in 1.5mo), and M5 (18k in 2mo) What do you think about this alternative? |
@rbrandao thank you for the answer. It is not clear to me what would be the scope of the L1/L2 grant applications. Can you detail the scope of each one together with the budget? In this way, we would be able to analyze and give an opinion on that. Although the scope separation could lead to a higher overall price, this can reduce our risk of funding large projects that might not go in a direction that provides value to our ecosystem. Furthermore, data-related projects are eligible for treasury funding in other initiatives, such as Data Alliance, if they are aligned with it. Having a working prototype that shows the value of your tool for our ecosystem and does not overlap with other projects that already are part of Data Alliance could be a good way to ask for funds for the treasury. Usually, treasury proposals and bounties give more funds than grants. |
@dsm-w3f, we have just changed our proposal to a L1 grant application. Now the scope of work is limited to the first milestone of the previous application. This is the final version of our application. Thank you and @semuelle for your time and relevant feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, @rbrandao. I'm happy to support this.
Congratulations and welcome to the Web3 Foundation Grants Program! Please refer to our Milestone Delivery repository for instructions on how to submit milestones and invoices, our FAQ for frequently asked questions and the support section of our README for more ways to find answers to your questions. |
* Adding polkadot_analytics_platform.md The Polkadot Analytics Platform aims at building a comprehensive data analysis and visualization tool for the Polkadot ecosystem. This is a follow-up grant application for the project: w3f#1420 * Update polkadot_analytics_platform.md * Update polkadot_analytics_platform.md
Project Abstract
The Polkadot Analytics Platform aims at building a comprehensive data analysis and visualization tool for the Polkadot ecosystem. The platform will allow users to retrieve and analyze data from various Polkadot-related sources (e.g., different parachains and components such as browser wallets), aligned with the POnto ontology [1, 2, 3]. Users will be able to specify their queries using a controlled natural language (CNL), and the platform will provide a query engine to process these queries. Additionally, the platform will provide a UI to support constructing queries and visualizing informative artifacts that represent query results. As well as support for composing customizable dashboards using these artifacts.
This is only the first stage in the roadmap to build the platform, which comprises a subset of the platform components
[1] POnto source code: https://github.com/mobr-ai/POnto
[2] POnto documentation: https://www.mobr.ai/ponto
[3] POnto scientific paper: https://github.com/mobr-ai/POnto/raw/main/deliverables/milestone3/article.pdf
This is a follow-up grant application for the project A Knowledge-Oriented Approach to Enhance Integration and Communicability in the Polkadot Ecosystem
Grant level
Application Checklist
project_name.md
).@_______:matrix.org
(change the homeserver if you use a different one)