-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ink!jet Grant Application #2154
Conversation
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
I have read and hereby sign the Contributor License Agreement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yu-jeffy thanks for the application.
- Pinecone doesn't appear to be open-source, which we usually require for all components. Does this mean that some trust in Pinecone, Inc. would be needed to verify the security and functionality?
- How will you deal with maintenence of updating OpenAI models/versions?
- When it comes to ink! data collection, I worry that the pool of existing ink! contracts isn't large enough yet for an effective training set. Can you elaborate on your approach to finding vulnerabilities?
- Would you look to integrate with other tools, such as CoinFabrik's Scout vulnerability detector?
- Do you plan to just stick with the browser version or would you be open to eventually integrating with a an IDE such as a VS Code extension?
A local ChromaDB, Redis, or Postgres (with pgvector) open-source database can be used instead, meaning that the data is retained within the stack. However, this may mean re-generating the embeddings and storing them in the local database every time the container is initialized, which incurs cost in terms of using the embedding model. These options are also available on cloud hosting services, however this introduces the same issue of trust of a third party company. Pinecone let's us create the vectors once and access them as we iterate through deployments and containers. A single Pinecone vectorstore also allows multiple deployed containers to access the same data from one place. While we would prefer to stick with Pinecone, we can migrate to an open-source solution, either locally or in cloud.
To complete the dataset, we will use a barebones RAG-LLM pipeline with both sets of documentation in the vectorstore to create generated examples. We are dividing our approach into categories and subcategories of smart contract purposes, including payments, transfers, lending, borrowing, vesting, escrow, NFTs, tokens, supply chain management, invoicing, real estate ownership, DAOs, decentralized identity, gaming mechanics, auctions, reputation systems, etc. We will generate thousands of examples with guided prompts that cover all these use cases. Before adding each contract to the dataset, we will run, compile, and deploy to a local or testnet node to ensure they are functional. They will also be run through CoinFabrik Scout to ensure they are not vulnerable. This will fall under Milestone 2 of our project timeline. Yesterday, we did a proof of concept of this method and were able to produce dozens of functional smart contracts, making this approach feasible. This does beg the question of the necessity of our RAG-LLM pipeline if this bare-bones version has efficacy. We believe that having the full dataset will improve efficacy further, without the need for the heavy prompt guidance and iterations we are employing for dataset generation. Additionally, our platform still fulfills the core objective of improving access and ease for new and existing developers. We identify ink! playground, which is the only in-browser ink! IDE available to our knowledge, and the deployment is not live. Our platform allows for in-browser development with CoinFabrik Scout bundled in (see 4.). Additionally, it integrates generative AI into the IDE, with responses tailored to ink! out of the box.
This simplifies our vectorstore as well, as we originally would have had to use metadata tags and code comments in the smart contracts to label vulnerabilities as separate from functional code. Additional testing would be needed to ensure our annotations work. For collecting vulnerable smart contracts, we were looking to pull the source code from reported DeFi attacks and using examples from vulnerability detection training material. We did a quick survey of available sources, and the amount was insufficient for a comprehensive dataset. We could create mutations from the available examples, however it will have a large time cost and may dilute the data through repetitiveness. Integrating CoinFabrik instead simplifies the vulnerability detection in our platform.
Regarding infrastructure, at this time we are implementing a Docker container per user. When a user visits the application, a new Docker image initializes. For the scale of this grant, without a major anticipated userbase, this approach should remain cost-effective. However, we may migrate towards an IDE extension after this project, which would eliminate the ongoing costs of Docker containerization and deployment. In an IDE extension, the smart contract and CoinFabrik would simply compile locally in the user's environment. |
Thanks for your thorough answers @yu-jeffy let me check with the team internally regarding Pinecone to see if it something we could accept in this context. In the meantime, I have marked the application as ready for review and will ping the rest of the committee to take a look. Also, just curious if you have looked into any alternative open-source vectorstore solutions such as Annoy, Faiss, Milvus, or Weaviate? Or would it make more sense to use one of the generic databases you mentioned, if Pinecone isn't an option? |
We took a look at the suggested databases. Annoy seems to have a different structure, and might not be compatible. We have some experience with FAISS, and it is bundled in the LangChain library. However, it may need to be loaded locally and reinitiated for each deployment. Milvus and Weaviate look promising. They are also cloud hosted options. We are going to switch to using these instead of Pinecone. Since we don't have past experience with them, we will use both up until Milestone 3 to test them out. And depending on performance, we will choose one for the final two Milestones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for being willing to switch @yu-jeffy and since you've demonstrated your experience with generative AI, I'm happy to go ahead with it. Hopefully this could help bring a competitive edge to ink!
While we wait for other committee members to take a look, we just started requiring KYC/KYB checks for all potential grantees. Could you please provide the information outlined under this link? Let me know if you have any questions or issues!
Sounds good, thank you! I have filled out the KYC also, it is processing now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yu-jeffy, sorry for the long wait.
I had a quick look at the studies you referenced, and I don't know if I'm misreading this, but it sounds like the work you reference as "similar" to yours still generates tons of vulnerable code. Here is the quote:
our experiments in sections III-A4 and III-B3 show that above 70% of the code is vulnerable. Of this 70%,we are able to correctly label 62%, giving us a total labeling performance of 43%. Of this 43%, we can avoid 67%. This gives us a total 30% reduction of vulnerabilities. Combined with the secure baseline percentage of 30%, our approach can increase the amount of secure code from 30% to 60%
In other words, 29% of all generated code contained vulnerabilities. And that's based on a much larger dataset than we have available. Am I reading this right? Do you have any targets set for yourself?
Hi @semuelle, We are looking to approach this issue in a different way than the study's experimental design. The study uses two datasets. They fine-tune a model on the first dataset. Then they fine-tune the resulting model on the second dataset. For the first dataset, they created it themselves via:
This first dataset includes many vulnerable contracts, which they acknowledge in the paper. The model at this point learns to generate smart contract code, but since it was trained with vulnerable code, it generates vulnerabilities.
They then add in a second dataset, smart contracts with labeled vulnerabilities, and fine-tune the model again. This tries to teach the model that certain code is vulnerable or not. They end up finding that even with this learning, the model still generates vulnerable code, likely due to the original training data having these vulnerabilities. For our approach, we are using GPT-4 with RAG and performing in-context learning, instead of fine-tuning a GPT-J model from the study. We are curating a dataset of only secure smart contracts, so our model is only provided secure code. Also, we are writing our system prompts to instruct the model to respond according to our code only, which guides it to follow secure coding practices and learn the ink! syntax. We are creating the dataset from three sources:
For smart contracts from any of these sources, we are running them through CoinFabrik Scout to check for vulnerabilities. If they pass they are added to the dataset. If they do not pass, we will attempt to manually debug them until they do pass, or else exclude it from the dataset. Therefore, when a user prompts our model, the model is only provided retrieved code that is guaranteed to be secure, which the model will perform in-context learning on before responding. Additionally, the web IDE will have an analyze button, where the user can send their current contract to be compiled and run through CoinFabrik Scout. In the event that the user writes vulnerable code themselves, or the model does produce vulnerable code, this step identifies the vulnerability for the user to debug. For targets, it may be hard to get an exact metric on our application. In the study, they gather vulnerable contracts, cut them off right before a vulnerability, and see what the model autocompletes at that point. Since we aren't working with a dataset of vulnerable contracts, we can't perform this same evaluation. Also, our application works iteratively, rather than a single autocomplete for the contract. The user starts with either a template contract or blank contract, and uses our model to help build parts of the contract. For example, the model is used to write small functions or parts of a larger function at each point, rather than autocomplete the whole contract. What we can do is create a dataset of a couple hundred prompts acting as a user building a contract, and in Milestone 3, we can test these prompts to see how the model performs. We can have the prompts ask specifically to write a complete function, such as asking to generate a basic withdraw function for a vault smart contract. Since these will be complete functions, we can run them through CoinFabrik Scout to test for vulnerabilities. We can then mark how many results are vulnerable and get an efficacy percentage. If this seems like a necessary step, I can update our Milestone 3 to include it. It will take time to create this testing dataset, so we will have to balance our development with how rigorous we want this testing to be. |
Regarding this system design, we have decided that we will start with one front-end Docker container and one back-end Docker container. The front-end will serve our React app and RAG-LLM. The backend will have our Rust environment. This is easier in terms of scaling for our scope at this time. Also, the user wouldn't have to wait for a new container to initialize when they start a session. Our design will look like this:
If the usage grows, we can scale horizontally and add containers. If that happens, we can add a load balancer across them to distribute the requests. We can scale in this way, instead of spinning up a new container per user. For the queue, we'll use RabbitMQ (open-source) to manage the requests coming in and out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yu-jeffy have you considered developing a VS Code extension, rather than a new, fully-fledged IDE? My main concern here is that devs won't use it, because they won't be able to use their current extension, shortkeys and other custom setup related to their IDE.
@takahser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yu-jeffy IMHO GitHub Copilot has proven that AI-driven IDE plugins are highly popular amongst devs, while I'm not sure if the same can be said for a completely web-based alternative. I don't understand the need to build a PoC as a web app first, when the same tech stack can be used to write vs code extensions.
Hi @takahser , We agree that adoption will be higher for something integrated directly in to the IDE. And many devs are familiar with using Copilot/Copilot Chat already, so this will be easy to adopt. Additionally, we believe that this will simplify our development process. With the extension, we won't need to spend time on a front-end web UI, the IDE itself, Dockerizing the application, and deploying on cloud (which brings ongoing costs). We also do not need to worry about scaling, where we would need to spin up more containers and implement load balancing if we had high demand on the web app. Also, since the user will be running it locally, the extension is able to use their local environment to compile the contract and run Coinfabrik Scout, opposed to our previous plan to have a Docker container where each user sends their code to compile as a request. I have updated the application to reflect these changes, and updated the UI mockup as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yu-jeffy thx for the update. I think a vs code extension makes much more sense. However, there are still a lot of open questions (see inline comments).
On top of that, I think it'd be good to focus a bit on the features that the extension will have. For example, the GitHub copilot comes with a variety of features, such as:
- explaining code
- translating code from one language to another
- learning from the dev's past inputs
- refactor code / making it more readable
- auto-detect and fix bugs
- clean up code (e.g. remove unused symbols)
- give step-by-step instructions for a specific task
- making the code more robust by adding error handling
- refactor code into smaller chunks to improve readability, reusability and improve cohesion
- produce code comments that explains the code
I think it'd be good to add at least some of these or similar features.
| **0b.** | Documentation | Code comments. Documentation for the prototype architecture and the setup process, explaining how a user can run a local instance of the prototype RAG system with our initial data. | | ||
| **0c.** | Testing Guide | Unit tests will be provided for the prototype, with a testing guide explaining how to run them. | | ||
| **0d.** | Docker | N/A | | ||
| 1. | Initial Prototype | Development of a basic LlamaIndex RAG system prototype integrated with `GPT-4`, using sentence embeddings. User can interact with the pipeline through the command line, interfacing with `GPT-4` with fetched documents from `Milvus and Weaviate`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a lot of GPT-related dependencies here. I just want to emphasise that any deliverable should be open-source and must NOT depend on any non-open-source technology. Quote from our README:
All code produced as part of a grant must be open-sourced, and it must also not rely on closed-source software for full functionality. We prefer Apache 2.0, but GPLv3, MIT, or Unlicense are also acceptable.
Are there any proprietary components involved with this project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are interfacing with the OpenAI API to utilize GPT-4. The rest of our pipeline, such as the dataset and vectorstore are open sourced. Any code interfacing with the OpenAI API will be open source.
Would the usage of GPT-4 be an issue?
| **0c.** | Testing Guide | Unit tests will be provided for the prototype, with a testing guide explaining how to run them. | | ||
| **0d.** | Docker | N/A | | ||
| 1. | Initial Prototype | Development of a basic LlamaIndex RAG system prototype integrated with `GPT-4`, using sentence embeddings. User can interact with the pipeline through the command line, interfacing with `GPT-4` with fetched documents from `Milvus and Weaviate`| | ||
| 2. | Data Collection | Collection of a small set of `ink!` smart contracts for initial embedding and retrieval testing. Smart contracts will be converted from `.rs` files to `JSON`, with identifying metadata. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart contracts will be converted from
.rs
files toJSON
, with identifying metadata.
How is this going to work? .rs
files usually contain logic while JSON
files are limited to storing data. How are you going to convert this and how are you going to make sure that no information is lost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We plan to have two characters that replace line breaks and indentations, so that the .rs file can be converted into plaintext in a JSON field and restored back to original form if needed.
For example, for newlines we can use \n and for indentations \ind, so that an indented line of code on a newline will be as such:
line one of code here \n \ind line two of code here
This let's us parse the smart contracts into our vectorstore as JSON, as there aren't any Rust/ink! data loaders we have found.
| **0c.** | Testing Guide | Guide on how to run tests on the VS Code extension. | | ||
| **0d.** | Docker 1 | N/A | | ||
| 1. | UI/UX Design | Design and development of UI for VS Code extension. Extension will reside in primary sidebar, with multiple resizable sections. | | ||
| 2. | VS Code Extension | Creation of the core VS Code extension functionality, including integration with our RAG-LLM pipeline, code selection access, and local file system access. This includes Chat and Chat Settings functionality. Analysis and Templates functionality will be scaffolded (see Milestone 5). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to already have the extension in M1, so we have something to test. The ML-related deliveries in M1 and M2 are not very tangible by themselves.
| **0b.** | Documentation | Code comments. VS Code Extension UI components and functionality documented. Instructions for use documented. | | ||
| **0c.** | Testing Guide | Guide on how to run tests on the VS Code extension. | | ||
| **0d.** | Docker 1 | N/A | | ||
| 1. | UI/UX Design | Design and development of UI for VS Code extension. Extension will reside in primary sidebar, with multiple resizable sections. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to remind you that we don't support design efforts.
| **0d.** | Docker 1 | N/A | | ||
| 1. | UI/UX Design | Design and development of UI for VS Code extension. Extension will reside in primary sidebar, with multiple resizable sections. | | ||
| 2. | VS Code Extension | Creation of the core VS Code extension functionality, including integration with our RAG-LLM pipeline, code selection access, and local file system access. This includes Chat and Chat Settings functionality. Analysis and Templates functionality will be scaffolded (see Milestone 5). | | ||
| 3. | Usage Testing | Comprehensive testing to ensure the VS Code extension is responsive and stable. All parts of functionality will be tested. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually expect testing to be part of the development process of each component. Hence it's not necessarily to list it as a separate deliverable, since it's also difficult for us to verify.
Regarding the edits on the milestones, we'll change our timeline to have the extension completed first, then work on the ML/AI aspects after. We'll also fix the design and testing parts. When our changes are finished, we'll update the application again. For functionality of the extension, we plan to have:
We can add the chunking feature as well. We need to brainstorm how to break the contract up and how to show dividers between chunks to the user. Also, I wanted to reiterate a question from one of the comments. We are planning to use GPT-4 as the LLM in our pipeline. The entire pipeline will be open source, and the only proprietary part is calling the OpenAI API to interface with GPT-4. Would this be an issue in terms of the guidelines? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yu-jeffy. I would strongly recommend that you apply with this at the Decentralized Futures Program for two reasons: one, it avoids the issue of having closed-source dependencies (the program is a lot more flexible in that regard), and two, I believe this project only makes sense if it is maintained over a significant period of time. Substrate, ink! and GPT change so quickly that I'm afraid the project would be useless if not properly maintained in about three months time. The DF program is specifically for establishing new entitites, and it would give you a chance to fund ink!jet's development and maintenance until the end of '24.
Hi @semuelle, In order for the project to have a smooth user experience, it has to depend on a closed source LLM. Running an open-source one ourselves would introduce too much compute cost and latency to be feasible. The longer timeline would let us update the system in the event of dependency changes, and also gather community feedback and implement new features as we go. We'll go ahead and apply for the Decentralized Futures program instead. If there are no more updates for this application, the pull request can be closed. Thank you for your help! |
@yu-jeffy thanks for the update. I'll go ahead and close this PR, since you're applying at the DFP instead. |
Project Abstract
Ink!jet
is a platform designed to use augmented generative AI with the intent of improving the development lifecycle ofink!
smart contracts within the Polkadot ecosystem. Recognizing the technically intricate nature of smart contracts and the high level of expertise they demand, our project aims to simplify these complexities, thereby democratizing the creation process.Existing generative AI have limitations in both the amount of
Rust
andink!
code in their training data. Our platform uses a retrieval-augmented generation pipeline with datasets of existingink!
smart contracts to bridge this knowledge gap. Injecting vectorstore retrieved code into prompts, this system utilizes in-context learning to improve response quality. Our goal is to enhance the productivity of existing developers through bootstrapping and assisted code interation, while simultaneously lowering the barrier of entry for new developers.Grant level
Application Checklist
project_name.md
).@jeffyyu:matrix.org
(change the homeserver if you use a different one)