generated from aboutcode-org/skeleton
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Philippe Ombredanne <[email protected]>
- Loading branch information
1 parent
7b5b7f8
commit 4152e61
Showing
5 changed files
with
186 additions
and
49 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
The following organizations or individuals have contributed to this repo: | ||
|
||
- | ||
- AboutCode |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,62 +1,203 @@ | ||
A Simple Python Project Skeleton | ||
================================ | ||
This repo attempts to standardize the structure of the Python-based project's | ||
repositories using modern Python packaging and configuration techniques. | ||
Using this `blog post`_ as inspiration, this repository serves as the base for | ||
all new Python projects and is mergeable in existing repositories as well. | ||
AI-Generated Code Search | ||
================================ | ||
|
||
``Search, detect, and identify AI-generated code.`` | ||
|
||
The AI-Generated Code Search project provides open source tools to find code that may have been | ||
generated using LLMs and GPT tools. | ||
|
||
Generative AI engines and Large Language Models (LLMs) are emerging as viable tools for software | ||
developers to automate writing code. These engines and LLMs are trained on publicly available, free | ||
and open source (FOSS) code. | ||
|
||
AI-generated code can inherit the license and vulnerabilities of the FOSS code used for its | ||
training. It is essential and urgent to identify AI-generated source code, as it threatens the | ||
foundation of open source development and software development and raises major ethical, legal and | ||
security questions. | ||
|
||
Based on the AbouCode track record of creating industry-leading FOSS code origin analysis tools for | ||
license and security, this project delivers a new approach to identify and detect if AI-generated | ||
code is derived from existing FOSS with a new code fragments approximate similarity search. | ||
|
||
We believe that AI-generated code identification is essential to ensure responsible use of that code | ||
while enjoying the productivity gains from Generative AI for code. There is a massive potential for | ||
misuse, malignant or illegal use of such code, and identifying AI-generated code will enable safer, | ||
efficient and responsible use of GenAI to help build better software for the next generation | ||
internet, faster and more efficiently. | ||
|
||
|
||
Problem | ||
------------- | ||
|
||
|
||
FOSS is the foundation of all modern software. It is imperative to know the origin of reused FOSS | ||
code and its vulnerabilities and licenses, especially for software supply chain security and | ||
cybersecurity regulations, like SBOM mandates and Europe's upcoming CRA. | ||
|
||
This applies to AI-generated code derived from FOSS code. The problem is so acute that several large | ||
companies have issued policies prohibiting their programmers from using AI code generation tools. | ||
|
||
Identifying AI-generated code requires matching using a large index of FOSS code. Scale is a | ||
significant problem because incumbents need ever larger code indexes for accurate results, leading | ||
to slower search queries and wasteful energy infrastructures. | ||
|
||
|
||
Existing approaches identify code fragments only exactly and cannot work with AI-generated code as | ||
each generation may yield different code, resulting in false negatives and false positives. All of | ||
this results in serious concerns for organizations to use AI-generated code. | ||
|
||
|
||
Goals | ||
------------ | ||
|
||
This project goal is to deliver a reusable open source library and the indexing code to create an | ||
open dataset to identify if source code is AI-generated and report which FOSS project it derives | ||
from. | ||
|
||
The intent is that AI-Generated Code Search will enhance trust for users searching for code on the | ||
internet. This project aims to provide information and validation on the derivative open source | ||
origin of AI-generated code to identify its vulnerabilities and licenses to mitigate any software | ||
supply chain integrity or security risks associated with using AI-generated code. | ||
|
||
We hope that this solution will increase users' trust of using both AI-generated code found on the | ||
web and code created by a Generative AI engine or LLM. | ||
|
||
|
||
Solution | ||
--------- | ||
|
||
We need to provide quality and trustworthy approximate code fragments matching on smaller indexes to | ||
accommodate the variety of AI-generated code and the growing volume of open source code used to | ||
train the backing LLMs. | ||
|
||
The ambition of this project is to provide a new approach to code matching with locality-sensitive | ||
hashing (LSH) using random projections tunable for precision and recall, and avoid an ever | ||
increasing code index size. This will make it practical for users to identify AI-generated code at | ||
scale. | ||
|
||
Providing a practical solution to discover AI-generated code fragments will improve the | ||
trustworthiness of both AI-generated and non-generated code - and open source code in particular - | ||
by reporting its AI-generated status and ultimate FOSS origin. | ||
|
||
|
||
This project will enable trustworthy, safer and efficient usage of LLM-based code generation with | ||
this critical knowledge, improving overall programmer productivity and adoption of responsible AI- | ||
code generation tools. | ||
|
||
|
||
|
||
|
||
Approach and Design | ||
----------------------- | ||
|
||
Existing code matching tools only match fragments exactly with low recall results, because exact | ||
matching requires indexing the largest possible number of fragments to avoid false negative matches, | ||
and this volume leads to more false positives. Exact matching does not work for AI-generated code | ||
where each generation may have small variations given the same input prompt. | ||
|
||
Our approach for matching code fragments consists of these three elements: | ||
|
||
1. A content-defined chunking algorithm to split code into fragments - an improved alternative to | ||
the common winnowing. | ||
|
||
2. The fingerprinting of code with a locality sensitive hashing (LSH) function for approximate | ||
matching of fragments. This fingerprint scheme embeds multiple precision in a single bit string to | ||
tune the search precision and organize the search in rounds of progressively increasing precision. | ||
|
||
3. An indexing-time matching with each new fragment only added to the index if it is not matchable | ||
in the index, avoiding large duplications of FOSS code. | ||
|
||
|
||
Implementation | ||
---------------------------- | ||
|
||
|
||
Existing code fragment search solutions are either closed data, expensive proprietary closed source | ||
solutions, or use code indexes that are too big to share. None are designed or adapted to support | ||
AI-generated code search. Because of their costs, none are practical or accessible for SMEs or | ||
individuals because they are too expensive to acquire and operate. | ||
|
||
This project integrates with the open AboutCode stack's existing extensive and open code analysis | ||
capabilities to enable: | ||
|
||
1. Holistic and comprehensive knowledge of software code origin, including AI-generated code | ||
|
||
2. Confidence that the licensing and known vulnerabilities of the whole code is tracked and managed | ||
|
||
3. Direct support for CRA compliance, and the implied mandate to track code origin and report | ||
security issues for software code | ||
|
||
|
||
These tools are designed to be used either: | ||
|
||
- As the building blocks for a larger solution, or | ||
- As an integrated solution with the open source AboutCode stack. | ||
|
||
|
||
Roadmap | ||
-------------- | ||
|
||
The high level plan for this project is to: | ||
|
||
.. _blog post: https://blog.jaraco.com/a-project-skeleton-for-python-projects/ | ||
- Design and implement core fingerprinting and content-defined chunking algorithms | ||
- Execute evaluation and tuning of these algorithms | ||
- Package these algorithms in a reusable library | ||
- Design and implement index storage data structures | ||
- Implement efficient hamming distance fingerprint matching, e.g., the core search | ||
- Create AI-generated code test dataset. Reuse existing dataset where relevant | ||
- Create reference dataset for indexing (reusing PurlDB and SWH) | ||
- Create indexing pipeline and REST API with index-time matching | ||
- Implement search results ranking procedure | ||
- Create searching pipeline and REST API | ||
- Create search query client to search a whole codebase | ||
- Execute at-scale evaluation and tuning campaigns of end-to-end solution | ||
- Package and document library and whole solution for easy deployment and reuse | ||
- Deploy public demo system | ||
- Present at FOSDEM and webinars for community dissemination | ||
|
||
|
||
Usage | ||
===== | ||
|
||
A brand new project | ||
------------------- | ||
.. code-block:: bash | ||
Acknowledgements, Funding, Support and Sponsoring | ||
-------------------------------------------------------- | ||
|
||
git init my-new-repo | ||
cd my-new-repo | ||
git pull [email protected]:nexB/skeleton | ||
|europa| | ||
|
||
|ngisearch| | ||
|
||
# Create the new repo on GitHub, then update your remote | ||
git remote set-url origin [email protected]:nexB/your-new-repo.git | ||
Funded by the European Union. Views and opinions expressed are however those of the author(s) only | ||
and do not necessarily reflect those of the European Union or European Commission. Neither the | ||
European Union nor the granting authority can be held responsible for them. Funded within the | ||
framework of the NGI Search project under grant agreement No 101069364 | ||
|
||
From here, you can make the appropriate changes to the files for your specific project. | ||
|
||
Update an existing project | ||
--------------------------- | ||
.. code-block:: bash | ||
This project is also supported and sponsored by: | ||
|
||
cd my-existing-project | ||
git remote add skeleton [email protected]:nexB/skeleton | ||
git fetch skeleton | ||
git merge skeleton/main --allow-unrelated-histories | ||
- Generous support and contributions from users like you! | ||
- Microsoft and Microsoft Azure | ||
- AboutCode ASBL | ||
|
||
This is also the workflow to use when updating the skeleton files in any given repository. | ||
|
||
More usage instructions can be found in ``docs/skeleton-usage.rst``. | ||
|aboutcode| | ||
|
||
|
||
Release Notes | ||
============= | ||
.. |ngisearch| image:: https://www.ngisearch.eu/download/FlamingoThemes/NGISearch2/NGISearch_logo_tag_icon.svg?rev=1.1 | ||
:target: https://www.ngisearch.eu/ | ||
:height: 50 | ||
:alt: NGI logo | ||
|
||
- 2023-07-18: | ||
- Add macOS-13 job in azure-pipelines.yml | ||
|
||
- 2022-03-04: | ||
- Synchronize configure and configure.bat scripts for sanity | ||
- Update CI operating system support with latest Azure OS images | ||
- Streamline utility scripts in etc/scripts/ to create, fetch and manage third-party dependencies | ||
There are now fewer scripts. See etc/scripts/README.rst for details | ||
.. |ngi| image:: https://ngi.eu/wp-content/uploads/thegem-logos/logo_8269bc6efcf731d34b6385775d76511d_1x.png | ||
:target: https://www.ngi.eu/ngi-projects/ngi-search/ | ||
:height: 37 | ||
:alt: NGI logo | ||
|
||
- 2021-09-03: | ||
- ``configure`` now requires pinned dependencies via the use of ``requirements.txt`` and ``requirements-dev.txt`` | ||
- ``configure`` can now accept multiple options at once | ||
- Add utility scripts from scancode-toolkit/etc/release/ for use in generating project files | ||
- Rename virtual environment directory from ``tmp`` to ``venv`` | ||
- Update README.rst with instructions for generating ``requirements.txt`` and ``requirements-dev.txt``, | ||
as well as collecting dependencies as wheels and generating ABOUT files for them. | ||
.. |europa| image:: etc/eu.funded.png | ||
:target: http://ec.europa.eu/index_en.htm | ||
:height: 120 | ||
:alt: Europa logo | ||
|
||
- 2021-05-11: | ||
- Adopt new configure scripts from ScanCode TK that allows correct configuration of which Python version is used. | ||
.. |aboutcode| image:: https://aboutcode.org/wp-content/uploads/2023/10/AboutCode.svg | ||
:target: https://aboutcode.org/ | ||
:height: 30 | ||
:alt: AboutCode logo |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.