A training dataset generator for Guesslang's deep learning model.
GuesslangTools purpose is to find and download a million source code files. These files are used to train, evaluate and test Guesslang, a deep learning programming language detection tool.
The files are retrieved from more than 100k public open source GitHub repositories.
The million source code files used to feed Guesslang are generated as follows:
- Download Github open source repositories information from the Libraries.io Open Source Repository and Dependency Metadata.
- Randomly select the repositories that will be used to create Guesslang's training, validation and test datasets.
- Download each selected repository.
- Extract some source code files from the downloaded repositories.
This workflow is fully automated but takes several hours to complete, especially the download part. Fortunately, it can be stopped and resumed at any moment.
GuesslangTools ensures that:
- Each source code file in the datasets is unique.
- There are no empty files.
- Only text files are retrieved, binary files are skipped.
- All the files are converted to UTF-8 encoding.
- Each selected repository is associated to only one dataset (training, validation or test), therefore files from a training repository can only be in the training dataset. Same for the validation and test datasets.
- GuesslangTools requires Python 3.6 or later.
- At least 16GB of total system memory is recommended.
- At least 150GB of free storage space is recommended.
You can install GuesslangTools from the source code by running:
pip install .
You can run Guesslang tools on a terminal as follows:
gltool /path/to/generated_datasets/
Several options and hacks are available to fine tune the size and the diversity of the generated datasets. To list all the options, please run:
gltool --help
-
Guesslang icon created with AndroidAssetStudio
-
Repository dataset downloaded from Libraries.io Open Source Repository and Dependency Metadata
-
SQL repositories dataset retrieve from The Public Git Archive
-
GuesslangTools — Copyright (c) 2020 Y. SOMDA, MIT License