- Prepare python environment and install package
- For development please use
pip install -r requirements.dev.txt
- Copy
sample.env
todev.env
, fill it with required variables and source it
export GITHUB_TOKEN="<your GitHub token>"
export LANGUAGES_DIR="<directory with cloned languages repos>"
export MIRROR_CRAWL_INTERVAL_SECONDS=1
export MIRROR_CRAWL_MIN_RATE_LIMIT=500 (for search better set as 5)
export MIRROR_CRAWL_BATCH_SIZE="<how often save data>"
export MIRROR_CRAWL_DIR="<where to save crawled data>"
export MIRROR_LANGUAGES_FILE="<json file with languauges>"
export SNIPPETS_DIR="<dir for snippets dataset>"
- To avoid block from GitHub prepare Rate Limit watcher
watch -d -n 5 'curl https://api.github.com/rate_limit -s -H "Authorization: Bearer $GITHUB_TOKEN" "Accept: application/vnd.github.v3+json"'
python -m mirror.cli --help
clone Clone repos from search api to output dir.
commits Read repos json file and upload all commits for that...
crawl Processes arguments as parsed from the command line
and...
generate_snippets Create snippets dataset from cloned repos
nextid Prints ID of most recent repository crawled and
written...
sample Writes repositories sampled from a crawl directory to...
search Crawl via search api.
validate Prints ID of most recent repository crawled and
written...
Run the crawl
command to extract all repositories metadata and save in a .json
file.
python -m mirror.cli crawl \
--crawldir $MIRROR_CRAWL_DIR \
--interval $MIRROR_CRAWL_INTERVAL_SECONDS \
--min-rate-limit $MIRROR_CRAWL_MIN_RATE_LIMIT \
--batch-size $MIRROR_CRAWL_BATCH_SIZE
Say you need to extract only a small pool of repositories for analysis then you can set more precise criteria that you need via search
command.
python -m mirror.cli search --crawldir "$MIRROR_CRAWL_DIR/search" -L "python" -s ">500" -l 5
The clone
command uses the standard git clone
to extract search results of repositories and clones to local machine.
Clone from search
python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR/search"
Clone from crawl
python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR"
Structure of $LANGUAGES_DIR
directory:
> $LANGUAGES_DIR
> language 1
> repo 1
> repo 2
...
> language 2
> repo 1
> repo 2
...
...
Also, there is possibility to upload popular repositories with python code. See example in ex_clone.py
Command commits
extract all commits from repository and save .json
files with commits for each repository.
python -m mirror.cli commits -d "$MIRROR_CRAWL_DIR\commits" -l 5 -r "$MIRROR_CRAWL_DIR/search"
It creates .csv
file with flat json structure.
python -m mirror.github.utils --json-files-folder "$MIRROR_CRAWL_DIR" --output-csv "$MIRROR_CRAWL_DIR/output.csv" --command commits
python -m mirror.github.generate_snippets -r "$OUTPUT_DIR" -f "examples/languages.json" -L "$LANGUAGES_DIR"
- Create search result
python -m mirror.cli search -d "$MIRROR_CRAWL_DIR/search" -f $MIRROR_LANGUAGES_FILE -s ">500" -l 5
- Clone repos from search result it's take time and maybe good idea not add stdout from git clone to terminal.
python -m mirror.cli clone -d $LANGUAGES_DIR -r "$MIRROR_CRAWL_DIR/search"
- Generate snippets
python -m mirror.cli generate_snippets -d $SNIPPETS_DIR -r $LANGUAGES_DIR
It return sqlite db with snippets and they metadata.
For use accross allrepos result clone and commits have option argument
--start-id --end-id
parameters must be set togrther. That id add for ability processing part of repo from allrepos result.