A tool for querying the Common Crawl CDX Index. Versions in both Python and Rust are included in this repository. The command–line syntax is identical in both versions.
- Clone this repository
- To run the Rust version, compile and run via:
$ cargo build --release
$ cd target/release
$ chmod +x scdx
$ scdx --sleep 2 --domain commoncrawl.org --crawls CC-MAIN-2021-04 CC-MAIN-2024-10
$ scdx -s 10 -d '*.wikipedia.org' -c CC-MAIN-2023-50
$ scdx -l -d apple.com
The program will display a progress bar and output a file with a timestamp (e.g 2024-02-27_18-34-50_output.jsonl
) to the working directory, unless the -o
or --output
options are used.
The default sleep time is 2 seconds. Please be polite! Polling multiple times a second will make the index server sad. See the CCF system status here.
If no crawls are specified, all crawls will be queried. Use the -l
or --latest
flag to only query the latest crawl.
The API used supports two methods of wildcarding, like the (more advanced and mature) cdx-toolkit by Greg Lindahl.
-
Prefixed asterisk
The query
*.example.com
, in CDX jargon setsmatchType='domain'
, and will return captures forblog.example.com
,support.example.com
, etc. -
Appended asterisk
The query
example.com/*
will return captures for any page onexample.com
.
The Python version uses tqdm
to display a progress bar, and the Rust version uses indicatif
.