Skip to content

Commit

Permalink
add usage harness in Python
Browse files Browse the repository at this point in the history
  • Loading branch information
monperrus committed May 18, 2024
1 parent af575b9 commit 923130b
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 7 deletions.
23 changes: 16 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,15 @@ This repository contains a list of of HTTP user-agents used by robots, crawlers,
* Go package: <https://pkg.go.dev/github.com/monperrus/crawler-user-agents>
* PyPi package: <https://pypi.org/project/crawler-user-agents/>

Each `pattern` is a regular expression. It should work out-of-the-box wih your favorite regex library:

## Install

### Direct download

Download the [`crawler-user-agents.json` file](https://raw.githubusercontent.com/monperrus/crawler-user-agents/master/crawler-user-agents.json) from this repository directly.

### Npm / Yarn
### Javascript

crawler-user-agents is deployed on npmjs.com: <https://www.npmjs.com/package/crawler-user-agents>

Expand All @@ -31,14 +33,21 @@ const crawlers = require('crawler-user-agents');
console.log(crawlers);
```

## Usage
### Python

Each `pattern` is a regular expression. It should work out-of-the-box wih your favorite regex library:
Install with `pip install crawler-user-agents`

Then:

```python
import crawleruseragents
if crawleruseragents.is_crawler("googlebot/"):
# do something
```

### Go

* JavaScript: `if (RegExp(entry.pattern).test(req.headers['user-agent']) { ... }`
* PHP: add a slash before and after the pattern: `if (preg_match('/'.$entry['pattern'].'/', $_SERVER['HTTP_USER_AGENT'])): ...`
* Python: `if re.search(entry['pattern'], ua): ...`
* Go: use [this package](https://pkg.go.dev/github.com/monperrus/crawler-user-agents),
Go: use [this package](https://pkg.go.dev/github.com/monperrus/crawler-user-agents),
it provides global variable `Crawlers` (it is synchronized with `crawler-user-agents.json`),
functions `IsCrawler` and `MatchingCrawlers`.

Expand Down
22 changes: 22 additions & 0 deletions __init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import crawleruseragents
import re
import json
from importlib import resources

def load_json():
return json.loads(resources.read_text(crawleruseragents,"crawler-user-agents.json"))

DATA = load_json()

def is_crawler(s):
# print(s)
for i in DATA:
test=re.search(i["pattern"],s,re.IGNORECASE)
if test:
return True
return False

def is_crawler2(s):
regexp = re.compile("|".join([i["pattern"] for i in DATA]))
return regexp.search(s) != None

0 comments on commit 923130b

Please sign in to comment.