Skip to content

Commit

Permalink
crawler fails on content-type
Browse files Browse the repository at this point in the history
I have a website that returns the following header:

 'Content-Type': 'text/html; charset=utf-8'
  • Loading branch information
orangewise authored and bigadsoleiman committed Jan 23, 2024
1 parent 5b1b997 commit 1aa3280
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def parse_url(url: str):
base_url = f"{root_url_parse.scheme}://{root_url_parse.netloc}"

response = requests.get(url, timeout=20)
if response.headers["Content-Type"] != "text/html":
if "text/html" not in response.headers["Content-Type"]:
raise Exception(f"Invalid content type {response.headers['Content-Type']}")
soup = BeautifulSoup(response.content, "html.parser")
content = soup.text
Expand Down

0 comments on commit 1aa3280

Please sign in to comment.