-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added scrape meta for palm_springs_pd #88 #93
base: dev
Are you sure you want to change the base?
Conversation
sample meta json |
tagging @stucka to take a look when you get a chance and see if this fits with your ideas on request.post helper to get JSON responses with asset urls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small change
clean/ca/config/palm_springs_pd.py
Outdated
"sec-fetch-site": "none", | ||
"sec-fetch-user": "?1", | ||
"upgrade-insecure-requests": "1", | ||
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is overwriting the default user-agent
set by the method. Does removing this header break the scraper? We want to maintain the user agent Big Local News (biglocalnews.org)
to avoid getting blocked by their servers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case also I think user-agent
is getting set in a Python dictionary, and then it gets thrown to utils, which is trying to set a User-Agent
. Looks like the actual HTTP(s) request is case-insensitive, but the Python dictionary could conceivably hold values for both user-agent
and User-Agent
and then leave requests
to figure out the ambiguity when it converts the Python dict into the actual POST header format.
I'd suggest we should require User-Agent
with the caps to resolve any ambiguities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the user-agent and it still works, but in utils it is setting up an other Cap case User-Agent
,
If we add the indent at line 388 then it will only add the user-agent where the headers are not passed, I think the better way of doing it is that when passing the headers
do-not pass the user-agent
if possible and change the utils to set the default user-agent
even if the headers are available and user-agent
is not in headers dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable to me. Maybe since we already know the capitalization can be unpredictable, that section in utils can test for both and only include it if neither is set:
if "headers" not in kwargs:
kwargs["headers"] = {}
if "user-agent" not in kwargs["headers"] and "User-Agent" not in kwargs["headers"]:
kwargs["headers"]["User-Agent"] = user_agent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stucka yes I agree.
clean/utils.py
Outdated
# Set the headers | ||
if "headers" not in kwargs: | ||
kwargs["headers"] = {} | ||
kwargs["headers"]["User-Agent"] = user_agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without an indent on line 338, User-Agent
will always be overridden, and that may break scrapers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the mirror image of #93 (comment) =)
Added The scrape meta for the palm spring police department