Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added scrape meta for palm_springs_pd #88 #93

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

naumansharifwork
Copy link
Contributor

Added The scrape meta for the palm spring police department

@naumansharifwork
Copy link
Contributor Author

sample meta json
ca_palm_springs_pd.json

@newsroomdev
Copy link
Member

tagging @stucka to take a look when you get a chance and see if this fits with your ideas on request.post helper to get JSON responses with asset urls

Copy link
Member

@newsroomdev newsroomdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change

"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is overwriting the default user-agent set by the method. Does removing this header break the scraper? We want to maintain the user agent Big Local News (biglocalnews.org) to avoid getting blocked by their servers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case also I think user-agent is getting set in a Python dictionary, and then it gets thrown to utils, which is trying to set a User-Agent. Looks like the actual HTTP(s) request is case-insensitive, but the Python dictionary could conceivably hold values for both user-agent and User-Agent and then leave requests to figure out the ambiguity when it converts the Python dict into the actual POST header format.

I'd suggest we should require User-Agent with the caps to resolve any ambiguities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the user-agent and it still works, but in utils it is setting up an other Cap case User-Agent,
If we add the indent at line 388 then it will only add the user-agent where the headers are not passed, I think the better way of doing it is that when passing the headers do-not pass the user-agent if possible and change the utils to set the default user-agent even if the headers are available and user-agent is not in headers dict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me. Maybe since we already know the capitalization can be unpredictable, that section in utils can test for both and only include it if neither is set:

    if "headers" not in kwargs:
        kwargs["headers"] = {}
    if "user-agent" not in kwargs["headers"] and "User-Agent" not in kwargs["headers"]:
        kwargs["headers"]["User-Agent"] = user_agent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stucka yes I agree.

clean/utils.py Outdated
# Set the headers
if "headers" not in kwargs:
kwargs["headers"] = {}
kwargs["headers"]["User-Agent"] = user_agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without an indent on line 338, User-Agent will always be overridden, and that may break scrapers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the mirror image of #93 (comment) =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants