-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added scrape meta for chula_vista_pd #94 #95
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another user-agent change
Tagging @stucka for a second set of eyes |
clean/ca/chula_vista_pd.py
Outdated
# save the index page url to cache (sensible name) | ||
base_name = f"{self.base_url.split('/')[-1]}.html" | ||
filename = f"{self.agency_slug}/{base_name}" | ||
self.cache.download(filename, self.base_url, headers=index_request_headers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only the one index page here -- I think this should include force = True
to force a rescrape on each run.
for content_area in content_areas: | ||
previous_h2 = content_area.find_previous("h2") | ||
if previous_h2 and previous_h2.text == "Documents": | ||
desired_element = content_area | ||
break | ||
|
||
if desired_element: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this methodology a lot, except we should probably log an error if there is no desired_element
. If the underlying HTML has changed in a way that breaks this part of the scraper, as written this will fail quietly, i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
Nice work! I found a couple little things in the scraper proper, but I think maybe the metadata might need some work. Or I'm hopelessly confused. =) I'm looking at the contrib docs: https://github.com/biglocalnews/clean-scraper/blob/dev/docs/contributing.md I think in these the title is going to be the human-friendly name of the document, which is going to be the a.string or a.text or a.content, whatever that is. The text of the anchor tag. Where you've got the case_id as "officer-involved shootings" or whatever, I'd maybe put that in ['details']['case_type'] or some such. The case_id I think should be coming from this hunk of text that's not part of the links, things like "Officer-Involved Shooting | 700 Monterey Avenue" This is ... really ugly and maybe there's a better way to do it, but it's trying to split the cases up into chunks that have that <p before going into the links.
|
OK, much much much easier way to get the case ID. Where you're processing your links, you can throw in a find_previous("p").content:
|
@naumansharifwork , can you take another look at this? I did a few things you might not be happy with and we can back them out if desired. You did excellent work and I've likely mangled it! =( The splash / redirect URLs weren't actually working for me when I dropped 'em in the browser so I put in some code to extract them out. And that seemed succinct enough I killed off your function to process them. But ... maybe I'm wrong here. I also incorporated the splash / redirect workflow into the other workflow. I put in some generic handling to detect whether a link was relative and if so add in the base URL. Otherwise, it remains intact. (The last hunk of code there I think ignored the possibility of a third kind of URL, like, whatever, a link to an attorney general's server or something.) There was a \somethingsomething making it through in I think title that I set up for a text replacement in a couple spots even where it's not needed. Updated JSON is here |
@stucka I have tested the youtube links from the previous meta file they were also working fine for me, but if you think this is a better method to do this then its fine as well. |
added scrape meta for chula_vista_pd.
Sample Meta Json is attached
ca_chula_vista_pd.json