Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added scrape meta for chula_vista_pd #94 #95

Merged
merged 6 commits into from
Sep 3, 2024

Conversation

naumansharifwork
Copy link
Contributor

added scrape meta for chula_vista_pd.

Sample Meta Json is attached

ca_chula_vista_pd.json

Copy link
Member

@newsroomdev newsroomdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another user-agent change

clean/ca/config/chula_vista_pd.py Outdated Show resolved Hide resolved
@newsroomdev newsroomdev linked an issue Aug 28, 2024 that may be closed by this pull request
@newsroomdev
Copy link
Member

Tagging @stucka for a second set of eyes

@newsroomdev newsroomdev self-requested a review August 30, 2024 16:44
# save the index page url to cache (sensible name)
base_name = f"{self.base_url.split('/')[-1]}.html"
filename = f"{self.agency_slug}/{base_name}"
self.cache.download(filename, self.base_url, headers=index_request_headers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only the one index page here -- I think this should include force = True to force a rescrape on each run.

Comment on lines +57 to +63
for content_area in content_areas:
previous_h2 = content_area.find_previous("h2")
if previous_h2 and previous_h2.text == "Documents":
desired_element = content_area
break

if desired_element:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this methodology a lot, except we should probably log an error if there is no desired_element. If the underlying HTML has changed in a way that breaks this part of the scraper, as written this will fail quietly, i think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

@stucka
Copy link
Contributor

stucka commented Aug 31, 2024

Nice work! I found a couple little things in the scraper proper, but I think maybe the metadata might need some work. Or I'm hopelessly confused. =)

I'm looking at the contrib docs: https://github.com/biglocalnews/clean-scraper/blob/dev/docs/contributing.md

I think in these the title is going to be the human-friendly name of the document, which is going to be the a.string or a.text or a.content, whatever that is. The text of the anchor tag.

Where you've got the case_id as "officer-involved shootings" or whatever, I'd maybe put that in ['details']['case_type'] or some such.

The case_id I think should be coming from this hunk of text that's not part of the links, things like "Officer-Involved Shooting | 700 Monterey Avenue"

This is ... really ugly and maybe there's a better way to do it, but it's trying to split the cases up into chunks that have that <p before going into the links.

            sections = desired_element.find_all("div", class_="accordion-item")
            for section in sections:
                case_type = section.find("div", class_="title").get_text(strip=True)
                psplit = "<p "
                for i, case in enumerate(str(section).split(psplit)):
                    if i > 0:
                        case = psplit + case
                    case_holder = BeautifulSoup(case, "html.parser")
                    if psplit in str(case_holder):
                        case_id = case_holder.find("p").text
                        links = case_holder.find_all("a")
                        for link in links:
                            link_href = link.get("href", None)

@stucka
Copy link
Contributor

stucka commented Sep 1, 2024

Nice work! I found a couple little things in the scraper proper, but I think maybe the metadata might need some work. Or I'm hopelessly confused. =)

OK, much much much easier way to get the case ID. Where you're processing your links, you can throw in a find_previous("p").content:

    for link in links:
        link_href = link.get("href", None)
        link_case_id = link.find_previous("p").text

@stucka
Copy link
Contributor

stucka commented Sep 3, 2024

@naumansharifwork , can you take another look at this? I did a few things you might not be happy with and we can back them out if desired. You did excellent work and I've likely mangled it! =(

The splash / redirect URLs weren't actually working for me when I dropped 'em in the browser so I put in some code to extract them out. And that seemed succinct enough I killed off your function to process them. But ... maybe I'm wrong here.

I also incorporated the splash / redirect workflow into the other workflow.

I put in some generic handling to detect whether a link was relative and if so add in the base URL. Otherwise, it remains intact. (The last hunk of code there I think ignored the possibility of a third kind of URL, like, whatever, a link to an attorney general's server or something.)

There was a \somethingsomething making it through in I think title that I set up for a text replacement in a couple spots even where it's not needed.

Updated JSON is here

ca_chula_vista_pd.json

@naumansharifwork
Copy link
Contributor Author

@stucka I have tested the youtube links from the previous meta file they were also working fine for me, but if you think this is a better method to do this then its fine as well.

@stucka stucka merged commit b139e7c into biglocalnews:dev Sep 3, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create clean/ca/chula_vista_pd.py
3 participants