added scrape meta for chula_vista_pd #94 #95

naumansharifwork · 2024-08-25T09:36:45Z

added scrape meta for chula_vista_pd.

Sample Meta Json is attached

ca_chula_vista_pd.json

newsroomdev

Another user-agent change

clean/ca/config/chula_vista_pd.py

newsroomdev · 2024-08-30T16:43:55Z

Tagging @stucka for a second set of eyes

stucka · 2024-08-31T13:55:42Z

clean/ca/chula_vista_pd.py

+        # save the index page url to cache (sensible name)
+        base_name = f"{self.base_url.split('/')[-1]}.html"
+        filename = f"{self.agency_slug}/{base_name}"
+        self.cache.download(filename, self.base_url, headers=index_request_headers)


There's only the one index page here -- I think this should include force = True to force a rescrape on each run.

stucka · 2024-08-31T13:57:04Z

clean/ca/chula_vista_pd.py

+        for content_area in content_areas:
+            previous_h2 = content_area.find_previous("h2")
+            if previous_h2 and previous_h2.text == "Documents":
+                desired_element = content_area
+                break
+
+        if desired_element:


I like this methodology a lot, except we should probably log an error if there is no desired_element. If the underlying HTML has changed in a way that breaks this part of the scraper, as written this will fail quietly, i think.

stucka · 2024-08-31T14:28:52Z

Nice work! I found a couple little things in the scraper proper, but I think maybe the metadata might need some work. Or I'm hopelessly confused. =)

I'm looking at the contrib docs: https://github.com/biglocalnews/clean-scraper/blob/dev/docs/contributing.md

I think in these the title is going to be the human-friendly name of the document, which is going to be the a.string or a.text or a.content, whatever that is. The text of the anchor tag.

Where you've got the case_id as "officer-involved shootings" or whatever, I'd maybe put that in ['details']['case_type'] or some such.

The case_id I think should be coming from this hunk of text that's not part of the links, things like "Officer-Involved Shooting | 700 Monterey Avenue"

This is ... really ugly and maybe there's a better way to do it, but it's trying to split the cases up into chunks that have that <p before going into the links.

            sections = desired_element.find_all("div", class_="accordion-item")
            for section in sections:
                case_type = section.find("div", class_="title").get_text(strip=True)
                psplit = "<p "
                for i, case in enumerate(str(section).split(psplit)):
                    if i > 0:
                        case = psplit + case
                    case_holder = BeautifulSoup(case, "html.parser")
                    if psplit in str(case_holder):
                        case_id = case_holder.find("p").text
                        links = case_holder.find_all("a")
                        for link in links:
                            link_href = link.get("href", None)

stucka · 2024-09-01T21:19:39Z

Nice work! I found a couple little things in the scraper proper, but I think maybe the metadata might need some work. Or I'm hopelessly confused. =)

OK, much much much easier way to get the case ID. Where you're processing your links, you can throw in a find_previous("p").content:

    for link in links:
        link_href = link.get("href", None)
        link_case_id = link.find_previous("p").text

stucka · 2024-09-03T11:53:43Z

@naumansharifwork , can you take another look at this? I did a few things you might not be happy with and we can back them out if desired. You did excellent work and I've likely mangled it! =(

The splash / redirect URLs weren't actually working for me when I dropped 'em in the browser so I put in some code to extract them out. And that seemed succinct enough I killed off your function to process them. But ... maybe I'm wrong here.

I also incorporated the splash / redirect workflow into the other workflow.

I put in some generic handling to detect whether a link was relative and if so add in the base URL. Otherwise, it remains intact. (The last hunk of code there I think ignored the possibility of a third kind of URL, like, whatever, a link to an attorney general's server or something.)

There was a \somethingsomething making it through in I think title that I set up for a text replacement in a couple spots even where it's not needed.

Updated JSON is here

ca_chula_vista_pd.json

naumansharifwork · 2024-09-03T12:52:09Z

@stucka I have tested the youtube links from the previous meta file they were also working fine for me, but if you think this is a better method to do this then its fine as well.

added scrape meta for chula_vista_pd biglocalnews#94

e384e9d

newsroomdev requested changes Aug 26, 2024

View reviewed changes

clean/ca/config/chula_vista_pd.py Outdated Show resolved Hide resolved

removed user-agent

50de6f2

newsroomdev linked an issue Aug 28, 2024 that may be closed by this pull request

Create clean/ca/chula_vista_pd.py #94

Closed

newsroomdev requested a review from stucka August 30, 2024 16:43

newsroomdev self-requested a review August 30, 2024 16:44

Merge branch 'dev' into ca-94

e1f1cdf

stucka reviewed Aug 31, 2024

View reviewed changes

naumansharifwork and others added 2 commits September 2, 2024 13:48

changes done

65d26b5

Rework URL handling; clean up a little more text

839d460

Linted. Oops.

34459df

stucka merged commit b139e7c into biglocalnews:dev Sep 3, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added scrape meta for chula_vista_pd #94 #95

added scrape meta for chula_vista_pd #94 #95

naumansharifwork commented Aug 25, 2024

newsroomdev left a comment

newsroomdev commented Aug 30, 2024

stucka Aug 31, 2024

stucka Aug 31, 2024

naumansharifwork Sep 1, 2024

stucka commented Aug 31, 2024

stucka commented Sep 1, 2024

stucka commented Sep 3, 2024

naumansharifwork commented Sep 3, 2024

added scrape meta for chula_vista_pd #94 #95

added scrape meta for chula_vista_pd #94 #95

Conversation

naumansharifwork commented Aug 25, 2024

newsroomdev left a comment

Choose a reason for hiding this comment

newsroomdev commented Aug 30, 2024

stucka Aug 31, 2024

Choose a reason for hiding this comment

stucka Aug 31, 2024

Choose a reason for hiding this comment

naumansharifwork Sep 1, 2024

Choose a reason for hiding this comment

stucka commented Aug 31, 2024

stucka commented Sep 1, 2024

stucka commented Sep 3, 2024

naumansharifwork commented Sep 3, 2024