Added The scraper for Humboldt with successful pre-commit run #48

naumansharifwork · 2024-07-24T11:05:13Z

No description provided.

naumansharifwork · 2024-07-24T11:05:57Z

Hey @zstumgoren please review this one as well Thanks.

zstumgoren · 2024-07-24T16:52:52Z

@naumansharifwork Checks look good! I should have asked on the last go-round, but could you post a snippet in this thread showing the directory tree structure produced in the cache directory (e.g. using tree command, along with a small sample of records in the final JSON output from scrape_meta? A main goal for us is to retain information about the linkage between assets (videos, audio, pdfs, etc.) and the associated case information, and a visual inspection of the output is one easy way to verify that everything works as expected.

naumansharifwork · 2024-07-24T17:02:02Z

@zstumgoren here is a screenshot of tree

Also i am uploading meta json
ca_humboldt_pd.json

zstumgoren · 2024-07-24T17:43:24Z

@naumansharifwork This is a great start. A few things we'll need to change are documented below, which admittedly lack clarity in the documentation.

`panel_code` clarification

I'm unclear on the meaning of the panel_code field. Is this supposed to be penal_code? And if so, it appears to be used inconsistently: Some early records contain information about the state criminal penal code, which would make sense if this is indeed capturing that type of data. But in other spots the fields appears to contain the date-of-release under which these case files appear to be listed. Could you clarify the usage and intent?

"panel_code": "832.7 (b)(1)(A)(i)",
vs
"panel_code": "201604167"

Parent page

The parent_page should provide the relative path to a local copy of the HTML (in cache) where the asset_url was harvested. For example, the parent_page for the below entry from ca_humboldt_pd.json ...

    {
        "title": "201604167_143761 OUTSIDE AGENCY Redacted.mp3 - NextRequest - Modern FOIA & Public Records Request Software",
        "panel_code": "201604167",
        "parent_page": "https://humboldtgov.nextrequest.com/documents/13365423",
        "asset_url": "https://humboldtgov.nextrequest.com/documents/13365423/download",
        "name": "201604167_143761 OUTSIDE AGENCY Redacted.mp3"
    },

...should be something like ca_humboldt_pd/SB-1421-AB-748-Information.html, which appears to contain the links to this and other NextRequest URLs on the AUDIO/VIDEO RELEASES tab.

There's some gray area here since technically speaking, the download link in asset_url lives on a dedicated page over on NextURL. But again, it's important for us to preserve a connection between the link-to-download and the HTML page (where available) that lists all files associated with a case.

I'd be open to expanding our schema to handle this situation, which I imagine may crop up with other agencies. For example, you could update parent_page as described above, but also add a download_page URL. Feel free to do that if it's not too much trouble.

File tree

When organizing files, the most helpful strategy is to save them in a directory tree associated with their parent page + case number or other unique identifying information (as opposed to, say, the penal code related to the case). Humboldt appears to list cases by their release date (or perhaps it's the date of the incident?). In any event, that appears to the be the ideal way to organize related case files/assets together.

Here's an example of an alternate way to organize the files for Humboldt that would greatly help with our downstream processing:

ca_humboldt_pd/
ca_humboldt_pd/SB-1421-AB-748-Information.html # parent_page
ca_humboldt_pd/201504289/ # This folder contains files listed on the above "parent_page"
ca_humboldt_pd/201504289/201504289-085-4 Redacted.wav

There's a bit of redundancy above since the folder (201504289) is repeated in the file name 201504289-085-4 Redacted.wav, but that may be necessary if the agency doesn't consistently prefix file names with that date. Or even if all files are named consistently with that prefix, it's a good hedge against the possibility of a future change.

naumansharifwork · 2024-07-24T18:19:59Z

Hey, @zstumgoren Thanks for the feedback,
For Panel Code:
I was grouping the records in their respective folders for the PDF documents we have a Panel Code e.g
832.7 (b)(1)(A)(i)
but for the audio/video files we don't have panel code mentioned, They have some case code like this example below but it does not appears to be the date

So i thought it might be panel code as well, i can name them as case id but this case id will not be available for the pdf type of documents,

Parent Page:
i understand about the first part that i should give the cache file location,
For the download page as the download and the name of document is available on the next page so we are first downloading the download page in cache as well so should i provide its relative path?

File Tree:
For saving the files we are already grouping the documents in their respective penal code

for the audio/video files they are also grouped in their respective case folder e.g

All these files will go to the 17-3004 folder,
File names are preserved the same as we get when we download them.
Let me know if we need to change anything in file tree,

zstumgoren · 2024-07-24T19:03:38Z

@naumansharifwork A few follow-ups to your most recent questions:

Panel Code

In the JSON, does this refer to Penal Code? If so, we should replace with correct spelling -- penal_code -- and only fill it out when that information is available. Or you can drop the field, since I'm not certain we'd use that downstream in our processing.

Parent page vs. download page

For the download page as the download and the name of document is available on the next page so we are first downloading the download page in cache as well so should i provide its relative path?

Sure, that'd be a good strategy to to handle as:

download_page - relative local path in cache
parent_page - higher-level index page that leads to the download_page

File tree with case-number folders

After reviewing the page a bit more, it appears that the unique "ID" for each case does indeed appear to be case number. These case numbers appear to have a PDF report on the Documents tab, and where available, corresponding audio and video and other assets on the Audio/Video tab.

So the key thing for us would be to use this unique ID to organize the files in separate folders. To expand on my earlier example, for 201504289, you might have the below structure:

ca_humboldt_pd/
ca_humboldt_pd/SB-1421-AB-748-Information.html # parent_page

ca_humboldt_pd/201504289/ # This folder contains case file and related audio/video listed on the "parent_page"
ca_humboldt_pd/201504289/201504289.pdf # NOTE: THIS IS NEW and corresponds to the file on the Documents tab
ca_humboldt_pd/201504289/201504289-085-4 Redacted.wav

It sounds like you're already doing this to group audio/video by case, but grouping the police reports (ie the PDFs) under separate penal code folders. Is that correct?

If so, we should get rid of the penal code directories and just save the case file PDF inside the appropriate case folder as illustrated above, using its case number + .pdf suffix (or whatever the actual file name of the PDF is based on the URL). That type of structure will greatly help with our downstream processing.

For the penal code, it's fine to add that as a key in the JSON file for each record, where available, and leave blank where it's not available. Or as mentioned above, also fine to drop the field for now. If you decide to keep it in the JSON, we should correct the spelling from panel_code to penal_code.

naumansharifwork · 2024-07-25T08:21:18Z

ca_humboldt_pd.json
Hey please check json now
also i am attaching the tree screenshot now

zstumgoren · 2024-07-26T20:18:06Z

@naumansharifwork Folder structure looks spot on! Thanks! JSON looks good overall, just one minor note: It appears the download_page and parent_page are identical:

        "parent_page": "ca_humboldt_pd/SB-1421-AB-748-Information.html",
        "download_page": "ca_humboldt_pd/SB-1421-AB-748-Information.html",

If we keep download_page, that should be the relative path to a local copy of the HTML from the NextURL page where the file can be dowloaded (i.e. based on the links harvested from the Audio/Video tab).

naumansharifwork · 2024-07-28T07:02:50Z

@zstumgoren download page is same as parent page for the cases where we get the download link actually from the parent page (for document types), EIther we can keep it same or if you want i can remove the download page key for the records where its not required.

zstumgoren · 2024-07-29T16:28:02Z

@zstumgoren download page is same as parent page for the cases where we get the download link actually from the parent page (for document types), EIther we can keep it same or if you want i can remove the download page key for the records where its not required.

Gotcha. Sure, let's remove it for the cases where parent_page and download_page are identical. Once that change is done, I think we should be ready to merge. Thanks!

naumansharifwork · 2024-07-29T16:58:29Z

ca_humboldt_pd.json
Hey @zstumgoren done.

zstumgoren · 2024-07-29T17:31:29Z

Looks great! Merging. Huge thanks!

* Added The scraper for Humboldt with successful pre-commit run * Required Changes done * removed download page where identical

* feat: sacramento pd scraper * fix: isort * scrape most child pages; todo: get sub-sub pages * more recursively grab child pages * inline comments * fix: fn names, py type * feat: collect zip & pdfs; todo: handle dupe assets * chore: ci * feat: download youtube videos & playlists; remove print stmts * style: naming * ops: clean-prefect import clean * ops: fix runner test (#44) * ops: fix runner test * ops: avoid redundant gha runs on prs --------- Co-authored-by: Gerald Rich <[email protected]> * ops: current reqs * naming * refactor: move around methods * refactor: add case_num * Tiny typo fixs * Ca 43 santa rosa scraper (#45) * added santa rosa * Added The scraper for Humboldt with successful pre-commit run (#48) * Added The scraper for Humboldt with successful pre-commit run * Required Changes done * removed download page where identical * docs: metadata spec (#49) * docs: metadata spec * docs: remove refs to scrape --------- Co-authored-by: Gerald Rich <[email protected]> * Update contributing.md * fix: metadata dict types * fix: import typing_extensions --------- Co-authored-by: Gerald Rich <[email protected]> Co-authored-by: Mike Stucka <[email protected]> Co-authored-by: naumansharifwork <[email protected]>

…alnews#48) * Added The scraper for Humboldt with successful pre-commit run * Required Changes done * removed download page where identical

* feat: sacramento pd scraper * fix: isort * scrape most child pages; todo: get sub-sub pages * more recursively grab child pages * inline comments * fix: fn names, py type * feat: collect zip & pdfs; todo: handle dupe assets * chore: ci * feat: download youtube videos & playlists; remove print stmts * style: naming * ops: clean-prefect import clean * ops: fix runner test (biglocalnews#44) * ops: fix runner test * ops: avoid redundant gha runs on prs --------- Co-authored-by: Gerald Rich <[email protected]> * ops: current reqs * naming * refactor: move around methods * refactor: add case_num * Tiny typo fixs * Ca 43 santa rosa scraper (biglocalnews#45) * added santa rosa * Added The scraper for Humboldt with successful pre-commit run (biglocalnews#48) * Added The scraper for Humboldt with successful pre-commit run * Required Changes done * removed download page where identical * docs: metadata spec (biglocalnews#49) * docs: metadata spec * docs: remove refs to scrape --------- Co-authored-by: Gerald Rich <[email protected]> * Update contributing.md * fix: metadata dict types * fix: import typing_extensions --------- Co-authored-by: Gerald Rich <[email protected]> Co-authored-by: Mike Stucka <[email protected]> Co-authored-by: naumansharifwork <[email protected]>

Added The scraper for Humboldt with successful pre-commit run

b369e66

Merge branch 'dev' into ca-46

e7cbcfe

Required Changes done

049f95b

removed download page where identical

4f5f5e1

zstumgoren approved these changes Jul 29, 2024

View reviewed changes

zstumgoren merged commit dc24b8e into biglocalnews:dev Jul 29, 2024
1 check passed

newsroomdev pushed a commit that referenced this pull request Jul 31, 2024

Added The scraper for Humboldt with successful pre-commit run (#48)

78fecef

* Added The scraper for Humboldt with successful pre-commit run * Required Changes done * removed download page where identical

newsroomdev linked an issue Aug 6, 2024 that may be closed by this pull request

Create clean/ca/humboldt_pd.py #46

Closed

newsroomdev mentioned this pull request Aug 6, 2024

Create clean/ca/humboldt_pd.py #46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added The scraper for Humboldt with successful pre-commit run #48

Added The scraper for Humboldt with successful pre-commit run #48

naumansharifwork commented Jul 24, 2024

naumansharifwork commented Jul 24, 2024

zstumgoren commented Jul 24, 2024

naumansharifwork commented Jul 24, 2024

zstumgoren commented Jul 24, 2024

naumansharifwork commented Jul 24, 2024

zstumgoren commented Jul 24, 2024

naumansharifwork commented Jul 25, 2024

zstumgoren commented Jul 26, 2024

naumansharifwork commented Jul 28, 2024

zstumgoren commented Jul 29, 2024

naumansharifwork commented Jul 29, 2024

zstumgoren commented Jul 29, 2024

Added The scraper for Humboldt with successful pre-commit run #48

Added The scraper for Humboldt with successful pre-commit run #48

Conversation

naumansharifwork commented Jul 24, 2024

naumansharifwork commented Jul 24, 2024

zstumgoren commented Jul 24, 2024

naumansharifwork commented Jul 24, 2024

zstumgoren commented Jul 24, 2024

panel_code clarification

Parent page

File tree

naumansharifwork commented Jul 24, 2024

zstumgoren commented Jul 24, 2024

Panel Code

Parent page vs. download page

File tree with case-number folders

naumansharifwork commented Jul 25, 2024

zstumgoren commented Jul 26, 2024

naumansharifwork commented Jul 28, 2024

zstumgoren commented Jul 29, 2024

naumansharifwork commented Jul 29, 2024

zstumgoren commented Jul 29, 2024

`panel_code` clarification