Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added The scraper for Humboldt with successful pre-commit run #48

Merged
merged 4 commits into from
Jul 29, 2024

Conversation

naumansharifwork
Copy link
Contributor

No description provided.

@naumansharifwork
Copy link
Contributor Author

Hey @zstumgoren please review this one as well Thanks.

@zstumgoren
Copy link
Member

@naumansharifwork Checks look good! I should have asked on the last go-round, but could you post a snippet in this thread showing the directory tree structure produced in the cache directory (e.g. using tree command, along with a small sample of records in the final JSON output from scrape_meta? A main goal for us is to retain information about the linkage between assets (videos, audio, pdfs, etc.) and the associated case information, and a visual inspection of the output is one easy way to verify that everything works as expected.

@naumansharifwork
Copy link
Contributor Author

@zstumgoren here is a screenshot of tree
image
image

Also i am uploading meta json
ca_humboldt_pd.json

@zstumgoren
Copy link
Member

@naumansharifwork This is a great start. A few things we'll need to change are documented below, which admittedly lack clarity in the documentation.

panel_code clarification

I'm unclear on the meaning of the panel_code field. Is this supposed to be penal_code? And if so, it appears to be used inconsistently: Some early records contain information about the state criminal penal code, which would make sense if this is indeed capturing that type of data. But in other spots the fields appears to contain the date-of-release under which these case files appear to be listed. Could you clarify the usage and intent?

"panel_code": "832.7 (b)(1)(A)(i)",
vs
"panel_code": "201604167"

Parent page

The parent_page should provide the relative path to a local copy of the HTML (in cache) where the asset_url was harvested. For example, the parent_page for the below entry from ca_humboldt_pd.json ...

    {
        "title": "201604167_143761 OUTSIDE AGENCY Redacted.mp3 - NextRequest - Modern FOIA & Public Records Request Software",
        "panel_code": "201604167",
        "parent_page": "https://humboldtgov.nextrequest.com/documents/13365423",
        "asset_url": "https://humboldtgov.nextrequest.com/documents/13365423/download",
        "name": "201604167_143761 OUTSIDE AGENCY Redacted.mp3"
    },

...should be something like ca_humboldt_pd/SB-1421-AB-748-Information.html, which appears to contain the links to this and other NextRequest URLs on the AUDIO/VIDEO RELEASES tab.

Screenshot 2024-07-24 at 10 20 58 AM

There's some gray area here since technically speaking, the download link in asset_url lives on a dedicated page over on NextURL. But again, it's important for us to preserve a connection between the link-to-download and the HTML page (where available) that lists all files associated with a case.

I'd be open to expanding our schema to handle this situation, which I imagine may crop up with other agencies. For example, you could update parent_page as described above, but also add a download_page URL. Feel free to do that if it's not too much trouble.

File tree

When organizing files, the most helpful strategy is to save them in a directory tree associated with their parent page + case number or other unique identifying information (as opposed to, say, the penal code related to the case). Humboldt appears to list cases by their release date (or perhaps it's the date of the incident?). In any event, that appears to the be the ideal way to organize related case files/assets together.

Here's an example of an alternate way to organize the files for Humboldt that would greatly help with our downstream processing:

ca_humboldt_pd/
ca_humboldt_pd/SB-1421-AB-748-Information.html # parent_page
ca_humboldt_pd/201504289/ # This folder contains files listed on the above "parent_page"
ca_humboldt_pd/201504289/201504289-085-4 Redacted.wav

There's a bit of redundancy above since the folder (201504289) is repeated in the file name 201504289-085-4 Redacted.wav, but that may be necessary if the agency doesn't consistently prefix file names with that date. Or even if all files are named consistently with that prefix, it's a good hedge against the possibility of a future change.

@naumansharifwork
Copy link
Contributor Author

Hey, @zstumgoren Thanks for the feedback,
For Panel Code:
I was grouping the records in their respective folders for the PDF documents we have a Panel Code e.g
832.7 (b)(1)(A)(i)
but for the audio/video files we don't have panel code mentioned, They have some case code like this example below but it does not appears to be the date
image
So i thought it might be panel code as well, i can name them as case id but this case id will not be available for the pdf type of documents,

Parent Page:
i understand about the first part that i should give the cache file location,
For the download page as the download and the name of document is available on the next page so we are first downloading the download page in cache as well so should i provide its relative path?

File Tree:
For saving the files we are already grouping the documents in their respective penal code
image
for the audio/video files they are also grouped in their respective case folder e.g
image
All these files will go to the 17-3004 folder,
File names are preserved the same as we get when we download them.
Let me know if we need to change anything in file tree,

@zstumgoren
Copy link
Member

@naumansharifwork A few follow-ups to your most recent questions:

Panel Code

In the JSON, does this refer to Penal Code? If so, we should replace with correct spelling -- penal_code -- and only fill it out when that information is available. Or you can drop the field, since I'm not certain we'd use that downstream in our processing.

Parent page vs. download page

For the download page as the download and the name of document is available on the next page so we are first downloading the download page in cache as well so should i provide its relative path?

Sure, that'd be a good strategy to to handle as:

  • download_page - relative local path in cache
  • parent_page - higher-level index page that leads to the download_page

File tree with case-number folders

After reviewing the page a bit more, it appears that the unique "ID" for each case does indeed appear to be case number. These case numbers appear to have a PDF report on the Documents tab, and where available, corresponding audio and video and other assets on the Audio/Video tab.

So the key thing for us would be to use this unique ID to organize the files in separate folders. To expand on my earlier example, for 201504289, you might have the below structure:

ca_humboldt_pd/
ca_humboldt_pd/SB-1421-AB-748-Information.html # parent_page

ca_humboldt_pd/201504289/ # This folder contains case file and related audio/video listed on the "parent_page"
ca_humboldt_pd/201504289/201504289.pdf # NOTE: THIS IS NEW and corresponds to the file on the Documents tab
ca_humboldt_pd/201504289/201504289-085-4 Redacted.wav

It sounds like you're already doing this to group audio/video by case, but grouping the police reports (ie the PDFs) under separate penal code folders. Is that correct?

If so, we should get rid of the penal code directories and just save the case file PDF inside the appropriate case folder as illustrated above, using its case number + .pdf suffix (or whatever the actual file name of the PDF is based on the URL). That type of structure will greatly help with our downstream processing.

For the penal code, it's fine to add that as a key in the JSON file for each record, where available, and leave blank where it's not available. Or as mentioned above, also fine to drop the field for now. If you decide to keep it in the JSON, we should correct the spelling from panel_code to penal_code.

@naumansharifwork
Copy link
Contributor Author

ca_humboldt_pd.json
Hey please check json now
also i am attaching the tree screenshot now
image

@zstumgoren
Copy link
Member

@naumansharifwork Folder structure looks spot on! Thanks! JSON looks good overall, just one minor note: It appears the download_page and parent_page are identical:

        "parent_page": "ca_humboldt_pd/SB-1421-AB-748-Information.html",
        "download_page": "ca_humboldt_pd/SB-1421-AB-748-Information.html",

If we keep download_page, that should be the relative path to a local copy of the HTML from the NextURL page where the file can be dowloaded (i.e. based on the links harvested from the Audio/Video tab).

@naumansharifwork
Copy link
Contributor Author

@zstumgoren download page is same as parent page for the cases where we get the download link actually from the parent page (for document types), EIther we can keep it same or if you want i can remove the download page key for the records where its not required.

@zstumgoren
Copy link
Member

@zstumgoren download page is same as parent page for the cases where we get the download link actually from the parent page (for document types), EIther we can keep it same or if you want i can remove the download page key for the records where its not required.

Gotcha. Sure, let's remove it for the cases where parent_page and download_page are identical. Once that change is done, I think we should be ready to merge. Thanks!

@naumansharifwork
Copy link
Contributor Author

ca_humboldt_pd.json
Hey @zstumgoren done.

@zstumgoren
Copy link
Member

Looks great! Merging. Huge thanks!

@zstumgoren zstumgoren merged commit dc24b8e into biglocalnews:dev Jul 29, 2024
1 check passed
newsroomdev pushed a commit that referenced this pull request Jul 31, 2024
* Added The scraper for Humboldt with successful pre-commit run
* Required Changes done
* removed download page where identical
newsroomdev added a commit that referenced this pull request Aug 1, 2024
* feat: sacramento pd scraper

* fix: isort

* scrape most child pages; todo: get sub-sub pages

* more recursively grab child pages

* inline comments

* fix: fn names, py type

* feat: collect zip & pdfs; todo: handle dupe assets

* chore: ci

* feat: download youtube videos & playlists; remove print stmts

* style: naming

* ops: clean-prefect import clean

* ops: fix runner test (#44)

* ops: fix runner test

* ops: avoid redundant gha runs on prs

---------

Co-authored-by: Gerald Rich <[email protected]>

* ops: current reqs

* naming

* refactor: move around methods

* refactor: add case_num

* Tiny typo fixs

* Ca 43 santa rosa scraper (#45)

* added santa rosa

* Added The scraper for Humboldt with successful pre-commit run (#48)

* Added The scraper for Humboldt with successful pre-commit run
* Required Changes done
* removed download page where identical

* docs: metadata spec (#49)

* docs: metadata spec

* docs: remove refs to scrape

---------

Co-authored-by: Gerald Rich <[email protected]>

* Update contributing.md

* fix: metadata dict types

* fix: import typing_extensions

---------

Co-authored-by: Gerald Rich <[email protected]>
Co-authored-by: Mike Stucka <[email protected]>
Co-authored-by: naumansharifwork <[email protected]>
@newsroomdev newsroomdev linked an issue Aug 6, 2024 that may be closed by this pull request
naumansharifwork added a commit to naumansharifwork/clean-scraper that referenced this pull request Oct 23, 2024
…alnews#48)

* Added The scraper for Humboldt with successful pre-commit run
* Required Changes done
* removed download page where identical
naumansharifwork added a commit to naumansharifwork/clean-scraper that referenced this pull request Oct 23, 2024
* feat: sacramento pd scraper

* fix: isort

* scrape most child pages; todo: get sub-sub pages

* more recursively grab child pages

* inline comments

* fix: fn names, py type

* feat: collect zip & pdfs; todo: handle dupe assets

* chore: ci

* feat: download youtube videos & playlists; remove print stmts

* style: naming

* ops: clean-prefect import clean

* ops: fix runner test (biglocalnews#44)

* ops: fix runner test

* ops: avoid redundant gha runs on prs

---------

Co-authored-by: Gerald Rich <[email protected]>

* ops: current reqs

* naming

* refactor: move around methods

* refactor: add case_num

* Tiny typo fixs

* Ca 43 santa rosa scraper (biglocalnews#45)

* added santa rosa

* Added The scraper for Humboldt with successful pre-commit run (biglocalnews#48)

* Added The scraper for Humboldt with successful pre-commit run
* Required Changes done
* removed download page where identical

* docs: metadata spec (biglocalnews#49)

* docs: metadata spec

* docs: remove refs to scrape

---------

Co-authored-by: Gerald Rich <[email protected]>

* Update contributing.md

* fix: metadata dict types

* fix: import typing_extensions

---------

Co-authored-by: Gerald Rich <[email protected]>
Co-authored-by: Mike Stucka <[email protected]>
Co-authored-by: naumansharifwork <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create clean/ca/humboldt_pd.py
2 participants