-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added The scraper for Humboldt with successful pre-commit run #48
Conversation
Hey @zstumgoren please review this one as well Thanks. |
@naumansharifwork Checks look good! I should have asked on the last go-round, but could you post a snippet in this thread showing the directory tree structure produced in the cache directory (e.g. using tree command, along with a small sample of records in the final JSON output from |
@zstumgoren here is a screenshot of tree Also i am uploading meta json |
@naumansharifwork This is a great start. A few things we'll need to change are documented below, which admittedly lack clarity in the documentation.
|
Hey, @zstumgoren Thanks for the feedback, Parent Page: File Tree: |
@naumansharifwork A few follow-ups to your most recent questions: Panel CodeIn the JSON, does this refer to Penal Code? If so, we should replace with correct spelling -- Parent page vs. download page
Sure, that'd be a good strategy to to handle as:
File tree with case-number foldersAfter reviewing the page a bit more, it appears that the unique "ID" for each case does indeed appear to be case number. These case numbers appear to have a PDF report on the Documents tab, and where available, corresponding audio and video and other assets on the Audio/Video tab. So the key thing for us would be to use this unique ID to organize the files in separate folders. To expand on my earlier example, for
It sounds like you're already doing this to group audio/video by case, but grouping the police reports (ie the PDFs) under separate penal code folders. Is that correct? If so, we should get rid of the penal code directories and just save the case file PDF inside the appropriate case folder as illustrated above, using its case number + For the penal code, it's fine to add that as a key in the JSON file for each record, where available, and leave blank where it's not available. Or as mentioned above, also fine to drop the field for now. If you decide to keep it in the JSON, we should correct the spelling from |
ca_humboldt_pd.json |
@naumansharifwork Folder structure looks spot on! Thanks! JSON looks good overall, just one minor note: It appears the "parent_page": "ca_humboldt_pd/SB-1421-AB-748-Information.html",
"download_page": "ca_humboldt_pd/SB-1421-AB-748-Information.html", If we keep |
@zstumgoren download page is same as parent page for the cases where we get the download link actually from the parent page (for document types), EIther we can keep it same or if you want i can remove the download page key for the records where its not required. |
Gotcha. Sure, let's remove it for the cases where |
ca_humboldt_pd.json |
Looks great! Merging. Huge thanks! |
* Added The scraper for Humboldt with successful pre-commit run * Required Changes done * removed download page where identical
* feat: sacramento pd scraper * fix: isort * scrape most child pages; todo: get sub-sub pages * more recursively grab child pages * inline comments * fix: fn names, py type * feat: collect zip & pdfs; todo: handle dupe assets * chore: ci * feat: download youtube videos & playlists; remove print stmts * style: naming * ops: clean-prefect import clean * ops: fix runner test (#44) * ops: fix runner test * ops: avoid redundant gha runs on prs --------- Co-authored-by: Gerald Rich <[email protected]> * ops: current reqs * naming * refactor: move around methods * refactor: add case_num * Tiny typo fixs * Ca 43 santa rosa scraper (#45) * added santa rosa * Added The scraper for Humboldt with successful pre-commit run (#48) * Added The scraper for Humboldt with successful pre-commit run * Required Changes done * removed download page where identical * docs: metadata spec (#49) * docs: metadata spec * docs: remove refs to scrape --------- Co-authored-by: Gerald Rich <[email protected]> * Update contributing.md * fix: metadata dict types * fix: import typing_extensions --------- Co-authored-by: Gerald Rich <[email protected]> Co-authored-by: Mike Stucka <[email protected]> Co-authored-by: naumansharifwork <[email protected]>
…alnews#48) * Added The scraper for Humboldt with successful pre-commit run * Required Changes done * removed download page where identical
* feat: sacramento pd scraper * fix: isort * scrape most child pages; todo: get sub-sub pages * more recursively grab child pages * inline comments * fix: fn names, py type * feat: collect zip & pdfs; todo: handle dupe assets * chore: ci * feat: download youtube videos & playlists; remove print stmts * style: naming * ops: clean-prefect import clean * ops: fix runner test (biglocalnews#44) * ops: fix runner test * ops: avoid redundant gha runs on prs --------- Co-authored-by: Gerald Rich <[email protected]> * ops: current reqs * naming * refactor: move around methods * refactor: add case_num * Tiny typo fixs * Ca 43 santa rosa scraper (biglocalnews#45) * added santa rosa * Added The scraper for Humboldt with successful pre-commit run (biglocalnews#48) * Added The scraper for Humboldt with successful pre-commit run * Required Changes done * removed download page where identical * docs: metadata spec (biglocalnews#49) * docs: metadata spec * docs: remove refs to scrape --------- Co-authored-by: Gerald Rich <[email protected]> * Update contributing.md * fix: metadata dict types * fix: import typing_extensions --------- Co-authored-by: Gerald Rich <[email protected]> Co-authored-by: Mike Stucka <[email protected]> Co-authored-by: naumansharifwork <[email protected]>
No description provided.