-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NextRequest #84 ; BART #96 ; LAPD #18 #105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is shaping up nicely! I left some notes below. Overall, don't forget to remove the Jupyter bits. I've opened a GitHub Discussion to help create a place to link back to and document development across branches and pull requests as we add more agencies.
The rebase made it show changes to files I didn't touch. At least some of these are current things already in dev. |
ca_bay_area_rapid_transit_pd.json Sample metadata here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm liking this, but one significant area of concern is the deeply nested code blocks. it's clear you've put a lot of thought into various logic branches, so i've tried to leave a few helpful suggestions on how to reframe the codeflows for this shared library
clean/platforms/nextrequest.py
Outdated
line = { | ||
"site_type": "bartish", # Bartish type | ||
"base_url": f"{parsed_url.scheme}://{parsed_url.netloc}", | ||
"folder_id": urlparse(start_url).path.split("/")[2], | ||
"doc_limit": 9950, # Max number of accessible docs in a folder | ||
"page_size": 25, | ||
"tally_field": "total_documents_count", | ||
# "document_path": "document_scan['document_path']", | ||
} | ||
line["json_url"] = ( | ||
f"{line['base_url']}/client/request_documents?request_id={line['folder_id']}&page_number=" | ||
) | ||
line["details"] = { | ||
"document_path": "ds!document_path", | ||
"bogus_asset_url": "asset_url", | ||
"review_state": "review_state", | ||
"review_status": "ds!review_status", | ||
"severity": "ds!severity", | ||
"findings": "ds!findings", | ||
"file_extension": "file_extension", | ||
"file_size": "ds!file_size", | ||
"file_type": "ds!file_type", | ||
"visibility1": "visibility", | ||
"visibility2": "ds!visibility", | ||
"upload_date1": "upload_date", | ||
"upload_date2": "ds!upload_date", | ||
"pretty_id": "ds!pretty_id", | ||
"id1": "id", | ||
"id2": "ds!id", | ||
"document_id": "ds!document_id", | ||
"request_id": "request_id", | ||
"folder_name": "folder_name", | ||
"subfolder_name": "subfolder_name", | ||
"exempt_from_retention": "exempt_from_retention", | ||
"bln_page_url": "bln_page_url", | ||
"bln_total_documents": "bln_total_documents", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for readability, don't break up the line
dictionaries
line = { | |
"site_type": "bartish", # Bartish type | |
"base_url": f"{parsed_url.scheme}://{parsed_url.netloc}", | |
"folder_id": urlparse(start_url).path.split("/")[2], | |
"doc_limit": 9950, # Max number of accessible docs in a folder | |
"page_size": 25, | |
"tally_field": "total_documents_count", | |
# "document_path": "document_scan['document_path']", | |
} | |
line["json_url"] = ( | |
f"{line['base_url']}/client/request_documents?request_id={line['folder_id']}&page_number=" | |
) | |
line["details"] = { | |
"document_path": "ds!document_path", | |
"bogus_asset_url": "asset_url", | |
"review_state": "review_state", | |
"review_status": "ds!review_status", | |
"severity": "ds!severity", | |
"findings": "ds!findings", | |
"file_extension": "file_extension", | |
"file_size": "ds!file_size", | |
"file_type": "ds!file_type", | |
"visibility1": "visibility", | |
"visibility2": "ds!visibility", | |
"upload_date1": "upload_date", | |
"upload_date2": "ds!upload_date", | |
"pretty_id": "ds!pretty_id", | |
"id1": "id", | |
"id2": "ds!id", | |
"document_id": "ds!document_id", | |
"request_id": "request_id", | |
"folder_name": "folder_name", | |
"subfolder_name": "subfolder_name", | |
"exempt_from_retention": "exempt_from_retention", | |
"bln_page_url": "bln_page_url", | |
"bln_total_documents": "bln_total_documents", | |
} | |
return { | |
"site_type": "bartish", | |
"base_url": f"{parsed_url.scheme}://{parsed_url.netloc}", | |
"folder_id": parsed_url.path.split("/")[2], | |
"doc_limit": BARTISH_DOC_LIMIT, | |
"page_size": BARTISH_PAGE_SIZE, | |
"tally_field": "total_documents_count", | |
"json_url": f"{parsed_url.scheme}://{parsed_url.netloc}/client/request_documents?request_id={parsed_url.path.split('/')[2]}&page_number=", | |
"details": { | |
"document_path": "ds!document_path", | |
"bogus_asset_url": "asset_url", | |
"review_state": "review_state", | |
"review_status": "ds!review_status", | |
"severity": "ds!severity", | |
"findings": "ds!findings", | |
"file_extension": "file_extension", | |
"file_size": "ds!file_size", | |
"file_type": "ds!file_type", | |
"visibility1": "visibility", | |
"visibility2": "ds!visibility", | |
"upload_date1": "upload_date", | |
"upload_date2": "ds!upload_date", | |
"pretty_id": "ds!pretty_id", | |
"id1": "id", | |
"id2": "ds!id", | |
"document_id": "ds!document_id", | |
"request_id": "request_id", | |
"folder_name": "folder_name", | |
"subfolder_name": "subfolder_name", | |
"exempt_from_retention": "exempt_from_retention", | |
"bln_page_url": "bln_page_url", | |
"bln_total_documents": "bln_total_documents", | |
}, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got those dictionaries back together by shifting one bit of code out. Which somehow also required more mypy ignores.
Can you take a look and see if this works?
I think if we break this down further we're just making it harder to find code. Here the fingerprint type you want is identified by indents, and you can just scroll down. Otherwise we're basically repeating the structure with more if statements in a profile generator function, or we're creating multiple profile functions that do nothing except scatter responsibility further.
clean/platforms/nextrequest.py
Outdated
if not force and local_cache.exists(filename): | ||
logger.debug(f"File found in cache: {filename}") | ||
returned_json = None | ||
file_needs_write = False | ||
else: | ||
# Remember pagination here! | ||
page_number = 1 | ||
page_url = f"{json_url}{page_number}" | ||
r = utils.get_url(page_url) | ||
if not r.ok: | ||
logger.error(f"Problem downloading {page_url}: {r.status_code}") | ||
returned_json: Dict = {} # type: ignore | ||
file_needs_write = False | ||
else: | ||
returned_json = r.json() | ||
# local_cache.write_json(filename, | ||
file_needs_write = True | ||
total_documents = returned_json[profile["tally_field"]] | ||
for i, _entry in enumerate(returned_json["documents"]): | ||
returned_json["documents"][i]["bln_page_url"] = page_url | ||
returned_json["documents"][i]["bln_total_documents"] = total_documents | ||
page_size = profile["page_size"] | ||
max_pages = find_max_pages(total_documents, page_size) | ||
sleep(throttle) | ||
if total_documents > profile["doc_limit"]: | ||
message = f"Request found with {total_documents:,} documents, exceeding limits. " | ||
message += f"This is probably a bad URL that can't be properly scraped: {page_url}. " | ||
message += "Dropping record." | ||
logger.warning(message) | ||
returned_json = {} | ||
file_needs_write = False | ||
return (filename, returned_json, file_needs_write) | ||
if max_pages > 1: | ||
logger.debug(f"Need to download {max_pages - 1:,} more JSON files.") | ||
for page_number in range(2, max_pages + 1): | ||
page_url = f"{json_url}{page_number}" | ||
if page_number >= 200: | ||
message = "NextRequest at least on some sites appears to have a hard limit of " | ||
message += f"199 pages. Not trying to scrape {page_url}." | ||
logger.warning(message) | ||
else: | ||
r = utils.get_url(page_url) | ||
if not r.ok: | ||
logger.error( | ||
f"Problem downloading {page_url}: {r.status_code}" | ||
) | ||
returned_json = {} | ||
file_needs_write = False | ||
else: | ||
additional_json = r.json() | ||
if "documents" not in additional_json: | ||
logger.error( | ||
f"Missing 'documents' section from {page_url}" | ||
) | ||
returned_json = {} | ||
file_needs_write = False | ||
else: | ||
for i, _entry in enumerate( | ||
additional_json["documents"] | ||
): | ||
additional_json["documents"][i][ | ||
"bln_page_url" | ||
] = page_url | ||
additional_json["documents"][i][ | ||
"bln_total_documents" | ||
] = total_documents | ||
returned_json["documents"].extend( | ||
additional_json["documents"] | ||
) | ||
sleep(throttle) | ||
documents_found = len(returned_json["documents"]) | ||
if documents_found != total_documents: | ||
message = f"Expected {total_documents:,} documents " | ||
message += f"but got {documents_found:,} instead for " | ||
message += f"{start_url}." | ||
logger.warning(message) | ||
|
||
return (filename, returned_json, file_needs_write) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested conditionals can be rather gnarly and buggy to maintain over time, and they are often the first platform code to be rewritten.
I let gen-AI take a swing at an initial refactor for the purposes of pseudo-code. In other words, pseudo-code is an untested suggestion, not necessarily for adoption, to help rethink through the logic flow differently
# UNTESTED GEN-AI PSEUDO-CODE FOR ILLUSTRATION
def fetch_nextrequest(
base_directory: Path, start_url: str, force: bool = False, throttle: int = 2
) -> Tuple[Path, Union[None, Dict], bool]:
"""Given a link to a NextRequest documents folder, return a proposed filename and the JSON contents."""
profile = fingerprint_nextrequest(start_url)
folder_id = profile["folder_id"]
json_url = profile["json_url"]
local_cache = Cache(path=None)
filename = base_directory / f"{folder_id}.json"
if not force and local_cache.exists(filename):
logger.debug(f"File found in cache: {filename}")
return filename, None, False
returned_json = download_json_data(json_url, profile, throttle)
if not returned_json:
return filename, {}, False
file_needs_write = True
return filename, returned_json, file_needs_write
def download_json_data(
json_url: str, profile: Dict, throttle: int
) -> Union[None, Dict]:
"""Download JSON data from the given URL with pagination handling."""
page_number = 1
page_url = f"{json_url}{page_number}"
r = utils.get_url(page_url)
if not r.ok:
logger.error(f"Problem downloading {page_url}: {r.status_code}")
return None
returned_json = r.json()
total_documents = returned_json[profile["tally_field"]]
add_metadata_to_documents(returned_json["documents"], page_url, total_documents)
if total_documents > profile["doc_limit"]:
logger.warning(
f"Request found with {total_documents:,} documents, exceeding limits. Dropping record."
)
return {}
max_pages = find_max_pages(total_documents, profile["page_size"])
if max_pages > 1:
logger.debug(f"Need to download {max_pages - 1:,} more JSON files.")
for page_number in range(2, max_pages + 1):
if page_number >= MAX_PAGES_LIMIT:
logger.warning(
f"Not trying to scrape {json_url}{page_number} due to page limit."
)
break
page_url = f"{json_url}{page_number}"
r = utils.get_url(page_url)
if not r.ok:
logger.error(f"Problem downloading {page_url}: {r.status_code}")
return None
additional_json = r.json()
if "documents" not in additional_json:
logger.error(f"Missing 'documents' section from {page_url}")
return None
add_metadata_to_documents(
additional_json["documents"], page_url, total_documents
)
returned_json["documents"].extend(additional_json["documents"])
sleep(throttle)
documents_found = len(returned_json["documents"])
if documents_found != total_documents:
logger.warning(
f"Expected {total_documents:,} documents but got {documents_found:,} instead for {json_url}."
)
return returned_json
def add_metadata_to_documents(
documents: List[Dict], page_url: str, total_documents: int
) -> None:
"""Add metadata to each document."""
for doc in documents:
doc["bln_page_url"] = page_url
doc["bln_total_documents"] = total_documents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor nits i'll need to cleanup later for automation, but this is in a working state for a one-off and there are other PRs
Description
This is intended to build out a library enabling easier scraping of NextRequest FOIA portal data, fulfilling #84 . Two versions of the sites are supported by this library.
The Bay Area Rapid Transit Police Department (BART / BART PD) scraper is prepared as an example of the library and fulfills #96 .
The library has been tested against a different version of a NextRequest site used by the Los Angeles Police Department and fulfills #18
Summary of Changes
-- Creates an easy-to-use NextRequest library, with an easier-to-use library that works on two versions of NextRequest document folders.
-- Creates a BART PD scraper as an examplar of this library.
-- Creates a LAPD scraper as a cautionary tale of how weird scraping index files can get.
Related Issues
closes Build a NextRequest platform library #84
closes Create ca_bay_area_rapid_transit_pd #96
closes Create clean/ca/los_angeles_pd.py #18
How to Review
However you'd like. This includes a library, an implementation in one format, and an implementation in the other format I've found.
The BART PD scraper is significantly simpler, as the entirety of their records are in a single NextRequest folder.
LAPD required recursion and also identified a number of edge cases.
The NextRequest code has grown in complexity to handle edge cases, but while readability suffered it should be more robust..
Notes