Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NextRequest #84 ; BART #96 ; LAPD #18 #105

Merged
merged 47 commits into from
Oct 23, 2024
Merged

NextRequest #84 ; BART #96 ; LAPD #18 #105

merged 47 commits into from
Oct 23, 2024

Conversation

stucka
Copy link
Contributor

@stucka stucka commented Sep 9, 2024

Description

This is intended to build out a library enabling easier scraping of NextRequest FOIA portal data, fulfilling #84 . Two versions of the sites are supported by this library.

The Bay Area Rapid Transit Police Department (BART / BART PD) scraper is prepared as an example of the library and fulfills #96 .

The library has been tested against a different version of a NextRequest site used by the Los Angeles Police Department and fulfills #18

Summary of Changes

-- Creates an easy-to-use NextRequest library, with an easier-to-use library that works on two versions of NextRequest document folders.

-- Creates a BART PD scraper as an examplar of this library.

-- Creates a LAPD scraper as a cautionary tale of how weird scraping index files can get.

Related Issues

How to Review

However you'd like. This includes a library, an implementation in one format, and an implementation in the other format I've found.

The BART PD scraper is significantly simpler, as the entirety of their records are in a single NextRequest folder.

LAPD required recursion and also identified a number of edge cases.

The NextRequest code has grown in complexity to handle edge cases, but while readability suffered it should be more robust..

Notes

  • It's possible, perhaps even likely, that other compatibility problems in the NextRequest library will be found the more it is used.
  • It's possible, perhaps even likely, that other versions of NextRequest sites will be discovered, and that will require some rework. The BART template is used on more than one site, at least.

@stucka stucka marked this pull request as draft September 9, 2024 15:41
Copy link
Member

@newsroomdev newsroomdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is shaping up nicely! I left some notes below. Overall, don't forget to remove the Jupyter bits. I've opened a GitHub Discussion to help create a place to link back to and document development across branches and pull requests as we add more agencies.

clean/platforms/nextrequest.py Outdated Show resolved Hide resolved
clean/platforms/nextrequest.py Show resolved Hide resolved
clean/platforms/nextrequest.py Outdated Show resolved Hide resolved
clean/ca/los_angeles_pd.py Fixed Show fixed Hide fixed
clean/ca/los_angeles_pd.py Fixed Show fixed Hide fixed
@stucka stucka marked this pull request as ready for review September 20, 2024 14:42
@stucka
Copy link
Contributor Author

stucka commented Sep 20, 2024

The rebase made it show changes to files I didn't touch. At least some of these are current things already in dev.

@stucka stucka changed the title NextRequest #84 BART #96 NextRequest #84 ; BART #96 ; LAPD #18 Sep 20, 2024
@stucka
Copy link
Contributor Author

stucka commented Sep 20, 2024

ca_bay_area_rapid_transit_pd.json
ca_los_angeles_pd.json

Sample metadata here

Copy link
Member

@newsroomdev newsroomdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm liking this, but one significant area of concern is the deeply nested code blocks. it's clear you've put a lot of thought into various logic branches, so i've tried to leave a few helpful suggestions on how to reframe the codeflows for this shared library

clean/platforms/nextrequest.py Show resolved Hide resolved
clean/platforms/nextrequest.py Outdated Show resolved Hide resolved
Comment on lines 315 to 351
line = {
"site_type": "bartish", # Bartish type
"base_url": f"{parsed_url.scheme}://{parsed_url.netloc}",
"folder_id": urlparse(start_url).path.split("/")[2],
"doc_limit": 9950, # Max number of accessible docs in a folder
"page_size": 25,
"tally_field": "total_documents_count",
# "document_path": "document_scan['document_path']",
}
line["json_url"] = (
f"{line['base_url']}/client/request_documents?request_id={line['folder_id']}&page_number="
)
line["details"] = {
"document_path": "ds!document_path",
"bogus_asset_url": "asset_url",
"review_state": "review_state",
"review_status": "ds!review_status",
"severity": "ds!severity",
"findings": "ds!findings",
"file_extension": "file_extension",
"file_size": "ds!file_size",
"file_type": "ds!file_type",
"visibility1": "visibility",
"visibility2": "ds!visibility",
"upload_date1": "upload_date",
"upload_date2": "ds!upload_date",
"pretty_id": "ds!pretty_id",
"id1": "id",
"id2": "ds!id",
"document_id": "ds!document_id",
"request_id": "request_id",
"folder_name": "folder_name",
"subfolder_name": "subfolder_name",
"exempt_from_retention": "exempt_from_retention",
"bln_page_url": "bln_page_url",
"bln_total_documents": "bln_total_documents",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for readability, don't break up the line dictionaries

Suggested change
line = {
"site_type": "bartish", # Bartish type
"base_url": f"{parsed_url.scheme}://{parsed_url.netloc}",
"folder_id": urlparse(start_url).path.split("/")[2],
"doc_limit": 9950, # Max number of accessible docs in a folder
"page_size": 25,
"tally_field": "total_documents_count",
# "document_path": "document_scan['document_path']",
}
line["json_url"] = (
f"{line['base_url']}/client/request_documents?request_id={line['folder_id']}&page_number="
)
line["details"] = {
"document_path": "ds!document_path",
"bogus_asset_url": "asset_url",
"review_state": "review_state",
"review_status": "ds!review_status",
"severity": "ds!severity",
"findings": "ds!findings",
"file_extension": "file_extension",
"file_size": "ds!file_size",
"file_type": "ds!file_type",
"visibility1": "visibility",
"visibility2": "ds!visibility",
"upload_date1": "upload_date",
"upload_date2": "ds!upload_date",
"pretty_id": "ds!pretty_id",
"id1": "id",
"id2": "ds!id",
"document_id": "ds!document_id",
"request_id": "request_id",
"folder_name": "folder_name",
"subfolder_name": "subfolder_name",
"exempt_from_retention": "exempt_from_retention",
"bln_page_url": "bln_page_url",
"bln_total_documents": "bln_total_documents",
}
return {
"site_type": "bartish",
"base_url": f"{parsed_url.scheme}://{parsed_url.netloc}",
"folder_id": parsed_url.path.split("/")[2],
"doc_limit": BARTISH_DOC_LIMIT,
"page_size": BARTISH_PAGE_SIZE,
"tally_field": "total_documents_count",
"json_url": f"{parsed_url.scheme}://{parsed_url.netloc}/client/request_documents?request_id={parsed_url.path.split('/')[2]}&page_number=",
"details": {
"document_path": "ds!document_path",
"bogus_asset_url": "asset_url",
"review_state": "review_state",
"review_status": "ds!review_status",
"severity": "ds!severity",
"findings": "ds!findings",
"file_extension": "file_extension",
"file_size": "ds!file_size",
"file_type": "ds!file_type",
"visibility1": "visibility",
"visibility2": "ds!visibility",
"upload_date1": "upload_date",
"upload_date2": "ds!upload_date",
"pretty_id": "ds!pretty_id",
"id1": "id",
"id2": "ds!id",
"document_id": "ds!document_id",
"request_id": "request_id",
"folder_name": "folder_name",
"subfolder_name": "subfolder_name",
"exempt_from_retention": "exempt_from_retention",
"bln_page_url": "bln_page_url",
"bln_total_documents": "bln_total_documents",
},
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got those dictionaries back together by shifting one bit of code out. Which somehow also required more mypy ignores.

Can you take a look and see if this works?

I think if we break this down further we're just making it harder to find code. Here the fingerprint type you want is identified by indents, and you can just scroll down. Otherwise we're basically repeating the structure with more if statements in a profile generator function, or we're creating multiple profile functions that do nothing except scatter responsibility further.

Comment on lines 77 to 154
if not force and local_cache.exists(filename):
logger.debug(f"File found in cache: {filename}")
returned_json = None
file_needs_write = False
else:
# Remember pagination here!
page_number = 1
page_url = f"{json_url}{page_number}"
r = utils.get_url(page_url)
if not r.ok:
logger.error(f"Problem downloading {page_url}: {r.status_code}")
returned_json: Dict = {} # type: ignore
file_needs_write = False
else:
returned_json = r.json()
# local_cache.write_json(filename,
file_needs_write = True
total_documents = returned_json[profile["tally_field"]]
for i, _entry in enumerate(returned_json["documents"]):
returned_json["documents"][i]["bln_page_url"] = page_url
returned_json["documents"][i]["bln_total_documents"] = total_documents
page_size = profile["page_size"]
max_pages = find_max_pages(total_documents, page_size)
sleep(throttle)
if total_documents > profile["doc_limit"]:
message = f"Request found with {total_documents:,} documents, exceeding limits. "
message += f"This is probably a bad URL that can't be properly scraped: {page_url}. "
message += "Dropping record."
logger.warning(message)
returned_json = {}
file_needs_write = False
return (filename, returned_json, file_needs_write)
if max_pages > 1:
logger.debug(f"Need to download {max_pages - 1:,} more JSON files.")
for page_number in range(2, max_pages + 1):
page_url = f"{json_url}{page_number}"
if page_number >= 200:
message = "NextRequest at least on some sites appears to have a hard limit of "
message += f"199 pages. Not trying to scrape {page_url}."
logger.warning(message)
else:
r = utils.get_url(page_url)
if not r.ok:
logger.error(
f"Problem downloading {page_url}: {r.status_code}"
)
returned_json = {}
file_needs_write = False
else:
additional_json = r.json()
if "documents" not in additional_json:
logger.error(
f"Missing 'documents' section from {page_url}"
)
returned_json = {}
file_needs_write = False
else:
for i, _entry in enumerate(
additional_json["documents"]
):
additional_json["documents"][i][
"bln_page_url"
] = page_url
additional_json["documents"][i][
"bln_total_documents"
] = total_documents
returned_json["documents"].extend(
additional_json["documents"]
)
sleep(throttle)
documents_found = len(returned_json["documents"])
if documents_found != total_documents:
message = f"Expected {total_documents:,} documents "
message += f"but got {documents_found:,} instead for "
message += f"{start_url}."
logger.warning(message)

return (filename, returned_json, file_needs_write)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested conditionals can be rather gnarly and buggy to maintain over time, and they are often the first platform code to be rewritten.

I let gen-AI take a swing at an initial refactor for the purposes of pseudo-code. In other words, pseudo-code is an untested suggestion, not necessarily for adoption, to help rethink through the logic flow differently

# UNTESTED GEN-AI PSEUDO-CODE FOR ILLUSTRATION
def fetch_nextrequest(
    base_directory: Path, start_url: str, force: bool = False, throttle: int = 2
) -> Tuple[Path, Union[None, Dict], bool]:
    """Given a link to a NextRequest documents folder, return a proposed filename and the JSON contents."""
    profile = fingerprint_nextrequest(start_url)
    folder_id = profile["folder_id"]
    json_url = profile["json_url"]

    local_cache = Cache(path=None)
    filename = base_directory / f"{folder_id}.json"
    if not force and local_cache.exists(filename):
        logger.debug(f"File found in cache: {filename}")
        return filename, None, False

    returned_json = download_json_data(json_url, profile, throttle)
    if not returned_json:
        return filename, {}, False

    file_needs_write = True
    return filename, returned_json, file_needs_write


def download_json_data(
    json_url: str, profile: Dict, throttle: int
) -> Union[None, Dict]:
    """Download JSON data from the given URL with pagination handling."""
    page_number = 1
    page_url = f"{json_url}{page_number}"
    r = utils.get_url(page_url)
    if not r.ok:
        logger.error(f"Problem downloading {page_url}: {r.status_code}")
        return None

    returned_json = r.json()
    total_documents = returned_json[profile["tally_field"]]
    add_metadata_to_documents(returned_json["documents"], page_url, total_documents)

    if total_documents > profile["doc_limit"]:
        logger.warning(
            f"Request found with {total_documents:,} documents, exceeding limits. Dropping record."
        )
        return {}

    max_pages = find_max_pages(total_documents, profile["page_size"])
    if max_pages > 1:
        logger.debug(f"Need to download {max_pages - 1:,} more JSON files.")
        for page_number in range(2, max_pages + 1):
            if page_number >= MAX_PAGES_LIMIT:
                logger.warning(
                    f"Not trying to scrape {json_url}{page_number} due to page limit."
                )
                break
            page_url = f"{json_url}{page_number}"
            r = utils.get_url(page_url)
            if not r.ok:
                logger.error(f"Problem downloading {page_url}: {r.status_code}")
                return None
            additional_json = r.json()
            if "documents" not in additional_json:
                logger.error(f"Missing 'documents' section from {page_url}")
                return None
            add_metadata_to_documents(
                additional_json["documents"], page_url, total_documents
            )
            returned_json["documents"].extend(additional_json["documents"])
            sleep(throttle)

    documents_found = len(returned_json["documents"])
    if documents_found != total_documents:
        logger.warning(
            f"Expected {total_documents:,} documents but got {documents_found:,} instead for {json_url}."
        )
    return returned_json


def add_metadata_to_documents(
    documents: List[Dict], page_url: str, total_documents: int
) -> None:
    """Add metadata to each document."""
    for doc in documents:
        doc["bln_page_url"] = page_url
        doc["bln_total_documents"] = total_documents

clean/ca/los_angeles_pd.py Show resolved Hide resolved
@newsroomdev
Copy link
Member

newsroomdev commented Oct 22, 2024

@stucka one small merge conflict here, but if you can update this I can get this merged in this week and unblock some more work merge conflict resolved in in 6945996

Copy link
Member

@newsroomdev newsroomdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor nits i'll need to cleanup later for automation, but this is in a working state for a one-off and there are other PRs

@newsroomdev newsroomdev merged commit 08c3e5f into dev Oct 23, 2024
7 checks passed
@newsroomdev newsroomdev deleted the platform-84 branch October 23, 2024 18:35
newsroomdev added a commit that referenced this pull request Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants