Skip to content

Commit

Permalink
docs: metadata spec (#49)
Browse files Browse the repository at this point in the history
* docs: metadata spec

* docs: remove refs to scrape

---------

Co-authored-by: Gerald Rich <[email protected]>
  • Loading branch information
newsroomdev and newsroomdev authored Jul 30, 2024
1 parent dc24b8e commit f689f6b
Show file tree
Hide file tree
Showing 3 changed files with 69 additions and 17 deletions.
44 changes: 31 additions & 13 deletions docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,21 +168,37 @@ The file should be saved to the cache folder's `exports/` directory. In the case

The metadata file should contain an array of one or more objects with the below attributes:

- `asset_url`: The URL where a file can be downloaded.
- **required** `asset_url`: The URL where a file can be downloaded.
- `case_number`: A string to associate the asset with a specific incident
- `name`: The base name of the file, minus prior path components.
- `parent_page`: The local file path in cache to the HTML page containing the `asset_url`.
- `title`: (optional) If available, this will typically be a human-friendly title for the file.
- `details`: (optional) Additional relevant info for detecting changes, e.g. modified or creation timestamps, filesize, etc.
- `filesize`: (optional) Integer of bytes for determining changes to existing assets in the same url
- `date`: (optional) Use [ISO 8601](https://www.w3.org/TR/NOTE-datetime) `str`
- Complete date: YYYY-MM-DD (eg 1997-07-16)
- Complete date plus hours and minutes: YYYY-MM-DDThh:mmTZD (eg 1997-07-16T19:20+01:00)
- Complete date plus hours, minutes and seconds: YYYY-MM-DDThh:mm:ssTZD (eg 1997-07-16T19:20:30+01:00)

Below is an example from `ca_san_diego_pd.json` metadata JSON.

##### JSON Example

```json
[
{
"asset_url": "https://sdpdsb1421.sandiego.gov/Sustained Findings/2022/11-21-2022 IA 2022-013/Audio/November+21%2C+2022+IA+%232022-013_Audio_Interview+Complainant_Redacted_KM.wav",
"name": "November 21, 2022 IA #2022-013_Audio_Interview Complainant_Redacted_KM.wav",
"parent_page": "/Users/someuser/.clean-scraper/cache/ca_san_diego_pd/sb16-sb1421-ab748/11-21-2022_IA_2022-013.html",
"title": "11-21-2022 IA 2022-013"
"parent_page": "/ca_san_diego_pd/sb16-sb1421-ab748/11-21-2022_IA_2022-013.html",
"title": "11-21-2022 IA 2022-013",
"case_num": "abc123",
"details": {
"filesize": 9999,
"date_modified": "2024-01-01T19:20:00+1:00"
...
}
},
]
```

#### Assets
Expand All @@ -205,29 +221,31 @@ Below is an example of the folder structure we used to organize HTML pages and f

**But please use a sensible strategy. If in doubt, ping the maintainers to discuss.**

##### Filetree Example

```bash
/Users/tumgoren/.clean-scraper
├── cache
│   └── ca_san_diego_pd
│   ├── assets
│   │   └── sb16-sb1421-ab748
│   │   ├── 08-30-2021_IA_2021-0651
/Users/someuser/.clean-scraper
├── cache/
│   └── ca_san_diego_pd/
│   ├── assets/
│   │   └── sb16-sb1421-ab748/
│   │   ├── 08-30-2021_IA_2021-0651/
│   │   │   ├── August_30,_2021_IA_#2021-0651_Audio_Civilian_Witness_Statement_RedactedBK_mb.wav
│   │   │   └── August_30,_2021_IA_#2021-0651_Audio_Complainant_Interview_RedactedBK_mb.wav
│   │   └── 11-21-2022_IA_2022-013
│   │   └── 11-21-2022_IA_2022-013/
│   │   ├── November_21,_2022_IA_#2022-013_Audio_Interview_Complainant_Redacted_KM.wav
│   │   ├── November_21,_2022_IA_#2022-013_Audio_Interview_Subject_Officer_Redacted_KM.wav
│   │   ├── November_21,_2022_IA_#2022-013_Audio_Interview_Witness_Redacted_KM.wav
│   │   ├── November_21,_2022_IA_#2022-013_Discipline_Documents_Redacted_KM.pdf
│   │   └── November_21,_2022_IA_#2022-013_Documents_Redacted_KM.pdf
│   ├── sb16-sb1421-ab748
│   ├── sb16-sb1421-ab748/
│   │   ├── 01-10-2022_3100_Imperial_Avenue.html
│   │   ├── 01-11-2020_IA_2020-003.html
│   │   ├── 01-13-2022_IA_2022-002.html
│   │   ├── 01-27-2021_IA_2021-001.html
│   │   ├── 02-11-2022_4900_University_Avenue.html


│   ├── exports/
│   │   └── san_diego_pd.json
```

## Running the CLI
Expand Down
37 changes: 37 additions & 0 deletions docs/decisions/00-deprecate-scrape.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Deprecate `scrape` method

Removes `clean` modules `scrape` method with a PR in favor of `scrape_meta` to focus development and testing around a JSON spec source of truth for other scrapers and analysis to use further downstream.

## Problems

- How do we scrape/download assets related to law enforcement accountability in a consistent manner?
- How do we associate multiple assets with with a single incident reference ID?

## Proposal

This is a two-part proposal. By deprecating one method we can focus development cycles around a reliable schema for consumers.

1. Delete the `scrape` method in individual scrapers with a single PR (GitHub will record the code)
2. Add stricter tests/types for `scrape_meta` to ensure it produces consistent results

## Implications

Please refer back to this document in discussions, code reviews, or additional proposals if additional implications arise.

### Pros

- Consistent outputs
- Test coverage strategy

### Cons

- Additional cognitive and testing overheads
- Upfront costs for onboarding new contributors

### Risks

- More documentation and friction to development

## Outcome

Adopted
5 changes: 1 addition & 4 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,6 @@ You can then run a scraper for an agency using its slug:
```bash
# Scrape metadata about available files
clean-scraper scrape-meta ca_san_diego_pd

# Download files
clean-scraper scrape ca_san_diego_pd
```

> **NOTE**: Always run `scrape-meta` at least once initially. It generates output required by the `scrape` subcommand.
Expand Down Expand Up @@ -60,5 +57,5 @@ Options:

Commands:
list List all available agencies and their slugs.
scrape Command-line interface for downloading CLEAN files.
scrape-meta Command-line interface for downloading CLEAN files.
```

0 comments on commit f689f6b

Please sign in to comment.