CivicPlusSite needs better handling for names of downloaded files #96

zstumgoren · 2021-08-07T02:48:32Z

On a test scrape for Belvedere, CA for roughly June through early August, the scrape generated less-than-helpful names for downloaded files:

/tmp/civic_scraper/
├── assets
│   ├── civicplus_www_06142021-577_agenda.html
│   ├── civicplus_www_06142021-577_agenda.pdf
│   ├── civicplus_www_06142021-577_agenda_packet.pdf
│   ├── civicplus_www_06142021-577_minutes.pdf
│   ├── civicplus_www_06152021-578_agenda.html
<<< snipped >>>

This appears to stem from our handling of the meeting_id variable, which is used in Asset.download to generate the file name.

Need to either debug for this locale and/or adopt an alternate convention for standardizing file names in CivicPlusSite (and generally).

An ideal solution would be storing file artifacts based on a combination of place, agency, date of meeting, committee type, document type and document format (i.e. the file suffix). For example:

# Note, place may need more careful handling
/tmp/civic_scraper/assets/ca_belvedere/20210604_city_council_agenda_packet.pdf
/tmp/civic_scraper/assets/ca_belvedere/20210604_city_council_agenda_packet.html

It's likely that we may not have all this information available for all platforms, so we may need platform specific solutions.

Or we can go in a totally different direction and just generate unique names based on a file hash, and then use asset metadata (e.g. stored in the metadata CSV) to link given files with their unique names.

The text was updated successfully, but these errors were encountered:

zstumgoren added the bug Something isn't working label Aug 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CivicPlusSite needs better handling for names of downloaded files #96

CivicPlusSite needs better handling for names of downloaded files #96

zstumgoren commented Aug 7, 2021 •

edited

Loading

CivicPlusSite needs better handling for names of downloaded files #96

CivicPlusSite needs better handling for names of downloaded files #96

Comments

zstumgoren commented Aug 7, 2021 • edited Loading

zstumgoren commented Aug 7, 2021 •

edited

Loading