Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA scraper hopelessly inefficient #652

Open
stucka opened this issue Jul 3, 2024 · 2 comments
Open

CA scraper hopelessly inefficient #652

stucka opened this issue Jul 3, 2024 · 2 comments
Assignees

Comments

@stucka
Copy link
Contributor

stucka commented Jul 3, 2024

No description provided.

@stucka stucka self-assigned this Jul 3, 2024
@stucka
Copy link
Contributor Author

stucka commented Jul 4, 2024

The scraper is ... strange. Right now every time it downloads an Excel file of recent filings. Every time it also re-downloads a bunch of historic PDFs,one of which isn't parsing correctly now but has data already in the Excel file.

We can't possibly know if the scraper will fail on when the next generation of PDF is released.

New workaround to at least bypass some of that:
ZIP up BLN's existing cache archive (through ca branch) and include the output CSV in the ZIP.

Throw that up in Google Cloud Storage.

Take the last good CSV and put that separately in Google Cloud Storage.

Rework the scraper to not download any of those existing PDFs through today's dates.

Rework the scraper to download and import the processed data from Google Cloud Storage. Import it into the scraper before writing the final output CSV.

This may break as soon as the scraper detects the first July 2024 layoffs and tries to parse that PDF, but at least some of the runtime, data transfer and senseless processing can be avoided.

@stucka stucka changed the title CA scraper down CA scraper hopelessly inefficient Jul 8, 2024
@stucka
Copy link
Contributor Author

stucka commented Jul 8, 2024

The scraper's PDF parsing code contained a fatal flaw that would have blocked any future success, and I believe I have that patched up with #653

The inefficiencies., however, remain. We're downloading PDFs that haven't changed in a decade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant