Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Users should be able to initiate a COPA Scraping job #56

Open
9 of 13 tasks
adesca opened this issue Aug 9, 2019 · 3 comments
Open
9 of 13 tasks

Users should be able to initiate a COPA Scraping job #56

adesca opened this issue Aug 9, 2019 · 3 comments

Comments

@adesca
Copy link
Contributor

adesca commented Aug 9, 2019

Overarching goal: A user should be able to trigger a process in the server that pulls data from the COPA website and imports new Allegations to the database.

Things to keep in mind:

Goals:

  • From the UI, a use should be able to initiate a copa job and be redirected to the live status page after it starts
    • Stages of a copa job:
      • Initial data - download data from copa site to store in google cloud storage under initial_data/
      • Phantom rows:
        • Clean - Split the data based on assignment.
          • All rows with assignment 'copa' should be saved cleaned/copa.csv
          • All rows without assignment 'copa' should be saved under cleaned/other-assignment.csv
        • Transform - Create database rows from the raw data
          • All rows with the assignment 'copa' should be transformed into data_allegation rows and saved under transformed/copa.csv
          • A separate csv should be made that ties log_no from copa to all information in the original dataset that is not a part of the data_allegation row, and saved under transformed/misc-data.csv
          • Note: not all of the columns will be filled
          • Note: If any rows cannot be transformed, they should be saved under errors/transform_error.csv and shown in the UI. NEEDS WORK TO DETECT FAILED TRANSFORM AND PRODUCE ERROR FILE. THE ONLY POSSIBLE WAYS FOR THE TRANSFORM TO FAIL IS IF API ENDPOINTS ARE REMOVED OR CHANGED, THERE WILL BE NO ROWS RETURNED IN THOSE CASES SO OUR ERRORS FILE SHOULD SAY SOMETHING ABOUT THE API ERROR GIVEN
        • Augment - Replace columns with foreign key references
          • All transformed copa rows should replace the current category column with a reference to the data_allegationcategory table for that particular category
          • Note: A row failing to augment should not end the pipeline
          • Note: if any rows cannot be augmented, they should be saved under errors/augment_failures.csv **
          • show in the UI
        • Load - Load augmented rows into the database
          • Check if there already exists a row with that log_no.
            • If there is, verify that all the data matches, including:
              • finding_code should match against data_officerallegation.final_finding REQUIRES NEW ENTITY TYPE (data_officerallegation) PLUS LOGIC TO MATCH the scraped copa column final_finding with data_officerallegation.final_finding
              • All data fields that are in data_allegation and also in the copa response (log_no, current_category, beat)
            • If any data does not match then save it as the file "changed-allegation.csv" under errors/.
              • This should appear in the UI - able to use loader.changed_allegations . NEEDS UI PAGE TO DISPLAY DATA SAVED ON loader OBJECT
            • If all data matches, disregard this row
          • If there doesnt exist a row with that log_no add the row
          • Note: if any rows cannot be turned into an entity object, they should be saved under errors/entity_failures.csv and shown in the UI
      • Data validation:
        • Check for missing records
          • Check if any allegations present in copa are missing from the original database
          • display a list of these in the UI - able to use loader.db_rows_added to show in UI NEEDS UI PAGE TO DISPLAY DATA SAVED ON loader OBJECT

The business need:

From Rajiv:
The primary purpose of this COPA Data Portal data capture step is to create incomplete/phantom complaint records in our database (for new complaints since our last successful FOIA response) so that we can have some matching data for the new documents that are being picked up by our crawlers/scrapers ( https://cpdp.co/crawlers and https:// cpdp.co/documents ).

The second purpose is to compare against the data that we have received via FOIA responses to whether we are missing any records (i.e., were any responsive complaint records omitted from our original dataset and if so which ones).

The third purpose is to compare different versions/snapshots of it over time and see what’s changing (is it just new records being added on to the end, or are older records being added, or removed, or altered).

From Basecamp:
The Civilian Office of Police Accountability (COPA) has just posted a new live data feed to the City's Open Data Portal that goes back 10 years. Here are a few early questions to investigate.

  • Are there CRs that appear here during the comparable time period (i.e., before October 2016) that don't appear in our FOIA'd datasets (which were produced in October 2016)? If so, how many and are there any revealing common characteristics amongst them to suggest why they may have been excluded from the dataset we received in response to our FOIA requests but not excluded from this public release on the City's public data portal. More likely is the inverse, i.e., complaints that we know of through our FOIA request, but that were excluded from the City's public data portal even during the overlapping time period of November 2007 – November 2016.
  • For all the CRs that exist both in the City Data Portal and in our FOIA'd datasets, how many rows have conflicting values for the dynamic data fields, such as CURRENT_STATUS (which we expect to change over time for open cases), and for data fields that we might not expect to change, such as COMPLAINT_DATE? What can we learn from any patterns amongst these kinds of unexpected discrepancies, particularly when they occur in cases that are already closed?
  • Are there any reasons not to import all these data and overwrite the conflicting fields in our existing dataset with more "up-to-date" information from the City's data portal (of course, any new CRs would be missing all officer-identifying data and other fields that are not being published to the data portal, until our next FOIA request)? The City Data Portal has a relatively robust API and supports numerous open standards for public APIs. Can we do all this importing and merging programmatically and run it on the Civis Platform on a routine basis? Is there any equivalent to cron built into the Platform? Apart from sanity checks, what kinds of issues will we run into that require human intervention/judgment (no officer-identifying data also means no officer profile matching challenges)?
@colin-parsons
Copy link
Contributor

Status indicators

  • Stage names should be unordered list items
  • Stage names should be green when the stage successfully completed for all rows
  • Stage names should be red when the stage experienced an error for one or more rows
    • Stage names should show the number of errored rows next to the stage name in the format "([errors]/[total rows])"
    • When a stage error a button should appear beside the name that says "show errors" and when clicked, the button should show an error summary right beneath the stage name
  • Stage names should flash when the stage is in progress for one or more rows

@colin-parsons
Copy link
Contributor

colin-parsons commented Aug 9, 2019

Tasks done in e2e:

  • A user can start the copa process
  • A user can show errors on errored transformations
  • Check that the files are stored with the proper data
  • Check the data validation screen

j0vanka pushed a commit that referenced this issue Aug 26, 2019
tw-jeff-burroughs added a commit that referenced this issue Aug 26, 2019
* [56] WIP: transform complete and tested, ready for augmentation step

* [56] WIP: return None if there is no file to read

* [56] WIP: use try except for trying to open file for reading, return None otherwise

* [56] WIP: mock.call_count too high in build, attempt to unpatch functions between tests to ensure correct count

* [56] WIP: Update transform test to ensure correct data is being passed to store_string calls

* Update .travis.yml

* Update .travis.yml

* Update .travis.yml

* [#56] completed transformation

* [56] fixed flake8 issues

* Update .travis.yml, fix parsing issue on travis-ci.com
@tw-jeff-burroughs
Copy link
Contributor

AUGMENTATION:

Table of data_allegationcategory

Id category_name
1 category1
2 category2
3 category3

Table data_allegation (pre-augment)

cr_id … current_category
123123 category2
123124 category3
123125 category1

augment()
for each row in data_allegation look up the id of the category listed under current_category
replace value of current_category with looked up id

Table data_allegation (post augment)

cr_id … current_category
123123 2
123124 3
123125 1

tw-jeff-burroughs added a commit that referenced this issue Sep 23, 2019
j0vanka pushed a commit that referenced this issue Sep 23, 2019
j0vanka pushed a commit that referenced this issue Sep 23, 2019
j0vanka pushed a commit that referenced this issue Sep 24, 2019
tw-jeff-burroughs added a commit that referenced this issue Oct 1, 2019
…object, copa scrape transformer add api error handling and update tests
colin-parsons added a commit that referenced this issue Oct 4, 2019
…ts in relation to the added functions in copa_scrape_transformer
j0vanka pushed a commit that referenced this issue Oct 7, 2019
KyleDolezal added a commit that referenced this issue Oct 8, 2019
…not-copa data; and storing errors. Removed commented code. Added tests for the above.
KyleDolezal added a commit that referenced this issue Oct 10, 2019
… (1) copa scrape yields error; (2) not-copa scrape yields error: (3) both scrapes yield errors; and (4) no scrape contain errors
KyleDolezal added a commit that referenced this issue Oct 10, 2019
@colin-parsons colin-parsons reopened this Oct 15, 2019
colin-parsons added a commit that referenced this issue Oct 15, 2019
…sts fails while the other succeeds"

This reverts commit e1b000d.
j0vanka added a commit that referenced this issue Nov 25, 2019
* [#56] WIP: Add test for adding augments copa recxord to db

* Reformated test within test_augment.py

* [Daisy] Debugging commi

* [#56][Thalia/Everyone] Fix mypy commit error

* [#56A] [Clari and Daisy] added react components header, and tab

* [#56A][Clari, Thalia and Daisy] Added CSS style sheet for tabs and header.

* [Daisy Octavia and Jovanka][#56A] Cleaned up CSS and finished applying proper front end design to header

* [Octavia, Jovanka, and Daisy][#56A] Added components and css styling for button

* [#56A][JK]WIP: add bg image and footer

* [#56A][JK] WIP: bg image fix; styling

* [#56A][Jole] WIP: add FOIA tab/placeholder; route; header included in status page and FOIA placeholder page for navigation

* [#56][Jole] fix: failing test due to unwrapped link in Tab component
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants