Skip to content

Commit

Permalink
Add scripts for nyc-open-data-catalog
Browse files Browse the repository at this point in the history
Merge pull request #17 from qri-io/nyc-open-data-catalog
  • Loading branch information
chriswhong authored Apr 17, 2020
2 parents 1ade79b + afb12e5 commit 55281b5
Show file tree
Hide file tree
Showing 10 changed files with 314 additions and 7 deletions.
3 changes: 3 additions & 0 deletions nyc-open-data-catalog/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/tmp
/node_modules
.DS_Store
63 changes: 63 additions & 0 deletions nyc-open-data-catalog/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# NYC Open Data Catalog

Scripts to generate a table of metadata for all datasets on the NYC Open Data Portal.

## Background

The simple question to answer is "What the most viewed/downloaded" datasets published on the NYC Open Data Portal. It's possible to sort the catalog website by most viewed, but the download count is elusive. It's available on each dataset's landing page, but there's no quick way to see them all at once. This script gathers the download count (and other platform metadata) for each dataset and compiles them into a qri dataset.

## Approach

Use `data.json` as the list of all datasets. `data.json` is a catalog feed, and contains an abbreviated set of metadata for all of the datasets. We are ignoring just about all of it, and are only interested in the dataset ids which we can use to get each dataset's detailed metadata.

`curl https://data.cityofnewyork.us/data.json > ./tmp/nyc.json`

Now that we know all of the dataset ids, we can call the metadata API for each one:

`https://data.cityofnewyork.us/api/views/:id.json`

The following fields will be added to our new dataset:

```
id
name
attribution
averageRating
category
createdAt
description
displayType
downloadCount
hideFromCatalog
hideFromDataJson
indexUpdatedAt
newBackend
numberOfComments
oid
provenance
publicationAmmendEnabled
publicationDate
publicationGroup
publicationStage
rowClass
rowsUpdatedAt
rowsUpdatedBy
tableId
totalTimesRated
viewCount
viewLastModified
viewType
automated
dataMadePublic
updateFrequency
agency
tags
```

## Scripts

`process-datasets.js` iterates over the datasets in `data.json`, calls the metadata API, processes the response, and writes a new line to the CSV at `tmp/output.csv`

It takes only a few minutes to fetch metadata for the 2,712 datasets listed in `data.json`

`create-and-publish.js` creates a new qri dataset in the local qri store, and publishes it.
13 changes: 13 additions & 0 deletions nyc-open-data-catalog/create-qri-dataset.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
// creates and publishes a qri dataset from the csv in tmp/output.csv
const qri = require(`${__dirname}/../../../qri/node-qri`)

qri.save('me/catalog-metadata', {
body: `${__dirname}/tmp/output.csv`,
file: [
`${__dirname}/tmp/meta.json`,
`${__dirname}/tmp/readme.md`,
`${__dirname}/tmp/structure.json`
]
})

qri.publish('me/catalog-metadata')
11 changes: 11 additions & 0 deletions nyc-open-data-catalog/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"name": "nyc-open-data-catalog",
"version": "1.0.0",
"main": "index.js",
"license": "MIT",
"dependencies": {
"csv-string": "^3.2.0",
"moment": "^2.24.0",
"node-fetch": "^2.6.0"
}
}
148 changes: 148 additions & 0 deletions nyc-open-data-catalog/process-datasets.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
const fs = require('fs')
const fetch = require('node-fetch')
const moment = require('moment')

const toISO8601 = (unix) => {
return moment.unix(unix).format()
}

// from https://stackoverflow.com/questions/46637955/write-a-string-containing-commas-and-double-quotes-to-csv
const sanitizeString = (desc) => {
var itemDesc;
if (desc) {
itemDesc = desc.replace(/(\r\n|\n|\r|\s+|\t| )/gm,' ');
itemDesc = itemDesc.replace(/,/g, '\,');
itemDesc = itemDesc.replace(/"/g, '""');
itemDesc = itemDesc.replace(/'/g, '\'');
itemDesc = itemDesc.replace(/ +(?= )/g,'');
} else {
itemDesc = '';
}
return `"${itemDesc}"`;
}

const fetchMetaData = async (datasetId) => {
const metadataUrl = `https://data.cityofnewyork.us/api/views/${datasetId}.json`
// get the metadata json
console.log('getting metadata...', metadataUrl)
const raw = await fetch(metadataUrl).then(d => d.json())
const {
id,
name,
attribution,
averageRating,
category,
createdAt,
description,
displayType,
downloadCount,
hideFromCatalog,
hideFromDataJson,
indexUpdatedAt,
newBackend,
numberOfComments,
oid,
provenance,
publicationAmmendEnabled,
publicationDate,
publicationGroup,
publicationStage,
rowClass,
rowsUpdatedAt,
rowsUpdatedBy,
tableId,
totalTimesRated,
viewCount,
viewLastModified,
viewType,
metadata,
tags
} = raw

// clean up the metadata

const { custom_fields } = metadata
const { Update, 'Dataset Information': datasetInformation } = custom_fields
const {
Automation: automation,
'Date Made Public': dateMadePublic,
'Update Frequency': updateFrequency
} = Update

let agency = ''
if (datasetInformation && datasetInformation.Agency) {
agency = datasetInformation.Agency
}

const tagsAsString = tags ? tags.join(';') : ''

return {
id,
name: sanitizeString(name),
attribution: sanitizeString(attribution),
averageRating,
category,
createdAt: toISO8601(createdAt),
description: sanitizeString(description),
displayType,
downloadCount,
hideFromCatalog,
hideFromDataJson,
indexUpdatedAt: toISO8601(indexUpdatedAt),
newBackend,
numberOfComments,
oid,
provenance,
publicationAmmendEnabled,
publicationDate: toISO8601(publicationDate),
publicationGroup,
publicationStage,
rowClass,
rowsUpdatedAt: toISO8601(rowsUpdatedAt),
rowsUpdatedBy,
tableId,
totalTimesRated,
viewCount,
viewLastModified: toISO8601(viewLastModified),
viewType,
automation,
dateMadePublic: sanitizeString(dateMadePublic),
updateFrequency,
agency: sanitizeString(agency),
tags: sanitizeString(tagsAsString)
}
}

(async () => {
const { dataset: catalog } = require('./tmp/nyc.json')

// const subset = catalog.slice(0, 50)
const subset = catalog

const ids = subset.map(d => d.landingPage.split('/')[4])

console.log(ids)

// const ids = ['v2kq-qrx6']

const output = fs.createWriteStream('./tmp/output.csv')
console.log(ids.length)
for (let i = 0; i < ids.length; i++) {
const id = ids[i]
try {
const metadataRow = await fetchMetaData(id)
console.log(metadataRow.description)

// add header on first result
if (i === 0) output.write(Object.keys(metadataRow).join(','))

const row = Object.keys(metadataRow).map(d => metadataRow[d]).join(',')
output.write(`\n${row}`)


} catch(e) {
console.log('SCRIPT ERROR', e)
}

}
})();
51 changes: 51 additions & 0 deletions nyc-open-data-catalog/yarn-error.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Arguments:
/Users/chriswhong/.nvm/versions/node/v12.1.0/bin/node /Users/chriswhong/.yarn/bin/yarn.js init

PATH:
/Users/chriswhong/opt/anaconda3/bin:/Users/chriswhong/opt/anaconda3/condabin:/Users/chriswhong/.yarn/bin:/Users/chriswhong/.config/yarn/global/node_modules/.bin:/Users/chriswhong/google-cloud-sdk/bin:/Users/chriswhong/.nvm/versions/node/v12.1.0/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/go/bin:/Applications/qri

Yarn version:
1.21.1

Node version:
12.1.0

Platform:
darwin x64

Trace:
Error: canceled
at Interface.<anonymous> (/Users/chriswhong/.yarn/lib/cli.js:136925:13)
at Interface.emit (events.js:196:13)
at Interface._ttyWrite (readline.js:877:16)
at ReadStream.onkeypress (readline.js:189:10)
at ReadStream.emit (events.js:196:13)
at emitKeys (internal/readline.js:424:14)
at emitKeys.next (<anonymous>)
at ReadStream.onData (readline.js:1145:36)
at ReadStream.emit (events.js:196:13)
at addChunk (_stream_readable.js:290:12)

npm manifest:
{
"name": "nyc-open-data-catalog",
"version": "1.0.0",
"main": "index.js",
"license": "MIT",
"dependencies": {
"moment": "^2.24.0"
}
}

yarn manifest:
No manifest

Lockfile:
# THIS IS AN AUTOGENERATED FILE. DO NOT EDIT THIS FILE DIRECTLY.
# yarn lockfile v1


moment@^2.24.0:
version "2.24.0"
resolved "https://registry.yarnpkg.com/moment/-/moment-2.24.0.tgz#0d055d53f5052aa653c9f6eb68bb5d12bf5c2b5b"
integrity sha512-bV7f+6l2QigeBBZSM/6yTNq4P2fNpSWj/0e7jQcy87A8e7o2nAfP/34/2ky5Vw4B9S446EtIhodAzkFCcR4dQg==
18 changes: 18 additions & 0 deletions nyc-open-data-catalog/yarn.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# THIS IS AN AUTOGENERATED FILE. DO NOT EDIT THIS FILE DIRECTLY.
# yarn lockfile v1


csv-string@^3.2.0:
version "3.2.0"
resolved "https://registry.yarnpkg.com/csv-string/-/csv-string-3.2.0.tgz#d034b62dfcd10b95ff7e584401d15355805673bd"
integrity sha512-JN3iAuFJ+r7+CwF6UtP3U8ryorRkQp8NT+9VufeiRV+Xyv+Q8HPPBHGm4LAq7YihTQYmUnIeYy5CPQ8Y2GhMkg==

moment@^2.24.0:
version "2.24.0"
resolved "https://registry.yarnpkg.com/moment/-/moment-2.24.0.tgz#0d055d53f5052aa653c9f6eb68bb5d12bf5c2b5b"
integrity sha512-bV7f+6l2QigeBBZSM/6yTNq4P2fNpSWj/0e7jQcy87A8e7o2nAfP/34/2ky5Vw4B9S446EtIhodAzkFCcR4dQg==

node-fetch@^2.6.0:
version "2.6.0"
resolved "https://registry.yarnpkg.com/node-fetch/-/node-fetch-2.6.0.tgz#e633456386d4aa55863f676a7ab0daa8fdecb0fd"
integrity sha512-8dG4H5ujfvFiqDmVu9fQ5bOHUC15JMjMY/Zumv26oOvvVJjM67KF8koCWIabKQ1GJIa9r2mMZscBq/TbdOcmNA==
2 changes: 1 addition & 1 deletion socrata-etl/datasets-from-ids.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ const processDataset = require('./process-dataset')

// helper function to fetch a dataset's row count
const getRowCount = async (id) => {
const res = await fetch(`https://data.cityofchicago.org/api/id/${id}.json?$select=count(*)%20as%20count`)
const res = await fetch(`https://data.cityofnewyork.us/api/id/${id}.json?$select=count(*)%20as%20count`)
.then(d => d.json())

const [{ count }] = res
Expand Down
4 changes: 2 additions & 2 deletions socrata-etl/process-dataset.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ const buildComponent = require('./util/build-component')

// helper function to fetch socrata metadata
const fetchMetaData = (datasetId) => {
const metadataUrl = `https://data.cityofchicago.org/api/views/${datasetId}.json`
const metadataUrl = `https://data.cityofnewyork.us/api/views/${datasetId}.json`
// get the metadata json
console.log('getting metadata...', metadataUrl)
return fetch(metadataUrl).then(d => d.json())
Expand All @@ -25,7 +25,7 @@ const processDataset = async (id) => {

const readme = buildComponent.readme(metadata)

const bodyBuffer = await fetch(`https://data.cityofchicago.org/api/views/${id}/rows.csv?accessType=DOWNLOAD`).then(d => d.arrayBuffer())
const bodyBuffer = await fetch(`https://data.cityofnewyork.us/api/views/${id}/rows.csv?accessType=DOWNLOAD`).then(d => d.arrayBuffer())


await writer(`tmp/${id}/readme.md`, readme)
Expand Down
8 changes: 4 additions & 4 deletions socrata-etl/util/build-component.js
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,13 @@ const meta = (sourceMetadata, datasetId) => {
email: undefined
}
],
accessUrl: `https://data.cityofchicago.org/api/views/${datasetId}/rows.csv?accessType=DOWNLOAD
accessUrl: `https://data.data.cityofnewyork.us/api/views/${datasetId}/rows.csv?accessType=DOWNLOAD
`,
createdAt,
downloadCount,
rowsUpdatedAt,
// accrualPeriodicity: metadata.custom_fields.Update['Update Frequency'].trim(),
// agency: metadata.custom_fields['Dataset Information']['Agency']
accrualPeriodicity: metadata.custom_fields.Update['Update Frequency'].trim(),
agency: metadata.custom_fields['Dataset Information']['Agency']
}
}

Expand All @@ -44,7 +44,7 @@ ${description}
## Import Details
This qri dataset was programmatically created from a dataset published on the Chicago Open Data Portal. [Original Dataset on data.cityofchicago.org/](https://data.cityofchicago.org/d/d/${id})
This qri dataset was programmatically created from a dataset published on the Chicago Open Data Portal. [Original Dataset on data.data.cityofnewyork.us/](https://data.data.cityofnewyork.us/d/d/${id})
The latest update ran on ${Date(Date.now()).toString()}`
}
Expand Down

0 comments on commit 55281b5

Please sign in to comment.