Skip to content
This repository has been archived by the owner on Jul 15, 2023. It is now read-only.

json_normalize is slow #41

Open
jonathancstroud opened this issue Oct 1, 2018 · 2 comments
Open

json_normalize is slow #41

jonathancstroud opened this issue Oct 1, 2018 · 2 comments
Labels
help wanted Extra attention is needed

Comments

@jonathancstroud
Copy link
Contributor

A majority of time loading data happens during the call to pd.io.json.json_normalize. Some timing tests:

debug run
train read_csv:  0.09 seconds
test read_csv:  0.08 seconds
json normalize device:  0.14 seconds
json normalize device:  0.13 seconds
json merge train device:  0.00 seconds
json merge test device:  0.00 seconds
json normalize geoNetwork:  0.10 seconds
json normalize geoNetwork:  0.10 seconds
json merge train geoNetwork:  0.00 seconds
json merge test geoNetwork:  0.00 seconds
json normalize totals:  0.05 seconds
json normalize totals:  0.04 seconds
json merge train totals:  0.00 seconds
json merge test totals:  0.00 seconds
json normalize trafficSource:  0.07 seconds
json normalize trafficSource:  0.07 seconds
json merge train trafficSource:  0.00 seconds
json merge test trafficSource:  0.00 seconds

Output is a bit cryptic, but basically the calls to json_normalize each take about as much time as loading the data off disk. Merging these columns into the rest of the dataset is essentially free, and loading data off disk is unavoidable. Therefore, we should try to make json_normalize faster or avoid calling it entirely.

@jonathancstroud jonathancstroud added the help wanted Extra attention is needed label Oct 1, 2018
@jonathancstroud
Copy link
Contributor Author

@jonathancstroud
Copy link
Contributor Author

timing with the full set on a macbook:

train read_csv: 239.25 seconds
test read_csv: 438.94 seconds
json normalize device: 392.48 seconds
json normalize device: 352.39 seconds
json merge train device: 122.27 seconds
json merge test device: 143.13 seconds
json normalize geoNetwork: 569.97 seconds
json normalize geoNetwork: 307.62 seconds
json merge train geoNetwork: 246.24 seconds
json merge test geoNetwork: 187.50 seconds
json normalize totals: 418.41 seconds
json normalize totals: 95.98 seconds
json merge train totals: 189.61 seconds
json merge test totals: 164.21 seconds
json normalize trafficSource: 246.83 seconds
json normalize trafficSource: 412.85 seconds
json merge train trafficSource: 226.35 seconds
[cut off because my laptop froze]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant