-
Notifications
You must be signed in to change notification settings - Fork 0
/
chiPy_blog2.html
470 lines (391 loc) · 19.2 KB
/
chiPy_blog2.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta http-equiv="X-UA-Compatible" content="IE=EDGE" />
<title>In Which the Statistician Sets Out to Clean Data in Python and Encounters a Parallel Universe</title>
<script src="site_libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="site_libs/bootstrap-3.3.5/css/journal.min.css" rel="stylesheet" />
<script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="site_libs/navigation-1.1/tabsets.js"></script>
<link href="site_libs/highlightjs-9.12.0/textmate.css" rel="stylesheet" />
<script src="site_libs/highlightjs-9.12.0/highlight.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs) {
hljs.configure({languages: []});
hljs.initHighlightingOnLoad();
if (document.readyState && document.readyState === "complete") {
window.setTimeout(function() { hljs.initHighlighting(); }, 0);
}
}
</script>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
}
.tabbed-pane {
padding-top: 12px;
}
.html-widget {
margin-bottom: 20px;
}
button.code-folding-btn:focus {
outline: none;
}
summary {
display: list-item;
}
</style>
<style type="text/css">
/* padding for bootstrap navbar */
body {
padding-top: 61px;
padding-bottom: 40px;
}
/* offset scroll position for anchor links (for fixed navbar) */
.section h1 {
padding-top: 66px;
margin-top: -66px;
}
.section h2 {
padding-top: 66px;
margin-top: -66px;
}
.section h3 {
padding-top: 66px;
margin-top: -66px;
}
.section h4 {
padding-top: 66px;
margin-top: -66px;
}
.section h5 {
padding-top: 66px;
margin-top: -66px;
}
.section h6 {
padding-top: 66px;
margin-top: -66px;
}
.dropdown-submenu {
position: relative;
}
.dropdown-submenu>.dropdown-menu {
top: 0;
left: 100%;
margin-top: -6px;
margin-left: -1px;
border-radius: 0 6px 6px 6px;
}
.dropdown-submenu:hover>.dropdown-menu {
display: block;
}
.dropdown-submenu>a:after {
display: block;
content: " ";
float: right;
width: 0;
height: 0;
border-color: transparent;
border-style: solid;
border-width: 5px 0 5px 5px;
border-left-color: #cccccc;
margin-top: 5px;
margin-right: -10px;
}
.dropdown-submenu:hover>a:after {
border-left-color: #ffffff;
}
.dropdown-submenu.pull-left {
float: none;
}
.dropdown-submenu.pull-left>.dropdown-menu {
left: -100%;
margin-left: 10px;
border-radius: 6px 0 6px 6px;
}
</style>
<script>
// manage active state of menu based on current page
$(document).ready(function () {
// active menu anchor
href = window.location.pathname
href = href.substr(href.lastIndexOf('/') + 1)
if (href === "")
href = "index.html";
var menuAnchor = $('a[href="' + href + '"]');
// mark it active
menuAnchor.parent().addClass('active');
// if it's got a parent navbar menu mark it active as well
menuAnchor.closest('li.dropdown').addClass('active');
});
</script>
<!-- tabsets -->
<style type="text/css">
.tabset-dropdown > .nav-tabs {
display: inline-table;
max-height: 500px;
min-height: 44px;
overflow-y: auto;
background: white;
border: 1px solid #ddd;
border-radius: 4px;
}
.tabset-dropdown > .nav-tabs > li.active:before {
content: "";
font-family: 'Glyphicons Halflings';
display: inline-block;
padding: 10px;
border-right: 1px solid #ddd;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before {
content: "";
border: none;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open:before {
content: "";
font-family: 'Glyphicons Halflings';
display: inline-block;
padding: 10px;
border-right: 1px solid #ddd;
}
.tabset-dropdown > .nav-tabs > li.active {
display: block;
}
.tabset-dropdown > .nav-tabs > li > a,
.tabset-dropdown > .nav-tabs > li > a:focus,
.tabset-dropdown > .nav-tabs > li > a:hover {
border: none;
display: inline-block;
border-radius: 4px;
background-color: transparent;
}
.tabset-dropdown > .nav-tabs.nav-tabs-open > li {
display: block;
float: none;
}
.tabset-dropdown > .nav-tabs > li {
display: none;
}
</style>
<!-- code folding -->
</head>
<body>
<div class="container-fluid main-container">
<div class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="index.html">Kaelen L. Medeiros</a>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li>
<a href="contact.html">Contact</a>
</li>
<li>
<a href="CV.html">CV</a>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false">
Blog
<span class="caret"></span>
</a>
<ul class="dropdown-menu" role="menu">
<li class="dropdown-header">May 2017</li>
<li>
<a href="chiPy_blog2.html">ChiPy Blog 2: Alternate Universe Definitions</a>
</li>
<li class="dropdown-header">April 2017</li>
<li>
<a href="chiPy_blog1.html">ChiPy Blog 1: Accidental Python</a>
</li>
<li class="dropdown-header">March 2017</li>
<li>
<a href="coming_soon.html">To be determined...</a>
</li>
</ul>
</li>
<li>
<a href="hire_me.html">Consulting</a>
</li>
</ul>
<ul class="nav navbar-nav navbar-right">
</ul>
</div><!--/.nav-collapse -->
</div><!--/.container -->
</div><!--/.navbar -->
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">In Which the Statistician Sets Out to Clean Data in Python and Encounters a Parallel Universe</h1>
</div>
<p>Full disclosure, fellow ChiPys: I’m a statistician by trade. I have an MS in Biostatistics, which means I have a statistics degree but was trained in a school of public health on health data and problems, and have worked for the last year as a statistician, with a transition at the end of May planned to work as a data scientist. I’ve used SAS and R extensively in my work both in graduate school and career thus far, but very little Python. <br></p>
<p>As a result of knowing SAS and R, both cleaning the data and writing this blog were exercises in futility, in a way: much of my time was spent knowing the name of what I wanted to do, how to do it in SAS and/or R, and Googling to figure out how to do it in Python. As my mentor Arpit put it, I knew the “parallel universe definition” of most everything I wanted to do, and had to figure out the best way to implement it in Python. I initially wrote this code in a Jupyter Notebook.<br></p>
<p>As a reminder, I’m dealing with the <a href="http://www.nuforc.org/webreports.html"><strong>National UFO Reporting Center Database</strong></a> data, as downloaded from <a href="https://www.kaggle.com/NUFORC/ufo-sightings"><strong>Kaggle</strong></a>. The NUFORC got a mention on the <a href="https://www.nytimes.com/2017/04/24/science/ufo-sightings-book.html?_r=0"><strong>New York Times</strong></a> last month, so obviously I’m in the right place. Our goal for the project is to ultimately take the data out into TensorFlow and play around with doing some prediction.<br></p>
<p>On Kaggle, the data is available both as ‘Complete’ and ‘Scrubbed’. The scrubbed dataset has any entry where location not found or blank plus those with blank or erroneous time removed, and I elected to use that as I worked through the cleaning.<br></p>
<p>The biggest issue I ran into early on in cleaning: the sightings in the NUFORC database go back to the 1800s, and when using the dt.year function below on the ‘scrubbed’ dataset as downloaded, it converted many dates to ones that might seem…a little odd, like 2025, which to my knowledge has not yet happened in this particular slice of multiverse. To be totally honest–and this is the kind of thing I’ve done in my professional life as well–I cheated and used Excel to split the dates and times apart. When I did that, dt.year converted the years correctly, and I was good to go from there. One thing I’ve learned in the last year is that it isn’t always worth it to spend hours on a solution in code when a quick fix is available. This might not be the best example of a good situation to apply that principle, but I did it, and I’m okay with it.<br></p>
<p>I wanted to go through my cleaning step-by-step, but in R, you have to keep re-importing packages and the data in each code block, so I’ll put a large portion of it below, with comments in the code. This blog is intended to be a high-level overview of my cleaning, so not all of the code will be here. <br></p>
<pre class="python"><code>import pandas as pd
import numpy as np
#import the dataset, using read_csv from pandas
nuforc = pd.read_csv('./scrubbedB.csv', na_values=['.'], low_memory=False)
#data cleaning, dates
nuforc['date'] = pd.to_datetime(nuforc['date'], errors='coerce')
nuforc.insert(1, 'year', nuforc['date'].dt.year)
nuforc['year'] = nuforc['year'].fillna(0).astype(int)
nuforc.insert(2, 'month', nuforc['date'].dt.month)
nuforc.insert(3, 'day', nuforc['date'].dt.day)
#clean up other columns
nuforc['city'] = nuforc['city'].str.title()
nuforc['state'] = nuforc['state'].str.upper()
nuforc['country'] = nuforc['country'].str.upper()
nuforc['latitude'] = pd.to_numeric(nuforc['latitude'], errors='coerce')
nuforc = nuforc.rename(columns={'longitude ':'longitude'})
nuforc['date posted'] = pd.to_datetime(nuforc['date posted'], errors='coerce')
#now examine datatypes
print(nuforc.dtypes)</code></pre>
<p>I first learned, very quickly, that I should import pandas <em>as</em> pd and numpy <em>as</em> np. My first run-through cleaning, I didn’t do this, and it made life complicated to impossible. I cleaned dates and times as described above, and broke apart day, month, and year to make some graphs, which I’ll show later. <br></p>
<p>For the next chunk, I used a technique called ‘binning’ to deal with the duration in seconds of the UFO spot. The information on binning came from an excellent resource, the blog of a data scientist Chris Albon, found <a href="https://chrisalbon.com/python/pandas_binning_data.html"><strong>here</strong></a>. I based my bins in this code on the quartiles of the duration (seconds) variable, but plan to make a few different versions with this method for use in modeling, utilizing bins of different sizes. <br></p>
<pre class="python"><code>import pandas as pd
#import the dataset again
nuforc = pd.read_csv('./scrubbedB.csv', na_values=['.'], low_memory=False)
nuforc['duration (seconds)'].describe()
#HUGE range here of times.
#Define times for the bins:
bins = [0, 30, 180, 600, 97836000]
#Create group names:
group_names = ['Q1', 'Q2', 'Q3', 'Q4']
#Cut the duration (seconds) variable
duration_cats = pd.cut(nuforc['duration (seconds)'], bins, labels=group_names)
nuforc['duration_cats'] = pd.cut(nuforc['duration (seconds)'], bins, labels=group_names)
#display a few, to demonstrate:
print(duration_cats[1])
print(duration_cats[10000])</code></pre>
<p>Another interesting piece of cleaning I performed on this dataset was on the ‘shape’ variable. In the raw data, this variable has 29 unique categories (too many!), as displayed below, with frequency counts attained from the value_counts() function: <br> light 16565* <br> triangle 7865* <br> circle 7608* <br> fireball 6208* <br> other 5649* <br> unknown 5584* <br> sphere 5387* <br> disk 5213* <br> oval 3733* <br> formation 2457* <br> cigar 2057* <br> changing 1962* <br> flash 1328* <br> rectangle 1297* <br> cylinder 1283* <br> diamond 1178* <br> chevron 952* <br> egg 759* <br> teardrop 750* <br> cone 316* <br> cross 233* <br> delta 7* <br> crescent 2* <br> round 2* <br> flare 1* <br> pyramid 1* <br> dome 1* <br> hexagon 1* <br> changed 1* <br></p>
<p>As you can imagine, this isn’t helpful in any way. I decided to regroup these, because a triangle is sort of like a pyramid, changed can safely (in my opinion) be equated with changing, an egg is like a teardrop, etc. I took it down to 10 types, as listed below: <br> changing = changed, changing<br> unknown = unknown<br> other = other, cross, hexagon (I stuck shapes I didn’t know what else to do with them here)<br> light_related =light, flare flash (these are related to quality of light, and not technically a shape, but I’m rolling with it)<br> formation = formation (I think this means “multiple UFOs spotted” but I’m not totally sure)<br> triangular = triangle, delta, pyramid, cone<br> ovular = oval, cigar, teardrop, egg, fireball<br> round = circle, sphere, disk, round, dome, cylinder<br> bent = chevron, crescent<br> rectangular = rectangle, diamond <br></p>
<p>Since this is a bit of a fancy replacement, I did so by mapping values with a dictionary, as below:</p>
<pre class="python"><code>#is a lot more complicated replace than usual, so we're going to get fancy:
import pandas as pd
#import the dataset again
nuforc = pd.read_csv('./scrubbedB.csv', na_values=['.'], low_memory=False)
#dict must include every single value, even if mapping to the same category!!
dict = {'changed':'changing', 'changing':'changing', 'cross':'other', 'hexagon':'other',
'light':'light_related', 'flare':'light_related', 'flash':'light_related',
'triangle':'triangular', 'delta':'triangular', 'pyramid': 'triangular',
'cone': 'triangular', 'oval':'ovular',
'cigar':'ovular','teardrop':'ovular','egg':'ovular','circle':'round',
'sphere':'round','disk':'round','dome':'round','cylinder':'round','chevron':'bent',
'crescent':'bent', 'rectangle':'rectangular','diamond':'rectangular',
'fireball':'ovular','unknown':'unknown', 'other':'other', 'formation':'formation', 'round':'round'}
#actual mapping below:
nuforc['clean_shape']=nuforc['shape'].map(dict)
print(nuforc['clean_shape'].value_counts())</code></pre>
<p>These look a lot cleaner and easier to work with, so I hope to use these moving foward, but may also make a second with further collapsed categories. <br></p>
<p>One of the visualization-based cleaning exercises I undertook was to make some basic frequency plots in matplotlib. I wanted to see frequencies of UFO spots across both the months of the years and the days of the month to try and spot patterns. I used the documentation for Seaborn to build these, which I found very helpful and thorough.<br></p>
<pre class="python"><code>#import again, plus seaborn
import pandas as pd
import seaborn as sns
#import dataset again
nuforc = pd.read_csv('./scrubbedB.csv', na_values=['.'], low_memory=False)
#data cleaning, dates, part 2
nuforc['date'] = pd.to_datetime(nuforc['date'], errors='coerce')
nuforc.insert(1, 'year', nuforc['date'].dt.year)
nuforc['year'] = nuforc['year'].fillna(0).astype(int)
nuforc.insert(2, 'month', nuforc['date'].dt.month)
nuforc.insert(3, 'day', nuforc['date'].dt.day)
#make a histogram of months
sns.countplot(x="month", data=nuforc, palette="Greens_d");
sns.plt.show()
#histogram of days
sns.countplot(x="day", data=nuforc, palette="Greens_d");
sns.plt.show()</code></pre>
<center>
<img src="images/month.png" />
</center>
<center>
<img src="images/day.png" />
</center>
<p>I was not surprised to see the results of the Month plot, which showed the highest number of UFO spots during the United States’ summer and fall months of June through October, when I’d guess more people are outside. The low number of spots from December through February are also pretty standard. I did find it sort of strange on the Day plot that spots peak on the 1st and 15th of the month, but remain pretty constant at all other days of the month. I don’t have a working theory for that yet, but expect to have one as I go.<br></p>
<p>Again, this wasn’t all of the cleaning I conducted on this data, but is instead a highlight of the more interesting bits of cleaning for me as someone who knows SAS and R. One trend I noticed as I was working is that across nearly all the variables there are so many UFO spots of so many different types that everything is fairly uniform. This is going to prove to be a very interesting challenge when it comes to modeling, and specifically it’s a challenge that means we may not find any “useful” or “meaningful” or even “reasonable” results. We will, however, learn a lot along the way!<br></p>
<p>Next steps include more visualization, basic models (GLMs), then some machine learning models using scikitlearn and hopefuly TensorFlow. Arpit and I both hope to learn more about using TF, so that’s a joint goal we have moving into the last month of the Mentorship program. As always, remember, <a href="https://www.youtube.com/watch?v=rbBX6aEzEz8"><strong>The Truth is Out There</strong></a>.</p>
</div>
<script>
// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
bootstrapStylePandocTables();
});
</script>
<!-- tabsets -->
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
$(document).ready(function () {
$('.tabset-dropdown > .nav-tabs > li').click(function () {
$(this).parent().toggleClass('nav-tabs-open')
});
});
</script>
<!-- code folding -->
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>