forked from STAT545-UBC/STAT545-UBC-original-website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
block004_basic-r-objects.html
904 lines (859 loc) · 39 KB
/
block004_basic-r-objects.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<title>The many flavors of R objects</title>
<script src="libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="libs/bootstrap-3.3.5/css/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-68219208-1', 'auto');
ga('send', 'pageview');
</script>
<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
href="libs/highlight/default.css"
type="text/css" />
<script src="libs/highlight/highlight.js"></script>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
window.setTimeout(function() {
hljs.initHighlighting();
}, 0);
}
</script>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
<link rel="stylesheet" href="libs/local/main.css" type="text/css" />
<link rel="stylesheet" href="libs/local/nav.css" type="text/css" />
<link rel="stylesheet" href="//netdna.bootstrapcdn.com/font-awesome/4.0.3/css/font-awesome.css" type="text/css" />
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<div class="container-fluid main-container">
<!-- tabsets -->
<script src="libs/navigation-1.1/tabsets.js"></script>
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
</script>
<!-- code folding -->
<header>
<div class="nav">
<a class="nav-logo" href="index.html">
<img src="static/img/stat545-logo-s.png" width="70px" height="70px"/>
</a>
<ul>
<li class="home"><a href="index.html">Home</a></li>
<li class="faq"><a href="faq.html">FAQ</a></li>
<li class="syllabus"><a href="syllabus.html">Syllabus</a></li>
<li class="topics"><a href="topics.html">Topics</a></li>
<li class="people"><a href="people.html">People</a></li>
</ul>
</div>
</header>
<div class="fluid-row" id="header">
<h1 class="title toc-ignore">The many flavors of R objects</h1>
</div>
<div id="TOC">
<ul>
<li><a href="#vectors-are-everywhere">Vectors are everywhere</a></li>
<li><a href="#indexing-a-vector">Indexing a vector</a></li>
<li><a href="#lists-hold-just-about-anything">lists hold just about anything</a></li>
<li><a href="#creating-a-data.frame-explicitly">Creating a data.frame explicitly</a></li>
<li><a href="#indexing-arrays-e.g.matrices">Indexing arrays, e.g. matrices</a></li>
<li><a href="#creating-arrays-e.g.matrices">Creating arrays, e.g. matrices</a></li>
<li><a href="#putting-it-all-together-implications-for-data.frames">Putting it all together … implications for data.frames</a></li>
<li><a href="#table-of-atomic-r-object-flavors">Table of atomic R object flavors</a></li>
</ul>
</div>
<blockquote>
<p>“Rigor and clarity are not synonymous” – Larry Wasserman</p>
</blockquote>
<blockquote>
<p>“Never hesitate to sacrifice truth for clarity.” – Greg Wilson’s dad</p>
</blockquote>
<div id="vectors-are-everywhere" class="section level3">
<h3>Vectors are everywhere</h3>
<p>Your garden variety R object is a vector. A single piece of info that you regard as a scalar is just a vector of length 1 and R will cheerfully let you add stuff to it. Square brackets are used for isolating elements of a vector for inspection, modification, etc. This is often called <strong>indexing</strong>. Go through the following code carefully, as it’s really rather surprising. BTW, indexing begins at 1 in R, unlike many other languages that index from 0.</p>
<pre class="r"><code>x <- 3 * 4
x
#> [1] 12
is.vector(x)
#> [1] TRUE
length(x)
#> [1] 1
x[2] <- 100
x
#> [1] 12 100
x[5] <- 3
x
#> [1] 12 100 NA NA 3
x[11]
#> [1] NA
x[0]
#> numeric(0)</code></pre>
<p>R is built to work with vectors. Many operations are <em>vectorized</em>, i.e. by default they will happen component-wise when given a vector as input. Novices often don’t internalize or exploit this and they write lots of unnecessary <code>for</code> loops.</p>
<pre class="r"><code>x <- 1:4
## which would you rather write and read?
## the vectorized version ...
(y <- x^2)
#> [1] 1 4 9 16
## or the for loop version?
z <- vector(mode = mode(x), length = length(x))
for(i in seq_along(x)) {
z[i] <- x[i]^2
}
identical(y, z)
#> [1] TRUE</code></pre>
<p>When reading function documentation, keep your eyes peeled for arguments that can be vectors. You’ll be surprised how common they are. For example, the mean and standard deviation of random normal variates can be provided as vectors.</p>
<pre class="r"><code>set.seed(1999)
rnorm(5, mean = 10^(1:5))
#> [1] 10.73267 99.96217 1001.20301 10001.46980 100000.13369
round(rnorm(5, sd = 10^(0:4)), 2)
#> [1] 0.52 -5.49 -118.56 -1147.28 11607.42</code></pre>
<p>This could be awesome in some settings, but dangerous in others, i.e. if you exploit this by mistake and get no warning. This is one of the reasons it’s so important to keep close tabs on your R objects: are they what you expect in terms of their flavor and length or dimensions? Check early and check often.</p>
<p>Notice that R also recycles vectors, if they are not the necessary length. You will get a warning if R suspects recycling is unintended, i.e. when one length is not an integer multiple of another, but recycling is silent if it seems like you know what you’re doing. Can be a beautiful thing when you’re doing this deliberately, but devasting when you don’t.</p>
<blockquote>
<p>Question: is there a way to turn recycling off? Not that I know of.</p>
</blockquote>
<pre class="r"><code>(y <- 1:3)
#> [1] 1 2 3
(z <- 3:7)
#> [1] 3 4 5 6 7
y + z
#> Warning in y + z: longer object length is not a multiple of shorter object
#> length
#> [1] 4 6 8 7 9
(y <- 1:10)
#> [1] 1 2 3 4 5 6 7 8 9 10
(z <- 3:7)
#> [1] 3 4 5 6 7
y + z
#> [1] 4 6 8 10 12 9 11 13 15 17</code></pre>
<p>The catenate function <code>c()</code> is your go-to function for making vectors.</p>
<pre class="r"><code>str(c("hello", "world"))
#> chr [1:2] "hello" "world"
str(c(1:3, 100, 150))
#> num [1:5] 1 2 3 100 150</code></pre>
<p>Plain vanilla R objects are called “atomic vectors” and an absolute requirement is that all the bits of info they hold are of the same flavor, i.e. all numeric or logical or character. If that’s not already true upon creation, the elements will be coerced to the same flavor, using a “lowest common denominator” approach (usually character). This is another stellar opportunity for you to create an object of one flavor without meaning to do so and to remain ignorant of that for a long time. Check early, check often.</p>
<pre class="r"><code>(x <- c("cabbage", pi, TRUE, 4.3))
#> [1] "cabbage" "3.14159265358979" "TRUE"
#> [4] "4.3"
str(x)
#> chr [1:4] "cabbage" "3.14159265358979" "TRUE" "4.3"
length(x)
#> [1] 4
mode(x)
#> [1] "character"
class(x)
#> [1] "character"</code></pre>
<p>The most important atomic vector types are</p>
<ul>
<li>logical: TRUE’s AND FALSE’s, easily coerced into 1’s and 0’s</li>
<li>numeric: numbers and, yes, integers and double-precision floating point numbers are different but you can live happily for a long time without worrying about this</li>
<li>character</li>
</ul>
<p>Let’s create some simple vectors for more demos below.</p>
<pre class="r"><code>n <- 8
set.seed(1)
(w <- round(rnorm(n), 2)) # numeric floating point
#> [1] -0.63 0.18 -0.84 1.60 0.33 -0.82 0.49 0.74
(x <- 1:n) # numeric integer
#> [1] 1 2 3 4 5 6 7 8
## another way to accomplish by hand is x <- c(1, 2, 3, 4, 5, 6, 7, 8)
(y <- LETTERS[1:n]) # character
#> [1] "A" "B" "C" "D" "E" "F" "G" "H"
(z <- runif(n) > 0.3) # logical
#> [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE</code></pre>
<p>Use <code>str()</code> and any other functions you wish to inspect these objects, such as <code>length()</code>, <code>mode()</code>, <code>class()</code>, <code>is.numeric()</code>, <code>is.logical()</code>, etc. Like the <code>is.xxx()</code> family of functions, there are also <code>as.xxx()</code> functions you can experiment with.</p>
<pre class="r"><code>str(w)
#> num [1:8] -0.63 0.18 -0.84 1.6 0.33 -0.82 0.49 0.74
length(x)
#> [1] 8
is.logical(y)
#> [1] FALSE
as.numeric(z)
#> [1] 1 1 1 1 1 0 1 0</code></pre>
</div>
<div id="indexing-a-vector" class="section level3">
<h3>Indexing a vector</h3>
<p>We’ve said, and even seen, that square brackets are used to index a vector. There is great flexibility in what one can put inside the square brackets and it’s worth understanding the many options. They are all useful, just in different contexts.</p>
<p>Most common, useful ways to index a vector</p>
<ul>
<li>logical vector: keep elements associated with TRUEs, ditch the FALSEs</li>
<li>vector of positive integers specifying the keepers</li>
<li>vector of negative integers specifying the losers</li>
<li>character vector, naming the keepers</li>
</ul>
<pre class="r"><code>w
#> [1] -0.63 0.18 -0.84 1.60 0.33 -0.82 0.49 0.74
names(w) <- letters[seq_along(w)]
w
#> a b c d e f g h
#> -0.63 0.18 -0.84 1.60 0.33 -0.82 0.49 0.74
w < 0
#> a b c d e f g h
#> TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
which(w < 0)
#> a c f
#> 1 3 6
w[w < 0]
#> a c f
#> -0.63 -0.84 -0.82
seq(from = 1, to = length(w), by = 2)
#> [1] 1 3 5 7
w[seq(from = 1, to = length(w), by = 2)]
#> a c e g
#> -0.63 -0.84 0.33 0.49
w[-c(2, 5)]
#> a c d f g h
#> -0.63 -0.84 1.60 -0.82 0.49 0.74
w[c('c', 'a', 'f')]
#> c a f
#> -0.84 -0.63 -0.82</code></pre>
</div>
<div id="lists-hold-just-about-anything" class="section level3">
<h3>lists hold just about anything</h3>
<p>Lists are basically über-vectors in R. It’s like a vector, but with no requirement that the elements be of the same flavor. In data analysis, you won’t make lists very often, at least not consciously, but you should still know about them. Why?</p>
<ul>
<li>data.frames are lists! They are a special case where each element is an atomic vector, all having the same length.</li>
<li>many functions will return lists to you and you will want to extract goodies from them, such as the p-value for a hypothesis test or the estimated error variance in a regression model</li>
</ul>
<p>Here we repeat an assignment from above, using <code>list()</code> instead of <code>c()</code> to combine things and you’ll notice that the different flavors of the consitutent parts are retained this time.</p>
<pre class="r"><code>## earlier: a <- c("cabbage", pi, TRUE, 4.3)
(a <- list("cabbage", pi, TRUE, 4.3))
#> [[1]]
#> [1] "cabbage"
#>
#> [[2]]
#> [1] 3.141593
#>
#> [[3]]
#> [1] TRUE
#>
#> [[4]]
#> [1] 4.3
str(a)
#> List of 4
#> $ : chr "cabbage"
#> $ : num 3.14
#> $ : logi TRUE
#> $ : num 4.3
length(a)
#> [1] 4
mode(a)
#> [1] "list"
class(a)
#> [1] "list"</code></pre>
<p>List components can also have names. You can create or change names after a list already exists or this can be integrated into the initial assignment.</p>
<pre class="r"><code>names(a)
#> NULL
names(a) <- c("veg", "dessert", "myAim", "number")
a
#> $veg
#> [1] "cabbage"
#>
#> $dessert
#> [1] 3.141593
#>
#> $myAim
#> [1] TRUE
#>
#> $number
#> [1] 4.3
a <- list(veg = "cabbage", dessert = pi, myAim = TRUE, number = 4.3)
names(a)
#> [1] "veg" "dessert" "myAim" "number"</code></pre>
<p>Indexing a list is similar to indexing a vector but it is necessarily more complex. The fundamental issue is this: if you request a single element from the list, do you want a list of length 1 containing only that element or do you want the element itself? For the former (desired return value is a list), we use single square brackets, <code>[</code> and <code>]</code>, just like indexing a vector. For the latter (desired return value is a single element), we use a dollar sign <code>$</code>, which you’ve already used to get one variable from a data.frame, or double square brackets, <code>[[</code> and <code>]]</code>.</p>
<p>The <a href="http://r4ds.had.co.nz/vectors.html#lists-of-condiments">“pepper shaker photos” in R for Data Science</a> are a splendid visual explanation of the different ways to get stuff out of a list. Highly recommended.</p>
<blockquote>
<p>Warning: the rest of this section might make your eyes glaze over. Skip to the next section if you need to; come back later when some list is ruining your day.</p>
</blockquote>
<p>A slightly more complicated list will make our demos more educational. Now we really see that the elements can differ in flavor and length.</p>
<pre class="r"><code>(a <- list(veg = c("cabbage", "eggplant"),
tNum = c(pi, exp(1), sqrt(2)),
myAim = TRUE,
joeNum = 2:6))
#> $veg
#> [1] "cabbage" "eggplant"
#>
#> $tNum
#> [1] 3.141593 2.718282 1.414214
#>
#> $myAim
#> [1] TRUE
#>
#> $joeNum
#> [1] 2 3 4 5 6
str(a)
#> List of 4
#> $ veg : chr [1:2] "cabbage" "eggplant"
#> $ tNum : num [1:3] 3.14 2.72 1.41
#> $ myAim : logi TRUE
#> $ joeNum: int [1:5] 2 3 4 5 6
length(a)
#> [1] 4
class(a)
#> [1] "list"
mode(a)
#> [1] "list"</code></pre>
<p>Here’s are ways to get a single list element:</p>
<pre class="r"><code>a[[2]] # index with a positive integer
#> [1] 3.141593 2.718282 1.414214
a$myAim # use dollar sign and element name
#> [1] TRUE
str(a$myAim) # we get myAim itself, a length 1 logical vector
#> logi TRUE
a[["tNum"]] # index with length 1 character vector
#> [1] 3.141593 2.718282 1.414214
str(a[["tNum"]]) # we get tNum itself, a length 3 numeric vector
#> num [1:3] 3.14 2.72 1.41
iWantThis <- "joeNum" # indexing with length 1 character object
a[[iWantThis]] # we get joeNum itself, a length 5 integer vector
#> [1] 2 3 4 5 6
a[[c("joeNum", "veg")]] # does not work! can't get > 1 elements! see below
#> Error in a[[c("joeNum", "veg")]]: subscript out of bounds</code></pre>
<p>A case when one must use the double bracket approach, as opposed to the dollar sign, is when the indexing object itself is an R object; we show that above.</p>
<p>What if you want more than one element? You must index vector-style with single square brackets. Note that the return value will always be a list, unlike the return value from double square brackets, even if you only request 1 element.</p>
<pre class="r"><code>names(a)
#> [1] "veg" "tNum" "myAim" "joeNum"
a[c("tNum", "veg")] # indexing by length 2 character vector
#> $tNum
#> [1] 3.141593 2.718282 1.414214
#>
#> $veg
#> [1] "cabbage" "eggplant"
str(a[c("tNum", "veg")]) # returns list of length 2
#> List of 2
#> $ tNum: num [1:3] 3.14 2.72 1.41
#> $ veg : chr [1:2] "cabbage" "eggplant"
a["veg"] # indexing by length 1 character vector
#> $veg
#> [1] "cabbage" "eggplant"
str(a["veg"])# returns list of length 1
#> List of 1
#> $ veg: chr [1:2] "cabbage" "eggplant"
length(a["veg"]) # really, it does!
#> [1] 1
length(a["veg"][[1]]) # contrast with length of the veg vector itself
#> [1] 2</code></pre>
</div>
<div id="creating-a-data.frame-explicitly" class="section level3">
<h3>Creating a data.frame explicitly</h3>
<p>In data analysis, we often import data into data.frame via <code>read.table()</code>. But one can also construct a data.frame directly using <code>data.frame()</code>.</p>
<pre class="r"><code>n <- 8
set.seed(1)
(jDat <- data.frame(w = round(rnorm(n), 2),
x = 1:n,
y = I(LETTERS[1:n]),
z = runif(n) > 0.3,
v = rep(LETTERS[9:12], each = 2)))
#> w x y z v
#> 1 -0.63 1 A TRUE I
#> 2 0.18 2 B TRUE I
#> 3 -0.84 3 C TRUE J
#> 4 1.60 4 D TRUE J
#> 5 0.33 5 E TRUE K
#> 6 -0.82 6 F FALSE K
#> 7 0.49 7 G TRUE L
#> 8 0.74 8 H FALSE L
str(jDat)
#> 'data.frame': 8 obs. of 5 variables:
#> $ w: num -0.63 0.18 -0.84 1.6 0.33 -0.82 0.49 0.74
#> $ x: int 1 2 3 4 5 6 7 8
#> $ y:Class 'AsIs' chr [1:8] "A" "B" "C" "D" ...
#> $ z: logi TRUE TRUE TRUE TRUE TRUE FALSE ...
#> $ v: Factor w/ 4 levels "I","J","K","L": 1 1 2 2 3 3 4 4
mode(jDat)
#> [1] "list"
class(jDat)
#> [1] "data.frame"</code></pre>
<blockquote>
<p>Sidebar: What is <code>I()</code>, used when creating the variable <span class="math inline">\(y\)</span> in the above data.frame? Short version: it tells R to do something <em>quite literally</em>. Here we are protecting the letters from being coerced to factor. We are ensuring we get a character vector. Note we let character-to-factor conversion happen in creating the <span class="math inline">\(v\)</span> variable above. More about (foiling) R’s determination to convert character data to factor can be found <a href="http://www.stat.ubc.ca/~jenny/STAT545A/block08_bossYourFactors.html">here</a> (Note: links to page from 2013).</p>
</blockquote>
<p>data.frames really are lists! Double square brackets can be used to get individual variables. Single square brackets can be used to get one or more variables, returned as a data.frame (though <code>subset(..., select = ...))</code> is how I would more likely do in a data analysis).</p>
<pre class="r"><code>is.list(jDat) # data.frames are lists
#> [1] TRUE
jDat[[5]] # this works but I prefer ...
#> [1] I I J J K K L L
#> Levels: I J K L
jDat$v # using dollar sign and name, when possible
#> [1] I I J J K K L L
#> Levels: I J K L
jDat[c("x", "z")] # get multiple variables
#> x z
#> 1 1 TRUE
#> 2 2 TRUE
#> 3 3 TRUE
#> 4 4 TRUE
#> 5 5 TRUE
#> 6 6 FALSE
#> 7 7 TRUE
#> 8 8 FALSE
str(jDat[c("x", "z")]) # returns a data.frame
#> 'data.frame': 8 obs. of 2 variables:
#> $ x: int 1 2 3 4 5 6 7 8
#> $ z: logi TRUE TRUE TRUE TRUE TRUE FALSE ...
identical(subset(jDat, select = c(x, z)), jDat[c("x", "z")])
#> [1] TRUE</code></pre>
<blockquote>
<p>Question: How do I make a data.frame from a list? It is an absolute requirement that the constituent vectors have the same length, although they can be of different flavors. Assuming you meet that requirement, use <code>as.data.frame()</code> to convert.</p>
</blockquote>
<pre class="r"><code>## note difference in the printing of a list vs. a data.frame
(qDat <- list(w = round(rnorm(n), 2),
x = 1:(n-1), ## <-- LOOK HERE! I MADE THIS VECTOR SHORTER!
y = I(LETTERS[1:n])))
#> $w
#> [1] -0.62 -2.21 1.12 -0.04 -0.02 0.94 0.82 0.59
#>
#> $x
#> [1] 1 2 3 4 5 6 7
#>
#> $y
#> [1] "A" "B" "C" "D" "E" "F" "G" "H"
as.data.frame(qDat) ## does not work! elements don't have same length!
#> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 8, 7
qDat$x <- 1:n ## fix the short variable x
(qDat <- as.data.frame(qDat)) ## we're back in business
#> w x y
#> 1 -0.62 1 A
#> 2 -2.21 2 B
#> 3 1.12 3 C
#> 4 -0.04 4 D
#> 5 -0.02 5 E
#> 6 0.94 6 F
#> 7 0.82 7 G
#> 8 0.59 8 H</code></pre>
<p>You will encounter weirder situations in which you want to make a data.frame out of a list and there are many tricks. Ask me and we’ll beef up this section.</p>
</div>
<div id="indexing-arrays-e.g.matrices" class="section level3">
<h3>Indexing arrays, e.g. matrices</h3>
<p>Though data.frames are recommended as the default receptacle for rectangular data, there are times when one will store rectangular data as a matrix instead. A matrix is a generalization of an atomic vector and the requirement that all the elements be of the same flavor still holds. General arrays are available in R, where a matrix is an important special case having dimension 2.</p>
<p>Let’s make a simple matrix and give it decent row and column names, which we know is a good practice. You’ll see familiar or self-explanatory functions below for getting to know a matrix.</p>
<pre class="r"><code>## don't worry if the construction of this matrix confuses you; just focus on
## the product
jMat <- outer(as.character(1:4), as.character(1:4),
function(x, y) {
paste0('x', x, y)
})
jMat
#> [,1] [,2] [,3] [,4]
#> [1,] "x11" "x12" "x13" "x14"
#> [2,] "x21" "x22" "x23" "x24"
#> [3,] "x31" "x32" "x33" "x34"
#> [4,] "x41" "x42" "x43" "x44"
str(jMat)
#> chr [1:4, 1:4] "x11" "x21" "x31" "x41" "x12" "x22" ...
class(jMat)
#> [1] "matrix"
mode(jMat)
#> [1] "character"
dim(jMat)
#> [1] 4 4
nrow(jMat)
#> [1] 4
ncol(jMat)
#> [1] 4
rownames(jMat)
#> NULL
rownames(jMat) <- paste0("row", seq_len(nrow(jMat)))
colnames(jMat) <- paste0("col", seq_len(ncol(jMat)))
dimnames(jMat) # also useful for assignment
#> [[1]]
#> [1] "row1" "row2" "row3" "row4"
#>
#> [[2]]
#> [1] "col1" "col2" "col3" "col4"
jMat
#> col1 col2 col3 col4
#> row1 "x11" "x12" "x13" "x14"
#> row2 "x21" "x22" "x23" "x24"
#> row3 "x31" "x32" "x33" "x34"
#> row4 "x41" "x42" "x43" "x44"</code></pre>
<p>Indexing a matrix is very similar to indexing a vector or a list: use square brackets and index with logical, integer numeric (positive or negative), or character vectors. Combine those approaches if you like! The main new wrinkle is the use of a comma <code>,</code> to distinguish rows and columns. The <span class="math inline">\(i,j\)</span>-th element is the element at the intersection of row <span class="math inline">\(i\)</span> and column <span class="math inline">\(j\)</span> and is obtained with <code>jMat[i, j]</code>. Request an entire row or an entire column by simplying leaving the associated index empty. The <code>drop =</code> argument controls whether the return value should be an atomic vector (<code>drop = TRUE</code>) or a matrix with a single row or column (<code>drop = FALSE</code>). Notice how row and column names persist and can help you stay oriented.</p>
<pre class="r"><code>jMat[2, 3]
#> [1] "x23"
jMat[2, ] # getting row 2
#> col1 col2 col3 col4
#> "x21" "x22" "x23" "x24"
is.vector(jMat[2, ]) # we get row 2 as an atomic vector
#> [1] TRUE
jMat[ , 3, drop = FALSE] # getting column 3
#> col3
#> row1 "x13"
#> row2 "x23"
#> row3 "x33"
#> row4 "x43"
dim(jMat[ , 3, drop = FALSE]) # we get column 3 as a 4 x 1 matrix
#> [1] 4 1
jMat[c("row1", "row4"), c("col2", "col3")]
#> col2 col3
#> row1 "x12" "x13"
#> row4 "x42" "x43"
jMat[-c(2, 3), c(TRUE, TRUE, FALSE, FALSE)] # wacky but possible
#> col1 col2
#> row1 "x11" "x12"
#> row4 "x41" "x42"</code></pre>
<p>Under the hood, of course, matrices are just vectors with some extra facilities for indexing. R is a <a href="http://en.wikipedia.org/wiki/Row-major_order">column-major order</a> language, in contrast to C and Python which use row-major order. What this means is that in the underlying vector storage of a matrix, the columns are stacked up one after the other. Matrices can be indexed <em>exactly</em> like a vector, i.e. with no comma <span class="math inline">\(i,j\)</span> business, like so:</p>
<pre class="r"><code>jMat[7]
#> [1] "x32"
jMat
#> col1 col2 col3 col4
#> row1 "x11" "x12" "x13" "x14"
#> row2 "x21" "x22" "x23" "x24"
#> row3 "x31" "x32" "x33" "x34"
#> row4 "x41" "x42" "x43" "x44"</code></pre>
<p>How to understand this: start counting in the upper left corner, move down the column, continue from the top of column 2 and you’ll land on the element “x32” when you get to 7.</p>
<p>If you have meaningful, systematic row or column names, there are many possibilities for indexing via regular expressions. Maybe we will talk about <code>grep</code> later….</p>
<pre class="r"><code>jMat[1, grepl("[24]", colnames(jMat))]
#> col2 col4
#> "x12" "x14"</code></pre>
<p>Note also that one can put an indexed matrix on the receiving end of an assignment operation and, as long as your replacement values have valid shape or extent, you can change the matrix.</p>
<pre class="r"><code>jMat["row1", 2:3] <- c("HEY!", "THIS IS NUTS!")
jMat
#> col1 col2 col3 col4
#> row1 "x11" "HEY!" "THIS IS NUTS!" "x14"
#> row2 "x21" "x22" "x23" "x24"
#> row3 "x31" "x32" "x33" "x34"
#> row4 "x41" "x42" "x43" "x44"</code></pre>
<p>Note that R can also work with vectors and matrices in the proper mathematical sense, i.e. perform matrix algebra. That is a separate topic. To get you started, read the help on <code>%*%</code> for matrix multiplication …</p>
</div>
<div id="creating-arrays-e.g.matrices" class="section level3">
<h3>Creating arrays, e.g. matrices</h3>
<p>There are 3 main ways to create a matrix. It goes without saying that the inputs must comply with the requirement that all matrix elements are the same flavor. If that’s not true, you risk an error or, worse, silent conversion to character.</p>
<ul>
<li>Filling a matrix with a vector</li>
<li>Glueing vectors together as rows or columns</li>
<li>Conversion of a data.frame</li>
</ul>
<p>Let’s demonstrate. Here we fill a matrix with a vector, explore filling by rows and giving row and columns at creation. Notice that recycling happens here too, so if the input vector is not large enough, R will recycle it.</p>
<pre class="r"><code>matrix(1:15, nrow = 5)
#> [,1] [,2] [,3]
#> [1,] 1 6 11
#> [2,] 2 7 12
#> [3,] 3 8 13
#> [4,] 4 9 14
#> [5,] 5 10 15
matrix("yo!", nrow = 3, ncol = 6)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] "yo!" "yo!" "yo!" "yo!" "yo!" "yo!"
#> [2,] "yo!" "yo!" "yo!" "yo!" "yo!" "yo!"
#> [3,] "yo!" "yo!" "yo!" "yo!" "yo!" "yo!"
matrix(c("yo!", "foo?"), nrow = 3, ncol = 6)
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] "yo!" "foo?" "yo!" "foo?" "yo!" "foo?"
#> [2,] "foo?" "yo!" "foo?" "yo!" "foo?" "yo!"
#> [3,] "yo!" "foo?" "yo!" "foo?" "yo!" "foo?"
matrix(1:15, nrow = 5, byrow = TRUE)
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 4 5 6
#> [3,] 7 8 9
#> [4,] 10 11 12
#> [5,] 13 14 15
matrix(1:15, nrow = 5,
dimnames = list(paste0("row", 1:5),
paste0("col", 1:3)))
#> col1 col2 col3
#> row1 1 6 11
#> row2 2 7 12
#> row3 3 8 13
#> row4 4 9 14
#> row5 5 10 15</code></pre>
<p>Here we create a matrix by glueing vectors together. Watch the vector names propagate as row or column names.</p>
<pre class="r"><code>vec1 <- 5:1
vec2 <- 2^(1:5)
cbind(vec1, vec2)
#> vec1 vec2
#> [1,] 5 2
#> [2,] 4 4
#> [3,] 3 8
#> [4,] 2 16
#> [5,] 1 32
rbind(vec1, vec2)
#> [,1] [,2] [,3] [,4] [,5]
#> vec1 5 4 3 2 1
#> vec2 2 4 8 16 32</code></pre>
<p>Here we create a matrix from a data.frame.</p>
<pre class="r"><code>vecDat <- data.frame(vec1 = 5:1,
vec2 = 2^(1:5))
str(vecDat)
#> 'data.frame': 5 obs. of 2 variables:
#> $ vec1: int 5 4 3 2 1
#> $ vec2: num 2 4 8 16 32
vecMat <- as.matrix(vecDat)
str(vecMat)
#> num [1:5, 1:2] 5 4 3 2 1 2 4 8 16 32
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:2] "vec1" "vec2"</code></pre>
<p>Here we create a matrix from a data.frame, but experience the “silently convert everything to character” fail. As an added bonus, I’m also allowing the “convert character to factor” thing to happen when we create the data.frame initially. Let this be a reminder to take control of your objects!</p>
<pre class="r"><code>multiDat <- data.frame(vec1 = 5:1,
vec2 = paste0("hi", 1:5))
str(multiDat)
#> 'data.frame': 5 obs. of 2 variables:
#> $ vec1: int 5 4 3 2 1
#> $ vec2: Factor w/ 5 levels "hi1","hi2","hi3",..: 1 2 3 4 5
(multiMat <- as.matrix(multiDat))
#> vec1 vec2
#> [1,] "5" "hi1"
#> [2,] "4" "hi2"
#> [3,] "3" "hi3"
#> [4,] "2" "hi4"
#> [5,] "1" "hi5"
str(multiMat)
#> chr [1:5, 1:2] "5" "4" "3" "2" "1" "hi1" "hi2" "hi3" ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:2] "vec1" "vec2"</code></pre>
</div>
<div id="putting-it-all-together-implications-for-data.frames" class="section level3">
<h3>Putting it all together … implications for data.frames</h3>
<p>This behind the scenes tour is still aimed at making you a better data analyst. Hopefully the slog through vectors, matrices, and lists will be redeemed by greater prowess at manipulating data.frames. Why should this be true?</p>
<ul>
<li>a data.frame is a <em>list</em></li>
<li>the list elements are the variables and they are <em>atomic vectors</em></li>
<li>data.frames are rectangular, like their matrix friends, so your intuition – and even some syntax – can be borrowed from the matrix world</li>
</ul>
<p>A data.frame is a list that quacks like a matrix.</p>
<p>Reviewing list-style indexing of a data.frame:</p>
<pre class="r"><code>jDat
#> w x y z v
#> 1 -0.63 1 A TRUE I
#> 2 0.18 2 B TRUE I
#> 3 -0.84 3 C TRUE J
#> 4 1.60 4 D TRUE J
#> 5 0.33 5 E TRUE K
#> 6 -0.82 6 F FALSE K
#> 7 0.49 7 G TRUE L
#> 8 0.74 8 H FALSE L
jDat$z
#> [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
iWantThis <- "z"
jDat[[iWantThis]]
#> [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
str(jDat[[iWantThis]]) # we get an atomic vector
#> logi [1:8] TRUE TRUE TRUE TRUE TRUE FALSE ...</code></pre>
<p>Reviewing vector-style indexing of a data.frame:</p>
<pre class="r"><code>jDat["y"]
#> y
#> 1 A
#> 2 B
#> 3 C
#> 4 D
#> 5 E
#> 6 F
#> 7 G
#> 8 H
str(jDat["y"]) # we get a data.frame with one variable, y
#> 'data.frame': 8 obs. of 1 variable:
#> $ y:Class 'AsIs' chr [1:8] "A" "B" "C" "D" ...
iWantThis <- c("w", "v")
jDat[iWantThis] # index with a vector of variable names
#> w v
#> 1 -0.63 I
#> 2 0.18 I
#> 3 -0.84 J
#> 4 1.60 J
#> 5 0.33 K
#> 6 -0.82 K
#> 7 0.49 L
#> 8 0.74 L
str(jDat[c("w", "v")])
#> 'data.frame': 8 obs. of 2 variables:
#> $ w: num -0.63 0.18 -0.84 1.6 0.33 -0.82 0.49 0.74
#> $ v: Factor w/ 4 levels "I","J","K","L": 1 1 2 2 3 3 4 4
str(subset(jDat, select = c(w, v))) # using subset() function
#> 'data.frame': 8 obs. of 2 variables:
#> $ w: num -0.63 0.18 -0.84 1.6 0.33 -0.82 0.49 0.74
#> $ v: Factor w/ 4 levels "I","J","K","L": 1 1 2 2 3 3 4 4</code></pre>
<p>Demonstrating matrix-style indexing of a data.frame:</p>
<pre class="r"><code>jDat[ , "v"]
#> [1] I I J J K K L L
#> Levels: I J K L
str(jDat[ , "v"])
#> Factor w/ 4 levels "I","J","K","L": 1 1 2 2 3 3 4 4
jDat[ , "v", drop = FALSE]
#> v
#> 1 I
#> 2 I
#> 3 J
#> 4 J
#> 5 K
#> 6 K
#> 7 L
#> 8 L
str(jDat[ , "v", drop = FALSE])
#> 'data.frame': 8 obs. of 1 variable:
#> $ v: Factor w/ 4 levels "I","J","K","L": 1 1 2 2 3 3 4 4
jDat[c(2, 4, 7), c(1, 4)] # awful and arbitrary but syntax works
#> w z
#> 2 0.18 TRUE
#> 4 1.60 TRUE
#> 7 0.49 TRUE
jDat[jDat$z, ]
#> w x y z v
#> 1 -0.63 1 A TRUE I
#> 2 0.18 2 B TRUE I
#> 3 -0.84 3 C TRUE J
#> 4 1.60 4 D TRUE J
#> 5 0.33 5 E TRUE K
#> 7 0.49 7 G TRUE L
subset(jDat, subset = z)
#> w x y z v
#> 1 -0.63 1 A TRUE I
#> 2 0.18 2 B TRUE I
#> 3 -0.84 3 C TRUE J
#> 4 1.60 4 D TRUE J
#> 5 0.33 5 E TRUE K
#> 7 0.49 7 G TRUE L</code></pre>
</div>
<div id="table-of-atomic-r-object-flavors" class="section level3">
<h3>Table of atomic R object flavors</h3>
<blockquote>
<p>This table will be hideous unless Pandoc is used to compile.</p>
</blockquote>
<table style="width:72%;">
<colgroup>
<col width="16%" />
<col width="22%" />
<col width="16%" />
<col width="16%" />
</colgroup>
<thead>
<tr class="header">
<th>“flavor”</th>
<th>type reported by typeof()</th>
<th>mode()</th>
<th>class()</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><p>character</p></td>
<td><p>character</p></td>
<td><p>character</p></td>
<td><p>character</p></td>
</tr>
<tr class="even">
<td><p>logical</p></td>
<td><p>logical</p></td>
<td><p>logical</p></td>
<td><p>logical</p></td>
</tr>
<tr class="odd">
<td><p>numeric</p></td>
<td><p>integer or double</p></td>
<td><p>numeric</p></td>
<td><p>integer or double</p></td>
</tr>
<tr class="even">
<td><p>factor</p></td>
<td><p>integer</p></td>
<td><p>numeric</p></td>
<td><p>factor</p></td>
</tr>
</tbody>
</table>
<blockquote>
<p>This should be legible no matter what.</p>
</blockquote>
<pre><code>+-----------+---------------+-----------+-----------+
| "flavor" | type reported | mode() | class() |
| | by typeof() | | |
+===========+===============+===========+===========+
| character | character | character | character |
+-----------+---------------+-----------+-----------+
| logical | logical | logical | logical |
+-----------+---------------+-----------+-----------+
| numeric | integer | numeric | integer |
| | or double | | or double |
+-----------+---------------+-----------+-----------+
| factor | integer | numeric | factor |
+-----------+---------------+-----------+-----------+</code></pre>
<p>Thinking about objects according to the flavors above will work fairly well for most purposes most of the time, at least when you’re first getting started. Notice that most rows in the table are quite homogeneous, i.e. a logical vector is a logical vector is a logical vector. But the row pertaining to factors is an exception, which highlights the special nature of factors. (for more, go <a href="block08_bossYourFactors.html">here</a>).</p>
<!--
> JB note to self. Possible TO ADD but probably belongs in separate tutorial on changing and adding to data.frames: cbind and rbind with data.frames, transform(). do.call() tricks for data.frames. adding/removing variables to/from data.frames. Does this belong in earlier tutorial on the care and feeding of a data.frame? Fits in well with subset(), after all.
-->
</div>
<div class="footer">
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">CC BY-NC 3.0 Creative Commons License</a>.
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>