-
Notifications
You must be signed in to change notification settings - Fork 13
/
03-data-structures.Rmd
1741 lines (1410 loc) · 55.5 KB
/
03-data-structures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data structures
Learning objectives:
- Basic concepts of R programming: functions and data objects
- Looking for help documents within R and using Google
- Common data objects: scalars, vectors, lists, data frames, and matrices
R scripts mostly consist of function calls and data. There are many different types of data objects. Common **data structures** in R include **scalars, vectors, factors, matrices, factors, data frames**, and **lists**. These data structures can contain one or more individual data elements of several types, namely **numeric** (2.5), **character** (“Go Jacks”), or **logical** (TRUE or FALSE).
## Basic concepts
### Expressions
Type anything at the prompt, and R will evaluate it and print the answer.
```{r}
1 + 1
```
There's your result, **2**. It's printed on the console right after your entry.
Type the string **"Go Jacks"**. (Don't forget the quotes!)
```{r}
"Go Jacks"
```
>
```{exercise}
>
Now try multiplying 45.6 by 78.9.
```
### Logical Values
Some expressions return a "logical value" that take the value of either **TRUE** or **FALSE**. Many programming languages refer to these as "Boolean" values. Let's try typing an expression that gives us a logical value:
```{r}
3 < 4
```
And another logical value.:
```{r}
2 + 2 == 5
```
Note that you need a double-equal sign to check whether two values are equal - a single-equal sign won't work.
### Variables
As in other programming languages, you can store a value into a variable to access it later. Type **x = 42** to store a value in **x**. x is a **scalar that contains only one data element**.
```{r}
x = 42
```
You can also use the following. This is a conventional, safer way to assign values.
```{r}
x <- 42
```
This is the recommended way of assignment, according to the [Google R style Guide.](https://google.github.io/styleguide/Rguide.xml) But ```x = 42``` works fine in the vast majority of the times.
x can now be used in expressions in place of the original result. Try dividing **x** by **2** (
'**/**' is the division operator), and other calculations.
```{r}
x / 2
log(x)
x^2
sqrt(x)
x > 1
```
You can re-assign any value to a variable at any time. Try assigning **"Go Jacks!"** to **x**.
```{r}
x <- "Go Jacks!"
```
You can print the value of a variable at any time just by typing its name in the console. Try printing the current value of x.
```{r}
x
```
Now try assigning the **TRUE** logical value to **x**.
```{r}
x <- TRUE
```
You can store multiple values in a variable or object. That is called a **vector**, which is explained later. An object can also contain a table with rows and columns, like an Excel spreadsheet, as a **matrix**, or **data frame**. These are explained in the next section.
### Functions
Functions are fundamental building blocks of R. Most of the times when we run R commands, we are calling and executing functions. You call a function by typing its name, followed by one or more arguments to that function in parenthesis. Let's try using the **sum** function, to add up a few numbers. For example:
```{r}
sum(1, 3, 5)
```
The following code shows how to assign the result of a function to a new variable:
```{r}
y <- sum(1, 3, 5)
y
```
Some arguments have "names". For example, to repeat a value 3 times, you would call the **rep** function and provide its **times** argument:
```{r}
rep("Yo ho!", times = 3)
```
>
```{exercise}
>
Suppose a vector is defined as x <- c(12, 56, 31, -5, 7).
a. Calculate the mean of all elements in x and assign it to *y*.
b. Square each element in x and assign the results to a new vector *z*.
```
>
```{exercise}
>
Use Google to find functions which set and get the current working directory in R, respectively.
```
>
```{exercise}
>
Use Google to find the function which lists all the files in the current working folder in R.
```
Often we want to re-use a chunk of code. The most efficient way is to wrap these code as a function, clearly define what the input and the output. Tested and documented R functions are often available as R packages. You can also define your own functions, which will be explained later.
### Looking for help and example Code
```{r, eval=F}
? sum
```
A document will pop up in the Help Window. See the Figure \@ref(fig:ch3-1).
(ref:ch3-1) Help in formation on the sum function.
```{r ch3-1, echo=FALSE, out.width='60%', fig.cap='(ref:ch3-1)', fig.align='center'}
knitr::include_graphics("images//Ch3_1.png")
```
At the bottom of the page is some example code. See the Figure \@ref(fig:ch3-2).
(ref:ch3-2) Examples of sum code
```{r ch3-2, echo=FALSE, out.width='60%', fig.cap='(ref:ch3-2)', fig.align='center'}
knitr::include_graphics("images//Ch3_2.png")
```
The quickest way to learn an R function is to run the example codes and see the input and output. You can easily copy, paste, and twist the example code to do your analysis.
**example()** brings up examples of usage for the given function. Try the examples for the min function:
```{r fig.keep='none'}
example(min)
```
```{r}
min(5:1, pi) # -> one number
```
Example commands and plots will show up automatically by typing Return in RStudio.
In R, you need to click on the plots.
```{r message=FALSE, results='hide', warning=FALSE, fig.keep='last'}
example(boxplot) # bring example of boxplot
```
I found a lot of help information about R through Google. Google tolerate typos, grammar errors, and different notations. Also, most (99 %) of your questions have been asked and answered on various forums. Many R gurus answered a ton of questions on web sites like** [stackoverflow.com](stackoverflow.com)**, with example codes! I also use Google as a reference.
It is important to add comments to your code. Everything after the “#” will be ignored by R when running. We often recycle and re-purpose our codes.
```{r}
max(1, 3, 5) # return the maximum value of a vector
```
## Data structures
### Vectors
A vector is an object that holds a sequence of values of the same type. A vector's values can be numbers, strings, logical values, or any other type, as long as they're all the same type. They can come from a column of a data frame. if we have a vector x:
```{r}
x <- c(5, 2, 22, 11, 5)
x
```
Here **c** stands for *concatenate*, do not use it as a variable name. It is as special as you!
Vectors can not hold values with different modes (types). Try mixing modes and see what happens:
```{r}
c(1, TRUE, "three")
```
All the values were converted to a single mode (characters) so that the vector can hold them all. To hold diverse types of values, you will need a **list**, which is explained later in this chapter.
If you need a vector with a sequence of numbers you can create it with **start:end** notation. This is often used in loops and operations on the indices of vectors etc. Let's make a vector with values from 5 through 9:
```{r}
5:9
```
A more versatile way to make sequences is to call the **seq** function. Let's do the same thing with **seq**:
```{r}
seq(from = 5, to = 9)
```
seq also allows you to use increments other than 1. Try it with steps of 0.5:
```{r}
seq(from = 5, to = 9, by = .5)
```
Create a sequence from 5 to 9 with length 15:
```{r}
seq(from = 5, to = 9, length = 15)
```
>
```{exercise}
>
Compute 1+2+3… +1000 with one line of R code. Hint: examine the example code for sum( ) function in the R help document.
```
#### Vectors operations
First let's find out what is the 4th element of our vector x <- c(5, 2, 22, 11, 5), or the elements from 2 to 4.
```{r}
x[4]
x[2:4]
```
If you define the vector as y,
```{r}
y <- x[2:4]
```
No result is returned but you "captured" the result in a new vector, which holds 3 numbers. You can type y and hit enter to see the results. Or do some computing with it.
```{r}
y <- x[2:4]; y
```
This does exactly the same in one line. A semicolon separates multiple commands.
Now if we want to know the number of elements in the vector.
```{r}
length(x)
```
It's also easy to know about the maximum, minimum, sum, mean and median individually or together. We can get standard deviation too.
```{r}
max(x)
min(x)
sum(x)
mean(x)
median(x)
summary(x)
sd(x)
```
rank() function ranks the elements. Ties are shown as the average of these ranks. While sort() will sort from the smallest to the biggest, decreasing = T will make it sort from the biggest to the smallest.
```{r}
rank(x)
sort(x)
sort(x, decreasing = T)
```
diff() lag and iterate the differences of vector x.
```{r}
diff(x)
```
rev() will reverse the position of the elements in the vector.
```{r}
rev(x)
```
Operations are performed element by element. Same for log, sqrt, x^2, etc. They return vectors too.
```{r}
log(x)
sqrt(x)
x^2
2*x + 1
```
If we don't want the second element and save it as y:
```{r}
y <- x[-2]
y
```
Add an element 100 to the vector x between the second and the third element:
```{r}
x <- c(5, 2, 22, 11, 5)
x <- c(x[1:2], 100, x[3:5] ) #add an element 100 to x between elements 8 and 9
x #The value 100 is added to the previous vector x
```
Length of the new created x is:
```{r}
length(x)
```
To add a new element to the end, we can use the two commands below, they generate same result.
```{r}
x <- c(5, 2, 22, 11, 5)
c(x, 7)
append(x, 7)
```
Create an empty vector y:
```{r}
y <- c()
y
```
```{r}
length(y)
```
Sometimes we are interested in unique elements:
```{r}
x <- c(5, 2, 22, 11, 5)
unique(x)
```
And the frequencies of the unique elements:
```{r}
x <- c(5, 2, 22, 11, 5)
table(x)
```
If we are interested in the index of the maximum or minimum:
```{r}
x <- c(5, 2, 22, 11, 5)
which.max(x)
which.min(x)
```
Or we need to look for the location of a special value:
```{r}
x <- c(5, 2, 22, 11, 5)
which(x == 11)
```
Or more complicated, we want to find the locations where $x^2>100$:
```{r}
x <- c(5, 2, 22, 11, 5)
x^2
```
```{r}
which (x^2 > 100)
```
We can randomly select some elements from the vector. Run the following code more than once, do you always get the same results? The answer is "No". Because the 3 elements are randomly selected.
```{r}
x <- c(5, 2, 22, 11, 5)
sample(x, 3)
```
Elements in the vector can have names. Type "x" in the command window to see the difference.
```{r}
x <- c(5, 2, 22, 11, 5)
names(x) <- c("David", "Beck", "Zach", "Amy", "John")
x
```
Now we can refer to the elements by their names.
```{r}
x["Amy"]
```
The *any()* and *all()* functions produce logical values. They return if any of all of their arguments are TRUE.
```{r}
x <- c(5, 2, 22, 11, 5)
any(x < 10)
```
```{r}
x <- c(5, 2, 22, 11, 5)
any(x < 0)
```
```{r}
x <- c(5, 2, 22, 11, 5)
all(x < 10)
```
```{r}
x <- c(5, 2, 22, 11, 5)
any(x > 0)
```
```{r}
x>10
```
If we want to get a subset from a vector, there are multiple methods can be used.Here are some examples:
```{r}
x <- c(NA, 2, -4, NA, 9, -1, 5)
x
```
```{r}
y <- x[x < 0]
y
```
There are annoying NAs in the vectors x and y. We can remove the NAs by using **is.na()** function.
```{r}
y<-y[!is.na(y)] #Remove all NAs in y. Or equivalently, keep all NOT NAs in y.
# The exclamation '!' in front of is.na() means NOT.
y
```
Now the updated variable y contains only numerical values. All NAs have been removed. If we want to assign the new vector to a different variable, we can use the code below.
```{r}
z<-y[!is.na(y)] # Assign the new vector without NAs to z.
z # Print the new variable.
```
We can also use the *subset()* function to get a "clean" data without NAs. For example:
```{r}
x <- c(NA, 2, -4, NA, 9, -1, 5)
y <- subset(x, x < 0)
y
```
Let's talk more about the function **is.na()** which checks if a vector contains NA of not. Note that the result is a vector holding logical values. Do we have missing value in the vectors x and z?
```{r}
is.na(x)
```
Sometimes, when working with sample data, a given value isn't available. But it's not a good idea to just throw those values out. R has a value that explicitly indicates a sample was not available: **NA**. Many functions that work with vectors treat this value specially.
For our x vector, try to get the sum of its values, and see what the result is:
```{r}
sum(x)
```
The sum is considered "not available" by default because one of the vector's values is **NA**. This is the responsible thing to do; R won't just blithely add up the numbers without warning you about the incomplete data. We can explicitly tell **sum** (and many other functions) to remove **NA** values before they do their calculations, however.
Bring up documentation for the **sum** function:
```{r, eval=F}
? sum
```
As you see in the documentation, **sum** can take an optional named argument, **na.rm**. It's set to **FALSE** by default, but if you set it to **TRUE**, all **NA** arguments will be removed from the vector before the calculation is performed. In other words, we can use the argument **na.rm = T** or **na.rm = TRUE** to ignore all NAs during calculations. For example:
Try calling **sum** again, with **na.rm** parameter set to **TRUE**:
```{r}
x <- c(NA, 2, -4, NA, 9, -1, 5)
sum(x) # The NAs are involved during the calculation.
sum(x, na.rm = TRUE) # All NAs are ignored during the calculation.
```
>
```{exercise}
>
Suppose a vector var1 <- c(NA, 334, 566, 319, NA, -307).
>
a. Obtain a new vector var2 which removes all NAs from var1.
b. Using the argument na.rm to calculate the mean of var1. Make sure you ignore all NAs.
```
Let's using examples to show the differences between NULL and NA.
```{r}
# build up a vector of numbers greater than 10 in vector vec.x
vec.x <- c(40, 3, 11, 0, 9)
z1 <- NULL
for (i in vec.x) {
if (i > 10) z1 <- c(z1, i)
}
z1
```
```{r}
length(z1)
```
```{r}
# build up a vector of numbers greater than 10 in vector vec.x
vec.x <- c(40, 3, 11, 0, 9)
z2 <- NA
for (i in vec.x) {
if (i > 10) z2 <- c(z2, i)
}
z2
```
```{r}
length(z2)
```
Comparing the length of z1 and z2, we know the NULL is counted as nonexistent, but the NA is counted as a missing value.
Let's do some operations related to vectors. Firstly, we start from the operation between a vector and a scalar.
```{r}
# Operation between x and a scalar
x <- c(1, 4, 8, 9, 10)
y <- 1
x+y
```
As you can see, 1 is added to each element in x. The operation is equivalent to:
```{r}
x <- c(1, 4, 8, 9, 10)
y <- c(1, 1, 1, 1, 1)
x+y
```
The operation between vectors with the same length is element-wise. For example:
```{r}
# two vectors with the same length
x <- c(1, 4, 8, 9, 10)
y <- c(1, 2, 0, 3, 15)
x+y
```
```{r}
x*y
```
If vectors have different length, then R will automatically *recycle* the shorter one, until it has the same length as the longer one. For example:
```{r warning=FALSE}
x <- c(1, 4, 8, 9, 10)
y <- c(1, 2)
x+y
```
y was *recycled*. In fact the real operation is showed below:
```{r}
x <- c(1, 4, 8, 9, 10)
y <- c(1, 2, 1, 2, 1)
x+y
```
*ifelse()* function allows us to do conditional element selection. The usage is *ifelse(test, yes, no)*. The yes or no depends on the test is true or false. Here are two examples.
```{r}
x <- c(2, -3, 4, -1, -5, 6)
y <- ifelse(x > 0, 'Positive', 'Negative')
y
```
In this example, the element in y is either 'positive' or 'negative'. It depends on x greater than 0 or less than 0.
```{r}
x <- c(3, 4, -6, 1, -2)
y <- ifelse (x < 0, abs(x), 2 * x + 1)
y
```
In this example, if an element in x is less than 0, then take the absolute value of the element. Otherwise, multiply the element by 2 then add 1.
>
```{exercise}
>
Using sample selection function randomly select 10 integers from 1 to 100. Create a vector y which satisfies the following conditions: if an selected integer is an even number, then y returns 'even', otherwise y returns 'odd'.
```
If we have two vectors and try to compare them with each other, we can use the match() function. It returns the positions of the first matches of its first argument in its second. For example:
```{r}
x <- c(5, 5, 22, 11, 3)
y <- c(5, 11, 8)
z <- match(y, x)
z
```
The elements 5 and 8 in y locates at the first and fourth position of x respectively. The function ignores the other 5 which locates at the second position of x. The last element 8 in y is not found in x which returns an "NA".
#### An example usage of vectors
Once upon a time, Tom, Jerry, and Mickey went fishing and they caught 7, 3, and 9 fishes, respectively. This information can be stored in a vector, like this:
```{r}
c(7, 3, 9)
```
The **c()** function creates a new vector by combining a set of values. If we want to continue to use the vector, we hold it in an object and give it a name:
```{r}
fishes <- c(7, 3, 9)
fishes
```
**fishes** is a vector with 3 data elements. There are many functions that operate on vectors. You can plot the vector:
```{r fig.keep='none'}
barplot(fishes) # see figure 3.3A
```
You can compute the total:
```{r}
sum(fishes)
```
We can access the individual elements by indices:
```{r}
fishes[3]
```
>
```{exercise}
>
Did Mickey catch more fishes than Tom and Jerry combined? Write R code to verify this statement using the **fishes** vector and return a TRUE or FALSE value.
```
Jerry protested that the ¼ inch long fish he caught and released per fishing rules was not counted properly. We can change the values in the 2nd element directly by:
```{r}
fishes[2] <- fishes[2] + 1
```
On the left side, we take the current value of the 2nd element, which is 3, and add an 1 to it. The result (4) is assigned back to the 2nd element itself. As a result, the 2nd element is increased by 1. This is not a math equation, but a value assignment operation. More rigorously, we should write this as fishes[2] **<-** fishes[2] + 1
We can also directly overwrite the values.
```{r}
fishes[2] <- 4
fishes
```
They started a camp fire, and each ate 1 fish for dinner. Now the fishes left:
```{r}
fishes2 <- fishes - 1
fishes2
```
Most arithmetic operations work just as well on vectors as they do on single values. R subtracts 1 from each individual element. If you add a scalar (a single value) to a vector, the scalar will be added to each value in the vector, returning a new vector with the results.
While they are sleeping in their camping site, a fox stole 3 fishes from Jerry’s bucket, and 4 fishes from Mickey’s bucket. How many left?
```{r}
stolen <- c(0, 3, 4) # a new vector
fishes2 - stolen
```
If you add or subtract two vectors of the same length, R will take the corresponding values from each vector and add or subtract them. The 0 is necessary to keep the vector length the same.
Proud of himself, Mickey wanted to make a 5ft x 5ft poster to show he is the best fisherman. Knowing that **a picture is worth a thousand words**, he learned R and started plotting. He absolutely needs his names on the plots. The data elements in a vector can have names or labels.
```{r}
names(fishes) <- c("Tom", "Jerry", "Mickey")
```
The right side is a vector, holding 3 character values. These values are assigned as the names of the 3 elements in the fishes vector. names is a built-in function. Our vector looks like:
```{r fig.keep='none'}
fishes
barplot(fishes) # see figure 3.3B
```
(ref:6-1) Simple Bar plot
```{r 6-1, echo=FALSE, out.width='80%', fig.cap='(ref:6-1)', fig.align='center'}
knitr::include_graphics("images/img0601_fish.png")
```
Assigning names for a vector also enables us to use labels to access each element. Try getting the value for Jerry:
```{r}
fishes["Jerry"]
```
>
```{exercise}
>
Using the name rather than the index in the vector fisher, assign a character 'Ten' to **Tom**.
```
Tom proposes that their goal for their next fishing trip is to double their catches.
```{r}
2 * fishes
```
Hopelessly optimistic, Jerry proposed that next time each should “square” their catches, so that together they may feed the entire school.
```{r}
sum(fishes ^ 2)
```
Note that two operations are nested. You can obviously do it in two steps.
>
```{exercise}
>
Create a vector representing the prices of groceries, bread $2.5, milk $3.1, jam $5.3, beer $9.1. And create a bar plot to represent this information.
```
#### Scatter Plots of two vectors
The **plot** function takes two vectors, one for X values and one for Y values, and draws a graph of them. Let's draw a graph showing the relationship of numbers and their sines.
```{r}
x <- seq(1, 20, 0.1)
y <- sqrt(x)
```
Then simply call plot with your two vectors:
```{r }
plot(x, y)
```
Great job! Notice on the graph that values from the first argument (x) are used for the horizontal axis, and values from the second (y) for the vertical.
>
```{exercise}
>
Create a vector with 21 integers from -10 to 10, and store it in the x variable. Then create a scatter plot of x^2 against x.
```
### Matrices
**A matrix** is a two-dimensional data structure in R programming. Internally, a matrix is also a vector, but with two additional attributes: the number of rows and the number of columns. A matrix has rows and columns, but it can only contain one type of values, i.e. numbers, characters, or logical values.
We can create a matrix by using *rbind* or *cbind* function. *rbind* combine all rows.
Here are two examples:
```{r}
m <- rbind(c(3, 4, 5), c(10, 13, 15)) # combine vectors by row
m
```
```{r}
n <- cbind(c(3, 4, 5), c(10, 13, 15), c(3, 2, 1)) # combine vectors by column
n
```
```{r}
s <- rbind(m,n) #combine two matrices m and n by row
s
```
To use *rbind()* combining matrices by row, the matrices must have the same number of columns. Similar to *cbind()*, the matrices must have same number of rows.
We can also create a matrix by using the *matrix()* function:
```{r}
x <- matrix(seq(1:12), nrow = 4, ncol = 3)
x
```
The argument *seq()* create a sequence from 1 to 12, *nrow()* define the number of rows in the matrix, *ncol()* define the number of columns in the matrix. We don't have to give both *nrow()* and *ncol()* since if one is provided, the other is inferred from the length of the data.
```{r}
y <- matrix(seq(1:12), nrow = 4)
y
```
As we can see that, the matrix is filled in column-wise by default. If you want to fill a matrix by row-wise, add the *byrow = TRUE* to the argument:
```{r}
z <- matrix(seq(1:12), nrow = 4, byrow = TRUE) # fill matrix row-wise
z
```
The following code will create an empty matrix w:
```{r}
w <- matrix(nrow = 4, ncol = 3)
w
```
We can assign values to the matrix. For example, let's assign the value 3 to the position at first row and first column and value 100 to the position of the second row and third column:
```{r}
w[1,1] <- 3
w[2,3] <- 100
w
```
We can also create a matrix from a vector by setting its dimension using function *dim()*.
```{r}
x <- c(1, 5, 6, 9, 8, 10, 21, 15, 76)
x
```
```{r}
class(x)
```
```{r}
dim(x) <- c(3, 3)
x
```
```{r}
class(x)
```
We can convert a non-matrix data set to a matrix using *as.matrix()* function. Take the data *iris* as an example.
```{r}
subset.iris <- iris[1:10, 1:4]
class(subset.iris)
```
The data structure of *subset.iris* is a data frame. The function *as.matrix* will transfer a data frame to a matrix.
```{r}
x <- as.matrix(subset.iris)
class(x)
```
Various matrix operation can be applied in R. For example:
```{r}
x <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
x
```
```{r}
x^2 #Squared each element in x
```
You can transform the matrix if you want, for the convenience of view and analysis.
```{r}
y <- t(x) # transpose of x
y
```
```{r}
x %*% y # Matrix Multiplication
```
```{r eval=F}
x - y # Matrix subtraction
```
Error in x - y : non-conformable arrays
The error reminders us that the matrices for subtraction must have same dimensions.
```{r}
y <- matrix(rep(1, 6), nrow = 2)
y
```
```{r}
x - y
```
We can produce a new matrix by each element is doubled and added 5
```{r}
z <- 2 * x + 5
z
```
We can also get a logical matrix using logical code like:
```{r}
x <- matrix(c(12, 34, 51, 27, 26, 10), ncol = 2)
x > 20
```
We can extract all *TRUE* results from x by using *x[x>20]*.
```{r}
x[x > 20]
```
Similar we can define a vector with logical values, then apply it to x to get all TRUE values.
```{r}
log.vau <- c(FALSE, TRUE, TRUE, TRUE, TRUE, FALSE)
x[log.vau]
```
Remember matrix is a vector, and filled by column-wise. Therefore the vector with logical values applies to x by column-wise order.
Since matrix is a vector with two dimensions, all operations for vectors also apply to matrix. For example:
```{r}
x[1, ] # Get the first row of x
```
```{r}
a <- as.matrix(iris[, 1:4]) #Take out the first 4 columns of iris and convert it to matrix.
c <- a[5:10, 2:4] # Extract a subset
c
```
```{r}
x <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2)
c[1:2, ] <- x # Replace the first two rows of c with x
c
```
Now if we want to know the mean and sum of these rows and columns, try *rowMeans()*, *colMeans()*, *rowSums()*, *colSums()*.
```{r}
x <- as.matrix(iris[1:10, 1:4])
rowMeans(x)
```
```{r}
colMeans(x)
```
```{r}
rowSums(x)
```
```{r}
colSums(x)
```
Here we are computing the standard deviation by columns, using *apply()* function.The second argument "1" or "2" in *apply()* represents the function applies to rows or columns.
```{r}
apply(x, 2, sd) # Calculate the standard deviation of each column in x
```
Or median by rows, using 1 for rows.
```{r}
apply(x, 1, median) # Calculate the median of each row in x
```
Heat map is my favorite type of graph for visualizing a large matrix data.
```{r}
heatmap(x, scale = "column", margins = c(10,5))
```
>
```{exercise}
>
Let *subset.iris <- as.matrix(iris[1:10, 1:4])*, Using apply function to calculate the mean of *subset.iris* by column.
```
**Example:**
Define a function to find the positions of minimal value in each column of *subset.iris <- as.matrix(iris[1:10, 1:4])*.
```{r}
find.min.posi <- function(x){
y <- function(xcol){
return(which.min(xcol))
}
return(apply(x, 2, y))
}
```
```{r}
subset.iris <- as.matrix(iris[1:10, 1:4])
subset.iris
```
```{r}
find.min.posi(subset.iris)
```
As you can see that the minimal value for Sepal.Length is 4.4 which locates at the 9th row. The minimal value locates at the 9th, 3rd and 10th row for Sepal.Width, Petal.Length and Petal.Width respectively.
>
```{exercise}
>
Let *subset.iris.2 <- as.matrix(iris[, 1:4])*, fill blanks in the *find.max* function defined below to find the maximal value in each column of *subset.iris.2*.
>
find.max <- function(x){
y <- function(xcol){
return(_________)
}
return(_______(x, ____, y))
}
>
________(subset.iris.2)
```
To make a matrix more easily readable, we can use functions *colnames()* and *rownames()* to assign names to the columns and rows of the matrix. For example:
```{r}
y <- rbind(c(1, 3, 5), c(2, 4, 6))
y
```
```{r}
colnames(y) <- c('First.Col', 'Second.Col', 'Third.Col')
row.names(y) <- c('odd.number', 'even.number')
y
```
Another interesting application of the matrix is the manipulation of images. You may have heard a pixel matrix which represents "picture elements". They are small little dots making up images. Each image is a matrix with thousands or even millions of pixels. Each pixel can only be one color at a time. If we have a gray-scale image, the brightness of color for each pixel is determined by the value assigned to it. In other words, the pixel value is a single number that represents the brightness of the pixel. For example, if the color value 0.2941 is assigned to a pixel which locates 3rd row and 4th column, then the dot at the 3rd row and the 4th column is pretty dark. The range of the colors from black to white correspond to the scale varies from 0% to 100%.
Sometimes we need to blur images or add mosaic to a picture for various purposes. Let's use one example to demonstrate how to add mosaic to a gray-scale image.
**Example: Add mosaic to Einstein image**
Firstly, let's read a gray-scale image of Einstein into R and view the image.
```{r warning=FALSE}
library(pixmap)
EINSTEIN <- read.pnm("images/EINSTEIN.pgm", cellres = 1)
plot(EINSTEIN)
```
Then let us look at structure of this image:
```{r}
str(EINSTEIN)
```
Here we get a new class which is S4 type. We don't go into depth for this time. One fact we need to pay attention is that we must use "@" instead of "$" sign to designate the components.
```{r}
class(EINSTEIN@ grey)
```
The class of *EINSTEIN@ grey* is a matrix. The output *..@ grey : num [1:512, 1:512]* shows that the dimension of matrix is 512*512. The values "0.596 0.541 0.522 0.529 0.561 ..." represent the brightness values in the matrix. For example, the value of the pixel at the 3rd row and the 4th column is 0.2941176 as shown below:
```{r}
EINSTEIN@ grey[3,4]
```
If we change the value 0.2941176 to 0 by *EINSTEIN@ grey[3,4] <- 0*, then that pixel will become pure black. If we assign a random number between 0 to 1 to the value, then the color in the pixel will be randomly assigned based on the random number. Using this idea, we define a mosaic function which will be used to blur the image.
```{r}
mosaic.plot <- function(image, yrange, xrange){
length.y <- length(yrange)
length.x <- length(xrange)
image2 <- image
whitenoise <- matrix(nrow = length.y, ncol = length.x, runif(length.y * length.x))
image2@grey[yrange, xrange] <- whitenoise
return(image2)
}
```
The argument *image* is the original image, *yrange* is the range of rows that you want to blur, *xrange* is the range of columns that you want to blur. The *xrange* and *yrange* construct the mosaic region. Since we don't want to change the original image, therefore we copy the original image to image2. The *whitenoise* creates a matrix filled with random numbers following a uniform distribution. The dimensions of *whitenoise* are determined by the mosaic region. Replace the values in the original image that you want to blur *image2@grey[yrange,xrange]* by the *whitenoise* matrix.
```{r}
EINSTEIN.mosaic <- mosaic.plot(EINSTEIN, 175:249, 118:295)
plot(EINSTEIN.mosaic)
```
Here, we take *yrange=175:249* and *xrange=118:295* to select a sub-matrix from 175th row to 249th row, and 118th column to 295th column. This sub-matrix stores the color values of Einstein' eyes region. The sub-matrix is replaced by whitenoise matrix. Therefore the image near the eyes region is replaced by image of random dots.
The function *locator()* allows us to find the relevant rows and columns. Type **locator()** in the *Console* window, then R will wait for you to click a point within an image. Click *esc* on your keyboard to exit the function, then the function will return the coordinates of that point in the *Console* window. If you click more points once, then the function will return all coordinates of these points sorted by your clicking order. You must be careful about the y-coordinate. The row numbers in *pixmap* objects increase from the top of the image to the bottom, therefore you need to opposite the y-coordinate by subtracting them from the number of rows in the original image. For example, the y-coordinates that I obtained from *locator()* function are 337 and 263. After subtracting them from 512, I got 175 and 249. They are the yrange used in the mosaic function.
>
```{exercise}
>
Using the *mosaic.plot()* function to blur the image of mona_lisa.pgm by adding mosaic to her eyes region. Your output should look like the graph below.
```
>
```{r echo = FALSE}
# Mona lisa