-
Notifications
You must be signed in to change notification settings - Fork 1
/
PaperFinalDraft.Rnw
1169 lines (965 loc) · 79.7 KB
/
PaperFinalDraft.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[smallextended]{svjour3} % onecolumn (second format)
\RequirePackage{fix-cm}
\smartqed% flush right qed marks, e.g. at end of proof
\usepackage{graphicx}
\usepackage[round]{natbib}
\usepackage{amsmath}
\begin{document}
\journalname{Computational Statistics}
\title{Community engagement and subgroup meta-knowledge: Some factors in the soul of a community}
\titlerunning{Community engagement and subgroup meta-knowledge} % if too long for running head
\author{Amelia A McNamara}
\institute{A. McNamara \at
University of California, Los Angeles \\
Los Angeles, CA 90095 \\
\email{[email protected]}
}
\date{Received: date / Accepted: date}
\maketitle
\begin{abstract} The Knight Foundation collects data to determine what factors impact community satisfaction, local GDP growth, and interest in Knight news publications. For the 2013 Data Expo at the Joint Statistical Meetings, participants created graphical explorations of these data. This article focuses on the idea of community meta-knowledge, which is essentially majority group empathy or understanding of how minorities experience their community. For example, the survey asks participants to rate their community ``as a place for senior citizens,'' on a 5-point Likert scale. A city where seniors rated their community in the same way as non-seniors is defined as a community with high meta-knowledge about conditions for seniors. Three minority groups were explored: seniors, families with young children, and racial minorities. In most communities, people outside the minority group tended to under-rate their community, compared to those in the minority group. However, meta knowledge about racial minorities stood out as an exception.
\keywords{2013 Data Exposition, R, ggplot2, Likert scales, meta-knowledge}
\end{abstract}
\section{Introduction}
\label{intro}
Studies have shown that increasing empathy is the best way to improve intergroup relations \citep{SteFin1999}. Therefore, it is of interest to quantify the typical level of empathy in communities across the United States. The Knight Foundation data provides a window into something which could be thought of as a proxy for empathy, namely community meta-knowledge. We define meta-knowledge as community awareness by those outside a specific subgroup about the conditions for people inside the subgroup. The primary attempt of this article is to answer the question ``Are people outside a specific subgroup aware of the quality of their community for people in that subgroup?" and then, ``do communities with high meta-knowledge (those where people outside the subgroup understood conditions for minorities) have higher community satisfaction rates than those with low meta-knowledge?"
This article is one of several related to the Knight Foundation community data from the 2013 American Statistical Association Data Expo. For more information on the Expo and the data sets, see \cite{Hof2019}.
This article is fully reproducible. The paper was written using \LaTeX\, and R~\citep{R2014}. The R package knitr was used to interweave writing and R code~\citep{Xie2014}. If you inspect the source code, available on Github~\citep{McN2019}, you will find that each of the figures in the paper are generated when the document is compiled (``knit''). Likewise, any statistic mentioned in the text is calculated using R and automatically inserted into the text using knitr. All graphics were generated with the package ggplot2, an R implementation of The Grammar of Graphics~\citep{Wic2009, Wil2005}. The paper was written in the time period between the development of the R package plyr~\citep{Wic2011} and the R package dplyr, which superseded it~\citep{WicFra2015}. Therefore, it uses functions from both packages.
One challenge with reproducibility is that as packages develop, they are not always backwards-compatible. To help combat this issue, the code available on Github includes a packrat library. The packrat package allows a user to snapshot the state of their R packages in order to freeze them for future reproducibility~\citep{UshMcP2014}. For example, although the current version of dplyr is 0.7.8, this paper uses version 0.4.1. In order to run the code hosted on Github, you will need to ``hydrate'' the library of packages using packrat. The Github repository contains the underlying data, the \LaTeX/knitr file which produces the paper (including R code), a packrat archive of the necessary R packages, and \LaTeX\, style files to create the specific journal style.
The remainder of this paper is an exploration of the idea of community meta-knowledge. It begins with a discussion of the data in Section \ref{data}, then moves to focus on community satisfaction in Section \ref{communittsatsec}, and community engagement through local behaviors in Section \ref{behaviorsec}. Finally, it discusses meta-knowledge in Section \ref{metasec} and wraps up with conclusions in Section \ref{conclusion}.
\section{The Data}
\label{data}
The Knight Foundation collects survey data on 26 communities where the Knight brothers own newspapers, including San Jose, CA, State College, PA, Palm Beach, FL, and St. Paul, MN. The foundation provided data for three years, starting in 2008. Each data set includes approximately 20 demographic questions and 50-80 survey questions, depending on how distinct questions are defined. The aim of the survey is to gauge what factors are important to community attachment, and it includes questions on a variety of subjects, from ``how satisfied are you with this community as a place to live?" to ``how many minutes is your daily commute?"
The survey is conducted over the phone by Gallup Poll, and can take place in either English or Spanish. Gallup also performs data analysis for the Knight Foundation, and their yearly reports are available on the Knight Foundation website \citep{KF2008, KF2009, KF2010}.
The existing data analysis from Gallup is related to a metric they call ``Community Attachment." It is difficult to pin down what this variable means, but it's a composite metric composed of Community Loyalty and Community Passion. Both of those metrics, in turn, are composed of several variables. Community Loyalty includes how likely a person says they are to stay in that particular area, how much they would recommend it to friends, and their outlook for the community's future \citep{KF2010}. Community Passion is composed of variables on connectedness and community pride. So, Community Attachment is already a model of what Gallup believes is important to strong communities. The Gallup team has discovered that this composite variable is positively correlated with local Gross Domestic Product (GDP) growth \citep{KF2010b}. Because of this relationship, the analysis from Gallup is focused on what other factors correlate with Community Attachment (and therefore, with local GDP growth).
While the Gallup Poll analysis is interesting, it does raise the question of multicollinearity, as factors that are correlated with Community Attachment may simply be correlated with one of the variables that were used to compose it, and may not actually have an impact on local GDP growth.
<<packages, echo=FALSE>>=
require(knitr)
@
<<chunkoptions, echo=FALSE, include=FALSE>>=
opts_chunk$set(echo=FALSE, cache=FALSE, message=FALSE, warning=FALSE, error=FALSE, fig.align='center')
require(dplyr)
require(ggplot2)
require(scales)
require(RColorBrewer)
require(ggmap)
require(xtable)
require(reshape2)
require(forcats)
@
<<readdata>>=
sotc10 <- read.csv("data/sotc10.csv")
sotc09 <- read.csv("data/sotc09.csv")
sotc08 <- read.csv("data/sotc08.csv")
@
<<renaming>>=
renameData <- function(datasetname){
plyr::rename(datasetname,c("INTRO1" = "gender", "QSB" = "citystate", "QS3"="county", "QS3_02"="county2", "QS3A"="zipcode", "QD1"="age", "QS4"="area_descrip", "QS5"="further", "QS5_2"="further2", "QN1A"="current_life", "QN1B"="future_life", "QCE1"="community_sat","QCE2"="recommend","Q3A"="proud", "Q3B"="perfect", "Q3C"="reputation", "Q4_1"="problem_1", "Q4_2"="problem_2", "Q4_3"="problem_3", "Q4_1_1"="problem_1", "Q4_1_2"="problem_2", "Q4_1_3"="problem_3","Q4_2_1"="problem_4", "Q4_2_2"="problem_5", "Q4_2_3"="problem_6","Q4_3_1"="problem_7", "Q4_3_2"="problem_8", "Q4_3_3"="problem_9","Q5"="rather_live", "Q6"= "five_years_ago", "Q6A" = "five_years_future", "Q7A"= "parks", "Q7B"= "beauty", "Q7C"= "highways", "Q7D"= "affordable_housing", "Q7E" = "job_opportunites", "Q7F"= "public_schools", "Q7G" = "colleges", "Q7H"= "nightlife", "Q7I"= "friends", "Q7J" = "raise_kids", "Q7K" = "heathcare", "Q7L" ="leadership", "Q7M" = "care", "Q7N"= "police", "Q7O" = "arts", "Q7P" = "community_events", "Q8A" = "talented_grads", "Q8B" = "immigrants", "Q8C"= "minorities", "Q8D"= "families_kids", "Q8E" = "gay_lesbian", "Q8F" = "seniors", "Q8G" = "young_adults", "Q9" = "economy", "Q10" = "econ_better", "Q10A" = "trust_government", "Q11" = "employment", "Q12" = "commute", "Q13" = "job_satisfaction", "Q14" = "company_workforce", "Q15" = "enough_income", "Q15AA" = "area_hiring", "Q15AB" = "representative_leaders", "Q16A" = "treated_respectfully", "Q16B" = "well_rested", "Q16C" = "high_stress", "Q16D" = "learned_yest", "Q17" = "stress_source", "Q18" = "night_safety", "Q19" = "crime", "Q20" = "crime_increasing", "Q21"= "registered_vote", "Q22A" = "volunteer", "Q22B" = "public_meeting", "Q22C"= "voted_local", "Q22D" = "worked_change", "Q22E" = "church", "Q22F" = "festival", "Q22G" = "donated_org", "Q22H" = "donated_individual", "Q22I" = "donated_shelter", "Q22_A" = "impact", "Q23" = "clubs", "Q24" = "friends_nearby", "Q24A" = "friends_friends", "Q25"= "family_nearby", "Q26"= "talk_neighbors", "QD2" = "years_lived", "QD2A" ="permanent_resident", "QD3" = "work_cat", "QD3A" = "household_adults", "QD4" = "children", "QD5A" = "under6", "QD5B" = "six_12", "QD5C" = "thirteen_17", "QD6" = "married", "QD7"= "education", "QD8" = "own_home", "QD9"= "income", "QD10" = "hispanic", "QD111" = "race", "QD112" = "race2", "QD113" = "race3", "QD12" = "white_hispanic", "QD13" = "phone_numbers", "QD13A" = "cell_only", "QD13B" = "have_cell", "QD13C" = "cell_calls"))
}
sotc10 <- renameData(sotc10)
sotc09 <- renameData(sotc09)
sotc08 <- renameData(sotc08)
@
<<usedalot>>=
n1 <- c("citystate", "InGroup", "Response", "n", "Freq", "year")
colors2 <- c("#f1a340", "#998ec3")
colors1 <- brewer.pal(5, "Spectral")
colorsA <- colors1[c(2,3,4,1,5)]
colorsB <- colors1[c(2,3,4,5,1)]
@
<<LikertDataFunction>>=
LikertData <- function(positive, negative, neutral=NULL, dataset){
pos <- dataset[dataset$Response %in% positive,]
neg <- dataset[dataset$Response %in% negative,]
if (is.null(neutral)==FALSE){
neu <- dataset[dataset$Response %in% neutral, ]
neu$Freq <- neu$Freq/2
pos <- rbind(pos, neu)
neg <- rbind(neg, neu)
} else {
neu <- NULL
}
neg$Freq <- -neg$Freq
neg$Response <- factor(neg$Response)
neg$Response <- factor(neg$Response, levels=rev(levels(neg$Response)))
pos$Response <- factor(pos$Response)
return(list(pos=pos, neu=neu, neg=neg))
}
@
<<calculatingInGroupFeelings>>=
FeelingsC <- function(dataset){
Feelings <- dataset %>%
group_by(citystate, InGroup) %>%
summarise(percents= sum(Freq))
Feelings <- data.frame(Feelings)
newDF <- data.frame(InGroup=NA, OutGroup=NA, citystate=levels(Feelings$citystate))
# just changed, because subtraction was in the wrong direction...
newDF$ratio <- (Feelings[Feelings$InGroup=="TRUE" | Feelings$InGroup=="No",]$percents) - (Feelings[Feelings$InGroup=="FALSE" | Feelings$InGroup=="Yes",]$percents)
newDF$InGroup <- Feelings[Feelings$InGroup=="TRUE" | Feelings$InGroup=="Yes",]$percents
newDF$OutGroup <- Feelings[Feelings$InGroup=="FALSE" | Feelings$InGroup=="No",]$percents
return(newDF)
}
@
\subsection{Community Survey Rates}
\label{maprates}
<<gettinggeodata>>=
placesPop <- dget("data/popdata.robj")
percent1 <- summary(sotc10$citystate)/placesPop$CENSUS2010POP
percent1 <-c(percent1, summary(sotc09$citystate)/placesPop$POPESTIMATE2009)
percent1 <- c(percent1, summary(sotc08$citystate)/placesPop$POPESTIMATE2008)
popData <- data.frame(year=c(rep(2010, 26), rep(2009, 26), rep(2008, 26)), percent=percent1)
popData$percent <- popData$percent*100
popData$citystate <- c(rep(levels(sotc10$citystate),3))
@
<<surveyrates, fig.align="left", fig.cap="Yearly survey percentages, displayed on a log scale. Communityies are ordered by median survey rate. Notice that some communities are always over-surveyed (for example, Palm Beach, FL) and some always appear under-surveyed (for example, Long Beach, CA). Population data compiled by the Census Bureau for Intercensal population estimates", out.width="0.99\\linewidth",fig.width = 9.5,fig.height=7.5>>=
popData <- popData %>%
group_by(citystate) %>%
mutate(medianpercent = median(percent)) %>%
ungroup() %>%
mutate(citystate = fct_reorder(citystate, desc(medianpercent)))
p <- ggplot(aes(x=year, y=percent), data=popData) +
facet_wrap(~citystate, ncol=5) +
geom_line() +
scale_x_continuous(
breaks=c(2008, 2008.5, 2009, 2009.5, 2010),
labels=c("2008","", "2009", "", "2010")
) +
scale_y_log10(breaks=c(0.04, 0.4, 4)) +
xlab("") +
ylab("Percent of community surveyed")
p
@
As mentioned above, the data were collected by Gallup through telephone surveys in 2008, 2009, and 2010. Participants were a random sample of adults living in 26 ``communities" (cities or metro areas of the United States). The data from 2008 and 2009 had \Sexpr{dim(sotc08)[1]} and \Sexpr{dim(sotc09)[1]} responses, respectively, while the data from 2010 contained \Sexpr{dim(sotc10)[1]} observations. Because the data sets surveyed the same 26 communities, we can calculate the average number of survey participants in each community. In 2008, that average was \Sexpr{floor(dim(sotc08)[1]/26)} people, in 2009, \Sexpr{floor(dim(sotc09)[1]/26)} people and in 2010, \Sexpr{floor(dim(sotc10)[1]/26)} people. The difference in average number of survey participants will be discussed further in Section \ref{missingdata}.
In most communities, approximately 400 people were interviewed, but certain communities were surveyed much more. It appears that the Knight Foundation was trying to survey places at an approximately similar rate, which is why Philadelphia (for example) was surveyed 1633 times in 2010. To see which places were over- or under-represented in the survey, see Figure \ref{fig:surveyrates} for plots showing the percentage of the community that was polled for each polling year.
Figure \ref{fig:surveyrates} shows the percentage of the community that was polled, and percentages hover around a mean of 0.07\%, with lots of variation. Palm Beach, FL always looks over-represented because the community is small in absolute number of residents and the minimal sample size of 400 was always used, leading to a polling rate around 4\%. Large communities like Philadelphia, PA, look under-represented, with a rate around 0.01\%. In addition, there is some variation over time, especially on the East side of the US. For example, Akron, OH begins with a polling rate of 0.01\%, which rises to 0.07\% and then 0.09\%, as a result of polling increasing from around 400 residents to more than 1700. It's not clear why Gallup made these polling decisions.
\subsection{Scale Lengths}
The majority of the Knight data is in the form of responses to survey questions, and most survey questions were answered on a Likert scale~\citep{Lik1932}. However, there was little consistency in the number of levels for the scales. The most common scale was a five-point scale, as in ``Not at all satisfied, 2, 3, 4, Extremely satisfied'' or ``Very bad, 2, 3, 4, Very good.'' However, many other scales wordings (and scale sizes) were used. For the yearly distribution of scale lengths, see Figure \ref{fig:LikertSizes}.
<<numlevels>>=
pullLevels <- function(dataset, range){
numLevels <- NA
for (i in range){
numLevels[i] <- length(levels(as.factor(dataset[,i])))
}
return(numLevels)
}
#Removing the ones that don't make sense-- listing problems, in particular
# For sotc10, this works:
numLevels10 <- pullLevels(sotc10, c(11:80))
numLevels10[which(numLevels10>25)] <- NA
#For sotc09
numLevels09 <- pullLevels(sotc09, c(11:70))
numLevels09[16:24] <- NA
#For sotc08
numLevels08 <- pullLevels(sotc08, c(9:64))
# This takes out the problem questions as well as the commute question
numLevels08[which(numLevels08>20)] <- NA
numLevels <- data.frame(levels=c(numLevels08, numLevels09, numLevels10), year=c(rep(2008, times=length(numLevels08)), rep(2009, times=length(numLevels09)), rep(2010, times=length(numLevels10))))
@
The varied lengths of response scales and the different phrasing of scales even with the same length suggests that this survey was quite long and complex to complete. And though the survey maintains the scale lengths for individual questions over the years, Gallup rescales all the questions down to a 3-point scale to make their analysis simpler. The complete data set provided by the Knight Foundation includes between \Sexpr{dim(sotc08)[2]} and \Sexpr{dim(sotc10)[2]} variables, depending on the year, but fully half of them are rescaled versions of the original questions. While some researchers have suggested that a 3-point scale is enough \citep{JacMat1971}, discarding data seems wasteful, especially if participants have gone to the trouble of rating on a 5- or 7-point scale. So, the remainder of this analysis works on the unscaled variables.
<<LikertSizes, fig.cap="Scale length distributions for each year of the survey. Notice that the distribution from 2008 is centered around 7, and the 2009 and 2010 distributions are centered around 5. All three distributions have large variation, suggesting that the Knight surveys were quite complex.",out.width="0.99\\linewidth",fig.width = 9.5,fig.height=4.5>>=
resp<- ggplot(numLevels, aes(x=levels)) +
geom_histogram(breaks=seq(from=0, to=15, by=1)) +
facet_grid(.~year) +
labs(x="Number of levels on the scale", y="Number of questions using the scale") +
theme(strip.text=element_text(size=14))
resp
@
\subsection{Missing Data}
\label{missingdata}
While the Gallup reports claim the telephone surveys only took 15 minutes, the number of variables collected and the wide range of response scales seem to indicate a much larger time commitment. This raises the question of whether everyone who began the survey completed it. And, as mentioned in Section \ref{maprates}, the 2010 data contained many more observations than previous years. The explanation for this difference is missing data, presumably related to surveys that were not fully completed.
<<missingmatch>>=
matchup1 <- sum(is.na(sotc10$recommend) & is.na(sotc10$college))
matchup2 <- sum(is.na(sotc10$college) & is.na(sotc10$registered_vote))
@
In order to explore this, we study the pattern of missing data within the 2010 data set. This data set includes a large number of responses that are extremely sparse, containing almost no completed survey questions, but generally completed demographic information. For example, the question ``How likely are you to recommend this community to a friend or associate as a place to live?'' has \Sexpr{sum(is.na(sotc10$recommend))} missing values, and the question ``The overall quality of the colleges and universities'' has \Sexpr{sum(is.na(sotc10$college))} missing values. Just between those two questions, the overlap of missing values for both questions is \Sexpr{matchup1}. In other words, almost all the entries that are missing a response to the question about recommending the community are also missing a response to the question about colleges. Similarly, there are \Sexpr{matchup2} respondents missing both the question about colleges and the question ``Are you registered to vote?''
These questions were chosen arbitrarily, but the pattern is almost the same no matter which pair of survey questions was selected. As a result, we chose to use the question about recommending the community to a friend or associate as a proxy for the overall missing data. That is, if the response was missing an answer to that question, we considered it to be a ``missing'' response. The particular question was selected as an indicator of missing data because it was the first question on the survey (in order of the survey script) to have this scale of missing data. Better methods could have been used, for example setting a cutoff number of missing questions within a single respondent, but this seemed to be fairly accurate for our purposes.
In order to be consistent, we have considered data missing in 2008 and 2009 if there was no response to the question ``How likely are you to recommend this community to a friend or associate as a place to live?'' even though the 2008 and 2009 data do not show the same correspondence between that question being missing and the rest of the responses being almost completely sparse. Table \ref{missingtable} shows the percentage of data designated as missing for each survey year. While the 2008 and 2009 data sets are almost complete, the 2010 data set has about 25\% missing data.
Interestingly, with the incomplete responses removed, the 2010 data set is reduced to ~15,000 observations, which is much closer to the ~14,000 observations the two prior years. This suggests that incomplete responses were removed in previous years, or that some new survey methodology (i.e. a ``short form'') was introduced in 2010.
<<missingdatatable, results="asis">>=
count_na <- function(x) sum(is.na(x))
missing10 <- count_na(sotc10$recommend)
missing09 <- count_na(sotc09$recommend)
missing08 <- sotc08 %>%
group_by(recommend == "(Refused)") %>%
summarise(n = n())
missing <- c(missing08$n[2], missing09, missing10)
missingtable <- as.matrix(data.frame(Year=c(2008, 2009, 2010), Missing=missing, Total=c(dim(sotc08)[1], dim(sotc09)[1], dim(sotc10)[1]), Percent=missing/c(dim(sotc08)[1], dim(sotc09)[1], dim(sotc10)[1])*100))
rownames(missingtable) <- NULL
print(xtable(x=missingtable,caption="Percent of missing data from surveys. The 2010 survey has almost 25 percent missing data. The total number of entries in 2010 is also much larger, suggesting that the 2008 and 2009 datasets used a different criteria for inclusion of entries.", display=c("d", "d", "d","d", "f"), label="missingtable"),include.rownames=FALSE)
@
\section{Community Satisfaction}
\label{communittsatsec}
Knowing that the question about community satisfaction was the only survey question answered by all respondents, it made sense to see which communities reported the highest levels of community satisfaction.
To visualize this, a set of stacked distribution graphs were created \citep{RobHei2011, HeiRob2014}. These stacked distribution graphs are centered around zero and use a diverging color scale to give an overall graphical sense of the amount of positive and negative responses across groups. Figure \ref{fig:communitysatplot} shows the distribution of responses to the question, ``Taking everything into account, how satisfied are you with this community as a place to live?'' and is ordered by the communities with the largest total positive responses in 2008, which highlights the changes in 2009 and 2010.
<<communitysat>>=
tmp1 <- tbl_df(sotc08) %>%
group_by(citystate, community_sat) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2008)
tmp2 <- tbl_df(sotc09) %>%
group_by(citystate, community_sat) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2009)
tmp3 <- tbl_df(sotc10) %>%
group_by(citystate, community_sat) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2010)
tmp <- rbind(tmp1, tmp2, tmp3)
names(tmp) <- c("citystate", "Response", "n", "Freq", "year")
#Now all I have to do is make the plot
neutral <- c("3")
positive <- c("Extremely satisfied", "4")
negative <- c("2", "Not at all satisfied")
trial2 <- LikertData(positive, negative, neutral, tmp)
extremelysat08 <- tbl_df(trial2$pos) %>%
filter(year==2008) %>%
group_by(citystate) %>%
summarise(tot=sum(Freq))
extremelysat <- tbl_df(trial2$pos) %>%
group_by(citystate, year) %>%
summarise(tot=sum(Freq))
mostsat <- order(extremelysat08$tot)
masterData <- data.frame(citystate=extremelysat$citystate, year=extremelysat$year, communitysat=extremelysat$tot)
trial2$pos$citystate <- factor(trial2$pos$citystate, levels=levels(trial2$pos$citystate)[mostsat])
trial2$neg$citystate <- factor(trial2$neg$citystate, levels=levels(trial2$neg$citystate)[mostsat])
trial2$neg$Response <- factor(trial2$neg$Response)
trial2$pos$Response <- factor(trial2$pos$Response)
trial2$neg$Response <- factor(trial2$neg$Response, levels=levels(trial2$neg$Response)[c(2,3,1)])
posform <- trial2$pos
maconsat <- tbl_df(posform) %>%
filter(citystate=="Macon, GA") %>%
group_by(year) %>%
summarise(totalper=sum(Freq))
brandetonsat <- tbl_df(posform) %>%
filter(citystate=="Bradenton, FL") %>%
group_by(year) %>%
summarise(totalper=sum(Freq))
@
<<communitysatplot, fig.cap="Responses to the question, \`\`Taking everything into account, how satisified are you with this community as a place to live?\" Communities are ordered by percentage of positive responses in 2008, making it clear the differences in distribution in 2009 and 2010.", out.width="0.99\\linewidth",fig.width = 9.5,fig.height=15,>>=
baseplot <- ggplot(mapping = aes(x=citystate, y=Freq, fill = Response, order=Response)) +
facet_wrap(~year, nrow=3) +
geom_bar(data = trial2$neg, stat = "identity") +
scale_fill_manual(
breaks=c("Not at all satisfied", "2", "3", "4", "Extremely satisfied"),
values=colorsB,
name="Response"
) +
geom_bar(data = trial2$pos, stat = "identity") +
coord_flip() +
ggtitle("Community satisfaction") +
xlab("") +
ylab("") +
scale_y_continuous(
limits=c(-0.5, 1),
breaks=seq(from=-0.5, to=0.75, by=0.25),
labels=c("50%", "25%", "0", "25%", "50%", "75%")
) +
theme(
legend.text=element_text(size=14),
legend.title=element_text(size=16),
axis.text=element_text(size=14),
strip.text=element_text(size=14))
baseplot
@
Overall, the total percentage of positive responses in 2008 ranges between \Sexpr{round(extremelysat08[extremelysat08$citystate=="Gary, IN",]$tot*100)}\% on the low end, and \Sexpr{round(extremelysat08[extremelysat08$citystate=="State College, PA",]$tot*100)}\% at the high end. These numbers are the overall percentage of people in each community that are answering the question, ``Taking everything into account, how satisfied are you with this community as a place to live?'' positively (this includes half of the responses labeled 3).
Communities are ordered by overall positive responses in 2008, which allows for comparisons between communities and across years. The overall trend is fairly stable, but there is some variation, both positive and negative. Looking at Figure \ref{fig:communitysatplot}, we can see that people in State College, PA typically report much greater levels of community satisfaction than people in Detroit, MI or Gary, IN. After 2008, Macon, GA sees a decrease in overall community satisfaction, moving from \Sexpr{round(maconsat$totalper[1]*100)}\% in 2008 to \Sexpr{round(maconsat$totalper[2]*100)}\% in 2009 and slightly up to \Sexpr{round(maconsat$totalper[3]*100)}\% in 2010. Over the same time period, Bradenton, FL shows a slight increase in community satisfaction, from \Sexpr{round(brandetonsat$totalper[1]*100)}\% in 2008 to about \Sexpr{round(brandetonsat$totalper[2]*100)}\% in 2009 and 2010.
\section{Community Engagement through Local Behaviors}
\label{behaviorsec}
Another point of interest was the most common behaviors reported by participants. Figure \ref{fig:YNall} shows the percentage of participants engaging in a variety of behaviors over the three years of the survey. An additional set of questions were introduced in 2010, so those are necessarily blank in the previous years.
Behaviors are arranged by percentage of survey respondents who reported the behavior.
Over all three years, the most common behavior was being registered to vote, followed by voting in a local election. The least common responses to the behavior questions (considering all three years) were ``worked with other residents to make change in the local community'', and ``attended a local public meeting in which local issues were discussed.'' When the additional questions were added in 2010, an even-less-common behavior was added, ``gave money or food to an individual in need in your community who is not related to you.''
<<twolevel>>=
twolevel <- c("registered_vote", "volunteer", "public_meeting", "voted_local", "worked_change", "church", "festival", "donated_org", "donated_individual", "donated_shelter")
df10 <- sotc10 %>%
select(one_of(twolevel)) %>%
mutate(id = 1:n()) %>%
tidyr::gather("Question", "Response", -id) %>%
group_by(Question, Response) %>%
filter(!is.na(Response)) %>%
summarise(n = n()) %>%
mutate(Freq = n/sum(n))
df10$Year <- 2010
df09 <- sotc09 %>%
select(one_of(twolevel[c(1:5)])) %>%
mutate(id = 1:n()) %>%
tidyr::gather("Question", "Response", -id) %>%
group_by(Question, Response) %>%
filter(!is.na(Response)) %>%
summarise(n = n()) %>%
mutate(Freq = n/sum(n))
df09$Year <- 2009
df08 <- sotc08 %>%
select(one_of(twolevel[c(1:5)])) %>%
mutate(id = 1:n()) %>%
tidyr::gather("Question", "Response", -id) %>%
group_by(Question, Response) %>%
filter(!is.na(Response) & Response != "(DK)" & Response != "(Refused)") %>%
summarise(n = n()) %>%
mutate(Freq = n/sum(n))
df08$Year <- 2008
dfall <- rbind(df10, df09, df08)
dfall$Question <- factor(dfall$Question, levels=levels(dfall$Question)[order(df10$Freq[df10$Response=="Yes"])])
positive <- "Yes"
negative <- "No"
trial1 <- LikertData(positive, negative, neutral=NULL, dataset=dfall)
@
<<YNall, out.width="0.99\\linewidth",fig.width = 15,fig.height=9.5,fig.cap="Responses to yes/no questions about participants' behaviors, comparing all three survey years.">>=
baseplot <- ggplot(mapping = aes(Question, Freq, fill = Response)) +
facet_wrap(~Year) +
geom_bar(data = trial1$neg, stat = "identity") +
geom_bar(data = trial1$pos, stat = "identity") +
scale_y_continuous(
breaks=seq(from=-0.75, to=0.75, length.out = 7),
labels=c("", "-50%", "", "0","", "50%", "")
) +
coord_flip()+
xlab("") +
scale_x_discrete(
labels=rev(c("Registered to vote \n", "Voted in local election \n", "Donated money to a \n local organization", "Attended a local \n festival or event", "Gave money or food \n to an individual", "Participated in a \n church event", "Performed local \n volunteer work", "Worked with other residents \n to make change", "Attended a local \n public meeting", "Provided free shelter \n to an individual"))
) +
ylab("") +
theme(
legend.text=element_text(size=14),
legend.title=element_text(size=16),
axis.text.y=element_text(size=20),
title=element_text(size=24),
axis.text.x=element_text(size=14),
strip.text=element_text(size=18))
baseplot
@
<<YN2010>>=
dfcities10 <- sotc10 %>%
select(citystate, one_of(twolevel)) %>%
mutate(id = 1:n()) %>%
group_by(citystate) %>%
tidyr::gather("Question", "Response", -id, -citystate) %>%
group_by(citystate, Question, Response) %>%
filter(!is.na(Response)) %>%
summarise(n = n()) %>%
mutate(Freq = n/sum(n)) %>%
filter(Response=="Yes") %>%
merge(select(filter(df10, Response=="Yes"), Freq), by="Question") %>%
mutate(diff=Freq.x-Freq.y) %>%
select(-Freq.y, -n, Freq=Freq.x) %>%
arrange(citystate) %>%
mutate(year="2010")
dfcitiesYes <- dfcities10 %>%
filter(Response=="Yes") %>%
mutate(col = diff>0)
dfcitiesYes$Question <- factor(dfcitiesYes$Question, levels=levels(dfcitiesYes$Question)[order(dfall[dfall$Response=="Yes",]$Freq)])
@
<<startingoverall>>=
dfcities09 <- sotc09 %>%
select(citystate, one_of(twolevel[1:5])) %>%
mutate(id = 1:n()) %>%
group_by(citystate) %>%
tidyr::gather("Question", "Response", -id, -citystate) %>%
group_by(citystate, Question, Response) %>%
filter(!is.na(Response)) %>%
summarise(n = n()) %>%
mutate(Freq = n/sum(n)) %>%
filter(Response=="Yes") %>%
merge(select(filter(df10, Response=="Yes"), Freq), by="Question") %>%
mutate(diff=Freq.y-Freq.x) %>%
select(-Freq.y, -n, Freq=Freq.x) %>%
arrange(citystate) %>%
mutate(year="2009")
dfcities08 <- sotc08 %>%
select(citystate, one_of(twolevel[1:5])) %>%
mutate(id = 1:n()) %>%
group_by(citystate) %>%
tidyr::gather("Question", "Response", -id, -citystate) %>%
group_by(citystate, Question, Response) %>%
filter(!is.na(Response)) %>%
summarise(n = n()) %>%
mutate(Freq = n/sum(n)) %>%
filter(Response=="Yes") %>%
merge(select(filter(df10, Response=="Yes"), Freq), by="Question") %>%
mutate(diff=Freq.y-Freq.x) %>%
select(-Freq.y, -n, Freq=Freq.x) %>%
arrange(citystate) %>%
mutate(year="2008")
md1 <-rbind(dfcities10, dfcities09, dfcities08) %>%
select(-Response, -diff) %>%
filter(Question %in% c("registered_vote", "voted_local", "volunteer", "worked_change", "public_meeting"))
md1 <- melt(md1)
md1 <- dcast(md1, citystate + year ~ Question, mean)
md1$year <- as.numeric(md1$year)
masterData <- left_join(masterData, md1)
@
<<YN2010plot, out.width="0.99\\linewidth",fig.width = 10, fig.height=18, fig.cap="Percentage-point difference from overall survey rates (2010 data). The letters A-J represent the activities listed in Figure \\ref{fig:YNall}. A: Registered to vote, B: Voted in a local election, C: Donated money to a local organization, D: Attended a local event, E: Gave money or food to an individual, F: Participated in a church event, G: Performed local volunteer work, H: Worked with other residents to make change, I: Attended a local public meeting, J: Provided free shelter to an individual.">>=
baseplot <- ggplot(dfcitiesYes) +
aes(Question, diff, fill = col) +
facet_wrap(~citystate, ncol=3) +
geom_bar(stat = "identity") +
coord_flip() +
guides(fill=FALSE) +
scale_y_continuous(
breaks=seq(from=-.15, to=.15, length.out = 5),
labels=c("", "-7.5", "0","+7.5", "")
) +
scale_x_discrete(labels=LETTERS[10:1]) +
ylab("") +
xlab("") +
theme(
axis.text.y=element_text(size=12),
title=element_text(size=18),
axis.text.x=element_text(size=11),
strip.text=element_text(size=16))
baseplot
@
A followup to this question is whether all communities performed these actions at similar rates, or if there were local variations in behavior. Figure \ref{fig:YN2010plot} shows the percentage-point difference from the overall rate across all 26 communities and 10 behaviors in 2010. For example, in 2010, approximately \Sexpr{round(trial1[["pos"]][trial1[["pos"]]$Year=="2010" & trial1[["pos"]]$Question=="registered_vote",]$Freq, digits=2)*100}\% of people across all communities were registered to vote. In comparison, in Aberdeen, SD, \Sexpr{round(dfcitiesYes[dfcitiesYes$citystate=="Aberdeen, SD" & dfcitiesYes$Question=="registered_vote",]$Freq, digits=2)*100}\% of people reported being registered to vote, so Figure \ref{fig:YN2010plot} shows a percentage-point difference of +\Sexpr{round(dfcitiesYes[dfcitiesYes$citystate=="Aberdeen, SD" & dfcitiesYes$Question=="registered_vote",]$Freq-trial1[["pos"]][trial1[["pos"]]$Year=="2010" & trial1[["pos"]]$Question=="registered_vote",]$Freq, digits=2)*100}. Of course, all the communities rates are included in the calculation of the average, so it makes sense that some communities are above the overall rate and some are below. More study is required to determine if the visual differences between communities represent true differences or just random variation, but there are certainly communities that stand out from the rest.
For example, people in Boulder, CO, Long Beach, CA, and San Jose, CA participated in church events at a much lower rate the the overall, while people in Georgia (both Macon and Milledgeville) were more likely to participate in church events.
It is also interesting to note the communities that performed above or below national rates on most questions. For example, Biloxi, MS shows that respondents engage in almost all of the behaviors asked about at higher-than average rates, with the exception of being registered to vote and performing local volunteer work. On the other hand, Palm Beach, FL shows lower-than-average rates for almost all behavior, except for giving money or food to an individual. Long Beach, CA shows a similar pattern. However, since this array of plots represents 26 facets of the same plot, it is highly possible that these visual trends occurred by chance. Humans have a tendency to see patterns in noise, even when no true pattern exists. This tendency is known as apophenia or pareidolia, and can be combated by the use of visual inference techniques~\citep{WicCoo2010}. However, that was outside the scope of this Data Expo entry.
\section{Meta-Knowledge}
\label{metasec}
The primary aim of this article is to address whether communities hold meta-knowledge about their city being a good place for subgroups or minorities.
The survey asks a number of questions related to rating the community as a place for subgroups, including: ``young, talented college graduates,'' ``immigrants from other countries,'' ``racial and ethnic minorities,'' ``families with young children,'' ``gay and lesbian people,'' ``senior citizens,'' and ``young adults without children.'' Not all these subgroups were asked to identify themselves in the demographic questions (particularly ``gay and lesbian people'') so it was not possible to address them all. Instead, we focus on racial and ethnic minorities, senior citizens, and families with young children.
Community meta-knowledge is essentially majority group empathy or understanding of how minorities experience their community. For example, the survey asks participants to rate their community as a place for families with young children on a 5-point Likert scale. A city where participants with children rated their community in the same way as participants without children is defined as a community with high meta-knowledge about conditions for families with young children.
Ideally, we could use the information gathered about community meta-knowledge on subgroup experiences to determine more about the community itself. It's possible that meta-knowledge would correlate with other measures we are interested in (for example, overall community satisfaction).
We define meta-knowledge for a particular dimension (e.g. seniors) as the difference between the total percentage of positive responses to the question, ``How is your community as a place for [particular subgroup]?" by people in the subgroup and those outside it. This can be expressed by equation (\ref{MK}).
\begin{eqnarray}
\label{MK}
MK = \frac{1}{\sum R^{+}_{\mbox{ In subgroup}} - \sum R^{+}_{\mbox{ Out of subgroup}}}
\end{eqnarray}
This equation means that communities with similar scores from people inside and outside the subgroup will receive high absolute $MK$ values, while those with very different scores will receive low absolute $MK$ values. A positive $MK$ score means that people outside the subgroup are over-rating the community, while a negative $MK$ score means they are under-rating it.
We can calculate $MK$ for each subgroup we are interested in, denoting which group we are talking about by a subscript. Here, we will consider the questions ``How is your community as a place for racial and ethnic minorities?'' ($MK_R$), ``How is your community as a place for seniors?'' ($MK_S$), and ``How if your community as a place for families with young children?'' ($MK_C$).
Using this definition, communities where Whites highly over-rate their community as a place for minorities will have a negative $MK_R$ score, while communities where Whites under-rate their community will have a positive $MK_R$ score. Communities where the ratings by both groups are roughly equal will have a large $MK_R$ score, whether positive or negative. Generally, we are more interested in the magnitude of the score than the sign, but it is sometimes interesting to consider which group is under- or over-rating.
For a visual explanation of this rating system, see Figure \ref{fig:anotherlookplot}. Points close to the $y=x$ line have a large $MK$ score, those below the line will have smaller positive score, and those above the line have smaller negative scores.
This definition puts emphasis on communities where out-of-subgroup participants rated their city the same way that in-subgroup participants did, based on the assumption that empathy is important to communities \citep{SteFin1999}.
\subsection{Meta-knowledge about the community as a place for racial and ethnic minorities}
\label{minorities}
To begin, we investigate meta-knowledge about racial and ethnic minorities. This proved somewhat difficult because each year of data collection used a slightly different set of possible answer choices to the question ``Which of these groups best describes your racial background?'' and because there were so many participants who refused to answer (particularly in 2010).
<<minorities>>=
races10 <- sotc10 %>%
filter(race != " " & race != "4" & race != "Don't Know") %>%
group_by(race) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2010)
races10$race <- factor(races10$race)
levels(races10$race) <- c(levels(races10$race)[1:5], "Some other race", levels(races10$race)[7])
races09 <- sotc09 %>%
filter(race != " " & race != "4" & race != "Don't know") %>%
group_by(race) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2009)
races09$race <- factor(races09$race)
races08 <- sotc08 %>%
filter(race != " " & race != "4" & race != "(DK)" & race != "None") %>%
group_by(race) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2008)
races08$race <- factor(races08$race)
levels(races08$race) <- c("Hispanic", "More than one", "Refused", "American Indian or Alaskan", "Asian", "Black or African-American", "Native Hawaiian or other Pacific Islander", "Some other race", "White")
races <- rbind(races08, races09, races10)
@
<<overallRaceplot, out.width="0.99\\linewidth",fig.width = 15,fig.height=10, fig.cap="Responses to the question, \`\`Which of these groups best describes your racial background?\" Notice that 2010 shows an overrepresentation of Refused responses, compared to the other years.">>=
raceOver <- ggplot(races, mapping = aes(race, n)) +
geom_bar(stat="identity") +
facet_wrap(~year,nrow=3) +
ylab("Number of respondents") +
xlab("") +
theme(
axis.text.y=element_text(size=20),
title=element_text(size=24),
axis.text.x=element_text(size=18),
strip.text=element_text(size=25)) +
coord_flip()
raceOver
@
The overall distribution of responses to the demographic race question is shown in Figure \ref{fig:overallRaceplot}. This figure is shown with absolute numbers of participants, rather than fractions, to underscore how different the 2010 data is. Notice that in 2010, the largest category was ``Refused,'' and without that category the rest of the responses are not on the same scale as previous years. As the plot shows, the sample sizes for individual minority race responses were somewhat small each year, so for the investigation on subgroup meta-knowledge, we combined all the minority responses into one group that we refer to as Non-white. This category contains all participants who reported a race that was not White, but does not include participants who declined to give a response to the question. For a condensed summary of what the White/Non-white criteria means for the overall percentages, see Figure \ref{fig:samplesize}.
<<minoritiesOverYearsE>>=
minoritiesYN08 <- sotc08 %>%
filter(race != " " & race != "4" & race != "(DK)" & race != "None" & race != "(Refused)") %>%
filter(minorities != "(DK)" & minorities != "(Refused)") %>%
group_by(citystate, race != "White", minorities) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2008)
names(minoritiesYN08) <- n1
minoritiesYN09 <- sotc09 %>%
filter(race != " " & race != "4" & race != "Don't know" & race != "Refused") %>%
filter(!is.na(minorities)) %>% group_by(citystate, race != "White", minorities) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2009)
names(minoritiesYN09) <- n1
minoritiesYN10 <- sotc10 %>%
filter(race != " " & race != "4" & race != "Don't Know" & race != "Refused") %>%
filter(!is.na(minorities)) %>%
group_by(citystate, race != "White", minorities) %>% summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2010)
names(minoritiesYN10) <- n1
whiteYN <- rbind(minoritiesYN08, minoritiesYN09, minoritiesYN10) %>%
group_by(year, InGroup, Response) %>%
summarise(tot = sum(n)) %>%
mutate(newFreq=prop.table(tot))
whiteYN <- rename(whiteYN, Freq=newFreq, n=tot)
neutral <- c("3")
positive <- c("Very good", "4")
negative <- c("2", "Very bad")
whiteYears <- LikertData(positive, negative, neutral, whiteYN)
whiteYears$neg$Response <- factor(whiteYears$neg$Response, levels=levels(whiteYears$neg$Response)[c(2,3,1)])
@
<<samplesize, out.width="0.99\\linewidth",fig.width = 8.5,fig.height=4.5, fig.cap="Sample sizes for plots about meta-knowledge regarding the community as a place for minorities.">>=
whiteYN1 <- tbl_df(whiteYN) %>%
group_by(year, InGroup) %>%
summarise(howmany=sum(n)) %>%
mutate(frac=prop.table(howmany))
whiteplot <- ggplot(aes(x=year, y=howmany), data=whiteYN1) +
geom_bar(aes(fill=InGroup), position="dodge", stat="identity") +
xlab("Year") +
ylab("Sample size") +
guides(fill=guide_legend(title=NULL)) +
guides(fill=guide_legend(title=NULL)) +
scale_fill_manual(values=colors2, labels=c("White", "Non-white"))
whiteplot
@
<<overallRaceResponsePlot, out.width="0.99\\linewidth",fig.width = 8.5, fig.height=4.5, fig.cap="Responses to the question, \`\`How is your community as a place for racial and ethnic minorities?\" White denotes survey respondents who listed their race as White, and Non-white is all other race responses (not including survey participants who refused to report a race). Notice the difference between the 2008/2009 responses and the 2010 responses, but also refer to Figure \\ref{fig:samplesize} for the absolute sample sizes for each year-- 2010 has a much smaller sample of responses to the question overall.">>=
baseplot <- ggplot(mapping = aes(InGroup, Freq, fill = Response, order=Response)) +
facet_wrap(~year, nrow=3) +
geom_bar(data = whiteYears$neg, stat = "identity") +
geom_bar(data = whiteYears$pos, stat = "identity") +
scale_fill_manual(
breaks=c("Very bad", "2", "3", "4", "Very good"),
values=colorsA,
name="Response"
) +
coord_flip() +
# TRUE means minority, and in 2008 minorities were under-rating
scale_y_continuous(
breaks=seq(from=-0.25, to=0.75, by=0.25),
labels=percent
) +
scale_x_discrete(labels=c("White", "Non-white")) +
theme(
legend.text=element_text(size=14),
legend.title=element_text(size=16),
axis.text.y=element_text(size=16),
title=element_text(size=16),
axis.text.x=element_text(size=14),
strip.text=element_text(size=14)
) +
xlab("") +
ylab("")
baseplot
@
<<facetcalculations>>=
minGroups10 <- LikertData(positive, negative, neutral, minoritiesYN10)
minGroups10$neg$Response <- factor(minGroups10$neg$Response, levels=levels(minGroups10$neg$Response)[c(2,3,1)])
minGroups08 <- LikertData(positive, negative, neutral, minoritiesYN08)
minGroups08$neg$Response <- factor(minGroups08$neg$Response, levels=levels(minGroups08$neg$Response)[c(2,3,1)])
minGroups09 <- LikertData(positive, negative, neutral, minoritiesYN09)
minGroups09$neg$Response <- factor(minGroups09$neg$Response, levels=levels(minGroups09$neg$Response)[c(2,3,1)])
totalMin09 <- minoritiesYN09 %>%
group_by(citystate, InGroup) %>%
summarize(totalN = sum(n))
@
<<allMinorities,out.width="0.99\\linewidth",fig.width = 10.5, fig.height=12.5, fig.cap="Responses to the question, \`\`How is your community as a place for racial and ethnic minorities?\" faceted by community (2009 data). Numbers in parentheses indicate sample sizes." >>=
baseplot <- ggplot() +
facet_wrap(~citystate, ncol=3) +
geom_bar(mapping = aes(InGroup, Freq, fill = Response, order=Response), data = minGroups09$neg, stat = "identity") +
geom_bar(mapping = aes(InGroup, Freq, fill = Response, order=Response), data = minGroups09$pos, stat = "identity") +
scale_fill_manual(
values=colorsA,
breaks=c("Very bad", "2", "3", "4", "Very good"),
name="Response"
) +
coord_flip() +
#TRUE means minorities, in Aberdeen they are under-rating
scale_y_continuous(
breaks=seq(from=-0.5, to=0.75, by=0.25),
labels=percent
) +
scale_x_discrete(labels=c("White", "Non-white")) +
xlab("") +
ylab("") +
theme(
legend.text=element_text(size=14),
legend.title=element_text(size=16),
axis.text.y=element_text(size=16),
title=element_text(size=16),
axis.text.x=element_text(size=11),
strip.text=element_text(size=14)) +
geom_text(data=totalMin09, aes(label=paste0("(",totalN, ")"), x=InGroup, y=1.1), size=3, hjust=1) +
expand_limits(y=1)
baseplot
@
We want to study the overall difference between in-group and out-group responses to the question, ``how is your community as a place for racial and ethnic minorities?'' To do this, the data is split into two groups, one called White and one called Non-white. For comparison of the distribution of responses between groups, see Figure \ref{fig:overallRaceResponsePlot}. Interestingly, Whites were rating their communities as better places for minorities than Non-whites in 2008 and 2009, but in 2010 Whites began under-rating their communities.
As in Section \ref{behaviorsec}, we want to see the individual variation between communities. We chose to look at the 2009 data to study community-level variation. The community-level responses are shown in Figure \ref{fig:allMinorities}. There is a lot of variation between the communities for this particular type of meta-knowledge. Some communities saw the White respondents over-scoring their community as a place for minorities, while some were under-scoring.
For the most part, however, Whites over-rated their communities as a place for racial and ethnic minorities, compared to Non-whites. Grand Forks, ND, was a particularly bad offender-- Whites over-rated it as a place for minorities (negative $MK$ score), and minorities themselves rated it as one of the worst communities in 2009.
There was a lot of additional variation in response distribution. For example, San Jose, CA is highly rated as a place for minorities both by people in- and -outside the subgroup. And it was slightly under-rated by Whites, which is probably a good sign of meta-knowledge and empathy. Going the opposite direction are Grand Forks, ND, Myrtle Beach, SC, and Macon, GA.
For another view of the relationship between ratings (as seen in Figure \ref{fig:allMinorities}) see Figure \ref{fig:anotherlookplot}, which shows the relationship between total positive responses by Whites ($\sum R^+_{\text{out of subgroup}}$) versus total positive responses by Non-whites ($\sum R^+_{\text{in subgroup}}$) to the question, ``How is your community as a place for racial and ethnic minorities?'' Figure \ref{fig:anotherlookplot} makes it clear that while there are some communities that are under-rated by Whites (positive $MK$ score), the majority of communities are over-rated by Whites as a place for minorities (negative $MK$ score). Very few communities are close to the 1-1 (or y=x) line (large $|MK|$ score).
<<anotherlook>>=
newDF09 <- FeelingsC(minGroups09$pos)
newDF09$year <- "2009"
names(newDF09) <- c("nonwhite", "white", "citystate", "racediffs", "year")
newDF08 <- FeelingsC(minGroups08$pos)
newDF08$year <- "2008"
names(newDF08) <- c("nonwhite", "white", "citystate", "racediffs", "year")
newDF <- rbind(newDF09, newDF08)
# Too much missing data to use 2010
newDF$year <- as.numeric(newDF$year)
masterData <- left_join(masterData, newDF)
horiz <- mean(newDF09$white)
vert <- mean(newDF09$nonwhite)
@
However, even if a community is on the 1-1 line in Figure \ref{fig:anotherlookplot}, it may fall below the mean rating for communities overall. Gary, IN is a good example. It has almost the same overall positive rating from Whites and Non-whites (around \Sexpr{round(newDF09[newDF09$citystate=="Gary, IN",]$nonwhite, digits=2)})\% positive ratings), so it has an $MK$ score of \Sexpr{round(1/newDF09[newDF09$citystate=="Gary, IN",]$racediffs)} but it falls below the mean ratings of communities by both Whites and Non-whites. So, while there is high meta-knowledge in Gary, it is agreement that Gary is a worse place than average for minorities. Of course, looking at Figure \ref{fig:allMinorities}, we can see that using the overall positive rating ($\sum R^+$) to compute $MK$ is obscuring some aspects of the overall distribution. Although the positive end of the scale matches up quite nicely, the distribution of negative responses by non-whites is much heavier tailed. As with all simple measures, focusing just on the total positive percentage loses information.
<<anotherlookplot, out.width="0.76\\linewidth", fig.cap=paste0("Relationship between positive responses to the question, \`\`How is your community as a place for minorities?\" comparing ratings of Whites and Non-whites. Each community is represented, and the plot uses 2009 data. The darker grey region corresponds to positive meta-knowledge scores. Some communities, like State College, PA, are under-rated by Whites, some are over-rated, like Grand Forks, ND, and some are rated the same by both groups, like Gary, IN. The black line shows y=x, for comparison. Grey lines at x=", round(vert, digits=2), " and y=", round(horiz, digits=2), " show the mean ratings by each group.") >>=
df_poly <- data.frame(
x=c(-Inf, Inf, Inf),
y=c(-Inf, Inf, -Inf)
)
al <- ggplot(newDF09) +
geom_polygon(data=df_poly, aes(x, y), fill="grey", alpha=0.6) +
geom_vline(aes(xintercept=vert), color="grey") +
geom_hline(aes(yintercept=horiz), color="grey") +
geom_abline(intercept=0, slope=1) +
geom_point(aes(y=white, x=nonwhite)) +
geom_text(
data = newDF09[newDF09$citystate %in% c("Duluth, MN", "Gary, IN", "Grand Forks, ND", "State College, PA"),],
aes(y=white, x=nonwhite, label = paste(citystate, "\nMK =", round(1/racediffs))),
vjust=c(-0.4, 1.1, 1.3, -0.4),
hjust=0,
size=4
) +
xlab("Non-whites rating") +
ylab("Whites rating") +
xlim(0.4, 1) +
ylim(0.4, 1)
al
@
There is clearly some relationship between ratings by Whites and Non-whites, although it is not a perfect $y=x$ relationship. The correlation between ratings is $r= $\Sexpr{round(cor(newDF09$white, newDF09$nonwhite), digits=2)}.
<<diffscor>>=
diffscor <- cor(masterData[masterData$year != 2010,]$communitysat, abs(1/masterData[masterData$year != 2010,]$racediffs))
@
To explore the effect of $MK_R$ on community satisfaction, we plotted raw $MK_R$ score against community satisfaction (from Section \ref{communittsatsec}) and did not find a trend. Instead, it seemed like high $MK_R$ scores, both positive and negative, were associated with higher community satisfaction. So, we calculated the correlation between the absolute value of meta-knowledge score, $|MK_R|$, and community satisfaction, which is \Sexpr{round(diffscor, digits=2)}. This includes data from both 2008 and 2009 (2010 data was still excluded because of sample size issues).
The correlation suggests a weak negative relationship between the variables, which is not what we would expect. Communities that had lowest meta-knowledge scores had the highest community satisfaction scores, on average. Communities that had high meta-knowledge have lower community satisfaction scores. This seems counter-intuitive, as we would expect communities that had more self awareness to be more satisfied, but that's not what we see. It is clear that there is more to community satisfaction than just $|MK_R|$.
\subsection{Meta-knowledge about the community as a place for seniors}
\label{mseniors}
<<OverallSeniors>>=
seniorGroup10 <- sotc10 %>%
filter(!is.na(seniors)) %>%
group_by(citystate, age >= 62, seniors) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2010)
names(seniorGroup10) <- n1
seniorGroup09 <- sotc09 %>%
filter(!is.na(age)) %>%
group_by(citystate, age >= 62, seniors) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2009)
names(seniorGroup09) <- n1
seniorGroup08 <- sotc08 %>%
group_by(citystate, age >=62, seniors) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2008)
names(seniorGroup08) <- n1
seniorYears <- rbind(seniorGroup08, seniorGroup09, seniorGroup10)
seniorYears$Response <- factor(seniorYears$Response, levels=levels(seniorYears$Response)[c(6,3:5, 7,1:2)])
@
<<seniorSS, out.width= "0.99\\linewidth", fig.width = 8.5, fig.height= 4.5, fig.cap = "Sample sizes for plots about meta-knowledge regarding the community as a place for seniors.">>=
seniorSS <- tbl_df(seniorYears) %>%
group_by(year, InGroup) %>%
summarise(number=sum(n)) %>%
mutate(percentage=(100*number)/sum(number))
seniorSS$InGroup[seniorSS$InGroup=="FALSE"] <- "Non-seniors"
seniorSS$InGroup[seniorSS$InGroup=="TRUE"] <- "Seniors"
seniorplot <- ggplot(aes(x=year, y=number), data=seniorSS) +
geom_bar(aes(fill=InGroup), position="dodge", stat="identity") +
xlab("Year") +
ylab("Sample size") +
guides(fill=guide_legend(title=NULL)) +
scale_fill_manual(values=colors2)
seniorplot
@
<<OverallSeniorRatings>>=
seniorOverall <- LikertData(positive, negative, neutral, seniorYears)
seniorGroup09 <- LikertData(positive, negative, neutral, seniorGroup09)
seniorGroup09$neg$Response <- factor(seniorGroup09$neg$Response, levels=levels(seniorGroup09$neg$Response)[c(2,3,1)])
totPos <- FeelingsC(seniorGroup09$pos)
names(totPos) <- c("Seniors", "Nonseniors","citystate", "seniordiffs")
horiz <- mean(totPos$Nonseniors)
vert <- mean(totPos$Seniors)
@
<<seniorOverallData>>=
forM <- filter(seniorOverall$pos, year==2008)
forM <- tbl_df(forM)
comp08 <- FeelingsC(forM) %>%
mutate(year="2008")
forM <- filter(seniorOverall$pos, year==2009)
forM <- tbl_df(forM)
comp09 <- FeelingsC(forM) %>%
mutate(year="2009")
forM <- filter(seniorOverall$pos, year==2010)
forM <- tbl_df(forM)
comp10 <- FeelingsC(forM) %>%
mutate(year="2010")
comp <- rbind(comp08, comp09, comp10)
names(comp) <- c("seniors", "nonseniors", "citystate", "seniordiffs", "year")
comp$year <- as.numeric(comp$year)
masterData <- left_join(masterData, comp)
@
To continue our exploration of meta-knowledge, we consider meta-knowledge about communities as a place for seniors, or $MK_S$. In order to determine if non-seniors understood how good their community was for seniors, the data was split into two pieces, one of participants aged 62 and older, and the other of participants under 62. For sample sizes of the groups, see Figure \ref{fig:seniorSS}.
The yearly ratings distributions are shown in Figure \ref{fig:seniorOverallPlot}. It appears that non-seniors typically underestimate how good a place is for seniors (positive $MK$ score). People 62 and older rated their community as a better place for seniors than did people under 62, over all three years.
<<seniorOverallPlot,out.width="0.99\\linewidth",fig.width = 8.5, fig.height=5.5, fig.cap="Responses to the question, \`\`How is your community as a place for seniors?\" Seniors are defined as survey participants aged 62 and older, non-seniors are those under 62. Notice that seniors consistently rated their community more highly than did non-seniors over all three survey years.">>=
seniorYears2 <- seniorYears %>%
group_by(year, InGroup, Response) %>%
summarise(n3 = sum(n)) %>%
mutate(newFreq = prop.table(n3))
names(seniorYears2) <- c("year", "InGroup", "Response", "n", "Freq")
seniorYears2 <- LikertData(positive, negative, neutral, seniorYears2)
baseplot <- ggplot(mapping = aes(InGroup, Freq, fill = Response, order=Response)) +
facet_wrap(~year, nrow=3) +
geom_bar(data = seniorYears2$neg, stat = "identity") +
geom_bar(data = seniorYears2$pos, stat = "identity") +
scale_fill_manual(breaks=c("Very bad", "2", "3", "4", "Very good"), values=colorsA, name="Response") +
coord_flip() +
scale_y_continuous(breaks=seq(from=-0.25, to=0.75, by=0.25), labels=percent) +
scale_x_discrete(labels=c("Non-seniors", "Seniors")) +
xlab(" ") +
ylab("") +
theme(
legend.text=element_text(size=14),
legend.title=element_text(size=16),
axis.text.y=element_text(size=12),
title=element_text(size=16),
axis.text.x=element_text(size=12),
strip.text=element_text(size=14))
baseplot
@
<<seniorPlot, out.width="0.99\\linewidth",fig.width = 10.5, fig.height=12.5, fig.cap="Responses to the question, \`\`How is your community as a place for seniors?\" faceted by community (2009 data). Numbers in parentheses are sample sizes. Every community follows the pattern of over-rating by seniors, but some communities have a smaller discrepancy between ratings of seniors and non-seniors.">>=
totalSeniors09 <- seniorYears %>%
ungroup() %>%
filter(year==2009) %>%
group_by(citystate, InGroup) %>%
summarize(totalN = sum(n))
baseplot <- ggplot() +
facet_wrap(~citystate, ncol=3) +
geom_bar(mapping = aes(InGroup, Freq, fill = Response, order = Response), data = seniorGroup09$neg, stat = "identity") +
geom_bar(mapping = aes(InGroup, Freq, fill = Response, order = Response), data = seniorGroup09$pos, stat = "identity") +
scale_fill_manual(
breaks=c("Very bad", "2", "3", "4", "Very good"),
values=colorsA,
name="Response"
) +
coord_flip() +
scale_y_continuous(
breaks=seq(from=-0.25, to=0.75, by=0.25),
labels=percent
) +
scale_x_discrete(labels=c("Non-seniors", "Seniors")) +
xlab("") +
ylab("") +
theme(
legend.text=element_text(size=14),
legend.title=element_text(size=16),
axis.text.y=element_text(size=16),
title=element_text(size=16),
axis.text.x=element_text(size=11),
strip.text=element_text(size=14)) +
geom_text(data=totalSeniors09, aes(label=paste0("(",totalN, ")"), x=InGroup, y=1.2), size=3, hjust=1) +
expand_limits(y=1)
baseplot
@
<<anotherlookseniors, out.width="0.76\\linewidth", fig.cap=paste0("Relationship between positive responses to the question, \`\`How is your community as a place for seniors?\" comparing ratings of non-seniors and seniors. Each community is represented, and the plot uses 2009 data. The darker grey region corresponds to positive meta-knowledge scores. The black line shows y=x, for comparison. Grey lines at x=", round(mean(totPos$Seniors), digits=2), " and y=", round(mean(totPos$Nonseniors), digits=2), " show the mean ratings by each group.") >>=
sr <- ggplot(totPos) +
geom_polygon(data=df_poly, aes(x, y), fill="grey", alpha=0.6) +
geom_vline(aes(xintercept=vert), color="grey") +
geom_hline(aes(yintercept=horiz), color="grey") +
geom_abline(intercept=0, slope=1) +
geom_point(aes(y=Nonseniors, x=Seniors)) +
geom_text(
data = totPos[totPos$citystate %in% c("Wichita, KS", "Gary, IN", "Bradenton, FL"),],
aes(y=Nonseniors, x=Seniors, label = paste(citystate, "\nMK =", round(1/seniordiffs))
),
vjust=c(-0.4,-0.4,1),
hjust=c(1,0,0),
size=4
) +
xlab("Senior rating") +
ylab("Non-senior rating") +
ylim(0.5, 1) +
xlim(0.5, 1)
sr
@
As in Section \ref{minorities}, we wanted to investigate the local variation in responses. To do this, a faceted plot of responses between the two groups was created. This plot can be seen in Figure \ref{fig:seniorPlot}, and it uses 2009 data. While this plot allows us to see the overall distribution of responses across communities, it makes it hard to compare the two groups (seniors and non-seniors) overall.
To see the relationship between total positive responses by non-seniors versus total positive responses by seniors to the question ``How is your community as a place for seniors?'' see Figure \ref{fig:anotherlookseniors}. Interestingly, every community followed the trend of non-seniors underestimating how good their community was for seniors (or seniors boosting their responses). In other words, unlike $MK_R$, all the $MK_S$ scores were positive. The only community that came close to being an exception to this rule was Wichita, KS. There is a strong correlation between ratings both inside and outside the group ($r=$\Sexpr{round(cor(totPos$Seniors, totPos$Nonseniors), digits=2)}), which suggests that there is good meta-knowledge about communities as places for seniors, although seniors always over-rate their communities.
Then, the question becomes whether the $MK_S$ score is correlated with community satisfaction-- the value turns out to be \Sexpr{round(cor(masterData$communitysat, abs(1/masterData$seniordiffs)), digits=2)}. This shows the relationship we would expect between community satisfaction and meta-knowledge, as communities with higher meta-knowledge also showed higher community satisfaction.
\subsection{Meta-knowledge about the community as a place for families with young children}
The last subgroup to study in this exploration is families with young children. For this section, we split participants between those who reported having dependent children under the age of 18 living in their household and those who did not. Intuitively, it makes sense that a parent of an older child (say, a teenager) would have higher meta-knowledge about the community as a place for families with young children than a participant who never had children or whose children have grown up and moved away. While the data included more granular demographic details about the ages of the children in the households, splitting the data into participants with children and those without made the groups closer in size. To see the sample sizes and percentages, see Figure \ref{fig:kidsSS}.
<<kidsbyYear>>=
kidGroup10 <- sotc10 %>%
filter(children == "Yes" | children == "No", !is.na(families_kids)) %>%
group_by(citystate, children, families_kids) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2010)
names(kidGroup10) <- n1
kidGroup09 <- sotc09 %>%
filter(children == "Yes" | children == "No", !is.na(families_kids)) %>%
group_by(citystate, children, families_kids) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2009)
names(kidGroup09) <- n1
kidGroups <- LikertData(positive, negative, neutral, kidGroup09)
kidGroups$neg$Response <- factor(kidGroups$neg$Response, levels=levels(kidGroups$neg$Response)[c(2,3,1)])
kidGroup08 <- sotc08 %>%
filter(children == "Yes" | children == "No", !is.na(families_kids) & families_kids != "(DK)" & families_kids != "(Refused)") %>%
group_by(citystate, children, families_kids) %>%
summarise(n = n()) %>%
mutate(frac=prop.table(n), year=2008)
names(kidGroup08) <- n1
kidYears <- rbind(kidGroup08, kidGroup09, kidGroup10)
forMaster <- LikertData(positive, negative, neutral, kidYears)
forMaster$neg$Response <- factor(forMaster$neg$Response, levels=levels(forMaster$neg$Response)[c(2,3,1)])
forM <- filter(forMaster$pos, year==2008)
forM <- tbl_df(forM)
comp08 <- FeelingsC(forM) %>%
mutate(year="2008")
forM <- filter(forMaster$pos, year==2009)
forM <- tbl_df(forM)
comp09 <- FeelingsC(forM) %>%
mutate(year="2009")
forM <- filter(forMaster$pos, year==2010)
forM <- tbl_df(forM)
comp10 <- FeelingsC(forM) %>%
mutate(year="2010")
comp <- rbind(comp08, comp09, comp10)
names(comp) <- c("withKids", "noKids", "citystate", "kiddiffs", "year")
comp$year <- as.numeric(comp$year)