PaperFinalDraft.Rnw

\documentclass[smallextended]{svjour3}       % onecolumn (second format)
\RequirePackage{fix-cm}
\smartqed% flush right qed marks, e.g. at end of proof
\usepackage{graphicx}
\usepackage[round]{natbib}
\usepackage{amsmath}

\begin{document}
\journalname{Computational Statistics}
\title{Community engagement and subgroup meta-knowledge: Some factors in the soul of a community}


\titlerunning{Community engagement and subgroup meta-knowledge}        % if too long for running head

\author{Amelia A McNamara}
\institute{A. McNamara \at
              University of California, Los Angeles \\
              Los Angeles, CA 90095 \\
              \email{amelia.mcnamara@stthomas.edu}
}

\date{Received: date / Accepted: date}
\maketitle


\begin{abstract} The Knight Foundation collects data to determine what factors impact community satisfaction, local GDP growth, and interest in Knight news publications. For the 2013 Data Expo at the Joint Statistical Meetings, participants created graphical explorations of these data. This article focuses on the idea of community meta-knowledge, which is essentially majority group empathy or understanding of how minorities experience their community. For example, the survey asks participants to rate their community ``as a place for senior citizens,'' on a 5-point Likert scale. A city where seniors rated their community in the same way as non-seniors is defined as a community with high meta-knowledge about conditions for seniors. Three minority groups were explored: seniors, families with young children, and racial minorities. In most communities, people outside the minority group tended to under-rate their community, compared to those in the minority group. However, meta knowledge about racial minorities stood out as an exception. 
\keywords{2013 Data Exposition, R, ggplot2, Likert scales, meta-knowledge}
\end{abstract}

\section{Introduction}
\label{intro}
Studies have shown that increasing empathy is the best way to improve intergroup relations \citep{SteFin1999}. Therefore, it is of interest to quantify the typical level of empathy in communities across the United States. The Knight Foundation data provides a window into something which could be thought of as a proxy for empathy, namely community meta-knowledge. We define meta-knowledge as community awareness by those outside a specific subgroup about the conditions for people inside the subgroup. The primary attempt of this article is to answer the question ``Are people outside a specific subgroup aware of the quality of their community for people in that subgroup?" and then, ``do communities with high meta-knowledge (those where people outside the subgroup understood conditions for minorities) have higher community satisfaction rates than those with low meta-knowledge?"

This article is one of several related to the Knight Foundation community data from the 2013 American Statistical Association Data Expo. For more information on the Expo and the data sets, see \cite{Hof2019}. 

This article is fully reproducible. The paper was written using \LaTeX\, and R~\citep{R2014}. The R package knitr was used to interweave writing and R code~\citep{Xie2014}. If you inspect the source code, available on Github~\citep{McN2019}, you will find that each of the figures in the paper are generated when the document is compiled (``knit''). Likewise, any statistic mentioned in the text is calculated using R and automatically inserted into the text using knitr.  All graphics were generated with the package ggplot2, an R implementation of The Grammar of Graphics~\citep{Wic2009, Wil2005}. The paper was written in the time period between the development of the R package plyr~\citep{Wic2011} and the R package dplyr, which superseded it~\citep{WicFra2015}. Therefore, it uses functions from both packages. 

One challenge with reproducibility is that as packages develop, they are not always backwards-compatible. To help combat this issue, the code available on Github includes a packrat library. The packrat package allows a user to snapshot the state of their R packages in order to freeze them for future reproducibility~\citep{UshMcP2014}. For example, although the current version of dplyr is 0.7.8, this paper uses version 0.4.1. In order to run the code hosted on Github, you will need to ``hydrate'' the library of packages using packrat. The Github repository contains the underlying data, the \LaTeX/knitr file which produces the paper (including R code), a packrat archive of the necessary R packages, and \LaTeX\, style files to create the specific journal style. 

The remainder of this paper is an exploration of the idea of community meta-knowledge. It begins with a discussion of the data in Section \ref{data}, then moves to focus on community satisfaction in Section \ref{communittsatsec}, and community engagement through local behaviors in Section \ref{behaviorsec}. Finally, it discusses meta-knowledge in Section \ref{metasec} and wraps up with conclusions in Section \ref{conclusion}.

\section{The Data}
\label{data}
The Knight Foundation collects survey data on 26 communities where the Knight brothers own newspapers, including San Jose, CA, State College, PA, Palm Beach, FL, and St. Paul, MN. The foundation provided data for three years, starting in 2008. Each data set includes approximately 20 demographic questions and 50-80 survey questions, depending on how distinct questions are defined. The aim of the survey is to gauge what factors are important to community attachment, and it includes questions on a variety of subjects, from ``how satisfied are you with this community as a place to live?" to ``how many minutes is your daily commute?"

The survey is conducted over the phone by Gallup Poll, and can take place in either English or Spanish. Gallup also performs data analysis for the Knight Foundation, and their yearly reports are available on the Knight Foundation website \citep{KF2008, KF2009, KF2010}. 

The existing data analysis from Gallup is related to a metric they call ``Community Attachment." It is difficult to pin down what this variable means, but it's a composite metric composed of Community Loyalty and Community Passion. Both of those metrics, in turn, are composed of several variables. Community Loyalty includes how likely a person says they are to stay in that particular area, how much they would recommend it to friends, and their outlook for the community's future \citep{KF2010}. Community Passion is composed of variables on connectedness and community pride. So, Community Attachment is already a model of what Gallup believes is important to strong communities. The Gallup team has discovered that this composite variable is positively correlated with local Gross Domestic Product (GDP) growth \citep{KF2010b}. Because of this relationship, the analysis from Gallup is focused on what other factors correlate with Community Attachment (and therefore, with local GDP growth). 

While the Gallup Poll analysis is interesting, it does raise the question of multicollinearity, as factors that are correlated with Community Attachment may simply be correlated with one of the variables that were used to compose it, and may not actually have an impact on local GDP growth. 
<<packages, echo=FALSE>>=
require(knitr)
@ 
<<chunkoptions, echo=FALSE, include=FALSE>>=
opts_chunk$set(echo=FALSE, cache=FALSE, message=FALSE, warning=FALSE, error=FALSE, fig.align='center')
require(dplyr)
require(ggplot2)
require(scales)
require(RColorBrewer)
require(ggmap)
require(xtable)
require(reshape2)
require(forcats)
@
<<readdata>>=
sotc10 <- read.csv("data/sotc10.csv")
sotc09 <- read.csv("data/sotc09.csv")
sotc08 <- read.csv("data/sotc08.csv")
@
<<renaming>>=
renameData <- function(datasetname){
  plyr::rename(datasetname,c("INTRO1" = "gender", "QSB" = "citystate", "QS3"="county", "QS3_02"="county2", "QS3A"="zipcode", "QD1"="age", "QS4"="area_descrip", "QS5"="further", "QS5_2"="further2", "QN1A"="current_life", "QN1B"="future_life", "QCE1"="community_sat","QCE2"="recommend","Q3A"="proud", "Q3B"="perfect", "Q3C"="reputation", "Q4_1"="problem_1", "Q4_2"="problem_2", "Q4_3"="problem_3", "Q4_1_1"="problem_1", "Q4_1_2"="problem_2", "Q4_1_3"="problem_3","Q4_2_1"="problem_4", "Q4_2_2"="problem_5", "Q4_2_3"="problem_6","Q4_3_1"="problem_7", "Q4_3_2"="problem_8", "Q4_3_3"="problem_9","Q5"="rather_live", "Q6"= "five_years_ago", "Q6A" = "five_years_future", "Q7A"= "parks", "Q7B"= "beauty", "Q7C"= "highways", "Q7D"= "affordable_housing", "Q7E" =  "job_opportunites", "Q7F"=  "public_schools", "Q7G" = "colleges", "Q7H"= "nightlife", "Q7I"= "friends", "Q7J" = "raise_kids", "Q7K" = "heathcare", "Q7L" ="leadership", "Q7M" = "care", "Q7N"= "police", "Q7O" = "arts", "Q7P" =  "community_events", "Q8A" = "talented_grads", "Q8B" = "immigrants", "Q8C"= "minorities", "Q8D"= "families_kids", "Q8E" = "gay_lesbian", "Q8F" = "seniors", "Q8G" = "young_adults", "Q9" = "economy", "Q10" = "econ_better", "Q10A" = "trust_government", "Q11" = "employment", "Q12" = "commute", "Q13" = "job_satisfaction", "Q14" = "company_workforce", "Q15" = "enough_income", "Q15AA" = "area_hiring", "Q15AB" = "representative_leaders", "Q16A" = "treated_respectfully", "Q16B" = "well_rested", "Q16C" = "high_stress", "Q16D" = "learned_yest", "Q17" = "stress_source", "Q18" = "night_safety", "Q19" = "crime", "Q20" = "crime_increasing", "Q21"=  "registered_vote", "Q22A" = "volunteer", "Q22B" = "public_meeting", "Q22C"= "voted_local", "Q22D" = "worked_change", "Q22E" = "church", "Q22F" = "festival", "Q22G" = "donated_org", "Q22H" = "donated_individual", "Q22I" = "donated_shelter", "Q22_A" = "impact", "Q23" = "clubs", "Q24" = "friends_nearby", "Q24A" =  "friends_friends", "Q25"= "family_nearby", "Q26"= "talk_neighbors", "QD2" = "years_lived", "QD2A" ="permanent_resident", "QD3" = "work_cat", "QD3A" = "household_adults", "QD4" = "children", "QD5A" = "under6", "QD5B" = "six_12", "QD5C" = "thirteen_17", "QD6" = "married", "QD7"= "education", "QD8" = "own_home", "QD9"= "income", "QD10" = "hispanic", "QD111" = "race", "QD112" = "race2", "QD113" = "race3", "QD12" = "white_hispanic", "QD13" = "phone_numbers", "QD13A" = "cell_only", "QD13B" = "have_cell", "QD13C" = "cell_calls"))
}


sotc10 <- renameData(sotc10)
sotc09 <- renameData(sotc09)
sotc08 <- renameData(sotc08)
@


<<usedalot>>=
n1 <- c("citystate", "InGroup", "Response", "n", "Freq", "year")
colors2 <- c("#f1a340", "#998ec3")
colors1 <- brewer.pal(5, "Spectral")
colorsA <- colors1[c(2,3,4,1,5)]
colorsB <- colors1[c(2,3,4,5,1)]
@

<<LikertDataFunction>>=
LikertData <- function(positive, negative, neutral=NULL, dataset){
  pos <- dataset[dataset$Response %in% positive,]
  neg <- dataset[dataset$Response %in% negative,]
  if (is.null(neutral)==FALSE){
    neu <- dataset[dataset$Response %in% neutral, ]
    neu$Freq <- neu$Freq/2
    pos <- rbind(pos, neu)
    neg <- rbind(neg, neu)
  } else {
  neu <- NULL
  }
  neg$Freq <- -neg$Freq
  neg$Response <- factor(neg$Response)
  neg$Response <- factor(neg$Response, levels=rev(levels(neg$Response)))
  pos$Response <- factor(pos$Response)
  return(list(pos=pos, neu=neu, neg=neg))
}
@

<<calculatingInGroupFeelings>>=
FeelingsC <- function(dataset){
  Feelings <- dataset %>% 
    group_by(citystate, InGroup) %>% 
    summarise(percents= sum(Freq))
  Feelings <- data.frame(Feelings)
  newDF <- data.frame(InGroup=NA, OutGroup=NA, citystate=levels(Feelings$citystate))
  # just changed, because subtraction was in the wrong direction...
  newDF$ratio <- (Feelings[Feelings$InGroup=="TRUE" | Feelings$InGroup=="No",]$percents) - (Feelings[Feelings$InGroup=="FALSE" | Feelings$InGroup=="Yes",]$percents)
  newDF$InGroup <- Feelings[Feelings$InGroup=="TRUE" | Feelings$InGroup=="Yes",]$percents
  newDF$OutGroup <- Feelings[Feelings$InGroup=="FALSE" | Feelings$InGroup=="No",]$percents
  return(newDF)
}
@


\subsection{Community Survey Rates}
\label{maprates}

<<gettinggeodata>>=
placesPop <- dget("data/popdata.robj")

percent1 <- summary(sotc10$citystate)/placesPop$CENSUS2010POP
percent1 <-c(percent1, summary(sotc09$citystate)/placesPop$POPESTIMATE2009)
percent1 <- c(percent1, summary(sotc08$citystate)/placesPop$POPESTIMATE2008)

popData <- data.frame(year=c(rep(2010, 26), rep(2009, 26), rep(2008, 26)), percent=percent1)
popData$percent <- popData$percent*100
popData$citystate <- c(rep(levels(sotc10$citystate),3))
@

<<surveyrates, fig.align="left", fig.cap="Yearly survey percentages, displayed on a log scale. Communityies are ordered by median survey rate. Notice that some communities are always over-surveyed (for example, Palm Beach, FL) and some always appear under-surveyed (for example, Long Beach, CA). Population data compiled by the Census Bureau for Intercensal population estimates", out.width="0.99\\linewidth",fig.width = 9.5,fig.height=7.5>>=

popData <- popData %>%
  group_by(citystate) %>%
  mutate(medianpercent = median(percent)) %>%
  ungroup() %>%
  mutate(citystate = fct_reorder(citystate, desc(medianpercent)))

p <- ggplot(aes(x=year, y=percent), data=popData) + 
  facet_wrap(~citystate, ncol=5) + 
  geom_line() + 
  scale_x_continuous(
    breaks=c(2008, 2008.5, 2009, 2009.5, 2010), 
    labels=c("2008","", "2009", "", "2010")
    ) + 
  scale_y_log10(breaks=c(0.04, 0.4, 4)) + 
  xlab("") + 
  ylab("Percent of community surveyed")
p
@
As mentioned above, the data were collected by Gallup through telephone surveys in 2008, 2009, and 2010. Participants were a random sample of adults living in 26 ``communities" (cities or metro areas of the United States). The data from 2008 and 2009 had \Sexpr{dim(sotc08)[1]} and \Sexpr{dim(sotc09)[1]} responses, respectively, while the data from 2010 contained \Sexpr{dim(sotc10)[1]} observations. Because the data sets surveyed the same 26 communities, we can calculate the average number of survey participants in each community. In 2008, that average was \Sexpr{floor(dim(sotc08)[1]/26)} people, in 2009, \Sexpr{floor(dim(sotc09)[1]/26)} people and in 2010, \Sexpr{floor(dim(sotc10)[1]/26)} people. The difference in average number of survey participants will be discussed further in Section \ref{missingdata}. 

In most communities, approximately 400 people were interviewed, but certain communities were surveyed much more. It appears that the Knight Foundation was trying to survey places at an approximately similar rate, which is why Philadelphia (for example) was surveyed 1633 times in 2010. To see which places were over- or under-represented in the survey, see Figure \ref{fig:surveyrates} for plots showing the percentage of the community that was polled for each polling year.

Figure \ref{fig:surveyrates} shows the percentage of the community that was polled, and percentages hover around a mean of 0.07\%, with lots of variation. Palm Beach, FL always looks over-represented because the community is small in absolute number of residents and the minimal sample size of 400 was always used, leading to a polling rate around 4\%. Large communities like Philadelphia, PA, look under-represented, with a rate around 0.01\%. In addition, there is some variation over time, especially on the East side of the US. For example, Akron, OH begins with a polling rate of 0.01\%, which rises to 0.07\% and then 0.09\%, as a result of polling increasing from around 400 residents to more than 1700. It's not clear why Gallup made these polling decisions. 

\subsection{Scale Lengths}
The majority of the Knight data is in the form of responses to survey questions, and most survey questions were answered on a Likert scale~\citep{Lik1932}. However, there was little consistency in the number of levels for the scales. The most common scale was a five-point scale, as in ``Not at all satisfied, 2, 3, 4, Extremely satisfied'' or ``Very bad, 2, 3, 4, Very good.'' However, many other scales wordings (and scale sizes) were used. For the yearly distribution of scale lengths, see Figure \ref{fig:LikertSizes}.

<<numlevels>>=
pullLevels <- function(dataset, range){
  numLevels <- NA
  for (i in range){
    numLevels[i] <- length(levels(as.factor(dataset[,i])))
  }
  return(numLevels)
}

#Removing the ones that don't make sense-- listing problems, in particular

# For sotc10, this works:
numLevels10 <- pullLevels(sotc10, c(11:80))
numLevels10[which(numLevels10>25)] <- NA

#For sotc09
numLevels09 <- pullLevels(sotc09, c(11:70))
numLevels09[16:24] <- NA

#For sotc08
numLevels08 <- pullLevels(sotc08, c(9:64))
# This takes out the problem questions as well as the commute question
numLevels08[which(numLevels08>20)] <- NA

numLevels <- data.frame(levels=c(numLevels08, numLevels09, numLevels10), year=c(rep(2008, times=length(numLevels08)), rep(2009, times=length(numLevels09)), rep(2010, times=length(numLevels10))))
@

The varied lengths of response scales and the different phrasing of scales even with the same length suggests that this survey was quite long and complex to complete. And though the survey maintains the scale lengths for individual questions over the years, Gallup rescales all the questions down to a 3-point scale to make their analysis simpler. The complete data set provided by the Knight Foundation includes between \Sexpr{dim(sotc08)[2]} and \Sexpr{dim(sotc10)[2]} variables, depending on the year, but fully half of them are rescaled versions of the original questions. While some researchers have suggested that a 3-point scale is enough \citep{JacMat1971}, discarding data seems wasteful, especially if participants have gone to the trouble of rating on a 5- or 7-point scale. So, the remainder of this analysis works on the unscaled variables. 
<<LikertSizes, fig.cap="Scale length distributions for each year of the survey. Notice that the distribution from 2008 is centered around 7, and the 2009 and 2010 distributions are centered around 5. All three distributions have large variation, suggesting that the Knight surveys were quite complex.",out.width="0.99\\linewidth",fig.width = 9.5,fig.height=4.5>>=
resp<- ggplot(numLevels, aes(x=levels)) + 
  geom_histogram(breaks=seq(from=0, to=15, by=1)) + 
  facet_grid(.~year) + 
  labs(x="Number of levels on the scale", y="Number of questions using the scale") + 
  theme(strip.text=element_text(size=14))
resp
@

\subsection{Missing Data}
\label{missingdata}
While the Gallup reports claim the telephone surveys only took 15 minutes, the number of variables collected and the wide range of response scales seem to indicate a much larger time commitment. This raises the question of whether everyone who began the survey completed it. And, as mentioned in Section \ref{maprates}, the 2010 data contained many more observations than previous years. The explanation for this difference is missing data, presumably related to surveys that were not fully completed. 

<<missingmatch>>=
matchup1 <- sum(is.na(sotc10$recommend) & is.na(sotc10$college))
matchup2 <- sum(is.na(sotc10$college) & is.na(sotc10$registered_vote))
@

In order to explore this, we study the pattern of missing data within the 2010 data set. This data set includes a large number of responses that are extremely sparse, containing almost no completed survey questions, but generally completed demographic information. For example, the question ``How likely are you to recommend this community to a friend or associate as a place to live?'' has \Sexpr{sum(is.na(sotc10$recommend))} missing values, and the question ``The overall quality of the colleges and universities'' has \Sexpr{sum(is.na(sotc10$college))} missing values. Just between those two questions, the overlap of missing values for both questions is \Sexpr{matchup1}. In other words, almost all the entries that are missing a response to the question about recommending the community are also missing a response to the question about colleges. Similarly, there are \Sexpr{matchup2} respondents missing both the question about colleges and the question ``Are you registered to vote?''

These questions were chosen arbitrarily, but the pattern is almost the same no matter which pair of survey questions was selected. As a result, we chose to use the question about recommending the community to a friend or associate as a proxy for the overall missing data. That is, if the response was missing an answer to that question, we considered it to be a ``missing'' response. The particular question was selected as an indicator of missing data because it was the first question on the survey (in order of the survey script) to have this scale of missing data. Better methods could have been used, for example setting a cutoff number of missing questions within a single respondent, but this seemed to be fairly accurate for our purposes. 

In order to be consistent, we have considered data missing in 2008 and 2009 if there was no response to the question ``How likely are you to recommend this community to a friend or associate as a place to live?'' even though the 2008 and 2009 data do not show the same correspondence between that question being missing and the rest of the responses being almost completely sparse. Table \ref{missingtable} shows the percentage of data designated as missing for each survey year. While the 2008 and 2009 data sets are almost complete, the 2010 data set has about 25\% missing data.

Interestingly, with the incomplete responses removed, the 2010 data set is reduced to ~15,000 observations, which is much closer to the ~14,000 observations the two prior years. This suggests that incomplete responses were removed in previous years, or that some new survey methodology (i.e. a ``short form'') was introduced in 2010. 

<<missingdatatable, results="asis">>=
count_na <- function(x) sum(is.na(x))
missing10 <- count_na(sotc10$recommend)
missing09 <- count_na(sotc09$recommend)
missing08 <- sotc08 %>%
  group_by(recommend == "(Refused)") %>% 
  summarise(n = n()) 
missing <- c(missing08$n[2], missing09, missing10)
missingtable <- as.matrix(data.frame(Year=c(2008, 2009, 2010), Missing=missing, Total=c(dim(sotc08)[1], dim(sotc09)[1], dim(sotc10)[1]), Percent=missing/c(dim(sotc08)[1], dim(sotc09)[1], dim(sotc10)[1])*100))
rownames(missingtable) <- NULL
print(xtable(x=missingtable,caption="Percent of missing data from surveys. The 2010 survey has almost 25 percent missing data. The total number of entries in 2010 is also much larger, suggesting that the 2008 and 2009 datasets used a different criteria for inclusion of entries.", display=c("d", "d", "d","d", "f"), label="missingtable"),include.rownames=FALSE)
@

\section{Community Satisfaction}
\label{communittsatsec}
Knowing that the question about community satisfaction was the only survey question answered by all respondents, it made sense to see which communities reported the highest levels of community satisfaction. 

To visualize this, a set of stacked distribution graphs were created \citep{RobHei2011, HeiRob2014}. These stacked distribution graphs are centered around zero and use a diverging color scale to give an overall graphical sense of the amount of positive and negative responses across groups. Figure \ref{fig:communitysatplot} shows the distribution of responses to the question, ``Taking everything into account, how satisfied are you with this community as a place to live?'' and is ordered by the communities with the largest total positive responses in 2008, which highlights the changes in 2009 and 2010. 

<<communitysat>>=
tmp1 <- tbl_df(sotc08) %>% 
  group_by(citystate, community_sat) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2008)
tmp2 <- tbl_df(sotc09) %>% 
  group_by(citystate, community_sat) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2009)
tmp3 <- tbl_df(sotc10) %>% 
  group_by(citystate, community_sat) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2010)

tmp <- rbind(tmp1, tmp2, tmp3)
names(tmp) <- c("citystate", "Response", "n", "Freq", "year")

#Now all I have to do is make the plot
neutral <- c("3")
positive <- c("Extremely satisfied", "4")
negative <- c("2", "Not at all satisfied")

trial2 <- LikertData(positive, negative, neutral, tmp)

extremelysat08 <- tbl_df(trial2$pos) %>% 
  filter(year==2008) %>% 
  group_by(citystate) %>% 
  summarise(tot=sum(Freq)) 
extremelysat <- tbl_df(trial2$pos) %>% 
  group_by(citystate, year) %>% 
  summarise(tot=sum(Freq)) 
mostsat <- order(extremelysat08$tot)

masterData <- data.frame(citystate=extremelysat$citystate, year=extremelysat$year, communitysat=extremelysat$tot)

trial2$pos$citystate <- factor(trial2$pos$citystate, levels=levels(trial2$pos$citystate)[mostsat])
trial2$neg$citystate <- factor(trial2$neg$citystate, levels=levels(trial2$neg$citystate)[mostsat])
trial2$neg$Response <- factor(trial2$neg$Response)
trial2$pos$Response <- factor(trial2$pos$Response)
trial2$neg$Response <- factor(trial2$neg$Response, levels=levels(trial2$neg$Response)[c(2,3,1)])

posform <- trial2$pos

maconsat <- tbl_df(posform) %>% 
  filter(citystate=="Macon, GA") %>% 
  group_by(year) %>% 
  summarise(totalper=sum(Freq))
brandetonsat <- tbl_df(posform) %>% 
  filter(citystate=="Bradenton, FL") %>% 
  group_by(year) %>% 
  summarise(totalper=sum(Freq))
@

<<communitysatplot, fig.cap="Responses to the question, \`\`Taking everything into account, how satisified are you with this community as a place to live?\" Communities are ordered by percentage of positive responses in 2008, making it clear the differences in distribution in 2009 and 2010.", out.width="0.99\\linewidth",fig.width = 9.5,fig.height=15,>>=
baseplot <- ggplot(mapping = aes(x=citystate, y=Freq, fill = Response, order=Response)) + 
  facet_wrap(~year, nrow=3) + 
  geom_bar(data = trial2$neg, stat = "identity") + 
  scale_fill_manual(
    breaks=c("Not at all satisfied", "2", "3", "4", "Extremely satisfied"), 
    values=colorsB, 
    name="Response"
    ) + 
  geom_bar(data = trial2$pos, stat = "identity") + 
  coord_flip() + 
  ggtitle("Community satisfaction") + 
  xlab("") + 
  ylab("") + 
  scale_y_continuous(
    limits=c(-0.5, 1), 
    breaks=seq(from=-0.5, to=0.75, by=0.25), 
    labels=c("50%", "25%", "0", "25%", "50%", "75%")
    ) + 
  theme(
    legend.text=element_text(size=14), 
    legend.title=element_text(size=16), 
    axis.text=element_text(size=14), 
    strip.text=element_text(size=14))
baseplot 
@

Overall, the total percentage of positive responses in 2008 ranges between \Sexpr{round(extremelysat08[extremelysat08$citystate=="Gary, IN",]$tot*100)}\% on the low end, and \Sexpr{round(extremelysat08[extremelysat08$citystate=="State College, PA",]$tot*100)}\% at the high end. These numbers are the overall percentage of people in each community that are answering the question, ``Taking everything into account, how satisfied are you with this community as a place to live?'' positively (this includes half of the responses labeled 3).

Communities are ordered by overall positive responses in 2008, which allows for comparisons between communities and across years. The overall trend is fairly stable, but there is some variation, both positive and negative. Looking at Figure \ref{fig:communitysatplot}, we can see that people in State College, PA typically report much greater levels of community satisfaction than people in Detroit, MI or Gary, IN. After 2008, Macon, GA sees a decrease in overall community satisfaction, moving from \Sexpr{round(maconsat$totalper[1]*100)}\% in 2008 to \Sexpr{round(maconsat$totalper[2]*100)}\% in 2009 and slightly up to \Sexpr{round(maconsat$totalper[3]*100)}\% in 2010. Over the same time period, Bradenton, FL shows a slight increase in community satisfaction, from \Sexpr{round(brandetonsat$totalper[1]*100)}\% in 2008 to about  \Sexpr{round(brandetonsat$totalper[2]*100)}\% in 2009 and 2010. 

\section{Community Engagement through Local Behaviors}
\label{behaviorsec}
Another point of interest was the most common behaviors reported by participants. Figure \ref{fig:YNall} shows the percentage of participants engaging in a variety of behaviors over the three years of the survey. An additional set of questions were introduced in 2010, so those are necessarily blank in the previous years. 

Behaviors are arranged by percentage of survey respondents who reported the behavior. 
Over all three years, the most common behavior was being registered to vote, followed by voting in a local election. The least common responses to the behavior questions (considering all three years) were ``worked with other residents to make change in the local community'', and ``attended a local public meeting in which local issues were discussed.'' When the additional questions were added in 2010, an even-less-common behavior was added, ``gave money or food to an individual in need in your community who is not related to you.'' 

<<twolevel>>=
twolevel <- c("registered_vote", "volunteer", "public_meeting", "voted_local", "worked_change", "church", "festival", "donated_org", "donated_individual", "donated_shelter")

df10 <- sotc10 %>%
  select(one_of(twolevel)) %>%
  mutate(id = 1:n()) %>%
  tidyr::gather("Question", "Response", -id) %>%
  group_by(Question, Response) %>%
  filter(!is.na(Response)) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n))
df10$Year <- 2010


df09 <- sotc09 %>%
  select(one_of(twolevel[c(1:5)])) %>%
  mutate(id = 1:n()) %>%
  tidyr::gather("Question", "Response", -id) %>%
  group_by(Question, Response) %>%
  filter(!is.na(Response)) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n))
df09$Year <- 2009

df08 <- sotc08 %>%
  select(one_of(twolevel[c(1:5)])) %>%
  mutate(id = 1:n()) %>%
  tidyr::gather("Question", "Response", -id) %>%
  group_by(Question, Response) %>%
  filter(!is.na(Response) & Response != "(DK)" & Response != "(Refused)") %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n))
df08$Year <- 2008

dfall <- rbind(df10, df09, df08)
dfall$Question <- factor(dfall$Question, levels=levels(dfall$Question)[order(df10$Freq[df10$Response=="Yes"])])
positive <- "Yes"
negative <- "No"
trial1 <- LikertData(positive, negative, neutral=NULL, dataset=dfall) 
@

<<YNall, out.width="0.99\\linewidth",fig.width = 15,fig.height=9.5,fig.cap="Responses to yes/no questions about participants' behaviors, comparing all three survey years.">>=
baseplot <- ggplot(mapping = aes(Question, Freq, fill = Response)) + 
  facet_wrap(~Year) + 
  geom_bar(data = trial1$neg, stat = "identity") + 
  geom_bar(data = trial1$pos, stat = "identity") + 
  scale_y_continuous(
    breaks=seq(from=-0.75, to=0.75, length.out = 7), 
    labels=c("", "-50%", "", "0","",  "50%", "")
    ) + 
  coord_flip()+ 
  xlab("") + 
  scale_x_discrete(
    labels=rev(c("Registered to vote \n", "Voted in local election \n", "Donated money to a \n local organization", "Attended a local \n festival or event", "Gave money or food \n to an individual", "Participated in a \n church event", "Performed local \n volunteer work", "Worked with other residents \n to make change", "Attended a local \n public meeting", "Provided free shelter \n to an individual"))
    ) + 
  ylab("") + 
  theme(
    legend.text=element_text(size=14), 
    legend.title=element_text(size=16), 
    axis.text.y=element_text(size=20), 
    title=element_text(size=24), 
    axis.text.x=element_text(size=14), 
    strip.text=element_text(size=18))
baseplot
@

<<YN2010>>=
dfcities10 <- sotc10 %>%
  select(citystate, one_of(twolevel)) %>%
  mutate(id = 1:n()) %>%
  group_by(citystate) %>%
  tidyr::gather("Question", "Response", -id, -citystate) %>%
  group_by(citystate, Question, Response) %>%
  filter(!is.na(Response)) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)) %>%
  filter(Response=="Yes") %>%
  merge(select(filter(df10, Response=="Yes"), Freq), by="Question") %>%
  mutate(diff=Freq.x-Freq.y) %>%
  select(-Freq.y, -n, Freq=Freq.x) %>%
  arrange(citystate) %>%
  mutate(year="2010")

dfcitiesYes <- dfcities10 %>%
  filter(Response=="Yes") %>%
  mutate(col = diff>0)

dfcitiesYes$Question <- factor(dfcitiesYes$Question, levels=levels(dfcitiesYes$Question)[order(dfall[dfall$Response=="Yes",]$Freq)])
@

<<startingoverall>>=
dfcities09 <- sotc09 %>%
  select(citystate, one_of(twolevel[1:5])) %>%
  mutate(id = 1:n()) %>%
  group_by(citystate) %>%
  tidyr::gather("Question", "Response", -id, -citystate) %>%
  group_by(citystate, Question, Response) %>%
  filter(!is.na(Response)) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)) %>%
  filter(Response=="Yes") %>%
  merge(select(filter(df10, Response=="Yes"), Freq), by="Question") %>%
  mutate(diff=Freq.y-Freq.x) %>%
  select(-Freq.y, -n, Freq=Freq.x) %>%
  arrange(citystate) %>%
  mutate(year="2009")

dfcities08 <- sotc08 %>%
  select(citystate, one_of(twolevel[1:5])) %>%
  mutate(id = 1:n()) %>%
  group_by(citystate) %>%
  tidyr::gather("Question", "Response", -id, -citystate) %>%
  group_by(citystate, Question, Response) %>%
  filter(!is.na(Response)) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)) %>%
  filter(Response=="Yes") %>%
  merge(select(filter(df10, Response=="Yes"), Freq), by="Question") %>%
  mutate(diff=Freq.y-Freq.x) %>%
  select(-Freq.y, -n, Freq=Freq.x) %>%
  arrange(citystate) %>%
  mutate(year="2008")


md1 <-rbind(dfcities10, dfcities09, dfcities08) %>% 
  select(-Response, -diff) %>% 
  filter(Question %in% c("registered_vote", "voted_local", "volunteer", "worked_change", "public_meeting"))
md1 <- melt(md1)
md1 <- dcast(md1, citystate + year ~ Question, mean)
md1$year <- as.numeric(md1$year)
masterData <- left_join(masterData, md1)
@

<<YN2010plot, out.width="0.99\\linewidth",fig.width = 10, fig.height=18, fig.cap="Percentage-point difference from overall survey rates (2010 data). The letters A-J represent the activities listed in Figure \\ref{fig:YNall}. A: Registered to vote, B: Voted in a local election, C: Donated money to a local organization, D: Attended a local event, E: Gave money or food to an individual, F: Participated in a church event, G: Performed local volunteer work, H: Worked with other residents to make change, I: Attended a local public meeting, J: Provided free shelter to an individual.">>=
baseplot <- ggplot(dfcitiesYes) + 
  aes(Question, diff, fill = col) + 
  facet_wrap(~citystate, ncol=3) + 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  guides(fill=FALSE) + 
  scale_y_continuous(
    breaks=seq(from=-.15, to=.15, length.out = 5),
    labels=c("", "-7.5", "0","+7.5",  "")
    ) +
  scale_x_discrete(labels=LETTERS[10:1]) + 
  ylab("") + 
  xlab("") + 
  theme(
    axis.text.y=element_text(size=12), 
    title=element_text(size=18), 
    axis.text.x=element_text(size=11), 
    strip.text=element_text(size=16))
baseplot
@

A followup to this question is whether all communities performed these actions at similar rates, or if there were local variations in behavior. Figure \ref{fig:YN2010plot} shows the percentage-point difference from the overall rate across all 26 communities and 10 behaviors in 2010. For example, in 2010, approximately \Sexpr{round(trial1[["pos"]][trial1[["pos"]]$Year=="2010" & trial1[["pos"]]$Question=="registered_vote",]$Freq, digits=2)*100}\% of people across all communities were registered to vote. In comparison, in Aberdeen, SD, \Sexpr{round(dfcitiesYes[dfcitiesYes$citystate=="Aberdeen, SD" & dfcitiesYes$Question=="registered_vote",]$Freq, digits=2)*100}\% of people reported being registered to vote, so Figure \ref{fig:YN2010plot} shows a percentage-point difference of +\Sexpr{round(dfcitiesYes[dfcitiesYes$citystate=="Aberdeen, SD" & dfcitiesYes$Question=="registered_vote",]$Freq-trial1[["pos"]][trial1[["pos"]]$Year=="2010" & trial1[["pos"]]$Question=="registered_vote",]$Freq, digits=2)*100}. Of course, all the communities rates are included in the calculation of the average, so it makes sense that some communities are above the overall rate and some are below. More study is required to determine if the visual differences between communities represent true differences or just random variation, but there are certainly communities that stand out from the rest. 

For example, people in Boulder, CO, Long Beach, CA, and San Jose, CA participated in church events at a much lower rate the the overall, while people in Georgia (both Macon and Milledgeville) were more likely to participate in church events. 

It is also interesting to note the communities that performed above or below national rates on most questions. For example, Biloxi, MS shows that respondents engage in almost all of the behaviors asked about at higher-than average rates, with the exception of being registered to vote and performing local volunteer work. On the other hand, Palm Beach, FL shows lower-than-average rates for almost all behavior, except for giving money or food to an individual. Long Beach, CA shows a similar pattern. However, since this array of plots represents 26 facets of the same plot, it is highly possible that these visual trends occurred by chance. Humans have a tendency to see patterns in noise, even when no true pattern exists. This tendency is known as apophenia or pareidolia, and can be combated by the use of visual inference techniques~\citep{WicCoo2010}. However, that was outside the scope of this Data Expo entry.  

\section{Meta-Knowledge}
\label{metasec}
The primary aim  of this article is to address whether communities hold meta-knowledge about their city being a good place for subgroups or minorities. 

The survey asks a number of questions related to rating the community as a place for subgroups, including: ``young, talented college graduates,'' ``immigrants from other countries,'' ``racial and ethnic minorities,'' ``families with young children,'' ``gay and lesbian people,'' ``senior citizens,'' and ``young adults without children.'' Not all these subgroups were asked to identify themselves in the demographic questions (particularly ``gay and lesbian people'') so it was not possible to address them all. Instead, we focus on racial and ethnic minorities, senior citizens, and families with young children.

Community meta-knowledge is essentially majority group empathy or understanding of how minorities experience their community. For example, the survey asks participants to rate their community as a place for families with young children on a 5-point Likert scale. A city where participants with children rated their community in the same way as participants without children is defined as a community with high meta-knowledge about conditions for families with young children. 

Ideally, we could use the information gathered about community meta-knowledge on subgroup experiences to determine more about the community itself. It's possible that meta-knowledge would correlate with other measures we are interested in (for example, overall community satisfaction). 

We define meta-knowledge for a particular dimension (e.g. seniors) as the difference between the total percentage of positive responses to the question, ``How is your community as a place for [particular subgroup]?" by people in the subgroup and those outside it. This can be expressed by equation (\ref{MK}). 

\begin{eqnarray}
\label{MK}
MK = \frac{1}{\sum R^{+}_{\mbox{ In subgroup}} - \sum R^{+}_{\mbox{ Out of subgroup}}}
\end{eqnarray}

This equation means that communities with similar scores from people inside and outside the subgroup will receive high absolute $MK$ values, while those with very different scores will receive low absolute $MK$ values. A positive $MK$ score means that people outside the subgroup are over-rating the community, while a negative $MK$ score means they are under-rating it. 

We can calculate $MK$ for each subgroup we are interested in, denoting which group we are talking about by a subscript. Here, we will consider the questions ``How is your community as a place for racial and ethnic minorities?'' ($MK_R$), ``How is your community as a place for seniors?'' ($MK_S$), and ``How if your community as a place for families with young children?'' ($MK_C$).

Using this definition, communities where Whites highly over-rate their community as a place for minorities will have a negative $MK_R$ score, while communities where Whites under-rate their community will have a positive $MK_R$ score. Communities where the ratings by both groups are roughly equal will have a large $MK_R$ score, whether positive or negative. Generally, we are more interested in the magnitude of the score than the sign, but it is sometimes interesting to consider which group is under- or over-rating. 

For a visual explanation of this rating system, see Figure \ref{fig:anotherlookplot}. Points close to the $y=x$ line have a large $MK$ score, those below the line will have smaller positive score, and those above the line have smaller negative scores.

This definition puts emphasis on communities where out-of-subgroup participants rated their city the same way that in-subgroup participants did, based on the assumption that empathy is important to communities \citep{SteFin1999}. 

\subsection{Meta-knowledge about the community as a place for racial and ethnic minorities}
\label{minorities}
To begin, we investigate meta-knowledge about racial and ethnic minorities. This proved somewhat difficult because each year of data collection used a slightly different set of possible answer choices to the question ``Which of these groups best describes your racial background?'' and because there were so many participants who refused to answer (particularly in 2010). 

<<minorities>>=
races10 <- sotc10 %>% 
  filter(race != " " & race != "4" & race != "Don't Know") %>% 
  group_by(race) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2010)
races10$race <- factor(races10$race)
levels(races10$race) <- c(levels(races10$race)[1:5], "Some other race", levels(races10$race)[7])

races09 <- sotc09 %>% 
  filter(race != " " & race != "4" & race != "Don't know") %>% 
  group_by(race) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2009)
races09$race <- factor(races09$race)

races08 <- sotc08 %>% 
  filter(race != " " & race != "4" & race != "(DK)" & race != "None") %>% 
  group_by(race) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2008)
races08$race <- factor(races08$race)
levels(races08$race) <- c("Hispanic", "More than one", "Refused", "American Indian or Alaskan", "Asian", "Black or African-American", "Native Hawaiian or other Pacific Islander", "Some other race", "White")

races <- rbind(races08, races09, races10)
@

<<overallRaceplot, out.width="0.99\\linewidth",fig.width = 15,fig.height=10, fig.cap="Responses to the question, \`\`Which of these groups best describes your racial background?\" Notice that 2010 shows an overrepresentation of Refused responses, compared to the other years.">>=
raceOver <- ggplot(races, mapping = aes(race, n)) + 
  geom_bar(stat="identity") + 
  facet_wrap(~year,nrow=3) + 
  ylab("Number of respondents") + 
  xlab("") + 
  theme(
    axis.text.y=element_text(size=20), 
    title=element_text(size=24), 
    axis.text.x=element_text(size=18), 
    strip.text=element_text(size=25)) + 
  coord_flip()

raceOver 
@

The overall distribution of responses to the demographic race question is shown in Figure \ref{fig:overallRaceplot}. This figure is shown with absolute numbers of participants, rather than fractions, to underscore how different the 2010 data is. Notice that in 2010, the largest category was ``Refused,'' and without that category the rest of the responses are not on the same scale as previous years. As the plot shows, the sample sizes for individual minority race responses were somewhat small each year, so for the investigation on subgroup meta-knowledge, we combined all the minority responses into one group that we refer to as Non-white. This category contains all participants who reported a race that was not White, but does not include participants who declined to give a response to the question. For a condensed summary of what the White/Non-white criteria means for the overall percentages, see Figure \ref{fig:samplesize}. 

<<minoritiesOverYearsE>>=
minoritiesYN08 <- sotc08 %>% 
  filter(race != " " & race != "4" & race != "(DK)" & race != "None" & race != "(Refused)") %>% 
  filter(minorities != "(DK)" & minorities != "(Refused)") %>% 
  group_by(citystate, race != "White", minorities) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2008)
names(minoritiesYN08) <- n1

minoritiesYN09 <- sotc09 %>% 
  filter(race != " " & race != "4" & race != "Don't know" & race != "Refused") %>% 
  filter(!is.na(minorities)) %>% group_by(citystate, race != "White", minorities) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2009)
names(minoritiesYN09) <- n1

minoritiesYN10 <- sotc10 %>% 
  filter(race != " " & race != "4" & race != "Don't Know" & race != "Refused") %>% 
  filter(!is.na(minorities)) %>% 
  group_by(citystate, race != "White", minorities) %>% summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2010)
names(minoritiesYN10) <- n1

whiteYN <- rbind(minoritiesYN08, minoritiesYN09, minoritiesYN10) %>% 
  group_by(year, InGroup, Response) %>% 
  summarise(tot = sum(n)) %>% 
  mutate(newFreq=prop.table(tot))
whiteYN <- rename(whiteYN, Freq=newFreq, n=tot)

neutral <- c("3")
positive <- c("Very good", "4")
negative <- c("2", "Very bad")

whiteYears <- LikertData(positive, negative, neutral, whiteYN)
whiteYears$neg$Response <- factor(whiteYears$neg$Response, levels=levels(whiteYears$neg$Response)[c(2,3,1)])
@

<<samplesize, out.width="0.99\\linewidth",fig.width = 8.5,fig.height=4.5, fig.cap="Sample sizes for plots about meta-knowledge regarding the community as a place for minorities.">>=
whiteYN1 <- tbl_df(whiteYN) %>% 
  group_by(year, InGroup) %>% 
  summarise(howmany=sum(n)) %>% 
  mutate(frac=prop.table(howmany))

whiteplot <- ggplot(aes(x=year, y=howmany), data=whiteYN1) +
  geom_bar(aes(fill=InGroup), position="dodge", stat="identity") + 
  xlab("Year") + 
  ylab("Sample size") + 
  guides(fill=guide_legend(title=NULL)) + 
  guides(fill=guide_legend(title=NULL)) + 
  scale_fill_manual(values=colors2, labels=c("White", "Non-white"))
whiteplot
@

<<overallRaceResponsePlot,  out.width="0.99\\linewidth",fig.width = 8.5, fig.height=4.5, fig.cap="Responses to the question, \`\`How is your community as a place for racial and ethnic minorities?\" White denotes survey respondents who listed their race as White, and Non-white is all other race responses (not including survey participants who refused to report a race). Notice the difference between the 2008/2009 responses and the 2010 responses, but also refer to Figure \\ref{fig:samplesize} for the absolute sample sizes for each year-- 2010 has a much smaller sample of responses to the question overall.">>=

baseplot <- ggplot(mapping = aes(InGroup, Freq, fill = Response, order=Response)) + 
  facet_wrap(~year, nrow=3) + 
  geom_bar(data = whiteYears$neg, stat = "identity")  + 
  geom_bar(data = whiteYears$pos, stat = "identity") + 
  scale_fill_manual(
    breaks=c("Very bad", "2", "3", "4", "Very good"), 
    values=colorsA, 
    name="Response"
    ) + 
  coord_flip() +
# TRUE means minority, and in 2008 minorities were under-rating
  scale_y_continuous(
    breaks=seq(from=-0.25, to=0.75, by=0.25), 
    labels=percent
    ) + 
  scale_x_discrete(labels=c("White", "Non-white"))  + 
  theme(
    legend.text=element_text(size=14), 
    legend.title=element_text(size=16), 
    axis.text.y=element_text(size=16), 
    title=element_text(size=16), 
    axis.text.x=element_text(size=14), 
    strip.text=element_text(size=14)
    ) + 
  xlab("") + 
  ylab("")

baseplot
@


<<facetcalculations>>=
minGroups10 <- LikertData(positive, negative, neutral, minoritiesYN10)
minGroups10$neg$Response <- factor(minGroups10$neg$Response, levels=levels(minGroups10$neg$Response)[c(2,3,1)])

minGroups08 <- LikertData(positive, negative, neutral, minoritiesYN08)
minGroups08$neg$Response <- factor(minGroups08$neg$Response, levels=levels(minGroups08$neg$Response)[c(2,3,1)])

minGroups09 <- LikertData(positive, negative, neutral, minoritiesYN09)
minGroups09$neg$Response <- factor(minGroups09$neg$Response, levels=levels(minGroups09$neg$Response)[c(2,3,1)])

totalMin09 <- minoritiesYN09 %>%
  group_by(citystate, InGroup) %>%
  summarize(totalN = sum(n))
@ 

<<allMinorities,out.width="0.99\\linewidth",fig.width = 10.5, fig.height=12.5, fig.cap="Responses to the question, \`\`How is your community as a place for racial and ethnic minorities?\" faceted by community (2009 data). Numbers in parentheses indicate sample sizes." >>=
baseplot <- ggplot() + 
  facet_wrap(~citystate, ncol=3) + 
  geom_bar(mapping = aes(InGroup, Freq, fill = Response, order=Response), data = minGroups09$neg, stat = "identity") + 
  geom_bar(mapping = aes(InGroup, Freq, fill = Response, order=Response), data = minGroups09$pos, stat = "identity") + 
  scale_fill_manual(
    values=colorsA, 
    breaks=c("Very bad", "2", "3", "4", "Very good"),
    name="Response"
    ) + 
  coord_flip() +
#TRUE means minorities, in Aberdeen they are under-rating
  scale_y_continuous(
    breaks=seq(from=-0.5, to=0.75, by=0.25), 
    labels=percent
    ) + 
  scale_x_discrete(labels=c("White", "Non-white")) + 
  xlab("") + 
  ylab("") + 
  theme(
    legend.text=element_text(size=14), 
    legend.title=element_text(size=16), 
    axis.text.y=element_text(size=16), 
    title=element_text(size=16), 
    axis.text.x=element_text(size=11), 
    strip.text=element_text(size=14)) +
  geom_text(data=totalMin09, aes(label=paste0("(",totalN, ")"), x=InGroup, y=1.1), size=3, hjust=1) +
  expand_limits(y=1) 

baseplot 
@

We want to study the overall difference between in-group and out-group responses to the question, ``how is your community as a place for racial and ethnic minorities?'' To do this, the data is split into two groups, one called White and one called Non-white. For comparison of the distribution of responses between groups, see Figure \ref{fig:overallRaceResponsePlot}. Interestingly, Whites were rating their communities as better places for minorities than Non-whites in 2008 and 2009, but in 2010 Whites began under-rating their communities. 

As in Section \ref{behaviorsec}, we want to see the individual variation between communities. We chose to look at the 2009 data to study community-level variation. The community-level responses are shown in Figure \ref{fig:allMinorities}. There is a lot of variation between the communities for this particular type of meta-knowledge. Some communities saw the White respondents over-scoring their community as a place for minorities, while some were under-scoring. 

For the most part, however, Whites over-rated their communities as a place for racial and ethnic minorities, compared to Non-whites. Grand Forks, ND, was a particularly bad offender-- Whites over-rated it as a place for minorities (negative $MK$ score), and minorities themselves rated it as one of the worst communities in 2009. 

There was a lot of additional variation in response distribution. For example, San Jose, CA is highly rated as a place for minorities both by people in- and -outside the subgroup. And it was slightly under-rated by Whites, which is probably a good sign of meta-knowledge and empathy. Going the opposite direction are Grand Forks, ND, Myrtle Beach, SC, and Macon, GA. 

For another view of the relationship between ratings (as seen in Figure \ref{fig:allMinorities}) see Figure \ref{fig:anotherlookplot}, which shows the relationship between total positive responses by Whites ($\sum R^+_{\text{out of subgroup}}$) versus total positive responses by Non-whites ($\sum R^+_{\text{in subgroup}}$) to the question, ``How is your community as a place for racial and ethnic minorities?'' Figure \ref{fig:anotherlookplot} makes it clear that while there are some communities that are under-rated by Whites (positive $MK$ score), the majority of communities are over-rated by Whites as a place for minorities (negative $MK$ score). Very few communities are close to the 1-1 (or y=x) line (large $|MK|$ score). 

<<anotherlook>>=
newDF09 <- FeelingsC(minGroups09$pos)
newDF09$year <- "2009"
names(newDF09) <- c("nonwhite", "white", "citystate", "racediffs", "year")
newDF08 <- FeelingsC(minGroups08$pos)
newDF08$year <- "2008"
names(newDF08) <- c("nonwhite", "white", "citystate", "racediffs", "year")
newDF <- rbind(newDF09, newDF08)
# Too much missing data to use 2010
newDF$year <- as.numeric(newDF$year)

masterData <- left_join(masterData, newDF)

horiz <- mean(newDF09$white)
vert <- mean(newDF09$nonwhite)
@

However, even if a community is on the 1-1 line in Figure \ref{fig:anotherlookplot}, it may fall below the mean rating for communities overall. Gary, IN is a good example. It has almost the same overall positive rating from Whites and Non-whites (around \Sexpr{round(newDF09[newDF09$citystate=="Gary, IN",]$nonwhite, digits=2)})\% positive ratings), so it has an $MK$ score of \Sexpr{round(1/newDF09[newDF09$citystate=="Gary, IN",]$racediffs)} but it falls below the mean ratings of communities by both Whites and Non-whites. So, while there is high meta-knowledge in Gary, it is agreement that Gary is a worse place than average for minorities. Of course, looking at Figure \ref{fig:allMinorities}, we can see that using the overall positive rating ($\sum R^+$) to compute $MK$ is obscuring some aspects of the overall distribution. Although the positive end of the scale matches up quite nicely, the distribution of negative responses by non-whites is much heavier tailed. As with all simple measures, focusing just on the total positive percentage loses information. 

<<anotherlookplot, out.width="0.76\\linewidth", fig.cap=paste0("Relationship between positive responses to the question, \`\`How is your community as a place for minorities?\" comparing ratings of Whites and Non-whites. Each community is represented, and the plot uses 2009 data. The darker grey region corresponds to positive meta-knowledge scores. Some communities, like State College, PA, are under-rated by Whites, some are over-rated, like Grand Forks, ND, and some are rated the same by both groups, like Gary, IN. The black line shows y=x, for comparison. Grey lines at x=", round(vert, digits=2), " and y=", round(horiz, digits=2), " show the mean ratings by each group.") >>=
df_poly <- data.frame(
    x=c(-Inf, Inf, Inf),
    y=c(-Inf, Inf, -Inf)
)

al <- ggplot(newDF09) + 
  geom_polygon(data=df_poly, aes(x, y), fill="grey", alpha=0.6) +
  geom_vline(aes(xintercept=vert), color="grey") + 
  geom_hline(aes(yintercept=horiz), color="grey") + 
  geom_abline(intercept=0, slope=1) + 
  geom_point(aes(y=white, x=nonwhite)) + 
  geom_text(
    data = newDF09[newDF09$citystate %in% c("Duluth, MN", "Gary, IN", "Grand Forks, ND", "State College, PA"),],
    aes(y=white, x=nonwhite, label = paste(citystate, "\nMK =", round(1/racediffs))), 
    vjust=c(-0.4, 1.1, 1.3, -0.4), 
    hjust=0, 
    size=4
    ) + 
  xlab("Non-whites rating") + 
  ylab("Whites rating") + 
  xlim(0.4, 1) + 
  ylim(0.4, 1) 
al
@

There is clearly some relationship between ratings by Whites and Non-whites, although it is not a perfect $y=x$ relationship. The correlation between ratings is $r= $\Sexpr{round(cor(newDF09$white, newDF09$nonwhite), digits=2)}.

<<diffscor>>=
diffscor <- cor(masterData[masterData$year != 2010,]$communitysat, abs(1/masterData[masterData$year != 2010,]$racediffs))
@

To explore the effect of $MK_R$ on community satisfaction, we plotted raw $MK_R$ score against community satisfaction (from Section \ref{communittsatsec}) and did not find a trend. Instead, it seemed like high $MK_R$ scores, both positive and negative, were associated with higher community satisfaction. So, we calculated the correlation between the absolute value of meta-knowledge score, $|MK_R|$, and community satisfaction, which is \Sexpr{round(diffscor, digits=2)}. This includes data from both 2008 and 2009 (2010 data was still excluded because of sample size issues). 

The correlation suggests a weak negative relationship between the variables, which is not what we would expect. Communities that had lowest meta-knowledge scores had the highest community satisfaction scores, on average. Communities that had high meta-knowledge have lower community satisfaction scores. This seems counter-intuitive, as we would expect communities that had more self awareness to be more satisfied, but that's not what we see. It is clear that there is more to community satisfaction than just $|MK_R|$. 


\subsection{Meta-knowledge about the community as a place for seniors}
\label{mseniors}

<<OverallSeniors>>=
seniorGroup10 <- sotc10 %>% 
  filter(!is.na(seniors)) %>% 
  group_by(citystate, age >= 62, seniors) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2010)

names(seniorGroup10) <- n1

seniorGroup09 <- sotc09 %>% 
  filter(!is.na(age)) %>% 
  group_by(citystate, age >= 62, seniors) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2009)
names(seniorGroup09) <- n1

seniorGroup08 <- sotc08 %>%
  group_by(citystate, age >=62, seniors) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2008)
names(seniorGroup08) <- n1

seniorYears <- rbind(seniorGroup08, seniorGroup09, seniorGroup10)
seniorYears$Response <- factor(seniorYears$Response, levels=levels(seniorYears$Response)[c(6,3:5, 7,1:2)])
@

<<seniorSS, out.width= "0.99\\linewidth", fig.width = 8.5, fig.height= 4.5, fig.cap = "Sample sizes for plots about meta-knowledge regarding the community as a place for seniors.">>=
seniorSS <- tbl_df(seniorYears) %>% 
  group_by(year, InGroup) %>% 
  summarise(number=sum(n)) %>% 
  mutate(percentage=(100*number)/sum(number))
seniorSS$InGroup[seniorSS$InGroup=="FALSE"] <- "Non-seniors"
seniorSS$InGroup[seniorSS$InGroup=="TRUE"] <- "Seniors"

seniorplot <- ggplot(aes(x=year, y=number), data=seniorSS) + 
  geom_bar(aes(fill=InGroup), position="dodge", stat="identity") + 
  xlab("Year") + 
  ylab("Sample size") + 
  guides(fill=guide_legend(title=NULL)) + 
  scale_fill_manual(values=colors2)
seniorplot
@

<<OverallSeniorRatings>>=

seniorOverall <- LikertData(positive, negative, neutral, seniorYears)
seniorGroup09 <- LikertData(positive, negative, neutral, seniorGroup09)
seniorGroup09$neg$Response <- factor(seniorGroup09$neg$Response, levels=levels(seniorGroup09$neg$Response)[c(2,3,1)])

totPos <- FeelingsC(seniorGroup09$pos)
names(totPos) <- c("Seniors", "Nonseniors","citystate", "seniordiffs")

horiz <- mean(totPos$Nonseniors)
vert <- mean(totPos$Seniors)
@

<<seniorOverallData>>=
forM <- filter(seniorOverall$pos, year==2008)
forM <- tbl_df(forM)
comp08 <- FeelingsC(forM) %>% 
  mutate(year="2008")
forM <- filter(seniorOverall$pos, year==2009)
forM <- tbl_df(forM)
comp09 <- FeelingsC(forM) %>% 
  mutate(year="2009")
forM <- filter(seniorOverall$pos, year==2010)
forM <- tbl_df(forM)
comp10 <- FeelingsC(forM) %>% 
  mutate(year="2010")

comp <- rbind(comp08, comp09, comp10)
names(comp) <- c("seniors", "nonseniors", "citystate", "seniordiffs", "year")
comp$year <- as.numeric(comp$year)
masterData <- left_join(masterData, comp)
@
To continue our exploration of meta-knowledge, we consider meta-knowledge about communities as a place for seniors, or $MK_S$. In order to determine if non-seniors understood how good their community was for seniors, the data was split into two pieces, one of participants aged 62 and older, and the other of participants under 62. For sample sizes of the groups, see Figure \ref{fig:seniorSS}.

The yearly ratings distributions are shown in Figure \ref{fig:seniorOverallPlot}. It appears that non-seniors typically underestimate how good a place is for seniors (positive $MK$ score). People 62 and older rated their community as a better place for seniors than did people under 62, over all three years.  

<<seniorOverallPlot,out.width="0.99\\linewidth",fig.width = 8.5, fig.height=5.5, fig.cap="Responses to the question, \`\`How is your community as a place for seniors?\" Seniors are defined as survey participants aged 62 and older, non-seniors are those under 62. Notice that seniors consistently rated their community more highly than did non-seniors over all three survey years.">>=

seniorYears2 <- seniorYears %>% 
  group_by(year, InGroup, Response) %>% 
  summarise(n3 = sum(n)) %>% 
  mutate(newFreq = prop.table(n3))
names(seniorYears2) <- c("year", "InGroup", "Response", "n", "Freq")
seniorYears2 <- LikertData(positive, negative, neutral, seniorYears2)

baseplot <- ggplot(mapping = aes(InGroup, Freq, fill = Response, order=Response)) + 
  facet_wrap(~year, nrow=3) + 
  geom_bar(data = seniorYears2$neg, stat = "identity") + 
  geom_bar(data = seniorYears2$pos, stat = "identity") + 
  scale_fill_manual(breaks=c("Very bad", "2", "3", "4", "Very good"), values=colorsA, name="Response") + 
  coord_flip() + 
  scale_y_continuous(breaks=seq(from=-0.25, to=0.75, by=0.25), labels=percent) + 
  scale_x_discrete(labels=c("Non-seniors", "Seniors")) + 
  xlab(" ") + 
  ylab("") + 
  theme(
    legend.text=element_text(size=14), 
    legend.title=element_text(size=16), 
    axis.text.y=element_text(size=12), 
    title=element_text(size=16), 
    axis.text.x=element_text(size=12), 
    strip.text=element_text(size=14))

baseplot
@

<<seniorPlot, out.width="0.99\\linewidth",fig.width = 10.5, fig.height=12.5, fig.cap="Responses to the question, \`\`How is your community as a place for seniors?\" faceted by community (2009 data). Numbers in parentheses are sample sizes. Every community follows the pattern of over-rating by seniors, but some communities have a smaller discrepancy between ratings of seniors and non-seniors.">>=

totalSeniors09 <- seniorYears %>%
  ungroup() %>%
  filter(year==2009) %>%
  group_by(citystate, InGroup) %>%
  summarize(totalN = sum(n))

baseplot <- ggplot() + 
  facet_wrap(~citystate, ncol=3) + 
  geom_bar(mapping = aes(InGroup, Freq, fill = Response, order = Response), data = seniorGroup09$neg, stat = "identity") + 
  geom_bar(mapping = aes(InGroup, Freq, fill = Response, order = Response), data = seniorGroup09$pos, stat = "identity") + 
  scale_fill_manual(
    breaks=c("Very bad", "2", "3", "4", "Very good"), 
    values=colorsA, 
    name="Response"
    ) + 
  coord_flip() + 
  scale_y_continuous(
    breaks=seq(from=-0.25, to=0.75, by=0.25), 
    labels=percent
    ) + 
  scale_x_discrete(labels=c("Non-seniors", "Seniors")) + 
  xlab("") + 
  ylab("") + 
  theme(
    legend.text=element_text(size=14), 
    legend.title=element_text(size=16), 
    axis.text.y=element_text(size=16), 
    title=element_text(size=16), 
    axis.text.x=element_text(size=11),
    strip.text=element_text(size=14)) +
  geom_text(data=totalSeniors09, aes(label=paste0("(",totalN, ")"), x=InGroup, y=1.2), size=3, hjust=1) +
  expand_limits(y=1) 

baseplot
@


<<anotherlookseniors, out.width="0.76\\linewidth", fig.cap=paste0("Relationship between positive responses to the question, \`\`How is your community as a place for seniors?\" comparing ratings of non-seniors and seniors. Each community is represented, and the plot uses 2009 data. The darker grey region corresponds to positive meta-knowledge scores. The black line shows y=x, for comparison. Grey lines at x=", round(mean(totPos$Seniors), digits=2), " and y=", round(mean(totPos$Nonseniors), digits=2), " show the mean ratings by each group.") >>=
sr <- ggplot(totPos) + 
  geom_polygon(data=df_poly, aes(x, y), fill="grey", alpha=0.6) +
  geom_vline(aes(xintercept=vert), color="grey") + 
  geom_hline(aes(yintercept=horiz), color="grey") + 
  geom_abline(intercept=0, slope=1) + 
  geom_point(aes(y=Nonseniors, x=Seniors)) + 
  geom_text(
    data = totPos[totPos$citystate %in% c("Wichita, KS", "Gary, IN", "Bradenton, FL"),], 
    aes(y=Nonseniors, x=Seniors, label = paste(citystate, "\nMK =", round(1/seniordiffs))
        ), 
    vjust=c(-0.4,-0.4,1), 
    hjust=c(1,0,0), 
    size=4
    ) + 
  xlab("Senior rating") + 
  ylab("Non-senior rating") +
  ylim(0.5, 1) +
  xlim(0.5, 1)

sr
@

As in Section \ref{minorities}, we wanted to investigate the local variation in responses. To do this, a faceted plot of responses between the two groups was created. This plot can be seen in Figure \ref{fig:seniorPlot}, and it uses 2009 data. While this plot allows us to see the overall distribution of responses across communities, it makes it hard to compare the two groups (seniors and non-seniors) overall.

To see the relationship between total positive responses by non-seniors versus total positive responses by seniors to the question ``How is your community as a place for seniors?'' see Figure \ref{fig:anotherlookseniors}. Interestingly, every community followed the trend of non-seniors underestimating how good their community was for seniors (or seniors boosting their responses). In other words, unlike $MK_R$, all the $MK_S$ scores were positive. The only community that came close to being an exception to this rule was Wichita, KS. There is a strong correlation between ratings both inside and outside the group ($r=$\Sexpr{round(cor(totPos$Seniors, totPos$Nonseniors), digits=2)}), which suggests that there is good meta-knowledge about communities as places for seniors, although seniors always over-rate their communities. 

Then, the question becomes whether the $MK_S$ score is correlated with community satisfaction-- the value turns out to be \Sexpr{round(cor(masterData$communitysat, abs(1/masterData$seniordiffs)), digits=2)}. This shows the relationship we would expect between community satisfaction and meta-knowledge, as communities with higher meta-knowledge also showed higher community satisfaction. 


\subsection{Meta-knowledge about the community as a place for families with young children}
The last subgroup to study in this exploration is families with young children. For this section, we split participants between those who reported having dependent children under the age of 18 living in their household and those who did not. Intuitively, it makes sense that a parent of an older child (say, a teenager) would have higher meta-knowledge about the community as a place for families with young children than a participant who never had children or whose children have grown up and moved away. While the data included more granular demographic details about the ages of the children in the households, splitting the data into participants with children and those without made the groups closer in size. To see the sample sizes and percentages, see Figure \ref{fig:kidsSS}. 

<<kidsbyYear>>=
kidGroup10 <- sotc10 %>% 
  filter(children == "Yes" | children == "No", !is.na(families_kids)) %>% 
  group_by(citystate, children, families_kids) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2010)
names(kidGroup10) <- n1

kidGroup09 <- sotc09 %>% 
  filter(children == "Yes" | children == "No", !is.na(families_kids)) %>% 
  group_by(citystate, children, families_kids) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2009)
names(kidGroup09) <- n1

kidGroups <- LikertData(positive, negative, neutral, kidGroup09)
kidGroups$neg$Response <- factor(kidGroups$neg$Response, levels=levels(kidGroups$neg$Response)[c(2,3,1)])

kidGroup08 <- sotc08 %>% 
  filter(children == "Yes" | children == "No", !is.na(families_kids) & families_kids != "(DK)" & families_kids != "(Refused)") %>% 
  group_by(citystate, children, families_kids) %>% 
  summarise(n = n()) %>% 
  mutate(frac=prop.table(n), year=2008)
names(kidGroup08) <- n1

kidYears <- rbind(kidGroup08, kidGroup09, kidGroup10)

forMaster <- LikertData(positive, negative, neutral, kidYears)
forMaster$neg$Response <- factor(forMaster$neg$Response, levels=levels(forMaster$neg$Response)[c(2,3,1)])
forM <- filter(forMaster$pos, year==2008)
forM <- tbl_df(forM)
comp08 <- FeelingsC(forM) %>% 
  mutate(year="2008")
forM <- filter(forMaster$pos, year==2009)
forM <- tbl_df(forM)
comp09 <- FeelingsC(forM) %>% 
  mutate(year="2009")
forM <- filter(forMaster$pos, year==2010)
forM <- tbl_df(forM)
comp10 <- FeelingsC(forM) %>% 
  mutate(year="2010")

comp <- rbind(comp08, comp09, comp10)
names(comp) <- c("withKids", "noKids", "citystate", "kiddiffs", "year")
comp$year <- as.numeric(comp$year)
masterData <- left_join(masterData, comp)

kidYears <- kidYears %>% 
  group_by(year, InGroup, Response) %>% summarise(totalnum = sum(n)) %>% 
  mutate(totalFreq=prop.table(totalnum))
kidYears <- rename(kidYears, Freq=totalFreq, num = totalnum)
@

<<kidsSS,out.width="0.99\\linewidth",fig.width = 8.5,fig.height=4.5, fig.cap="Sample sizes for plots about meta-knowledge regarding the community as a place for families with young children.">>=
kidSS <- group_by(kidYears, year, InGroup) %>% 
  summarise(number=sum(num)) %>% 
  mutate(percentage=100*number/sum(number))

levels(kidSS$InGroup) <- c("Don't know", "Refused", "Childless households", "Families with children")
kidsplot <- ggplot(aes(x=year, y=number), data=kidSS)+geom_bar(aes(fill=InGroup), position="dodge", stat="identity")+xlab("Year")+ylab("Sample size")+guides(fill=guide_legend(title=NULL))+scale_fill_manual(values=colors2)
kidsplot 
@

<<kidsbyYearresponses>>=
kidYears <- LikertData(positive, negative, neutral, kidYears)
kidYears$neg$Response <- factor(kidYears$neg$Response, levels=levels(kidYears$neg$Response)[c(2,3,1)]) 
@

Figure \ref{fig:kidsplotOverall} shows the difference between the ratings of the groups over the three years of the survey. As in Section \ref{mseniors}, we can see that in-group ratings were slightly higher than out-group ratings, but the difference is not nearly as significant as in Section \ref{mseniors} on seniors. In fact, both groups seem to give about the same ratings overall. This shows that meta-knowledge about the community as a place for families with young children is generally high (that is, people without children have a good idea how their community is for families with children).

Again, we broke this down by community for the 2009 data, as seen in Figure \ref{fig:kidPlot}. Communities generally reflect the larger trend of good meta-knowledge outside the subgroup. To see the overall relationship between groups, we can look at Figure \ref{fig:anotherlookkids}. Communities generally reflect the larger trend of good meta-knowledge outside the subgroup. Accordingly, the variation apparent in Figure \ref{fig:kidPlot} is more about the overall ratings of the communities as places for raising kids. There was agreement both inside and outside the subgroup-- people seem to agree that State College, PA is a good place for families with young children, and Gary, IN and Macon, GA, are not. 

<<kidsplotOverall, out.width="0.99\\linewidth",fig.width = 8.5, fig.height=5.5, fig.cap="Responses to the question, \`\`How is your community as a place for families with young children?\" Families with children are defined as any household with children under the age of 18, Childless households are any households without children, whether or not the residents have adult children living elsewhere. Ratings were consistently similar between groups and across survey years.">>=
baseplot <- ggplot(mapping = aes(InGroup, Freq, fill = Response, order=Response)) + 
  facet_wrap(~year, nrow=3) + 
  geom_bar(data = kidYears$neg, stat = "identity") + 
  geom_bar(data = kidYears$pos, stat = "identity") + 
  scale_fill_manual(
    breaks=c("Very bad", "2", "3", "4", "Very good"), 
    values=colorsA, 
    name="Response"
    ) + 
  coord_flip() + 
  scale_y_continuous(
    breaks=seq(from=-0.25, to=0.75, by=0.25), 
    labels=percent
    ) + 
  scale_x_discrete(labels=c("Childless \n households", "Families \n with children")) + 
  xlab("")+ 
  ylab("") + 
  theme(
    legend.text=element_text(size=14), 
    legend.title=element_text(size=16), 
    axis.text.y=element_text(size=16), 
    title=element_text(size=16), 
    axis.text.x=element_text(size=14), 
    strip.text=element_text(size=14))

baseplot 
@

<<kidPlot, out.width="0.99\\linewidth",fig.width = 10.5, fig.height=12.5, fig.cap="Responses to the question, \`\`How is your community as a place for families with young children?\" faceted by community (2009 data). Interestingly, most communities follow the pattern of similar ratings by both groups, but there is a lot of variation between communities in terms of the overall rating.">>=
totalkids09 <- kidGroup09 %>%
  ungroup() %>%
  group_by(citystate, InGroup) %>%
  summarize(totalN = sum(n))

baseplot <- ggplot() + 
  facet_wrap(~citystate, ncol=3) + 
  geom_bar(mapping = aes(InGroup, Freq, fill = Response, order=Response), data = kidGroups$neg, stat = "identity") + 
  geom_bar(mapping = aes(InGroup, Freq, fill = Response, order=Response), data = kidGroups$pos, stat = "identity") + 
  scale_fill_manual(
    breaks=c("Very bad", "2", "3", "4", "Very good"), 
    values=colorsA, 
    name="Response"
    ) + 
  coord_flip() + 
  scale_y_continuous(
    breaks=seq(from=-0.25, to=0.75, by=0.25), 
    labels=percent
    ) + 
  scale_x_discrete(labels=c("Childless \n households", "Families \n with children")) + 
  xlab(" ") + 
  ylab("") + 
  theme(
    legend.text=element_text(size=14), 
    legend.title=element_text(size=16), 
    axis.text.y=element_text(size=16), 
    title=element_text(size=16), 
    axis.text.x=element_text(size=10)) +
  geom_text(data=totalkids09, aes(label=paste0("(",totalN, ")"), x=InGroup, y=1.2), size=3, hjust=1) +
  expand_limits(y=1) 

baseplot 
@

<<anotherlookkids, out.width="0.76\\linewidth", fig.cap=paste0("Relationship between positive responses to the question, \`\`How is your community as a place for families with young children?\" comparing ratings of families with kids and those without. Each community is represented, and the plot uses 2009 data. The darker grey region corresponds to positive meta-knowledge scores. The black line shows y=x, for comparison. Grey lines at x=", round(mean(comp09$InGroup), digits=2), " and y=", round(mean(comp09$OutGroup), digits=2), " show the mean ratings by each group.") >>=
kd <- ggplot(comp09) + 
  geom_polygon(data=df_poly, aes(x, y), fill="grey", alpha=0.6) +
  geom_abline(intercept=0, slope=1) + 
  geom_vline(aes(xintercept=mean(comp09$InGroup)), color="grey") + 
  geom_hline(aes(yintercept=mean(comp09$OutGroup)), color="grey") + 
  geom_point(aes(y=OutGroup, x=InGroup)) + 
  geom_text(
    data = comp09[comp09$citystate %in% c("Grand Forks, ND", "Gary, IN", "St. Paul, MN"),], 
    aes(y=OutGroup, x=InGroup, label = paste(citystate, "\nMK =", round(-1/ratio))), 
    vjust=c(-0.4,-0.4,1), 
    hjust=c(0,1,0), 
    size=4
    ) + 
  xlab("Families with children") + 
  ylab("Childless households") +
  xlim(0.6, 1) +
  ylim(0.6,1)
kd
@

Finally, we can find the correlation between $MK_C$ and community satisfaction, $r= $\Sexpr{round(cor(masterData$communitysat, abs(1/masterData$kiddiffs)), digits=2)}. Again, this shows the direction of relationship we would expect-- larger community satisfaction scores are correlated with higher $MK_C$ scores. 

\subsection{Generalizing with meta-knowledge}
In order to further study the relationship between community satisfaction and meta knowledge, we want to be able to consider all three measures of meta-knowledge, $MK_R$, $MK_S$, and  $MK_C$, over all three years of data, and in conjunction with the measures of community engagement, like being registered to vote. 

The most obvious way to do this would be to fit some simple models, but because of the nature of the data (proportions of a population reporting community satisfaction), the coefficients soon become quite complex to interpret. Most modeling tasks either look to interpretability or predictive power, and in this case we are interested in interpretability. This is because the data does not even suggest causal links, so we cannot hope to modify one of the predictors and in turn influence the outcome. For example, consider the variable of seniors ratings of their community as a place for seniors. There is no way to artificially modify the way seniors think of their community as a place for seniors-- if you improve the community for them, you are likely to modify other aspects that people consider when thinking about community satisfaction. 

Instead of creating hard-to-interpret models, we simply consider more correlations between variables, looking at all three years. This can be seen in Figure \ref{fig:correlationmatrix}. 

<<correlationmatrix, out.width="\\linewidth",  fig.cap="Correlation between variables of interest. In particular, look at the correlations of variables with Community Satisfaction. When reading the plot, recall that `[subgroup] rating` refers to their rating of their community as a place either for the associated minority group or the complement of that group in society, not to satisfaction overall. (2008, 2009, 2010 data, except for race-related questions where 2010 was excluded).">>=

masterData$racediffs <- abs(1/masterData$racediffs)
masterData$seniordiffs <- abs(1/masterData$seniordiffs)
masterData$kiddiffs <- abs(1/masterData$kiddiffs)

melted <- select(masterData, -citystate, -year) %>% 
  cor(use="complete.obs") %>% 
  melt()
varnames <- c("Community satisfaction", "Registered to vote", "Volunteered", "Attended a public meeting", "Voted in a local election", "Worked to make change", "Non-white rating", "White rating", "|MK about racial minorities|", "Senior rating", "Non-Senior rating", "|MK about seniors|", "Families with children rating", "Childless household rating", "|MK about children|")
  
cp <- ggplot(data=melted, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile() + 
  scale_fill_gradient2(low="#762A83", high="#1B7837", name="Correlation") + 
  scale_x_discrete(labels=varnames) + 
  scale_y_discrete(labels=varnames) + 
  xlab("") + 
  ylab("") + 
  theme(axis.text.x=element_text(angle=-90, hjust=0)) + 
  coord_flip() +
  coord_fixed()
cp
@

There are many interesting insights which can be gleaned from this plot. The first is that almost all the correlations are positive, with the exception of correlations related to race, both ratings by Whites and Non-whites as well as meta-knowledge about race. 

There are also relationships between community engagement behaviors (for example, being registered to vote and voting in a local election) and between those engagement behaviors and meta-knowledge components (for example, ratings of the community as a place for families with young children and volunteering). 

These correlations suggest avenues for further study. 

\section{Conclusions and Further Work}
\label{conclusion}
Through this exploration of the Knight Foundation's Soul of the Community data, we have been able to examine variability between communities and theorize about what might affect those variations. In Section \ref{communittsatsec} we saw which communities reported the highest levels of community satisfaction, namely State College, PA and Boulder, CO. In Section \ref{behaviorsec}, we explored the common behaviors that survey participants engaged in, including registering to vote and voting in local elections. 

Then, in Section \ref{metasec} we began to explore community meta-knowledge, or how aware survey participants are about their community as a place for minority groups. We explored meta-knowledge about how good a community is for racial and ethnic minorities, finding that there is a lot of variation between communities in whether Whites under- or over-rate their communities as a place for minorities, as compared to Non-white respondents. Communities tended to have the worst meta-knowledge about their community as a place for racial and ethnic minorities.  

We noticed that non-seniors almost always under-rate how good their community is for seniors, but that there is a strong correlation between ratings by seniors and non-seniors, so the under-rating of the senior subgroup is simply a shift of an almost $y=x$ relationship. 

Communities also tend to have high meta-knowledge about how good their community is for families with young children (though there are exceptions). The relationship between respondents in- and out-side the subgroup was also linear and centered around $y=x$, with more variation both above and below. 

Through this exploration, we were able to expose variation between cities, years, and groups, but it is still not clear how useful this meta-knowledge could be as a measure of the soul of the community. 

In our final explorations, we studied the pattern of correlations between variables related to subgroup meta-knowledge and community engagement behaviors. Looking at these relationships, we were able to identify correlations that made intuitive sense (e.g. a correlation between being registered to vote and voting in a local election) as well as those that were un-intuitive (e.g. correlations between senior ratings about their community as a place for seniors and childless household ratings about the community as a place for families with young children). 

There are clearly deeper relationships at play here. It seems likely that there is a larger-scale measure of meta-knowledge that could be developed, taking into account more subgroups or transitions between groups. It would be interesting to explore whether seniors who had adult children had better meta-knowledge about their community as a place for families with young children, although they no longer exist in that particular subgroup. 
\bibliographystyle{plainnat}
\bibliography{SoCbib.bib}
\end{document}