Skip to content

Product: Population Factfinder

fvankrieken edited this page Aug 25, 2023 · 2 revisions

Welcome to the db-factfinder wiki!

In the following sections, you will find documentation about the methodology used to create the data driving the Population Fact Finder application. The package we've created for this purpose serves other purposes, too, such as calculating ACS data for various community district geographies for the community profiles dataset.

Converting 2010 to 2020 Geographies

Overview

In order to facilitate time-series analysis of demographic trends, data originally released in 2010 census geographies need to be converted to 2020 census geographies. This involves allocating count values in 2010 census tracts to 2020 tracts, accounting for tract splits or merges. In cases of tract splits, counts from the 2010 tract-level data are distributed to the multiple 2020 tracts in a way that is proportional to the 2010 population distribution within the tract. DCP uses a one-to-one relationship between 2010 blocks and 2020 blocks in order to estimate the proportion of 2010 population contained within each new tract.

For example:

  • 2010 tract 1 (containing 8 blocks) split into 2020 tracts 1.1 (containing blocks 1, 2, 3 and 4) and 1.2 (containing blocks 5, 6, 7, and 8)
  • In 2010, tract 1 had a total population of 4000, made up of:
    • Block 1: 1000
    • Block 2: 500
    • Block 3: 1000
    • Block 4: 500
    • Block 5: 200
    • Block 6: 200
    • Block 7: 500
    • Block 8: 100
  • The 2010 population contained in the blocks now associated with each of the 2020 tracts is:
    • Tract 1.1 (blocks 1-4): 1000 + 500 + 1000 + 500 = 3000
    • Tract 1.2 (blocks 5-8): 200 + 200 + 500 + 100 = 1000
  • The proportion of total 2010 population in the blocks now associated with each of the 2020 tracts is:
    • Tract 1.1: 75%
    • Tract 1.2: 25%

These proportions are contained in ratio.csv, in the following format (using the above example for demonstration):

2020 Tract 2010 Tract ratio
1.1 1 .75
1.2 1 .25

For cases of merges, ratios are 1 if the entireties of multiple 2010 tracts are combined into a new, larger 2020 tract.

These ratios are used to proportionately allocate count values from 2010 to 2020 tracts. Tract-to-tract conversion is the first step before higher-level spatial aggregation. For more information about aggregating census tracts into larger geographies, see the vertical aggregation documentation page.

Calculating 2020 tract-level estimates & MOEs in cases of splits

Conversion of 2010 tract-level estimates and MOEs occurs in the AggregateGeography class for year 2010_to_2020. This class contains a method ct2010_to_ct2020, which takes a DataFrame of 2010 tract-level data and returns a DataFrame of 2020 tract-level data (as estimated using the proportional allocation of total population).

Consider the example tract split described above, along with the following example 2010 tract-level estimates:

2010 Tract Workers Under 16 Estimate Workers Under 16 MOE
1 1000 100

In order to estimate the number of workers under 16 in 2020 tacts 1.1 and 1.2, we assume that the spatial distribution of workers under 16 is well-approximated by the spatial distribution of total people within the tract.

First, we merge the 2010 data with the ratios described in the previous section, yielding:

2020 Tract 2010 Tract Ratio Estimate (2010) MOE (2010)
1.1 1 .75 1000 100
1.2 1 .25 1000 100

2020 tract-level estimates are simply the 2010 estimate multiplied by the ratio:

2020 Tract 2010 Tract Ratio Estimate (2010) MOE (2010) Estimate (2020)
1.1 1 .75 1000 100 1000 * .75 = 750
1.2 1 .25 1000 100 1000 * .25 = 250

Calculating 2020 MOEs depends on an empirically-derived formula, convert_moe:

  • If the ratio is 1 (not a tract split), 2020 MOE is the same as 2010 MOE
  • If the 2020 estimate is 0 (prior to any rounding), the 2020 MOE is NULL
  • If ((ratio * 100)^(0.56901)) * 7.96309 >= 100, the 2010 MOE is the same as the 2020 MOE
  • Otherwise, the 2020 MOE is equal to: ((((ratio * 100)^(0.56901)) * 7.96309) / 100) * (2010 MOE)

This formula comes from an empirical model capturing the relationships between published block group MOEs as a percent of published tract MOEs and block group estimates as a percent of tract estimates, with R-squared of 0.81:

(block group MOEs as a percent of tract MOEs) = 7.96309 * (block group estimates as a percent of tract estimates)^0.56901

This formula is based on 10 selected variables, for 314 random NYC block groups.

  • Males 85 years and older
  • Non-hispanic of 2 or more races
  • Single female household with children
  • 65 years and older living alone
  • Household income $200,000 or more
  • Worked from home
  • Employed civilians 16 years and older
  • Occupied housing with a mortgage
  • Vacant housing units
  • GRAPI 30% to 34.9%

The nested relationship of block groups within tracts mimics the relationship of 2020 tracts within 2010 tracts in cases of a tract split.

Using the example above, MOE is calculated as follows:

2020 Tract 2010 Tract Ratio Estimate (2010) MOE (2010) Estimate (2020) MOE (2020)
1.1 1 .75 1000 100 750 7.96309 * (75)^0.56901 = 92.8988
1.2 1 .25 1000 100 250 7.96309 * (25)^0.56901 = 49.7191

MOEs and estimates are rounded in the final cleaning and rounding step.

Calculating 2020 tract-level estimates and MOEs in cases of merges

Cases of tract merges are much simpler, and generally follow the same logic as other small-to-large spatial aggregation.

Consider the following example, representing a complete merge of 2.1 and 2.2 into 2:

2020 Tract 2010 Tract ratio
2 2.1 1
2 2.2 1

In this case, joining with an example 2010 tract-level dataset would produce:

2020 Tract 2010 Tract ratio Estimate (2010) MOE (2010) Estimate (2020) MOE (2020)
2 2.1 1 100 10 100 * 1 = 100 10
2 2.2 1 200 20 200 * 1 = 200 20

At this point, rows of the joined table are aggregated to get 2020 tract-level data, following these steps. Estimates are summed, and MOEs are aggregated using the square root of a sum of squares, in agg_moe.

2020 Tract Estimate (2020) MOE (2020)
2 100 + 200 = 300 SQRT(10^2 + 20^2) = 22.3607

Converting from 2010 tracts to other 2020 geographies

If the requested geography type is not a tract, but is instead another 2020 geography type (NTA, CDTA, etc.), other methods in the AggregateGeography class first call ct2010_to_ct2020 to estimate 2020 tract-level data. From there, aggregation proceeds using the same techniques as other data years. For example, the method tract_to_cdta:

  1. Converts 2010 tracts to 2020 tracts using the workflow described above
  2. Joins the resulting 2020 tract-level data with the 2010_to_2020 lookup_geo (for more information about spatial lookups, see here)
  3. Groups by 2020 NTA field, nta_geoid, using aggregation techniques defined in create_output
  4. Renames nta_geoid as geoid and sets geotype to "NTA" to standardize format

Entire PFF workflow in cases of converting 2010 to 2020 data

The following image shows the entire workflow for converting 2010 to 2020 geographies. The example shown is transforming 2019 ACS data into 2020 NTA-level data for the PFF variable capturing the population of South Asian origin, or asnsouth.

factfinder_convert

Downloading Census Data from the API

Overview

Input data for all population factfinder calculations come from the US Census Bureau's API, as accessed using the census python wrapper package.

Download Class

The Download class accesses input data for all population factfinder calculations by formatting a geoquery and calling the appropriate US Census Bureau API endpoint. When initialized, this class contains the following properties, all necessary for selecting endpoints and creating queries:

  • The census API access key, contained in a .env file
  • The year of data to access (In the case of 5-year ACS data, this is the final year. For example, 5-year data from the 2015-2019 rolling sample would correspond with year = 2019)
  • The source type (i.e. acs, decennial)
  • Necessary state and county FIPS codes, set by default to the five NYC counties within NY state

The geoqueries method uses state and county FIPS codes to generate an appropriate query for the requested spatial unit. For example, calling geoqueries('tract') will return the string query expected by the US Census Bureau API (via the census python wrapper) to download all tracts within the five NYC counties.

The download_variable method then calls either download_e_m or download_e_m_p_z. These methods set the census client based on the specified source, identifies the census variable codes associated with the pff_variable name using Metadata, identifies the appropriate geoquery for the requested geotype, then calls client.get to store data in a pandas dataframe. Upon download, types are enforced (set to float64), outliers are replaced with NULLs, and MOEs for zero estimates are set to zero.

In order to improve performance, the Download class writes results of each call to a cache (via utils.write_to_cache). Prior to re-downloading, the Download class checks the cache for previously-stored results.

Calculating Estimate and MOE of PFF variables

Calculate class method calculate_c_e_m_p_z is the entry function into all logic for calculating non-rounded c, e, m, p, and z for a given PFF variable.

This method first creates an instance of the Variable class, so that all of the metadata associated with the PFF variable is easily accessible. For more information about metadata, see the metadata documentation page.

Basic estimate and MOE workflow

The most straight-forward workflow for calculating estimates and MOEs of a given PFF variable is defined in the calculate_e_m method. This is the workflow for non-median, non-special variables.

  1. Determine from & to geography types: In cases where the requested geography type is not a standard Census geography, block group-, or tract-level data get aggregated to produce the requested estimates and MOEs. Logic and lookups necessary for geographic aggregation are year-specific python files in the geography directory. Each of these files defines an AggregatedGeography class, which contains an options property of the form:
    {
        "decennial": {
            "tract": {"NTA": self.tract_to_nta, "cd": self.tract_to_cd},
            "block": {
                "cd_fp_500": self.block_to_cd_fp500,
                "cd_fp_100": self.block_to_cd_fp100,
                "cd_park_access": self.block_to_cd_park_access,
            },
        },
        "acs": {
            "tract": {"NTA": self.tract_to_nta, "cd": self.tract_to_cd},
            "block group": {
                "cd_fp_500": self.block_group_to_cd_fp500,
                "cd_fp_100": self.block_group_to_cd_fp100,
                "cd_park_access": self.block_group_to_cd_park_access,
            },
        },
   }

These lookups determine the necessary geography to download from the Census API in order to produce output for the requested geotype. For example, calculating ACS data at the NTA-level (the "to geography") requires raw data at the tract-level (the "from geography"), while calculating ACS data for the irregular park access region within each community district (cd_park_acess), requires raw data at the block group-level.

  1. Download input data: Once the necessary raw data geography format is identified, all necessary census variables are downloaded using the Download class. For more information on downloading data from the Census API using this class, see the "Downloading data from the API" documentation page.

  2. Aggregate horizontally: If a pff_variable is a sum of multiple, mutually-exclusive census variables, the data downloaded in step 2 gets aggregated "horizontally." For example, if PFF Variable = Input 1 + Input 2, horizontal aggregation first combines the two input census variables to calculate a PFF variable estimate and MOE for each row of the input data. For more information on this form of aggregation, see the "Horizontal aggregation" documentation page.

  3. Aggregate vertically: In cases where the requested geography is not a Census geography, the results of step 3 undergo "vertical" aggregation. For example, rows containing tract-level estimates and MOEs for a given PFF variable get combined to produce NTA-level estimates and MOEs. For more information on this form of aggregation, see the "Vertical aggregation" documentation page.

Estimate & MOE workflow exceptions

Several variables require slight modifications to the workflow above.

  • Medians: For medians, estimate and MOE calculations occur in the calculate_e_m_median method, rather than calculate_e_m. When downloading data (step 2 above), all necessary variables of counts within bins get downloaded. Horizontal and vertical aggregation (steps 3 & 4) are handled by the Median class. For more information about medians, see the "Median calculation" documentation page.

  • Special variables: Several PFF variables are combinations of census variables, but are not simple sums. In these cases, horizontal aggregation relies on variable-specific formulas contained in special.py. For more information about special variable calculation, which occurs in calculate_e_m_special, see the "Special variables" section of the Horizontal aggregation documentation page.

  • Profile-only variables: For some PFF variables, estimates and MOEs are available both in reference to a count, and a percent of the larger population. In these cases, the downloading step also includes the download of associated percent estimate and percent MOE data. Estimate and MOE calculations for profile-only variables occur in calculate_e_m_p_z. There is no vertical aggregation associated with these cases. For more information about profile-only calculations for non-aggregated geography types, see the exceptions section of the "Percent Estimate and Percent MOE" documentation page.

Performance enhancements with cacheing and multiprocessing

In order to improve performance, both raw data and calculated estimate and MOE data get cached locally. When downloading data from the Census API, the Download class first checks to see if the same variables for the same geographies exist in the local cache, implemented here. If so, the raw data is read from the cache and is not re-downloaded. Otherwise, raw data is obtained via the API and saved to the cache for future calls, using the write_to_cache utility function.

Caching also occurs after raw data is transformed into PFF variable estimates and MOEs. The method calculate_e_m, described above, first checks to see if previously calculated data are saved in the cache, in these lines. If so, estimate and MOE data are read from local files rather than being recalculated. First time calculations (ones not already in the cache), are added to the cache here.

In cases where a single PFF variable is a combination of multiple input PFF variables (such as the binned data used to calculate medians), inputs are calculated in parallel. The method calculate_e_m_multiprocessing is a wrapper function that calls either calculate_e_m or calculate_e_m_special over a list of input PFF variables.

Calculating coefficient of variation

After e and m are calculated in calculate_c_e_m_p_z, c is calculated, using the function get_c.

If the estimate is 0 or the MOE is NULL, then c is NULL. Otherwise, c = m / 1.645 / e * 100.

Final Output Formatting

After c, e, m, p, and z are calculated using calculate_c_e_m_p_z, values are rounded and cleaned based on the rules contained in the methods cleaning and rounding.

Rounding

The utility function rounding rounds estimates and MOEs to the number of digits specified in the metadata. All c, p, and z are rounded to a single decimal place, regardless of the number of digits specified in the metadata. Note that the logic used to clean data (described in the next section), refer to the rounded values rather than the raw values.

Cleaning

The following rules modify the rounded results of calculate_c_e_m_p_z, in the order listed. The purpose of these cleaning steps is to remove invalid values.

Invalid values

  • If c, e, m, p, or z are negative, they are overwritten by NULL
  • If p is greater than 100, it is overwritten by NULL
  • If p is 100 or NULL (including values overwritten by the above rule), z is set to NULL

Zero estimates

  • If e is 0, c, m, p, z is set to NULL

Base variables

  • If the variable is a base variable, the geography type is either borough or city, and c is NULL, c is set to 0
  • If the variable is a base variable, the geography type is either borough or city, and m is NULL, m is set to 0
  • If the variable is a base variable and the variable is not a median variable, p is set to 100
  • If the variable is a base variable and the variable is not a median variable, z is set to NULL

Inputs to median variables (binned variables)

  • If the variable is an input to a median (with the exception of median rooms inputs), m is set to NULL
  • If the variable is an input to a median (with the exception of median rooms inputs), p is set to NULL
  • If the variable is an input to a median (with the exception of median rooms inputs), z is set to NULL
  • If the variable is an input to a median (with the exception of median rooms inputs), c is set to NULL

Special variables

  • If the variable is a special variable, p is set to NULL
  • If the variable is a special variable, z is set to NULL

GEOID Formatting

The method labs_geoid translates census geoids into the format displayed in the PFF application. Because the list of geography types changed in 2020, this method relies on year-specific functions, format_geoid, contained in the AggregatedGeography classes.

These translations primarily involve replacing FIPS county codes with borough abbreviations or codes.

Combining census variables to create pff-variables

Sums of census variables

In order to calculate estimates and margins of error for pff variables, input census variables undergo up to two forms of aggregation. We describe these as "horizontal" and "vertical" aggregation, referring to summing tables over either columns or rows. For example, refer to the simplified example below. The two input variables represent data as downloaded from the Census API. If PFF Variable = Input 1 + Input 2, the estimate columns need to be combined to derive the PFF variable estimate, and MOE columns need to be combined to derive the PFF MOE. Each of these steps are described in the following sections.

geoid Input Estimate 1 Input MOE 1 Input Estimate 2 Input MOE 2
tract 1 10 1 20 2
tract 2 30 3 40 4

Horizontal aggregation (excluding aggregations described in the "Exceptions" section below) happens in the aggregate_horizontal method of the Calculate class.

Calculating estimates of sums

Several population factfinder variables are sums of more granular, mutually exclusive inputs. For example, counts representing a population under 18 might come from the aggregated counts of several childhood age bins (i.e. 0-4, 4-10, 10-15, 15-18). Or, a variable might reflect a sum over male- and female-specific counts. This form of aggregation is "horizontal." In our simplified case, the tract-level counts for a variable comprised of two inputs would be:

geoid PFF Variable Estimate
tract 1 10 + 20 = 30
tract 2 30 + 40 = 70

In general, PFF variable estimate (for a row) = Sum of Input Estimates (for that row)

Calculating MOEs of sums

The margin of error for aggregations are a simple root sum-of-squares of input margins of error. This is based on an assumption that input variables are independent.

geoid PFF Variable MOE
tract 1 sqrt(1^2 + 2^2) = sqrt(5)
tract 2 sqrt(3^2 + 4^2) = sqrt(25)

In general, PFF variable MOE (for a row) = Square root of the sum of squared input MOEs (for that row)

Exceptions

Not all pff variables are simple sums of census variables. There are two types of non-sum combinations of census variables: medians and special variables.

Special variables

PFF variables that are non-sum, non-median combinations of census variables are referred to as "special variables". These include:

  • hovacrtm
  • percapinc
  • mntrvtm
  • mnhhinc
  • avghhsooc
  • avghhsroc
  • avghhsz
  • avgfmsz
  • hovacrt
  • rntvacrt
  • wrkrnothm

Estimate and MOE calculation for special variables occurs in the method calculate_e_m_special of the Calculate class. After downloading the estimates and MOEs of necessary input variables, this function then calls one of the pff variable-specific functions in special.py to combine inputs.

would more documentation elaborating on special variables be useful, like you did with Medians?

Medians

Several PFF variables are medians, rather than counts. These include:

  • mdage
  • mdhhinc
  • mdfaminc
  • mdnfinc
  • mdewrk
  • mdemftwrk
  • mdefftwrk
  • mdrms
  • mdvl
  • mdgr

Estimate and MOE calculation for medians occurs in the method calculate_e_m_median of the Calculate class. This method calculates medians by:

  1. Extracting ranges, design factors, and booleans indicating whether top and bottom coding are appropriate from the metadata class (see metadata documentation for more information).
  2. Downloading and calculating the estimate and MOE for all input variables. For medians, input variables are counts within a given bin. For example, a count of people ages 5 to 9 is an input for median age.
  3. Pivoting the outputs of step 2 to create a table with each row representing a geoid, where each input pff variable corresponds with two columns (one estimate column and one MOE column).
  4. Combine columns (a form of horizontal aggregation), using formulas contained in the Median class.

For more detail on median calculation, as implemented in the Median class, see the median calculation documentation page.

Median Calculations

Methods for calculating the median estimate and MOE for a given geography are in the Median class.

Estimate calculation

Median estimates are calculated from count estimates of binned data. For example, median household income estimates are calculated from estimated count of households with incomes in various ranges (under 10k, 10-14k, 15-19k, etc.).

Below is an example of tract-level input variable estimates for median household income:

Input variable (range) Estimate of count in range
mdhhiu10 (0 to 9999) 20.0
mdhhi10t14 (10000 to 14999) 0.0
mdhhi15t19 (15000 to 19999) 12.0
mdhhi20t24 (20000 to 24999) 9.0
mdhhi25t29 (25000 to 29999) 0.0
mdhhi30t34 (30000 to 34999) 0.0
mdhhi35t39 (35000 to 39999) 0.0
mdhhi40t44 (40000 to 44999) 0.0
mdhhi45t49 (45000 to 49999) 0.0
mdhhi50t59 (50000 to 59999) 0.0
mdhhi60t74 (60000 to 74999) 0.0
mdhhi75t99 (75000 to 99999) 0.0
mdhi100t124 (100000 to 124999) 0.0
mdhi125t149 (125000 to 149999) 0.0
mdhi150t199 (150000 to 199999) 0.0
mdhhi200pl (200000 to 9999999) 0.0

This, in turn, corresponds with a cumulative count distribution of:

Input variable (range) Cumulative count
mdhhiu10 (0 to 9999) 20.0
mdhhi10t14 (10000 to 14999) 20.0
mdhhi15t19 (15000 to 19999) 32.0
mdhhi20t24 (20000 to 24999) 41.0
mdhhi25t29 (25000 to 29999) 41.0
mdhhi30t34 (30000 to 34999) 41.0
mdhhi35t39 (35000 to 39999) 41.0
mdhhi40t44 (40000 to 44999) 41.0
mdhhi45t49 (45000 to 49999) 41.0
mdhhi50t59 (50000 to 59999) 41.0
mdhhi60t74 (60000 to 74999) 41.0
mdhhi75t99 (75000 to 99999) 41.0
mdhi100t124 (100000 to 124999) 41.0
mdhi125t149 (125000 to 149999) 41.0
mdhi150t199 (150000 to 199999) 41.0
mdhhi200pl (200000 to 9999999) 41.0

And a cumulative percent distribution of:

Input variable (range) Cumulative percent
mdhhiu10 (0 to 9999) 48.78048780487805
mdhhi10t14 (10000 to 14999) 48.78048780487805
mdhhi15t19 (15000 to 19999) 78.04878048780488
mdhhi20t24 (20000 to 24999) 100.0
mdhhi25t29 (25000 to 29999) 100.0
mdhhi30t34 (30000 to 34999) 100.0
mdhhi35t39 (35000 to 39999) 100.0
mdhhi40t44 (40000 to 44999) 100.0
mdhhi45t49 (45000 to 49999) 100.0
mdhhi50t59 (50000 to 59999) 100.0
mdhhi60t74 (60000 to 74999) 100.0
mdhhi75t99 (75000 to 99999) 100.0
mdhi100t124 (100000 to 124999) 100.0
mdhi125t149 (125000 to 149999) 100.0
mdhi150t199 (150000 to 199999) 100.0
mdhhi200pl (200000 to 9999999) 100.0

Calculating median estimates from binned data occurs in the median method of the Median class. This method calculates the median estimate by:

  1. Calculating the sum of all counts (N) within all input bins
  2. Using the cumulative distribution of counts within bins, identifies which bin contains N/2
  3. Using linear interpolation to estimate where within the bin identified in step 2 the median lies
    • First, the difference between N/2 and the total count in all lower bins represents how far within the median-containing bin the median lies.
    • The median is assigned as that difference, times the width of the bin divided by the count in that bin.
Median = (Lower boundary of the median-containing bin) 
          + (N/2 - (Total count in all bins below median-containing bin)) 
            * (Difference between min and max value of the median-containing group) / (Count within the median-containing group)

For a video demonstration of median linear interpolation, see here

Top and bottom coding

Some medians undergo top- or bottom-coding, as described in the top_coding and bottom_coding sections of the median metadata.

If top_coding is True, medians falling within the bottom bin are set to the max value of the bottom bin. For example, if a geography's median household income is between 0 and 9999 based on the calculations described in the previous section, the median gets set to 9999. Similarly, if bottom_coding is True, medians falling within the top bin are set to the min value of the top bin. For example, if a geography's median household income is above 200000 based on the calculations described in the previous section, the median gets set to 200000.

MOE calculation

Calculating 1 standard error interval around a 50% proportion

Margins of errors for medians are estimated by calculating a 1 standard error interval around a 50% proportion estimate. First, the Median class calculates the standard error of a 50% proportion (se_50) as:

(Design Factor) * ((93/7 * Base) * 2500)) ^ .5

where the Base is the sum of counts in all bins. Design factors are values that account for the fact that the ACS does not use a simple random sample. These values are a ratio of observed standard errors for a variable to the standard errors that would be obtained from a simple random sample of the same size, and come from the Census Bureau.

This standard error is added to and subtracted from 50, creating a 1SE interval around a 50% estimate (with boundaries p_lower and p_upper).

Comparing confidence interval boundaries to the cumulative distribution

Then, p_lower and p_upper are compared to a cumulative percent distribution (see above), cumm_dist, to determine which bins contain the boundaries for a 1SE interval around a 50% proportion. These bins are saved as lower_bin and upper_bin.

For both lower_bin and upper_bin, the next step is to get the following values using the cumulative percent distribution of all input bins:

  • A1: The min value the bin
  • A2: The min value of the next highest bin
  • C1: The cumulative percentage of counts strictly less than A1 (total counts in bins up to the one containing the boundary)
  • C2: The cumulative percentage of counts strictly less than A2 (total counts in bins up to and including the one containing the boundary)

Calculation of A1, A2, C1, C2 for a given p occurs in the method base_case.

A1, A2, C1, and C2 get calculated relative to both lower_bin and upper_bin by calling base_case where _bin = lower_bin and then _bin = upper_bin. These calls happen in lower_bound and upper_bound methods, respectively.

There are several exceptions in which A1, A2, C1, and C2 do not follow the base case. To account for exceptions, the methods lower_bound and upper_bound subsequently modify the base case results according to the following:

Calculate a confidence interval around the median

Once A1, A2, C1, and C2 are set, the method get_bound converts these values into a boundary for the confidence interval around the median.

CI boundary = (p - C1) * (A2 - A1) / (C2 - C1) + A1

This equation is similar to the linear interpolation used in estimate calculation, but uses percent cumulative distributions rather than count cumulative distributions. In estimate calculation, we determined where within a given bin an estimate lies, assuming that all frequencies within that bin are uniformly distributed between the min and max values of the bin. If the median was in the bin 1000 to 1499, which contained 45 counts, we assumed that these 45 counts were evenly distributed between 1000 and 1499.

Estimating where within a bin the boundary for a median confidence interval lies is similar. We first identified which bin contains percent 1SE away from 50%. From here, we assume that the cumulative percentage of counts contained within that bin is evenly distributed between its two extremes, i.e. if the bin 1000 to 1499 contains accounts for 30% to 40% of the cumulative counts, we assume that those 10% of total counts are evenly distributed between 1000 and 1499.

The various components of the CI boundary calculation are:

  • (p - C1): The difference between the 1SE boundary for the 50% proportion and the percent of counts that are in all bins below the one containing this boundary
  • (A2 - A1): The width of the bin containing the 1SE boundary for the 50% proportion CI
  • (C2 - C1): The percent of counts that are in the bin containing boundary for 50% proportion CI
  • (A1): The lowest value of the bin containing the 1SE boundary for the 50% proportion

The method get_bound is used to calculate both the upper and lower boundaries of the confidence interval. When calculating the lower boundary of the median confidence interval, p refers to p_lower, and A1, A2, C1, C2 are all in reference to p_lower. Similarly, when calculating the upper boundary of the median confidence interval, p refers to p_upper.

Use the confidence interval to calculate the median MOE

The median MOE is calculated from the median confidence interval determined above. This occurs in the median_moe class property.

MOE of the median = (Width of CI around the median) * 1.645 / 2

In the following exceptions, the median MOE is set to NULL:

Metadata

Overview

The pff-factfinder package relies on a series of json metadata files. The primary function of these files is to relate a given fact finder variable to input census variables. Because ACS and census tables can change slightly from year-to-year as the Census Bureau adds, drops, or modifies included variables, the metadata files are specific to a release year. The metadata also contains any other pff-variable level information necessary for calculations.

Metadata structure

Metadata for a given factfinder variable are structured as follows

{
    "pff_variable": "lgoenlep1",
    "base_variable": "lgbase",
    "census_variable": [
      "C16001_005",
      "C16001_008",
      "C16001_011",
      "C16001_014",
      "C16001_017",
      "C16001_020",
      "C16001_023",
      "C16001_026",
      "C16001_029",
      "C16001_032",
      "C16001_035",
      "C16001_038"
    ],
    "domain": "community_profiles",
    "rounding": 0,
    "category": "Language Spoken at Home"
  }

pff_variable: The field name, as it appears in the final Population FactFinder data files (not a display name) and as it gets called by the main Calculate class

base_variable: The pff_variable name of the associated base variable. This is the denominator when calculating percent estimates and percent MOEs

census_variable: A list containing all input census variables for a given factfinder variable. These are listed without an "E" or "M" suffix (these suffixes are included in the Census API variable documentation and column headings of downloaded ACS data).

domain: Used for filtering of final outputs. For variables used in Population FactFinder, these are "housing", "demographic", "economic", or "social". For variables used only in the Community Profiles datasets, the domain is "community_profiles".

rounding: Variable-specific number of digits for rounding final output estimates and MOEs

category:

Median metadata

Variables that are medians have additional metadata, as follows.

"mdage": {
        "design_factor": 1.1,
        "top_coding": true,
        "bottom_coding": true,
        "ranges": {
            "mdpop0t4": [
                0,
                4.9999
            ],
            "mdpop5t9": [
                5,
                9.9999
            ],
            ...
            "mdpop85pl": [
                85,
                115
            ]
        }

design_factor: design factor values that account for the fact that the ACS does not use a simple random sample. These values are a ratio of observed standard errors for a variable to the standard errors that would be obtained from a simple random sample of the same size, and come from the Census Bureau.

top_coding: if True, medians falling within the bottom bin are set to the upper bound of the bottom bin. For example, if a geography's median age income is between 0 and 4.999 based on the example above, the median gets set to 4.999.

bottom_coding: if True, medians falling within the top bin are set to the lower bound of the top bin. For example, if a geography's median age income is between 85 and 115 based on the example above, the median gets set to 85.

ranges: The upper and lower values associated with each input pff_variable, where the inputs are counts of either people or households with a characteristic falling in a particular range

Metadata class

The Metadata class parses and reads the metadata json files as a whole. This class contains properties differentiating different types of population factfinder variables. These lists inform which methodology is appropriate when aggregating census variables (either horizontally or vertically) to calculate a pff_variable.

  • median_variables is a list of all pff_variable names referring to medians
  • median_inputs is a list of all pff_variables that are inputs to median calculations
  • median_ranges is a dictionary containing the value ranges associated with each median input variable
  • special_variables is a list of all pff_variables that require special calculations upon aggregation. These calculations are variable-specific functions contained in special.py.
  • profile_only_variables is a list of pff_variables for which percent and percent MOE are available from the Census API, and so are downloaded rather than calculated for geotypes available in the census API (i.e. not NYC-specific geographies). These variables are ones pulled from the census profile tables. The census variable names of profile variables have the prefix "DP".
  • profile_only_exceptions is a list of pff_variables pulled from profile tables (their census variable names have a "DP" prefix), but for which percent and percent MOE are calculated for all geotypes.
  • base_variables is a list of all pff_variables that serve as the base (denominator) for the percent and percent MOE calculation of another pff_variable.

Other methods include:

Variable class

The Variable class reads and parses the metadata files with reference to a particular pff_variable. The census_variables method returns the census variable names associated with estimate (E), margin of error (M), percent estimate (PE), and percent margin of error (PM) of a given pff_variable. The create_census_variables method splits a given a list of partial census variable names (i.e. ["B01001_044", "B01001_045"]) into a tuple of estimate and MOE census variable names (i.e. ["B01001_044E", "B01001_045E"], ["B01001_044M", "B01001_045M"]). Other Variable properties include the domain of the variable (i.e. "economic"), the base_variable (the name of the pff_variable serving as a denominator when calculating percent and percent MOE, where applicable), the number of decimal places to retain in the final rounded estimate and margin of error, and the category assigned to a variable by Labs' front-end application.

Metadata maintenance

Maintaining the metadata files is a largely manual process. Metadata undergoes several updates between yearly releases of data. These include:

  • If a Census Bureau table has a change in schema resulting in shifted columns, the census_variable portion of metadata likely need updates to reflect new column numbers
  • If Census Bureau tables containing median inputs change to include either more or fewer binned counts, the ranges portion of median metadata will need to get updated
  • If the Census Bureau releases new design factors associated with median input tables, the design_factor portion of the median metadata will need to get updated
  • If PFF variables are either discontinued or introduced (due to upstream Census Bureau changes or otherwise), these variables will need to get either added or removed from the metadata

Calculating Percent Estimates and MOEs

The Calculate class method calculate_c_e_m_p_z first calculates the estimate and margin of error of a variable, then calculates percent and percent MOE as appropriate.

Calculating pff_variable and base_variable estimate and MOEs

Calculating percent estimates and MOEs require the estimate and MOE of the numerator (the PFF variable of interest), as well as the estimate and MOE of the denominator (the variable representing the population of which the PFF variable is a subset). For example, if we were to calculate the estimated percent of workers 16+ who commute by walking, the PFF variable is the the estimated count of workers 16+ who commute by walking, while the denominator is the estimated count of workers 16+. In our metadata and code, the denominator variables are referred to as "base variables". Base variables for each PFF variable are stored in the metadata, and are accessible as the base_variable property of an instance of the Variable class.

In the most basic case, calculating the percent and percent MOE occur by first calling calculate_e_m to calculate the estimate and MOE of the pff_variable. Then, calculate_e_m gets called again, instead calculating the estimate and MOE of the associated base variable.

As with other pff_variables, a base variable can be a special variable or a median. If the base variable is a special variable, calculate_e_m_special gets called in place of calculate_e_m. Similarly, if the base variable is a median, calculate_e_m_median gets called in place of calculate_e_m. For more information on exceptions when calculating estimates and MOEs, see here.

Combining pff_variable and base_variable results to calculate p and z

Once estimates and MOEs are calculated for both the pff_variable and its base_variable, the two resulting DataFrames get merged, with e and m of the base variable renamed as e_agg and m_agg.

Once merged, the percent estimate is calculated using get_p:

Percent MOE = (Estimate of PFF variable) / (Estimate of Base Variable) if (Estimate of Base Variable) is not NULL

The percent MOE is calculated using get_z, which is based on the methodology outlined in the Census Bureau's guidance on calculating MOEs of derived estimates.

If the PFF percent estimate is 0 or 100, percent MOE is NULL
If the base variable estimate is 0, percent MOE is NULL
Otherwise,
    Percent MOE = (Square root of 
                      (Squared PFF MOE - Squared (PFF estimate * Base MOE / Base Estimate))
                  ) / Base Estimate * 100

In cases where the value under the square root is negative, the ratio MOE formula is used instead:
    Percent MOE = (Square root of 
                      (Squared PFF MOE + Squared (PFF estimate * Base MOE / Base Estimate))
                  ) / Base Estimate * 100

Exceptions to calculating base estimate and MOE

Profile-only variables: There are several variables where percent and percent MOE are available directly from the Census API. The Census Bureau variable documentation and table column headers indicate these with suffixes "PE" and "PM". If available, we do not calculate the base estimate and MOE, and instead pull estimate, MOE, percent estimate, and percent MOE from the API directly using calculate_e_m_p_z, as called by calculate_c_e_m_p_z. This is only possible for census geography types (i.e. non-aggregated geography types), and for variables from the DP-prefixed profile tables. The variables that have percent and percent MOEs available directly from the Census API are listed in the profile_only_variables property of the Metadata class. There are 10 DP-prefixed profile variables that do not have percent estimates and MOEs available, which are listed in the profile_only_exceptions property of the Metadata class. For aggregated geography types, the percent and percent MOE are calculated using a base variable as described above.

Poverty variables: There are three poverty-related variables where the percent and percent MOE are available directly from the Census API (after 2010): "pbwpv", "pu18bwpv", and "p65plbwpv". Unlike the profile-only variables above, the percent and percent MOE are not indicated with suffixes "PE" and "PM", but are instead stored as estimates ("E" suffix) and MOEs ("M" suffix) of a separate variable. These separate, percent variables are in the PFF metadata as {pff_variable}_pct. The function calculate_poverty_p_z calls calculate_e_m on {pff_variable}_pct, renames e and m of the results as p and z respectively. The function calculate_poverty_p_z is called by calculate_c_e_m_p_z in place of calling calculate_e_m for the base variable. For aggregated geography types, the percent and percent MOE are calculated using a base variable as described above.

Variables without p and z

Several variables do not have p or z values. This is indicated in the metadata where base_variable is "nan". These include variables that are means, rent/cost values or burdens, and variables that already represent a percent of a population. When calculate_c_e_m_p_z is called for these variables, p and z are set to NULL.

For base variables, p is set to 100 if the geography type is either a city or borough. Otherwise, p is NULL. In both cases, z is NULL.

Aggregating small areas to larger areas

If the requested geography type is not a Census geography, tract- or block-group level PFF data are aggregated to calculate e, m, p, and z for larger geographies. For example, rows containing tract-level estimates and MOEs for a given PFF variable get combined to produce NTA-level estimates and MOEs.

In general, aggregate geography types include:

  • NTA
  • CDTA (post 2020)
  • Portion of Community Districts within 100 year floodplain
  • Portion of Community Districts within 500 year floodplain
  • Portion of Community Districts within walking distance of a park

Note: when converting data published in 2010 geographies to 2020 geographies, 2020 census tracts function as an aggregate geography type.

Geographic relationships

Relationships between geographic areas are maintained in the directory data/lookup_geo. Lookups are specific to the latest decennial census year, since tract boundaries change each decade. When the Calculate class is initialized, the specified geography year determines which spatial lookup is referenced.

AggregateGeography Class

Each decennial year corresponds with a different version of the AggregateGeography class. These classes are defined in year-specific python files in the geography directory.

Class properties and methods

lookup_geo: The AggregateGeography class (for both years) contains a property lookup_geo. This property is a DataFrame with parsed columns from the geographic lookups in the data directory.

options: This property contains a lookup between Census geography types, aggregate geography types they can be combined into, and the function necessary for converting raw geography types to aggregate geography types. For example, one record in the 2010 lookup is:

"tract": {"NTA": self.tract_to_nta, "cd": self.tract_to_cd}

Both NTA- and CD-level data are built from tract-level raw data. The function for aggregating tract-level data into NTA-level data is tract_to_nta, and the function for aggregating tract-level data into CD-level data is tract_to_cd.

aggregated_geography: This property is a list of all aggregated geography types for a given year, e.g. ["nta", "cd", "cd_fp_500", "cd_fp_100", "cd_park_access"].

format_geoid and format_geotype: These methods convert FIPS census geoids and types into the format displayed in Planning Labs' application, as implemented in labs_geotype. See the final output cleaning documentation page for more info.

Functions to convert smaller to larger geography types

The majority of methods in the AggregateGeography class aggregate tract- or block group-level data into larger geographies. While the methods are specific to the geography type, they follow a similar structure. Consider the example:

Tract-level data

geoid Estimate MOE
tract 1 1 2
tract 2 3 4
tract 3 5 6
tract 4 7 8

Geo-lookup

tract_geoid nta_geoid ...
tract 1 NTA 1 ...
tract 2 NTA 1 ...
tract 3 NTA 2 ...
tract 4 NTA 3 ...
  1. Join tract/block-group data with lookup_geo (which defines how small geographies nest within larger ones) on geoid_tract or geoid_block_group. Following the example data above, this would produce:
geoid Estimate MOE nta_geoid
tract 1 1 2 NTA 1
tract 2 3 4 NTA 1
tract 3 5 6 NTA 2
tract 4 7 8 NTA 3
  1. Call the function create_output in order to group by the aggregate geography geoid. Within each group, estimates get summed. MOEs are aggregated using the square root of sum of squares, defined in agg_moe.
nta_geoid Estimate MOE
NTA 1 (1 + 3) = 3 SQRT(2^2 + 4^2) = SQRT(20)
NTA 2 5 6
NTA 3 7 8
  1. Rename GEOID column to standardize output
census_geoid Estimate MOE
NTA 1 (1 + 3) = 3 SQRT(2^2 + 4^2) = SQRT(20)
NTA 2 5 6
NTA 3 7 8

Special case: Converting 2010 tracts to 2020 aggregate geographies

When converting 2010 input geographies to 2020 outputs, the methods described in the previous section contain an additional step. Prior to step one, in which tract-level data are joined with lookup_geo, 2010 tracts are converted to 2020 tracts using the method ct2010_to_ct2020. The following steps proceed as described above, using 2020 tract-level data as the input to further vertical aggregation.

For more information about tract-to-tract conversion, see the 2010 to 2020 geography conversion documentation page.

Clone this wiki locally