Fetching US census data with R: a short tutorial

In this short, self-contained post, I show how to download and tidy a cool data set of US census data. The post is meant to be the first of a series showcasing data scraping, wrangling, visualization and modelling.

library(acs)
library(magrittr)
library(dplyr)

Step 1: Get your free api key

Here’s the link
Install your api key:

api.key.install('your-api-key-here')

Step 2: Select a survey and a table to explore.

There is an excellent variety of data you can play with !
Here’s the link

Step 3: Fetch the data.

In the example below, I’ve chosen to explore the table number B15001 in the 2010-2014 American Community Survey 5-Year Estimates, which contains data related to educational level, by age and gender.

First you need to call the geo.make function. The call below simply means “give me all states and counties”

county_geo <- geo.make(state = '*', county = '*')

Then we actually get the data. It’s important to set the col.names parameter to “pretty” to get understandable variable names.

education_acs <- acs.fetch(geography = county_geo, table.number = 'B15001', 
          col.names = 'pretty', endyear = 2014, span = 5)

What you have at this point is an acs object.
Two slots are of particular interest to us. First, the @geography slot which contains the state/county name in the $NAME attribute.
Let’s go ahead and extract those right away:

state_county <- education_acs@geography$NAME %>% str_split(",")

county <- state_county  %>%
            sapply(`[[`, 1) %>%
            str_replace(' County', '')

state <-  state_county %>%
            sapply(`[[`, 2)

And the @estimate slot which contains the actual census values.
The @estimate slot is actually a matrix:

str(education_acs@estimate)
##  num [1:3220, 1:83] 40922 148019 21257 17750 43768 ...
##  - attr(*, 'dimnames')=List of 2
##   ..$ : chr [1:3220] 'Autauga County, Alabama' 'Baldwin County, Alabama' 'Barbour County, Alabama' 'Bibb County, Alabama' ...
##   ..$ : chr [1:83] 'SEX BY AGE BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 18 YEARS AND OVER: Total:' 'SEX BY AGE BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 18 YEARS AND OVER: Male:' 'SEX BY AGE BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 18 YEARS AND OVER: Male: 18 to 24 years:' 'SEX BY AGE BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 18 YEARS AND OVER: Male: 18 to 24 years: Less than 9th grade' ...

Step 4: Tidy your data

As you can see, there is a separate column for every level of drilldown.
So We have a little bit of work in order to get a tidy dataset. Come on then, let’s get to it !

In the end, what we really want, is a dataframe in long format, with the state and county variables, then one variable for the education level, one for the age group, one for the gender, and finally the census value.

Because there are so many columns, and also because in R we trust,
there is no way in hell we are going to do this manually.

You can check the code below if you’re curious, suffice it to say that the expand.grid base function was super useful !

df_education = NULL

education <- c('Total', '< 9th grade', '< 12th grade', 'High school', 'College', 'Associate', 'Bachelor', 'Graduate')
age <- c('18-24', '25-34', '35-44', '45-64', '65+')
sex = c('Male', 'Female')

columns = c(3:42, 44:83)
df_exp <- expand.grid(education=education, age=age, sex=sex)

for(i in 1:length(columns)){
  df_education <- 
    rbind(df_education, 
        data.frame(
            county=county, 
                state=state,
                sex=(df_exp$sex)[i], 
                age=(df_exp$age)[i],                                                education=(df_exp$education)[i],                                                            value=education_acs@estimate[,columns[i]])
        )
}

I had to include the ‘Total’ level of education in the loop because of the way the columns of the matrix are enumerated, but I don’t actually want to keep it:

df_education %>% filter(education != 'Total')

And guess what my friends…we now have a freakin cool dataset with 5 variables, 1 value and 225400 observations:

head(df_education)
##    county    state  sex   age   education value
## 1 Autauga  Alabama Male 18-24 < 9th grade    36
## 2 Baldwin  Alabama Male 18-24 < 9th grade   172
## 3 Barbour  Alabama Male 18-24 < 9th grade   123
## 4    Bibb  Alabama Male 18-24 < 9th grade    68
## 5  Blount  Alabama Male 18-24 < 9th grade    39
## 6 Bullock  Alabama Male 18-24 < 9th grade     0

That’s it for today ! We’re gonna keep it short & sweet. Our data is ready, we can take a break and come back later to play with it.

In the meantime, if you want to get a headstart, or have any suggestions, please feel free to leave a comment.

Next time, we will be doing some vizualisations using this dataset, and maybe we’ll do some webscraping and merge interesting information onto it.

Come back soon !

The full, reproductible and ready to spin code can be found here.
Do remember to install your api key first !

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s