R

Fetching US census data with R: a short tutorial

In this short, self-contained post, I show how to download and tidy a cool data set of US census data. The post is meant to be the first of a series showcasing data scraping, wrangling, visualization and modelling.

library(acs)
library(magrittr)
library(dplyr)

Step 1: Get your free api key

Here’s the link
Install your api key:

api.key.install('your-api-key-here')

Step 2: Select a survey and a table to explore.

There is an excellent variety of data you can play with !
Here’s the link

Step 3: Fetch the data.

In the example below, I’ve chosen to explore the table number B15001 in the 2010-2014 American Community Survey 5-Year Estimates, which contains data related to educational level, by age and gender.

First you need to call the geo.make function. The call below simply means “give me all states and counties”

county_geo <- geo.make(state = '*', county = '*')

Then we actually get the data. It’s important to set the col.names parameter to “pretty” to get understandable variable names.

education_acs <- acs.fetch(geography = county_geo, table.number = 'B15001', 
          col.names = 'pretty', endyear = 2014, span = 5)

What you have at this point is an acs object.
Two slots are of particular interest to us. First, the @geography slot which contains the state/county name in the $NAME attribute.
Let’s go ahead and extract those right away:

state_county <- education_acs@geography$NAME %>% str_split(",")

county <- state_county  %>%
            sapply(`[[`, 1) %>%
            str_replace(' County', '')

state <-  state_county %>%
            sapply(`[[`, 2)

And the @estimate slot which contains the actual census values.
The @estimate slot is actually a matrix:

str(education_acs@estimate)
##  num [1:3220, 1:83] 40922 148019 21257 17750 43768 ...
##  - attr(*, 'dimnames')=List of 2
##   ..$ : chr [1:3220] 'Autauga County, Alabama' 'Baldwin County, Alabama' 'Barbour County, Alabama' 'Bibb County, Alabama' ...
##   ..$ : chr [1:83] 'SEX BY AGE BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 18 YEARS AND OVER: Total:' 'SEX BY AGE BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 18 YEARS AND OVER: Male:' 'SEX BY AGE BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 18 YEARS AND OVER: Male: 18 to 24 years:' 'SEX BY AGE BY EDUCATIONAL ATTAINMENT FOR THE POPULATION 18 YEARS AND OVER: Male: 18 to 24 years: Less than 9th grade' ...

Step 4: Tidy your data

As you can see, there is a separate column for every level of drilldown.
So We have a little bit of work in order to get a tidy dataset. Come on then, let’s get to it !

In the end, what we really want, is a dataframe in long format, with the state and county variables, then one variable for the education level, one for the age group, one for the gender, and finally the census value.

Because there are so many columns, and also because in R we trust,
there is no way in hell we are going to do this manually.

You can check the code below if you’re curious, suffice it to say that the expand.grid base function was super useful !

df_education = NULL

education <- c('Total', '< 9th grade', '< 12th grade', 'High school', 'College', 'Associate', 'Bachelor', 'Graduate')
age <- c('18-24', '25-34', '35-44', '45-64', '65+')
sex = c('Male', 'Female')

columns = c(3:42, 44:83)
df_exp <- expand.grid(education=education, age=age, sex=sex)

for(i in 1:length(columns)){
  df_education <- 
    rbind(df_education, 
        data.frame(
            county=county, 
                state=state,
                sex=(df_exp$sex)[i], 
                age=(df_exp$age)[i],                                                education=(df_exp$education)[i],                                                            value=education_acs@estimate[,columns[i]])
        )
}

I had to include the ‘Total’ level of education in the loop because of the way the columns of the matrix are enumerated, but I don’t actually want to keep it:

df_education %>% filter(education != 'Total')

And guess what my friends…we now have a freakin cool dataset with 5 variables, 1 value and 225400 observations:

head(df_education)
##    county    state  sex   age   education value
## 1 Autauga  Alabama Male 18-24 < 9th grade    36
## 2 Baldwin  Alabama Male 18-24 < 9th grade   172
## 3 Barbour  Alabama Male 18-24 < 9th grade   123
## 4    Bibb  Alabama Male 18-24 < 9th grade    68
## 5  Blount  Alabama Male 18-24 < 9th grade    39
## 6 Bullock  Alabama Male 18-24 < 9th grade     0

That’s it for today ! We’re gonna keep it short & sweet. Our data is ready, we can take a break and come back later to play with it.

In the meantime, if you want to get a headstart, or have any suggestions, please feel free to leave a comment.

Next time, we will be doing some vizualisations using this dataset, and maybe we’ll do some webscraping and merge interesting information onto it.

Come back soon !

The full, reproductible and ready to spin code can be found here.
Do remember to install your api key first !

USA – Ethnicity & Income

Following up on my previous post, and inspired by datausa.io, I hit me that I had a pretty cool dataset I could explore, thanks to the chloroplethr package.

The dataset I’m referring to comes from the US 2013 Census data. Let’s take a look:

require(choroplethr)
require(dplyr)
require(magrittr)
require(ggplot2)
api.key.install('your-api-key-here')
demographics=get_county_demographics()
head(demographics)
##   region total_population percent_white percent_black percent_asian
## 1   1001            54907            76            18             1
## 2   1003           187114            83             9             1
## 3   1005            27321            46            46             0
## 4   1007            22754            75            22             0
## 5   1009            57623            88             1             0
## 6   1011            10746            22            71             0
##   percent_hispanic per_capita_income median_rent median_age
## 1                2             24571         668       37.5
## 2                4             26766         693       41.5
## 3                5             16829         382       38.3
## 4                2             17427         351       39.4
## 5                8             20730         403       39.6
## 6                6             18628         276       39.6

One thing in particular I was curious about was this: how can I show on a map, how much a particular group of the population (e.g. the black population) is hit by poverty.

We do have a percentage of black population as well as a per-capita income for each county. Let us merge this data on the regions data from the chloroplethr package:

county_data  <- county.regions %>% left_join(demographics)
## Joining by: 'region'
head(county_data)
##   region county.fips.character county.name state.name state.fips.character
## 1   1001                 01001     autauga    alabama                   01
## 2   1003                 01003     baldwin    alabama                   01
## 3   1005                 01005     barbour    alabama                   01
## 4   1007                 01007        bibb    alabama                   01
## 5   1009                 01009      blount    alabama                   01
## 6   1011                 01011     bullock    alabama                   01
##   state.abb total_population percent_white percent_black percent_asian
## 1        AL            54907            76            18             1
## 2        AL           187114            83             9             1
## 3        AL            27321            46            46             0
## 4        AL            22754            75            22             0
## 5        AL            57623            88             1             0
## 6        AL            10746            22            71             0
##   percent_hispanic per_capita_income median_rent median_age
## 1                2             24571         668       37.5
## 2                4             26766         693       41.5
## 3                5             16829         382       38.3
## 4                2             17427         351       39.4
## 5                8             20730         403       39.6
## 6                6             18628         276       39.6

The idea will be to rely on the relatively unknown yet uber-cool harmonic mean. This mean has the property that it is high when both components (in my case I will only have two) are high.

Here’s the formula for the harmonic mean of two numbers:

harmonic_mean = function(x, y){
  (2*x*y)/(x+y)
}

So I will be using the harmonic mean of the percentage of black population and the per-capita income. But to make those measures play nice together, let us first transform then into ranks and map that rank to the [0,1] interval:

county_data$percent_black_ranked <- rank(county_data$percent_black, ties.method='first')
county_data$percent_black_ranked <- county_data$percent_black_ranked / max(county_data$percent_black_ranked)

county_data$per_capita_income_ranked  <- rank(county_data$per_capita_income, ties.method='first')
county_data$per_capita_income_ranked <- 1 - county_data$per_capita_income_ranked / max(county_data$per_capita_income_ranked)

Notice that I took the complement of per-capita income rank to show poverty.

We are now ready to create our map:

county_data$value  <- harmonic_mean(county_data$percent_black_ranked, county_data$per_capita_income_ranked)

county_choropleth(county_data) + scale_fill_brewer(palette=3) + ggtitle('Map of (black % of population &amp;amp; poverty) index')

unnamed-chunk-8-1

The dark areas are where there is a both a relatively high percentage of black population and a low per-capita income.

The same exercise with the hispanic population yields:

unnamed-chunk-9-1

Out of curiosity, let us look at white % of population and wealth:

county_data$percent_white_ranked <- rank(county_data$percent_white, ties.method='first')
county_data$percent_white_ranked <- county_data$percent_white_ranked / max(county_data$percent_white_ranked)

county_data$per_capita_income_ranked  <- rank(county_data$per_capita_income, ties.method='first')
county_data$per_capita_income_ranked <- county_data$per_capita_income_ranked / max(county_data$per_capita_income_ranked)

unnamed-chunk-11-1

us 2016 elections

This post is my contribution to the challenge set up by Ari Lamstein. I wish I had more time to spend on it, but it did at least give me the opportunity to learn to use dplyr, reshape2, ggplot, Shiny, and the choroplethr package. I’m also making heavy use of rvest, which I have explored already in some of my previous posts.

This post is divided in three sections. First, using ggplot, we chart a time view of the race, for both parties. Then we use the choroplethr package to try and display the performance of the candidates in relation to a demographic variable. Finally, a simple Shiny app provides us another way to visualize and slice through the data.

Excited ? Let’s get started !

Section 1: Time view of the race

I’ve decided not to explicitely show the code in this post, but if you are interested, you can find the complete code for this section here. Should you have any questions or comments, don’t hesitate to leave them below.

My data is scraped from: http://www.politico.com/2016-election/results/map/president. At the time of writing of this post, the website contained election results up to April 9th.

So, here’s how the race is shaping up so far for the republican party:

republican

The grey solid horizontal line is the finish line: it’s the number of pledged delegates a candidate must have in order to be declared the winner (in theory – that is, assuming the pledged candidates don’t change their minds !). The dashed grey line represents 50% of the running total number of delegates: it’s a way to quickly visualize how well the leading candidate (in this case Trump) is doing compared to where he needs to be come the end of the race.

An important observation here is that the red line is below the dashed grey line, which means that even though he is way ahead of his main rival, Trump still needs to pick up the pace if he wants a guaranteed win. And it is a secret for noone that if he doesn’t get the guaranteed win, he is toast. Which is why you’ll often hear Trump say that Kasich should quit, because he actually needs those votes.

Compare this to the picture in the democratic party:

democratic.png

Here you can see that not only does Clinton have a comfortable lead over her opponent, she is also well poised to have the majority by the end of the race, if the trend continues.

Section 2: Map view

We now turn our attention to the county-by-county election results. In this section we will focus on the two leading candidates for each party: Clinton and Sanders for the Democrats, Trump and Cruz for the Republican. We will be counting the votes this time (instead of the delegate counts).

For this purpose, I have normalized the vote counts to be as if only the two leading candidates were competing (so that their combined vote percentages would sum to 1). This does introduce some bias in the results, but it is an easy and reasonable way to deal with the fact that the number of candidates changed through time due to some candidates dropping from the race.

I tried scraping the same website as in section 1 (you can click on the Detailed Results link to get the county results)…unsucessfully, because the page content is dynamically generated via javascript.
So I resorted to:

  1. manually scrolling down the page to force-generate the html
  2. saving the html code (thanks to google inspector) to my hardrive
  3. scraping the the code locally with rvest

I’ve copied all the files (each file corresponds to a state) on google drive. Please note (if you want to run the code), that the files must be in a “states” subfolder of your working directory.
Also note that I haven’t used the results of Kansas and Minesotta…because they are by district instead of by county, and I couldn’t find a decent mapping between the two (if you do, please let me know!).
After that, I used the choroplethr package to download us census data (5 years, as of 2013). It is ridiculously easy. If you interested in knowing more, Ari has written a tutorial just for that.

A few mutate, filter, dcast and melt later, I had my desired results: a number of votes by state, county and candidate, merged with the us census demographics data. If you are interested in playing with this dataset, I am also making it available here.

I was particularly interested in the relationship between Trump’s popularity and the proportion of the Hispanic population in a given county.
In the map below, I have used hue to represent the popularity of Trump: counties where he was more popular (his vote percentage is higher than his median country-wide) are in green, counties where he is less popular are in purple.
The same idea was applied to the Hispanic population percentage, using saturation this time: a saturated color for counties with a percentage of Hispanics in the top 50%, and a pale color for the others.
And here is the result:

trump_hispanic_01

Based on this, would you say that Trump is unpopular among Hispanics ? I guess it depends ! Looking for example at Florida and Texas, which both have most counties in the top 50% when it comes to percent Hispanic, one could say that the Hispanics of Florida love Trump, while those of Texas, not so much…could it have something to do with the infamous Mexican wall…

Another relationship I was curious about is that between Sander’s popularity and the per capita income in a county:

sanders_income_01

Here the relationship is pretty obvious! A lot of saturated green, and a lot of pale purple, as well as a clear north/south division. Sanders is popular in the North and North-West, and in top-50% counties with regards to per-capita income.

If you’re interested, you can easily map other relationships. Get the code here.
The function call is as easy as:

plot_map("B. Sanders&amp", "per_capita_income")

Section 3: Shiny

In this section, I wanted to be able to slice through the data to get an overall feel of how well a candidate is doing regarding a particular demographics. Where as in the preceding section we could quickly see the difference between states, here the focus is more on the demographics variable.

You can find the app here, and the code here.

For example, let us take a second look at the relationship between Trump’s popularity and the proportion of hispanic population:

trump1

What about Sanders and per capita income:

sanders1

And, just for fun, Ted Cruz with median age:

cruz1

Finally, let us look at all candidates, with the percentage of black population:

all_blacks

Whoa! The difference between Clinton and Sanders there is astonishing!

That’s it for today, I hope you enjoy this post, and that you will have fun playing with the Shiny app.

Loss ratio by variable

In my line of work (property & casualty insurance), one ratio we are constantly monitoring is the loss ratio, which is defined as the ratio of the loss amounts paid to the insured to the premium earned by the insurance company.

Ignoring non claim-specific expenses (e.g. general expenses, marketing, etc.) and investment returns, a loss ratio of 100% means that for the average premium we charge the insured covers exactly the average loss amount, i.e. we break even (the premium corresponding to a loss ratio of 100% is called the pure premium). Note that by definition, everything else being equal, a lower loss ratio translates into higher profits for the insurance company.

Sometimes, we want to look at this loss ratio conditionally on the values of a specific variable. That is, for a particular segmenting variable (age, gender, territory, etc.), we are interested in knowing the loss ratio for each category in this variable, along with the earned premium as a percentage of the total earned premium.

For example, such a report for the Age variable could look like this:

age_group = c("A. [16 - 25]", "B. [25 - 39]", "C. [40 - 64]", "D. [65+ ]   ")
weights = c(0.25, 0.1, 0.35, 0.3)
loss.ratios = c(0.6, 0.9, 0.55, 0.4)

df = data.frame(age_group=age_group, weight=weights, loss.ratio=loss.ratios)
df
##      age_group weight loss.ratio
## 1 A. [16 - 25]   0.25       0.60
## 2 B. [25 - 39]   0.10       0.90
## 3 C. [40 - 64]   0.35       0.55
## 4 D. [65+ ]      0.30       0.40

The global loss ratio for this portfolio is:

sum(df$weight * df$loss.ratio)
## [1] 0.5525

Here’s the question I’ve been asking myself lately: if I could select only one category for this variable (i.e. one age group) to try and take business measures to improve it’s profitability, which one should it be ? In other words, which of these categories has the biggest negative impact on our overall loss ratios ?

One possible answer would be to find the category which, if we improved it’s loss ratio by x%, would improve the global loss ratio the most. But if x is fixed, then this approach simply selects the category with the biggest weight…

A better solution would be to consider, for each age group, what the loss ratio of the portfolio would be if that age group was removed from consideration. For example, to calculate the impact of age group “A. [16 – 25]”, one can calculate the overall loss ratio of the porfolio consisting of ages groups B. to D., and substract that value from our orginal (entire portfolio, including group A.) loss ratio.

impacts = function(weights, loss.ratios){

  overall.lr = sum(weights * loss.ratios)

  v = numeric()

  for(i in 1:length(weights)){
    w.without = weights[-i]/sum(weights[-i])
    lrs.without = loss.ratios[-i]
    lr = sum(w.without * lrs.without)
    v = c(v, overall.lr - lr)
  }

  paste0(round(v*100, 1), "%")
}

df$lr.impact = impacts(weights, loss.ratios)
df
##      age_group weight loss.ratio lr.impact
## 1 A. [16 - 25]   0.25       0.60      1.6%
## 2 B. [25 - 39]   0.10       0.90      3.9%
## 3 C. [40 - 64]   0.35       0.55     -0.1%
## 4 D. [65+ ]      0.30       0.40     -6.5%

What this tells us is that the age group “B. [25 – 39]” has the biggest upward impact on our overall loss ratio: if we didn’t insure this group (or equivalently, if that group’s loss ratio was equal to the loss ratio of the rest of the portfolio), our loss ratio would be 3.9 points lower.

Fun with barplots

I love barplots! Here’s a simple function that allows me to do some quick 2-way exploration of my data when I have categorical variables.

Let’s give ourselves some data to play with:

gender = c('Male', 'Female')
age_band = c('[16-24]', '[25-49]', '50+')
buying_ratio = c('1.Low', '2.Medium', '3.High')

n = 100

gender_v = sample(gender, 100, replace=T)
age_band_v = sample(age_band, 100, replace=T)
buying_ratio = sample(buying_ratio, 100, replace=T)

df = data.frame(gender=gender_v, age_band=age_band_v, buying_ratio=buying_ratio)

df
## gender age_band buying_ratio
## 1 Male [16-24] 1.Low
## 2 Male [25-49] 2.Medium
## 3 Female [25-49] 2.Medium
## 4 Male [25-49] 3.High
## 5 Male [16-24] 1.Low
## 6 Female [25-49] 1.Low
## 7 Female 50+ 1.Low
## 8 Male [16-24] 1.Low
## 9 Female 50+ 3.High
## 10 Female 50+ 3.High
## 11 Female 50+ 1.Low
## 12 Female 50+ 2.Medium
## 13 Male [25-49] 2.Medium
## 14 Male [25-49] 2.Medium
## 15 Male [25-49] 1.Low
## 16 Male [16-24] 2.Medium
## 17 Female 50+ 2.Medium
## 18 Female [16-24] 2.Medium
## 19 Male [25-49] 2.Medium
## 20 Female [16-24] 2.Medium
## 21 Male [25-49] 3.High
## 22 Male [25-49] 2.Medium
## 23 Male 50+ 1.Low
## 24 Female [25-49] 2.Medium
## 25 Female [25-49] 3.High
## 26 Male 50+ 2.Medium
## 27 Female 50+ 1.Low
## 28 Male 50+ 3.High
## 29 Female [16-24] 2.Medium
## 30 Female [16-24] 3.High
## 31 Male 50+ 3.High
## 32 Male 50+ 3.High
## 33 Female [16-24] 3.High
## 34 Female [16-24] 2.Medium
## 35 Female 50+ 3.High
## 36 Male [25-49] 3.High
## 37 Female 50+ 1.Low
## 38 Male [16-24] 2.Medium
## 39 Male [25-49] 3.High
## 40 Female [25-49] 3.High
## 41 Male [25-49] 3.High
## 42 Female [25-49] 2.Medium
## 43 Female [25-49] 3.High
## 44 Female 50+ 1.Low
## 45 Female [25-49] 2.Medium
## 46 Male 50+ 3.High
## 47 Male [16-24] 3.High
## 48 Female 50+ 3.High
## 49 Female [25-49] 1.Low
## 50 Male [16-24] 1.Low
## 51 Female [16-24] 3.High
## 52 Male [16-24] 3.High
## 53 Male [25-49] 1.Low
## 54 Female [16-24] 1.Low
## 55 Male [16-24] 1.Low
## 56 Male 50+ 2.Medium
## 57 Female 50+ 3.High
## 58 Female [25-49] 1.Low
## 59 Female [25-49] 2.Medium
## 60 Female [16-24] 3.High
## 61 Male 50+ 2.Medium
## 62 Male [25-49] 3.High
## 63 Female [16-24] 2.Medium
## 64 Female 50+ 2.Medium
## 65 Male [16-24] 2.Medium
## 66 Male [25-49] 3.High
## 67 Male [16-24] 1.Low
## 68 Male 50+ 3.High
## 69 Male 50+ 2.Medium
## 70 Male [25-49] 2.Medium
## 71 Female [25-49] 2.Medium
## 72 Female [25-49] 1.Low
## 73 Male [16-24] 3.High
## 74 Female 50+ 3.High
## 75 Female [25-49] 3.High
## 76 Female [16-24] 2.Medium
## 77 Female 50+ 1.Low
## 78 Female [16-24] 1.Low
## 79 Male [16-24] 1.Low
## 80 Female [25-49] 3.High
## 81 Female [25-49] 1.Low
## 82 Male [16-24] 2.Medium
## 83 Female [25-49] 1.Low
## 84 Male [16-24] 2.Medium
## 85 Male 50+ 3.High
## 86 Female [25-49] 3.High
## 87 Female [16-24] 1.Low
## 88 Male 50+ 3.High
## 89 Female [25-49] 3.High
## 90 Male 50+ 2.Medium
## 91 Male [25-49] 3.High
## 92 Male [16-24] 1.Low
## 93 Female [25-49] 2.Medium
## 94 Male 50+ 1.Low
## 95 Female [16-24] 2.Medium
## 96 Female 50+ 3.High
## 97 Female [16-24] 3.High
## 98 Male 50+ 3.High
## 99 Male [25-49] 3.High
## 100 Male [25-49] 2.Medium

and a function to visualize conditional distributions:

plot_var = function(varname, varname2, col=brewer.pal(9, "Oranges")){
  var_data = t(table(df[,varname], df[,varname2]))
  var_data_ordered = var_data[order(rownames(var_data)),]

  bar_heights = sapply(colnames(var_data_ordered), function(x) cumsum(var_data_ordered[,x]))
  bar_incr = rbind(bar_heights[1,], diff(bar_heights))

  percentages = apply(bar_incr, 2, function(x) paste(round(x/sum(x), 2)*100, '%'))

  ypos = bar_heights - bar_incr/2

  bar_widths = apply(var_data, 2, sum)

  bp = barplot(var_data_ordered, main=paste(varname2, 'by', varname),
               names=paste(colnames(var_data), '(', bar_widths, ')'),
               beside=F, col=col,
               legend=rownames(var_data), args.legend=list(x='topleft', cex=0.6),
               width=bar_widths)

  i=1
  for(xpos in bp){
    text(xpos, ypos[,i], percentages[,i])
    i = i + 1
  }
}

We can call the function like so:

library(RColorBrewer)
plot_var('gender', 'buying_ratio', brewer.pal(3, 'Oranges'))

Rplot01

and the other way around

plot_var('buying_ratio', 'gender', c('indianred1', 'lightblue2'))

Rplot

Note that the width of the bars are proportional to the number of observations for that value.

Montreal Restaurants

I’m currently looking for a new apartment. One important criteria for me is the proximity to services and entertainment, in particular restaurants. Or maybe I was just looking for a pretext to do some R…

Anyhow, I thought I would do some web scraping + mapping in R and see what neighborhoods in Montreal are the best when it comes to dining out.

The first step was to decide where to get the data from. Well that decision was easy, because the first sites I looked at had javascript pages (instead of static html), which meant standard web scraping would not work. Then I tried Yelp, and fortunately enough they use static html (although they limit any of their views to 1000 results).

Below is the code to scrape all 1000 results for Montreal, and put the result in a dataframe. I then calculate points for each neighborhood as the sumproduct of ratings and number of reviews:

library(rvest)
library(stringr)
library(magrittr)

scrape_page = function(start){

  restos = read_html(paste0('http://www.yelp.ca/search?find_loc=Montr%C3%A9al,+QC,+Canada&start=',
                            start, '&cflt=food'))

  n1 = restos %>% html_nodes('.main-attributes')
  n2 = restos %>% html_nodes('.secondary-attributes')

  df = data.frame(neighborhood=factor(), rating=numeric(), votes=numeric())

  #we have to extract the nodes one by one because sometimes the address field can be missing
  for(i in 1:10){

    main_attr = n1 %>% extract2(i)
    sec_attr = n2 %>% extract2(i)

    if(length(sec_attr %>% html_nodes('.neighborhood-str-list')) == 1
       && length(main_attr %>% html_nodes('.star-img')) == 1
       && length(main_attr %>% html_nodes('.review-count')) == 1){

      neighborhood = iconv(sec_attr %>% html_nodes('.neighborhood-str-list') %>% html_text() %>% str_trim(),
                            from='UTF-8', to='UTF-8')

      # rename the neighborhoods so the names match those found in the spatial sata (see below)
      if(str_detect(neighborhood, 'Notre-Dame') || str_detect(neighborhood, 'des-Neiges')){
        neighborhood = 'Côte-des-Neiges-Notre-Dame-de-Grâce'
      }
      else if(str_detect(neighborhood, 'Plateau')){
        neighborhood = 'Le Plateau-Mont-Royal'
      }
      else if(str_detect(neighborhood, 'Sud')){
        neighboorhood = 'Le Sud-Ouest'
      }
      else if(str_detect(neighborhood, 'Ville-Marie')){
        neighboorhood = 'Ville-Marie'
      }
      else if(str_detect(neighborhood, 'Villeray')){
        neighborhood = 'Villeray-Saint-Michel-Parc-Extension'
      }
      else if(str_detect(neighborhood, 'Saint-Luc')){
        neighborhood = 'Côte-Saint-Luc'
      }
      else if(str_detect(neighborhood, 'Trembles')){
        neighborhood = 'Rivière-des-Prairies-Pointe-aux-Trembles'
      }
      else if(str_detect(neighborhood, 'onard')){
        neighborhood = 'Saint-Léonard'
      }
      else if(str_detect(neighborhood, 'Montr')){
        neighborhood = 'Montréal-Ouest'
      }
      else if(str_detect(neighborhood, 'Ahun')){
        neighborhood = 'Ahuntsic-Cartierville'
      }

      rating = as.numeric(main_attr %>% html_nodes('.star-img') %>%
                          html_attr('title') %>% str_extract('[0-9]\\.[0-9]'))

      votes = as.numeric(main_attr %>% html_nodes('.review-count') %>% html_text() %>% str_extract('[0-9]+'))

      df = rbind(df, data.frame(neighborhood=neighborhood, rating=rating, votes=votes))
    }
  }

  df
}

all_ratings = lapply(seq(from=0, to=990, by=10), scrape_page)
all_ratings_df = do.call(rbind, all_ratings)

Now we tally up the points by neighborhood, and we’re gonna score each neighborhood according to the average quality of its restaurants AND to its total number of restaurants, using the square root function on the latter to give it less weight (the idea being I would prefer to live in a neighborhood with 10 good restaurants rather than 100 crappy ones).

all_ratings_df$points = all_ratings_df$rating * all_ratings_df$votes

points = aggregate(points ~ neighborhood, data=all_ratings_df, sum)

restaurant_count = aggregate(rating ~ neighborhood, data=all_ratings_df, length)
colnames(restaurant_count)[2] = 'restaurant_count'

votes = aggregate(votes ~ neighborhood, data=all_ratings_df, sum)

avg_ratings = data.frame(neighborhood=points[,1],
avg_rating=points[,2]/votes[,2])

scores = data.frame(neighborhood=points[,1],
score=avg_ratings[,2] * sqrt(restaurant_count[,2]))

Here are the neighborhood average ratings:

## neighborhood avg_rating
## 1 Ahuntsic-Cartierville 4.103774
## 2 Brossard 5.000000
## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 4.112676
## 4 Côte-Saint-Luc 5.000000
## 5 Le Plateau-Mont-Royal 4.182097
## 6 Mercier-Hochelaga-Maisonneuve 4.043605
## 7 Mont-Royal 3.545455
## 8 Outremont 4.004098
## 9 Rosemont-La Petite-Patrie 4.265391
## 10 Saint-Laurent 3.836735
## 11 Saint-Léonard 4.240000
## 12 Sud-Ouest 4.015648
## 13 Verdun 4.094203
## 14 Ville-Marie 3.936800
## 15 Villeray-Saint-Michel-Parc-Extension 4.342500

and scores (which is the average rating times the square root of the number of restaurants in the neighborhood):

scores
## neighborhood score
## 1 Ahuntsic-Cartierville 18.352633
## 2 Brossard 7.071068
## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 33.913996
## 4 Côte-Saint-Luc 5.000000
## 5 Le Plateau-Mont-Royal 69.478213
## 6 Mercier-Hochelaga-Maisonneuve 20.218023
## 7 Mont-Royal 7.927877
## 8 Outremont 12.662071
## 9 Rosemont-La Petite-Patrie 51.184692
## 10 Saint-Laurent 9.398042
## 11 Saint-Léonard 7.343895
## 12 Sud-Ouest 28.957252
## 13 Verdun 10.028708
## 14 Ville-Marie 63.356803
## 15 Villeray-Saint-Michel-Parc-Extension 27.464381

So now that we have the scores aggregated by neighborhood, we need a map to draw them on:

library(maptools)

tmp_dir = tempdir()
url_data = 'http://donnees.ville.montreal.qc.ca/dataset/00bd85eb-23aa-4669-8f1b-ba9a000e3dd8/resource/62f7ce10-36ce-4bbd-b419-8f0a10d3b280/download/limadmin-shp.zip'
zip_file = paste0(tmp_dir, '/montreal.zip')
download.file(url_data, zip_file)
unzip(zip_file, exdir=tmp_dir)

montreal = readShapeSpatial(paste0(tmp_dir, '/LIMADMIN.shp'))

The next step is to merge the points from the web scraping step onto the shape data. Note that unfortunately, not all neighborhoods are represented on the Yelp website.

df = merge(montreal@data, scores, by.x='NOM', by.y='neighborhood', all.x=TRUE)

#reorder df to match montreal@data
df = df[rank(montreal@data$NOM),]

We can assign a category to each neighborhoods based on their points (I chose to make 3 categories), and attribute a color to each category.

library(RColorBrewer)
library(classInt)

ci = classIntervals(df$score, n=3, style='quantile')

palette = brewer.pal(8, 'YlOrRd')[c(2,4,6)]

colors = findColours(ci, palette)

Finally we draw the map, add a legend and write the names of the neighborhoods as well using the gCentroid function from the rgeos package

library(rgeos)

plot(montreal, col=colors)
legend('topleft', legend=c('Starvation area', 'Survivable', 'Foodie's heaven'), fill=palette)

centroids = gCentroid(montreal, byid=TRUE)[!is.na(df$score)]
text(centroids$x, centroids$y, labels=montreal@data$NOM[!is.na(df$score)], cex=.7)

Rplot

Voila ! It is quite unfortunate that we only have the central neighborhoods of Montreal. To be continued…

Text analysis of republican presidential debates

Here we analyze the republican presidential debate transcripts, focusing on three candidates in particular: Trump, Cruz and Rubio.

First let’s answer the following question:

How often does Trump get mentionned by name by the other two candidates during a debate, versus how often do the other two candidates mention each other’s name ?

The transcripts were downloaded from: http://www.presidency.ucsb.edu/debates.php, from August 6th, 2015 to March 21st, 2016.

Here’s the code:


library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

ref_matrix = function (date){

  #read character data
  text = scan(paste0('debate_', date, '.txt'), what='x', quote=NULL)

  speakers =
    c(  CRUZ    = '',
        TRUMP   = '',
        RUBIO   = ''
    )

  #assign text to the right speaker
  for(word in text){
    #if word ends with :
    if(substr(word,nchar(word),nchar(word))==':'){
      #if word corresponds to one of the speakers of interest
      if(word %in% paste0(names(speakers), ':')){
        #set current speaker
        currentSpeaker = substr(word,1,nchar(word)-1)
      }
      else{
        #if the current speaker is not one of the speakers of interest, set it to NA
        currentSpeaker = NA
      }
    }
    else if(!is.na(currentSpeaker)){
      #if the current speaker is of interest, save what he is saying
      speakers[currentSpeaker] = paste(speakers[currentSpeaker], word)
    }
  }

  #preprocess text
  prez = Corpus(VectorSource(speakers))
  prez = tm_map(prez, tolower)
  prez = tm_map(prez, removeWords, stopwords('english'))
  #remove additional unwanted words
  prez = Corpus(VectorSource(speakers))
  prez = tm_map(prez, tolower)
  prez = tm_map(prez, removeWords, stopwords('english'))
  #remove additional unwanted words
  prez = tm_map(prez, removeWords, c('will', 'going', 'applause', 'get', 'say', 'want', 'let', 'can', 'well', 'know',
                                    'got', 'one', 'two', 'also', 'ever', 'even', 'need', 'every', 'let', 'many'))
  prez = tm_map(prez, removePunctuation, preserve_intra_word_dashes = FALSE)
  prez = tm_map(prez, stemDocument)
  prez = tm_map(prez, stripWhitespace)
  prez = tm_map(prez, removeNumbers)
  prez = tm_map(prez, PlainTextDocument) 

  #make document term matrix
  dtm &amp;amp;amp;amp;amp;amp;amp;lt;- DocumentTermMatrix(prez) 

  #reassign row names (each row is a speaker)
  rownames(dtm) = names(speakers)

  #how many times was donald trump referred to by other candidates
  names = character()
  if('donald' %in% colnames(dtm)){names = c(names, 'donald')}
  if('trump' %in% colnames(dtm)){names = c(names, 'trump')}
  dtm_trump = dtm[,names]
  TRUMP = apply(dtm_trump, 1, sum)

  #how many times was ted cruz referred to by other candidates
 names = character()
  if('ted' %in% colnames(dtm)){names = c(names, 'ted')}
  if('cruz' %in% colnames(dtm)){names = c(names, 'cruz')}
  dtm_trump = dtm[,names]
  dtm_cruz = dtm[,names]
  CRUZ = apply(dtm_cruz, 1, sum)

  #how many times was marco rubio referred to by other candidates
  names = character()
  if('marco' %in% colnames(dtm)){names = c(names, 'marco')}
  if('rubio' %in% colnames(dtm)){names = c(names, 'rubio')}
  dtm_trump = dtm[,names]
  dtm_rubio = dtm[,names]
  RUBIO = apply(dtm_rubio, 1, sum)

  #summary matrix
  data.frame(TRUMP=TRUMP, CRUZ=CRUZ, RUBIO=RUBIO)
}

dates = c(20150806, 20150916, 20151028, 20151110, 20151215,
          20160114, 20160128, 20160206, 20160213, 20160225,
          20160303)

ref_list = lapply(dates, ref_matrix)

names(ref_list) = dates

trump = sapply(ref_list, function(df) sum(df[rownames(df) != 'TRUMP', 'TRUMP']))
cruz_rubio = sapply(ref_list, function(df) sum(df[rownames(df) == 'CRUZ', 'RUBIO'], df[rownames(df) == 'RUBIO', 'CRUZ']))

m = t(as.matrix(data.frame(trump, cruz_rubio)))
barplot(m, main='Number of times Cruz/Rubio mention Trump vs each other', beside=TRUE, col=c('blue', 'red'),
        legend=c('# times Cruz/Rubio mention Trump', '# times Cruz/Rubio mention each other'),
        args.legend=list(x='topleft', cex=0.75), cex.names=0.75)

We can see that at the beginning of the race, the candidates really didn’t refer to each other much at all.

Things change around the debate of December 15th, 2015, where Cruz and Rubio refer to each other significantly more.
Then in the next debate and every other debate afterwards, Cruz and Rubio collectively refer to Trump more often than they refer to each other.

In particular, in the last two debates, of February 25th, 2016 and March 3rd, 2016, they mention each other 10 times in total, where as they mention Trump 137 times !

Let’s now turn our attention to the words themselves.

We’re gonna change the code a little bit in order to collect all the transcripts in a single character string:

  dates = c(20150806, 20150916, 20151028, 20151110, 20151215,
          20160114, 20160128, 20160206, 20160213, 20160225,
          20160303)
  read_transcript = function(date){scan(paste0('debate_', date, '.txt'), what='x', quote=NULL)}

  #read and collate all transcripts
  text = unlist(sapply(dates, read_transcript))

After reusing the same bit of code as in the beginning to preprocess the text, we can answer a few intersting questions:

For each candidate, what are the top 50 most frequent words ?

get indexes of 50 most frequent words
indexes = apply(dtm, 1, function(v) head(order(v, decreasing=TRUE), 50))

# find the 50 most frequent words
freq_words = apply(indexes, 2, function(v) colnames(dtm)[v])
freq_words

 

##         CRUZ          TRUMP       RUBIO
##    [1,] 'donald'      'people'    'people'
##    [2,] 'president'   'country'   'president'
##    [3,] 'people'      'just'      'country'
##    [4,] 'now'         'think'     'now'
##    [5,] 'tax'         'said'      'america'
##    [6,] 'obama'       'now'       'states'
##    [7,] 'country'     'tell'      'united'
##    [8,] 'said'        'right'     'just'
##    [9,] 'question'    'great'     'world'
##   [10,] 'right'       'like'      'first'
##   [11,] 'back'        'way'       'issue'
##   [12,] 'just'        'look'      'think'
##   [13,] 'america'     'back'      'like'
##   [14,] 'washington'  'take'      'american'
##   [15,] 'years'       'lot'       'way'
##   [16,] 'court'       'much'      'important'
##   [17,] 'american'    'come'      'make'
##   [18,] 'clinton'     'make'      'years'
##   [19,] 'hillary'     'thing'     'immigration'
##   [20,] 'tell'        'never'     'money'
##   [21,] 'fight'       'years'     'tax'
##   [22,] 'law'         'china'     'isis'
##   [23,] 'amnesty'     'world'     'see'
##   [24,] 'isis'        'first'     'said'
##   [25,] 'day'         'talking'   'someone'
##   [26,] 'first'       'good'      'time'
##   [27,] 'look'        'money'     'believe'
##   [28,] 'think'       'win'       'government'
##   [29,] 'like'        'everybody' 'never'
##   [30,] 'state'       'trade'     'back'
##   [31,] 'crosstalk'   'care'      'barack'
##   [32,] 'new'         'something' 'military'
##   [33,] 'barack'      'deal'      'things'
##   [34,] 'flat'        'wall'      'right'
##   [35,] 'keep'        'really'    'americans'
##   [36,] 'percent'     'problem'   'economy'
##   [37,] 'plan'        'believe'   'made'
##   [38,] 'women'       'billion'   'obama'
##   [39,] 'bill'        'saying'    'today'
##   [40,] 'radical'     'crosstalk' 'fact'
##   [41,] 'stage'       'big'       'able'
##   [42,] 'government'  'president' 'hillary'
##   [43,] 'business'    'time'      'question'
##   [44,] 'john'        'wrong'     'donald'
##   [45,] 'marco'       'done'      'clinton'
##   [46,] 'everyone'    'excuse'    'plan'
##   [47,] 'islamic'     'jobs'      'point'
##   [48,] 'millions'    'laughter'  'support'
##   [49,] 'immigration' 'far'       'better'
##   [50,] 'men'         'jeb'       'place'

It would actually be interesting to see for each candidate, the words in his top-50 that are unique to him, i.e. that are not in the other candidates’ top-50

find the number of times each word appears in the matrix
word_count = apply(freq_words, c(1,2), function(x) sum(x == freq_words))

#keep those words that appear only once
unique_words = word_count == 1

l = lapply(colnames(freq_words), function(name) freq_words[unique_words[,name], name])
names(l) = colnames(freq_words)
l

 

## $CRUZ
##  [1] 'washington' 'court'      'fight'      'law'        'amnesty'
##  [6] 'day'        'state'      'new'        'flat'       'keep'
## [11] 'percent'    'women'      'bill'       'radical'    'stage'
## [16] 'business'   'john'       'marco'      'everyone'   'islamic'
## [21] 'millions'   'men'
##
## $TRUMP
##  [1] 'great'     'take'      'lot'       'much'      'come'
##  [6] 'thing'     'china'     'talking'   'good'      'win'
## [11] 'everybody' 'trade'     'care'      'something' 'deal'
## [16] 'wall'      'really'    'problem'   'billion'   'saying'
## [21] 'big'       'wrong'     'done'      'excuse'    'jobs'
## [26] 'laughter'  'far'       'jeb'
##
## $RUBIO
##  [1] 'states'    'united'    'issue'     'important' 'see'
##  [6] 'someone'   'military'  'things'    'americans' 'economy'
## [11] 'made'      'today'     'fact'      'able'      'point'
## [16] 'support'   'better'    'place'

Let’s go ahead and make word clouds out of that:

Trump:
freq = dtm['TRUMP', l$TRUMP]
wordcloud(l$TRUMP, as.vector(freq), colors=brewer.pal(9,'Blues'))

unnamed-chunk-6-1

Cruz:
freq = dtm['CRUZ', l$CRUZ]
wordcloud(l$CRUZ, as.vector(freq), colors=brewer.pal(9,'Greens'))

unnamed-chunk-7-1

Rubio:
freq = dtm['RUBIO', l$RUBIO]
wordcloud(l$RUBIO, as.vector(freq), colors=brewer.pal(9,'Oranges'))

unnamed-chunk-8-1

For each candidate, what is the average word length ?

#number of words by speaker
nb_words = apply(dtm, 1, sum)
#word lengths
word_lengths = sapply(colnames(dtm), nchar)
#transform word count into total character count matrix
character_counts = t(apply(dtm, 1, function(v) v * word_lengths))
#total character count by speaker
total_character_counts = apply(character_counts, 1, function(v) sum(v))
#divide total character count by numer of words
round(total_character_counts / nb_words, digits=1)
##  CRUZ TRUMP RUBIO
##   6.3   5.9   6.3

Finally, for each candidate, how diversified is their vocabulary ?

To quantify this, we are gonna count the number of unique words per 1000 words:

apply(dtm, 1, function(v) round(sum(v != 0)/sum(v)*1000, digits=1))

 

##  CRUZ TRUMP RUBIO
## 248.8 176.3 218.0

We see that Trump uses fewer and shorter words than his opponents.