Month: March 2016

Loss ratio by variable

In my line of work (property & casualty insurance), one ratio we are constantly monitoring is the loss ratio, which is defined as the ratio of the loss amounts paid to the insured to the premium earned by the insurance company.

Ignoring non claim-specific expenses (e.g. general expenses, marketing, etc.) and investment returns, a loss ratio of 100% means that for the average premium we charge the insured covers exactly the average loss amount, i.e. we break even (the premium corresponding to a loss ratio of 100% is called the pure premium). Note that by definition, everything else being equal, a lower loss ratio translates into higher profits for the insurance company.

Sometimes, we want to look at this loss ratio conditionally on the values of a specific variable. That is, for a particular segmenting variable (age, gender, territory, etc.), we are interested in knowing the loss ratio for each category in this variable, along with the earned premium as a percentage of the total earned premium.

For example, such a report for the Age variable could look like this:

age_group = c("A. [16 - 25]", "B. [25 - 39]", "C. [40 - 64]", "D. [65+ ]   ")
weights = c(0.25, 0.1, 0.35, 0.3)
loss.ratios = c(0.6, 0.9, 0.55, 0.4)

df = data.frame(age_group=age_group, weight=weights, loss.ratio=loss.ratios)
df
##      age_group weight loss.ratio
## 1 A. [16 - 25]   0.25       0.60
## 2 B. [25 - 39]   0.10       0.90
## 3 C. [40 - 64]   0.35       0.55
## 4 D. [65+ ]      0.30       0.40

The global loss ratio for this portfolio is:

sum(df$weight * df$loss.ratio)
## [1] 0.5525

Here’s the question I’ve been asking myself lately: if I could select only one category for this variable (i.e. one age group) to try and take business measures to improve it’s profitability, which one should it be ? In other words, which of these categories has the biggest negative impact on our overall loss ratios ?

One possible answer would be to find the category which, if we improved it’s loss ratio by x%, would improve the global loss ratio the most. But if x is fixed, then this approach simply selects the category with the biggest weight…

A better solution would be to consider, for each age group, what the loss ratio of the portfolio would be if that age group was removed from consideration. For example, to calculate the impact of age group “A. [16 – 25]”, one can calculate the overall loss ratio of the porfolio consisting of ages groups B. to D., and substract that value from our orginal (entire portfolio, including group A.) loss ratio.

impacts = function(weights, loss.ratios){

  overall.lr = sum(weights * loss.ratios)

  v = numeric()

  for(i in 1:length(weights)){
    w.without = weights[-i]/sum(weights[-i])
    lrs.without = loss.ratios[-i]
    lr = sum(w.without * lrs.without)
    v = c(v, overall.lr - lr)
  }

  paste0(round(v*100, 1), "%")
}

df$lr.impact = impacts(weights, loss.ratios)
df
##      age_group weight loss.ratio lr.impact
## 1 A. [16 - 25]   0.25       0.60      1.6%
## 2 B. [25 - 39]   0.10       0.90      3.9%
## 3 C. [40 - 64]   0.35       0.55     -0.1%
## 4 D. [65+ ]      0.30       0.40     -6.5%

What this tells us is that the age group “B. [25 – 39]” has the biggest upward impact on our overall loss ratio: if we didn’t insure this group (or equivalently, if that group’s loss ratio was equal to the loss ratio of the rest of the portfolio), our loss ratio would be 3.9 points lower.

Advertisements

Why smart people are loners

I was thinking about an article (in french) a friend of mine sent me. Basically the article is saying that the correlation between happiness and social-life depends on your IQ: the smarter you are, the less your happiness is correlated to how much time you spend with friends.

At the same time, somewhere on Quora, someone had suggested that people who are significantly smarter than average are less likely to meet like-minded people.

This led me to wonder about the average distance between two points inside a disk (notice the smooth transition). I’m sure this question has been asked before (see here for example), and I’m sure one could integrate their way to the answer, but hey I’m not about to pass an opportunity to rdoodle!

(Before looking at the result below, try to guess what the average distance, as a function of the radius, should look like – linear, polynomial..? To be honest I had no clue)


N = 1000

distances = function(rmax){

  r = runif(N, 0 ,rmax)
  theta = runif(N, 0, 2*pi)
  x = r * cos(theta)
  y = r * sin(theta)

  distances = sqrt(outer(x, x, function(x1, x2) (x1-x2)^2) + outer(y, y, function(y1, y2) (y1 - y2)^2))

  #plot(x, y, xlim=c(-rmax, rmax), ylim=c(-rmax, rmax), col="blue", pch=19, cex=0.75)

  mean(distances)
}

r = seq(1,10,0.25)

means = sapply(r, distances)

plot(r, means, col="blue", type="l")

Rplot01.png

Looks like the relationship is linear.
Let’s get the value of the coefficient:

#linear regression with 0 intercept
lm(means ~ -1 + r)

#Call:
#lm(formula = means ~ -1 + r)

#Coefficients:
#     r
#0.7223  

There you go. Average distance = 0.7223r

Which is the right answer to the wrong question, because what I really wanted to compute is the average distance between a person on the edge of the disk (i.e. a smart person) and all the others – how about now, still linear ?

The adjustment to the code is actually very small (love you R!):

N = 1000

distances = function(rmax){

  r = runif(N, 0 ,rmax)
  theta = runif(N, 0, 2*pi)
  x = r * cos(theta)
  y = r * sin(theta)

  #the exact position of the person on the edge doesnt matter due to symmetry
  x0 = 0
  y0 = rmax

  distances = sqrt(outer(x0, x, function(x1, x2) (x1-x2)^2) + outer(y0, y, function(y1, y2) (y1 - y2)^2))

  mean(distances)
}

r = seq(1,10,0.25)

means = sapply(r, distances)

lm(means ~ -1 + r)

#Call:
#lm(formula = means ~ -1 + r)

#Coefficients:
#    r
#1.087

Average distance = 1.087
See? I told you smart people were less gregarious…

sas proc sql – missing values

Here’s a little trick to give a elegantly default your missing values when you do a merge using PROC SQL.

DATA A;
	CATEGORY = 'A'; NAME = 'CAT-A'; OUTPUT;
	CATEGORY = 'B'; NAME = 'CAT-B'; OUTPUT;
RUN;

DATA B;
	CATEGORY = 'A'; OUTPUT;
	CATEGORY = 'B'; OUTPUT;
	CATEGORY = 'C'; OUTPUT;
RUN;

We want to get the category name from dataset A. But in case we don’t find it, we want to output ‘Unknown’ (instead of the default blank).

The trick is to use the COALESCE function, which returns the first non-missing value among it’s arguments:

PROC SQL;
	CREATE TABLE TEST AS
	SELECT T0.*, COALESCE(T1.NAME, 'UNKNOWN') AS NAME FORMAT $7.
	FROM
	B AS T0
	LEFT JOIN A AS T1
	ON T0.CATEGORY = T1.CATEGORY;
QUIT;

Capture

Fun with barplots

I love barplots! Here’s a simple function that allows me to do some quick 2-way exploration of my data when I have categorical variables.

Let’s give ourselves some data to play with:

gender = c('Male', 'Female')
age_band = c('[16-24]', '[25-49]', '50+')
buying_ratio = c('1.Low', '2.Medium', '3.High')

n = 100

gender_v = sample(gender, 100, replace=T)
age_band_v = sample(age_band, 100, replace=T)
buying_ratio = sample(buying_ratio, 100, replace=T)

df = data.frame(gender=gender_v, age_band=age_band_v, buying_ratio=buying_ratio)

df
## gender age_band buying_ratio
## 1 Male [16-24] 1.Low
## 2 Male [25-49] 2.Medium
## 3 Female [25-49] 2.Medium
## 4 Male [25-49] 3.High
## 5 Male [16-24] 1.Low
## 6 Female [25-49] 1.Low
## 7 Female 50+ 1.Low
## 8 Male [16-24] 1.Low
## 9 Female 50+ 3.High
## 10 Female 50+ 3.High
## 11 Female 50+ 1.Low
## 12 Female 50+ 2.Medium
## 13 Male [25-49] 2.Medium
## 14 Male [25-49] 2.Medium
## 15 Male [25-49] 1.Low
## 16 Male [16-24] 2.Medium
## 17 Female 50+ 2.Medium
## 18 Female [16-24] 2.Medium
## 19 Male [25-49] 2.Medium
## 20 Female [16-24] 2.Medium
## 21 Male [25-49] 3.High
## 22 Male [25-49] 2.Medium
## 23 Male 50+ 1.Low
## 24 Female [25-49] 2.Medium
## 25 Female [25-49] 3.High
## 26 Male 50+ 2.Medium
## 27 Female 50+ 1.Low
## 28 Male 50+ 3.High
## 29 Female [16-24] 2.Medium
## 30 Female [16-24] 3.High
## 31 Male 50+ 3.High
## 32 Male 50+ 3.High
## 33 Female [16-24] 3.High
## 34 Female [16-24] 2.Medium
## 35 Female 50+ 3.High
## 36 Male [25-49] 3.High
## 37 Female 50+ 1.Low
## 38 Male [16-24] 2.Medium
## 39 Male [25-49] 3.High
## 40 Female [25-49] 3.High
## 41 Male [25-49] 3.High
## 42 Female [25-49] 2.Medium
## 43 Female [25-49] 3.High
## 44 Female 50+ 1.Low
## 45 Female [25-49] 2.Medium
## 46 Male 50+ 3.High
## 47 Male [16-24] 3.High
## 48 Female 50+ 3.High
## 49 Female [25-49] 1.Low
## 50 Male [16-24] 1.Low
## 51 Female [16-24] 3.High
## 52 Male [16-24] 3.High
## 53 Male [25-49] 1.Low
## 54 Female [16-24] 1.Low
## 55 Male [16-24] 1.Low
## 56 Male 50+ 2.Medium
## 57 Female 50+ 3.High
## 58 Female [25-49] 1.Low
## 59 Female [25-49] 2.Medium
## 60 Female [16-24] 3.High
## 61 Male 50+ 2.Medium
## 62 Male [25-49] 3.High
## 63 Female [16-24] 2.Medium
## 64 Female 50+ 2.Medium
## 65 Male [16-24] 2.Medium
## 66 Male [25-49] 3.High
## 67 Male [16-24] 1.Low
## 68 Male 50+ 3.High
## 69 Male 50+ 2.Medium
## 70 Male [25-49] 2.Medium
## 71 Female [25-49] 2.Medium
## 72 Female [25-49] 1.Low
## 73 Male [16-24] 3.High
## 74 Female 50+ 3.High
## 75 Female [25-49] 3.High
## 76 Female [16-24] 2.Medium
## 77 Female 50+ 1.Low
## 78 Female [16-24] 1.Low
## 79 Male [16-24] 1.Low
## 80 Female [25-49] 3.High
## 81 Female [25-49] 1.Low
## 82 Male [16-24] 2.Medium
## 83 Female [25-49] 1.Low
## 84 Male [16-24] 2.Medium
## 85 Male 50+ 3.High
## 86 Female [25-49] 3.High
## 87 Female [16-24] 1.Low
## 88 Male 50+ 3.High
## 89 Female [25-49] 3.High
## 90 Male 50+ 2.Medium
## 91 Male [25-49] 3.High
## 92 Male [16-24] 1.Low
## 93 Female [25-49] 2.Medium
## 94 Male 50+ 1.Low
## 95 Female [16-24] 2.Medium
## 96 Female 50+ 3.High
## 97 Female [16-24] 3.High
## 98 Male 50+ 3.High
## 99 Male [25-49] 3.High
## 100 Male [25-49] 2.Medium

and a function to visualize conditional distributions:

plot_var = function(varname, varname2, col=brewer.pal(9, "Oranges")){
  var_data = t(table(df[,varname], df[,varname2]))
  var_data_ordered = var_data[order(rownames(var_data)),]

  bar_heights = sapply(colnames(var_data_ordered), function(x) cumsum(var_data_ordered[,x]))
  bar_incr = rbind(bar_heights[1,], diff(bar_heights))

  percentages = apply(bar_incr, 2, function(x) paste(round(x/sum(x), 2)*100, '%'))

  ypos = bar_heights - bar_incr/2

  bar_widths = apply(var_data, 2, sum)

  bp = barplot(var_data_ordered, main=paste(varname2, 'by', varname),
               names=paste(colnames(var_data), '(', bar_widths, ')'),
               beside=F, col=col,
               legend=rownames(var_data), args.legend=list(x='topleft', cex=0.6),
               width=bar_widths)

  i=1
  for(xpos in bp){
    text(xpos, ypos[,i], percentages[,i])
    i = i + 1
  }
}

We can call the function like so:

library(RColorBrewer)
plot_var('gender', 'buying_ratio', brewer.pal(3, 'Oranges'))

Rplot01

and the other way around

plot_var('buying_ratio', 'gender', c('indianred1', 'lightblue2'))

Rplot

Note that the width of the bars are proportional to the number of observations for that value.

Montreal Restaurants

I’m currently looking for a new apartment. One important criteria for me is the proximity to services and entertainment, in particular restaurants. Or maybe I was just looking for a pretext to do some R…

Anyhow, I thought I would do some web scraping + mapping in R and see what neighborhoods in Montreal are the best when it comes to dining out.

The first step was to decide where to get the data from. Well that decision was easy, because the first sites I looked at had javascript pages (instead of static html), which meant standard web scraping would not work. Then I tried Yelp, and fortunately enough they use static html (although they limit any of their views to 1000 results).

Below is the code to scrape all 1000 results for Montreal, and put the result in a dataframe. I then calculate points for each neighborhood as the sumproduct of ratings and number of reviews:

library(rvest)
library(stringr)
library(magrittr)

scrape_page = function(start){

  restos = read_html(paste0('http://www.yelp.ca/search?find_loc=Montr%C3%A9al,+QC,+Canada&start=',
                            start, '&cflt=food'))

  n1 = restos %>% html_nodes('.main-attributes')
  n2 = restos %>% html_nodes('.secondary-attributes')

  df = data.frame(neighborhood=factor(), rating=numeric(), votes=numeric())

  #we have to extract the nodes one by one because sometimes the address field can be missing
  for(i in 1:10){

    main_attr = n1 %>% extract2(i)
    sec_attr = n2 %>% extract2(i)

    if(length(sec_attr %>% html_nodes('.neighborhood-str-list')) == 1
       && length(main_attr %>% html_nodes('.star-img')) == 1
       && length(main_attr %>% html_nodes('.review-count')) == 1){

      neighborhood = iconv(sec_attr %>% html_nodes('.neighborhood-str-list') %>% html_text() %>% str_trim(),
                            from='UTF-8', to='UTF-8')

      # rename the neighborhoods so the names match those found in the spatial sata (see below)
      if(str_detect(neighborhood, 'Notre-Dame') || str_detect(neighborhood, 'des-Neiges')){
        neighborhood = 'Côte-des-Neiges-Notre-Dame-de-Grâce'
      }
      else if(str_detect(neighborhood, 'Plateau')){
        neighborhood = 'Le Plateau-Mont-Royal'
      }
      else if(str_detect(neighborhood, 'Sud')){
        neighboorhood = 'Le Sud-Ouest'
      }
      else if(str_detect(neighborhood, 'Ville-Marie')){
        neighboorhood = 'Ville-Marie'
      }
      else if(str_detect(neighborhood, 'Villeray')){
        neighborhood = 'Villeray-Saint-Michel-Parc-Extension'
      }
      else if(str_detect(neighborhood, 'Saint-Luc')){
        neighborhood = 'Côte-Saint-Luc'
      }
      else if(str_detect(neighborhood, 'Trembles')){
        neighborhood = 'Rivière-des-Prairies-Pointe-aux-Trembles'
      }
      else if(str_detect(neighborhood, 'onard')){
        neighborhood = 'Saint-Léonard'
      }
      else if(str_detect(neighborhood, 'Montr')){
        neighborhood = 'Montréal-Ouest'
      }
      else if(str_detect(neighborhood, 'Ahun')){
        neighborhood = 'Ahuntsic-Cartierville'
      }

      rating = as.numeric(main_attr %>% html_nodes('.star-img') %>%
                          html_attr('title') %>% str_extract('[0-9]\\.[0-9]'))

      votes = as.numeric(main_attr %>% html_nodes('.review-count') %>% html_text() %>% str_extract('[0-9]+'))

      df = rbind(df, data.frame(neighborhood=neighborhood, rating=rating, votes=votes))
    }
  }

  df
}

all_ratings = lapply(seq(from=0, to=990, by=10), scrape_page)
all_ratings_df = do.call(rbind, all_ratings)

Now we tally up the points by neighborhood, and we’re gonna score each neighborhood according to the average quality of its restaurants AND to its total number of restaurants, using the square root function on the latter to give it less weight (the idea being I would prefer to live in a neighborhood with 10 good restaurants rather than 100 crappy ones).

all_ratings_df$points = all_ratings_df$rating * all_ratings_df$votes

points = aggregate(points ~ neighborhood, data=all_ratings_df, sum)

restaurant_count = aggregate(rating ~ neighborhood, data=all_ratings_df, length)
colnames(restaurant_count)[2] = 'restaurant_count'

votes = aggregate(votes ~ neighborhood, data=all_ratings_df, sum)

avg_ratings = data.frame(neighborhood=points[,1],
avg_rating=points[,2]/votes[,2])

scores = data.frame(neighborhood=points[,1],
score=avg_ratings[,2] * sqrt(restaurant_count[,2]))

Here are the neighborhood average ratings:

## neighborhood avg_rating
## 1 Ahuntsic-Cartierville 4.103774
## 2 Brossard 5.000000
## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 4.112676
## 4 Côte-Saint-Luc 5.000000
## 5 Le Plateau-Mont-Royal 4.182097
## 6 Mercier-Hochelaga-Maisonneuve 4.043605
## 7 Mont-Royal 3.545455
## 8 Outremont 4.004098
## 9 Rosemont-La Petite-Patrie 4.265391
## 10 Saint-Laurent 3.836735
## 11 Saint-Léonard 4.240000
## 12 Sud-Ouest 4.015648
## 13 Verdun 4.094203
## 14 Ville-Marie 3.936800
## 15 Villeray-Saint-Michel-Parc-Extension 4.342500

and scores (which is the average rating times the square root of the number of restaurants in the neighborhood):

scores
## neighborhood score
## 1 Ahuntsic-Cartierville 18.352633
## 2 Brossard 7.071068
## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 33.913996
## 4 Côte-Saint-Luc 5.000000
## 5 Le Plateau-Mont-Royal 69.478213
## 6 Mercier-Hochelaga-Maisonneuve 20.218023
## 7 Mont-Royal 7.927877
## 8 Outremont 12.662071
## 9 Rosemont-La Petite-Patrie 51.184692
## 10 Saint-Laurent 9.398042
## 11 Saint-Léonard 7.343895
## 12 Sud-Ouest 28.957252
## 13 Verdun 10.028708
## 14 Ville-Marie 63.356803
## 15 Villeray-Saint-Michel-Parc-Extension 27.464381

So now that we have the scores aggregated by neighborhood, we need a map to draw them on:

library(maptools)

tmp_dir = tempdir()
url_data = 'http://donnees.ville.montreal.qc.ca/dataset/00bd85eb-23aa-4669-8f1b-ba9a000e3dd8/resource/62f7ce10-36ce-4bbd-b419-8f0a10d3b280/download/limadmin-shp.zip'
zip_file = paste0(tmp_dir, '/montreal.zip')
download.file(url_data, zip_file)
unzip(zip_file, exdir=tmp_dir)

montreal = readShapeSpatial(paste0(tmp_dir, '/LIMADMIN.shp'))

The next step is to merge the points from the web scraping step onto the shape data. Note that unfortunately, not all neighborhoods are represented on the Yelp website.

df = merge(montreal@data, scores, by.x='NOM', by.y='neighborhood', all.x=TRUE)

#reorder df to match montreal@data
df = df[rank(montreal@data$NOM),]

We can assign a category to each neighborhoods based on their points (I chose to make 3 categories), and attribute a color to each category.

library(RColorBrewer)
library(classInt)

ci = classIntervals(df$score, n=3, style='quantile')

palette = brewer.pal(8, 'YlOrRd')[c(2,4,6)]

colors = findColours(ci, palette)

Finally we draw the map, add a legend and write the names of the neighborhoods as well using the gCentroid function from the rgeos package

library(rgeos)

plot(montreal, col=colors)
legend('topleft', legend=c('Starvation area', 'Survivable', 'Foodie's heaven'), fill=palette)

centroids = gCentroid(montreal, byid=TRUE)[!is.na(df$score)]
text(centroids$x, centroids$y, labels=montreal@data$NOM[!is.na(df$score)], cex=.7)

Rplot

Voila ! It is quite unfortunate that we only have the central neighborhoods of Montreal. To be continued…

Text analysis of republican presidential debates

Here we analyze the republican presidential debate transcripts, focusing on three candidates in particular: Trump, Cruz and Rubio.

First let’s answer the following question:

How often does Trump get mentionned by name by the other two candidates during a debate, versus how often do the other two candidates mention each other’s name ?

The transcripts were downloaded from: http://www.presidency.ucsb.edu/debates.php, from August 6th, 2015 to March 21st, 2016.

Here’s the code:


library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

ref_matrix = function (date){

  #read character data
  text = scan(paste0('debate_', date, '.txt'), what='x', quote=NULL)

  speakers =
    c(  CRUZ    = '',
        TRUMP   = '',
        RUBIO   = ''
    )

  #assign text to the right speaker
  for(word in text){
    #if word ends with :
    if(substr(word,nchar(word),nchar(word))==':'){
      #if word corresponds to one of the speakers of interest
      if(word %in% paste0(names(speakers), ':')){
        #set current speaker
        currentSpeaker = substr(word,1,nchar(word)-1)
      }
      else{
        #if the current speaker is not one of the speakers of interest, set it to NA
        currentSpeaker = NA
      }
    }
    else if(!is.na(currentSpeaker)){
      #if the current speaker is of interest, save what he is saying
      speakers[currentSpeaker] = paste(speakers[currentSpeaker], word)
    }
  }

  #preprocess text
  prez = Corpus(VectorSource(speakers))
  prez = tm_map(prez, tolower)
  prez = tm_map(prez, removeWords, stopwords('english'))
  #remove additional unwanted words
  prez = Corpus(VectorSource(speakers))
  prez = tm_map(prez, tolower)
  prez = tm_map(prez, removeWords, stopwords('english'))
  #remove additional unwanted words
  prez = tm_map(prez, removeWords, c('will', 'going', 'applause', 'get', 'say', 'want', 'let', 'can', 'well', 'know',
                                    'got', 'one', 'two', 'also', 'ever', 'even', 'need', 'every', 'let', 'many'))
  prez = tm_map(prez, removePunctuation, preserve_intra_word_dashes = FALSE)
  prez = tm_map(prez, stemDocument)
  prez = tm_map(prez, stripWhitespace)
  prez = tm_map(prez, removeNumbers)
  prez = tm_map(prez, PlainTextDocument) 

  #make document term matrix
  dtm <- DocumentTermMatrix(prez) 

  #reassign row names (each row is a speaker)
  rownames(dtm) = names(speakers)

  #how many times was donald trump referred to by other candidates
  names = character()
  if('donald' %in% colnames(dtm)){names = c(names, 'donald')}
  if('trump' %in% colnames(dtm)){names = c(names, 'trump')}
  dtm_trump = dtm[,names]
  TRUMP = apply(dtm_trump, 1, sum)

  #how many times was ted cruz referred to by other candidates
 names = character()
  if('ted' %in% colnames(dtm)){names = c(names, 'ted')}
  if('cruz' %in% colnames(dtm)){names = c(names, 'cruz')}
  dtm_trump = dtm[,names]
  dtm_cruz = dtm[,names]
  CRUZ = apply(dtm_cruz, 1, sum)

  #how many times was marco rubio referred to by other candidates
  names = character()
  if('marco' %in% colnames(dtm)){names = c(names, 'marco')}
  if('rubio' %in% colnames(dtm)){names = c(names, 'rubio')}
  dtm_trump = dtm[,names]
  dtm_rubio = dtm[,names]
  RUBIO = apply(dtm_rubio, 1, sum)

  #summary matrix
  data.frame(TRUMP=TRUMP, CRUZ=CRUZ, RUBIO=RUBIO)
}

dates = c(20150806, 20150916, 20151028, 20151110, 20151215,
          20160114, 20160128, 20160206, 20160213, 20160225,
          20160303)

ref_list = lapply(dates, ref_matrix)

names(ref_list) = dates

trump = sapply(ref_list, function(df) sum(df[rownames(df) != 'TRUMP', 'TRUMP']))
cruz_rubio = sapply(ref_list, function(df) sum(df[rownames(df) == 'CRUZ', 'RUBIO'], df[rownames(df) == 'RUBIO', 'CRUZ']))

m = t(as.matrix(data.frame(trump, cruz_rubio)))
barplot(m, main='Number of times Cruz/Rubio mention Trump vs each other', beside=TRUE, col=c('blue', 'red'),
        legend=c('# times Cruz/Rubio mention Trump', '# times Cruz/Rubio mention each other'),
        args.legend=list(x='topleft', cex=0.75), cex.names=0.75)

We can see that at the beginning of the race, the candidates really didn’t refer to each other much at all.

Things change around the debate of December 15th, 2015, where Cruz and Rubio refer to each other significantly more.
Then in the next debate and every other debate afterwards, Cruz and Rubio collectively refer to Trump more often than they refer to each other.

In particular, in the last two debates, of February 25th, 2016 and March 3rd, 2016, they mention each other 10 times in total, where as they mention Trump 137 times !

Let’s now turn our attention to the words themselves.

We’re gonna change the code a little bit in order to collect all the transcripts in a single character string:

  dates = c(20150806, 20150916, 20151028, 20151110, 20151215,
          20160114, 20160128, 20160206, 20160213, 20160225,
          20160303)
  read_transcript = function(date){scan(paste0('debate_', date, '.txt'), what='x', quote=NULL)}

  #read and collate all transcripts
  text = unlist(sapply(dates, read_transcript))

After reusing the same bit of code as in the beginning to preprocess the text, we can answer a few intersting questions:

For each candidate, what are the top 50 most frequent words ?

get indexes of 50 most frequent words
indexes = apply(dtm, 1, function(v) head(order(v, decreasing=TRUE), 50))

# find the 50 most frequent words
freq_words = apply(indexes, 2, function(v) colnames(dtm)[v])
freq_words

 

##         CRUZ          TRUMP       RUBIO
##    [1,] 'donald'      'people'    'people'
##    [2,] 'president'   'country'   'president'
##    [3,] 'people'      'just'      'country'
##    [4,] 'now'         'think'     'now'
##    [5,] 'tax'         'said'      'america'
##    [6,] 'obama'       'now'       'states'
##    [7,] 'country'     'tell'      'united'
##    [8,] 'said'        'right'     'just'
##    [9,] 'question'    'great'     'world'
##   [10,] 'right'       'like'      'first'
##   [11,] 'back'        'way'       'issue'
##   [12,] 'just'        'look'      'think'
##   [13,] 'america'     'back'      'like'
##   [14,] 'washington'  'take'      'american'
##   [15,] 'years'       'lot'       'way'
##   [16,] 'court'       'much'      'important'
##   [17,] 'american'    'come'      'make'
##   [18,] 'clinton'     'make'      'years'
##   [19,] 'hillary'     'thing'     'immigration'
##   [20,] 'tell'        'never'     'money'
##   [21,] 'fight'       'years'     'tax'
##   [22,] 'law'         'china'     'isis'
##   [23,] 'amnesty'     'world'     'see'
##   [24,] 'isis'        'first'     'said'
##   [25,] 'day'         'talking'   'someone'
##   [26,] 'first'       'good'      'time'
##   [27,] 'look'        'money'     'believe'
##   [28,] 'think'       'win'       'government'
##   [29,] 'like'        'everybody' 'never'
##   [30,] 'state'       'trade'     'back'
##   [31,] 'crosstalk'   'care'      'barack'
##   [32,] 'new'         'something' 'military'
##   [33,] 'barack'      'deal'      'things'
##   [34,] 'flat'        'wall'      'right'
##   [35,] 'keep'        'really'    'americans'
##   [36,] 'percent'     'problem'   'economy'
##   [37,] 'plan'        'believe'   'made'
##   [38,] 'women'       'billion'   'obama'
##   [39,] 'bill'        'saying'    'today'
##   [40,] 'radical'     'crosstalk' 'fact'
##   [41,] 'stage'       'big'       'able'
##   [42,] 'government'  'president' 'hillary'
##   [43,] 'business'    'time'      'question'
##   [44,] 'john'        'wrong'     'donald'
##   [45,] 'marco'       'done'      'clinton'
##   [46,] 'everyone'    'excuse'    'plan'
##   [47,] 'islamic'     'jobs'      'point'
##   [48,] 'millions'    'laughter'  'support'
##   [49,] 'immigration' 'far'       'better'
##   [50,] 'men'         'jeb'       'place'

It would actually be interesting to see for each candidate, the words in his top-50 that are unique to him, i.e. that are not in the other candidates’ top-50

find the number of times each word appears in the matrix
word_count = apply(freq_words, c(1,2), function(x) sum(x == freq_words))

#keep those words that appear only once
unique_words = word_count == 1

l = lapply(colnames(freq_words), function(name) freq_words[unique_words[,name], name])
names(l) = colnames(freq_words)
l

 

## $CRUZ
##  [1] 'washington' 'court'      'fight'      'law'        'amnesty'
##  [6] 'day'        'state'      'new'        'flat'       'keep'
## [11] 'percent'    'women'      'bill'       'radical'    'stage'
## [16] 'business'   'john'       'marco'      'everyone'   'islamic'
## [21] 'millions'   'men'
##
## $TRUMP
##  [1] 'great'     'take'      'lot'       'much'      'come'
##  [6] 'thing'     'china'     'talking'   'good'      'win'
## [11] 'everybody' 'trade'     'care'      'something' 'deal'
## [16] 'wall'      'really'    'problem'   'billion'   'saying'
## [21] 'big'       'wrong'     'done'      'excuse'    'jobs'
## [26] 'laughter'  'far'       'jeb'
##
## $RUBIO
##  [1] 'states'    'united'    'issue'     'important' 'see'
##  [6] 'someone'   'military'  'things'    'americans' 'economy'
## [11] 'made'      'today'     'fact'      'able'      'point'
## [16] 'support'   'better'    'place'

Let’s go ahead and make word clouds out of that:

Trump:
freq = dtm['TRUMP', l$TRUMP]
wordcloud(l$TRUMP, as.vector(freq), colors=brewer.pal(9,'Blues'))

unnamed-chunk-6-1

Cruz:
freq = dtm['CRUZ', l$CRUZ]
wordcloud(l$CRUZ, as.vector(freq), colors=brewer.pal(9,'Greens'))

unnamed-chunk-7-1

Rubio:
freq = dtm['RUBIO', l$RUBIO]
wordcloud(l$RUBIO, as.vector(freq), colors=brewer.pal(9,'Oranges'))

unnamed-chunk-8-1

For each candidate, what is the average word length ?

#number of words by speaker
nb_words = apply(dtm, 1, sum)
#word lengths
word_lengths = sapply(colnames(dtm), nchar)
#transform word count into total character count matrix
character_counts = t(apply(dtm, 1, function(v) v * word_lengths))
#total character count by speaker
total_character_counts = apply(character_counts, 1, function(v) sum(v))
#divide total character count by numer of words
round(total_character_counts / nb_words, digits=1)
##  CRUZ TRUMP RUBIO
##   6.3   5.9   6.3

Finally, for each candidate, how diversified is their vocabulary ?

To quantify this, we are gonna count the number of unique words per 1000 words:

apply(dtm, 1, function(v) round(sum(v != 0)/sum(v)*1000, digits=1))

 

##  CRUZ TRUMP RUBIO
## 248.8 176.3 218.0

We see that Trump uses fewer and shorter words than his opponents.