Fun with barplots

I love barplots! Here’s a simple function that allows me to do some quick 2-way exploration of my data when I have categorical variables.

Let’s give ourselves some data to play with:

gender = c('Male', 'Female')
age_band = c('[16-24]', '[25-49]', '50+')
buying_ratio = c('1.Low', '2.Medium', '3.High')

n = 100

gender_v = sample(gender, 100, replace=T)
age_band_v = sample(age_band, 100, replace=T)
buying_ratio = sample(buying_ratio, 100, replace=T)

df = data.frame(gender=gender_v, age_band=age_band_v, buying_ratio=buying_ratio)

df
## gender age_band buying_ratio
## 1 Male [16-24] 1.Low
## 2 Male [25-49] 2.Medium
## 3 Female [25-49] 2.Medium
## 4 Male [25-49] 3.High
## 5 Male [16-24] 1.Low
## 6 Female [25-49] 1.Low
## 7 Female 50+ 1.Low
## 8 Male [16-24] 1.Low
## 9 Female 50+ 3.High
## 10 Female 50+ 3.High
## 11 Female 50+ 1.Low
## 12 Female 50+ 2.Medium
## 13 Male [25-49] 2.Medium
## 14 Male [25-49] 2.Medium
## 15 Male [25-49] 1.Low
## 16 Male [16-24] 2.Medium
## 17 Female 50+ 2.Medium
## 18 Female [16-24] 2.Medium
## 19 Male [25-49] 2.Medium
## 20 Female [16-24] 2.Medium
## 21 Male [25-49] 3.High
## 22 Male [25-49] 2.Medium
## 23 Male 50+ 1.Low
## 24 Female [25-49] 2.Medium
## 25 Female [25-49] 3.High
## 26 Male 50+ 2.Medium
## 27 Female 50+ 1.Low
## 28 Male 50+ 3.High
## 29 Female [16-24] 2.Medium
## 30 Female [16-24] 3.High
## 31 Male 50+ 3.High
## 32 Male 50+ 3.High
## 33 Female [16-24] 3.High
## 34 Female [16-24] 2.Medium
## 35 Female 50+ 3.High
## 36 Male [25-49] 3.High
## 37 Female 50+ 1.Low
## 38 Male [16-24] 2.Medium
## 39 Male [25-49] 3.High
## 40 Female [25-49] 3.High
## 41 Male [25-49] 3.High
## 42 Female [25-49] 2.Medium
## 43 Female [25-49] 3.High
## 44 Female 50+ 1.Low
## 45 Female [25-49] 2.Medium
## 46 Male 50+ 3.High
## 47 Male [16-24] 3.High
## 48 Female 50+ 3.High
## 49 Female [25-49] 1.Low
## 50 Male [16-24] 1.Low
## 51 Female [16-24] 3.High
## 52 Male [16-24] 3.High
## 53 Male [25-49] 1.Low
## 54 Female [16-24] 1.Low
## 55 Male [16-24] 1.Low
## 56 Male 50+ 2.Medium
## 57 Female 50+ 3.High
## 58 Female [25-49] 1.Low
## 59 Female [25-49] 2.Medium
## 60 Female [16-24] 3.High
## 61 Male 50+ 2.Medium
## 62 Male [25-49] 3.High
## 63 Female [16-24] 2.Medium
## 64 Female 50+ 2.Medium
## 65 Male [16-24] 2.Medium
## 66 Male [25-49] 3.High
## 67 Male [16-24] 1.Low
## 68 Male 50+ 3.High
## 69 Male 50+ 2.Medium
## 70 Male [25-49] 2.Medium
## 71 Female [25-49] 2.Medium
## 72 Female [25-49] 1.Low
## 73 Male [16-24] 3.High
## 74 Female 50+ 3.High
## 75 Female [25-49] 3.High
## 76 Female [16-24] 2.Medium
## 77 Female 50+ 1.Low
## 78 Female [16-24] 1.Low
## 79 Male [16-24] 1.Low
## 80 Female [25-49] 3.High
## 81 Female [25-49] 1.Low
## 82 Male [16-24] 2.Medium
## 83 Female [25-49] 1.Low
## 84 Male [16-24] 2.Medium
## 85 Male 50+ 3.High
## 86 Female [25-49] 3.High
## 87 Female [16-24] 1.Low
## 88 Male 50+ 3.High
## 89 Female [25-49] 3.High
## 90 Male 50+ 2.Medium
## 91 Male [25-49] 3.High
## 92 Male [16-24] 1.Low
## 93 Female [25-49] 2.Medium
## 94 Male 50+ 1.Low
## 95 Female [16-24] 2.Medium
## 96 Female 50+ 3.High
## 97 Female [16-24] 3.High
## 98 Male 50+ 3.High
## 99 Male [25-49] 3.High
## 100 Male [25-49] 2.Medium

and a function to visualize conditional distributions:

plot_var = function(varname, varname2, col=brewer.pal(9, "Oranges")){
  var_data = t(table(df[,varname], df[,varname2]))
  var_data_ordered = var_data[order(rownames(var_data)),]

  bar_heights = sapply(colnames(var_data_ordered), function(x) cumsum(var_data_ordered[,x]))
  bar_incr = rbind(bar_heights[1,], diff(bar_heights))

  percentages = apply(bar_incr, 2, function(x) paste(round(x/sum(x), 2)*100, '%'))

  ypos = bar_heights - bar_incr/2

  bar_widths = apply(var_data, 2, sum)

  bp = barplot(var_data_ordered, main=paste(varname2, 'by', varname),
               names=paste(colnames(var_data), '(', bar_widths, ')'),
               beside=F, col=col,
               legend=rownames(var_data), args.legend=list(x='topleft', cex=0.6),
               width=bar_widths)

  i=1
  for(xpos in bp){
    text(xpos, ypos[,i], percentages[,i])
    i = i + 1
  }
}

We can call the function like so:

library(RColorBrewer)
plot_var('gender', 'buying_ratio', brewer.pal(3, 'Oranges'))

Rplot01

and the other way around

plot_var('buying_ratio', 'gender', c('indianred1', 'lightblue2'))

Rplot

Note that the width of the bars are proportional to the number of observations for that value.

Advertisements

2 comments

  1. Nice one,

    Here is a couple of points/suggestions:

    1. On my screen, the top part of legend gets cutoff. I am not sure if this happens to others.
    2. You might want to add a default color to the args.
    plot_var = function(varname, varname2, col=brewer.pal(9, ‘Oranges’)){

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s