Montreal Restaurants

I’m currently looking for a new apartment. One important criteria for me is the proximity to services and entertainment, in particular restaurants. Or maybe I was just looking for a pretext to do some R…

Anyhow, I thought I would do some web scraping + mapping in R and see what neighborhoods in Montreal are the best when it comes to dining out.

The first step was to decide where to get the data from. Well that decision was easy, because the first sites I looked at had javascript pages (instead of static html), which meant standard web scraping would not work. Then I tried Yelp, and fortunately enough they use static html (although they limit any of their views to 1000 results).

Below is the code to scrape all 1000 results for Montreal, and put the result in a dataframe. I then calculate points for each neighborhood as the sumproduct of ratings and number of reviews:

library(rvest)
library(stringr)
library(magrittr)

scrape_page = function(start){

  restos = read_html(paste0('http://www.yelp.ca/search?find_loc=Montr%C3%A9al,+QC,+Canada&start=',
                            start, '&cflt=food'))

  n1 = restos %>% html_nodes('.main-attributes')
  n2 = restos %>% html_nodes('.secondary-attributes')

  df = data.frame(neighborhood=factor(), rating=numeric(), votes=numeric())

  #we have to extract the nodes one by one because sometimes the address field can be missing
  for(i in 1:10){

    main_attr = n1 %>% extract2(i)
    sec_attr = n2 %>% extract2(i)

    if(length(sec_attr %>% html_nodes('.neighborhood-str-list')) == 1
       && length(main_attr %>% html_nodes('.star-img')) == 1
       && length(main_attr %>% html_nodes('.review-count')) == 1){

      neighborhood = iconv(sec_attr %>% html_nodes('.neighborhood-str-list') %>% html_text() %>% str_trim(),
                            from='UTF-8', to='UTF-8')

      # rename the neighborhoods so the names match those found in the spatial sata (see below)
      if(str_detect(neighborhood, 'Notre-Dame') || str_detect(neighborhood, 'des-Neiges')){
        neighborhood = 'Côte-des-Neiges-Notre-Dame-de-Grâce'
      }
      else if(str_detect(neighborhood, 'Plateau')){
        neighborhood = 'Le Plateau-Mont-Royal'
      }
      else if(str_detect(neighborhood, 'Sud')){
        neighboorhood = 'Le Sud-Ouest'
      }
      else if(str_detect(neighborhood, 'Ville-Marie')){
        neighboorhood = 'Ville-Marie'
      }
      else if(str_detect(neighborhood, 'Villeray')){
        neighborhood = 'Villeray-Saint-Michel-Parc-Extension'
      }
      else if(str_detect(neighborhood, 'Saint-Luc')){
        neighborhood = 'Côte-Saint-Luc'
      }
      else if(str_detect(neighborhood, 'Trembles')){
        neighborhood = 'Rivière-des-Prairies-Pointe-aux-Trembles'
      }
      else if(str_detect(neighborhood, 'onard')){
        neighborhood = 'Saint-Léonard'
      }
      else if(str_detect(neighborhood, 'Montr')){
        neighborhood = 'Montréal-Ouest'
      }
      else if(str_detect(neighborhood, 'Ahun')){
        neighborhood = 'Ahuntsic-Cartierville'
      }

      rating = as.numeric(main_attr %>% html_nodes('.star-img') %>%
                          html_attr('title') %>% str_extract('[0-9]\\.[0-9]'))

      votes = as.numeric(main_attr %>% html_nodes('.review-count') %>% html_text() %>% str_extract('[0-9]+'))

      df = rbind(df, data.frame(neighborhood=neighborhood, rating=rating, votes=votes))
    }
  }

  df
}

all_ratings = lapply(seq(from=0, to=990, by=10), scrape_page)
all_ratings_df = do.call(rbind, all_ratings)

Now we tally up the points by neighborhood, and we’re gonna score each neighborhood according to the average quality of its restaurants AND to its total number of restaurants, using the square root function on the latter to give it less weight (the idea being I would prefer to live in a neighborhood with 10 good restaurants rather than 100 crappy ones).

all_ratings_df$points = all_ratings_df$rating * all_ratings_df$votes

points = aggregate(points ~ neighborhood, data=all_ratings_df, sum)

restaurant_count = aggregate(rating ~ neighborhood, data=all_ratings_df, length)
colnames(restaurant_count)[2] = 'restaurant_count'

votes = aggregate(votes ~ neighborhood, data=all_ratings_df, sum)

avg_ratings = data.frame(neighborhood=points[,1],
avg_rating=points[,2]/votes[,2])

scores = data.frame(neighborhood=points[,1],
score=avg_ratings[,2] * sqrt(restaurant_count[,2]))

Here are the neighborhood average ratings:

## neighborhood avg_rating
## 1 Ahuntsic-Cartierville 4.103774
## 2 Brossard 5.000000
## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 4.112676
## 4 Côte-Saint-Luc 5.000000
## 5 Le Plateau-Mont-Royal 4.182097
## 6 Mercier-Hochelaga-Maisonneuve 4.043605
## 7 Mont-Royal 3.545455
## 8 Outremont 4.004098
## 9 Rosemont-La Petite-Patrie 4.265391
## 10 Saint-Laurent 3.836735
## 11 Saint-Léonard 4.240000
## 12 Sud-Ouest 4.015648
## 13 Verdun 4.094203
## 14 Ville-Marie 3.936800
## 15 Villeray-Saint-Michel-Parc-Extension 4.342500

and scores (which is the average rating times the square root of the number of restaurants in the neighborhood):

scores
## neighborhood score
## 1 Ahuntsic-Cartierville 18.352633
## 2 Brossard 7.071068
## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 33.913996
## 4 Côte-Saint-Luc 5.000000
## 5 Le Plateau-Mont-Royal 69.478213
## 6 Mercier-Hochelaga-Maisonneuve 20.218023
## 7 Mont-Royal 7.927877
## 8 Outremont 12.662071
## 9 Rosemont-La Petite-Patrie 51.184692
## 10 Saint-Laurent 9.398042
## 11 Saint-Léonard 7.343895
## 12 Sud-Ouest 28.957252
## 13 Verdun 10.028708
## 14 Ville-Marie 63.356803
## 15 Villeray-Saint-Michel-Parc-Extension 27.464381

So now that we have the scores aggregated by neighborhood, we need a map to draw them on:

library(maptools)

tmp_dir = tempdir()
url_data = 'http://donnees.ville.montreal.qc.ca/dataset/00bd85eb-23aa-4669-8f1b-ba9a000e3dd8/resource/62f7ce10-36ce-4bbd-b419-8f0a10d3b280/download/limadmin-shp.zip'
zip_file = paste0(tmp_dir, '/montreal.zip')
download.file(url_data, zip_file)
unzip(zip_file, exdir=tmp_dir)

montreal = readShapeSpatial(paste0(tmp_dir, '/LIMADMIN.shp'))

The next step is to merge the points from the web scraping step onto the shape data. Note that unfortunately, not all neighborhoods are represented on the Yelp website.

df = merge(montreal@data, scores, by.x='NOM', by.y='neighborhood', all.x=TRUE)

#reorder df to match montreal@data
df = df[rank(montreal@data$NOM),]

We can assign a category to each neighborhoods based on their points (I chose to make 3 categories), and attribute a color to each category.

library(RColorBrewer)
library(classInt)

ci = classIntervals(df$score, n=3, style='quantile')

palette = brewer.pal(8, 'YlOrRd')[c(2,4,6)]

colors = findColours(ci, palette)

Finally we draw the map, add a legend and write the names of the neighborhoods as well using the gCentroid function from the rgeos package

library(rgeos)

plot(montreal, col=colors)
legend('topleft', legend=c('Starvation area', 'Survivable', 'Foodie's heaven'), fill=palette)

centroids = gCentroid(montreal, byid=TRUE)[!is.na(df$score)]
text(centroids$x, centroids$y, labels=montreal@data$NOM[!is.na(df$score)], cex=.7)

Rplot

Voila ! It is quite unfortunate that we only have the central neighborhoods of Montreal. To be continued…

6 comments

  1. Nice! One of my first little R programs was trying HTTR, JSONLITE, XML and GGMAP to toss Yelp info onto a map (below). I generally did the smallest projects possible for a while with the best APIs I could find just to learn various capabilities of R. SO it’s tiny.

    But anyway, the “short” approach didn’t turn out to be possible, as I found that the Yelp web service limited me to 20 results and disappointed (and interested in other things) just moved on for a while. So it’s great to see this, I’ll try it out.

    —————-

    # install.packages(“httr”)
    # install.packages(“httpuv”)
    # install.packages(“jsonlite”)
    # install.packages(“XML”)

    # setwd(“)

    #load packages
    library(httr)
    library(httpuv)
    library(jsonlite)
    library(XML)

    # set keys
    consumerKey = “”
    consumerSecret = “”
    token = “”
    token_secret = “”

    # authorization
    application = oauth_app(“YELP”, key = consumerKey, secret = consumerSecret)
    signature = sign_oauth1.0(application, token = token, token_secret = token_secret)

    # PDX restaurants – limit is 20!!! What!?
    locationdataList = fromJSON( toJSON( content(GET(“http://api.yelp.com/v2/search/?limit=20&location=Portland&term=restaurant”, signature)) ))

    yelpDF = as.data.frame(locationdataList)
    names(yelpDF)
    View(yelpDF)

    library(ggmap)

    # the data.frame from Yelp has lat & lon but let’s do addresses
    # yelp JSON stuff is embedded so get DF from inside DF
    yelpLoc <= as.data.frame(yelpDF$businesses.location)

    # use geocode to get a list of lat & lon
    locations <- geocode( as.character(yelpLoc$display_address) )
    View(locations)

    # dump it into map
    qmap(location = "Portland, Oregon", maptype = "roadmap", zoom = 12) +
    geom_point(data = locations, mapping = aes(x = lon, y = lat), color = "red", size = 3)

    Like

Leave a comment