Montreal Restaurants

I’m currently looking for a new apartment. One important criteria for me is the proximity to services and entertainment, in particular restaurants. Or maybe I was just looking for a pretext to do some R…

Anyhow, I thought I would do some web scraping + mapping in R and see what neighborhoods in Montreal are the best when it comes to dining out.

The first step was to decide where to get the data from. Well that decision was easy, because the first sites I looked at had javascript pages (instead of static html), which meant standard web scraping would not work. Then I tried Yelp, and fortunately enough they use static html (although they limit any of their views to 1000 results).

Below is the code to scrape all 1000 results for Montreal, and put the result in a dataframe. I then calculate points for each neighborhood as the sumproduct of ratings and number of reviews:

library(rvest)
library(stringr)
library(magrittr)

scrape_page = function(start){

  restos = read_html(paste0('http://www.yelp.ca/search?find_loc=Montr%C3%A9al,+QC,+Canada&start=',
                            start, '&cflt=food'))

  n1 = restos %>% html_nodes('.main-attributes')
  n2 = restos %>% html_nodes('.secondary-attributes')

  df = data.frame(neighborhood=factor(), rating=numeric(), votes=numeric())

  #we have to extract the nodes one by one because sometimes the address field can be missing
  for(i in 1:10){

    main_attr = n1 %>% extract2(i)
    sec_attr = n2 %>% extract2(i)

    if(length(sec_attr %>% html_nodes('.neighborhood-str-list')) == 1
       && length(main_attr %>% html_nodes('.star-img')) == 1
       && length(main_attr %>% html_nodes('.review-count')) == 1){

      neighborhood = iconv(sec_attr %>% html_nodes('.neighborhood-str-list') %>% html_text() %>% str_trim(),
                            from='UTF-8', to='UTF-8')

      # rename the neighborhoods so the names match those found in the spatial sata (see below)
      if(str_detect(neighborhood, 'Notre-Dame') || str_detect(neighborhood, 'des-Neiges')){
        neighborhood = 'Côte-des-Neiges-Notre-Dame-de-Grâce'
      }
      else if(str_detect(neighborhood, 'Plateau')){
        neighborhood = 'Le Plateau-Mont-Royal'
      }
      else if(str_detect(neighborhood, 'Sud')){
        neighboorhood = 'Le Sud-Ouest'
      }
      else if(str_detect(neighborhood, 'Ville-Marie')){
        neighboorhood = 'Ville-Marie'
      }
      else if(str_detect(neighborhood, 'Villeray')){
        neighborhood = 'Villeray-Saint-Michel-Parc-Extension'
      }
      else if(str_detect(neighborhood, 'Saint-Luc')){
        neighborhood = 'Côte-Saint-Luc'
      }
      else if(str_detect(neighborhood, 'Trembles')){
        neighborhood = 'Rivière-des-Prairies-Pointe-aux-Trembles'
      }
      else if(str_detect(neighborhood, 'onard')){
        neighborhood = 'Saint-Léonard'
      }
      else if(str_detect(neighborhood, 'Montr')){
        neighborhood = 'Montréal-Ouest'
      }
      else if(str_detect(neighborhood, 'Ahun')){
        neighborhood = 'Ahuntsic-Cartierville'
      }

      rating = as.numeric(main_attr %>% html_nodes('.star-img') %>%
                          html_attr('title') %>% str_extract('[0-9]\\.[0-9]'))

      votes = as.numeric(main_attr %>% html_nodes('.review-count') %>% html_text() %>% str_extract('[0-9]+'))

      df = rbind(df, data.frame(neighborhood=neighborhood, rating=rating, votes=votes))
    }
  }

  df
}

all_ratings = lapply(seq(from=0, to=990, by=10), scrape_page)
all_ratings_df = do.call(rbind, all_ratings)

Now we tally up the points by neighborhood, and we’re gonna score each neighborhood according to the average quality of its restaurants AND to its total number of restaurants, using the square root function on the latter to give it less weight (the idea being I would prefer to live in a neighborhood with 10 good restaurants rather than 100 crappy ones).

all_ratings_df$points = all_ratings_df$rating * all_ratings_df$votes

points = aggregate(points ~ neighborhood, data=all_ratings_df, sum)

restaurant_count = aggregate(rating ~ neighborhood, data=all_ratings_df, length)
colnames(restaurant_count)[2] = 'restaurant_count'

votes = aggregate(votes ~ neighborhood, data=all_ratings_df, sum)

avg_ratings = data.frame(neighborhood=points[,1],
avg_rating=points[,2]/votes[,2])

scores = data.frame(neighborhood=points[,1],
score=avg_ratings[,2] * sqrt(restaurant_count[,2]))

Here are the neighborhood average ratings:

## neighborhood avg_rating
## 1 Ahuntsic-Cartierville 4.103774
## 2 Brossard 5.000000
## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 4.112676
## 4 Côte-Saint-Luc 5.000000
## 5 Le Plateau-Mont-Royal 4.182097
## 6 Mercier-Hochelaga-Maisonneuve 4.043605
## 7 Mont-Royal 3.545455
## 8 Outremont 4.004098
## 9 Rosemont-La Petite-Patrie 4.265391
## 10 Saint-Laurent 3.836735
## 11 Saint-Léonard 4.240000
## 12 Sud-Ouest 4.015648
## 13 Verdun 4.094203
## 14 Ville-Marie 3.936800
## 15 Villeray-Saint-Michel-Parc-Extension 4.342500

and scores (which is the average rating times the square root of the number of restaurants in the neighborhood):

scores

## neighborhood score
## 1 Ahuntsic-Cartierville 18.352633
## 2 Brossard 7.071068
## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 33.913996
## 4 Côte-Saint-Luc 5.000000
## 5 Le Plateau-Mont-Royal 69.478213
## 6 Mercier-Hochelaga-Maisonneuve 20.218023
## 7 Mont-Royal 7.927877
## 8 Outremont 12.662071
## 9 Rosemont-La Petite-Patrie 51.184692
## 10 Saint-Laurent 9.398042
## 11 Saint-Léonard 7.343895
## 12 Sud-Ouest 28.957252
## 13 Verdun 10.028708
## 14 Ville-Marie 63.356803
## 15 Villeray-Saint-Michel-Parc-Extension 27.464381

So now that we have the scores aggregated by neighborhood, we need a map to draw them on:

library(maptools)

tmp_dir = tempdir()
url_data = 'http://donnees.ville.montreal.qc.ca/dataset/00bd85eb-23aa-4669-8f1b-ba9a000e3dd8/resource/62f7ce10-36ce-4bbd-b419-8f0a10d3b280/download/limadmin-shp.zip'
zip_file = paste0(tmp_dir, '/montreal.zip')
download.file(url_data, zip_file)
unzip(zip_file, exdir=tmp_dir)

montreal = readShapeSpatial(paste0(tmp_dir, '/LIMADMIN.shp'))

The next step is to merge the points from the web scraping step onto the shape data. Note that unfortunately, not all neighborhoods are represented on the Yelp website.

df = merge(montreal@data, scores, by.x='NOM', by.y='neighborhood', all.x=TRUE)

#reorder df to match montreal@data
df = df[rank(montreal@data$NOM),]

We can assign a category to each neighborhoods based on their points (I chose to make 3 categories), and attribute a color to each category.

library(RColorBrewer)
library(classInt)

ci = classIntervals(df$score, n=3, style='quantile')

palette = brewer.pal(8, 'YlOrRd')[c(2,4,6)]

colors = findColours(ci, palette)

Finally we draw the map, add a legend and write the names of the neighborhoods as well using the gCentroid function from the rgeos package

library(rgeos)

plot(montreal, col=colors)
legend('topleft', legend=c('Starvation area', 'Survivable', 'Foodie's heaven'), fill=palette)

centroids = gCentroid(montreal, byid=TRUE)[!is.na(df$score)]
text(centroids$x, centroids$y, labels=montreal@data$NOM[!is.na(df$score)], cex=.7)

Rplot

Voila ! It is quite unfortunate that we only have the central neighborhoods of Montreal. To be continued…

Nice! One of my first little R programs was trying HTTR, JSONLITE, XML and GGMAP to toss Yelp info onto a map (below). I generally did the smallest projects possible for a while with the best APIs I could find just to learn various capabilities of R. SO it’s tiny.

But anyway, the “short” approach didn’t turn out to be possible, as I found that the Yelp web service limited me to 20 results and disappointed (and interested in other things) just moved on for a while. So it’s great to see this, I’ll try it out.

—————-

# install.packages(“httr”)
# install.packages(“httpuv”)
# install.packages(“jsonlite”)
# install.packages(“XML”)

# setwd(“)

#load packages
library(httr)
library(httpuv)
library(jsonlite)
library(XML)

# set keys
consumerKey = “”
consumerSecret = “”
token = “”
token_secret = “”

# authorization
application = oauth_app(“YELP”, key = consumerKey, secret = consumerSecret)
signature = sign_oauth1.0(application, token = token, token_secret = token_secret)

# PDX restaurants – limit is 20!!! What!?
locationdataList = fromJSON( toJSON( content(GET(“http://api.yelp.com/v2/search/?limit=20&location=Portland&term=restaurant”, signature)) ))

yelpDF = as.data.frame(locationdataList)
names(yelpDF)
View(yelpDF)

library(ggmap)

# the data.frame from Yelp has lat & lon but let’s do addresses
# yelp JSON stuff is embedded so get DF from inside DF
yelpLoc <= as.data.frame(yelpDF$businesses.location)

# use geocode to get a list of lat & lon
locations <- geocode( as.character(yelpLoc$display_address) )
View(locations)

# dump it into map
qmap(location = "Portland, Oregon", maptype = "roadmap", zoom = 12) +
geom_point(data = locations, mapping = aes(x = lon, y = lat), color = "red", size = 3)

LikeLike

6 comments

D says:

March 29, 2016 at 4:03 pm

This is well done!

LikeLike

1. Karim L says:
  
  March 30, 2016 at 7:52 am
  
  Thank you so much ! It’s in my plans to revisit and improve on this mini-project actually…
  
  LikeLike
  
Monkey King (@majesticprimate) says:

April 24, 2016 at 8:28 pm

Nice! One of my first little R programs was trying HTTR, JSONLITE, XML and GGMAP to toss Yelp info onto a map (below). I generally did the smallest projects possible for a while with the best APIs I could find just to learn various capabilities of R. SO it’s tiny.

But anyway, the “short” approach didn’t turn out to be possible, as I found that the Yelp web service limited me to 20 results and disappointed (and interested in other things) just moved on for a while. So it’s great to see this, I’ll try it out.

—————-

# install.packages(“httr”)
# install.packages(“httpuv”)
# install.packages(“jsonlite”)
# install.packages(“XML”)

# setwd(“)

#load packages
library(httr)
library(httpuv)
library(jsonlite)
library(XML)

# set keys
consumerKey = “”
consumerSecret = “”
token = “”
token_secret = “”

# authorization
application = oauth_app(“YELP”, key = consumerKey, secret = consumerSecret)
signature = sign_oauth1.0(application, token = token, token_secret = token_secret)

# PDX restaurants – limit is 20!!! What!?
locationdataList = fromJSON( toJSON( content(GET(“http://api.yelp.com/v2/search/?limit=20&location=Portland&term=restaurant”, signature)) ))

yelpDF = as.data.frame(locationdataList)
names(yelpDF)
View(yelpDF)

library(ggmap)

# the data.frame from Yelp has lat & lon but let’s do addresses
# yelp JSON stuff is embedded so get DF from inside DF
yelpLoc <= as.data.frame(yelpDF$businesses.location)

# use geocode to get a list of lat & lon
locations <- geocode( as.character(yelpLoc$display_address) )
View(locations)

# dump it into map
qmap(location = "Portland, Oregon", maptype = "roadmap", zoom = 12) +
geom_point(data = locations, mapping = aes(x = lon, y = lat), color = "red", size = 3)

LikeLike

1. Monkey King (@majesticprimate) says:
  
  April 24, 2016 at 8:33 pm
  
  Oh sorry it’s been too long … as I think you mentioned it limited to 1,000. I cut it to 20 to learn. Now I remember the problem … it did not give me access to my own profile just the Yelp general web site … hence not handy for me.
  
  LikeLike
  
2. Karim L says:
  
  April 25, 2016 at 1:53 am
  
  Hey Monkey King! Have you heard about the Yelp academic dataset ? https://www.yelp.ca/dataset_challenge
  It’s in JSON format as well and the amount of information in there is staggering.
  It’s not on the top of my to-do list but definitely something I want to look at in the future, you might be interested too.
  
  LikeLike
  
3. Karim L says:
  
  April 25, 2016 at 3:16 pm
  
  Interesting approach (using the API). I wonder how long it takes for the 20 hard-limit to be reset…
  
  LikeLike

KLonData

Montreal Restaurants

6 comments

Leave a comment Cancel reply

Share this:

Related

6 comments

Leave a comment Cancel reply