I’m currently looking for a new apartment. One important criteria for me is the proximity to services and entertainment, in particular restaurants. Or maybe I was just looking for a pretext to do some R…
Anyhow, I thought I would do some web scraping + mapping in R and see what neighborhoods in Montreal are the best when it comes to dining out.
The first step was to decide where to get the data from. Well that decision was easy, because the first sites I looked at had javascript pages (instead of static html), which meant standard web scraping would not work. Then I tried Yelp, and fortunately enough they use static html (although they limit any of their views to 1000 results).
Below is the code to scrape all 1000 results for Montreal, and put the result in a dataframe. I then calculate points for each neighborhood as the sumproduct of ratings and number of reviews:
library(rvest) library(stringr) library(magrittr) scrape_page = function(start){ restos = read_html(paste0('http://www.yelp.ca/search?find_loc=Montr%C3%A9al,+QC,+Canada&start=', start, '&cflt=food')) n1 = restos %>% html_nodes('.main-attributes') n2 = restos %>% html_nodes('.secondary-attributes') df = data.frame(neighborhood=factor(), rating=numeric(), votes=numeric()) #we have to extract the nodes one by one because sometimes the address field can be missing for(i in 1:10){ main_attr = n1 %>% extract2(i) sec_attr = n2 %>% extract2(i) if(length(sec_attr %>% html_nodes('.neighborhood-str-list')) == 1 && length(main_attr %>% html_nodes('.star-img')) == 1 && length(main_attr %>% html_nodes('.review-count')) == 1){ neighborhood = iconv(sec_attr %>% html_nodes('.neighborhood-str-list') %>% html_text() %>% str_trim(), from='UTF-8', to='UTF-8') # rename the neighborhoods so the names match those found in the spatial sata (see below) if(str_detect(neighborhood, 'Notre-Dame') || str_detect(neighborhood, 'des-Neiges')){ neighborhood = 'Côte-des-Neiges-Notre-Dame-de-Grâce' } else if(str_detect(neighborhood, 'Plateau')){ neighborhood = 'Le Plateau-Mont-Royal' } else if(str_detect(neighborhood, 'Sud')){ neighboorhood = 'Le Sud-Ouest' } else if(str_detect(neighborhood, 'Ville-Marie')){ neighboorhood = 'Ville-Marie' } else if(str_detect(neighborhood, 'Villeray')){ neighborhood = 'Villeray-Saint-Michel-Parc-Extension' } else if(str_detect(neighborhood, 'Saint-Luc')){ neighborhood = 'Côte-Saint-Luc' } else if(str_detect(neighborhood, 'Trembles')){ neighborhood = 'Rivière-des-Prairies-Pointe-aux-Trembles' } else if(str_detect(neighborhood, 'onard')){ neighborhood = 'Saint-Léonard' } else if(str_detect(neighborhood, 'Montr')){ neighborhood = 'Montréal-Ouest' } else if(str_detect(neighborhood, 'Ahun')){ neighborhood = 'Ahuntsic-Cartierville' } rating = as.numeric(main_attr %>% html_nodes('.star-img') %>% html_attr('title') %>% str_extract('[0-9]\\.[0-9]')) votes = as.numeric(main_attr %>% html_nodes('.review-count') %>% html_text() %>% str_extract('[0-9]+')) df = rbind(df, data.frame(neighborhood=neighborhood, rating=rating, votes=votes)) } } df } all_ratings = lapply(seq(from=0, to=990, by=10), scrape_page) all_ratings_df = do.call(rbind, all_ratings)
Now we tally up the points by neighborhood, and we’re gonna score each neighborhood according to the average quality of its restaurants AND to its total number of restaurants, using the square root function on the latter to give it less weight (the idea being I would prefer to live in a neighborhood with 10 good restaurants rather than 100 crappy ones).
all_ratings_df$points = all_ratings_df$rating * all_ratings_df$votes points = aggregate(points ~ neighborhood, data=all_ratings_df, sum) restaurant_count = aggregate(rating ~ neighborhood, data=all_ratings_df, length) colnames(restaurant_count)[2] = 'restaurant_count' votes = aggregate(votes ~ neighborhood, data=all_ratings_df, sum) avg_ratings = data.frame(neighborhood=points[,1], avg_rating=points[,2]/votes[,2]) scores = data.frame(neighborhood=points[,1], score=avg_ratings[,2] * sqrt(restaurant_count[,2]))
Here are the neighborhood average ratings:
## neighborhood avg_rating ## 1 Ahuntsic-Cartierville 4.103774 ## 2 Brossard 5.000000 ## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 4.112676 ## 4 Côte-Saint-Luc 5.000000 ## 5 Le Plateau-Mont-Royal 4.182097 ## 6 Mercier-Hochelaga-Maisonneuve 4.043605 ## 7 Mont-Royal 3.545455 ## 8 Outremont 4.004098 ## 9 Rosemont-La Petite-Patrie 4.265391 ## 10 Saint-Laurent 3.836735 ## 11 Saint-Léonard 4.240000 ## 12 Sud-Ouest 4.015648 ## 13 Verdun 4.094203 ## 14 Ville-Marie 3.936800 ## 15 Villeray-Saint-Michel-Parc-Extension 4.342500
and scores (which is the average rating times the square root of the number of restaurants in the neighborhood):
scores
## neighborhood score ## 1 Ahuntsic-Cartierville 18.352633 ## 2 Brossard 7.071068 ## 3 Côte-des-Neiges-Notre-Dame-de-Grâce 33.913996 ## 4 Côte-Saint-Luc 5.000000 ## 5 Le Plateau-Mont-Royal 69.478213 ## 6 Mercier-Hochelaga-Maisonneuve 20.218023 ## 7 Mont-Royal 7.927877 ## 8 Outremont 12.662071 ## 9 Rosemont-La Petite-Patrie 51.184692 ## 10 Saint-Laurent 9.398042 ## 11 Saint-Léonard 7.343895 ## 12 Sud-Ouest 28.957252 ## 13 Verdun 10.028708 ## 14 Ville-Marie 63.356803 ## 15 Villeray-Saint-Michel-Parc-Extension 27.464381
So now that we have the scores aggregated by neighborhood, we need a map to draw them on:
library(maptools) tmp_dir = tempdir() url_data = 'http://donnees.ville.montreal.qc.ca/dataset/00bd85eb-23aa-4669-8f1b-ba9a000e3dd8/resource/62f7ce10-36ce-4bbd-b419-8f0a10d3b280/download/limadmin-shp.zip' zip_file = paste0(tmp_dir, '/montreal.zip') download.file(url_data, zip_file) unzip(zip_file, exdir=tmp_dir) montreal = readShapeSpatial(paste0(tmp_dir, '/LIMADMIN.shp'))
The next step is to merge the points from the web scraping step onto the shape data. Note that unfortunately, not all neighborhoods are represented on the Yelp website.
df = merge(montreal@data, scores, by.x='NOM', by.y='neighborhood', all.x=TRUE) #reorder df to match montreal@data df = df[rank(montreal@data$NOM),]
We can assign a category to each neighborhoods based on their points (I chose to make 3 categories), and attribute a color to each category.
library(RColorBrewer) library(classInt) ci = classIntervals(df$score, n=3, style='quantile') palette = brewer.pal(8, 'YlOrRd')[c(2,4,6)] colors = findColours(ci, palette)
Finally we draw the map, add a legend and write the names of the neighborhoods as well using the gCentroid function from the rgeos package
library(rgeos) plot(montreal, col=colors) legend('topleft', legend=c('Starvation area', 'Survivable', 'Foodie's heaven'), fill=palette) centroids = gCentroid(montreal, byid=TRUE)[!is.na(df$score)] text(centroids$x, centroids$y, labels=montreal@data$NOM[!is.na(df$score)], cex=.7)
Voila ! It is quite unfortunate that we only have the central neighborhoods of Montreal. To be continued…
This is well done!
LikeLike
Thank you so much ! It’s in my plans to revisit and improve on this mini-project actually…
LikeLike
Nice! One of my first little R programs was trying HTTR, JSONLITE, XML and GGMAP to toss Yelp info onto a map (below). I generally did the smallest projects possible for a while with the best APIs I could find just to learn various capabilities of R. SO it’s tiny.
But anyway, the “short” approach didn’t turn out to be possible, as I found that the Yelp web service limited me to 20 results and disappointed (and interested in other things) just moved on for a while. So it’s great to see this, I’ll try it out.
—————-
# install.packages(“httr”)
# install.packages(“httpuv”)
# install.packages(“jsonlite”)
# install.packages(“XML”)
# setwd(“)
#load packages
library(httr)
library(httpuv)
library(jsonlite)
library(XML)
# set keys
consumerKey = “”
consumerSecret = “”
token = “”
token_secret = “”
# authorization
application = oauth_app(“YELP”, key = consumerKey, secret = consumerSecret)
signature = sign_oauth1.0(application, token = token, token_secret = token_secret)
# PDX restaurants – limit is 20!!! What!?
locationdataList = fromJSON( toJSON( content(GET(“http://api.yelp.com/v2/search/?limit=20&location=Portland&term=restaurant”, signature)) ))
yelpDF = as.data.frame(locationdataList)
names(yelpDF)
View(yelpDF)
library(ggmap)
# the data.frame from Yelp has lat & lon but let’s do addresses
# yelp JSON stuff is embedded so get DF from inside DF
yelpLoc <= as.data.frame(yelpDF$businesses.location)
# use geocode to get a list of lat & lon
locations <- geocode( as.character(yelpLoc$display_address) )
View(locations)
# dump it into map
qmap(location = "Portland, Oregon", maptype = "roadmap", zoom = 12) +
geom_point(data = locations, mapping = aes(x = lon, y = lat), color = "red", size = 3)
LikeLike
Oh sorry it’s been too long … as I think you mentioned it limited to 1,000. I cut it to 20 to learn. Now I remember the problem … it did not give me access to my own profile just the Yelp general web site … hence not handy for me.
LikeLike
Hey Monkey King! Have you heard about the Yelp academic dataset ? https://www.yelp.ca/dataset_challenge
It’s in JSON format as well and the amount of information in there is staggering.
It’s not on the top of my to-do list but definitely something I want to look at in the future, you might be interested too.
LikeLike
Interesting approach (using the API). I wonder how long it takes for the 20 hard-limit to be reset…
LikeLike