With the pandamic I have found my self stuck at home instead of the usual pub-crawling. What better way to use the extra time than pull APIs, analyzing where are the best pubs and generating a plan of attack with machine learning in anticipation of when things are more normal again?

We would be using Singapore as a sample for the above approach, and subset to only the best venues

Data Required

In order to complete the above we would need the following

  • Foursquare API – Explore Endpoint for list of venues
  • Foursquare API – Venues Endpoint for details
  • Google Maps to find inital lat/log coordinates

We acquire the data by doing pull on lat long latnum=”1.3521″ longnum=”103.8198″ representing the center of Singapore with a radius of 50000 to cover all areas.

Pulling the data

categoryId="4d4b7105d754a06376d81259"
latnum="1.3521"
longnum="103.8198"
radiusnum="50000"

offsetnum=0

def pullExplorerdata(clientid,clientsecret,latnum,longnum,radiusnum,categoryId,offsetnum):
    explorerUrl=F"https://api.foursquare.com/v2/venues/explore?&client_id={clientid}&client_secret={clientsecret}&ll={latnum},{longnum}&radius={radiusnum}&categoryId={categoryId}&v=20200815&limit=50&offset={offsetnum}"
    response = requests.get(explorerUrl)
    return response.json()
  
dat=pullExplorerdata(clientid,clientsecret,latnum,longnum,radiusnum,categoryId,offsetnum)
results=[]
items=dat["response"]["groups"][0]['items']
results+=items

while dat['meta']['code']==200 and len(items)>0:
    
    offsetnum+=50
    print(offsetnum)
    dat=pullExplorerdata(clientid,clientsecret,latnum,longnum,radiusnum,categoryId,offsetnum)
    items=dat["response"]["groups"][0]['items']
    results+=items

Next because most of the data is nested in a json, we use pandas to extract out the details

df['id']=[i['venue']['id'] for i in results]
df['name']=[i['venue']['name'] for i in results]
df['location_lat']=[i['venue']['location']['lat'] for i in results]
df['location_lng']=[i['venue']['location']['lng'] for i in results]
df['location_post']=[i['venue']['location']['postalCode']  if 
                        'postalCode' in i['venue']['location'].keys() else np.nan for i in results]
df['location_address']=[i['venue']['location']['address']  if 
                        'address' in i['venue']['location'].keys() else np.nan for i in results]
df['location_neighborhood']=[i['venue']['location']['neighborhood']  if 
                        'neighborhood' in i['venue']['location'].keys() else np.nan for i in results]
df['location_city']=[i['venue']['location']['city']  if 
                        'city' in i['venue']['location'].keys() else np.nan for i in results]
df['location_country']=[i['venue']['location']['country']  if 
                        'country' in i['venue']['location'].keys() else np.nan for i in results]
df['cat_name']=[i['venue']['categories'][0]['name'] for i in results]
df['catid']=[i['venue']['categories'][0]['id'] for i in results]
df=df[df['location_country']=='Singapore']

Next we enrich the data by making secondary calls to the venues end point

def pullVenueData(venueId):
    venueUrl=F"https://api.foursquare.com/v2/venues/{venueId}?&client_id={clientid}&client_secret={clientsecret}&v=20200815"
    response = requests.get(venueUrl)
    return response.json()
df["full_json"]=df["id"].apply(pullVenueData)

Here we check if there were any non 200 status (failed) and do reruns as required

df['stat_code']=df["full_json"].apply(lambda x:x['meta']['code'])
#clean error
elemswitherror=df['stat_code']!=200
df[elemswitherror]
#df.iloc[elemswitherror,"fulljson"]=df[elemswitherror]["id"].apply(pullVenueData)

we now extract the additional fields

df['url']=df["full_json"].apply(lambda x:x['response']['venue']['canonicalUrl'])
df['tips']=df["full_json"].apply(lambda x:x['response']['venue']['tips'])
df['price_tier']=df["full_json"].apply(lambda x: x['response']['venue']['price']['tier'] if 'price' in x['response']['venue'].keys() else np.nan ) 
df['price_message']=df["full_json"].apply(lambda x: x['response']['venue']['price']['message'] if 'price' in x['response']['venue'].keys() else np.nan ) 
df['rating']=df["full_json"].apply(lambda x: x['response']['venue']["rating"] ) 
df['likes']=df["full_json"].apply(lambda x: x['response']['venue']["likes"]['count'] ) 

Exploratory Analysis

We check out the scores:

Each venue is scored against a scale of 1-10. The median score is at 8.1 and from the histogram, 8.6 and above probably returns a superior experience from an average venue

Mapping out the points

Mapping out the full data set, we see how the venues are scattered

import folium
m = folium.Map(location=[latnum, longnum], zoom_start=12)
mlatlong=[list(a) for a in zip(df["location_lat"],df["location_lng"]) ]
for i in mlatlong:
    folium.Marker(
        location=i
    ).add_to(m)
m

Now there are a few out of the way places – Like Yishun. All savvy Singaporean knows you should avoid Yishun. (I joke). We now subset to only high scoring venues.

from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn import metrics

db=DBSCAN(eps=0.24,min_samples=2).fit(X)
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print(n_clusters_)

n_noise_ = list(labels).count(-1)
print(n_noise_)

subdf["labels"]=labels
subdf=subdf.reset_index()

Plotting the output, we get the following:

colors= ['green', 'red', 'orange', 'pink', 'darkgreen', 'darkpurple', 'darkblue', 'lightgreen', 'lightgray']

mlatlong=[list(a) for a in zip(subdf["location_lat"],subdf["location_lng"]) ]
for i in range(len(mlatlong)):

    folium.Marker(
            location=mlatlong[i],
            icon=folium.Icon(color=colors[subdf.iloc[i,-1]])
        ).add_to(m)
m

Conclusion

Here we see, a few good night out of enclaves:

One in the Orchard Belt, Another Somerset. A few clustered around Fort Canning. The motherload would be downtown where the offices are.

We could run this approach on the orange clusters further subdividing out venues.

Here we have it, a DBSCAN algo powered pub crawl.

*Please be responsible during Covid times and avoid endangering others, This exercise is a purely for me to live out pre-covid time vicariously.