This winter, we're planning to take a trip to New York City! Everyone knows the cost of living there is sky-high, so naturally, we wanted to see if there was a way to find bargains. One popular option: Airbnb!
Airbnb is a shared economy platform for people to offer their own housing for travellers. Since 2008, it has grown in popularity and has become ubiquitous in travelling options, becoming a large competitor in the hotel industry. Competing with hotels and other Airbnbs makes pricing challenging for sellers. There are many features that can factor into the price - its proximity to popular locations, amenities, size, etc.
In this tutorial, we are aiming to see if there are certain features that contribute to price more than others. We also want to see if we can find outliers for the Airbnbs (bargains or ripoffs). We hope that this exploration can be useful for travelers looking to find a place in New York City, or for homeowners to be able to price their property at a competitive price to make a profit.
For this tutorial, we will be using 2019 New York City Airbnb data, published by dgomonov on Kaggle. This data includes information about the hosts, geographical data, and other potential predictors of price.
We'll be using Python 3 and its standard libraries for this tutorial, along with the following libraries:
Note: dgomonov also has a great notebook about this dataset! Be sure to check it out: https://www.kaggle.com/dgomonov/data-exploration-on-nyc-airbnb/
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from IPython.display import HTML, display # displaying maps in the notebook
import seaborn as sns; sns.set() # graphing data
To start, let's take a look at what the data looks like. Open the .csv
file.
main_df = pd.read_csv('nyc.csv')
main_df.head()
Each column in the dataframe gives us information about the property.
name
of the property is set by the hosthost_id
and host_name
are identification ids of the host for Airbnbneighbourhood_group
, shown aboveneighbourhood
tells us which specific neighbourhood in the group the property belongs tolatitude
and longitude
give us the coordinates of the location. We can use this with folium to map all the locationsroom_type
indicates the type of room the property isprice
will be the attribute we will try to predictminimum_nights
are the minimum number of nights the property has to be booked fornumber_of_reviews
, last_review
, and reviews_per_month
give us information about the reviews of each property. Unfortunately, we don't have the actual reviews or ratingcalculated_host_listings_count
and availability_365
are additional features that tell us how many total properties the host has, and how long this property is available in a yearLet's examine some basic stats:
There are close to 50k entries, so we want to sample data for plotting. Data seems to missing for some fields, most noticeably those relating to the number of reviews.
print(main_df.info())
Let's examine the categorical variables.
print("Neighbourhood Groups:", main_df['neighbourhood_group'].unique().tolist())
print("Room Types:", main_df['room_type'].unique().tolist())
We are going to be working with Airbnbs in the five boroughs in New York City. There are three types of rooms: shared rooms, private rooms, or the entire home or apartment. We can expect the rooms to have a significant impact on the price, as people would prefer to have an entire home to themselves rather than a shared room, so homeowners will likely charge more. We will explore this later.
How about price? Let's see what our typically Airbnb is priced at, as well as what the outliers look like.
print(main_df['price'].describe(percentiles=[.25, .50, .75, .95]))
We find that there are some very large outliers, so for visualization purposes, we winsorize (ignore) the top 5% of data, about $400.
Now that we have seen the data that we are working with, let's visualize our data in order to get a better understanding of it.
We'll start by looking at some geographical data:
First we'll look at the neighbourhood groups, and where Airbnbs are most commonly found. We'll also look at how they are priced in each neighbourhood group. Using seaborn, we can make some nice visuals!
# using countplot to visualize the number of Airbnbs in each borough
ax = sns.countplot('neighbourhood_group',data=main_df,order=main_df['neighbourhood_group'].value_counts().index)
ax.set_title('Share of Neighborhood')
plt.show()
Manhattan and Brooklyn have the highest number of listings, at around 20K each. This can be attributed to the fact that both of those neighbourhoods have more of the tourist attractions, so people would typically want to stay close to what they are seeing.
# we can see from our statistical table that we have some extreme values, therefore we need to remove them for the sake of a better visualization
# creating a sub-dataframe with no extreme values / less than 400
winsorized_df=main_df[main_df.price < 400]
# using violinplot to showcase density and distribtuion of prices
viz_2=sns.violinplot(data=winsorized_df, x='neighbourhood_group', y='price')
viz_2.set_title('Price distribution for each neighbourhood')
plt.show()
Here we can see the distribution of prices of properties, based on which neighbourhood group they belong to. We can see that Manhattan sems to have more of the higher priced properties. Bronx, Staten Island, and Queens have much more reasonable prices compared to Brooklyn and Manhattan. All distributions have positive skew.
Now that we've looked at the locations for the Airbnbs, let's explore the room types and how the price distribution differs between them.
# using countplot once again
ax = sns.countplot('room_type',data=main_df,order=main_df['room_type'].value_counts().index)
ax.set_title('Share of Room Type')
plt.show()
We also see that entire homes and private rooms are the most common, which may be because the demand for shared rooms is typically lower.
# make sure to use the winsorized_df to ignore the outliers!
# using violinplot to showcase density and distribtuion of prices
viz_2=sns.violinplot(data=winsorized_df, x='room_type', y='price')
viz_2.set_title('Price distribution by room type')
plt.show()
As expected, shared rooms have the lowest mean price, while entire homes have the highest. All room types seem to have a similar spread, however private rooms and shared rooms seemd to be more centered around their mean. There is more disparity of price with entire homes.
Now let's see how the prices look on a map of New York City, so we can really visualize our data. Using plotly, we can display a map of New York City with the Airbnbs colored in based on price. This way, we can visualize where exactly the more expensive or super cheap Airbnbs are located.
# Geographic plot using Plotly
import plotly.graph_objects as go
import plotly.offline as py
py.init_notebook_mode(connected=False)
# randomly sampling 10000 data points from our dataframe for rendering purposes
sample_df = winsorized_df.sample(10000)
mapbox_access_token = "pk.eyJ1IjoibWFvc2VmIiwiYSI6ImNrMTNzYXY5dDBjcHIzbW51d2J1ZjJweHoifQ.CA3fec_PEoQHxf9jr7yGaA"
# adding points from our sample_df and coloring them based on the price
# the colorscale we use is magma, where higher prices are a darker shade of red, and lower prices area lighter shade
fig = go.Figure(go.Scattermapbox(
lon = sample_df['longitude'],
lat = sample_df['latitude'],
mode='markers',
marker=go.scattermapbox.Marker(
size=5,
color = sample_df['price'],
colorscale="Magma",
reversescale=True,
colorbar=dict(
title="Price ($)"
),
opacity=0.7
),
))
# settings the map title and lat and long
fig.update_layout(
title="Airbnb prices in New York City",
hovermode='closest',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
bearing=0,
center=go.layout.mapbox.Center(
lat=40.7,
lon=-74
),
pitch=0,
zoom=9
)
)
py.iplot(fig)
We see there is definitely clustering of higher prices in downtown Manhattan. There are also noticeable clusters in Upper Brooklyn and Upper Manhattan. Location could provide a good signal of price.
Natural Language Processing deals with analyzing text to draw insights how and why humans say the things they do. We can use this to look at the names that homeowners gave their Airbnbs, and see how people advertise their homes using text features.
(If you want to learn more about NLP, check out this article by Towards Data Science)
The first thing we want to look at is word frequency in names. This will allows us to see how hosts are naming their property, and what words we can expect to see when browsing for an Airbnb.
We start by getting the names and creating a list of all words appearing in each name.
# getting list of all names
names = main_df['name'].tolist()
name_words = []
for name in names:
name = str(name).split()
for word in name:
name_words.append(word.lower())
Now we have a list of all the words in the names. The problem is a lot of the words that we will see are words that don't have much meaning to us. Take a look at the first 30 we encounter, for example.
print(name_words[:30])
Here we see tokens like "by" or "the" that don't really tell us much. Luckily, the NLTK library for Python has a way for us to get rid of these words (stopwords).
import nltk # natural language toolkit
from nltk.corpus import stopwords
filtered_words = [word for word in name_words if word not in stopwords.words('english')]
print(filtered_words[:30])
We can see that our filtered words no longer contain the stopwords we wanted to avoid.
Next, we'll use the collections library in Python that allows us to quickly count the words in our list. We can then add the word and it's frequency into a dataframe for visualization.
from collections import Counter # to count words in our list
# plotting top 25 words used by the host in naming their home
words_count = Counter(filtered_words).most_common()
words_count = words_count[:25]
# converting the data into a dataframe so we can plot it using Seaborn
words_df = pd.DataFrame(words_count)
words_df.head()
To make this easier to understand, let's rename our columns. The first column is showing the words from most to least common. The second column shows the count in the entire list of names.
words_df.rename(columns={0:"Words", 1:"Count"}, inplace=True)
# Now let's plot!
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
fig.suptitle('Top 25 Words Used in NYC Airbnb Names', fontsize=20)
sns.barplot(x='Words', y='Count', data=words_df, ax=ax)
plt.xticks(rotation=80)
plt.show()
Looking at the top 25 words, we can see come clear trends. For one, hosts are using simple and specific words that allow users to find their property quicker with a quick search. Words like "bedroom", "private" and "apartment" help homeowners get found, and then they can try to pull travelers in with nice pictures and descriptions. Another thing we can see is the use of words that draw users in. "Cozy", "Spacious" and "Beautiful" are all words that appear often, and for good reason!
Speaking of nice sounding words, let's take a look at how hosts are naming their property, in particular, the positivity behind the name. If hosts tend to advertise their home better, they might want to charge a higher price for having more amenities, being in a better location, etc., which is all something they can list in the name.
We will use the NLTK library, and their VADER lexicon, a built-in dictionary of sentiments and words that we can use to calculate sentiment scores of our text. In order to use the lexicon, you need to download the lexicon from nltk inline.
nltk.download('vader_lexicon')
Now we can use the SentimentIntensityAnalyzer that does the work for us! We can give it pieces of text, in our case the names of the Airbnb, and it will return to use the sentiment scores.
# initialize the analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
# preparing an array that we will use as data for our dataframe
data = []
for i,row in main_df.iterrows():
# we want to create a dictionary that stores the name and price of a property, as well as the neutral, positive, and
# compound scores that were returned by NLTK.
dic = {}
ss = sid.polarity_scores(str(row['name']))
dic['name'] = row['name']
dic['sentiment_neu'] = ss['neu']
dic['sentiment_pos'] = ss['pos']
dic['sentiment_compound'] = ss['compound']
dic['price'] = row['price']
data.append(dic)
# building our dataframe from the data
sentiment_df = pd.DataFrame(data)
sentiment_df.head()
sns.lmplot('sentiment_compound', 'price', data=sentiment_df, fit_reg=False, height=8, aspect=2)
We look at the compound sentiment score, which is determined by the nltk SentimentIntensityAnalyzer as the normalized aggregated score (more information here), and see how it affects the price that the hosts set for their Airbnb. We actually see something a little interesting: a majority of the Airbnbs that are offered at a higher price seem to have a higher compound score. This could be because hosts make their property seem better in the name (with richer vocabulary), and then charge a higher price.
Because we see that the right side of our plot is very dense, we decide not to use this as a feature when regressing for price.
Note: because we are looking at just the names of the property (where we expect not too much extra sentiment information), many of the Airbnb names returned a compound score of 0.0
We now look at a correlation plot among the numerical variables. We don't see any strong correlations between meaningful variables, except number_of_reviews vs reviews_per_month
main_df.corr().style.background_gradient(cmap='coolwarm')
# plt.show()
The problem is of regressing price.
Let's try a multiple linear regression on the features. We drop the features (name, id, host name, and last review). We transform the categorical variables (neighbourhood_group, neighbourhood, room_type) into labels using Scikit-Learn's label transformer.
We use Ordinary Least Squares (OLS) Regression. We hold out 20% of the data for testing.
'''Machine Learning'''
import sklearn
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
# Preparing the main_df
main_df.drop(['name','id','host_name','last_review'],axis=1,inplace=True)
main_df['reviews_per_month']=main_df['reviews_per_month'].replace(np.nan, 0)
'''Encode labels with value between 0 and n_classes-1.'''
le = preprocessing.LabelEncoder() # Fit label encoder
le.fit(main_df['neighbourhood_group'])
main_df['neighbourhood_group']=le.transform(main_df['neighbourhood_group']) # Transform labels to normalized encoding.
le = preprocessing.LabelEncoder()
le.fit(main_df['neighbourhood'])
main_df['neighbourhood']=le.transform(main_df['neighbourhood'])
le = preprocessing.LabelEncoder()
le.fit(main_df['room_type'])
main_df['room_type']=le.transform(main_df['room_type'])
main_df.sort_values(by='price',ascending=True,inplace=True)
main_df.head()
'''Train LRM'''
lm = LinearRegression()
X = main_df[['neighbourhood_group','neighbourhood','latitude','longitude','room_type','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']]
y = main_df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
lm.fit(X_train,y_train)
For evaluation, we calculate Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and the coefficient of determination (R^2).
MAE is the average of the absolute errors. MSE is the average of the squared errors. This penalizes larger errors by more. Taking the square root to get RMSE returns to our original units.
R^2 is the proportion of the variance in the dependent variable that is predictable from the independent variable.
From Scikit's documentation for R^2:
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
Here's a more thorough explanation of evaluation metrics for regression.
We use MAE, because we believe price outliers exist and don't want them to impact the error. According to MAE, on average, our model is off by 72$. This is better than 1 standard deviation of guessing the mean (240), but realistically not that great.
'''Get Predictions & Print Metrics'''
predicts = lm.predict(X_test)
print("""
Mean Absolute Error: {}
Root Mean Squared Error: {}
R2 Score: {}
""".format(
mean_absolute_error(y_test,predicts),
np.sqrt(metrics.mean_squared_error(y_test, predicts)),
r2_score(y_test,predicts),
))
We plot the regressor price predictions against the actual ones. This is to visually check if our regression estimates look good, as well as test if the assumptions of a linear relationship are satisfied. The assumptions are explained in detail here.
Some core assumptions:
plt.figure(figsize=(16,8))
sns.regplot(predicts,y_test)
plt.xlabel('Predictions')
plt.ylabel('Actual')
plt.title("Linear Model Predictions")
plt.grid(False)
plt.show()
We notice some large outliers in the positive direction. These represent listings that were much more expensive than expected. Perhaps this is an indication that they're a ripoff! Or there are more features that account for their price, such as room quality and amenities.
This plot shows some problems: we notice our regressor is relatively conservative in predicting price (it doesn't go above 400), when there are listings in the thousands. We also notice it predicts negative prices for some listings, which is nonsensical. Unfortunately, the residuals don't seem to be evenly distributed around the regression line - this may indicate an issue with the model assumptions.
It seems as though there is a significant relationship between Airbnb price and and a variety of the features included in the dataset. However, considering the number of features in the present model, it is likely that we could find a more parsimonious model and improve our $R^2$ value. Unfortunately, SKLearn makes it difficult to analyze the significance of each of the predictors in a model. Instead, we can temporarily make use of Python's StatsModels library. This library in particular has some very powerful statistical tools including robust model summary information.
Below are a few methods to generate diagnostic plots which can be used to check the assumptions for linearity.
# Residuals vs. Fitted
def r_v_fit(m):
ax = sns.residplot(m.fittedvalues, m.resid)
plt.title("Residuals vs. Fitted")
plt.ylabel("Residuals")
plt.xlabel("Fitted Values")
plt.show()
# Residuals vs. Order
def r_v_order(m):
ax = plt.scatter(m.resid.index, m.resid)
plt.title("Residuals vs. Order")
plt.ylabel("Residuals")
plt.xlabel("Order")
plt.show()
# Histogram
def r_hist(m, binwidth):
resid = m.resid
plt.hist(m.resid, bins=np.arange(min(resid), max(resid) + binwidth, binwidth))
plt.title("Histogram of Residuals")
plt.show()
# Get separate dataframe for statsmodels analysis
sm_df = pd.read_csv('nyc.csv')
# Split data for training and testing
sm_df['logprice'] = np.log(1 + sm_df['price'])
train_data, test_data = train_test_split(sm_df, test_size=0.2)
import statsmodels.formula.api as smf
# Create the model
model = smf.ols(
'price ~ neighbourhood_group + latitude + longitude \
+ room_type + minimum_nights + number_of_reviews + reviews_per_month \
+ calculated_host_listings_count + availability_365',
data=train_data).fit()
print("P-Value:\t{}".format(model.pvalues[0]))
print("R_Squared:\t{}".format(model.rsquared))
print("R_Squared Adj:\t{}".format(model.rsquared_adj))
# Diagnostic Plots for model
r_v_fit(model)
r_v_order(model)
r_hist(model, 100)
The above plots show a few issues with the current model. The residuals vs. fitted plot shows mostly positive residuals and a number of outliers. It also appears as though the plot has a cone shape, and therefore does not meet the equal spread condition for linear regression. Residuals vs. Order also shows a number of outliers and, like the vs. fitted plot, takes on mostly positive values. Lastly, the histogram of residuals is very skewed. This is likely due to the outliers as seen above. Also, the R-Squared value of the model (listed above the residual plots) is very low.
Because of the skew in distribution of residuals/number of outliers, it may be useful to attempt a log transformation on the response (price). Let's try the new model out to see how it compares:
# Fitting a new model with a log-transformed price
log_model = smf.ols(
'logprice ~ neighbourhood_group + latitude + longitude \
+ room_type + minimum_nights + number_of_reviews + reviews_per_month \
+ calculated_host_listings_count + availability_365',
data=train_data).fit()
print("P-Value:\t{}".format(log_model.pvalues[0]))
print("R_Squared:\t{}".format(log_model.rsquared))
print("R_Squared Adj:\t{}".format(log_model.rsquared_adj))
# Diagnostic Plots for new, transformed model
r_v_fit(log_model)
r_v_order(log_model)
r_hist(log_model, 0.1)
Although there are still some outliers, this model immediately appears to be improved! All three diagnostic plots meet the assumptions necessary for using linear regression. There is still a slight cone shape to the residuals vs. fitted plot, but it looks much better. Residuals vs. order has better spread around the x-axis, which indicates independence of the data. Lastly, the histogram of residuals has a much more normal distribution than the original model. The R-Squared value is significantly higher than that of the original model, so this new model is much better for explaining price.
With this new nonlinear relationship, our interpretation of the model coefficients also changes. if we have a coefficient m, then an increase of X by 1 corresponds to an increase of log(Y) by m; this is equivalent to saying Y increases by a factor of e^m.
Now that we have a better model, it may be worth examining to see if any predictors may be removed from the model. The Statsmodels library has great summary statistics, so we can look at the p-value of each of the predictors to see how significant they are:
print(model.pvalues)
With a significance level of $\alpha=0.05$, we can see from the above output that only two of the predictors, Queens Borough and Reviews Per Month, are not significant predictors of price. Through backwards elimination we can remove these predictors one-by-one to see if our model improves. First we can start by eliminating reviews per month because it has the higher p-value.
log_model_1 = smf.ols(
'logprice ~ neighbourhood_group + latitude + longitude \
+ room_type + minimum_nights + reviews_per_month \
+ calculated_host_listings_count + availability_365',
data=train_data).fit()
print("P-Value:\t{}".format(log_model_1.pvalues[0]))
print("R_Squared:\t{}".format(log_model_1.rsquared))
print("R_Squared Adj:\t{}".format(log_model_1.rsquared_adj))
Because both $R^2$ and $R^2_{adj}$ decreased with the removal of Reviews per Month, we should keep the original log-transformed model and not continue to eliminate predictors. Therefore, we can settle on the following model:
$\widehat{price}=b_0 + b_1*n\_brooklyn + b_2*n\_manhattan + b_3*n\_queens + b_4*n\_staten + b_5*latitude + b_6*longitude + b_7*room\_private + b_8*room\_shared + b_9*minimum\_nights + b_{10}*reviews\_per\_month + b_{11}*listings\_count + b_{12}*availability$
We print the coefficients of the model:
model.params
The most sensitive coefficients are longitude, latitude, neighbourhood_group[T.Staten Island], and room_type[T.Shared room]. Coincidentally, they are all negatively correlated with price, which intuitively makes sense. The features most positively correlated with price are neighbourhood_group[T.Manhattan] and neighbourhood_group[T.Queens].
We perform exploratory data analysis, examining listings across boroughs, room types, and location. We also use natural language processing to look at how hosts name their property, looking at the top words as well as sentiment. We then perform ordinary least squares regression on price, and find that there is some predictive power, however the assumptions of linearity aren't followed, indicating a nonlinear relationship. We perform a log-linear regression on price, and find the assumptions are satisfied, in addition to predictive power improving dramatically. We analyze the predictive power of each individual feature.
Plans for the future:
And there you have it! For cheap Airbnbs, look to the northeast, shared rooms, or Staten Island. Happy Airbnb hunting!