airbnb_nyc

Exploring Airbnb Prices in New York City

Aman Jaiman, Andrew Mao, Adam Howard


Introduction


This winter, we're planning to take a trip to New York City! Everyone knows the cost of living there is sky-high, so naturally, we wanted to see if there was a way to find bargains. One popular option: Airbnb!

Airbnb is a shared economy platform for people to offer their own housing for travellers. Since 2008, it has grown in popularity and has become ubiquitous in travelling options, becoming a large competitor in the hotel industry. Competing with hotels and other Airbnbs makes pricing challenging for sellers. There are many features that can factor into the price - its proximity to popular locations, amenities, size, etc.

In this tutorial, we are aiming to see if there are certain features that contribute to price more than others. We also want to see if we can find outliers for the Airbnbs (bargains or ripoffs). We hope that this exploration can be useful for travelers looking to find a place in New York City, or for homeowners to be able to price their property at a competitive price to make a profit.

Data Collection


For this tutorial, we will be using 2019 New York City Airbnb data, published by dgomonov on Kaggle. This data includes information about the hosts, geographical data, and other potential predictors of price.

We'll be using Python 3 and its standard libraries for this tutorial, along with the following libraries:

  • Numpy - for linear algebra
  • Pandas - for manipulating and preprocessing the data
  • Scikit-Learn - for machine learning
  • Seaborn - making pretty plots (uses matplotlib)!
  • Plotly - creating the map of New York City
  • Matplotlib - more pretty plots (also necessary for seaborn)!
  • StatsModels - generating stats summary for our models
  • NLTK - doing sentiment analysis

Note: dgomonov also has a great notebook about this dataset! Be sure to check it out: https://www.kaggle.com/dgomonov/data-exploration-on-nyc-airbnb/

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

from IPython.display import HTML, display # displaying maps in the notebook
import seaborn as sns; sns.set() # graphing data

To start, let's take a look at what the data looks like. Open the .csv file.

In [2]:
main_df = pd.read_csv('nyc.csv')
main_df.head()
Out[2]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0

Each column in the dataframe gives us information about the property.

  • The name of the property is set by the host
  • host_id and host_name are identification ids of the host for Airbnb
  • There are five groups in neighbourhood_group, shown above
  • The neighbourhood tells us which specific neighbourhood in the group the property belongs to
  • latitude and longitude give us the coordinates of the location. We can use this with folium to map all the locations
  • room_type indicates the type of room the property is
  • price will be the attribute we will try to predict
  • minimum_nights are the minimum number of nights the property has to be booked for
  • number_of_reviews, last_review, and reviews_per_month give us information about the reviews of each property. Unfortunately, we don't have the actual reviews or rating
  • calculated_host_listings_count and availability_365 are additional features that tell us how many total properties the host has, and how long this property is available in a year

Let's examine some basic stats:

There are close to 50k entries, so we want to sample data for plotting. Data seems to missing for some fields, most noticeably those relating to the number of reviews.

In [3]:
print(main_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
id                                48895 non-null int64
name                              48879 non-null object
host_id                           48895 non-null int64
host_name                         48874 non-null object
neighbourhood_group               48895 non-null object
neighbourhood                     48895 non-null object
latitude                          48895 non-null float64
longitude                         48895 non-null float64
room_type                         48895 non-null object
price                             48895 non-null int64
minimum_nights                    48895 non-null int64
number_of_reviews                 48895 non-null int64
last_review                       38843 non-null object
reviews_per_month                 38843 non-null float64
calculated_host_listings_count    48895 non-null int64
availability_365                  48895 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB
None

Let's examine the categorical variables.

In [4]:
print("Neighbourhood Groups:", main_df['neighbourhood_group'].unique().tolist())
print("Room Types:", main_df['room_type'].unique().tolist())
Neighbourhood Groups: ['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx']
Room Types: ['Private room', 'Entire home/apt', 'Shared room']

We are going to be working with Airbnbs in the five boroughs in New York City. There are three types of rooms: shared rooms, private rooms, or the entire home or apartment. We can expect the rooms to have a significant impact on the price, as people would prefer to have an entire home to themselves rather than a shared room, so homeowners will likely charge more. We will explore this later.

How about price? Let's see what our typically Airbnb is priced at, as well as what the outliers look like.

In [5]:
print(main_df['price'].describe(percentiles=[.25, .50, .75, .95]))
count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
95%        355.000000
max      10000.000000
Name: price, dtype: float64

We find that there are some very large outliers, so for visualization purposes, we winsorize (ignore) the top 5% of data, about $400.

Data Exploration


Now that we have seen the data that we are working with, let's visualize our data in order to get a better understanding of it.

We'll start by looking at some geographical data:

  • what boroughs have the most rooms?
  • what is the price distribution of rooms per borough?
  • what is the frequency and price distribution of rooms per room type?

Neighbourhood Groups

First we'll look at the neighbourhood groups, and where Airbnbs are most commonly found. We'll also look at how they are priced in each neighbourhood group. Using seaborn, we can make some nice visuals!

In [6]:
# using countplot to visualize the number of Airbnbs in each borough
ax = sns.countplot('neighbourhood_group',data=main_df,order=main_df['neighbourhood_group'].value_counts().index)
ax.set_title('Share of Neighborhood')
plt.show()

Manhattan and Brooklyn have the highest number of listings, at around 20K each. This can be attributed to the fact that both of those neighbourhoods have more of the tourist attractions, so people would typically want to stay close to what they are seeing.

In [7]:
# we can see from our statistical table that we have some extreme values, therefore we need to remove them for the sake of a better visualization

# creating a sub-dataframe with no extreme values / less than 400
winsorized_df=main_df[main_df.price < 400]

# using violinplot to showcase density and distribtuion of prices 
viz_2=sns.violinplot(data=winsorized_df, x='neighbourhood_group', y='price')
viz_2.set_title('Price distribution for each neighbourhood')
plt.show()

Here we can see the distribution of prices of properties, based on which neighbourhood group they belong to. We can see that Manhattan sems to have more of the higher priced properties. Bronx, Staten Island, and Queens have much more reasonable prices compared to Brooklyn and Manhattan. All distributions have positive skew.

Room Type

Now that we've looked at the locations for the Airbnbs, let's explore the room types and how the price distribution differs between them.

In [8]:
# using countplot once again
ax = sns.countplot('room_type',data=main_df,order=main_df['room_type'].value_counts().index)
ax.set_title('Share of Room Type')
plt.show()

We also see that entire homes and private rooms are the most common, which may be because the demand for shared rooms is typically lower.

In [9]:
# make sure to use the winsorized_df to ignore the outliers!
# using violinplot to showcase density and distribtuion of prices
viz_2=sns.violinplot(data=winsorized_df, x='room_type', y='price')
viz_2.set_title('Price distribution by room type')
plt.show()

As expected, shared rooms have the lowest mean price, while entire homes have the highest. All room types seem to have a similar spread, however private rooms and shared rooms seemd to be more centered around their mean. There is more disparity of price with entire homes.

Now let's see how the prices look on a map of New York City, so we can really visualize our data. Using plotly, we can display a map of New York City with the Airbnbs colored in based on price. This way, we can visualize where exactly the more expensive or super cheap Airbnbs are located.

In [10]:
# Geographic plot using Plotly

import plotly.graph_objects as go
import plotly.offline as py
py.init_notebook_mode(connected=False)

# randomly sampling 10000 data points from our dataframe for rendering purposes
sample_df = winsorized_df.sample(10000)

mapbox_access_token = "pk.eyJ1IjoibWFvc2VmIiwiYSI6ImNrMTNzYXY5dDBjcHIzbW51d2J1ZjJweHoifQ.CA3fec_PEoQHxf9jr7yGaA"

# adding points from our sample_df and coloring them based on the price
# the colorscale we use is magma, where higher prices are a darker shade of red, and lower prices area lighter shade
fig = go.Figure(go.Scattermapbox(
        lon = sample_df['longitude'],
        lat = sample_df['latitude'],
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=5,
            color = sample_df['price'],
            colorscale="Magma",
            reversescale=True,
            colorbar=dict(
                title="Price ($)"
            ),
            opacity=0.7
        ),
    ))

# settings the map title and lat and long
fig.update_layout(
    title="Airbnb prices in New York City",
    hovermode='closest',
    mapbox=go.layout.Mapbox(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=go.layout.mapbox.Center(
            lat=40.7,
            lon=-74
        ),
        pitch=0,
        zoom=9
    )
)

py.iplot(fig)

nyc_price_map

We see there is definitely clustering of higher prices in downtown Manhattan. There are also noticeable clusters in Upper Brooklyn and Upper Manhattan. Location could provide a good signal of price.

Natural Language Processing


Natural Language Processing deals with analyzing text to draw insights how and why humans say the things they do. We can use this to look at the names that homeowners gave their Airbnbs, and see how people advertise their homes using text features.

(If you want to learn more about NLP, check out this article by Towards Data Science)

Word Count

The first thing we want to look at is word frequency in names. This will allows us to see how hosts are naming their property, and what words we can expect to see when browsing for an Airbnb.

We start by getting the names and creating a list of all words appearing in each name.

In [11]:
# getting list of all names
names = main_df['name'].tolist()
name_words = []
for name in names:
    name = str(name).split()
    for word in name:
        name_words.append(word.lower())

Now we have a list of all the words in the names. The problem is a lot of the words that we will see are words that don't have much meaning to us. Take a look at the first 30 we encounter, for example.

In [12]:
print(name_words[:30])
['clean', '&', 'quiet', 'apt', 'home', 'by', 'the', 'park', 'skylit', 'midtown', 'castle', 'the', 'village', 'of', 'harlem....new', 'york', '!', 'cozy', 'entire', 'floor', 'of', 'brownstone', 'entire', 'apt:', 'spacious', 'studio/loft', 'by', 'central', 'park', 'large']

Here we see tokens like "by" or "the" that don't really tell us much. Luckily, the NLTK library for Python has a way for us to get rid of these words (stopwords).

In [13]:
import nltk # natural language toolkit
from nltk.corpus import stopwords

filtered_words = [word for word in name_words if word not in stopwords.words('english')]
print(filtered_words[:30])
['clean', '&', 'quiet', 'apt', 'home', 'park', 'skylit', 'midtown', 'castle', 'village', 'harlem....new', 'york', '!', 'cozy', 'entire', 'floor', 'brownstone', 'entire', 'apt:', 'spacious', 'studio/loft', 'central', 'park', 'large', 'cozy', '1', 'br', 'apartment', 'midtown', 'east']

We can see that our filtered words no longer contain the stopwords we wanted to avoid.

Next, we'll use the collections library in Python that allows us to quickly count the words in our list. We can then add the word and it's frequency into a dataframe for visualization.

In [14]:
from collections import Counter # to count words in our list

# plotting top 25 words used by the host in naming their home
words_count = Counter(filtered_words).most_common()
words_count = words_count[:25]
In [15]:
# converting the data into a dataframe so we can plot it using Seaborn
words_df = pd.DataFrame(words_count)
words_df.head()
Out[15]:
0 1
0 room 9389
1 bedroom 7231
2 private 6985
3 apartment 6113
4 cozy 4634

To make this easier to understand, let's rename our columns. The first column is showing the words from most to least common. The second column shows the count in the entire list of names.

In [16]:
words_df.rename(columns={0:"Words", 1:"Count"}, inplace=True)
In [17]:
# Now let's plot!
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
fig.suptitle('Top 25 Words Used in NYC Airbnb Names', fontsize=20)
sns.barplot(x='Words', y='Count', data=words_df, ax=ax)
plt.xticks(rotation=80)
plt.show()

Looking at the top 25 words, we can see come clear trends. For one, hosts are using simple and specific words that allow users to find their property quicker with a quick search. Words like "bedroom", "private" and "apartment" help homeowners get found, and then they can try to pull travelers in with nice pictures and descriptions. Another thing we can see is the use of words that draw users in. "Cozy", "Spacious" and "Beautiful" are all words that appear often, and for good reason!

Sentiment Analysis

Speaking of nice sounding words, let's take a look at how hosts are naming their property, in particular, the positivity behind the name. If hosts tend to advertise their home better, they might want to charge a higher price for having more amenities, being in a better location, etc., which is all something they can list in the name.

We will use the NLTK library, and their VADER lexicon, a built-in dictionary of sentiments and words that we can use to calculate sentiment scores of our text. In order to use the lexicon, you need to download the lexicon from nltk inline.

In [18]:
nltk.download('vader_lexicon')
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
Out[18]:
True

Now we can use the SentimentIntensityAnalyzer that does the work for us! We can give it pieces of text, in our case the names of the Airbnb, and it will return to use the sentiment scores.

In [19]:
# initialize the analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
In [20]:
# preparing an array that we will use as data for our dataframe
data = []
for i,row in main_df.iterrows():
    # we want to create a dictionary that stores the name and price of a property, as well as the neutral, positive, and
    # compound scores that were returned by NLTK.
    dic = {}
    ss = sid.polarity_scores(str(row['name']))
    
    dic['name'] = row['name']
    dic['sentiment_neu'] = ss['neu']
    dic['sentiment_pos'] = ss['pos']
    dic['sentiment_compound'] = ss['compound']
    dic['price'] = row['price']
    data.append(dic)
In [21]:
# building our dataframe from the data
sentiment_df = pd.DataFrame(data)
sentiment_df.head()
Out[21]:
name sentiment_neu sentiment_pos sentiment_compound price
0 Clean & quiet apt home by the park 0.69 0.31 0.4019 149
1 Skylit Midtown Castle 1.00 0.00 0.0000 225
2 THE VILLAGE OF HARLEM....NEW YORK ! 1.00 0.00 0.0000 150
3 Cozy Entire Floor of Brownstone 1.00 0.00 0.0000 89
4 Entire Apt: Spacious Studio/Loft by central park 1.00 0.00 0.0000 80
In [22]:
sns.lmplot('sentiment_compound', 'price', data=sentiment_df, fit_reg=False, height=8, aspect=2)
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x7f4b7e4575f8>

We look at the compound sentiment score, which is determined by the nltk SentimentIntensityAnalyzer as the normalized aggregated score (more information here), and see how it affects the price that the hosts set for their Airbnb. We actually see something a little interesting: a majority of the Airbnbs that are offered at a higher price seem to have a higher compound score. This could be because hosts make their property seem better in the name (with richer vocabulary), and then charge a higher price.

Because we see that the right side of our plot is very dense, we decide not to use this as a feature when regressing for price.

Note: because we are looking at just the names of the property (where we expect not too much extra sentiment information), many of the Airbnb names returned a compound score of 0.0

Variable Correlations


We now look at a correlation plot among the numerical variables. We don't see any strong correlations between meaningful variables, except number_of_reviews vs reviews_per_month

In [23]:
main_df.corr().style.background_gradient(cmap='coolwarm')
# plt.show()
Out[23]:
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
id 1 0.58829 -0.00312529 0.0909085 0.0106187 -0.0132245 -0.31976 0.291828 0.133272 0.0854676
host_id 0.58829 1 0.0202242 0.127055 0.0153091 -0.0173643 -0.140106 0.296417 0.15495 0.203492
latitude -0.00312529 0.0202242 1 0.0847884 0.0339387 0.0248693 -0.0153888 -0.0101416 0.0195174 -0.0109835
longitude 0.0909085 0.127055 0.0847884 1 -0.150019 -0.0627471 0.0590943 0.145948 -0.114713 0.0827307
price 0.0106187 0.0153091 0.0339387 -0.150019 1 0.0427993 -0.0479542 -0.0306083 0.0574717 0.0818288
minimum_nights -0.0132245 -0.0173643 0.0248693 -0.0627471 0.0427993 1 -0.0801161 -0.121702 0.12796 0.144303
number_of_reviews -0.31976 -0.140106 -0.0153888 0.0590943 -0.0479542 -0.0801161 1 0.549868 -0.0723761 0.172028
reviews_per_month 0.291828 0.296417 -0.0101416 0.145948 -0.0306083 -0.121702 0.549868 1 -0.00942116 0.185791
calculated_host_listings_count 0.133272 0.15495 0.0195174 -0.114713 0.0574717 0.12796 -0.0723761 -0.00942116 1 0.225701
availability_365 0.0854676 0.203492 -0.0109835 0.0827307 0.0818288 0.144303 0.172028 0.185791 0.225701 1

Predicting Price

The problem is of regressing price.

Let's try a multiple linear regression on the features. We drop the features (name, id, host name, and last review). We transform the categorical variables (neighbourhood_group, neighbourhood, room_type) into labels using Scikit-Learn's label transformer.

We use Ordinary Least Squares (OLS) Regression. We hold out 20% of the data for testing.

In [24]:
'''Machine Learning'''
import sklearn
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor


# Preparing the main_df 
main_df.drop(['name','id','host_name','last_review'],axis=1,inplace=True)
main_df['reviews_per_month']=main_df['reviews_per_month'].replace(np.nan, 0)

'''Encode labels with value between 0 and n_classes-1.'''
le = preprocessing.LabelEncoder() # Fit label encoder
le.fit(main_df['neighbourhood_group'])
main_df['neighbourhood_group']=le.transform(main_df['neighbourhood_group']) # Transform labels to normalized encoding.

le = preprocessing.LabelEncoder()
le.fit(main_df['neighbourhood'])
main_df['neighbourhood']=le.transform(main_df['neighbourhood'])

le = preprocessing.LabelEncoder()
le.fit(main_df['room_type'])
main_df['room_type']=le.transform(main_df['room_type'])

main_df.sort_values(by='price',ascending=True,inplace=True)

main_df.head()
Out[24]:
host_id neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
25796 86327101 1 13 40.68258 -73.91284 1 0 1 95 4.35 6 222
25634 15787004 1 28 40.69467 -73.92433 1 0 2 16 0.71 5 0
25433 131697576 0 62 40.83296 -73.88668 1 0 2 55 2.56 4 127
25753 1641537 1 91 40.72462 -73.94072 1 0 2 12 0.53 2 0
23161 8993084 1 13 40.69023 -73.95428 1 0 4 1 0.05 4 28
In [25]:
'''Train LRM'''
lm = LinearRegression()

X = main_df[['neighbourhood_group','neighbourhood','latitude','longitude','room_type','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']]
y = main_df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

lm.fit(X_train,y_train)
Out[25]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

For evaluation, we calculate Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and the coefficient of determination (R^2).

MAE is the average of the absolute errors. MSE is the average of the squared errors. This penalizes larger errors by more. Taking the square root to get RMSE returns to our original units.

R^2 is the proportion of the variance in the dependent variable that is predictable from the independent variable.

From Scikit's documentation for R^2:

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).

Here's a more thorough explanation of evaluation metrics for regression.

We use MAE, because we believe price outliers exist and don't want them to impact the error. According to MAE, on average, our model is off by 72$. This is better than 1 standard deviation of guessing the mean (240), but realistically not that great.

In [26]:
'''Get Predictions & Print Metrics'''
predicts = lm.predict(X_test)

print("""
        Mean Absolute Error: {}
        Root Mean Squared Error: {}
        R2 Score: {}
     """.format(
        mean_absolute_error(y_test,predicts),
        np.sqrt(metrics.mean_squared_error(y_test, predicts)),
        r2_score(y_test,predicts),
        ))
        Mean Absolute Error: 75.26679580308493
        Root Mean Squared Error: 225.61535777968695
        R2 Score: 0.090924502614535
     

We plot the regressor price predictions against the actual ones. This is to visually check if our regression estimates look good, as well as test if the assumptions of a linear relationship are satisfied. The assumptions are explained in detail here.

Some core assumptions:

  • error terms (residuals) are normally distributed around the regression line, and homoscedastic (spread doesn't grow or shrink).
  • no multicollinearity - this is when a feature is itself linearly dependent on other features.
In [27]:
plt.figure(figsize=(16,8))
sns.regplot(predicts,y_test)
plt.xlabel('Predictions')
plt.ylabel('Actual')
plt.title("Linear Model Predictions")
plt.grid(False)
plt.show()

We notice some large outliers in the positive direction. These represent listings that were much more expensive than expected. Perhaps this is an indication that they're a ripoff! Or there are more features that account for their price, such as room quality and amenities.

This plot shows some problems: we notice our regressor is relatively conservative in predicting price (it doesn't go above 400), when there are listings in the thousands. We also notice it predicts negative prices for some listings, which is nonsensical. Unfortunately, the residuals don't seem to be evenly distributed around the regression line - this may indicate an issue with the model assumptions.

Finding a Better Model

It seems as though there is a significant relationship between Airbnb price and and a variety of the features included in the dataset. However, considering the number of features in the present model, it is likely that we could find a more parsimonious model and improve our $R^2$ value. Unfortunately, SKLearn makes it difficult to analyze the significance of each of the predictors in a model. Instead, we can temporarily make use of Python's StatsModels library. This library in particular has some very powerful statistical tools including robust model summary information.

Diagnostic Plots

Below are a few methods to generate diagnostic plots which can be used to check the assumptions for linearity.

  • Plot of residual size against the fitted (predicted) value. We expect an even (homoscedastic), Gaussian distribution around y=0.
  • Plot of residuals against the order the data was presented. If we see trends, then that indicates a serious problem.
  • Histogram of residual sizes.
In [28]:
# Residuals vs. Fitted
def r_v_fit(m):
    ax = sns.residplot(m.fittedvalues, m.resid)
    plt.title("Residuals vs. Fitted")
    plt.ylabel("Residuals")
    plt.xlabel("Fitted Values")
    plt.show()
    
# Residuals vs. Order
def r_v_order(m):
    ax = plt.scatter(m.resid.index, m.resid)
    plt.title("Residuals vs. Order")
    plt.ylabel("Residuals")
    plt.xlabel("Order")
    plt.show()

# Histogram
def r_hist(m, binwidth):
    resid = m.resid
    plt.hist(m.resid, bins=np.arange(min(resid), max(resid) + binwidth, binwidth))
    plt.title("Histogram of Residuals")
    plt.show()
In [29]:
# Get separate dataframe for statsmodels analysis
sm_df = pd.read_csv('nyc.csv')

# Split data for training and testing
sm_df['logprice'] = np.log(1 + sm_df['price'])
train_data, test_data = train_test_split(sm_df, test_size=0.2)
In [30]:
import statsmodels.formula.api as smf

# Create the model
model = smf.ols(
    'price ~ neighbourhood_group + latitude + longitude \
     + room_type + minimum_nights + number_of_reviews + reviews_per_month \
     + calculated_host_listings_count + availability_365',
    data=train_data).fit()

print("P-Value:\t{}".format(model.pvalues[0]))
print("R_Squared:\t{}".format(model.rsquared))
print("R_Squared Adj:\t{}".format(model.rsquared_adj))

# Diagnostic Plots for model
r_v_fit(model)
r_v_order(model)
r_hist(model, 100)
P-Value:	2.7230705088860967e-16
R_Squared:	0.11050193903353478
R_Squared Adj:	0.11013023069213623

The above plots show a few issues with the current model. The residuals vs. fitted plot shows mostly positive residuals and a number of outliers. It also appears as though the plot has a cone shape, and therefore does not meet the equal spread condition for linear regression. Residuals vs. Order also shows a number of outliers and, like the vs. fitted plot, takes on mostly positive values. Lastly, the histogram of residuals is very skewed. This is likely due to the outliers as seen above. Also, the R-Squared value of the model (listed above the residual plots) is very low.

Because of the skew in distribution of residuals/number of outliers, it may be useful to attempt a log transformation on the response (price). Let's try the new model out to see how it compares:

In [31]:
# Fitting a new model with a log-transformed price
log_model = smf.ols(
    'logprice ~ neighbourhood_group + latitude + longitude \
     + room_type + minimum_nights + number_of_reviews + reviews_per_month \
     + calculated_host_listings_count + availability_365',
    data=train_data).fit()

print("P-Value:\t{}".format(log_model.pvalues[0]))
print("R_Squared:\t{}".format(log_model.rsquared))
print("R_Squared Adj:\t{}".format(log_model.rsquared_adj))

# Diagnostic Plots for new, transformed model
r_v_fit(log_model)
r_v_order(log_model)
r_hist(log_model, 0.1)
P-Value:	1.6528354271293651e-137
R_Squared:	0.5121880744786157
R_Squared Adj:	0.5119842249485189

Although there are still some outliers, this model immediately appears to be improved! All three diagnostic plots meet the assumptions necessary for using linear regression. There is still a slight cone shape to the residuals vs. fitted plot, but it looks much better. Residuals vs. order has better spread around the x-axis, which indicates independence of the data. Lastly, the histogram of residuals has a much more normal distribution than the original model. The R-Squared value is significantly higher than that of the original model, so this new model is much better for explaining price.

With this new nonlinear relationship, our interpretation of the model coefficients also changes. if we have a coefficient m, then an increase of X by 1 corresponds to an increase of log(Y) by m; this is equivalent to saying Y increases by a factor of e^m.

Reducing the Model

Now that we have a better model, it may be worth examining to see if any predictors may be removed from the model. The Statsmodels library has great summary statistics, so we can look at the p-value of each of the predictors to see how significant they are:

In [32]:
print(model.pvalues)
Intercept                               2.723071e-16
neighbourhood_group[T.Brooklyn]         1.130309e-01
neighbourhood_group[T.Manhattan]        4.413060e-05
neighbourhood_group[T.Queens]           3.810153e-01
neighbourhood_group[T.Staten Island]    5.329573e-16
room_type[T.Private room]               0.000000e+00
room_type[T.Shared room]                2.735771e-70
latitude                                1.597978e-05
longitude                               3.969655e-33
minimum_nights                          2.206316e-03
number_of_reviews                       3.200351e-13
reviews_per_month                       9.651902e-01
calculated_host_listings_count          2.448303e-02
availability_365                        1.477321e-80
dtype: float64

With a significance level of $\alpha=0.05$, we can see from the above output that only two of the predictors, Queens Borough and Reviews Per Month, are not significant predictors of price. Through backwards elimination we can remove these predictors one-by-one to see if our model improves. First we can start by eliminating reviews per month because it has the higher p-value.

In [33]:
log_model_1 = smf.ols(
    'logprice ~ neighbourhood_group + latitude + longitude \
     + room_type + minimum_nights + reviews_per_month \
     + calculated_host_listings_count + availability_365',
    data=train_data).fit()

print("P-Value:\t{}".format(log_model_1.pvalues[0]))
print("R_Squared:\t{}".format(log_model_1.rsquared))
print("R_Squared Adj:\t{}".format(log_model_1.rsquared_adj))
P-Value:	6.980357943140676e-134
R_Squared:	0.5112530940419264
R_Squared Adj:	0.5110645706452213

Because both $R^2$ and $R^2_{adj}$ decreased with the removal of Reviews per Month, we should keep the original log-transformed model and not continue to eliminate predictors. Therefore, we can settle on the following model:

$\widehat{price}=b_0 + b_1*n\_brooklyn + b_2*n\_manhattan + b_3*n\_queens + b_4*n\_staten + b_5*latitude + b_6*longitude + b_7*room\_private + b_8*room\_shared + b_9*minimum\_nights + b_{10}*reviews\_per\_month + b_{11}*listings\_count + b_{12}*availability$

We print the coefficients of the model:

In [34]:
model.params
Out[34]:
Intercept                              -27412.393773
neighbourhood_group[T.Brooklyn]           -14.573270
neighbourhood_group[T.Manhattan]           34.046948
neighbourhood_group[T.Queens]               7.751601
neighbourhood_group[T.Staten Island]     -139.733422
room_type[T.Private room]                -100.763362
room_type[T.Shared room]                 -134.615304
latitude                                 -141.892331
longitude                                -451.143297
minimum_nights                             -0.186459
number_of_reviews                          -0.198474
reviews_per_month                           0.034372
calculated_host_listings_count             -0.096340
availability_365                            0.171311
dtype: float64

The most sensitive coefficients are longitude, latitude, neighbourhood_group[T.Staten Island], and room_type[T.Shared room]. Coincidentally, they are all negatively correlated with price, which intuitively makes sense. The features most positively correlated with price are neighbourhood_group[T.Manhattan] and neighbourhood_group[T.Queens].

Summary


We perform exploratory data analysis, examining listings across boroughs, room types, and location. We also use natural language processing to look at how hosts name their property, looking at the top words as well as sentiment. We then perform ordinary least squares regression on price, and find that there is some predictive power, however the assumptions of linearity aren't followed, indicating a nonlinear relationship. We perform a log-linear regression on price, and find the assumptions are satisfied, in addition to predictive power improving dramatically. We analyze the predictive power of each individual feature.

Plans for the future:

  • Interpret the model, and what features have the most explanatory power
  • Try a more complex model (Gradient Boosted Regressor)
  • Individually examine regressor outliers, both high and low. Can we find ripoffs or bargains?
  • Normalize the data.
  • Run our analysis/model on other cities.

And there you have it! For cheap Airbnbs, look to the northeast, shared rooms, or Staten Island. Happy Airbnb hunting!