Working as a data scientist at a start-up, new projects and new demands are something that we should all be prepared for. My learning curve in the company keeps growing every day, and I thought that it would be nice to write down some of my personal works. What a better way to do that than publishing it on Medium ?
Exploratory Data Analysis is the process of performing investigations on data in order to retrieve its main characteristics, patterns, spot anomalies, test hypothesis and check assumptions; answering questions related to business or real-world applications of how the data could be used.
In this post, I’ll be analyzing a dataset from Airbnb website history for the city of Rio de Janeiro, Brazil.
One of the datasets contains informations for more than 34.000 host properties, like: price, location, host neighborhood, host response rate, review scores rate, accommodates and many others. The second dataset contains more that 330.000 guests reviews going from January 2010 until January 2020.
Part I: How do prices change based on location (geographical pricing) ?
When I looked trough the data, this question was the first that came into my mind. Just like any other city, Rio de Janeiro has some neighborhood like Leblon, Ipanema, Copacabana, that are more tourist than others, therefore, more expensive. Thus, location and price must have a correlation. Let’s try check it through data!
From the dataset listings.csv, the column price has values going from $29 up to $42.000. As expected, these values were extremely scattered as it can observed on the image below.
We can observe that most of the prices are below $1000, more precisely, 75% are below $600.
In order to better analyze data, a technique called binning was used. This method reduces the effect of minor observations errors. I ended up with 9 different intervals for this column.
From Wikipedia, Data binning: “The original data values which fall into a given small interval, a bin, are replaced by a value representative of that interval, often the central value”
Once the column price was preprocessed, it was time to retrieve the location of each host. Since the column host_neighborhood had more than 160 different values not well distributed, making it very hard to analyze it, the best solution was a visual approach. Selecting columns latitude, longitude and price it was possible to come up with the following graph:
It’s possible to observe that the houses price tends to be higher when closer to beaches. For those people that are more familiar with Rio de Janeiro geography: we can also observe that the price is also higher, in average, for the most famous neighborhoods: Leblon, Ipanema, Barra da Tijuca, Lagoa and Barra da Tijuca; which are also the most tourist areas.
Part II: Does the host response rate affect his scores?
The longer it takes for a host to reply, more unsatisfied its guests get. Does this statement make sense ? Maybe, but first, let’s try to look trough the data for help.
The review_scores_rating column had almost the same problem as the price column. The score 100% represents more than 40% of the total data. For reducing errors effect from minors classes, the column was divided into two intervals: less and bigger than 96%. For better analysis, the host_response_rate was discretized into 10 intervals, obtaining the following aggregation table.
The table shows that the review_scores_rating isn’t affected by the host response rate. The result goes against the premise that this two variables would be directly proportional. The conclusion isn’t an absurd, since not everyone cares about the host response rate. There are more important factors that influences the review_scores_rate: location, price, hospitality, house cleanliness and etc.
Part III: Airbnb growth through the years
How does Airbnb has being doing, in Rio de Janeiro, through the years ? This is a very interesting question, and to answer this question I had to found a way to measure the amount of Airbnb guests per month in the city.
Based on the datasets available, the review.csv is the one that has the best features to retrieve this information. The dataset contains the date of each review, going from 2010 up to 2020, which represents, indirectly, the total amount of guests per month. Therefore, this analysis will be based on the number of reviews per month.
The assumption that it’s made here, is that most of people leaves a review after a stay. After a quick research I found that approximately 60%-70% of guests leaves a review.
After grouping the dataset per month, summing the amount of review, I obtained the following graph:
From the time series above it’s possible to observe an upward trend, showing the growth in the number of reviews through the years.
We can also analyze in which month the maximum number of total_reviews, per year, happened, represented by the peaks in the graph. Some months obtained a high number of reviews for not an explicity reason, but for August 2016, we can explain that the total amount of 9062 reviews where due to the Rio Olympic Games, that was held from 5 to 21 August 2016.
In order to check the existing of seasonality, a visual analysis using a box plot graph was done.
From the box plot above it’s possible to analyze that January has a higher mean value when compared to other months, which can be explained by the fact that this period correponds to summer season in Rio de Janeiro, Brazil.
In this article, we took a quick look at a real world dataset, going trough some techniques for analyzing variables with imbalanced distributions and also how to check for seasonality on a time series.
From the data wrangling, we were able to retrieve some important informations of Airbnb in the city of Rio de Janeiro:
- Price tends to be higher for houses near to beaches
- Host response rate isn’t one of the main factors in the guests reviews scores
- Airbnb has been growing trough the years
- January is the high season in Rio de Janeiro
Hope you enjoyed the reading ! If you want to check the full technical analysis and the code, the project is on GitHub: https://github.com/DanielDaCosta/airbnb-analysis.
The findings here are observational, not the result of a formal study. Now it’s your turn:
What other informations we can retrieve from these datasets ? What other techniques we can use to get more insights ?