Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×


MS5103 Business Analytics Project
An Analysis of the First Time Bookings of Airbnb Users
Group Members (ID): Patrick Leddy...
Declaration of Originality
Project Details
Module Code: MS5103 Assignment Title: M.Sc. Business Analytics - Major Projec...
The intention of our research project is to analyse the Airbnb datasets acquired from Kaggle.,
whereby we aimed...
Próximo SlideShare
Cargando en…3

Eche un vistazo a continuación

1 de 44 Anuncio

Más Contenido Relacionado

Similares a MS5103BusinessAnalyticsProject (20)



  1. 1. MS5103 Business Analytics Project An Analysis of the First Time Bookings of Airbnb Users Group Members (ID): Patrick Leddy (08370231) Brian O Conghaile (11311151) Níamh Ryan (11307801) Supervisor: Michael Lang
  2. 2. 1 Declaration of Originality Project Details Module Code: MS5103 Assignment Title: M.Sc. Business Analytics - Major Project Group Members: (please use BLOCK CAPITALS) Student ID Student Name Contact Details (Email, Telephone) 08370231 Patrick Leddy 0834819255 11311151 Brian O Conghaile 0870694753 11307801 Níamh Ryan, 0873119544 I/We hereby declare that this project is my/our own original work. I/We have read the University Code of Practice for Dealing with Plagiarism* and am/are aware that the possible penalties for plagiarism include expulsion from the University. Signature Date *
  3. 3. 2 Abstract The intention of our research project is to analyse the Airbnb datasets acquired from Kaggle., whereby we aimed to look into interesting patterns and trends that could be discovered in the data. Some of the minor areas we focused on were the likes of social media trends and seasonal patterns. However, our main goal was to attempt to predict the destination which the next new user will book to travel to on the Airbnb website. We wanted to gain insight into the booking patterns of new users, including as much variables as possible to aid us in our analysis. We used a wide variety of tools and techniques for this project including RStudio, for decision trees and xgboost, and Minitab for some time series analysis.
  4. 4. 3 Table of Contents 1. Company Background 2. Outline of Problem 2.1 Objectives 2.2 Defining the Lifecycle 2.3 Selecting the Model 3. Description of Datasets 4. Data Preparation 4.1 Merging the Datasets 4.2 Missing Values 4.3 Creating the Dummy Variables 5. Initial Understanding of Data 5.1 Social Media Trends 5.2 Seasonal Trends 6. Explanation of Tools and Techniques 6.1 R Studio 6.2 Minitab 6.3 Excel 7. Findings 7.1 Association Between Variables 7.2The Earlier Models 7.3 XGBoost 7.4 The Final Model 7.5 Testing the Variables 7.6 The Results 8. Conclusion 9. Appendices
  5. 5. 4 1. Company Background Founded in 2008, Airbnb is still a relatively new company which has experienced major growth over the last number of years. This well-established company is a market leader in the hospitality sector. The idea of Airbnb is simple; they provide a platform and a service whereby their customers, people seeking places to stay, can connect to their clients, hosts who are looking to rent out their property. Customers are matched to clients based on their own preferences, such as room type, price, host language and other factors, all in a timely and efficient manner. (Airbnb, 2016) According the Airbnb website, there are over 2 million listings worldwide, more than 60 million guests and more than 191 countries on offer to visit. They also boast more than 34 thousand cities on offer as well as more than 14 hundred castles to stay in. (Airbnb, 2016)
  6. 6. 5 2. Outline of Problem A vast amount of data is available to us in this project. Like all big data projects, the biggest issue we encountered before we began our project was how to organise, understand and manage the data we are given in a suitable manner to try and achieved our objectives. 2.1 Objectives A real and worrying issue that arises in organisations, who are attempting projects involving large data and prediction models, is that they often tend to worry about what type of model they are trying to build, or what data they have available. Instead, they should be focusing on what the business problem is, that they are trying to solve, and whether or not they are asking the right questions to begin with. We have outlined four major objectives to reach by the end of our project, which we felt would be beneficial to Airbnb, for the sake of understanding their customers and those customer’s behavioural patterns a bit better. 2.1.1 Location of Booking (Main Objective) As the data retrieved from a competition run by Airbnb, with the goal being to predict where a customer would book as their first holiday destination, the main aim of this project was to imagine ourselves working for Airbnb and deciding on why this information would be of value to our employer and how we would go about achieving this goal. Naturally one of the main ways of utilising data for creating value is the use of personalisation or even a reward strategy. In order to create value from the large volume and variety of data available with Big Data, that there are different types of value creation such as performance management, data exploration, social analysis or decision science. For the kind of data collected by Airbnb, the value creation method that best suits them would be the social analysis. Airbnb have collected information on everything their customers have done on their website to date and as such that could use this greatly to their advantage. (Pairse, Iyer, and Vesset, 2015) Seeing as Airbnb acts as an intermediary between their buyers and sellers, having the knowledge of where a customer is likely to book their accommodation allows the company to act more efficiently and as such there is an opportunity for better forecasting of future income. Having this sort of knowledge will allow Airbnb to interact with their customers, for example, by providing them with offers which are individually customised. This could easily reduce the time between the creation of an account and the time of the first booking by the customer.
  7. 7. 6 One downside of this sort of objective is questioning whether or not we really have all the necessary data available to us. In something as personal as choosing a holiday destination there are so many different factors that influence this sort of decision. People don’t simply choose holiday destinations based on their age and gender alone as well as the time they created their account. Instead there could be social, economic and trending issues as well as the cost of flights to and from that particular location. We do not necessarily have all this information to hand all the time. 2.1.2 Social Media Trends Another object of our project was to assess how social media affected different aspect of the Airbnb customers data. There is potential to yield some very interesting trends and results from the analysis of the data in relation to the method of sign up, be it via Facebook, Google +, or so on. 2.1.3 Seasonal Trends Seasonal trends is something that we all would think is very closely linked to the choice of holiday destinations. One would expect that seasonal trends would have perhaps the greatest influence on the main objective of the location of the users’ bookings, so perhaps it is worth looking into this on a more individualised bases for the sake of finding some interesting patterns. Potentially time series analysis could be run in order to determine how seasonal trends affect different aspects of our datasets not only the location but sign up method as well as others. With this vast variety and volume of data, there is room for potential seasonal trends in all areas, not just on location alone. We need to also take into consideration that all customers in these datasets are from America. So they may have different ideas of holiday trends from our own personal Irish perspective. 2.2 Defining the Lifecycle In order to focus our time and ensure rigor and completeness, it was important that we clearly defined the analytics lifecycle to approach the problem correctly and follow a certain framework. Well-defined processes can help guide any analytic project and with an analytics lifecycle, the focus is on Data Science rather than Business Intelligence. Data Science projects differ in the fact that they require more due diligence in the discovery phase, tend to lack shape or structure and contain less predictable data.
  8. 8. 7 Figure 2.2.1: The Analytics Lifecycle ● Discovery - Learn the business domain, assess resources available, frame the problem and begin learning the data. ● Data Preparation - Create analytic sandbox (conduct discovery and situational analytics), ETL to get data into the sandbox so team can work with it and analyse it, increase data familiarisation, data conditioning. ● Model Planning - Determine methods, techniques and workflow, data exploration, select key variables and most suitable model. ● Model Building - Develop datasets for testing and training, build and execute model, review existing tools and environment. ● Communicate Results - Determine success of results, identify key findings, quantify business value, develop narrative to summarise and convey findings. ● Operationalise - Deliver final report including code used and technical specifications, no pilot run as there is no production environment. Once models have been run and findings produced, it is critical to frame these results in a way that is tailored to the audience that engages the project and demonstrates clear value. If a team performs a technically accurate analysis but fails to translate the results into a language that resonates with the audience, people will not see the value, and much of the time and effort on the project will have been wasted.
  9. 9. 8 2.3 Selecting the Model As described by Finlay (2014), whether or not the model is of any use to the organisation can be answered in 3 simple questions: ● Does the model improve efficiency? ● Does the model result in better decision making? ● Does the model enable you to do something new? When deliberating and deciding on whether or not the model is sufficient or even any use to the organisation, it would be useful to apply the above questions to the model as well as determining whether the model is in line with the objectives outlined in the section 2.1.
  10. 10. 9 3. Description of Datasets For our project we have a number of interconnected datasets that are given to us by Airbnb for the purpose of the challenge. In total we were supplied with four datasets we are using. The first dataset that we interact with is the training dataset. The main dataset consists of 213451x16 items of data. The 16 different columns: ID Date a/c was created Timestamp of First activity Date of first booking Gender Age Sign_up method Sign_up flow Language Affiliate channel Affiliate provider First affiliate traded Sign up App First device type First browser Country destination The next dataset is the age gender dataset which contains the following columns: Age bracket Country destination Gender Population (in thousands) Year The countries dataset contains the following columns: Country destination Latitude of destination Longitude of destination Distance from US in km Area of destination
  11. 11. 10 Destination language Language levenshtein distance The test dataset is then a dataset very similar to the training set except for the column for country destination, the reasoning behind this is due to the need to predict this location after training the data. The last dataset was the Sessions dataset, which contained the following columns: User_Id Action Action_Type Action Detail Device Type Seconds Elapsed
  12. 12. 11 4. Data Preparation 4.1 Merging of the Datasets What we needed to do first was to cleanse the datasets and merge them together. The data sets utilised were of the following dimensions: Sessions: 10,567,737 x 6 Training_Users: 213,451 x 16 Test_Users: 62096 x 15 Figure 4.1.1: Reading and Loading the Datasets into RStudio Particular packages that will be of use when using R, are the dplyr and ggplot2 packages. They are installed and loaded below. Dplyr package: This package is useful for data manipulation, using “verb” functions to perform the manipulation. In particular the subset, group_by, summarize and arrange functions will be of use for preparing the data for analysis. Ggplot2 package: This package is primarily used for the creation of complex and customizable graphs in a stylish and clear manner, as opposed to the default graphing options available on R that have many limitations. Figure 4.1.2: Loading the Packages into RStudio The main goal of the data preparation phase is to combine information available in the Sessions dataset to the Training and Test datasets. The Sessions dataset consists of the following information. Figure 4.1.3: Sessions Variables The first issue that arises with the Sessions dataset is the fact that it contains multiple rows of data for each particular user, as opposed to the single row of data per user in the Training and Test datasets. In order to combine the Sessions dataset to the others, only one row of data per
  13. 13. 12 user would need to be created. This row of data would be a summary of the user’s actions. The new variables created per each user were, total time on the site, average time spent per action, standard deviation (sd) of time per action, total number of actions taken and number of key actions taken. The first 3 new variables of total time on the site, average time spent per action and sd of time per action were calculated using the secs_elapsed variable. Figure 4.1.4: Creating Summary Statistics of the Sessions Dataset As for the number of actions variable, the original variable of action was used to get a count of how many actions each user took. Figure 4.1.5: Creating the Count for the Actions Variables Finally, the number of key actions builds on the original action variable and gives a count based on how many key actions the user takes. Figure 4.1.6: Creating the Count for Important Actions Variables All of these data sets are then combined into one data frame. Figure 4.1.7: Combining the Newly Created Variables This creates a new data frame with the following dimensions that is made up of both Training and Test user data. Sessions_New: 135484 x 6 The next step in the process is to filter the training and test datasets based on the Sessions information. The Sessions information provided only dates back as far as 1/1/14, whereas the Training data dates back to the 1/1/10. With regards to tailoring the dataset to answer the main objective, the Sessions information was deemed important for the analysis, therefore only Training Users who recorded session activity from the 1/1/14 onwards were considered.
  14. 14. 13 The Sessions_New data frame was split up between Training and Testing. Training: Figure 4.1.8: Splitting Sessions into Training Dataset Testing: Figure 4.1.9: Splitting Sessions into Testing Datasets The original Training Set was then filtered, keeping only users with session information. Figure 4.1.10: Filtering the Training and the Training Sessions Sets The original Test Set is similarly filtered to users with session information. Figure 4.1.11: Filtering the Test and the Test Sessions Sets Users with no session information are added to a different data set, making use of the Hmisc package in R. Figure 4.1.12: Set of Users Sessions Information not in the Training or Test Sets The appropriate data sets were then combined column wise in R. Figure 4.1.13: Creating the New Training and Test Sets
  15. 15. 14 As for the users in the Test Set users with no session activity, a value of 0 was given for all variables featuring session information and was added to the Test Set via Excel. It was then decided to drop some variables that would not have any impact on the overall objective. Such variables were the date of account creation and timestamp first active. Also as the variable of date of first booking was null in the test set it was also removed. With appropriate filtering, new variables created and unneeded variables dropped, the Test and Training sets were of the following dimensions. Training Set: 73815 x 18 Test Set: 62096 x 17 4.2 Missing values The next step of the data preparation process was dealing with any missing values. This was an issue within the age variable in particular. As users signed up there was no obligation to give an age, which left a large portion of users with an undefined age. For example within the training set, there was 32362 users with no age, which is nearly half of the users. As this was a vast portion of the data set we could not simply remove the users with no age. There were several alternative options available to deal with the missing values. The first option was to turn age into a dummy variable, with value 0 indicating no age specified and value 1 indicating an age was specified. The second option was to impute the missing values by running a regression model to determine the age of a user based on the other variables that were given. The third and final option was to take the missing values and impute them with the mean value for age (≈ 35). In an ideal world the second option of imputing the missing values would have been the leading candidate, however the results of the regression analysis run on Minitab were disappointing, with no relationship found between the variables and the age of the user. Both the first and third options seemed viable, it was decided to try both options in our final model, with the third option producing more accurate results. 4.3 Creation of Dummy Variables In order to perform any form of prediction models and analysis on our datasets there was a need to convert our character variables into respective dummy variables to be used in their place. By doing this we can perform our models with issue of converting to numeric and such.
  16. 16. 15 The following variables required the creation of dummy variables: ● Gender; Four levels, which included male, female, -unknown- and other. ● Signup Method; Four level, which included google, facebook, basic and weibo. ● Language; This variable had 25 levels which included english, chinese, italian and spanish, to name but a few. ● Affiliate Channel; Eight levels, which include seo, sem-non-brand, sem-brand, remarketing, other, direct, content, and api. ● First Affiliate Tracked; Eight level, which includes linked, local ops, marketing, omg, product, tracked - other, untracked, and empty cells. ● Signup App; Four levels, which include android, iOS, moweb, and web. ● Affiliate Provider; 17 levels, which includes yahoo, padmapper, craigslist, google and many more. ● First Device Type; Nine levels, which includes android phone, android tablet, desktop (other), iPad, iPhone, mac desktop, other/unknown, smartphone (other), and windows desktop. ● First Browser; 39 levels, which includes chrome, IE, opera, safari, firefox and many more. The dummy variables were created using nested IF statements in Microsoft Excel. Following on from the creation of these variables, their respective character version (or original versions) were removed from the dataset as they no longer served any purpose in the analysis. Figure 4.3.1: Creation of Dummy Variables The IF statement method was used for the majority of the dummy variable creations. However, there are certain columns that had too many variables in order to use the IF statement method. As such we decided to simply sort the data alphabetically based on one column. Then we manually input the dummy variable for each different variable. Although this was time consuming it was a guaranteed way to ensure the test set and the training set have the same dummy variables for the same variable type.
  17. 17. 16 Figure 4.3.2: Dummy Variables References
  18. 18. 17 5. Initial Understanding of Data 5.1 Social Media Trends Social media is a huge part of the modern era. Airbnb gave us a few different methods of sign up for the website. These include Google+ (0), Facebook (1), the basic website signup method (2), and Weibo (3). In order to get a sense of the amount of bookings actually made on each form of social media or even to see if they are evenly enough dispersed to see if there may be other patterns in booking based off the choice of social media. Through the use of R Studio, we created a bar chart of each type of sign up method, one for the training set and one for the test set. Figure 5.1.1: Code for the Social Media Trends Graph From the graphs below, it is very clear to see that the basic method is the most popular option for people to sign up with. The data was so skewed that clearly there is no particular connection between the sign up method and the destination travelled. Too much of the data involved will fall into one of two categories. Also the training set is clearly missing one of the methods. As such this would mean that there isn’t a huge connection between the signup method and the destination. Figure 5.1.2: Training and Test Set bar charts for Signup Methods
  19. 19. 18 5.2 Seasonal Trends When we tidied up the data, we had removed all seasonal factors. So in order to perform this analysis we had to go back to the original datasets to find booking patterns. A big issue with this is kind of analysis is that there are so many variables in the datasets that its difficult to focus on just one. In order to discover trends or patterns in the dataset, we quickly realised we had too much variables for the system handle the analysis correctly or even to plot a seasonal graph for us. So in order to create any sort of seasonal plot we decided to use R to subset our variables and create a new dataset which simply has the total number of bookings per month included., From this we could get a better understanding of the times of year in which bookings were being made. Figure 5.2.1: Code for Separating the Data by the Month The above code is a sample of the method used to break down our large dataset into a more basic format in which we intend to use in Minitab to create a more sensible and understandable graphic for the time series. Once we analysed the different months, we discovered that the pattern for the individual months didn’t show much of an alternate pattern to each other. As such we combined the vectors together into a different dataset. We exported the dataset and then used it in Minitab to display a time series analysis graphic. Figure 5.2.2: Minitab Time Series Analysis
  20. 20. 19 As seen above, in Minitab we decided to begin by creating a simple time series plot to give us an initial understanding of the task at hand. The series was centered about the bookings per month and we created a basic line graph as seen below to highlight the booking patterns of the customers of Airbnb for the year 2014. Figure 5.2.3: Time Series Analysis Plot of Total Booking for 2014
  21. 21. 20 6. Explanation of Tools and Techniques 6.1 R Studio R Studio is also a free software available mainly for statistics and visualisations. This can be two fold for our project as not only do we need the ability to apply data mining and analytics, but all the opportunity to display information in graphs and charts in such a way that the average person could understand what we are trying to sell. RStudio allows the user to analyse in a number of different ways including linear and nonlinear modelling, traditional statistical tests (regression, correlation), data manipulations and data handling. This software offers a wide range of opportunities for analysis of our datasets and in particular when it comes to the predictive aspects of where the next person may book. (Authorimanuel, 2016) We can also use R studio as a visualisation tool. We can easily create maps displaying the travels trends and the different locations they travel to. The maps allow us to show areas of greater popularity amongst travellers and also display the results of our various area and social media trends. 6.2 Minitab A very important thing to remember about our datasets is that they contain information on US citizens who have made their first booking on the Airbnb website. The information we have may not necessarily follow the normal trends expected for holiday goers. Minitab is a statistical software used in many areas but in particular it offers a wide selection of methods that can be used for time-series analysis such as: ● Simple forecasting and smoothing methods ● Trend analysis ● Decomposition ● Moving average ● Single exponential smoothing ● Double exponential smoothing ● Winter’s method ● Correlation analysis and ARIMA modelling For what we intend on doing in relation to the time series analysis, the method we feel may best fit our intentions would be the ARIMA modelling method. ARIMA modelling not only
  22. 22. 21 makes use of the patterns in the data but is specifically tailored to find patterns in the data that may not be simply seen in the case of a simple visualisation. (Inc, 2016) 6.3 Excel and XLMiner XLMiner Platform is an analytical tool used in Excel which is part of the Analytic Solver Platform. This software is used for both predictive and prescriptive analytics. There are many features of this platform such as identifying key features, by which the software uses feature selection that automatically locates the variable what has the best power for explaining your classification, the methods for prediction, whereby the software offers options such as multiple linear regression, ensembles of regression trees and neural networks and finally affinity analysis which uses the market basket analyses and system of recommendations with specific rules. (Systems, 2016) Also the Excel Add-In ArcGIS Maps was used to give a geographic representation of the results generated.
  23. 23. 22 7. Findings 7.1 Association Between Variables Before proceeding with various models and techniques, it was important to firstly understand the variables that we had at our disposal and the relationships they had with each other and the target output variable. Within our data we had two types of variables, categorical and continuous. With categorical data the numbers simply distinguish between different groups but with continuous data the value of the number is more of a measure of the size. With different types of data finding relationships between them can be difficult. Analysing the continuous variables at our disposal was the more straight-forward task. However there was some complications in that the data for the variables failed the normality test on Minitab, therefore a Spearman’s rho test was decided to be used. A Spearman’s rank correlation matrix was created on Minitab based on the 5 variables created from the Sessions dataset. The variables were transformed using the rank function on Minitab and produced the following correlation matrix. Time Spent Mean Time Sd Time Number of Actions Important Actions Time Spent 1 Mean Time 0.59 1 Sd Time 0.72 0.94 1 Number of Actions 0.80 0.09 0.25 1 Important Actions 0.24 0.27 0.26 0.11 1 Table 7.1.1: Correlation Table With a p-value less than the significance level of 0.05, the correlations are statistically significant. In most cases there is a medium to strong relationship between the variables, indicating the variables should be a good foundation for a model.
  24. 24. 23 Figure 7.1.2: Summary Table In terms of determining the relationship between the categorical variables a chi squared test of independence is conducted. The data must first be summarized, for example a count is taken of each gender type (4 levels) on whether they make a booking or not. For the purpose of this analysis a new categorical variable “Booking” with 2 levels was created, to simplify the analysis. A chi squared test of independence is run on Minitab to test the association between the two categorical variables, gender and booking. For this example a significant interaction was found (ᵡ2 (3) = 4281, p < 0.05), thus rejecting the null hypothesis of independence between the variables. A similar approach was taken for analysing the remaining categorical variables, with results similar to the example indicating dependence between the categorical variables and the output variable of “Booking”. 7.2 The Earlier Models The techniques we used in XLMiner were Multiple Linear Regression and Neural Networks. However, there were certain unknown limitations to using XLMiner. The biggest issue with using this platform was the limitations of the training data which only allows 10000 rows for training. Unfortunately that was a much too small amount to train our extensive data. As such, this meant the XLMiner was no longer a viable option for our analysis and as such was shelved in the end. The techniques we used in RStudio were plentiful in this project. We felt RStudio was a major factor in completing this project as it had a range of techniques we could use in order to understand and solve our problems. Some of the ideas we had fell flat while others seemed to excel more than expected. When it came to the machine learning algorithm aspect of our project we tough the use of decision trees may allow for a relatively accurate result. We decided to try 2 different types of decision trees to determine which would give us a more accurate answer. Firstly we attempted the use of the package party. Through the use of this technique we managed to create our own model for the decision tree using the variables we felt would predict the destination of the customers the best. Unfortunately the accuracy of this particular decision tree was very weak and only yielded a 30% accurate result. Although we
  25. 25. 24 tried several different variable combinations, it still didn’t allow for the accuracy of our answers to improve by a greater amount. Figure 7.2.1: Code for the ‘Party’ Package Decision Tree Next we tried to use the rpart package. This method had more detail in the code than the previous party package option. This method allowed us to yield a much more accurate results which was 73% accurate. There was a more detailed training which in outlined below, whereby there is now a control element added to the decision tree for various parameters the rpart fit. This allows, for an improved accuracy of the results.
  26. 26. 25 Figure 7.2.2: Code for the ‘RPart’ Package Decision Tree 7.3 XGBoost The XGBoost package available in R, which stands for Extreme Gradient Boosting, is a machine learning algorithm used for supervised learning problems. The idea of supervised learning is to use training data xi to predict the target variable yi. The model used for evaluation of the prediction variable in the case of xgboost is tree ensembles, which is a set of classification and regression trees (CART), where each output variable is classified into different leaves depending on the inputs. CART differs from decision trees in that it gives a score for each leaf within the tree. Similar to random forests, the prediction scores for each tree are combined to give an overall prediction. The main difference between tree ensembles and random forests are in the way the model is trained. Training the model involves determining an objective function and then to optimize it, with tree ensembles using an additive training approach.
  27. 27. 26 7.4 The Final Model The xgboost and associated packages that are required for the prediction model are installed and loaded on R. Figure 7.4.1: Loading the Packages in RStudio The country destination column of the training set is assigned to its own data frame called labels. Figure 7.4.2: Assigning the Destination Variable Then in order to run the xgboost the destination column must be removed from the training set. Figure 7.4.3: Eliminating the Destination Column The next step involves assigning a numeric value for each country destination. Figure 7.4.4: Assigning the Destination Once this is completed, the process of training the model can begin. There are many different parameters involved in the xgboost function. Many of the values given for each parameter are either default values or are widely used and acceptable. Some parameters that influence the output of the model that are of worth mentioning are: Eta (default = 0.3) is used to prevent overfitting, where by it shrinks the weights at each step giving a more conservative boosting process. Max_depth indicates the maximum depth of a tree, with the higher the value the more complex the model becomes. Subsample is the ratio of training instance. It is the ratio of data collected to grow the trees. It is used to prevent over fitting.
  28. 28. 27 Figure 7.4.5: XGBoost Parameters As with many machine learning algorithms, a major consideration in understanding the accuracy of the output is the fit of the model. A model that is under fit fails to identify relationships between the variables and the targeted output variable. On the other hand a model is over fit when the relationships defined are too specific to the training data set and cannot be generalised to the wider population. Both cases can lead to poor predictive accuracy. The parameter tuning within the xgboost model was performed to find a balanced model that can identify relationships and be used within the wider population. Figure 7.4.6: Under fitting and Over fitting The next step involves using the model created to classify the test data. This is done using the predict function in R. Figure 7.4.7: Predicting the Destinations using the XGBoost Model The final steps relate to organising the results generated into a clear and manageable format. For the purpose of the Kaggle competition, the top 5 most likely destinations for each user were specified in descending order of likelihood.
  29. 29. 28 Figure 7.4.8: Generating the Output Excel Files 7.5 Testing the Variables Once we had found the model that seemed to yield the best accuracy for our predictions, we proceeded to look at what factors were affecting our results in a negative way. We looked very closely at the evaluation metric and played around with a few different ideas. The competition outline on Kaggle suggested the use of the NDCG (Normalised Discounted Cumulative Gain) metric. Figure 7.5.1: Mathematical Formula for Normalised Discounted Cumulative Gain This metric is used to measure the performance of the system based off of recommendations, which are ranked in order of relevance to the ideas which they are trying to recommend. The resulting values vary from 0.0 to 1.0 and the higher the value they higher they are in the ranking system. This metric is actually used quite a bit in evaluating the performance of web search engines such as Google. (Solera, F., 2015)
  30. 30. 29 Seeing as this was the recommended metric it was natural we were going to attempt to use it. Our only issue was building the formula. We did succeed in building a formula but managed to have an error in RStudio that we were incapable of solving in the end. There was some issues with the size of objects and the incompatibility of this size with other objects. As a result we decided to move forward with other evaluation metrics that didn’t require us to create the function ourselves and as such would reduce the risk of error. Figure 7.5.2: R Code to Generate the NDCG Evaluation Metric Another metric we tried was rmse (root mean square error) which is the default value for regressions and for the error for classification. Mean square error measures the closeness of a fitted line to its data points. Root mean square error is very similar to mse, whereby it is the square root of the mse. As statistics go this is one of the easily understood ones. This did not yield a decent accuracy however and as such we felt we could do better. (Vernier, 2016 ) Finally we looked into the idea of using multiclass log loss or mlogloss as it is referred to in R. This metric was widely talked about on different forums and blogs about xgboost on the internet. Some even said it was the best metric to use. This metric takes each observation, which are in the same class, and for each individual observation it predicts it probability for each of the classes. Mlogloss is the negative log likelihood of the specified model which states that each observation from the test set is chosen independently from a distribution that gives a relative probability to a corresponding class for every observation in the set. (Kaggle, 2016) Figure 7.5.3: Mathematical Formula for Multiclass Log Loss For the final attempt at improving our model we looked at the different variables we were using in the model. Ideally we wanted to reduce the model to as few a variables as possible
  31. 31. 30 because we were certain there were a few variables contained in our model that didn’t do any good in our predictions or better yet were actually making our prediction accuracy worse. We began by simply altering our datasets and deleting different columns we felt were not necessary for the analysis. We tried the different affiliate variables and the devices, generally the things we thought were just pointless. But once we saw that it made a difference but not a good one, we decided to try removing the variables we thought were affecting it. In the end we discovered that all our variables actually matter on in model. Surprisingly every little detail the model can get on a customer helps shape the prediction of their next destination. 7.6 The Results In terms of conveying the results generated we make use of the country_pred data frame created using the predict function on R. This data frame outputted a probability associated with each destination for each user, as seen in the example figure below. Figure 7.6.1: Probability of a Random User Booking in Each Destination From the table below we have outlined how many people we anticipate will visit each location as their first booking destination or in most cases no destination at all. We used an expected value method in order to calculate the number of people traveling to each of the 11 destinations or no destination at all. 𝐸𝐸𝐸𝐸𝐸𝐸 = � 𝑃𝑃 (𝑥𝑥) Where EUB is Expected User Bookings and x is the country destination. This formula was executed through Excel.
  32. 32. 31 Country Expected User Bookings Australia 552 Canada 802 Germany 662 Spain 883 France 1359 Great Britain 951 Italy 1041 Netherlands 723 Portugal 514 United States of America 13885 Other 3008 No Booking 37716 Table 7.6.2: The Expected Number of Airbnb Users Travelling to Each Destination In order to get a clearer image of what exact results we obtained from those Airbnb users we were able to predict a destination, we designed the following graphic. As both other destination and NDF cannot be represented geographically they were omitted from the graphic below. The size of the orange dot on the location is a representative proportion of the overall number of users who had a definitive location. It is clear from the graph that the USA had the largest percentage of the users travel to it and given that all the users were from the USA it makes very clear sense. In the case of the likes of Australia and Portugal we see much smaller dots, as they were the least popular locations for travel.
  33. 33. 32 Figure 7.6.3: Map of Results So how accurate was our analysis overall? When it came to checking the accuracy of the model we had built, we were lucky enough to have Kaggle to do it for us. Once we created our datasets of our predictions of the top 5 countries per user, we simply had the option of uploading it onto Kaggle and receive as result as to how accurate our findings actually were. In the end, after much editing of the model, we came to a final result of 87.248% accuracy with our predictions. Figure 7.6.3: The Evaluated Accuracy of the Final Model via Kaggle
  34. 34. 33 8. Conclusion From the offset of this project our main aim was to predict the destination of the next Airbnb users booking. Although we aimed to discover a few other items of information along the way, the main objective of this project has always been that. At the beginning we understood clearly the limitations we would be facing with a project such as this. Predicting people's travel patterns was always going to be difficult considering we cannot possibly know everything that a person considers when booking a holiday. We were limited by the fact that we cannot predict the likes of world disasters, people's personal aversions to certain locations or even how fickle any one particular human being can be when it comes to making a decision like this. However, we feel we overcame those limitations throughout this project and managed to gain a result that may not be one hundred percent accurate, but then again no model is ever perfect. We felt that our final model yielded a very accurate result when compared to the winner of the Kaggle competition in the end whose model was 88.697% accurate. In the end, our model was strong and our results showed that and even though we didn’t get a one hundred percent accurate result, we got one that was close enough given the information that is necessary for these kinds of decisions. 9. Appendices Referencing Airbnb(2016) About s. Available at: (Accessed:27 May 2016) Amazon Web Services. Model Fit: Underfitting vs. Overfitting. overfitting.html (Accessed 19 June 2016) Authorimanuel, T. (2013) Top 20 predictive Analytics Freeware software. Available at: (Accessed: 31 March 2016) Finlay, S. (2014) Predictive Analytics, Data Mining and Big Data Myths, Misconceptions and Methods Graphics with ggplot2 He, T. (2016) An Introduction to XGBoost R package, Available at: (Accessed 14 June 2016)
  35. 35. 34 Inc, M. (2016) Methods for analyzing time series. Available at: us/minitab/17/topic-library/modeling-statistics/time-series/basics/methods-for-analyzing- time-series/ (Accessed: 31 March 2016) Introduction to Boosted Trees, Available at: (Accessed 14 June 2016) Introduction to dplyr (2015) (Accessed 20 June 2016) Jain, A. (2016) Complete Guide to Parameter Tuning in XGBoost. Available at: with-codes-python/ (Accessed 19 June 2016) Kaggle. 2016. Multi Class Log Loss | Kaggle. [ONLINE] Available at: [Accessed 16 June 2016]. Parise, S., Iyer, B. and Vesset, D. (2015) Four strategies to capture and create value from big data. Available at : create-value-from-big-data/ (Accessed: 14 June 2016) (Accessed 14 June 2016) PennState, Performing a Chi-Square Test of Independence from Summarized Data in Minitab (Accessed 20 June 2016) Solera, F. 2015. Normalized Discounted Cumulative Gain. [ONLINE] Available at: [Accessed 16 June 2016]. Vernier (2016) What are mean squared error and root mean squared error? > vernier software & technology. Available at: (Accessed: 19 June 2016) XGBoost R Tutorial, Available at: (Accessed 14 June 2016)
  36. 36. 35 EXTRA EARLIER WORKINGS The following section is our initial understanding of the information provided and the situation we are dealing with. We decided to use RStudio to produce a series of graphs to gain a better insight into what is happening with some of our more major variables. We started by looking at our sign up methods, i.e through Airbnb (basic), Facebook, or Google +. Simply from our initial outlook of the information it is clear that most users either sign up directly via the website or through Facebook. Google + seems to be non-existent at the moment. Following on from that we decided to have a look at the different countries and how many people visited them in our dataset. Simply from a quick glance at the histogram it’s clear that
  37. 37. 36 the US is most popular which would make sense. Seeing as all our customers are from the US, naturally you would imagine the most popular holiday destination would be the US itself. Outside of the Us however, the most popular destinations are other, France, Italy and Spain. The least popular of our destinations are Portugal, Australia and the Netherlands. Finally we looked at the first device type used by the customers. The most popular amongst these would be the Mac desktop and the Windows desktop. Smartphones other than android or iPhone, and Android Tablet were the least popular amongst the customers.
  38. 38. 37
  39. 39. 38
  40. 40. 39 These graphics were created in one of our earlier datasets. We had a few versions of the datasets before we were happy with them and as such we were creating different graphics to understand the data. We decided to keep these graphics as they were earlier works for earlier datasets.
  41. 41. 40 Code Used Merging the datasets ## Reading in the data sets Training_Set <- read.csv(file = "train_users_2.csv") View(Training_Set) Test_Set <- read.csv(file = "test_users.csv") View(Test_Set) Sessions <- read.csv(file = "sessions.csv") View(Sessions) ## Loaking Packages install.packages("dplyr", dependencies = TRUE) library(dplyr) install.packages("ggplot2", dependencies = TRUE) library(ggplot2) Sessions[] <- 0 Sessions_Time <- Sessions %>% group_by(user_id) %>% summarize(Time_Spent = sum(secs_elapsed), Mean_Time = mean(secs_elapsed), Sd_Time = sd(secs_elapsed)) Sessions_Actions <- data.frame(table(Sessions$user_id)) Sessions_Actions <- Sessions_Actions[apply(Sessions_Actions[2],1,function(z) any(z!=0)),] Important_Actions <- c("pending", "booking request", "at_checkpoint") Sessions <- transform(Sessions, c = ifelse(action %in% Important_Actions, 1, 0)) Sessions_Important_Actions <- Sessions %>% group_by(user_id) %>% summarize(Important_Actions = sum(c)) Sessions_New <- cbind(Sessions_Time, Sessions_Actions, Sessions_Important_Actions) Training_Set_ID <- Training_Set$id Sessions_Training <- Sessions_New %>% subset(user_id %in% Training_Set_ID)
  42. 42. 41 Test_Set_ID <- Test_Set$id Sessions_Test <- Sessions_New %>% subset(user_id %in% Test_Set_ID) Training_Set_ID_2 <- Sessions_Training$user_id Training_Set <- Training_Set %>% subset(id %in% Training_Set_ID_2) %>% arrange(id) Test_Set_ID_2 <- Sessions_Test$user_id Test_Set_Main <- Test_Set %>% subset(id %in% Test_Set_ID_2) %>% arrange(id) install.packages("Hmisc", dependencies = TRUE) library(Hmisc) Test_Set_Others <- Test_Set %>% subset(id %nin% Test_Set_ID_2) %>% arrange(id) Training_Set <- cbind(Training_Set, Sessions_Training) Test_Set <- cbind(Test_Set_Main, Sessions_Test) Running the Xgboost ## Loading csv files. Must be in the folder of the R project, and saved with the right names. Training_Set <- read.csv(file = "Training_Set_Final.csv") Test_Set <- read.csv(file = "Test_Set_Final.csv") View(Training_Set) View(Test_Set) ## Loading packages required for xgboost. library(xgboost) library(readr) library(stringr) library(caret) library(car) ## Assign the destination variable to label. labels = Training_Set['country_destination']
  43. 43. 42 ## Then remove the destination column. Training_Set$country_destination <- NULL ## country_num is the country destination as numeric. country_num <- recode(labels$country_destination,"'NDF'=0; 'US'=1; 'other'=2; 'FR'=3; 'CA'=4; 'GB'=5; 'ES'=6; 'IT'=7; 'PT'=8; 'NL'=9; 'DE'=10; 'AU'=11") country_num <- as.numeric(as.character(country_num)) ## train xgboost xgb <- xgboost(data = data.matrix(Training_Set[,-1]), label = country_num, eta = 0.1, max_depth = 8, nround= 40, subsample = 0.7, colsample_bytree = .8, seed = 1, eval_metric = "mlogloss", objective = "multi:softprob", num_class = 12, nthread = 3 ) ## predict values in test set country_pred <- predict(xgb, data.matrix(Test_Set[,-1])) ## extract the 5 classes with highest probabilities predictions <-, nrow=12)) rownames(predictions) <- c('NDF','US','other','FR','CA','GB','ES','IT','PT','NL','DE','AU') predictions_top5 <- as.vector(apply(predictions, 2, function(x) names(sort(x)[12:8]))) ## create the prediction data frame Test_Set$id <- as.character(Test_Set$id) ids <- NULL for (i in 1:NROW(Test_Set)) { idx <- Test_Set$id[i] ids <- append(ids, rep(idx,5)) }
  44. 44. 43 country_prediction_top5 <- NULL country_prediction_top5$id <- ids country_prediction_top5$country <- predictions_top5 # create final prediction file final_prediction <- write.csv(final_prediction, "final_prediction.csv")