The document summarizes a project analyzing and predicting flight prices using historical pricing data from Amadeus, the largest global distribution system for travel. The goals were to construct a classifier to distinguish expensive and cheap tickets, use it to predict future prices, and determine impactful factors. Exploratory analysis of the 27.2 billion record dataset found most activity in Europe. Classification methods like support vector machines and linear regression were implemented in Hadoop for preprocessing and parallelized stochastic gradient descent training. Results showed overall accuracy improved from 65% to 75% and analyzing specific airlines yielded even better accuracy.
Take control of your SAP testing with UiPath Test Suite
14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)
1. Analysis and Prediction of Flight Prices
using historical pricing data
1st Swiss Hadoop User Group meeting – May 14, 2012
Jérémie Miserez - miserezj@student.ethz.ch
2012-05-14
2. Overview
Project setup
Goals
Exploratory data analysis (Hadoop)
Classification & prediction methods
Processing pipeline (Hadoop)
Results
This project was done as part of my Bachelor’s thesis at the Systems Group,
ETH Zürich, in collaboration with Amadeus IT Group SA.
2
3. Project setup
Airline tickets can be bought up to ~1 year in advance.
Prices change from day to day.
Amadeus CRS is the largest global distribution system in
the travel/tourism industry:
sells tickets for 435 airlines (also hotels, cruises, etc.)
processes ~850 million billable transactions per year
Amadeus provided us with a dataset containing buyable
tickets for each day from May 2008 – Jan 2011.
3
4. Goals
1. Construct and train a general classifier so that it can
distinguish between expensive and cheap tickets.
2. Use this classifier to predict the prices of future tickets.
3. Determine which factors have the greatest impact on price
by analyzing the trained classifier.
But first: Need to understand dataset!
4
5. Exploratory data analysis
Extent of the dataset:
27.2 billion records
132.2 GiB (uncompressed)
63 departure airports, 428 destinations, 4387 routes, 117 airlines
5
7. Exploratory data analysis
Lots of fields:
“Buy” date: When was this price current?
“Fly” date: When does the flight leave?
…
Price & currency
…
Cabin class Economy/Business/First (98% economy tickets)
Booking class A-Z
…
Airline The airline selling the ticket.
…
Not a time series, tickets are not linked over time.
7
8. Exploratory data analysis
Visualizing small subsets of the data helps understand the
data.
Lots of simple Hadoop jobs used to preprocess the data,
multiple visualizations using Matlab.
Can we see some patterns already?
8
9. Exploratory data analysis
For ZRH-BKK, plot the prices of the cheapest tickets available every day:
2400 EUR
Buy date
December
July
600 EUR
Fly date
9
10. Classification & Prediction methods
Implemented two different classifiers:
Support vector machine (SVM)
L1- regularized linear regression
Both are convex minimization problems that can be solved
online by employing the stochastic gradient descent (SGD)
method.
Online algorithm results in constant memory usage, does not depend
on size of dataset.
“Stochastic”: Select order of training points at random from dataset.
SGD can be parallelized (parallelized SGD)* with almost
no overhead, and is very suitable for use with MapReduce.
* Zinkevich, M. Weimer, A. Smola, and L. Li. “Parallelized stochastic gradient descent”, 24th Annual Conference on Neural Information
Processing Systems, 2010.
10
11. Classification & Prediction methods
SVM: binary linear classifier
Goal: Find maximum-margin hyperplane
that divides the points with label “+1” from
those with label “-1”.
After training:
Hyperplane parameters:
Get label for a data point as
Training:
Generate training label for i-th data point
Choose hyperplane parameters so the margin is maximal and the training data
is still correctly classified:
11
12. Classification & Prediction methods
Implementation uses:
Hinge loss function:
Takes into account “outliers”.
Regularization parameter
Bounds length of , i.e. large increase generalization.
Preprocess data for zero mean, unit variance
For training points:
Margin: , with lower bound:
12
13. Hadoop: Preprocessing
Generate training labels (y) from dataset:
Convert currencies using historical exchange rates.
For each route r, calculate the arithmetic mean (and standard
deviation) of the price over all tickets.
Assign labels:
Label +: “Above mean price for this route”
Label -: “Below mean price for this route”
Only store mean/std-dev, do not actually store labels in the HDFS.
13
14. Hadoop: Preprocessing
Extract features from plaintext records (x).
Each plaintext record is transformed into a 930-dimensional vector.
Each dimension contains a numerical value corresponding to a
feature such as:
Number of days between “Buy” and “Fly” dates
Week of day (for all dates)
Is the day on a weekend (for all dates).
Is the Currency CHF?
etc.
Each dimension is normalized to zero mean and unit variance.
(per route r)
14
15. Hadoop: Processing pipeline
Shuffle the data
(P)SGD demands random selection of
data points
Partition the data into n (=1200)
chunks
Train using PSGD:
Parallel training on k (=40) chunks
Average hyperplane coefficients after
all 1200 chunks have been
processed (= after 30 iterations).
We can get intermediate results
by calculating the accuracy every
time 40 chunks have been
processed.
15
16. Extensions done to the basic algorithms:
Hierarchical classification: Per airline classification:
Train 7 classifiers in parallel Train 1+21 classifiers in parallel
Increases runtime by a factor of 3. Increases runtime by a factor of 2.
General classifier
1 – Airline A classifier (21%)
2 - Airline B classifier (9%)
3 - Airline C classifier (7%)
4 – Airline D classifier (6%)
… …
21 – “Other” airlines (15.4%)
16