SlideShare a Scribd company logo
1 of 22
Download to read offline
Analysis and Prediction of Flight Prices
using historical pricing data

1st Swiss Hadoop User Group meeting – May 14, 2012

Jérémie Miserez - miserezj@student.ethz.ch




2012-05-14
Overview

   Project setup
   Goals
   Exploratory data analysis (Hadoop)
   Classification & prediction methods
   Processing pipeline (Hadoop)
   Results

   This project was done as part of my Bachelor’s thesis at the Systems Group,
    ETH Zürich, in collaboration with Amadeus IT Group SA.




                                                                                  2
Project setup

 Airline tickets can be bought up to ~1 year in advance.
   Prices change from day to day.

 Amadeus CRS is the largest global distribution system in
  the travel/tourism industry:
   sells tickets for 435 airlines (also hotels, cruises, etc.)
   processes ~850 million billable transactions per year

 Amadeus provided us with a dataset containing buyable
  tickets for each day from May 2008 – Jan 2011.



                                                                  3
Goals

1. Construct and train a general classifier so that it can
   distinguish between expensive and cheap tickets.

2. Use this classifier to predict the prices of future tickets.

3. Determine which factors have the greatest impact on price
   by analyzing the trained classifier.

 But first: Need to understand dataset!



                                                                  4
Exploratory data analysis

 Extent of the dataset:
   27.2 billion records
   132.2 GiB (uncompressed)
   63 departure airports, 428 destinations, 4387 routes, 117 airlines




                                                                         5
Exploratory data analysis

 The majority of activity is concentrated in Europe:




                                                        6
Exploratory data analysis

 Lots of fields:
      “Buy” date:        When was this price current?
      “Fly” date:        When does the flight leave?
      …
      Price & currency
      …
      Cabin class        Economy/Business/First       (98% economy tickets)
      Booking class      A-Z
      …
      Airline            The airline selling the ticket.
      …

 Not a time series, tickets are not linked over time.

                                                                           7
Exploratory data analysis

 Visualizing small subsets of the data helps understand the
  data.

 Lots of simple Hadoop jobs used to preprocess the data,
  multiple visualizations using Matlab.

 Can we see some patterns already?




                                                               8
Exploratory data analysis
        For ZRH-BKK, plot the prices of the cheapest tickets available every day:
                                                                                     2400 EUR
    Buy date




                                                             December
                                              July
                                                                                     600 EUR
                                            Fly date
                                                                                        9
Classification & Prediction methods

 Implemented two different classifiers:
     Support vector machine (SVM)
     L1- regularized linear regression

 Both are convex minimization problems that can be solved
  online by employing the stochastic gradient descent (SGD)
  method.
     Online algorithm results in constant memory usage, does not depend
      on size of dataset.
     “Stochastic”: Select order of training points at random from dataset.
 SGD can be parallelized (parallelized SGD)* with almost
  no overhead, and is very suitable for use with MapReduce.
   * Zinkevich, M. Weimer, A. Smola, and L. Li. “Parallelized stochastic gradient descent”, 24th Annual Conference on Neural Information
    Processing Systems, 2010.


                                                                                                                                            10
Classification & Prediction methods

 SVM: binary linear classifier
   Goal: Find maximum-margin hyperplane
    that divides the points with label “+1” from
    those with label “-1”.

   After training:
      Hyperplane parameters:
      Get label for a data point    as


   Training:
      Generate training label      for i-th data point
      Choose hyperplane parameters so the margin         is maximal and the training data
       is still correctly classified:



                                                                                             11
Classification & Prediction methods

 Implementation uses:
   Hinge loss function:
      Takes into account “outliers”.
   Regularization parameter
      Bounds length of     , i.e. large    increase generalization.
   Preprocess data for zero mean, unit variance
     

 For     training points:




         Margin:                        , with lower bound:

                                                                       12
Hadoop: Preprocessing

 Generate training labels (y) from dataset:

   Convert currencies using historical exchange rates.

   For each route r, calculate the arithmetic mean (and standard
    deviation) of the price over all tickets.

   Assign labels:
      Label +: “Above mean price for this route”
      Label -: “Below mean price for this route”


   Only store mean/std-dev, do not actually store labels in the HDFS.



                                                                         13
Hadoop: Preprocessing

 Extract features from plaintext records (x).

   Each plaintext record is transformed into a 930-dimensional vector.

   Each dimension contains a numerical value corresponding to a
    feature such as:
      Number of days between “Buy” and “Fly” dates
      Week of day (for all dates)
      Is the day on a weekend (for all dates).
      Is the Currency CHF?
      etc.


   Each dimension is normalized to zero mean and unit variance.
      (per route r)

                                                                          14
Hadoop: Processing pipeline
 Shuffle the data
    (P)SGD demands random selection of
     data points
 Partition the data into n (=1200)
  chunks

 Train using PSGD:
    Parallel training on k (=40) chunks
    Average hyperplane coefficients after
     all 1200 chunks have been
     processed (= after 30 iterations).
 We can get intermediate results
  by calculating the accuracy every
  time 40 chunks have been
  processed.
                                             15
Extensions done to the basic algorithms:
 Hierarchical classification:              Per airline classification:
    Train 7 classifiers in parallel           Train 1+21 classifiers in parallel
    Increases runtime by a factor of 3.       Increases runtime by a factor of 2.

                                                           General classifier



                                                           1 – Airline A classifier (21%)

                                                           2 - Airline B classifier (9%)

                                                           3 - Airline C classifier (7%)

                                                           4 – Airline D classifier (6%)
                                                 …         …

                                                           21 – “Other” airlines (15.4%)

                                                                                           16
Results: Overall accuracy

 Dataset: 10% subsample of all records (class economy)




                                                          17
Results: Overall accuracy

 Dataset: All records ZRH -> * (economy)




                                            18
Results: Overall accuracy

 Dataset: All records ZRH -> BKK (economy)




                                              19
Results: Analyzing a single airline X

 SVM classifier 0, for airline X, dataset 10% full subsample




                                                                20
Results: Analyzing a single airline X

 SVM classifier 0, for airline X, dataset 10% full subsample




                                                                21
Questions!




             22

More Related Content

What's hot

Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extractionskylian
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
 
Major and Minor Elements of Object Model
Major and Minor Elements of Object ModelMajor and Minor Elements of Object Model
Major and Minor Elements of Object Modelsohailsaif
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)EdutechLearners
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.ASHOK KUMAR
 

What's hot (20)

Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Noise Models
Noise ModelsNoise Models
Noise Models
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
YOLO
YOLOYOLO
YOLO
 
Major and Minor Elements of Object Model
Major and Minor Elements of Object ModelMajor and Minor Elements of Object Model
Major and Minor Elements of Object Model
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Backpropagation algo
Backpropagation  algoBackpropagation  algo
Backpropagation algo
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Viewers also liked

Viewers also liked (7)

mavdumplog_machine_learning_2016
mavdumplog_machine_learning_2016mavdumplog_machine_learning_2016
mavdumplog_machine_learning_2016
 
Supporting Flight Test And Flight Matching
Supporting Flight Test And Flight MatchingSupporting Flight Test And Flight Matching
Supporting Flight Test And Flight Matching
 
Phase1review
Phase1reviewPhase1review
Phase1review
 
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPTBIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
BIG DATA TO AVOID WEATHER RELATED FLIGHT DELAYS PPT
 
Flight Delay Prediction
Flight Delay PredictionFlight Delay Prediction
Flight Delay Prediction
 
Flight Arrival Delay Prediction
Flight Arrival Delay PredictionFlight Arrival Delay Prediction
Flight Arrival Delay Prediction
 
Big Data For Flight Delay Report
Big Data For Flight Delay ReportBig Data For Flight Delay Report
Big Data For Flight Delay Report
 

Similar to 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsbutest
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsToyotaro Suzumura
 
License Plate Recognition
License Plate RecognitionLicense Plate Recognition
License Plate RecognitionAmr Rashed
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
 
Google and SRI talk September 2016
Google and SRI talk September 2016Google and SRI talk September 2016
Google and SRI talk September 2016Hagai Aronowitz
 
IRJET - Airplane Crash Analysis and Prediction using Machine Learning
IRJET - Airplane Crash Analysis and Prediction using Machine LearningIRJET - Airplane Crash Analysis and Prediction using Machine Learning
IRJET - Airplane Crash Analysis and Prediction using Machine LearningIRJET Journal
 
AIML2 DNN lab 1 3 1hr (111-1).pdf
AIML2 DNN lab 1 3 1hr (111-1).pdfAIML2 DNN lab 1 3 1hr (111-1).pdf
AIML2 DNN lab 1 3 1hr (111-1).pdfssuserb4d806
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEuropeBigData_Europe
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
 
Plume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryPlume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryTigerGraph
 
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...IRJET Journal
 
A Study on New York City Taxi Rides
A Study on New York City Taxi RidesA Study on New York City Taxi Rides
A Study on New York City Taxi RidesCaglar Subasi
 
Real Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningReal Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningIRJET Journal
 
Python time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving carsPython time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving carsAndreas Pawlik
 
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study csandit
 

Similar to 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich) (20)

Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical models
 
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph AnalyticsScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Network predictive analysis
Network predictive analysisNetwork predictive analysis
Network predictive analysis
 
License Plate Recognition
License Plate RecognitionLicense Plate Recognition
License Plate Recognition
 
R studio
R studio R studio
R studio
 
Portfolio
PortfolioPortfolio
Portfolio
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Google and SRI talk September 2016
Google and SRI talk September 2016Google and SRI talk September 2016
Google and SRI talk September 2016
 
IRJET - Airplane Crash Analysis and Prediction using Machine Learning
IRJET - Airplane Crash Analysis and Prediction using Machine LearningIRJET - Airplane Crash Analysis and Prediction using Machine Learning
IRJET - Airplane Crash Analysis and Prediction using Machine Learning
 
AIML2 DNN lab 1 3 1hr (111-1).pdf
AIML2 DNN lab 1 3 1hr (111-1).pdfAIML2 DNN lab 1 3 1hr (111-1).pdf
AIML2 DNN lab 1 3 1hr (111-1).pdf
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEurope
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
Plume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis LibraryPlume - A Code Property Graph Extraction and Analysis Library
Plume - A Code Property Graph Extraction and Analysis Library
 
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...Performance Comparisons among Machine Learning Algorithms based on the Stock ...
Performance Comparisons among Machine Learning Algorithms based on the Stock ...
 
A Study on New York City Taxi Rides
A Study on New York City Taxi RidesA Study on New York City Taxi Rides
A Study on New York City Taxi Rides
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Real Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine LearningReal Estate Investment Advising Using Machine Learning
Real Estate Investment Advising Using Machine Learning
 
Python time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving carsPython time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving cars
 
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
Turnover Prediction of Shares Using Data Mining Techniques : A Case Study
 

More from Swiss Big Data User Group

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useSwiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorSwiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisSwiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesSwiss Big Data User Group
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseSwiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexitySwiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceSwiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketSwiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridSwiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseSwiss Big Data User Group
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computingSwiss Big Data User Group
 

More from Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

  • 1. Analysis and Prediction of Flight Prices using historical pricing data 1st Swiss Hadoop User Group meeting – May 14, 2012 Jérémie Miserez - miserezj@student.ethz.ch 2012-05-14
  • 2. Overview  Project setup  Goals  Exploratory data analysis (Hadoop)  Classification & prediction methods  Processing pipeline (Hadoop)  Results  This project was done as part of my Bachelor’s thesis at the Systems Group, ETH Zürich, in collaboration with Amadeus IT Group SA. 2
  • 3. Project setup  Airline tickets can be bought up to ~1 year in advance.  Prices change from day to day.  Amadeus CRS is the largest global distribution system in the travel/tourism industry:  sells tickets for 435 airlines (also hotels, cruises, etc.)  processes ~850 million billable transactions per year  Amadeus provided us with a dataset containing buyable tickets for each day from May 2008 – Jan 2011. 3
  • 4. Goals 1. Construct and train a general classifier so that it can distinguish between expensive and cheap tickets. 2. Use this classifier to predict the prices of future tickets. 3. Determine which factors have the greatest impact on price by analyzing the trained classifier.  But first: Need to understand dataset! 4
  • 5. Exploratory data analysis  Extent of the dataset:  27.2 billion records  132.2 GiB (uncompressed)  63 departure airports, 428 destinations, 4387 routes, 117 airlines 5
  • 6. Exploratory data analysis  The majority of activity is concentrated in Europe: 6
  • 7. Exploratory data analysis  Lots of fields:  “Buy” date: When was this price current?  “Fly” date: When does the flight leave?  …  Price & currency  …  Cabin class Economy/Business/First (98% economy tickets)  Booking class A-Z  …  Airline The airline selling the ticket.  …  Not a time series, tickets are not linked over time. 7
  • 8. Exploratory data analysis  Visualizing small subsets of the data helps understand the data.  Lots of simple Hadoop jobs used to preprocess the data, multiple visualizations using Matlab.  Can we see some patterns already? 8
  • 9. Exploratory data analysis  For ZRH-BKK, plot the prices of the cheapest tickets available every day: 2400 EUR Buy date December July 600 EUR Fly date 9
  • 10. Classification & Prediction methods  Implemented two different classifiers:  Support vector machine (SVM)  L1- regularized linear regression  Both are convex minimization problems that can be solved online by employing the stochastic gradient descent (SGD) method.  Online algorithm results in constant memory usage, does not depend on size of dataset.  “Stochastic”: Select order of training points at random from dataset.  SGD can be parallelized (parallelized SGD)* with almost no overhead, and is very suitable for use with MapReduce.  * Zinkevich, M. Weimer, A. Smola, and L. Li. “Parallelized stochastic gradient descent”, 24th Annual Conference on Neural Information Processing Systems, 2010. 10
  • 11. Classification & Prediction methods  SVM: binary linear classifier  Goal: Find maximum-margin hyperplane that divides the points with label “+1” from those with label “-1”.  After training:  Hyperplane parameters:  Get label for a data point as  Training:  Generate training label for i-th data point  Choose hyperplane parameters so the margin is maximal and the training data is still correctly classified: 11
  • 12. Classification & Prediction methods  Implementation uses:  Hinge loss function:  Takes into account “outliers”.  Regularization parameter  Bounds length of , i.e. large increase generalization.  Preprocess data for zero mean, unit variance   For training points: Margin: , with lower bound: 12
  • 13. Hadoop: Preprocessing  Generate training labels (y) from dataset:  Convert currencies using historical exchange rates.  For each route r, calculate the arithmetic mean (and standard deviation) of the price over all tickets.  Assign labels:  Label +: “Above mean price for this route”  Label -: “Below mean price for this route”  Only store mean/std-dev, do not actually store labels in the HDFS. 13
  • 14. Hadoop: Preprocessing  Extract features from plaintext records (x).  Each plaintext record is transformed into a 930-dimensional vector.  Each dimension contains a numerical value corresponding to a feature such as:  Number of days between “Buy” and “Fly” dates  Week of day (for all dates)  Is the day on a weekend (for all dates).  Is the Currency CHF?  etc.  Each dimension is normalized to zero mean and unit variance.  (per route r) 14
  • 15. Hadoop: Processing pipeline  Shuffle the data  (P)SGD demands random selection of data points  Partition the data into n (=1200) chunks  Train using PSGD:  Parallel training on k (=40) chunks  Average hyperplane coefficients after all 1200 chunks have been processed (= after 30 iterations).  We can get intermediate results by calculating the accuracy every time 40 chunks have been processed. 15
  • 16. Extensions done to the basic algorithms:  Hierarchical classification:  Per airline classification:  Train 7 classifiers in parallel  Train 1+21 classifiers in parallel  Increases runtime by a factor of 3.  Increases runtime by a factor of 2. General classifier 1 – Airline A classifier (21%) 2 - Airline B classifier (9%) 3 - Airline C classifier (7%) 4 – Airline D classifier (6%) … … 21 – “Other” airlines (15.4%) 16
  • 17. Results: Overall accuracy  Dataset: 10% subsample of all records (class economy) 17
  • 18. Results: Overall accuracy  Dataset: All records ZRH -> * (economy) 18
  • 19. Results: Overall accuracy  Dataset: All records ZRH -> BKK (economy) 19
  • 20. Results: Analyzing a single airline X  SVM classifier 0, for airline X, dataset 10% full subsample 20
  • 21. Results: Analyzing a single airline X  SVM classifier 0, for airline X, dataset 10% full subsample 21