Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Learning Cheap and Novel Flight Itineraries

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 22 Anuncio

Learning Cheap and Novel Flight Itineraries

Descargar para leer sin conexión

We consider the problem of efficiently constructing cheap and novel round trip flight itineraries by combining legs from different airlines. We analyse the factors that contribute towards the price of such itineraries and find that many result from the combination of just 30% of airlines and that the closer the departure of such itineraries is to the user’s search date the more likely they are to be cheaper than the tickets from one airline. We use these insights to formulate the problem as a trade-off between the recall of cheap itinerary constructions and the costs associated with building them. We propose a supervised learning solution with location embeddings which achieves an AUC=80.48, a substantial improvement over simpler baselines. We discuss various practical considerations for dealing with the staleness and the stability of the model and present the design of the machine learning pipeline. Finally, we present an analysis of the model’s performance in production and its impact on Skyscanner’s users

We consider the problem of efficiently constructing cheap and novel round trip flight itineraries by combining legs from different airlines. We analyse the factors that contribute towards the price of such itineraries and find that many result from the combination of just 30% of airlines and that the closer the departure of such itineraries is to the user’s search date the more likely they are to be cheaper than the tickets from one airline. We use these insights to formulate the problem as a trade-off between the recall of cheap itinerary constructions and the costs associated with building them. We propose a supervised learning solution with location embeddings which achieves an AUC=80.48, a substantial improvement over simpler baselines. We discuss various practical considerations for dealing with the staleness and the stability of the model and present the design of the machine learning pipeline. Finally, we present an analysis of the model’s performance in production and its impact on Skyscanner’s users

Anuncio
Anuncio

Más Contenido Relacionado

Más reciente (20)

Anuncio

Learning Cheap and Novel Flight Itineraries

  1. 1. LearningCheap and Novel Flight Itineraries Dima Karamshuk, David Matthews Skyscanner Applied Data ScienceTrack
  2. 2. Planning aTrip How much time you spend to choose a flight?
  3. 3. Planning aTrip –3.5h European travellers spend on average to find a perfect flight, often longer than the flight itself https://goo.gl/74CivT
  4. 4. Planning aTrip How much of you choose airline by price? –3.5h European travellers spend on average to find a perfect flight, often longer than the flight itself https://goo.gl/74CivT
  5. 5. Planning aTrip –37%of users choose airlines by competitive price, more want to see cheapest price for comparison https://goo.gl/8UX3vx –3.5h European travellers spend on average to find a perfect flight, often longer than the flight itself https://goo.gl/74CivT
  6. 6. Combination Itineraries Potentially cheaper itineraries in over half of all search results Virgin Delta Dublin Barcelona Ryanair Aer Lingus Barcelona NewYork Ryanair American
  7. 7. Competitive Combinations Tips for booking your next flight – good for last minute booking – average savings of 9% on return ticket – 90% of competitive combinations are from top-30% airlines – good deals when flying from US, UK, Spain, Germany, Italy and other origins
  8. 8. Problem Global Distribution Systems OnlineTravel Agencies Airlines Traveler • Combinations require more queries to ticket providers • Most of variants are not competitive Solution: Only choose combinations which are likely to be competitive
  9. 9. Supervised Learning Classify whether for a query Q a combination of partners (X andY) is going to be in Top-10 search results Coverage: How many of all possible cheap itineraries we recall Cost: How much queries for flight quotes are required Metrics Dataset • sample all possible combinations for a share of searches • collect examples of competitive and non-competitive combinations
  10. 10. CostCoverage Supervised Learning Use your favorite classifier perfect predictor heuristic-based baseline
  11. 11. CostCoverage Supervised Learning Tree ensembles (Random Forest) achieve good performs 5% 45% In practice
  12. 12. Feature Engineering Can we improve performance with smart feature engineering? London Gatwick [1 0 0 … 0 ] London Stansted [0 1 0 … 0] One-hot encoding Better encoding London Gatwick London Stansted [1.0 0.9 0.9 ...] [1.0 0.9 0.1 …] Barcelona [0 0 1 … 0] Barcelona [0.0 1.0 0.5 …] London European Trans-Atlantic
  13. 13. Location Embeddings Perozzi et al., KDD, 2014. [London, Barcelona, Frankfurt am Main, NewYork, ….] word sentence • Option N1: Every user’s history is a sentence (think ofWord2Vec) • Option N2: Learn embeddings on graphs of locations competitive or not origin destination … … • Option N3:Train embeddings for target problem
  14. 14. Location Embeddings • Capture geographical proximity (Europe vs. Asia) • Learn function of the airport (Heathrow and Gatwick vs. Stansted) • Produce a slight improvement in prediction performance
  15. 15. Model Staleness Performance of the model stales, hence needs to be updated regularly everyday re-training one-off training
  16. 16. Production Pipeline Data Querying AWS Athena Data Archive AWS S3 Data Collection Current Model Model Training scikit-learn Training Data 7 recent days Validation Data 5% of the last day Model Validation Passed? Skyscanner Traffic Pre-processing Experiments with Challenger Model 5% 5% 90% Training Component (AWS CF + AWS Data Pipeline) Report Failure Update ModelApache Kafka Serving Component
  17. 17. Production Pipeline Data Querying AWS Athena Data Archive AWS S3 Data Collection Current Model Model Training scikit-learn Training Data 7 recent days Validation Data 5% of the last day Model Validation Passed? Skyscanner Traffic Pre-processing Experiments with Challenger Model 5% 5% 90% Training Component (AWS CF + AWS Data Pipeline) Report Failure Update ModelApache Kafka Serving Component • re-train the model everyday against model drift • run on a single large machine vs. distributed cluster
  18. 18. Production Pipeline Data Querying AWS Athena Data Archive AWS S3 Data Collection Current Model Model Training scikit-learn Training Data 7 recent days Validation Data 5% of the last day Model Validation Passed? Skyscanner Traffic Pre-processing Experiments with Challenger Model 5% 5% 90% Training Component (AWS CF + AWS Data Pipeline) Report Failure Update ModelApache Kafka Serving Component • sample all possible combinations on 5% of users’ traffic
  19. 19. Production Pipeline Data Querying AWS Athena Data Archive AWS S3 Data Collection Current Model Model Training scikit-learn Training Data 7 recent days Validation Data 5% of the last day Model Validation Passed? Skyscanner Traffic Pre-processing Experiments with Challenger Model 5% 5% 90% Training Component (AWS CF + AWS Data Pipeline) Report Failure Update ModelApache Kafka Serving Component • update the model if it passes the tests and serve it to 90% of the users • leave 5% for A/B experiments with better models
  20. 20. ModelStability (Origin, Destination, Provider) rules • combine the models trained on consecutive days to control for stability
  21. 21. Lessons From theTrenches – bootstrapping ML projects requires 20% of modelingand 80% of engineering– in the long run should be vice versa – many interesting ML problems arise in production(e.g., model staleness ad stability) – simple solutions are often good enough
  22. 22. Join our Team! @SkyscannerEng

×