SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
Using Hadoop and Hive to Optimize Travel Search
       Jonathan Seidman and Ramesh Venkataramaiah
Contributors


•  Robert Lancaster, Orbitz Worldwide
•  Wai Gen Yee, Orbitz Worldwide
•  Andrew Yates, Intern - Orbitz Worldwide




                                             page 2
Agenda


   •  Orbitz Worldwide
   •  Hadoop for Big Data
   •  Hive for Queries
   •  Web Analytics data as input
   •  Applications of Hadoop/Hive at Orbitz:
          –  Hotel Sort
          –  Data Cubes
   •  Sample analysis and data trends




Orbitz!       Hadoop!     Hive!   Input!   Applications!   Analysis!   page 3
Launched: 2001, Chicago, IL




                              page 4
Data Challenges

   •  Orbitz.com generates ~1.5 million air searches and ~1 million
      hotel searches every day.
   •  All of this activity generates massive amounts of data – over
      500 GB/day of log data, and even this doesn’t capture all of the
      data we want.
   •  Expensive and difficult to use existing data infrastructure for
      storing and processing this data.
   •  Need an infrastructure that provides:
          –  Long term storage of very large data sets.
          –  Open access to developers and analysts.
          –  Allows for ad-hoc querying of data and rapid deployment of
             reporting applications.


Orbitz!      Hadoop!     Hive!    Input!   Applications!   Analysis!      page 5
Hadoop Overview

   •  Open source framework providing reliable and scalable storage
      and processing of data on inexpensive commodity hardware.
   •  Two primary components: The Hadoop distributed file system
      and MapReduce.




Orbitz!   Hadoop!   Hive!    Input!   Applications!   Analysis!       page 6
Hadoop Overview – Hadoop Distributed File System

   •  HDFS provides reliable, fault tolerant and scalable storage of
      very large datasets across machines in a cluster.




Orbitz!   Hadoop!    Hive!     Input!   Applications!   Analysis!      page 7
Hadoop Overview – MapReduce

   •  Programming model for efficient distributed processing.
      Designed to reliably perform computations on large volumes of
      data in parallel.
   •  Removes much of the burden of writing distributed
      computations.




Orbitz!   Hadoop!    Hive!   Input!   Applications!   Analysis!       page 8
The Problem with MapReduce


   •  Requires experienced developers to write MapReduce jobs
      which can be difficult to maintain and re-use.




Orbitz!   Hadoop!   Hive!   Input!   Applications!   Analysis!   page 9
Hive Overview

  •  Hive is an open-source data warehousing solution built on top
     of Hadoop which allows for easy data summarization, adhoc
     querying and analysis of large datasets stored in Hadoop.
  •  Developed at Facebook to provide a structured data model
     over Hadoop data.
  •  Simplifies Hadoop data analysis – users can use a familiar
     SQL model rather than writing low level custom code.
  •  Hive queries are compiled into Hadoop MapReduce jobs.
  •  Designed for scalability, not low latency.




Orbitz!   Hadoop!    Hive!      Input!   Applications!   Analysis!   page 10
Hive Overview – Comparison to Traditional DBMS Systems


  •  Although Hive uses a model familiar to database users, it does
     not support a full relational model and only supports a subset of
     SQL.
  •  What Hadoop/Hive offers is highly scalable and fault-tolerant
     processing of very large data sets.




Orbitz!   Hadoop!    Hive!     Input!   Applications!   Analysis!        page 11
Hive - Data Model

  •  Databases – provide namespace for Hive objects, prevent
     naming conflicts.
  •  Tables – analogous to tables in a standard RDBMS.
  •  Partitions and buckets – Allow Hive to prune data during query
     processing.




Orbitz!   Hadoop!   Hive!     Input!   Applications!   Analysis!      page 12
Hive – Data Types

  •  Supports primitive types such as int, double, and string.
  •  Also supports complex types such as structs, maps (key/value
     tuples), and arrays (indexable lists).




Orbitz!   Hadoop!    Hive!     Input!   Applications!   Analysis!   page 13
Hive – Hive Query Language

  •  HiveQL – Supports basic SQL-like operations such as select,
     join, aggregate, union, sub-queries, etc.
  •  HiveQL queries are compiled into MapReduce processes.
  •  Supports embedding custom MapReduce scripts.
  •  Built in support for standard relational, arithmetic, and boolean
     operators.




Orbitz!   Hadoop!    Hive!     Input!   Applications!   Analysis!        page 14
Hive MapReduce

  •  Allows analysis not possible through standard HiveQL queries.
  •  Can be implemented in any language.




Orbitz!   Hadoop!   Hive!    Input!   Applications!   Analysis!      page 15
Hive – User Defined Functions

  •  HiveQL is extensible through user defined functions
     implemented in Java.
  •  Also supports aggregation functions (sum, avg).
  •  Provides table functions when more than one value needs to
     be returned.




Orbitz!   Hadoop!   Hive!     Input!   Applications!   Analysis!   page 16
Hive – User Defined Functions
  Example UDF – Find hotel’s position in an impression list:

  package com.orbitz.hive;!
  import org.apache.hadoop.hive.ql.exec.UDF;!
  import org.apache.hadoop.io.Text;!


  /**!
   * returns hotel_id's position given a hotel_id and impression list!
   */!
  public final class GetPos extends UDF {!
         public Text evaluate(final Text hotel_id, final Text impressions) {!
              if (hotel_id == null || impressions == null)!
                  return null;!


              String[] hotels = impressions.toString().split(";");!
              String position;!
              String id = hotel_id.toString();!
              int begin=0, end=0;!


              for (int i=0; i<hotels.length; i++) {!
                   begin = hotels[i].indexOf(",");!
                   end = hotels[i].lastIndexOf(",");!
                   position = hotels[i].substring(begin+1,end);!
                   if (id.equals(hotels[i].substring(0,begin)))!
                       return new Text(position);!
              }!
              return null;!
         }!
  }!



Orbitz!            Hadoop!               Hive!            Input!      Applications!   Analysis!   page 17
Hive – User Defined Functions

  hive> add jar path-to-jar/pos.jar; !
  hive> create temporary function getpos as
   'com.orbitz.hive.GetPos';!
  hive> select getpos(‘1’,
   ‘1,3,100.00;2,1,100.00’);!
  …!
  hive> 3 !




Orbitz!   Hadoop!   Hive!   Input!   Applications!   Analysis!   page 18
Hive – Client Access

  •  Hive Command Line Interface (CLI)
  •  Hive Web UI
  •  JDBC, ODBC, Thrift




Orbitz!   Hadoop!   Hive!   Input!   Applications!   Analysis!   page 19
Hive – Lessons Learned


  •  Job scheduling – Default Hadoop scheduling is FIFO. Consider
     using something like the fair scheduler.
  •  Multi-user Hive – Default install is single user. Multi-user
     installs require Derby network server.




Orbitz!   Hadoop!     Hive!     Input!   Applications!   Analysis!   page 20
Data Infrastructure




                      page 21
Input Data – Webtrends

   •  Web analytics software providing information about user
      behavior.
   •  Raw Webtrends log files are used as input to much of our
      processing. Logs have the following format:
          –  date time c-ip cs-username cs-host cs-method cs-
             uri-stem cs-uri-query sc-status sc-bytes cs-
             version cs(User-Agent) cs(Cookie) cs(Referer)
             dcs-geo dcs-dns origin-id dcs-id!




Orbitz!      Hadoop!   Hive!   Input!   Applications!   Analysis!   page 22
Input Data – Webtrends Cont’d

   •  Storing raw data in HDFS provides access to data not available
      elsewhere, for example “hotel impression” data:
          –  115004,1,70.00;35217,2,129.00;239756,3,99.00;83389,4,99.00!




Orbitz!      Hadoop!      Hive!     Input!   Applications!   Analysis!     page 23
page 24
Improve Hotel Sort

   •  Extract data from raw Webtrends logs for input to a trained
      classification process.
   •  Logs provide input to MapReduce processing which extracts
      required fields.
   •  Previous process used a series of Perl and Bash scripts to
      extract data serially.
   •  Comparison of performance
          –  Months worth of data
          –  Manual process took 109m14s
          –  MapReduce process took 25m58s




Orbitz!      Hadoop!     Hive!      Input!   Applications!   Analysis!   page 25
Improve Hotel Sort – Components




Orbitz!   Hadoop!   Hive!   Input!   Applications!   Analysis!   page 26
Improve Hotel Sort – Processing Flow




Orbitz!   Hadoop!   Hive!   Input!   Applications!   Analysis!   page 27
Webtrends Analysis in Hive

   •  Extract data is loaded into two Hive tables:
   DROP TABLE wt_extract;
   CREATE TABLE wt_extract(
      session_id STRING, visitor_tracking_id STRING, host STRING, visitors_ip STRING, booking_date STRING, booking_time
      STRING, dept_date STRING, ret_date STRING, booked_hotel_id STRING, sort_type STRING, destination STRING, location_id
      STRING, number_of_guests INT, number_of_rooms INT, page_number INT, matrix_interaction STRING, impressions STRING,
      areacode STRING, city STRING, region_code STRING, country STRING, country_code STRING, continent STRING, company STRING,
      tzone STRING)
   CLUSTERED BY(booked_hotel_id) INTO 256 BUCKETS
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY 't’
   STORED AS TEXTFILE;


   load data inpath ’/extract-output/part-00000' into table wt_extract;


   DROP TABLE hotel_impressions;
   CREATE TABLE hotel_impressions(
      session_id STRING, hotel_id STRING, position INT, rate FLOAT )
   CLUSTERED BY(hotel_id) INTO 256 BUCKETS
   ROW FORMAT DELIMITED
   FIELDS TERMINATED BY 't' STORED AS TEXTFILE;


   load data inpath ’/impressions-output/part-00000' into table hotel_extract;




Orbitz!           Hadoop!                    Hive!                 Input!        Applications!   Analysis!                       page 28
Webtrends Analysis in Hive Cont’d

   •  Allows us to easily derive metrics not previously possible.
   •  Example - Find the Position of Each Booked Hotel in Search
      Results:
          CREATE TABLE positions(!
             session_id STRING,!
             booked_hotel_id STRING,!
             position INT);!
          set mapred.reduce.tasks = 17;!
          INSERT OVERWRITE TABLE

            positions!
          SELECT

            e.session_id, e.booked_hotel_id, i.position!
          FROM

            hotel_impressions i JOIN wt_extract e!
          ON

                (e.booked_hotel_id = i.hotel_id and e.session_id = i.session_id);!




Orbitz!          Hadoop!        Hive!      Input!    Applications!   Analysis!       page 29
Webtrends Analysis in Hive Cont’d

   •  Example - Aggregate Booking Position by Location by Day:
          CREATE TABLE position_aggregate_by_day(!
            location_id STRING,!
            booking_date STRING,!
            position INT,!
            pcount INT);!


          INSERT OVERWRITE TABLE!
            position_aggregate_by_day!
          SELECT!
            e.location_id, e.booking_date, i.position, count(1)!
          FROM!
            wt_extract e JOIN hotel_impressions i!
          ON!
           (i.hotel_id = e.booked_hotel_id and i.session_id = e.session_id)!
          GROUP BY!
            e.location_id,e.booking_date,i.position!




Orbitz!        Hadoop!          Hive!         Input!     Applications!   Analysis!   page 30
Hotel Data Cube

   •  Goal is to provide users with data not available in existing hotel cube.
   •  Problem is lack of Hive support in existing visualization tools.




Orbitz!   Hadoop!       Hive!      Input!   Applications!   Analysis!            page 31
Hotel Data Cube – Components




Orbitz!   Hadoop!   Hive!   Input!   Applications!   Analysis!   page 32
Statistical Analysis of Hive Data

   Explore statistical trends and long-tails…
            to help machine learning algorithms…
                      by choosing well understood input datasets.


   •  What approximations and biases exist?
   •  How is the Data distributed?
   •  Are 25+ variables pair-wise correlated?
   •  Are there built-in data bias? Any Lurking variables?
   •  Are there outlier effects on the distribution?
   •  Which segment should be used for ML training?




Orbitz!   Hadoop!       Hive!        Input!   Applications!   Analysis!   page 33
Statistical Analysis – Components




Orbitz!   Hadoop!   Hive!    Input!    Applications!   Analysis!   page 34
Statistical Analysis: Infrastructure and Dataset

   •  Hive + R platform for query processing and statistical analysis.
   •  R - Open-source stat package with visualization.
   •  R - Steep learning curve but worth it!
   •  R – Vibrant community support.
   •  Hive Dataset used:
          –  Customer hotel booking on our sites over 6 months.
          –  Derived from web analytics extract files from Hive.
          –  User rating of hotels.
          –  Captured major holiday travel bookings but not summer
             peak.
   •  Costs of cleaning and processing data is non-trivial.


Orbitz!      Hadoop!      Hive!       Input!   Applications!   Analysis!   page 35
Some R code…
HIVE> Insert overwrite local directory ‘/home/user/…/HiveExtractTable’!
Select * from . . .!
----------------------------------------------------------------------------------------------------------------!
R> WTData <- data.frame(scan("/user/.../HiveExtractTable", what = list(0,"",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"",
   0,"",""),na.strings="NA"))!
R> names(CHI) <- c("locationID”,"hotelID","rate","distance","amenities","checkin", "cleanliness", "comfort",
   "dining", "location”, "bookposition", "guests", "competitiveness”, "departuredate", "returndate", "stay”)!
CHI <- WTData[which(WTData[,1] == '7840'|'35416'|'37961'|'38422'|'37795'| '37769'|'37730'|'33728'|'38123'),]!
R> summary(CHI)!
----------------------------------------------------------------------------------------------------------------!
R> library(corrgram)!
R> corrgram(CHI,order=NULL,lower.panel=panel.ellipse)!
----------------------------------------------------------------------------------------------------------------!
R> pdf(file="/user/.../himp.pdf") !
R> par(mfrow=c(2,1))!
R> bin1 <-hexbin(CHI[,15],CHI[,4], xbins=25)!
R> hsmooth(bin1,c(48,24,0))!
R> plot(bin1,xlab="Position", ylab="Price($)", main="Booked Hotel Impressions", colramp=function(n){plinrain
   (n,beg=35,end=225)})!
R> bin2 <-hexbin(CHI[,15], CHI[,22], xbins=25)!
----------------------------------------------------------------------------------------------------------------!
R> png(”/user/. . ./HM_agg_CHI.png",width=480,height=480,units="px",pointsize=12)!
R> CHI <- aggdata[which(aggdata$position <=10),]!
R> par(mfrow=c(2,2))!
R> calendarHeat(CHI$Time,CHI$rate, varname="Aggregated Check-in Date vs. Avg. Daily Rate”)!



    Orbitz!        Hadoop!        Hive!           Input!   Applications!   Analysis!                    page 36
Statistical Analysis - Positional Bias



•  Lurking variable is…
   Positional Bias.
•  Top positions invariably
   picked the most.
•  Aim to position Best Ranked
   Hotels at the top based on
   customer search criteria and
   user ratings.
•  If website originated data,
   watch for inherent hidden
   bias.


   Orbitz!   Hadoop!      Hive!   Input!   Applications!   Analysis!   page 37
Statistical Analysis - Kernel Density



•  User Ratings of Hotels
•  Strongly affected by the number
   of bins used.
•  Kernel density plots are usually
   a much more effective way to
   overcome the limitations of
   histograms.




   Orbitz!   Hadoop!     Hive!        Input!   Applications!   Analysis!   page 38
Statistical Analysis - Normality tests

•  How normal is our data?
•  Plots the quantiles of the data set
   against the theoretical quantiles.
•  If the distributions are the same,
   then the plot will be
   approximately a straight line.
•  An "S" shape implies that one
   distribution has longer tails than
   the other.
•  Avoid tendency to create
   stories out of noise.




    Orbitz!    Hadoop!       Hive!       Input!   Applications!   Analysis!   page 39
Statistical Analysis - Macro level regression




Orbitz!   Hadoop!   Hive!   Input!   Applications!   Analysis!   page 40
Statistical Analysis - Exploratory correlation




Orbitz!   Hadoop!   Hive!   Input!   Applications!   Analysis!   page 41
Statistical Analysis - Visual Analytics
•  Show daily average rate based on booked hotels.
•  Show seasonal dip in hotel rates.
•  Outliers removed.
•  “Median is not the message”; Find patterns first.




  Orbitz!   Hadoop!    Hive!     Input!   Applications!   Analysis!   page 42
Statistical Analysis - More seasonal variations
•  Customer hotel stay gets longer during summer months
•  Could help in designing search based on seasons.
•  Outliers removed.
•  Understand data boundaries.




  Orbitz!   Hadoop!    Hive!   Input!   Applications!   Analysis!   page 43
Statistical Analysis - Linear Filtering


•  Decompose into a trend, a
   seasonal component and
   remainder.
•  Moving average linear filters
   with equal weights.




•  Filters Coefficients:



•  Let macro patterns guide
   further modeling.
   Orbitz!   Hadoop!       Hive!   Input!   Applications!   Analysis!   page 44
References


•  Hadoop project: http://hadoop.apache.org/
•  Hive project: http://hadoop.apache.org/hive/
•  Hive – A Petabyte Scale Data Warehouse Using Hadoop:
   http://i.stanford.edu/~ragho/hive-icde2010.pdf
•  Hadoop The Definitive Guide, Tom White, O’Reilly Press, 2009
•  Why Model, J. Epstein, 2008
•  Beautiful Data, T. Segaran & J. Hammerbacher, 2009




                                                                  page 45
Contact


•  Jonathan Seidman:
  –  jseidman@orbitz.com
  –  @jseidman
•  Ramesh Venkataramaiah:
  –  rvenkataramaiah@orbitz.com




                                  page 46
Questions?




             page 47

Más contenido relacionado

La actualidad más candente

Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSightAmazon Web Services
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksAmazon Web Services
 
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedis Labs
 
seminar presentation on apache-spark
seminar presentation on apache-sparkseminar presentation on apache-spark
seminar presentation on apache-sparkJawhar Ali
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guideCynthia Saracco
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 

La actualidad más candente (20)

Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSight
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 
Apache hive
Apache hiveApache hive
Apache hive
 
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
seminar presentation on apache-spark
seminar presentation on apache-sparkseminar presentation on apache-spark
seminar presentation on apache-spark
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guide
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 

Destacado

How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses HadoopNarayan Bharadwaj
 
APAC Big Data Strategy RadhaKrishna Hiremane
APAC Big Data  Strategy RadhaKrishna  HiremaneAPAC Big Data  Strategy RadhaKrishna  Hiremane
APAC Big Data Strategy RadhaKrishna HiremaneIntelAPAC
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness
Surprising failure factors when implementing eCommerce and Omnichannel eBusinessSurprising failure factors when implementing eCommerce and Omnichannel eBusiness
Surprising failure factors when implementing eCommerce and Omnichannel eBusinessDivante
 
Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)Divante
 
Omnichannel Customer Experience
Omnichannel Customer ExperienceOmnichannel Customer Experience
Omnichannel Customer ExperienceDivante
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
White Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known CustomersWhite Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known CustomersGigya
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
 
Real-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesReal-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesDataWorks Summit/Hadoop Summit
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Jonathan Seidman
 
Scaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemScaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemDataWorks Summit
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minssparkflows
 

Destacado (20)

How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
 
APAC Big Data Strategy RadhaKrishna Hiremane
APAC Big Data  Strategy RadhaKrishna  HiremaneAPAC Big Data  Strategy RadhaKrishna  Hiremane
APAC Big Data Strategy RadhaKrishna Hiremane
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness
Surprising failure factors when implementing eCommerce and Omnichannel eBusinessSurprising failure factors when implementing eCommerce and Omnichannel eBusiness
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness
 
Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)
 
Omnichannel Customer Experience
Omnichannel Customer ExperienceOmnichannel Customer Experience
Omnichannel Customer Experience
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
White Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known CustomersWhite Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known Customers
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Real-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesReal-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and Challenges
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010Hadoop and Hive at Orbitz, Hadoop World 2010
Hadoop and Hive at Orbitz, Hadoop World 2010
 
Scaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop EcosystemScaling Ancestry DNA with the Hadoop Ecosystem
Scaling Ancestry DNA with the Hadoop Ecosystem
 
Hug Milano September 2014: Hadoop Summit Europe Impressions
Hug Milano September 2014: Hadoop Summit Europe ImpressionsHug Milano September 2014: Hadoop Summit Europe Impressions
Hug Milano September 2014: Hadoop Summit Europe Impressions
 
Hug Italy- 30 Sept 2014, Milan
Hug Italy- 30 Sept 2014, MilanHug Italy- 30 Sept 2014, Milan
Hug Italy- 30 Sept 2014, Milan
 
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 minsSparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
Sparkflows - Build E2E Data Analytics Use Cases in less than 30 mins
 
Big Data Infrastructures - Hadoop ecosystem, M. E. Piras
Big Data Infrastructures - Hadoop ecosystem, M. E. PirasBig Data Infrastructures - Hadoop ecosystem, M. E. Piras
Big Data Infrastructures - Hadoop ecosystem, M. E. Piras
 

Similar a Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 

Similar a Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010 (20)

Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hive
HiveHive
Hive
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hive_Pig.pptx
Hive_Pig.pptxHive_Pig.pptx
Hive_Pig.pptx
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 

Más de Jonathan Seidman

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Jonathan Seidman
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_finalJonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Jonathan Seidman
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Jonathan Seidman
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Jonathan Seidman
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Jonathan Seidman
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Jonathan Seidman
 

Más de Jonathan Seidman (14)

Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019Foundations for Successful Data Projects – Strata London 2019
Foundations for Successful Data Projects – Strata London 2019
 
Foundations strata sf-2019_final
Foundations strata sf-2019_finalFoundations strata sf-2019_final
Foundations strata sf-2019_final
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011Extending the Data Warehouse with Hadoop - Hadoop world 2011
Extending the Data Warehouse with Hadoop - Hadoop world 2011
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011Distributed Data Analysis with Hadoop and R - OSCON 2011
Distributed Data Analysis with Hadoop and R - OSCON 2011
 
Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011Extending the EDW with Hadoop - Chicago Data Summit 2011
Extending the EDW with Hadoop - Chicago Data Summit 2011
 
Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011Real World Machine Learning at Orbitz, Strata 2011
Real World Machine Learning at Orbitz, Strata 2011
 

Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010

  • 1. Using Hadoop and Hive to Optimize Travel Search Jonathan Seidman and Ramesh Venkataramaiah
  • 2. Contributors •  Robert Lancaster, Orbitz Worldwide •  Wai Gen Yee, Orbitz Worldwide •  Andrew Yates, Intern - Orbitz Worldwide page 2
  • 3. Agenda •  Orbitz Worldwide •  Hadoop for Big Data •  Hive for Queries •  Web Analytics data as input •  Applications of Hadoop/Hive at Orbitz: –  Hotel Sort –  Data Cubes •  Sample analysis and data trends Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 3
  • 5. Data Challenges •  Orbitz.com generates ~1.5 million air searches and ~1 million hotel searches every day. •  All of this activity generates massive amounts of data – over 500 GB/day of log data, and even this doesn’t capture all of the data we want. •  Expensive and difficult to use existing data infrastructure for storing and processing this data. •  Need an infrastructure that provides: –  Long term storage of very large data sets. –  Open access to developers and analysts. –  Allows for ad-hoc querying of data and rapid deployment of reporting applications. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 5
  • 6. Hadoop Overview •  Open source framework providing reliable and scalable storage and processing of data on inexpensive commodity hardware. •  Two primary components: The Hadoop distributed file system and MapReduce. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 6
  • 7. Hadoop Overview – Hadoop Distributed File System •  HDFS provides reliable, fault tolerant and scalable storage of very large datasets across machines in a cluster. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 7
  • 8. Hadoop Overview – MapReduce •  Programming model for efficient distributed processing. Designed to reliably perform computations on large volumes of data in parallel. •  Removes much of the burden of writing distributed computations. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 8
  • 9. The Problem with MapReduce •  Requires experienced developers to write MapReduce jobs which can be difficult to maintain and re-use. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 9
  • 10. Hive Overview •  Hive is an open-source data warehousing solution built on top of Hadoop which allows for easy data summarization, adhoc querying and analysis of large datasets stored in Hadoop. •  Developed at Facebook to provide a structured data model over Hadoop data. •  Simplifies Hadoop data analysis – users can use a familiar SQL model rather than writing low level custom code. •  Hive queries are compiled into Hadoop MapReduce jobs. •  Designed for scalability, not low latency. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 10
  • 11. Hive Overview – Comparison to Traditional DBMS Systems •  Although Hive uses a model familiar to database users, it does not support a full relational model and only supports a subset of SQL. •  What Hadoop/Hive offers is highly scalable and fault-tolerant processing of very large data sets. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 11
  • 12. Hive - Data Model •  Databases – provide namespace for Hive objects, prevent naming conflicts. •  Tables – analogous to tables in a standard RDBMS. •  Partitions and buckets – Allow Hive to prune data during query processing. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 12
  • 13. Hive – Data Types •  Supports primitive types such as int, double, and string. •  Also supports complex types such as structs, maps (key/value tuples), and arrays (indexable lists). Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 13
  • 14. Hive – Hive Query Language •  HiveQL – Supports basic SQL-like operations such as select, join, aggregate, union, sub-queries, etc. •  HiveQL queries are compiled into MapReduce processes. •  Supports embedding custom MapReduce scripts. •  Built in support for standard relational, arithmetic, and boolean operators. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 14
  • 15. Hive MapReduce •  Allows analysis not possible through standard HiveQL queries. •  Can be implemented in any language. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 15
  • 16. Hive – User Defined Functions •  HiveQL is extensible through user defined functions implemented in Java. •  Also supports aggregation functions (sum, avg). •  Provides table functions when more than one value needs to be returned. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 16
  • 17. Hive – User Defined Functions Example UDF – Find hotel’s position in an impression list: package com.orbitz.hive;! import org.apache.hadoop.hive.ql.exec.UDF;! import org.apache.hadoop.io.Text;! /**! * returns hotel_id's position given a hotel_id and impression list! */! public final class GetPos extends UDF {! public Text evaluate(final Text hotel_id, final Text impressions) {! if (hotel_id == null || impressions == null)! return null;! String[] hotels = impressions.toString().split(";");! String position;! String id = hotel_id.toString();! int begin=0, end=0;! for (int i=0; i<hotels.length; i++) {! begin = hotels[i].indexOf(",");! end = hotels[i].lastIndexOf(",");! position = hotels[i].substring(begin+1,end);! if (id.equals(hotels[i].substring(0,begin)))! return new Text(position);! }! return null;! }! }! Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 17
  • 18. Hive – User Defined Functions hive> add jar path-to-jar/pos.jar; ! hive> create temporary function getpos as 'com.orbitz.hive.GetPos';! hive> select getpos(‘1’, ‘1,3,100.00;2,1,100.00’);! …! hive> 3 ! Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 18
  • 19. Hive – Client Access •  Hive Command Line Interface (CLI) •  Hive Web UI •  JDBC, ODBC, Thrift Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 19
  • 20. Hive – Lessons Learned •  Job scheduling – Default Hadoop scheduling is FIFO. Consider using something like the fair scheduler. •  Multi-user Hive – Default install is single user. Multi-user installs require Derby network server. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 20
  • 22. Input Data – Webtrends •  Web analytics software providing information about user behavior. •  Raw Webtrends log files are used as input to much of our processing. Logs have the following format: –  date time c-ip cs-username cs-host cs-method cs- uri-stem cs-uri-query sc-status sc-bytes cs- version cs(User-Agent) cs(Cookie) cs(Referer) dcs-geo dcs-dns origin-id dcs-id! Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 22
  • 23. Input Data – Webtrends Cont’d •  Storing raw data in HDFS provides access to data not available elsewhere, for example “hotel impression” data: –  115004,1,70.00;35217,2,129.00;239756,3,99.00;83389,4,99.00! Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 23
  • 25. Improve Hotel Sort •  Extract data from raw Webtrends logs for input to a trained classification process. •  Logs provide input to MapReduce processing which extracts required fields. •  Previous process used a series of Perl and Bash scripts to extract data serially. •  Comparison of performance –  Months worth of data –  Manual process took 109m14s –  MapReduce process took 25m58s Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 25
  • 26. Improve Hotel Sort – Components Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 26
  • 27. Improve Hotel Sort – Processing Flow Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 27
  • 28. Webtrends Analysis in Hive •  Extract data is loaded into two Hive tables: DROP TABLE wt_extract; CREATE TABLE wt_extract( session_id STRING, visitor_tracking_id STRING, host STRING, visitors_ip STRING, booking_date STRING, booking_time STRING, dept_date STRING, ret_date STRING, booked_hotel_id STRING, sort_type STRING, destination STRING, location_id STRING, number_of_guests INT, number_of_rooms INT, page_number INT, matrix_interaction STRING, impressions STRING, areacode STRING, city STRING, region_code STRING, country STRING, country_code STRING, continent STRING, company STRING, tzone STRING) CLUSTERED BY(booked_hotel_id) INTO 256 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 't’ STORED AS TEXTFILE; load data inpath ’/extract-output/part-00000' into table wt_extract; DROP TABLE hotel_impressions; CREATE TABLE hotel_impressions( session_id STRING, hotel_id STRING, position INT, rate FLOAT ) CLUSTERED BY(hotel_id) INTO 256 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE; load data inpath ’/impressions-output/part-00000' into table hotel_extract; Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 28
  • 29. Webtrends Analysis in Hive Cont’d •  Allows us to easily derive metrics not previously possible. •  Example - Find the Position of Each Booked Hotel in Search Results: CREATE TABLE positions(! session_id STRING,! booked_hotel_id STRING,! position INT);! set mapred.reduce.tasks = 17;! INSERT OVERWRITE TABLE
 positions! SELECT
 e.session_id, e.booked_hotel_id, i.position! FROM
 hotel_impressions i JOIN wt_extract e! ON
 (e.booked_hotel_id = i.hotel_id and e.session_id = i.session_id);! Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 29
  • 30. Webtrends Analysis in Hive Cont’d •  Example - Aggregate Booking Position by Location by Day: CREATE TABLE position_aggregate_by_day(! location_id STRING,!   booking_date STRING,!   position INT,!   pcount INT);! INSERT OVERWRITE TABLE! position_aggregate_by_day! SELECT! e.location_id, e.booking_date, i.position, count(1)! FROM! wt_extract e JOIN hotel_impressions i! ON! (i.hotel_id = e.booked_hotel_id and i.session_id = e.session_id)! GROUP BY! e.location_id,e.booking_date,i.position! Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 30
  • 31. Hotel Data Cube •  Goal is to provide users with data not available in existing hotel cube. •  Problem is lack of Hive support in existing visualization tools. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 31
  • 32. Hotel Data Cube – Components Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 32
  • 33. Statistical Analysis of Hive Data Explore statistical trends and long-tails… to help machine learning algorithms… by choosing well understood input datasets. •  What approximations and biases exist? •  How is the Data distributed? •  Are 25+ variables pair-wise correlated? •  Are there built-in data bias? Any Lurking variables? •  Are there outlier effects on the distribution? •  Which segment should be used for ML training? Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 33
  • 34. Statistical Analysis – Components Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 34
  • 35. Statistical Analysis: Infrastructure and Dataset •  Hive + R platform for query processing and statistical analysis. •  R - Open-source stat package with visualization. •  R - Steep learning curve but worth it! •  R – Vibrant community support. •  Hive Dataset used: –  Customer hotel booking on our sites over 6 months. –  Derived from web analytics extract files from Hive. –  User rating of hotels. –  Captured major holiday travel bookings but not summer peak. •  Costs of cleaning and processing data is non-trivial. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 35
  • 36. Some R code… HIVE> Insert overwrite local directory ‘/home/user/…/HiveExtractTable’! Select * from . . .! ----------------------------------------------------------------------------------------------------------------! R> WTData <- data.frame(scan("/user/.../HiveExtractTable", what = list(0,"",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"", 0,"",""),na.strings="NA"))! R> names(CHI) <- c("locationID”,"hotelID","rate","distance","amenities","checkin", "cleanliness", "comfort", "dining", "location”, "bookposition", "guests", "competitiveness”, "departuredate", "returndate", "stay”)! CHI <- WTData[which(WTData[,1] == '7840'|'35416'|'37961'|'38422'|'37795'| '37769'|'37730'|'33728'|'38123'),]! R> summary(CHI)! ----------------------------------------------------------------------------------------------------------------! R> library(corrgram)! R> corrgram(CHI,order=NULL,lower.panel=panel.ellipse)! ----------------------------------------------------------------------------------------------------------------! R> pdf(file="/user/.../himp.pdf") ! R> par(mfrow=c(2,1))! R> bin1 <-hexbin(CHI[,15],CHI[,4], xbins=25)! R> hsmooth(bin1,c(48,24,0))! R> plot(bin1,xlab="Position", ylab="Price($)", main="Booked Hotel Impressions", colramp=function(n){plinrain (n,beg=35,end=225)})! R> bin2 <-hexbin(CHI[,15], CHI[,22], xbins=25)! ----------------------------------------------------------------------------------------------------------------! R> png(”/user/. . ./HM_agg_CHI.png",width=480,height=480,units="px",pointsize=12)! R> CHI <- aggdata[which(aggdata$position <=10),]! R> par(mfrow=c(2,2))! R> calendarHeat(CHI$Time,CHI$rate, varname="Aggregated Check-in Date vs. Avg. Daily Rate”)! Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 36
  • 37. Statistical Analysis - Positional Bias •  Lurking variable is… Positional Bias. •  Top positions invariably picked the most. •  Aim to position Best Ranked Hotels at the top based on customer search criteria and user ratings. •  If website originated data, watch for inherent hidden bias. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 37
  • 38. Statistical Analysis - Kernel Density •  User Ratings of Hotels •  Strongly affected by the number of bins used. •  Kernel density plots are usually a much more effective way to overcome the limitations of histograms. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 38
  • 39. Statistical Analysis - Normality tests •  How normal is our data? •  Plots the quantiles of the data set against the theoretical quantiles. •  If the distributions are the same, then the plot will be approximately a straight line. •  An "S" shape implies that one distribution has longer tails than the other. •  Avoid tendency to create stories out of noise. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 39
  • 40. Statistical Analysis - Macro level regression Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 40
  • 41. Statistical Analysis - Exploratory correlation Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 41
  • 42. Statistical Analysis - Visual Analytics •  Show daily average rate based on booked hotels. •  Show seasonal dip in hotel rates. •  Outliers removed. •  “Median is not the message”; Find patterns first. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 42
  • 43. Statistical Analysis - More seasonal variations •  Customer hotel stay gets longer during summer months •  Could help in designing search based on seasons. •  Outliers removed. •  Understand data boundaries. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 43
  • 44. Statistical Analysis - Linear Filtering •  Decompose into a trend, a seasonal component and remainder. •  Moving average linear filters with equal weights. •  Filters Coefficients: •  Let macro patterns guide further modeling. Orbitz! Hadoop! Hive! Input! Applications! Analysis! page 44
  • 45. References •  Hadoop project: http://hadoop.apache.org/ •  Hive project: http://hadoop.apache.org/hive/ •  Hive – A Petabyte Scale Data Warehouse Using Hadoop: http://i.stanford.edu/~ragho/hive-icde2010.pdf •  Hadoop The Definitive Guide, Tom White, O’Reilly Press, 2009 •  Why Model, J. Epstein, 2008 •  Beautiful Data, T. Segaran & J. Hammerbacher, 2009 page 45
  • 46. Contact •  Jonathan Seidman: –  jseidman@orbitz.com –  @jseidman •  Ramesh Venkataramaiah: –  rvenkataramaiah@orbitz.com page 46
  • 47. Questions? page 47