SlideShare a Scribd company logo
1 of 72
Download to read offline
Big	
  Data	
  Analy,cs:	
  	
  
Applica,ons	
  &	
  Opportuni,es	
  in	
  On-­‐line	
  
                       Predic,ve	
  Modeling	
  
                                       Usama	
  Fayyad,	
  Ph.D.	
  	
  
                                                 Chairman	
  &	
  CTO	
  
                                             ChoozOn	
  Corpora/on	
  
                                               Twi$er:	
  @usamaf	
  
                                                                              	
  

                                                    August	
  12,	
  2012	
  
                            BigMine:	
  BigData	
  Mining	
  Workshop	
  
                                        KDD-­‐2012	
  –	
  Beijing,	
  China	
  



                    1	
  
Outline	
  
•    Big	
  Data	
  all	
  around	
  us	
  
•    IntroducIon	
  to	
  Data	
  Mining	
  and	
  PredicIve	
  AnalyIcs	
  
•    On-­‐line	
  data	
  and	
  facts	
  
•    Case	
  studies	
  from	
  mulIple	
  verIcals:	
  
      1.    Yahoo!	
  Big	
  Data	
  
      2.    Social	
  Network	
  Data	
  
      3.    Case	
  Study	
  from	
  nPario	
  ApplicaIons	
  
      4.    ChoozOn	
  Big	
  Data	
  from	
  offers	
  
•  High-­‐level	
  view:	
  don’t	
  forget	
  the	
  basics	
  
•  Summary	
  and	
  conclusions	
  



                                            2	
  
What	
  Ma3ers	
  in	
  the	
  Age	
  of	
  Analy/cs?	
  	
  
	
  


         1. Being	
  Able	
  to	
  exploit	
  all	
  the	
  data	
  that	
  is	
  available	
  	
  
              •  not	
  just	
  what	
  you've	
  got	
  available	
  	
  
              •  what	
  you	
  can	
  acquire	
  and	
  use	
  to	
  enhance	
  your	
  acIons	
  
              	
  
         2. 	
  ProliferaIng	
  analyIcs	
  throughout	
  your	
  
            organizaIon	
  
              •  make	
  every	
  part	
  of	
  your	
  business	
  smarter	
  
              	
  
         3. 	
  Driving	
  significant	
  business	
  value	
  	
  
              •  embedding	
  analyIcs	
  into	
  every	
  area	
  of	
  your	
  business	
  
                 can	
  help	
  you	
  drive	
  top	
  line	
  revenues	
  and/or	
  bo]om	
  
                 line	
  cost	
  efficiencies	
  	
  
                                                  3	
  
What	
  OrganizaIons	
  Are	
  Struggling	
  With	
  
•  Data	
  Strategy	
  -­‐	
  	
  how	
  much	
  data?	
  why	
  data?	
  how	
  does	
  it	
  
   impact	
  my	
  business?	
  
•  PrioriIzaIon	
  conducted	
  based	
  on	
  business	
  need,	
  not	
  IT	
  
      –    Business	
  jusIficaIons	
  for	
  Big	
  Data	
  
      –    DemonstraIng	
  value	
  of	
  data	
  in	
  impacIng	
  the	
  business	
  
      –    Looking	
  at	
  specialized	
  stores	
  to	
  reduce	
  TCO	
  
      –    File	
  systems	
  for	
  grid	
  compuIng	
  (Hadoop)	
  
•  We	
  do	
  need	
  to	
  stay	
  on	
  top	
  of	
  our	
  basic	
  business	
  ops	
  
      –  billing,	
  monitoring,	
  inventory	
  management,	
  etc...	
  
      –  Most	
  can	
  be	
  handled	
  by	
  stream	
  processing	
  and	
  tradiIonal	
  BI	
  
•  But,	
  a	
  new	
  generaIon	
  of	
  requirements	
  are	
  becoming	
  a	
  
   priority	
  for	
  data-­‐driven	
  business	
  
      –  PredicIve	
  analyIcs,	
  advanced	
  forecasIng,	
  automated	
  detecIon	
  of	
  
         events	
  of	
  interest	
  

                                                      4	
  
Why	
  Big	
  Data?	
  
A	
  new	
  term,	
  with	
  associated	
  “Data	
  Scien:st”	
  posi:ons:	
  
•  Big	
  Data:	
  is	
  a	
  mix	
  of	
  structured,	
  semi-­‐structured,	
  and	
  
   unstructured	
  data:	
  
     –  Typically	
  breaks	
  barriers	
  for	
  tradiIonal	
  RDB	
  storage	
  
     –  Typically	
  breaks	
  limits	
  of	
  indexing	
  by	
  “rows”	
  
     –  Typically	
  requires	
  intensive	
  pre-­‐processing	
  before	
  each	
  query	
  
        to	
  extract	
  “some	
  structure”	
  –	
  usually	
  using	
  Map-­‐Reduce	
  type	
  
        operaIons	
  
•  Above	
  leads	
  to	
  “messy”	
  situaIons	
  with	
  no	
  standard	
  
   recipes	
  or	
  architecture:	
  hence	
  the	
  need	
  for	
  “data	
  
   scienIsts”	
  	
  
     –  conduct	
  “Data	
  ExpediIons”	
  	
  
     –  Discovery	
  and	
  learning	
  on	
  the	
  spot	
  
                                                5	
  
What	
  Makes	
  Data	
  “Big	
  Data”?	
  
•  Big	
  Data	
  is	
  Characterized	
  by	
  the	
  3-­‐V’s:	
  
    –  Volume:	
  larger	
  than	
  “normal”	
  –	
  challenging	
  to	
  load/process	
  
         •  Expensive	
  to	
  do	
  ETL	
  
         •  Expensive	
  to	
  figure	
  out	
  how	
  to	
  index	
  and	
  retrieve	
  
         •  MulIple	
  dimensions	
  that	
  are	
  “key”	
  
    –  Velocity:	
  Rate	
  of	
  arrival	
  poses	
  real-­‐/me	
  constraints	
  on	
  what	
  
       are	
  typically	
  “batch	
  ETL”	
  opera/ons	
  
         •  If	
  you	
  fall	
  behind	
  catching	
  up	
  is	
  extremely	
  expensive	
  (replicate	
  very	
  
            expensive	
  systems)	
  
         •  Must	
  keep	
  up	
  with	
  rate	
  and	
  service	
  queries	
  on-­‐the-­‐fly	
  
    –  Variety:	
  Mix	
  of	
  data	
  types	
  and	
  varying	
  degrees	
  of	
  structure	
  
         •  Non-­‐standard	
  schema	
  
         •  Lots	
  of	
  BLOB’s	
  and	
  CLOB’s	
  
         •  DB	
  queries	
  don’t	
  know	
  what	
  to	
  do	
  with	
  semi-­‐structured	
  and	
  
            unstructured	
  data.	
  
                                                       6	
  
The	
  DisIncIon	
  between	
  “Data”	
  and	
  “Big	
  
                 Data”	
  is	
  fast	
  disappearing	
  
	
  
•  Most	
  real	
  data	
  sets	
  nowadays	
  come	
  with	
  a	
  serious	
  mix	
  of	
  
   semi-­‐structured	
  and	
  unstructured	
  components:	
  
        –    Images	
  
        –    Video	
  
        –    Text	
  descripIons	
  and	
  news,	
  blogs,	
  etc…	
  
        –    User	
  and	
  customer	
  commentary	
  
        –    ReacIons	
  on	
  social	
  media:	
  e.g.	
  Twi]er	
  is	
  a	
  mix	
  of	
  data	
  anyway	
  
•  Using	
  standard	
  transforms,	
  enIty	
  extracIon,	
  and	
  new	
  
   generaIon	
  tools	
  to	
  transform	
  unstructured	
  raw	
  data	
  into	
  
   semi-­‐structured	
  analyzable	
  data	
  	
  
•  Hadoop	
  vs.	
  Not	
  Hadoop	
  -­‐	
  	
  when	
  to	
  use	
  what	
  kind	
  of	
  
   techniques	
  requiring	
  Map-­‐Reduce	
  and	
  grid	
  compuIng	
  


                                                            7	
  
Text	
  Data:	
  The	
  Big	
  Driver	
  
	
  
•  While	
  we	
  speak	
  of	
  “big	
  data”	
  and	
  the	
  “Variety”	
  in	
  3-­‐V’s	
  
•  Reality:	
  biggest	
  driver	
  of	
  growth	
  of	
  Big	
  Data	
  has	
  been	
  text	
  
   data	
  
•  In	
  fact	
  Map-­‐Reduce	
  became	
  popularized	
  by	
  Google	
  to	
  address	
  
   the	
  problem	
  of	
  processing	
  large	
  amounts	
  of	
  text	
  data:	
  	
  
       –  Indexing	
  a	
  full	
  copy	
  of	
  the	
  web	
  
       –  Frequent	
  re-­‐indexing	
  
       –  Many	
  operaIons	
  with	
  each	
  being	
  a	
  simple	
  operaIon	
  but	
  done	
  at	
  large	
  
          scale	
  
•  Most	
  work	
  on	
  analysis	
  of	
  “images”	
  and	
  “video”	
  data	
  has	
  really	
  
     been	
  reduced	
  to	
  analysis	
  of	
  surrounding	
  text	
  
	
  


                                                      8	
  
To	
  Hadoop	
  or	
  not	
  to	
  Hadoop?	
  
when	
  to	
  use	
  techniques	
  requiring	
  Map-­‐Reduce	
  and	
  grid	
  compu,ng?	
  
•  Typically	
  organizaIons	
  try	
  to	
  use	
  Map-­‐Reduce	
  for	
  everything	
  to	
  
   do	
  with	
  Big	
  Data	
  
     –  This	
  is	
  actually	
  very	
  inefficient	
  and	
  oqen	
  irraIonal	
  
     –  Certain	
  operaIons	
  require	
  specialized	
  storage	
  
           •  UpdaIng	
  segment	
  memberships	
  over	
  large	
  numbers	
  of	
  users	
  
           •  Defining	
  new	
  segments	
  on	
  user	
  or	
  usage	
  data	
  
•  Map-­‐Reduce	
  is	
  useful	
  when	
  a	
  very	
  simple	
  operaIon	
  is	
  to	
  be	
  
   applied	
  on	
  a	
  large	
  body	
  of	
  unstructured	
  data	
  
     –  Typically	
  this	
  is	
  during	
  enIty	
  and	
  a]ribute	
  extracIon	
  
     –  SIll	
  need	
  Big	
  Data	
  analysis	
  post	
  Hadoop	
  
•  Map-­‐Reduce	
  is	
  not	
  efficient	
  or	
  effecIve	
  for	
  tasks	
  involving	
  deeper	
  
   staIsIcal	
  modeling	
  
     –  	
  good	
  for	
  gathering	
  counts	
  and	
  simple	
  (sufficient)	
  staIsIcs	
  
           •  E.g.	
  how	
  many	
  Imes	
  a	
  keyword	
  occurs,	
  quick	
  aggregaIon	
  of	
  simple	
  facts	
  in	
  unstructured	
  
              data,	
  esImates	
  of	
  variances,	
  density,	
  etc…	
  
     –  Mostly	
  pre-­‐processing	
  for	
  Data	
  Mining	
  
                                                                 9	
  
10	
  
11	
  
Hadoop	
  Use	
  Cases	
  by	
  Data	
  Type	
  
           ERP	
  Financial	
  Data	
      Supply	
  Chain	
  Data	
  
                                                                       Sensor	
  Data	
         Financial	
  Trading	
  Data	
  
                     1%	
                            2%	
  
                                                                           2%	
                              4%	
  

                                                                                              CRM	
  Data	
  
                                                                                                4%	
  




                                                                                     Science	
  Data	
  
                                                                                          7%	
  
      Content	
  and	
  Preference	
  Data	
  
                    24%	
  


                                                                                      AdverIsing	
  Data	
  
                                                                                           10%	
  




      IT	
  Log	
  Data	
  
             19%	
  
                                                                                     Social	
  Data	
  
                                                                                        11%	
  



                                          Text	
  and	
  Language	
  Data	
  
                                                          16%	
  


                                              12	
  
Big	
  Data	
  Applica:ons	
  and	
  Uses	
  
                                                                  100%	
  Capture	
  
      IT	
  Log	
  &	
  Security	
  	
  
    Forensics	
  &	
  AnalyIcs	
                Find	
  New	
  Signal	
         Predict	
  Events	
  


                                                             Product	
  Planning	
  
   Automated	
  Device	
  Data	
  
                                                               Failure	
  Analysis	
  
         AnalyIcs	
  
                                                               ProacIve	
  Fixes	
  

                                                                RecommendaIon	
  
            AdverIsing	
                                           SegmentaIon	
  
             AnalyIcs	
  
                                                                     Social	
  Media	
  


                                                          Hadoop	
  +	
  	
  MPP	
  +	
  EDW	
  
          Big	
  Data	
                         Cost	
  ReducIon	
            Ad	
  Hoc	
  Insight	
  
     Warehouse	
  AnalyIcs	
  
                                                            PredicIve	
  AnalyIcs	
  


                                       13	
  
Analysis	
  &	
  Programming	
  Soqware	
  




                                     PIG




                                           HIPI




                    14	
  
15	
  
From	
  Basic	
  Dashboards	
  to	
  Advanced	
  Analy:cs	
  

•  Data	
  ReducIon	
  to	
  get	
  
    –  Advanced	
  views	
  oriented	
  by	
  customer	
  or	
  product	
  
    –  SegmentaIon	
  
    –  Pa]ern	
  analysis	
  and	
  summaries	
  
•  PredicIve	
  AnalyIcs	
  
    –  Data	
  Mining	
  
    –  StaIsIcal	
  analysis	
  
    –  OpImizaIon	
  of	
  processes	
  and	
  spend	
  

   The	
  same	
  analyIcs	
  technique	
  apply	
  across	
  
   many	
  industries:	
  fraud	
  detecIon	
  is	
  fraud	
  
   detecIon,	
  is	
  fraud	
  detecIon	
  
                                         16	
  
What	
  is	
  Data	
  Mining?	
  

Finding	
  interes,ng	
  structure	
  in	
  data	
  
•  Structure:	
  refers	
  to	
  staIsIcal	
  pa]erns,	
  predicIve	
  models,	
  hidden	
  
   relaIonships	
  
•  Interes/ng:	
  Accurate	
  predicIons,	
  associated	
  with	
  new	
  revenue	
  
   potenIal,	
  associated	
  with	
  cost	
  savings,	
  enables	
  opImizaIon	
  
	
  
	
  
•  Examples	
  of	
  tasks	
  addressed	
  by	
  Data	
  Mining	
  
    –  PredicIve	
  Modeling	
  (classificaIon,	
  regression)	
  
    –  SegmentaIon	
  (Data	
  Clustering	
  )	
  
    –  Affinity	
  (SummarizaIon) 	
  	
  
          •  relaIons	
  between	
  fields,	
  associaIons,	
  visualizaIon	
  

                                                   17	
  
Data	
  Mining	
  and	
  Databases	
  
Many	
  interesIng	
  analysis	
  queries	
  are	
  difficult	
  to	
  state	
  
     precisely	
  
	
  
•  Examples:	
  
     –    which	
  records	
  represent	
  fraudulent	
  transac:ons?	
  
     –    which	
  households	
  are	
  likely	
  to	
  prefer	
  a	
  Ford	
  over	
  a	
  Toyota?	
  
     –    Who	
  is	
  a	
  good	
  credit	
  risk	
  in	
  my	
  customer	
  DB?	
  
     –    Why	
  are	
  these	
  automobiles	
  in	
  need	
  of	
  unusual	
  repairs?	
  
    	
  
•  Yet	
  database	
  contains	
  the	
  informaIon	
  	
  
     –  good/bad	
  customer,	
  profitability	
  
     –  did/didn’t	
  respond	
  to	
  mailout/survey/campaign/...	
  
     –  automobile	
  repair	
  and	
  warranty	
  records	
  

                                                           18	
  
Many	
  Business	
  Uses	
  
Analy:c	
  technique	
                       Uses	
  in	
  business	
  
Marke:ng	
  and	
  sales	
                   	
  IdenIfy	
  potenIal	
  customers;	
  establish	
  the	
  
                                             effecIveness	
  of	
  a	
  campaign	
  	
  
Understanding	
  customer	
  behavior	
   model	
  churn,	
  affiniIes,	
  propensiIes,	
  …	
  
Web	
  analy:cs	
  &	
  metrics	
            model	
  user	
  preferences	
  from	
  data,	
  collaboraIve	
  
                                             filtering,	
  targeIng,	
  etc.	
  
Fraud	
  detec:on	
                          IdenIfy	
  fraudulent	
  transacIons	
  
Credit	
  scoring	
                          Establish	
  credit	
  worthiness	
  of	
  a	
  customer	
  requesIng	
  a	
  
                                             loan	
  
Manufacturing	
  process	
  analysis	
       IdenIfy	
  the	
  causes	
  of	
  manufacturing	
  problems	
  
PorWolio	
  trading	
                        opImize	
  a	
  porvolio	
  of	
  financial	
  instruments	
  by	
  
                                             maximizing	
  returns	
  &	
  minimizing	
  risks	
  
Healthcare	
  Applica:on	
                   fraud	
  detecIon,	
  cost	
  opImizaIon,	
  detecIon	
  of	
  events	
  
                                             like	
  epidemics,	
  etc...	
  
Insurance	
                                  fraudulent	
  claim	
  detecIon,	
  risk	
  assessment	
  
Security	
  and	
  Surveillance	
            intrusion	
  detecIon,	
  sensor	
  data	
  analysis,	
  remote	
  
                                             sensing,	
  object/person	
  detecIon,	
  link	
  analysis,	
  etc...	
  
                                                    19	
  
So	
  this	
  Internet	
  thing	
  is	
  going	
  to	
  
                   be	
  big!	
  
         Big	
  opportunity,	
  Big	
  Data,	
  Big	
  
                    Challenges!	
  



                              20	
  
Stats	
  about	
  on-­‐line	
  usage	
  
   •  How	
  many	
  people	
  are	
  on-­‐line	
  today?	
  	
  
        –  2.1	
  Billion	
  (per	
  Comscore	
  esImates)	
  
        –  30%	
  of	
  world	
  PopulaIon	
  

   •  How	
  much	
  Ime	
  is	
  spent	
  on-­‐line	
  per	
  month	
  by	
  
      the	
  whole	
  world?	
  
        –  4M	
  person-­‐years	
  per	
  month	
  

   •  How	
  many	
  hours	
  per	
  month	
  per	
  Internet	
  User?	
  
        –  16	
  hours	
  (global	
  average)	
  
        –  32	
  hours	
  (U.S.	
  Average)	
  
*Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com, Nielsen.com,
thisDigitalLife.com, PewInternet.org
                                                21	
  
How	
  Are	
  Users	
  Distributed	
  Geographically	
  




*Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com, Nielsen.com,
thisDigitalLife.com, PewInternet.org              22	
  
How Do People Spend Their On-line Time?

•  On-line Shopping?                5%
•  Searches?                      21%
•  Email/Communication?           19%
•  Reading Content?               20%
•  Social Networking?             22%
•  Multimedia Sites?              13%
                       23	
  
Most	
  Popular	
  AcIviIes	
  On-­‐Line?	
  




*Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com, Nielsen.com,
thisDigitalLife.com, PewInternet.org
                                          24	
  
Top	
  10	
  Sites	
  Visited?	
  
•  Google:	
  	
  153.4M	
  visitors/month	
  	
  
    –  each	
  spending	
  1h	
  47mins	
  
•  Facebook:	
  137.6M	
  visitors	
  per	
  month	
  	
  
    –  each	
  spending	
  7h	
  50mins	
  




                                              25	
  
InteresIng	
  Events	
  
•  Google:	
  	
  How	
  many	
  queries	
  per	
  day?	
  
       –  More	
  than	
  1	
  Billion	
  
•  Twi$er:	
  How	
  many	
  Tweets/day?	
  
       –  More	
  than	
  250M	
  
•  Facebook:	
  Updates	
  per	
  day?	
  
       –  More	
  than	
  800M	
  
•  YouTube:	
  Views/day	
  
       –  4	
  Billion	
  views	
  
       –  60	
  hours	
  of	
  video	
  uploaded	
  every	
  minute!	
  
•  Social	
  Networks:	
  users	
  who	
  have	
  used	
  sites	
  for	
  spying	
  
   on	
  their	
  partners?	
  
    –  56%	
  
	
      *Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com,
        Nielsen.com, thisDigitalLife.com, PewInternet.org
                                                  26	
  
InteresIng	
  Events	
  
•  Country	
  with	
  Highest	
  online	
  friends?	
  
       –  Brazil	
  
       –  481	
  friends	
  per	
  user	
  
       –  Japan	
  has	
  least	
  at	
  29	
  
•  Country	
  with	
  maximum	
  Ime	
  spent	
  shopping	
  on-­‐line??	
  
       –  China:	
  5	
  hours/week	
  




	
  



         *Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com,
         Nielsen.com, thisDigitalLife.com, PewInternet.org
                                                  27	
  
So	
  Internet	
  is	
  a	
  big	
  place	
  with	
  
                               lots	
  happening?	
  	
  
Do	
  we	
  understand	
  what	
  each	
  individual	
  is	
  trying	
  to	
  
  achieve?	
  
•  What	
  is	
  user	
  intent?	
  
•  Cri/cal	
  in	
  mone/za/on,	
  adver/sing,	
  etc…	
  
Do	
  we	
  understand	
  what	
  a	
  community’s	
  sen:ment	
  is?	
  
• 	
   What	
  is	
  the	
  emoIon?	
  
•  	
   Is	
  it	
  negaIve	
  or	
  posiIve?	
  
•  	
   What	
  is	
  the	
  health	
  of	
  my	
  brand	
  online?	
  
Do	
  we	
  understand	
  context	
  and	
  content?	
  
• 	
   What	
  are	
  appropriate	
  ads?	
  
•  	
   Is	
  it	
  Ok	
  to	
  associate	
  my	
  brand	
  with	
  this	
  content?	
  
•  Is	
  content	
  sad?,	
  happy?,	
  serious?,	
  informaIve?	
  
                                                     28	
  
Case	
  Studies	
  
   Yahoo!	
  Big	
  Data	
  




                               29
Yahoo!	
  –	
  One	
  of	
  Largest	
  
                       DesInaIons	
  on	
  the	
  Web	
  
                                                                                                               More people visited
                                                                                                                  Yahoo! in the past
                                                                                                                  month than:"
                                                                                                               "

80%	
  of	
  the	
  U.S.	
  Internet	
  popula:on	
  uses	
  Yahoo!	
  	
                                      •    Use coupons"
–	
  Over	
  600	
  million	
  users	
  per	
  month	
  globally!	
                                            •    Vote"
	
  
                                                                                                               •    Recycle"
•       Global	
  network	
  of	
  content,	
  commerce,	
  media,	
  search	
  and	
  access	
  
                                                                                                               •    Exercise regularly"
        products	
                                                                                             •    Have children
•       100+	
  properIes	
  including	
  mail,	
  TV,	
  news,	
  shopping,	
  finance,	
                           living at home"
        autos,	
  travel,	
  games,	
  movies,	
  health,	
  etc.	
                                            •    Wear sunscreen
                                                                                                                    regularly"
•  25+	
  terabytes	
  of	
  data	
  collected	
  each	
  day	
  
         •  RepresenIng	
  1000’s	
  of	
  cataloged	
  consumer	
  behaviors	
  

       Data is used to develop content, consumer, category and campaign
       insights for our key content partners and large advertisers

                                                                     30	
  
                          Sources: Mediamark Research, Spring 2004 and comScore Media Metrix, February 2005.
Yahoo!	
  Big	
  Data	
  –	
  A	
  league	
  of	
  its	
  own…	
  
                                                                            Terrabytes of Warehoused Data
        Millions of Events Processed Per Day

                                           14,000                                                                               5,000




                                                                                                                     1,000
                                                                                                         500
                                 2,000                          25       49        94     100
50          120        225



                                                                Amazon




                                                                                                                      Walmart



                                                                                                                                 warehouse
                                                                         Telecom




                                                                                                         Warehouse
                                                                                   AT&T




                                                                                                         Y! Panama
                                                                                           Y! LiveStor
                                                                          Korea




                                                                                                                                  Y! Main
SABRE        VISA      NYSE       YSM      Y! Global



             GRAND CHALLENGE PROBLEMS OF DATA PROCESSING
         TRAVEL, CREDIT CARD PROCESSING, STOCK EXCHANGE, RETAIL, INTERNET

              Y! Data Challenge Exceeds others by 2 orders of magnitude

                                                       31	
  
Behavioral	
  TargeIng	
  (BT)	
  
                                                     Search

Content




                       BT



                                                      Search Clicks

           Ad Clicks


                                            Targe:ng	
  your	
  ads	
  to	
  
                                     consumers	
  whose	
  recent	
  
                                     behaviors	
  online	
  indicate	
  
                                 that	
  your	
  product	
  category	
  is	
  
                        32	
                       relevant	
  to	
  them	
  
Yahoo!	
  User	
  DNA	
  
    Registration          Campaign             Behavior      Unknown

                         Clicked on                       Checks Yahoo!
               Lawyer                                     Mail daily via
 Lives in SF             Sony Plasma TV Searched on:
                                        “Hillary Clinton” PC & Phone Searched on:
                         SS ad
                                                                         “Italian
                                                                         restaurant
                                                                         Palo Alto”


                                                                      Searched on
   Male, age 32                                                       from London
                Spends 10 hour/week
                                    Purchased Da    Has 25 IM Buddies,last week
                On the internet
                                    Vinci Code from Moderates 3 Y!
                                    Amazon          Groups, and hosts a
                                                    360 page viewed by
                                                    10k people

•  On a per consumer basis: maintain a behavioral/interests profile and
   profitability (user value and LTV) metrics

                                      33	
  
How	
  it	
  works	
  |	
  Network	
  +	
  Interests	
  +	
  Modelling	
  
 Varying ProductRelevant Users
   Identify Users to the Models
    Rewarding Good Behaviour
    Match Most Purchase Cycles    Analyze predictive patterns for purchase
                                  cycles in over 100 product categories


                                  In each category, build models to describe
                                  behaviour most likely to lead to an ad
                                  response (i.e. click).


                                  Score each user for fit with every
                                  category…daily.


                                  Target ads to users who get highest
                                  ‘relevance’ scores in the targeting
                                  categories



                                   34	
  
Recency	
  Ma]ers,	
  So	
  Does	
  Intensity	
  

Active now…                     …and with feeling




                       35	
  
DifferenIaIon	
  |	
  Category	
  specific	
  modelling	
  
Example 1: Category Automotive                                                           Example 2: Category Travel/Last Minute




                                                           Intense Click Zone
   intensity score




                                                                                            intensity score




                                                                                                                                                      Intense Click Zone
                                                        time                                                                                      time



 Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click                  Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click



                     Different models allow us to weight and determine intensity and recency


                                                                                36	
  
DifferenIaIon	
  |	
  Category	
  specific	
  modelling	
  
Example 1: Category Automotive
                                            user is in the Intense Click Zone
  intensity score




                                                                                                            Intense Click Zone




                                                             with no further activity, decay takes effect   time



Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click



                    Different models allow us to weight and determine intensity and recency


                                                                                37	
  
Automobile	
  Purchase	
  Intender	
  Example	
  

•  A	
  test	
  ad-­‐campaign	
  with	
  a	
  major	
  Euro	
  automobile	
  manufacturer	
  
     –  Designed	
  a	
  test	
  that	
  served	
  the	
  same	
  ad	
  creaIve	
  to	
  test	
  and	
  control	
  groups	
  
        on	
  Yahoo	
  
     –  Success	
  metric:	
  performing	
  specific	
  acIons	
  on	
  Jaguar	
  website	
  

•  Test	
  results:	
  900%	
  conversion	
  liq	
  vs.	
  control	
  group	
  
     –  Purchase	
  Intenders	
  were	
  9	
  Imes	
  more	
  likely	
  to	
  configure	
  a	
  vehicle,	
  request	
  
        a	
  price	
  quote	
  or	
  locate	
  a	
  dealer	
  than	
  consumers	
  in	
  the	
  control	
  group	
  
     –  ~3x	
  higher	
  click	
  through	
  rates	
  vs.	
  control	
  group	
  




                                                           38	
  
Mortgage	
  Intender	
  Example	
  
                                               Results from a client campaign on Yahoo!
Example: Mortgages                                              Network

                                                                                                                      +626%
                                                                                                                     Conv Lift




We found:                                                                            +122%
1,900,000 people looking                                                            CTR Lift

for mortgage loans.
                                             Example search terms qualified for this target:

                                             Mortgages                  Home Loans Refinancing                             Ditech

                                             Example Yahoo! Pages visited:
                                             Financing section in Real Estate
                                             Mortgage Loans area in Finance
                                             Real Estate section in Yellow Pages

                                       Source: Campaign Click thru Rate lift is determined by Yahoo! Internal
                                       research. Conversion is the number of qualified leads from clicks over
                                       number of impressions served. Audience size represents the audience

Date: March 2006                   39	
  
                                       within this behavioral interest category that has the highest propensity to
                                       engage with a brand or product and to click on an offer.
Experience summary at Yahoo!
•  Dealing with the largest data source in the world (25
   Terabyte per day)
•  BT business was grown from $20M to about $500M
   in 3 years of investment!
•  Building the largest database systems:
   –    World’s largest Oracle data warehouse
   –    World’s largest single DB
   –    Over 300 data mart data
   –    Analytics with thousands of KPI’s
   –    Over 5000 users of reports
   –    Largest targeting system in the world
•  Big demands for grid computing (Hadoop)
                                     40	
  
Social	
  Network	
  

Social	
  Graph	
  Analysis	
  (no	
  Ime)	
  
  Social	
  Network	
  MarkeIng	
  
Understanding	
  Context	
  for	
  Ads	
  

                     41	
  
Case Study: TWITTER Social Marketing?
        Viacom’s VH1 Twitter campaign on ANVIL (the movie)
                                                         	

                    (week of May 14th , 2009 – see AdAge article)
                                                                	


 Marketing ANVIL movie 	

 Very niche audience, how do you reach them?	


 Diffusion 	

	

  Give 20 free DVD’s to major related artists/groups, ask them	

 To notify Twitter groups – reached over 2M people	


 Social Identity: power of word of mouth...	

 What is the Cost to VH-1?	

 Compare with traditional approach: TV commercials to promote a
 documentary film?	


42	

                                    42	
  
The	
  Display	
  Ads	
  Challenge	
  Today	
  




                      What	
  Ad	
  would	
  you	
  
                         place	
  here?	
  




                      43	
  
The	
  Display	
  Ads	
  Challenge	
  Today	
  
                               Damaging	
  to	
  Brand?	
  




                      44	
  
The	
  Display	
  Ads	
  Challenge	
  Today	
  


                       What	
  Ad	
  
                      would	
  you	
  
                      place	
  here?	
  




                      45	
  
The	
  Display	
  Ads	
  Challenge	
  Today	
  
                               Irrelevant	
  and	
  
                               Damaging	
  to	
  Brand	
  




                                 Completely	
  Irrelevant	
  




                      46	
  
NetSeer:	
  Intent	
  for	
  Display	
  
•  Currently	
  Processing	
  4	
  Billion	
  Impressions	
  per	
  Day	
  




                                    47	
  
Problem:	
  Hard	
  to	
  Understand	
  User	
  
                Intent	
  
  Contextual	
  Ad	
  served	
  by	
  Google	
              What	
  NetSeer	
  Sees:	
  	
  




                                                   48	
  
Case	
  Studies	
  

  nPario	
  –	
  Data	
  Management	
  Plavorm	
  
ChoozOn	
  –	
  Big	
  Data	
  over	
  Offers	
  Universe	
  



                           49	
  
Example	
  of	
  a	
  Big	
  Data	
  DMP	
  

Scale
nPario builds an infinitely scalable data management platform (DMP) that
allows advertisers and marketers to manage, understand, and monetize their
data. Their technology has been proven at companies such as Yahoo and EA.

Applications
nPario applications include segmentation of audiences for increasing the value
of advertising, reporting/analytics for examining performance, attribution to
show which advertising works, and experimentation to test ideas.

Access
nPario emphasizes putting access to data in the hands of marketing,
advertising and other business users




                                          50	
  
Powerful	
  Technology.	
  	
  
   540m	
  Users,	
  8+	
  Petabytes,	
  	
  
           &	
  16	
  Patents	
  	
  
                      	
  
nPario has the only commercially                                         In Production at Yahoo
available Big Data management                                            nPario technology manages the world’s
technology built	
  for	
  one	
  of	
  the	
  “Big	
  Five”.	
          largest data system. Used for Yahoo’s
                                                                         Marketing and Advertising business. Used
                                                                         across Yahoo’s platforms and 120 online
Significant Investment                                                   properties. Used by hundreds of analysts
nPario’s technology is the result of more than
$50m investment in development and 16
issued patents.




                                                                    51	
  
DMP	
  
Data Sources

 First-Party                               Self-Service Data Intake & ETL                      Deep Insights & Big Data Analytics

                     Offline                                 Data Processing,                  Insights
    CRM
                      Data




                               Tag Mgmt
                                                             Enrichment and                    Analytics
                                                             Normalization
            Events                                                                             Modeling
                                                                                               & Discovery

 Multiple Channels

   Search      eMail
                                                                                                   Self-Service Applications
                                                                           1	
      User DNA   Segment        Target
   Mobile      Video
                               IntegraIon	
  &	
  APIs	
  




   Display      Site


 Third-Party                                                 Real-Time
                                                             Data Availability                 Attribution       Experimentation




                                                                            Data Management Platform
                                                                                                                       52	
  
                                                                           52	
  
nPario	
  Case	
  Study	
  
Marketing to
100+ million Gamers

Challenge: Provide cross-platform campaign
insights for advertisers and enable audience
discovery across channels.

Result: Unified view of gamers                                “EA increased its worldwide audience reach by
                                                              30% this year […]. Combining that major jump in
across multiple cross platform data
                                                              reach with the launch of EA Legend puts us in
sources.                                                      perfect position to compete”

                                                              Dave Madden, Senior Vice President of Global
Pogo (online), sponsored content, Console                     Media Solutions
game interaction and ad interaction (Xbox,                    Electronic Arts.
PS3), Mobile, Playfish (facebook), external
sources (Collective Media, Comscore,
Omniture, Dynamic Logic)




                                                         53	
  
Big	
  Data	
  in	
  AcIon:	
  EA	
  Legend	
  




                      54	
  
Specialized	
  Search	
  through	
  Big	
  Data	
  
 AnalyIcs	
  over	
  the	
  Offers	
  Universe	
  



                       55	
  
                        55	
  
Chaos	
  for	
  
Consumers	
  	
  
                              Deep
                            Discount
                              Sites             Coupon
              Flash                              Sites
              Sale
              Sites
                                                             Loyalty
                                                            Programs


           Daily
           Deals
           Sites
                                Consumer
                                                          Online
                                                         Loyalty
                                                         Program
                                            Social          s
                      Search               Network
                      Engines                 s


                                  56	
  
Chaos	
  for	
  
Marketers	
  
                              Deep
                             Discou
                             nt Sites
               Flash                            Coupon
               Sale                              Sites
               Sites
                                                             Loyalty
                                                            Program
                                                               s
           Daily
           Deals
           Sites
                                                         Online
                                                         Loyalty
                                                         Progra
                                            Social         ms
                       Search
                                            Networ
                       Engines
                                              ks

                                   57	
  
What	
  Consumers	
  &	
  Marketers	
  Want	
  

        Consumers
        •  Value from brands they love
        •  Tame the deal chaos




        Marketers
        •  Reach targeted consumers
        •  Build loyalty
        •  Create brand evangelists
                      58	
  
                       58	
  
Keys	
  to	
  Being	
  THE	
  Consumer	
  Network	
  


        Comprehensive       Personalization         Intelligent
        Deal Coverage                                Matching




          Permission-            Loyalty               Multi-
            based                Solution             Channel
           Targeting
                                                       Reach
           Solution




10/18/2010	
            ChoozOn	
  ConfidenIal	
  
                                      59	
  
                                       59	
  
Solu:on	
  for	
  Consumers	
  

 Daily Deals   Flash Sale            Deep                  Affiliate Deals &      Loyalty
    Sites         Sites           Discounters               Coupon Sites         Programs




                       Choozer	
  Interests	
  &	
  Preferences	
  


                     Machine-        Intelligent	
        Social
                      based           Matching	
       (Pals,Clubs)




 Web App       Mobile App                                      Email           Digital
                                                                               Media

                                          60	
  
                                           60	
  
Carol’s	
  Consumer	
  Network	
  
“Chozen”	
  Brands	
                             Interests	
  (Intent)	
  




                                                  Shopping	
  Pals	
  
     Deal	
  Clubs	
                61	
  
What	
  Carol	
  Gets	
  
Carol’s	
  Consumer	
  Network	
                     The	
  Universe	
  of	
  Deals	
  
                                              Affiliate	
  Deals	
  
                                              Loyalty	
  Programs	
   1,500+	
  brands	
  
                                              Daily	
  Deals	
        	
  


                                              Flash	
  Sales	
        100,000+	
  offers	
  
                                              Deep	
  Discounters	
  
                                               PLUS her                              Inbox™



                            Intelligent Matching




       A	
  Personal	
  Shopper	
  for	
  Deals	
  
                                     62	
  
Big	
  Picture	
  on	
  Big	
  Data	
  AnalyIcs	
  

                   Key	
  points	
  




                         63	
  
Don’t	
  Forget	
  The	
  Basics	
  
•  Metrics	
  and	
  Scorecards	
  are	
  the	
  first	
  steps	
  to	
  
   awareness	
  
•  Plays	
  a	
  huge	
  role	
  in	
  deploying	
  predicIve	
  models	
  and	
  
   monitoring	
  and	
  proving	
  their	
  effecIveness	
  
•  Oqen	
  scorecards	
  require	
  
    –    Going	
  through	
  huge	
  amounts	
  of	
  data	
  to	
  produce	
  the	
  required	
  metrics	
  
    –    Ability	
  to	
  get	
  to	
  the	
  metrics	
  in	
  low	
  latency	
  
    –    Ability	
  to	
  modify	
  metrics	
  and	
  update	
  quickly	
  
    –    IntegraIon	
  with	
  data	
  warehouse	
  




                                                    64	
  
Focus	
  On	
  The	
  Right	
  Measure	
  
          Referral Site Metrics
 2.5                                           •  Total	
  traffic	
  not	
  a	
  good	
  
                                  Response        performance	
  measure	
  
 2.0
                                  Conversion
                                               •  High-­‐traffic	
  referral	
  sites	
  
 1.5
                                                  oqen	
  produce	
  poorer	
  
 1.0                              Margin          quality	
  click	
  throughs	
  
 0.5                                           •  Ads	
  best	
  response	
  not	
  
 0.0                                              most	
  effecIve	
  	
  
       Site 1      Site 2
                                               •  Target	
  the	
  message	
  




                                      65	
  
Focus	
  On	
  The	
  Right	
  Measure	
  
         Referral Site Metrics
2.5
                                               •  Total	
  traffic	
  not	
  a	
  good	
  
                                 Response         performance	
  measure	
  
2.0
                                               •  High-­‐traffic	
  referral	
  sites	
  
1.5                              Conversion
                                                  oqen	
  produce	
  poorer	
  
1.0                              Margin           quality	
  click	
  throughs	
  
0.5                                            •  Ads	
  best	
  response	
  not	
  
0.0                                               most	
  effecIve	
  	
  
      Site 1       Site 2                      •  Target	
  the	
  message	
  




                                      66	
  
Focus	
  On	
  The	
  Right	
  Measure	
  
         Referral Site Metrics
2.5
                                               •  Total	
  traffic	
  not	
  a	
  good	
  
                                 Response         performance	
  measure	
  
2.0
                                               •  High-­‐traffic	
  referral	
  sites	
  
1.5                              Conversion
                                                  oqen	
  produce	
  poorer	
  
1.0                              Margin           quality	
  click	
  throughs	
  
0.5                                            •  Ads	
  best	
  response	
  not	
  
0.0                                               most	
  effecIve	
  	
  
      Site 1       Site 2                      •  Target	
  the	
  message	
  




                                      67	
  
SomeImes,	
  	
  
Simple	
  is	
  Very	
  Powerful!	
  

  Retaining	
  New	
  Yahoo!	
  Mail	
  Registrants	
  
IntegraIng	
  Mail	
  and	
  News	
  
•  Data	
  showed	
  that	
  users	
  oqen	
  check	
  their	
  mail	
  and	
  
   news	
  in	
  the	
  same	
  session	
  
     –  But	
  no	
  easy	
  way	
  to	
  navigate	
  to	
  Y!	
  News	
  from	
  Y!	
  Mail	
  


•  Mail	
  users	
  who	
  also	
  visit	
  Y!	
  News	
  are	
  3X	
  more	
  acIve	
  
   on	
  Yahoo	
  
    –  Higher	
  retenIon,	
  repeat	
  visits	
  and	
  Ime-­‐spent	
  on	
  
       Yahoo	
  



                                                  69	
  
“In	
  the	
  news”	
  Module	
  on	
  Mail	
  Welcome	
  Page	
  

•  Increased	
  retenIon	
  on	
  Mail	
  for	
  light	
  users	
  by	
  40%!	
  
     –  Est.	
  Incremental	
  revenue	
  of	
  $16m	
  a	
  year	
  on	
  Y!	
  Mail	
  alone	
  




                                                       70	
  
Benefits	
  of	
  Advanced	
  AnalyIcs	
  
•  Advanced	
  AnalyIcs	
  brings	
  out	
  the	
  real	
  value	
  of	
  data	
  	
  
•  The	
  business	
  begins	
  to	
  understand	
  the	
  true	
  value	
  and	
  role	
  of	
  
   data	
  in	
  moving	
  the	
  big	
  needles	
  
•  Focus	
  on	
  useful	
  requirements	
  from	
  data,	
  rather	
  than	
  “data	
  
   acrobaIcs”	
  
•  Value	
  creaIon	
  from	
  data	
  leads	
  to	
  proper	
  investment	
  scoping	
  
     –  Many	
  are	
  realizing	
  predicIve	
  analyIcs	
  and	
  data	
  mining	
  	
  are	
  much	
  
        more	
  useful	
  than	
  reporIng	
  
     –  IntegraIon	
  of	
  analyIcs	
  story	
  with	
  data	
  storage	
  very	
  criIcal	
  
•  Big	
  data	
  makes	
  analyIcs	
  even	
  more	
  essenIal	
  and	
  more	
  useful	
  
     –  Avoiding	
  the	
  challenges	
  of	
  separaIng	
  analyIcs	
  from	
  big	
  data	
  are	
  
        increasingly	
  important	
  


                                                     71	
  
Thank	
  You!	
  	
  	
  	
  &	
  	
  	
  QuesIons?	
  
                                                   	
  
                      Twi]er:	
  @usamaf	
  
                                                     	
  
                       Usama@choozOn.net	
  
                        www.ChoozOn.com	
  

More Related Content

What's hot (20)

Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Bigdata
BigdataBigdata
Bigdata
 
Big data mining
Big data miningBig data mining
Big data mining
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
 
Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Thilga
ThilgaThilga
Thilga
 
Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data Science
 
Big data
Big dataBig data
Big data
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big Data
Big DataBig Data
Big Data
 

Similar to Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling by Usama Fayyad

THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMEGigaom
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPDr Geetha Mohan
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataPrakalp Agarwal
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptxKannanThangavelu2
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataSpringPeople
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...Experfy
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationDoug Denton
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1RUHULAMINHAZARIKA
 
Data Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiData Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiProfessor Lili Saghafi
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallTrillium Software
 

Similar to Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling by Usama Fayyad (20)

THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOP
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data
Big DataBig Data
Big Data
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1Big Data Analytics Materials, Chapter: 1
Big Data Analytics Materials, Chapter: 1
 
Data Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili SaghafiData Scientist By: Professor Lili Saghafi
Data Scientist By: Professor Lili Saghafi
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
 

More from BigMine

Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...BigMine
 
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...BigMine
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBigMine
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeBigMine
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...BigMine
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosBigMine
 

More from BigMine (8)

Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
 
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Big Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina MorikBig Data and Small Devices by Katharina Morik
Big Data and Small Devices by Katharina Morik
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling by Usama Fayyad

  • 1. Big  Data  Analy,cs:     Applica,ons  &  Opportuni,es  in  On-­‐line   Predic,ve  Modeling   Usama  Fayyad,  Ph.D.     Chairman  &  CTO   ChoozOn  Corpora/on   Twi$er:  @usamaf     August  12,  2012   BigMine:  BigData  Mining  Workshop   KDD-­‐2012  –  Beijing,  China   1  
  • 2. Outline   •  Big  Data  all  around  us   •  IntroducIon  to  Data  Mining  and  PredicIve  AnalyIcs   •  On-­‐line  data  and  facts   •  Case  studies  from  mulIple  verIcals:   1.  Yahoo!  Big  Data   2.  Social  Network  Data   3.  Case  Study  from  nPario  ApplicaIons   4.  ChoozOn  Big  Data  from  offers   •  High-­‐level  view:  don’t  forget  the  basics   •  Summary  and  conclusions   2  
  • 3. What  Ma3ers  in  the  Age  of  Analy/cs?       1. Being  Able  to  exploit  all  the  data  that  is  available     •  not  just  what  you've  got  available     •  what  you  can  acquire  and  use  to  enhance  your  acIons     2.   ProliferaIng  analyIcs  throughout  your   organizaIon   •  make  every  part  of  your  business  smarter     3.   Driving  significant  business  value     •  embedding  analyIcs  into  every  area  of  your  business   can  help  you  drive  top  line  revenues  and/or  bo]om   line  cost  efficiencies     3  
  • 4. What  OrganizaIons  Are  Struggling  With   •  Data  Strategy  -­‐    how  much  data?  why  data?  how  does  it   impact  my  business?   •  PrioriIzaIon  conducted  based  on  business  need,  not  IT   –  Business  jusIficaIons  for  Big  Data   –  DemonstraIng  value  of  data  in  impacIng  the  business   –  Looking  at  specialized  stores  to  reduce  TCO   –  File  systems  for  grid  compuIng  (Hadoop)   •  We  do  need  to  stay  on  top  of  our  basic  business  ops   –  billing,  monitoring,  inventory  management,  etc...   –  Most  can  be  handled  by  stream  processing  and  tradiIonal  BI   •  But,  a  new  generaIon  of  requirements  are  becoming  a   priority  for  data-­‐driven  business   –  PredicIve  analyIcs,  advanced  forecasIng,  automated  detecIon  of   events  of  interest   4  
  • 5. Why  Big  Data?   A  new  term,  with  associated  “Data  Scien:st”  posi:ons:   •  Big  Data:  is  a  mix  of  structured,  semi-­‐structured,  and   unstructured  data:   –  Typically  breaks  barriers  for  tradiIonal  RDB  storage   –  Typically  breaks  limits  of  indexing  by  “rows”   –  Typically  requires  intensive  pre-­‐processing  before  each  query   to  extract  “some  structure”  –  usually  using  Map-­‐Reduce  type   operaIons   •  Above  leads  to  “messy”  situaIons  with  no  standard   recipes  or  architecture:  hence  the  need  for  “data   scienIsts”     –  conduct  “Data  ExpediIons”     –  Discovery  and  learning  on  the  spot   5  
  • 6. What  Makes  Data  “Big  Data”?   •  Big  Data  is  Characterized  by  the  3-­‐V’s:   –  Volume:  larger  than  “normal”  –  challenging  to  load/process   •  Expensive  to  do  ETL   •  Expensive  to  figure  out  how  to  index  and  retrieve   •  MulIple  dimensions  that  are  “key”   –  Velocity:  Rate  of  arrival  poses  real-­‐/me  constraints  on  what   are  typically  “batch  ETL”  opera/ons   •  If  you  fall  behind  catching  up  is  extremely  expensive  (replicate  very   expensive  systems)   •  Must  keep  up  with  rate  and  service  queries  on-­‐the-­‐fly   –  Variety:  Mix  of  data  types  and  varying  degrees  of  structure   •  Non-­‐standard  schema   •  Lots  of  BLOB’s  and  CLOB’s   •  DB  queries  don’t  know  what  to  do  with  semi-­‐structured  and   unstructured  data.   6  
  • 7. The  DisIncIon  between  “Data”  and  “Big   Data”  is  fast  disappearing     •  Most  real  data  sets  nowadays  come  with  a  serious  mix  of   semi-­‐structured  and  unstructured  components:   –  Images   –  Video   –  Text  descripIons  and  news,  blogs,  etc…   –  User  and  customer  commentary   –  ReacIons  on  social  media:  e.g.  Twi]er  is  a  mix  of  data  anyway   •  Using  standard  transforms,  enIty  extracIon,  and  new   generaIon  tools  to  transform  unstructured  raw  data  into   semi-­‐structured  analyzable  data     •  Hadoop  vs.  Not  Hadoop  -­‐    when  to  use  what  kind  of   techniques  requiring  Map-­‐Reduce  and  grid  compuIng   7  
  • 8. Text  Data:  The  Big  Driver     •  While  we  speak  of  “big  data”  and  the  “Variety”  in  3-­‐V’s   •  Reality:  biggest  driver  of  growth  of  Big  Data  has  been  text   data   •  In  fact  Map-­‐Reduce  became  popularized  by  Google  to  address   the  problem  of  processing  large  amounts  of  text  data:     –  Indexing  a  full  copy  of  the  web   –  Frequent  re-­‐indexing   –  Many  operaIons  with  each  being  a  simple  operaIon  but  done  at  large   scale   •  Most  work  on  analysis  of  “images”  and  “video”  data  has  really   been  reduced  to  analysis  of  surrounding  text     8  
  • 9. To  Hadoop  or  not  to  Hadoop?   when  to  use  techniques  requiring  Map-­‐Reduce  and  grid  compu,ng?   •  Typically  organizaIons  try  to  use  Map-­‐Reduce  for  everything  to   do  with  Big  Data   –  This  is  actually  very  inefficient  and  oqen  irraIonal   –  Certain  operaIons  require  specialized  storage   •  UpdaIng  segment  memberships  over  large  numbers  of  users   •  Defining  new  segments  on  user  or  usage  data   •  Map-­‐Reduce  is  useful  when  a  very  simple  operaIon  is  to  be   applied  on  a  large  body  of  unstructured  data   –  Typically  this  is  during  enIty  and  a]ribute  extracIon   –  SIll  need  Big  Data  analysis  post  Hadoop   •  Map-­‐Reduce  is  not  efficient  or  effecIve  for  tasks  involving  deeper   staIsIcal  modeling   –   good  for  gathering  counts  and  simple  (sufficient)  staIsIcs   •  E.g.  how  many  Imes  a  keyword  occurs,  quick  aggregaIon  of  simple  facts  in  unstructured   data,  esImates  of  variances,  density,  etc…   –  Mostly  pre-­‐processing  for  Data  Mining   9  
  • 10. 10  
  • 11. 11  
  • 12. Hadoop  Use  Cases  by  Data  Type   ERP  Financial  Data   Supply  Chain  Data   Sensor  Data   Financial  Trading  Data   1%   2%   2%   4%   CRM  Data   4%   Science  Data   7%   Content  and  Preference  Data   24%   AdverIsing  Data   10%   IT  Log  Data   19%   Social  Data   11%   Text  and  Language  Data   16%   12  
  • 13. Big  Data  Applica:ons  and  Uses   100%  Capture   IT  Log  &  Security     Forensics  &  AnalyIcs   Find  New  Signal   Predict  Events   Product  Planning   Automated  Device  Data   Failure  Analysis   AnalyIcs   ProacIve  Fixes   RecommendaIon   AdverIsing   SegmentaIon   AnalyIcs   Social  Media   Hadoop  +    MPP  +  EDW   Big  Data   Cost  ReducIon   Ad  Hoc  Insight   Warehouse  AnalyIcs   PredicIve  AnalyIcs   13  
  • 14. Analysis  &  Programming  Soqware   PIG HIPI 14  
  • 15. 15  
  • 16. From  Basic  Dashboards  to  Advanced  Analy:cs   •  Data  ReducIon  to  get   –  Advanced  views  oriented  by  customer  or  product   –  SegmentaIon   –  Pa]ern  analysis  and  summaries   •  PredicIve  AnalyIcs   –  Data  Mining   –  StaIsIcal  analysis   –  OpImizaIon  of  processes  and  spend   The  same  analyIcs  technique  apply  across   many  industries:  fraud  detecIon  is  fraud   detecIon,  is  fraud  detecIon   16  
  • 17. What  is  Data  Mining?   Finding  interes,ng  structure  in  data   •  Structure:  refers  to  staIsIcal  pa]erns,  predicIve  models,  hidden   relaIonships   •  Interes/ng:  Accurate  predicIons,  associated  with  new  revenue   potenIal,  associated  with  cost  savings,  enables  opImizaIon       •  Examples  of  tasks  addressed  by  Data  Mining   –  PredicIve  Modeling  (classificaIon,  regression)   –  SegmentaIon  (Data  Clustering  )   –  Affinity  (SummarizaIon)     •  relaIons  between  fields,  associaIons,  visualizaIon   17  
  • 18. Data  Mining  and  Databases   Many  interesIng  analysis  queries  are  difficult  to  state   precisely     •  Examples:   –  which  records  represent  fraudulent  transac:ons?   –  which  households  are  likely  to  prefer  a  Ford  over  a  Toyota?   –  Who  is  a  good  credit  risk  in  my  customer  DB?   –  Why  are  these  automobiles  in  need  of  unusual  repairs?     •  Yet  database  contains  the  informaIon     –  good/bad  customer,  profitability   –  did/didn’t  respond  to  mailout/survey/campaign/...   –  automobile  repair  and  warranty  records   18  
  • 19. Many  Business  Uses   Analy:c  technique   Uses  in  business   Marke:ng  and  sales    IdenIfy  potenIal  customers;  establish  the   effecIveness  of  a  campaign     Understanding  customer  behavior   model  churn,  affiniIes,  propensiIes,  …   Web  analy:cs  &  metrics   model  user  preferences  from  data,  collaboraIve   filtering,  targeIng,  etc.   Fraud  detec:on   IdenIfy  fraudulent  transacIons   Credit  scoring   Establish  credit  worthiness  of  a  customer  requesIng  a   loan   Manufacturing  process  analysis   IdenIfy  the  causes  of  manufacturing  problems   PorWolio  trading   opImize  a  porvolio  of  financial  instruments  by   maximizing  returns  &  minimizing  risks   Healthcare  Applica:on   fraud  detecIon,  cost  opImizaIon,  detecIon  of  events   like  epidemics,  etc...   Insurance   fraudulent  claim  detecIon,  risk  assessment   Security  and  Surveillance   intrusion  detecIon,  sensor  data  analysis,  remote   sensing,  object/person  detecIon,  link  analysis,  etc...   19  
  • 20. So  this  Internet  thing  is  going  to   be  big!   Big  opportunity,  Big  Data,  Big   Challenges!   20  
  • 21. Stats  about  on-­‐line  usage   •  How  many  people  are  on-­‐line  today?     –  2.1  Billion  (per  Comscore  esImates)   –  30%  of  world  PopulaIon   •  How  much  Ime  is  spent  on-­‐line  per  month  by   the  whole  world?   –  4M  person-­‐years  per  month   •  How  many  hours  per  month  per  Internet  User?   –  16  hours  (global  average)   –  32  hours  (U.S.  Average)   *Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com, Nielsen.com, thisDigitalLife.com, PewInternet.org 21  
  • 22. How  Are  Users  Distributed  Geographically   *Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com, Nielsen.com, thisDigitalLife.com, PewInternet.org 22  
  • 23. How Do People Spend Their On-line Time? •  On-line Shopping? 5% •  Searches? 21% •  Email/Communication? 19% •  Reading Content? 20% •  Social Networking? 22% •  Multimedia Sites? 13% 23  
  • 24. Most  Popular  AcIviIes  On-­‐Line?   *Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com, Nielsen.com, thisDigitalLife.com, PewInternet.org 24  
  • 25. Top  10  Sites  Visited?   •  Google:    153.4M  visitors/month     –  each  spending  1h  47mins   •  Facebook:  137.6M  visitors  per  month     –  each  spending  7h  50mins   25  
  • 26. InteresIng  Events   •  Google:    How  many  queries  per  day?   –  More  than  1  Billion   •  Twi$er:  How  many  Tweets/day?   –  More  than  250M   •  Facebook:  Updates  per  day?   –  More  than  800M   •  YouTube:  Views/day   –  4  Billion  views   –  60  hours  of  video  uploaded  every  minute!   •  Social  Networks:  users  who  have  used  sites  for  spying   on  their  partners?   –  56%     *Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com, Nielsen.com, thisDigitalLife.com, PewInternet.org 26  
  • 27. InteresIng  Events   •  Country  with  Highest  online  friends?   –  Brazil   –  481  friends  per  user   –  Japan  has  least  at  29   •  Country  with  maximum  Ime  spent  shopping  on-­‐line??   –  China:  5  hours/week     *Sources: Feb.2012 - from Go-Gulf.com compiled from Comscoredatamine.com, Nielsen.com, thisDigitalLife.com, PewInternet.org 27  
  • 28. So  Internet  is  a  big  place  with   lots  happening?     Do  we  understand  what  each  individual  is  trying  to   achieve?   •  What  is  user  intent?   •  Cri/cal  in  mone/za/on,  adver/sing,  etc…   Do  we  understand  what  a  community’s  sen:ment  is?   •    What  is  the  emoIon?   •    Is  it  negaIve  or  posiIve?   •    What  is  the  health  of  my  brand  online?   Do  we  understand  context  and  content?   •    What  are  appropriate  ads?   •    Is  it  Ok  to  associate  my  brand  with  this  content?   •  Is  content  sad?,  happy?,  serious?,  informaIve?   28  
  • 29. Case  Studies   Yahoo!  Big  Data   29
  • 30. Yahoo!  –  One  of  Largest   DesInaIons  on  the  Web   More people visited Yahoo! in the past month than:" " 80%  of  the  U.S.  Internet  popula:on  uses  Yahoo!     •  Use coupons" –  Over  600  million  users  per  month  globally!   •  Vote"   •  Recycle" •  Global  network  of  content,  commerce,  media,  search  and  access   •  Exercise regularly" products   •  Have children •  100+  properIes  including  mail,  TV,  news,  shopping,  finance,   living at home" autos,  travel,  games,  movies,  health,  etc.   •  Wear sunscreen regularly" •  25+  terabytes  of  data  collected  each  day   •  RepresenIng  1000’s  of  cataloged  consumer  behaviors   Data is used to develop content, consumer, category and campaign insights for our key content partners and large advertisers 30   Sources: Mediamark Research, Spring 2004 and comScore Media Metrix, February 2005.
  • 31. Yahoo!  Big  Data  –  A  league  of  its  own…   Terrabytes of Warehoused Data Millions of Events Processed Per Day 14,000 5,000 1,000 500 2,000 25 49 94 100 50 120 225 Amazon Walmart warehouse Telecom Warehouse AT&T Y! Panama Y! LiveStor Korea Y! Main SABRE VISA NYSE YSM Y! Global GRAND CHALLENGE PROBLEMS OF DATA PROCESSING TRAVEL, CREDIT CARD PROCESSING, STOCK EXCHANGE, RETAIL, INTERNET Y! Data Challenge Exceeds others by 2 orders of magnitude 31  
  • 32. Behavioral  TargeIng  (BT)   Search Content BT Search Clicks Ad Clicks Targe:ng  your  ads  to   consumers  whose  recent   behaviors  online  indicate   that  your  product  category  is   32   relevant  to  them  
  • 33. Yahoo!  User  DNA   Registration Campaign Behavior Unknown Clicked on Checks Yahoo! Lawyer Mail daily via Lives in SF Sony Plasma TV Searched on: “Hillary Clinton” PC & Phone Searched on: SS ad “Italian restaurant Palo Alto” Searched on Male, age 32 from London Spends 10 hour/week Purchased Da Has 25 IM Buddies,last week On the internet Vinci Code from Moderates 3 Y! Amazon Groups, and hosts a 360 page viewed by 10k people •  On a per consumer basis: maintain a behavioral/interests profile and profitability (user value and LTV) metrics 33  
  • 34. How  it  works  |  Network  +  Interests  +  Modelling   Varying ProductRelevant Users Identify Users to the Models Rewarding Good Behaviour Match Most Purchase Cycles Analyze predictive patterns for purchase cycles in over 100 product categories In each category, build models to describe behaviour most likely to lead to an ad response (i.e. click). Score each user for fit with every category…daily. Target ads to users who get highest ‘relevance’ scores in the targeting categories 34  
  • 35. Recency  Ma]ers,  So  Does  Intensity   Active now… …and with feeling 35  
  • 36. DifferenIaIon  |  Category  specific  modelling   Example 1: Category Automotive Example 2: Category Travel/Last Minute Intense Click Zone intensity score intensity score Intense Click Zone time time Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click Different models allow us to weight and determine intensity and recency 36  
  • 37. DifferenIaIon  |  Category  specific  modelling   Example 1: Category Automotive user is in the Intense Click Zone intensity score Intense Click Zone with no further activity, decay takes effect time Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click Different models allow us to weight and determine intensity and recency 37  
  • 38. Automobile  Purchase  Intender  Example   •  A  test  ad-­‐campaign  with  a  major  Euro  automobile  manufacturer   –  Designed  a  test  that  served  the  same  ad  creaIve  to  test  and  control  groups   on  Yahoo   –  Success  metric:  performing  specific  acIons  on  Jaguar  website   •  Test  results:  900%  conversion  liq  vs.  control  group   –  Purchase  Intenders  were  9  Imes  more  likely  to  configure  a  vehicle,  request   a  price  quote  or  locate  a  dealer  than  consumers  in  the  control  group   –  ~3x  higher  click  through  rates  vs.  control  group   38  
  • 39. Mortgage  Intender  Example   Results from a client campaign on Yahoo! Example: Mortgages Network +626% Conv Lift We found: +122% 1,900,000 people looking CTR Lift for mortgage loans. Example search terms qualified for this target: Mortgages Home Loans Refinancing Ditech Example Yahoo! Pages visited: Financing section in Real Estate Mortgage Loans area in Finance Real Estate section in Yellow Pages Source: Campaign Click thru Rate lift is determined by Yahoo! Internal research. Conversion is the number of qualified leads from clicks over number of impressions served. Audience size represents the audience Date: March 2006 39   within this behavioral interest category that has the highest propensity to engage with a brand or product and to click on an offer.
  • 40. Experience summary at Yahoo! •  Dealing with the largest data source in the world (25 Terabyte per day) •  BT business was grown from $20M to about $500M in 3 years of investment! •  Building the largest database systems: –  World’s largest Oracle data warehouse –  World’s largest single DB –  Over 300 data mart data –  Analytics with thousands of KPI’s –  Over 5000 users of reports –  Largest targeting system in the world •  Big demands for grid computing (Hadoop) 40  
  • 41. Social  Network   Social  Graph  Analysis  (no  Ime)   Social  Network  MarkeIng   Understanding  Context  for  Ads   41  
  • 42. Case Study: TWITTER Social Marketing? Viacom’s VH1 Twitter campaign on ANVIL (the movie) (week of May 14th , 2009 – see AdAge article) Marketing ANVIL movie Very niche audience, how do you reach them? Diffusion Give 20 free DVD’s to major related artists/groups, ask them To notify Twitter groups – reached over 2M people Social Identity: power of word of mouth... What is the Cost to VH-1? Compare with traditional approach: TV commercials to promote a documentary film? 42 42  
  • 43. The  Display  Ads  Challenge  Today   What  Ad  would  you   place  here?   43  
  • 44. The  Display  Ads  Challenge  Today   Damaging  to  Brand?   44  
  • 45. The  Display  Ads  Challenge  Today   What  Ad   would  you   place  here?   45  
  • 46. The  Display  Ads  Challenge  Today   Irrelevant  and   Damaging  to  Brand   Completely  Irrelevant   46  
  • 47. NetSeer:  Intent  for  Display   •  Currently  Processing  4  Billion  Impressions  per  Day   47  
  • 48. Problem:  Hard  to  Understand  User   Intent   Contextual  Ad  served  by  Google   What  NetSeer  Sees:     48  
  • 49. Case  Studies   nPario  –  Data  Management  Plavorm   ChoozOn  –  Big  Data  over  Offers  Universe   49  
  • 50. Example  of  a  Big  Data  DMP   Scale nPario builds an infinitely scalable data management platform (DMP) that allows advertisers and marketers to manage, understand, and monetize their data. Their technology has been proven at companies such as Yahoo and EA. Applications nPario applications include segmentation of audiences for increasing the value of advertising, reporting/analytics for examining performance, attribution to show which advertising works, and experimentation to test ideas. Access nPario emphasizes putting access to data in the hands of marketing, advertising and other business users 50  
  • 51. Powerful  Technology.     540m  Users,  8+  Petabytes,     &  16  Patents       nPario has the only commercially In Production at Yahoo available Big Data management nPario technology manages the world’s technology built  for  one  of  the  “Big  Five”.   largest data system. Used for Yahoo’s Marketing and Advertising business. Used across Yahoo’s platforms and 120 online Significant Investment properties. Used by hundreds of analysts nPario’s technology is the result of more than $50m investment in development and 16 issued patents. 51  
  • 52. DMP   Data Sources First-Party Self-Service Data Intake & ETL Deep Insights & Big Data Analytics Offline Data Processing, Insights CRM Data Tag Mgmt Enrichment and Analytics Normalization Events Modeling & Discovery Multiple Channels Search eMail Self-Service Applications 1   User DNA Segment Target Mobile Video IntegraIon  &  APIs   Display Site Third-Party Real-Time Data Availability Attribution Experimentation Data Management Platform 52   52  
  • 53. nPario  Case  Study   Marketing to 100+ million Gamers Challenge: Provide cross-platform campaign insights for advertisers and enable audience discovery across channels. Result: Unified view of gamers “EA increased its worldwide audience reach by 30% this year […]. Combining that major jump in across multiple cross platform data reach with the launch of EA Legend puts us in sources. perfect position to compete” Dave Madden, Senior Vice President of Global Pogo (online), sponsored content, Console Media Solutions game interaction and ad interaction (Xbox, Electronic Arts. PS3), Mobile, Playfish (facebook), external sources (Collective Media, Comscore, Omniture, Dynamic Logic) 53  
  • 54. Big  Data  in  AcIon:  EA  Legend   54  
  • 55. Specialized  Search  through  Big  Data   AnalyIcs  over  the  Offers  Universe   55   55  
  • 56. Chaos  for   Consumers     Deep Discount Sites Coupon Flash Sites Sale Sites Loyalty Programs Daily Deals Sites Consumer Online Loyalty Program Social s Search Network Engines s 56  
  • 57. Chaos  for   Marketers   Deep Discou nt Sites Flash Coupon Sale Sites Sites Loyalty Program s Daily Deals Sites Online Loyalty Progra Social ms Search Networ Engines ks 57  
  • 58. What  Consumers  &  Marketers  Want   Consumers •  Value from brands they love •  Tame the deal chaos Marketers •  Reach targeted consumers •  Build loyalty •  Create brand evangelists 58   58  
  • 59. Keys  to  Being  THE  Consumer  Network   Comprehensive Personalization Intelligent Deal Coverage Matching Permission- Loyalty Multi- based Solution Channel Targeting Reach Solution 10/18/2010   ChoozOn  ConfidenIal   59   59  
  • 60. Solu:on  for  Consumers   Daily Deals Flash Sale Deep Affiliate Deals & Loyalty Sites Sites Discounters Coupon Sites Programs Choozer  Interests  &  Preferences   Machine- Intelligent   Social based Matching   (Pals,Clubs) Web App Mobile App Email Digital Media 60   60  
  • 61. Carol’s  Consumer  Network   “Chozen”  Brands   Interests  (Intent)   Shopping  Pals   Deal  Clubs   61  
  • 62. What  Carol  Gets   Carol’s  Consumer  Network   The  Universe  of  Deals   Affiliate  Deals   Loyalty  Programs   1,500+  brands   Daily  Deals     Flash  Sales   100,000+  offers   Deep  Discounters   PLUS her Inbox™ Intelligent Matching A  Personal  Shopper  for  Deals   62  
  • 63. Big  Picture  on  Big  Data  AnalyIcs   Key  points   63  
  • 64. Don’t  Forget  The  Basics   •  Metrics  and  Scorecards  are  the  first  steps  to   awareness   •  Plays  a  huge  role  in  deploying  predicIve  models  and   monitoring  and  proving  their  effecIveness   •  Oqen  scorecards  require   –  Going  through  huge  amounts  of  data  to  produce  the  required  metrics   –  Ability  to  get  to  the  metrics  in  low  latency   –  Ability  to  modify  metrics  and  update  quickly   –  IntegraIon  with  data  warehouse   64  
  • 65. Focus  On  The  Right  Measure   Referral Site Metrics 2.5 •  Total  traffic  not  a  good   Response performance  measure   2.0 Conversion •  High-­‐traffic  referral  sites   1.5 oqen  produce  poorer   1.0 Margin quality  click  throughs   0.5 •  Ads  best  response  not   0.0 most  effecIve     Site 1 Site 2 •  Target  the  message   65  
  • 66. Focus  On  The  Right  Measure   Referral Site Metrics 2.5 •  Total  traffic  not  a  good   Response performance  measure   2.0 •  High-­‐traffic  referral  sites   1.5 Conversion oqen  produce  poorer   1.0 Margin quality  click  throughs   0.5 •  Ads  best  response  not   0.0 most  effecIve     Site 1 Site 2 •  Target  the  message   66  
  • 67. Focus  On  The  Right  Measure   Referral Site Metrics 2.5 •  Total  traffic  not  a  good   Response performance  measure   2.0 •  High-­‐traffic  referral  sites   1.5 Conversion oqen  produce  poorer   1.0 Margin quality  click  throughs   0.5 •  Ads  best  response  not   0.0 most  effecIve     Site 1 Site 2 •  Target  the  message   67  
  • 68. SomeImes,     Simple  is  Very  Powerful!   Retaining  New  Yahoo!  Mail  Registrants  
  • 69. IntegraIng  Mail  and  News   •  Data  showed  that  users  oqen  check  their  mail  and   news  in  the  same  session   –  But  no  easy  way  to  navigate  to  Y!  News  from  Y!  Mail   •  Mail  users  who  also  visit  Y!  News  are  3X  more  acIve   on  Yahoo   –  Higher  retenIon,  repeat  visits  and  Ime-­‐spent  on   Yahoo   69  
  • 70. “In  the  news”  Module  on  Mail  Welcome  Page   •  Increased  retenIon  on  Mail  for  light  users  by  40%!   –  Est.  Incremental  revenue  of  $16m  a  year  on  Y!  Mail  alone   70  
  • 71. Benefits  of  Advanced  AnalyIcs   •  Advanced  AnalyIcs  brings  out  the  real  value  of  data     •  The  business  begins  to  understand  the  true  value  and  role  of   data  in  moving  the  big  needles   •  Focus  on  useful  requirements  from  data,  rather  than  “data   acrobaIcs”   •  Value  creaIon  from  data  leads  to  proper  investment  scoping   –  Many  are  realizing  predicIve  analyIcs  and  data  mining    are  much   more  useful  than  reporIng   –  IntegraIon  of  analyIcs  story  with  data  storage  very  criIcal   •  Big  data  makes  analyIcs  even  more  essenIal  and  more  useful   –  Avoiding  the  challenges  of  separaIng  analyIcs  from  big  data  are   increasingly  important   71  
  • 72. Thank  You!        &      QuesIons?     Twi]er:  @usamaf     Usama@choozOn.net   www.ChoozOn.com