Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Hadoop vs. RDBMS for Advanced Analytics

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 16 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)

Anuncio

Similares a Hadoop vs. RDBMS for Advanced Analytics (20)

Más reciente (20)

Anuncio

Hadoop vs. RDBMS for Advanced Analytics

  1. 1. Hadoop vs. RDBMS for Advanced Analytics Josh Wills April 26th, 2012
  2. 2. About Me • jwills@cloudera.com • Formerly of Google (2008 – 2011) • Worked on the ad auction • Led the team that build the data infrastructure for Google+ • Before that: a bunch of startups • Sometimes as a software engineer, sometimes as a statistician • Math degree from Duke and a half-finished PhD from The University of Texas at Austin • Now: Director of Data Science at Cloudera Copyright 2012 Cloudera Inc. All rights reserved
  3. 3. Getting Started with Hadoop: Apache Hive • Stick with the relational models that you are used to working with • Great for the common starter use cases • Logs processing • Online data archival • ETL/ELT Copyright 2012 Cloudera Inc. All rights reserved
  4. 4. Hadoop for Advanced Analytics When Should I Use Hadoop instead of an RDBMS? Copyright 2012 Cloudera Inc. All rights reserved
  5. 5. First Symptom: COUNT DISTINCT Copyright 2012 Cloudera Inc. All rights reserved
  6. 6. Second Symptom: Cursors Copyright 2012 Cloudera Inc. All rights reserved
  7. 7. Third Symptom: ALTER TABLE OF_DOOM Copyright 2012 Cloudera Inc. All rights reserved
  8. 8. The Unit of Analysis Problem • Data warehouses are optimized to analyze transactions • Awesome for finance and ERP • Not ideal for product and marketing • A function of what databases are good at Copyright 2012 Cloudera Inc. All rights reserved
  9. 9. What Are You Trying to Analyze? Simple Entities Complex Entities • Static attributes • Evolving attributes • Flat data structure • Hierarchical data structure • Transient • Persistent • Examples • Examples • SKUs • Customers • Line items from an invoice • Suppliers • Log messages • Website visitors Copyright 2011 Cloudera Inc. All rights reserved
  10. 10. Rods and Cones vs. Facial Recognition Copyright 2012 Cloudera Inc. All rights reserved
  11. 11. Structure the Data to Fit the Problem • HDFS Lets Us Store Our Data However We Want • We can choose storage schemas that are: • Flexible • Evolvable • Compact • Fast serialization/deserializati on Copyright 2012 Cloudera Inc. All rights reserved
  12. 12. Advaned Analytics: Use Cases Copyright 2012 Cloudera Inc. All rights reserved
  13. 13. Simple Counts on Complex Objects Copyright 2012 Cloudera Inc. All rights reserved
  14. 14. Self-Self-Self-Joins Copyright 2012 Cloudera Inc. All rights reserved
  15. 15. Matching Problems Copyright 2012 Cloudera Inc. All rights reserved
  16. 16. We’re Hiring. jwills@cloudera.com

Notas del editor

  • How do you know you have a unit of analysis problem? You’re doing a bunch of COUNT DISTINCT queries. You’re doing LAG/LEAD-style queries, or using a cursor.

×