Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

SQL on Hadoop for Enterprise Analytics

895 visualizaciones

Publicado el

Hear why SQL on Hadoop is the future of analytics. Mike Xu, Looker Data Architect and Evangelist will share how recent updates to SQL query engines like Spark and Presto are finally allowing companies to harness Hadoop's processing power for analytics. Looker Data Analyst Eric Feinstein shares how a top 10 health insurance company built an in-cluster data platform using Looker to make all their data in Hadoop accessible to thousands of analysts and business users across the company every day.

In this webinar you will:
•Learn fundamentals of modeling out of Hadoop including Spark
•Hear best practices for navigating joins
•See case study demonstrations

Publicado en: Tecnología
  • Sé el primero en comentar

SQL on Hadoop for Enterprise Analytics

  1. 1. A LITTLE BIT OF HISTORY Everythingoldisnewagain. SQLForever.
  2. 2. The story so far Why hasn’t SQL died yet? It’s 2016 and we’re still using it?!
  3. 3. Everything old is new again Existing architecture keeps reappearing It takes time to figure out what tools are right for what jobs SQL is still the best tool for business analytics
  4. 4. A long long time ago…
  5. 5. Growing pains Late 1990
  6. 6. Database problems Database outage Data integrity issues Data latency Late 1990
  7. 7. Master Slave Late 1990
  8. 8. Transactions Late 1990
  9. 9. Performance Late 1990
  10. 10. By the time I graduated, SQL was on its last legs 2009
  11. 11. Cache all the things! 2009
  12. 12. Stop copying Twitter! 2009
  13. 13. SQL golden age ends, NoSQL takes off 2010 Column Graph Key-Value Document
  14. 14. NoSQL 2010
  15. 15. Awesome things about NoSQL No SQL, normal languages as APIs! Non relational! FAST! 2010
  16. 16. Remember ORMs? ~2000
  17. 17. Active Record ~2000
  18. 18. ORMs 👎 2011
  19. 19. Remember EAV(Entity Attribute Value)? 1968
  20. 20. Kind of looks like columns… 1968
  21. 21. Modern EAV 2010
  22. 22. Tedious to query 2010
  23. 23. Voila! 2010
  24. 24. No joins is a feature! 2010
  25. 25. NoSQL has some rough bumps 2010
  26. 26. NoSQL has A LOT of rough bumps… 2011
  27. 27. Throwback Thursday! 2011
  28. 28. Lock the doors 2011
  29. 29. MPP columnar DBs! Wait... SQL is back?! 2015
  30. 30. Hadoop on SQL 2016
  31. 31. A long long time ago…
  32. 32. What’s next? ~2020?
  33. 33. What’s next? ~2020? “If you have an architecture where you’re trying to periodically trying to dump from one system to the other and synchronize, you can simplify your life quite a bit by just putting your data in this storage system called Kudu.” – Todd Lipcon
  34. 34. SQL is far past hype
  35. 35. Fin “If it ain’t broke, don’t fix it”
  36. 36. CUSTOMER STORY Buildingaeventanalyticspipelineusing HadoopandSpark
  37. 37. Why Consider a Big Data Pipeline? 37 You arerapidly exceedingthelimits ofyour existing database Everythingon yourwebsitecan be analyzed. Waitinguntilthenextdayisn’tfor you Datacomes andgoestomany places, andyou wantoneprocess forit
  38. 38. Big DATA CULTURE 38 Summarydatais notgood enough Companyismandatingnew technologies Youwanttobuild adatadriven culture Big SQLis theheartof a data-drivenculture
  39. 39. CASE STUDY 39 A major healthcare provider wants to create a web event pipeline that: Duringperiodsofhealthcareregistrationandnew coveragestartandcan dialbacktherestoftheyear Massive Scaling Large data volumes 10-15Mcustomersworthof data.Provides dataforanalysisinunder1minute. AND Utilizes existing in house technologies (such as Cloudera Impala) Pageloads Registrations Logins Errors All events processed
  40. 40. Solution: Build an event processing framework 5 Events Event Collector Hadoop ?
  41. 41. High Level Process 6 Events Event Collector Message Processing HDFS Looker To be designed
  42. 42. Why is Hadoop so hard? 7 Needtowritein Javaand Scala We don’thavestructure NoteasytogetdataoutintoBItools EventCollectorsdon’ttendtofeed toHDFS outofthebox Typicallyfollowa batchprocessing framework
  43. 43. Ingestion mechanism 8 Low-Latency Inflighttransformationand processing Abilitytopopulatemultiple destinations Our ideal ingestion would have three key aspects
  44. 44. Spark vs Storm 9 VS • OwnMasterServer • Run onHDFS • Microbatching • Exact once delivery(eliminates vulnerability) • NotnativetoHadoop • LessDeveloped • Oneata time • ETL inflight • Subsecondlatency Twoofthemajorplayersin datastreaming/processing
  45. 45. Flume 45 Source Interceptor Selector Channel Sinks Managed by the Flume Agent Web Server Web Server Web Server Web Server Investor Channel HDFSNo in flight transformation, so this just needs to meet workload
  46. 46. KAFKA 46 Broker Broker Broker Producer Broker Consumer Producer Producer SparkStreaming Other ZooKeeper Broker
  47. 47. Flume vs. Kafka 12 Use Both: Out-of-the box with Flafka and native connectors Flume Kafka Source Spark Custom connector Custom connector Flume KafkaSource Spark
  48. 48. Storing the output 48 Data can be queried viaHive, Impala, or SparkSQL Clouderaisour Enterprise choice We can process asubset in-stream with Mlib or other machine learning algorithms Output summaries toother RDBMS systems Our streaming Spark cluster consumes messages from Kafka. We batch these every minute into a HDFS cluster. We chose this because
  49. 49. Final Result 14 Events Event Collector Kafka Flume SparkSQL Cloudera Other storage (RDBMS) Other storage (logs)
  50. 50. Pipeline Summary 15 Add datatoanypointof thepipeline Kafka,Flume,Impala,Looker withoutmanycustom connectors Pipelineincludes additionalsources liketeradata,oracle Add in-flightpredictivemodeltraining andexecutionwithoutsignificant additionalprocessingtime Our pipeline provides several points for flexibility as well as meets our key priorities.
  51. 51. Priority # 1: Scale Kafkais easy toscale, Asmorevolumecomes in, addingnew brokerscan be automatedusing the PartitionReassignmentTool Bymonitoringbatchtimesin LookeronSparkSQL, wecan alertwhenweneed toscale up thecluster using Scheduled Looks 16
  52. 52. Priority #2: Flexibility 17 Differenteventscan beparsed outtodifferentSparkstreamingapplications withKafkatopics (Oranothertype of consumer) Addmoredataatanypoint(flume, kafkaproducer,ordirectlytospark) Lookerconnects towhereverthedatalands, as long as wecan query it.Perform analysis INCLUSTER
  53. 53. Priority #3 Speed Analyzing the stream 53 Events per hour Identifymissingbatches Volume andTiming Rightsizinghardware Duplicate events And missinginformation
  54. 54. Priority #4: In house Technologies 19 Provide access to Hadoop/Impala via a centralized data hub: Asingle place toaccess webbased reports,explores, BI toolsand code libraries Enable users to ask questions and query web data without writing SQL or knowing about the pipeline
  55. 55. Analyzing the stream 55 Looking for Lost data =/=
  56. 56. Analyzing the stream 21 By connecting Looker to various points in the stream we can verify complete loads: We also mask the location of information, one dashboard may show a variety of reliable sources. • ImpalaSQL • SourceLogs • SummaryReports
  57. 57. Other uses and benefits 57 Match data in flight to find bad user accounts In flight alerts for missing data Analysis without needing to know the location in the stream SQL on Hadoop BI solution doesn’t require new skillset
  58. 58. THANK YOU!
  59. 59. Sources Security.pdf