Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

17.889 visualizaciones

Publicado el

http://bit.ly/1BTaXZP – Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop

This presentation was given at a webinar hosted by Data Science Central and co-presented by MapR + Databricks.

To see the webinar, please go to: http://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop

Publicado en: Tecnología
  • Inicia sesión para ver los comentarios

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies adawar@mapr.com pat.mcdonough@databricks.com
  2. 2. © 2014 MapR Technologies 2 About MapR and Databricks • Project leads for Spark, formerly with UC Berkeley’s AMPLab • Founded in June 2013 and backed by Andreessen Horowitz • Strong Engineering focus * Forrester Wave Big Data Hadoop Solutions, Q1 2014 • Top Ranked distribution for Hadoop* • Hundreds of deployments – 17 of Fortune 100 – Largest deployment in FSI (1000+ nodes) • Strong focus on making Hadoop resilient and enterprise grade • Worldwide Presence
  3. 3. © 2014 MapR Technologies 3 Hadoop Evolves Make it solid • HA: eliminate SPOFs • Data Protection: recover from application/user errors • Disaster Recovery: data center outages • Enterprise Integration: breaking the wall that separates Hadoop from the rest • Security & Multi- tenancy: sharing the cluster and meeting SLA’s, secure authorization, data governance Make it do more (easily) • Interactive apps (i.e. SQL) • Iterative programs • Streaming apps • Medium/Small Data • Architecture: using memory efficiently • How many different tools should it take? – It’s hard to get interoperability amongst different data-parallel models right – Learning curves and operational costs increase with each new tool
  4. 4. © 2014 MapR Technologies 4 MapR – Top ranked Hadoop distribution Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Batch Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning / coordination Savannah* Mahout ML, Graph MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integratio n & Access HttpFS Hue * Certification/support planned for 2014 Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability • High availability • Data protection • Disaster recovery • Standard file access • Standard database access • Pluggable services • Broad developer support • Enterprise security authorization • Wire-level authentication • Data governance • Ability to support predictive analytics, real- time database operations, and support high arrival rate data • Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators • 2X to 7X higher performance • Consistent, low latency * Forrester Wave Big Data Hadoop Solutions, Q1 2014
  5. 5. © 2014 MapR Technologies 5 MapR – The Only Distribution to Integrate the Complete Apache Spark Stack Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Batch Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout ML, Graph MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governan ce Tez* Accumulo* Hive Impala Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integratio n & Access HttpFS Hue * Certification/support planned for 2014 Shark (SQL) Spark Streaming (Streaming) MLLib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Spark Spark Streaming MLLib GraphX Shark
  6. 6. © 2014 MapR Technologies 6 Spark on MapR World-record performance on disk coupled with in-memory processing advantages High Performance Industry-leading enterprise-grade High Availability, Data Protection and Disaster Recovery Enterprise-grade dependability for Spark Strategic partnership with Databricks to ensure enterprise support for the entire stack 24/7 Best-in-class Global Support Spark stack can also be deployed natively as an independent standalone service on the MapR cluster Can Run Natively on MapR
  7. 7. Apache Spark
  8. 8. Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 • Top-level Apache Project as of 2014
  9. 9. The Spark Community
  10. 10. Spark is The Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
  11. 11. Spark: Easy and Fast Big Data Easy to Develop > Rich APIs in Java, Scala, Pytho n > Interactive shell Fast to Run > General execution graphs > In-memory storage
  12. 12. Spark: Easy and Fast Big Data Easy to Develop > Rich APIs in Java, Scala, Pytho n > Interactive shell Fast to Run > General execution graphs > In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  13. 13. Easy: Get Started Immediately • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  14. 14. Easy: Get Started Immediately • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); Java 8 (Coming Soon) JavaRDD<String> lines = sc.textFile(...) lines.filter(x -> x.contains(“ERROR”)).count()
  15. 15. Easy: Clean API Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  16. 16. Easy: Expressive API map reduce
  17. 17. Easy: Expressive API map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  18. 18. Easy: Example – Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  19. 19. Easy: Example – Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  20. 20. Easy: Works Well With Hadoop Data Compatibility • Access your existing Hadoop Data • Use the same data formats • Adheres to data locality for efficient processing Deployment Models • “Standalone” deployment • YARN-based deployment • Mesos-based deployment • Deploy on existing Hadoop cluster or side- by-side
  21. 21. Easy: User-Driven Roadmap Language support > Improved Python support > SparkR > Java 8 > Integrated Schema and SQL support in Spark’s APIs Better ML > Sparse Data Support > Model Evaluation Framework > Performance Testing
  22. 22. Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w
  23. 23. Fast: Logistic Regression Performance 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s
  24. 24. Fast: Using RAM, Operator Graphs In-memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance = cached partition = RDD join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  25. 25. Fast: Scaling Down 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache
  26. 26. Easy: Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  27. 27. Easy: Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)
  28. 28. Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation)
  29. 29. Hive Compatibility • Interfaces to access data and code in the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs
  30. 30. Parquet Support Native support for reading data stored in Parquet: • Columnar storage avoids reading unneeded data. • Currently only supports flat structures (nested data on short-term roadmap). • RDDs can be written to parquet files, preserving the schema.
  31. 31. Mixing SQL and Machine Learning val trainingDataTable = sql(""" SELECT e.action, u.age, u.latitude, u.logitude FROM Users u JOIN Events e ON u.userId = e.userId""")// Since `sql` returns an RDD, the results of can be easily used in MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new LogisticRegressionWithSGD().run(trainingData)
  32. 32. Relationship to Borrows • Hive data loading code / in- memory columnar representation • hardened spark execution engine Adds • RDD-aware optimizer / query planner • execution engine • language interfaces. Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark
  33. 33. Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation)
  34. 34. Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs 34 Spark Spark Streaming batches of X seconds live data stream processed results • Chop up the live stream into batches of ½ second or more, leverage RDDs for micro-batch processing • Use the same familiar Spark APIs to process streams • Combine your batch and online processing in a single system • Guarantee exactly-once semantics
  35. 35. DStream of data Window-based Transformations val tweets = ssc.twitterStream() val hashTags = tweets.flatMap(status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length sliding interval
  36. 36. Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation)
  37. 37. MLlib – Machine Learning library Logis] c*Regression,*Linear*SVM*(+L1,*L2),*Decision* Trees,*Naive*Bayes" Linear*Regression*(+Lasso,*Ridge)* Alterna] ng*Least*Squares* KZMeans,*SVD* SGD,*Parallel*Gradient* Scala,*Java,*PySpark*(0.9) MLlib Classifica. on:" Regression:" Collabora. ve"Filtering:" Clustering"/"Explora. on:" Op. miza. on"Primi. ves:" Interopera. lity:"
  38. 38. Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation)
  39. 39. Enabling users to easily and efficiently express the entire graph analytics pipeline New API Blurs the distinction between Tables and Graphs New System Combines Data-Parallel Graph-Parallel Systems The GraphX Unified Approach
  40. 40. Easy: Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.,: • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark) • Tachyon (off-heap RDD caching)
  41. 41. Use Cases
  42. 42. Interactive Exploratory Analytics • Leverage Spark’s in-memory caching and efficient execution to explore large distributed datasets • Use Spark’s APIs to explore any kind of data (structured, unstructured, semi-structured, etc.) and combine programming models • Execute arbitrary code using a fully-functional interactive programming environment • Connect external tools via SQL Drivers
  43. 43. Machine Learning • Improve performance of iterative algorithms by caching frequently accessed datasets • Develop programs that are easy to reason using a fully- capable functional programming style • Refine algorithms using the interactive REPL • Use carefully-curated algorithms out-of-the-box with MLlib
  44. 44. Power Real-time Dashboards • Use Spark Streaming to perform low-latency window- based aggregations • Combine offline models with streaming data for online clustering and classification within the dashboard • Use Spark’s core APIs and/or Spark SQL to give users large-scale, low-latency drill-down capabilities in exploring dashboard data
  45. 45. Faster ETL • Leverage Spark’s optimized scheduling for more efficient I/O on large datasets, and in-memory processing for aggregations, shuffles, and more • Use Spark SQL to perform ETL using a familiar SQL interface • Easily port PIG scripts to Spark’s API • Run existing HIVE queries directly on Spark SQL or Shark
  46. 46. San Francisco June 30 – July 2 • Use Cases • Tech Talks • Training http://spark-summit.org/
  47. 47. © 2014 MapR Technologies 47 Q&A @mapr maprtech adawar@mapr.com Engage with us! MapR maprtech mapr-technologies

×