Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Spark SQL Deep Dive @ Melbourne Spark Meetup

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 57 Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Spark SQL Deep Dive @ Melbourne Spark Meetup (20)

Anuncio

Más de Databricks (20)

Más reciente (20)

Anuncio

Spark SQL Deep Dive @ Melbourne Spark Meetup

  1. 1. Spark SQL Deep Dive Michael Armbrust Melbourne Spark Meetup – June 1st 2015
  2. 2. What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through: >  In-memory computing primitives >  General computation graphs Improves usability through: >  Rich APIs in Scala, Java, Python >  Interactive shell Up to 100× faster (2-10× on disk) 2-5× less code
  3. 3. Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) >  Collections of objects that can be stored in memory or disk across a cluster >  Parallel functional transformations (map, filter, …) >  Automatically rebuilt on failure
  4. 4. More than Map & Reduce map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  5. 5. 5   On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours
  6. 6. 6   Spark “Hall of Fame” LARGEST SINGLE-DAY INTAKE LONGEST-RUNNING JOB LARGEST SHUFFLE MOST INTERESTING APP Tencent (1PB+ /day) Alibaba (1 week on 1PB+ data) Databricks PB Sort (1PB) Jeremy Freeman Mapping the Brain at Scale (with lasers!) LARGEST CLUSTER Tencent (8000+ nodes) Based on Reynold Xin’s personal knowledge
  7. 7. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns val  lines  =  spark.textFile(“hdfs://...”)   val  errors  =  lines.filter(_  startswith  “ERROR”)   val  messages  =  errors.map(_.split(“t”)(2))   messages.cache()   lines Block 1 lines Block 2 lines Block 3 Worker Worker Worker Driver messages.filter(_  contains  “foo”).count()   messages.filter(_  contains  “bar”).count()   . . . tasks results messages Cache 1 messages Cache 2 messages Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec# (vs 170 sec for on-disk data)
  8. 8. A General Stack Spark Spark Streaming# real-time Spark SQL GraphX graph MLlib machine learning … Spark SQL
  9. 9. 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Powerful Stack – Agile Development
  10. 10. 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines Streaming Powerful Stack – Agile Development
  11. 11. 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines SparkSQL Streaming Powerful Stack – Agile Development
  12. 12. Powerful Stack – Agile Development 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Streaming SparkSQL
  13. 13. Powerful Stack – Agile Development 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Streaming SparkSQL Your App?
  14. 14. About SQL Spark SQL >  Part of the core distribution since Spark 1.0 (April 2014) 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 # of Contributors
  15. 15. SELECT  COUNT(*)   FROM  hiveTable   WHERE  hive_udf(data)     Spark SQL >  Part of the core distribution since Spark 1.0 (April 2014) >  Runs SQL / HiveQL queries including UDFs UDAFs and SerDes About SQL
  16. 16. Spark SQL >  Part of the core distribution since Spark 1.0 (April 2014) >  Runs SQL / HiveQL queries including UDFs UDAFs and SerDes >  Connect existing BI tools to Spark through JDBC About SQL
  17. 17. Spark SQL >  Part of the core distribution since Spark 1.0 (April 2014) >  Runs SQL / HiveQL queries including UDFs UDAFs and SerDes >  Connect existing BI tools to Spark through JDBC >  Bindings in Python, Scala, and Java About SQL
  18. 18. The not-so-secret truth… is not about SQL. SQL
  19. 19. SQL: The whole story Create and Run Spark Programs Faster: >  Write less code >  Read less data >  Let the optimizer do the hard work
  20. 20. DataFrame noun – [dey-tuh-freym] 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).
  21. 21. Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 21   { JSON } Built-In External JDBC and more…
  22. 22. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")    
  23. 23. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     read and write   functions create new builders for doing I/O
  24. 24. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     Builder methods are used to specify: •  Format •  Partitioning •  Handling of existing data •  and more
  25. 25. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df  =  sqlContext.read        .format("json")        .option("samplingRatio",  "0.1")        .load("/home/michael/data.json")     df.write        .format("parquet")        .mode("append")        .partitionBy("year")        .saveAsTable("fasterData")     load(…), save(…) or saveAsTable(…)   functions create new builders for doing I/O
  26. 26. ETL Using Custom Data Sources sqlContext.read      .format("com.databricks.spark.git")      .option("url",  "https://github.com/apache/spark.git")      .option("numPartitions",  "100")      .option("branches",  "master,branch-­‐1.3,branch-­‐1.2")      .load()      .repartition(1)      .write      .format("json")      .save("/home/michael/spark.json")    
  27. 27. Write Less Code: Powerful Operations Common operations can be expressed concisely as calls to the DataFrame API: •  Selecting required columns •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Filtering 27  
  28. 28. Write Less Code: Compute an Average private  IntWritable  one  =        new  IntWritable(1)   private  IntWritable  output  =      new  IntWritable()   proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)   }     IntWritable  one  =  new  IntWritable(1)   DoubleWritable  average  =  new  DoubleWritable()     protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)   }   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [x.[1],  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()  
  29. 29. Write Less Code: Compute an Average Using RDDs   data  =  sc.textFile(...).split("t")   data.map(lambda  x:  (x[0],  [int(x[1]),  1]))          .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])          .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])          .collect()         Using DataFrames   sqlCtx.table("people")          .groupBy("name")          .agg("name",  avg("age"))          .collect()     Using SQL   SELECT  name,  avg(age)   FROM  people   GROUP  BY  name  
  30. 30. Not Just Less Code: Faster Implementations 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  31. 31. 31   Demo: Data Frames Using Spark SQL to read, write, slice and dice your data using a simple functions
  32. 32. Read Less Data Spark SQL can help you read less data automatically: •  Converting to more efficient formats •  Using columnar formats (i.e. parquet) •  Using partitioning (i.e., /year=2014/month=02/…)1 •  Skipping data using statistics (i.e., min, max)2 •  Pushing predicates into storage systems (i.e., JDBC)  
  33. 33. Optimization happens as late as possible, therefore Spark SQL can optimize across functions. 33
  34. 34. 34 def  add_demographics(events):        u  =  sqlCtx.table("users")                                      #  Load  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)        #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))        #  udf  adds  city  column     events  =  add_demographics(sqlCtx.load("/data/events",  "json"))     training_data  =  events.where(events.city  ==  "Palo  Alto")                                              .select(events.timestamp).collect()     Logical Plan filter join events file users table expensive only join relevant users Physical Plan join scan (events) filter scan (users)
  35. 35. 35 def  add_demographics(events):        u  =  sqlCtx.table("users")                                      #  Load  partitioned  Hive  table        events              .join(u,  events.user_id  ==  u.user_id)        #  Join  on  user_id                      .withColumn("city",  zipToCity(df.zip))        #  Run  udf  to  add  city  column     Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events  =  add_demographics(sqlCtx.load("/data/events",  "parquet"))     training_data  =  events.where(events.city  ==  "Palo  Alto")                                              .select(events.timestamp).collect()     Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users)
  36. 36. Machine Learning Pipelines tokenizer  =  Tokenizer(inputCol="text", outputCol="words”)   hashingTF  =  HashingTF(inputCol="words", outputCol="features”)   lr  =  LogisticRegression(maxIter=10,  regParam=0.01)   pipeline  =  Pipeline(stages=[tokenizer,  hashingTF,  lr])     df  =  sqlCtx.load("/path/to/data") model  =  pipeline.fit(df) df0 df1 df2 df3tokenizer hashingTF lr.model lr Pipeline Model
  37. 37. Set Footer from Insert Dropdown Menu 37 So how does it all work?
  38. 38. Plan Optimization & Execution Set Footer from Insert Dropdown Menu 38 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Catalog DataFrames and SQL share the same optimization/execution pipeline Code Generation
  39. 39. An example query SELECT  name   FROM  (        SELECT  id,  name        FROM  People)  p   WHERE  p.id  =  1   Project name Project id,name Filter id = 1 People Logical Plan 39
  40. 40. Naïve Query Planning SELECT  name   FROM  (        SELECT  id,  name        FROM  People)  p   WHERE  p.id  =  1   Project name Project id,name Filter id = 1 People Logical Plan Project name Project id,name Filter id = 1 TableScan People Physical Plan 40
  41. 41. Optimized Execution Writing imperative code to optimize all possible patterns is hard. Project name Project id,name Filter id = 1 People Logical Plan Project name Project id,name Filter id = 1 People IndexLookup id = 1 return: name Logical Plan Physical Plan Instead write simple rules: •  Each rule makes one change •  Run many rules together to fixed point. 41
  42. 42. Prior Work: # Optimizer Generators Volcano / Cascades: •  Create a custom language for expressing rules that rewrite trees of relational operators. •  Build a compiler that generates executable code for these rules. Cons:  Developers  need  to  learn  this  custom   language.  Language  might  not  be  powerful   enough.   42
  43. 43. TreeNode Library Easily transformable trees of operators •  Standard collection functionality - foreach,   map,collect,etc. •  transform function – recursive modification of tree fragments that match a pattern. •  Debugging support – pretty printing, splicing, etc. 43
  44. 44. Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1.  If the function does apply to an operator, that operator is replaced with the result. 2.  When the function does not apply to an operator, that operator is left unchanged. 3.  The transformation is applied recursively to all children. 44
  45. 45. Writing Rules as Tree Transformations 1.  Find filters on top of projections. 2.  Check that the filter can be evaluated without the result of the project. 3.  If so, switch the operators. Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down 45
  46. 46. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   46
  47. 47. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Partial Function Tree 47
  48. 48. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Find Filter on Project 48
  49. 49. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Check that the filter can be evaluated without the result of the project. 49
  50. 50. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   If so, switch the order. 50
  51. 51. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Scala: Pattern Matching 51
  52. 52. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Catalyst: Attribute Reference Tracking 52
  53. 53. Filter Push Down Transformation   val  newPlan  =  queryPlan  transform  {    case  f  @  Filter(_,  p  @  Project(_,  grandChild))          if(f.references  subsetOf  grandChild.output)  =>    p.copy(child  =  f.copy(child  =  grandChild)   }   Scala: Copy Constructors 53
  54. 54. Optimizing with Rules Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down Project name Filter id = 1 People Combine Projection IndexLookup id = 1 return: name Physical Plan 54
  55. 55. Future Work – Project Tungsten Consider “abcd” – 4 bytes with UTF8 encoding java.lang.String  object  internals:   OFFSET    SIZE      TYPE  DESCRIPTION                                        VALUE            0          4                (object  header)                                ...            4          4                (object  header)                                ...            8          4                (object  header)                                ...          12          4  char[]  String.value                                      []          16          4        int  String.hash                                        0          20          4        int  String.hash32                                    0   Instance  size:  24  bytes  (reported  by  Instrumentation  API)  
  56. 56. Project Tungsten Overcome JVM limitations: •  Memory Management and Binary Processing: leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection •  Cache-aware computation: algorithms and data structures to exploit memory hierarchy •  Code generation: using code generation to exploit modern compilers and CPUs
  57. 57. Questions? Learn more at: http://spark.apache.org/docs/latest/ Get Involved: https://github.com/apache/spark

×