Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

ETL with SPARK - First Spark London meetup

Supercharging ETL with Spark
Slides from first Spark Meetup London

  • Inicia sesión para ver los comentarios

ETL with SPARK - First Spark London meetup

  1. 1. Supercharging ETL with Spark Rafal Kwasny First Spark London Meetup 2014-05-28
  2. 2. Who are you?
  3. 3. About me • Sysadmin/DevOps background • Worked as DevOps @Visualdna • Now building game analytics platform @Sony Computer Entertainment Europe
  4. 4. Outline • What is ETL • How do we do it in the standard Hadoop stack • How can we supercharge it with Spark • Real-life use cases • How to deploy Spark • Lessons learned
  5. 5. Standard technology stack Get the data
  6. 6. Standard technology stack Load into HDFS / S3
  7. 7. Standard technology stack Extract & Transform & Load
  8. 8. Standard technology stack Query, Analyze, train ML models
  9. 9. Standard technology stack Real Time pipeline
  10. 10. Hadoop • Industry standard • Have you ever looked at Hadoop code and tried to fix something?
  11. 11. How simple is simple? ”Simple YARN application to run n copies of a unix command - deliberately kept simple (with minimal error handling etc.)” ➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git (…) ➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l 232
  12. 12. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS
  13. 13. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS Repeat 10 times
  14. 14. Issue: Test run time • Job startup time ~20s to run a job that does nothing • Hard to test the code without a cluster ( cascading simulation mode != real life )
  15. 15. Issue: new applications MapReduce awkward for key big data workloads: • Low latency dispatch (E.G. quick queries) • Iterative algorithms (E.G. ML, Graph…) • Streaming data ingest
  16. 16. Issue: hardware is moving on Hardware had advanced since Hadoop started: • Very large RAMs, Faster networks (10Gb+) • Bandwidth to disk not keeping up • 1 GB of RAM ~ $0.75/month * *based on a spot price of AWS r3.8xlarge instance
  17. 17. How can we supercharge our ETL?
  18. 18. Use Spark • Fast and Expressive Cluster Computing Engine • Compatible with Apache Hadoop • In-memory storage • Rich APIs in Java, Scala, Python
  19. 19. Why Spark? • Up to 40x faster than Hadoop MapReduce ( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ ) • Jobs can be scheduled and run in <1s • Typically less code (2-5x) • Seamless Hadoop/HDFS integration • REPL • Accessible Source in terms of LOC and modularity
  20. 20. Why Spark? • Berkeley Data Analytics Stack ecosystem: • Spark, Spark Streaming, Shark, BlinkDB, MLlib • Deep integration into Hadoop ecosystem • Read/write Hadoop formats • Interoperability with other ecosystem components • Runs on Mesos & YARN, also MR1 • EC2, EMR • HDFS, S3
  21. 21. Why Spark?
  22. 22. Using RAM for in-memory caching
  23. 23. Fault recovery
  24. 24. Stack Also: • SHARK ( Hive on Spark ) • Tachyon ( off heap caching ) • SparkR ( R wrapper ) • BlinkDB ( Approximate Queries)
  25. 25. Real-life use
  26. 26. Spark use-cases • next-generation ETL platform • No more “multiple chained MapReduce jobs” architecture • Less jobs to worry about • Better sleep for your DevOps team
  27. 27. Sessionization Add session_id to events
  28. 28. Why add session id? Combine all user activity into user sessions
  29. 29. Adding session ID user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user2 1401207491 http://twitter.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://webpage/ http://webpage/product1
  30. 30. Group by user user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://twitter.com http://webpage/ user2 1401207491 http://webpage/ http://webpage/product1
  31. 31. Add unique session id user_id timestamp session_id Referrer URL user1 140120749 0 8fddc743bfbafdc 45e071e5c126ce ca7 http://fb.com http://webpage/ user1 140120754 3 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/ http://webpage/login user1 140120841 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/lo gin http://webpage/add_to_ cart user2 140120749 1 c00e742152500 8584d9d1ff4201 cbf65 http://twitter.com http://webpage/ 140120749 c00e742152500 http://webpage/product
  32. 32. Join with external data user_id timestamp session_id new_user Referrer URL user1 1401207490 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://fb.com http://webpage/ user1 1401207543 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/ http://webpage/l ogin user1 140120841 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/login http://webpage/ add_to_cart user2 1401207491 c00e7421525 008584d9d1ff 4201cbf65 FALSE http://twitter.c om http://webpage/ c00e7421525
  33. 33. Sessionize user clickstream • Filter interesting events • Group by user • Add unique sessionId • Join with external data sources • Write output
  34. 34. val input = sc.textFile("file:///tmp/input") val rawEvents = input .map(line => line.split("t")) val userInfo = sc.textFile("file:///tmp/userinfo") .map(line => line.split("t")) .map(user => (user(0),user)) val processedEvents = rawEvents .map(arr => (arr(0),arr)) .cogroup(userInfo) .flatMapValues(k => { val new_user = k._2.length match { case x if x > 0 => "true" case _ => "false" } val session_id = java.util.UUID.randomUUID.toString k._1.map(line => line.slice(0,3) ++ Array(session_id) ++ Array(new_user) ++ line.drop(3) ) }) .map(k => k._2)
  35. 35. Why is it better? • Single spark job • Easier to maintain than 3 consecutive map reduce stages • Can be unit tested
  36. 36. From the DevOps perspective
  37. 37. v1.0 - running on EC2 • Start with an EC2 script ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name> If it does not work for you - modify it, it’s just a simple python+boto
  38. 38. v2.0 - Autoscaling on spot instances 1x Master - on-demand (c3.large) XX Slaves - spot instances depending on usage patterns (r3.*) • no HDFS • persistence in memory + S3
  39. 39. Other options • Mesos • YARN • MR1
  40. 40. Lessons learned
  41. 41. JVM issues • java.lang.OutOfMemoryError: GC overhead limit exceeded • add more memory? val sparkConf = new SparkConf() .set("spark.executor.memory", "120g") .set("spark.storage.memoryFraction","0.3") .set("spark.shuffle.memoryFraction","0.3") • increase parallelism: sc.textFile("s3://..path", 10000) groupByKey(10000)
  42. 42. Full GC 2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G- >45G(110G), 79.3771030 secs] 2014-05-21T10:16:42.580+0000: 280.087: Total time for which application threads were stopped: 79.3773830 seconds we want to avoid this • Use G1GC + Java 8 • Store data serialized set("spark.serializer","org.apache.spark.serializer.KryoSerializer") set("spark.kryo.registrator","scee.SceeKryoRegistrator")
  43. 43. Bugs • for example: cdh5 does not work with Amazon S3 out of the box ( thx to Sean it will be fixed in next release ) • If in doubt use the provided ec2/spark-ec2 script • ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name>
  44. 44. Tips & Tricks • you do not need to package whole spark with your app, just specify dependencies as provided in sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" % „provided" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" % "provided" assembly jar size from 120MB -> 5MB • always ensure you are compiling agains the same version of artifacts, if not ”bad things will happen”™
  45. 45. Future - Spark 1.0 • Voting in progress to release Spark 1.0.0 RC11 • Spark SQL • History server • Job Submission Tool • Java 8 support
  46. 46. Spark - Hadoop done right • Faster to run, less code to write • Deploying Spark can be easy and cost-effective • Still rough around the edges but improves quickly
  47. 47. Thank you for listening :)

×