Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Big Data made easy with a Spark

38 visualizaciones

Publicado el

"Big Data made easy with a Spark" is the presentation I gave for ATO (AllThingsOpen) 2018.

In this hands-on session, you will learn how to do a full Big Data scenario from ingestion to publication. You will see how we can use Java and Apache Spark to ingest data, perform some transformations, save the data. You will then perform a second lab where you will run your very first Machine Learning algorithm!

Publicado en: Datos y análisis
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Big Data made easy with a Spark

  1. 1. Big Data made easy with a All Things Open 2018 Raleigh, NC October 22nd 2018
  2. 2. Jean Georges Perrin Software whatever since 1983 x10 @jgperrin http://jgp.net [blog]
  3. 3. Who are thou? ๏ Experience with Spark? ๏ Experience with Hadoop? ๏ Experience with Scala? ๏ Java? ๏ PHP guru? ๏ Front-end developer?
  4. 4. But most importantly… ๏ … who is not a developer?
  5. 5. ๏ What is Big Data? ๏ What is. ? ๏ What can I do with. ? ๏ What is a app, anyway? ๏ Install a bunch of software ๏ A first example ๏ Understand what just happened ๏ Another example, slightly more complex, because you are now ready ๏ But now, sincerely what just happened? ๏ Let’s do AI! ๏ Going further Agenda
  6. 6. Caution Hands-on tutorial Tons of content Unknown crowd Unknown setting
  7. 7. 3 V4 5 Biiiiiiiig Data ๏ volume ๏ variety ๏ velocity ๏ variability ๏ value Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
  8. 8. Data is considered big when they need more than one computer to be processed Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
  9. 9. Title TextAnalytics operating system
  10. 10. Apps Analytics Distrib. An analytics operating system? Hardware OS Apps HardwareHardware OS OS Distributed OS Analytics OS Apps HardwareHardware OS OS
  11. 11. An analytics operating system? HardwareHardware OS OS Distributed OS Analytics OS Apps {
  12. 12. Some use cases ๏ NCEatery.com ๏ Restaurant analytics ๏ 1.57×10^21 datapoints analyzed ๏ (@ Lumeris) ๏ General compute ๏ Distributed data transfer ๏ IBM ๏ DSX (Data Science Experience) ๏ Watson Data Studio ๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/ ๏ CERN ๏ Analysis of the science experiments in the LHC - Large Hadron Collider
  13. 13. What a typical app looks like? Connect to the cluster Load Data Do something with the data Share the results
  14. 14. Convinced? On y va!
  15. 15. http://bit.ly/spark-clego
  16. 16. Get all the S T U F F ๏ Go to http://jgp.net/ato2018 ๏ Install the software ๏ Access the source code
  17. 17. Download some tools ๏ Java JDK 1.8 ๏ http://bit.ly/javadk8 ๏ Eclipse Oxygen or later ๏ http://bit.ly/eclipseo2 ๏ Other nice to have ๏ Maven ๏ SourceTree or git (command line) http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html http://www.eclipse.org/downloads/eclipse-packages/
  18. 18. Aren’t you glad we are using Java?
  19. 19. Lab #1 - ingestion
  20. 20. Lab #1 - ingestion ๏ Goal
 In a Big Data project, ingestion is the first operation. You get the data “in.” ๏ Source code
 https://github.com/jgperrin/ net.jgp.books.sparkWithJava.ch01
  21. 21. Getting deeper ๏ Go to net.jgp.books.sparkWithJava.ch01 ๏ Open CsvToDataframeApp.java ๏ Right click, Run As, Java Application
  22. 22. +---+--------+--------------------+-----------+--------------------+ | id|authorId| title|releaseDate| link| +---+--------+--------------------+-----------+--------------------+ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| +---+--------+--------------------+-----------+--------------------+ only showing top 5 rows
  23. 23. package net.jgp.books.sparkWithJava.ch01; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CsvToDataframeApp { public static void main(String[] args) { CsvToDataframeApp app = new CsvToDataframeApp(); app.start(); } private void start() { // Creates a session on a local master SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local") .getOrCreate(); // Reads a CSV file with header, called books.csv, stores it in a dataframe Dataset<Row> df = spark.read().format("csv") .option("header", "true") .load("data/books.csv"); // Shows at most 5 rows from the dataframe df.show(5); } } /jgperrin/net.jgp.books.sparkWithJava.ch01
  24. 24. So what happened? Let’s try to understand a little more
  25. 25. Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) Apache Spark
  26. 26. Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - HW Node 2 - HW Node 3 - HW Node 4 - HW Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Node 5 - OS Node 5 - HW Your application … … Unified API Node 6 - OS Node 6 - HW Node 7 - OS Node 7 - HW Node 8 - OS Node 8 - HW
  27. 27. Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Your application Dataframe Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 5 - OS … Node 6 - OS Node 7 - OS Node 8 - OS Unified API
  28. 28. Title Text Spark SQL Spark streaming Machine learning & deep learning & artificial intelligence GraphX Dataframe
  29. 29. Lab #2 - a bit of analytics But really just a bit
  30. 30. Lab #2 - a little bit of analytics ๏ Goal
 From two datasets, one containing books, the other authors, list the authors with most books, by number of books ๏ Source code
 https://github.com/jgperrin/net.jgp.labs.spark
  31. 31. If it was in a relational database books.csv authors.csv id: integer name: string link: string wikipedia: string id: integer authorId: integer title: string releaseDate: string link: string
  32. 32. Basic analytics ๏ Go to net.jgp.labs.spark.l200_join.l030_count_books ๏ Open AuthorsAndBooksCountBooksApp.java ๏ Right click, Run As, Java Application
  33. 33. +---+-------------------+--------------------+-----+ | id| name| link|count| +---+-------------------+--------------------+-----+ | 1| J. K. Rowling|http://amzn.to/2l...| 4| | 12|William Shakespeare|http://amzn.to/2j...| 3| | 4| Denis Diderot|http://amzn.to/2i...| 2| | 6| Craig Walls|http://amzn.to/2A...| 2| | 2|Jean Georges Perrin|http://amzn.to/2w...| 2| | 3| Mark Twain|http://amzn.to/2v...| 2| | 11| Alan Mycroft|http://amzn.to/2A...| 1| | 10| Mario Fusco|http://amzn.to/2A...| 1| … +---+-------------------+--------------------+-----+ root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- link: string (nullable = true) |-- count: long (nullable = false)
  34. 34. package net.jgp.labs.spark.l200_join.l030_count_books; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class AuthorsAndBooksCountBooksApp { public static void main(String[] args) { AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp(); app.start(); } private void start() { SparkSession spark = SparkSession.builder() .appName("Authors and Books") .master("local").getOrCreate(); String filename = "data/authors.csv"; Dataset<Row> authorsDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); /jgperrin/net.jgp.labs.spark
  35. 35. filename = "data/books.csv"; Dataset<Row> booksDf = spark.read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load(filename); Dataset<Row> libraryDf = authorsDf .join( booksDf, authorsDf.col("id").equalTo(booksDf.col("authorId")), "left") .withColumn("bookId", booksDf.col("id")) .drop(booksDf.col("id")) .groupBy( authorsDf.col("id"), authorsDf.col("name"), authorsDf.col("link")) .count(); libraryDf = libraryDf .orderBy(libraryDf.col("count").desc()); libraryDf.show(); libraryDf.printSchema(); } } /jgperrin/net.jgp.labs.spark
  36. 36. The art of delegating
  37. 37. Slave (Worker) Driver Master Cluster Manager Slave (Worker) Your app Executor Task Task Executor Task Task
  38. 38. Lab #3 - an even smaller bit of AI But really just a bit
  39. 39. Title Text What’s AI
 anyway?
  40. 40. Popular beliefs ๏ Robot with human-like behavior ๏ HAL from 2001 ๏ Isaac Asimov ๏ Potential ethic problems General AI Narrow AI ๏ Lots of mathematics ๏ Heavy calculations ๏ Algorithms ๏ Self-driving cars Current state-of-the-art
  41. 41. Title Text I am an expert in general AI ARTIFICIAL INTELLIGENCE is Machine Learning
  42. 42. ๏ Common algorithms ๏Linear and logistic regressions ๏Classification and regression trees ๏K-nearest neighbors (KNN) ๏Deep learning ๏Subset of ML ๏Artificial neural networks (ANNs) ๏Super CPU intensive, use of GPU Machine learning
  43. 43. There are two kinds of data scientists: 1) Those who can extrapolate from incomplete data.
  44. 44. Title TextDATA Engineer DATA Scientist Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders.
  45. 45. Title Text Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer DATA Engineer DATA Scientist SQL
  46. 46. All over again As goes the old adage: Garbage In, Garbage Out xkcd
  47. 47. Lab #3 - correcting and extrapolating data
  48. 48. Lab #3 - projecting data ๏ Goal
 As a restaurant manager, I want to predict how much revenue will bring a party of 40 ๏ Source code
 https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
  49. 49. If everything was as simple… Dinner revenue per number of guests
  50. 50. …as a visual representation Anomaly #1 Anomaly #2
  51. 51. I love it when a plan comes together
  52. 52. Load & Format +-----+-----+ |guest|price| +-----+-----+ | 1| 23.1| | 2| 30.0| … +-----+-----+ only showing top 20 rows ---- 1st DQ rule +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ | 1| 23.1| 23.1| | 2| 30.0| 30.0| … | 25| 3.0| -1.0| | 26| 10.0| -1.0| … +-----+-----+------------+ … +-----+-----+-----+--------+ |guest|price|label|features| +-----+-----+-----+--------+ | 1| 23.1| 23.1| [1.0]| | 2| 30.0| 30.0| [2.0]| … +-----+-----+-----+--------+ only showing top 20 rows … RMSE: 2.802192495300457 r2: 0.9965340953376102 Intersection: 20.979190460591575 Regression parameter: 1.0 Tol: 1.0E-6 Prediction for 40.0 guests is 218.00351106373822
  53. 53. Using existing data quality rules package net.jgp.labs.sparkdq4ml.dq.udf; 
 import org.apache.spark.sql.api.java.UDF1; import net.jgp.labs.sparkdq4ml.dq.service.*; 
 public class MinimumPriceDataQualityUdf implements UDF1< Double, Double > { public Double call(Double price) throws Exception { return MinimumPriceDataQualityService.checkMinimumPrice(price); } } /jgperrin/net.jgp.labs.sparkdq4ml If price is ok, returns price, if price is ko, returns -1
  54. 54. Telling Spark to use my DQ rules SparkSession spark = SparkSession.builder() .appName("DQ4ML").master("local").getOrCreate(); spark.udf().register( "minimumPriceRule", new MinimumPriceDataQualityUdf(), DataTypes.DoubleType); spark.udf().register( "priceCorrelationRule", new PriceCorrelationDataQualityUdf(), DataTypes.DoubleType); /jgperrin/net.jgp.labs.sparkdq4ml
  55. 55. Loading my dataset String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); Using CSV, but could be Hive, JDBC, name it… /jgperrin/net.jgp.labs.sparkdq4ml
  56. 56. +-----+-----+ |guest|price| +-----+-----+ |   1|23.24| |    2|30.89| |    2|33.74| |    3|34.89| |    3|29.91| |    3| 38.0| |    4| 40.0| |    5|120.0| |    6| 50.0| |    6|112.0| |    8| 60.0| |    8|127.0| |    8|120.0| |    9|130.0| +-----+-----+ Raw data, contains the anomalies
  57. 57. Apply the rules String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  58. 58. +-----+-----+------------+ |guest|price|price_no_min| +-----+-----+------------+ |    1| 23.1|        23.1| |    2| 30.0|        30.0| |    2| 33.0|        33.0| |    3| 34.0|        34.0| |   24|142.0|       142.0| |   24|138.0|       138.0| |   25|  3.0|        -1.0| |   26| 10.0|        -1.0| |   25| 15.0|        -1.0| |   26|  4.0|        -1.0| |   28| 10.0|        -1.0| |   28|158.0|       158.0| |   30|170.0|       170.0| |   31|180.0|       180.0| +-----+-----+------------+ Anomalies are clearly identified by -1, so they can be easily filtered
  59. 59. Filtering out anomalies String filename = "data/dataset.csv"; Dataset<Row> df = spark.read().format("csv") .option("inferSchema", "true").option("header", "false") .load(filename); df = df.withColumn("guest", df.col("_c0")).drop("_c0"); df = df.withColumn("price", df.col("_c1")).drop("_c1"); df = df.withColumn( "price_no_min", callUDF("minimumPriceRule", df.col("price"))); df.createOrReplaceTempView("price"); df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE price_no_min > 0"); /jgperrin/net.jgp.labs.sparkdq4ml
  60. 60. +-----+-----+ |guest|price| +-----+-----+ |    1| 23.1| |    2| 30.0| |    2| 33.0| |    3| 34.0| |    3| 30.0| |    4| 40.0| |   19|110.0| |   20|120.0| |   22|131.0| |   24|142.0| |   24|138.0| |   28|158.0| |   30|170.0| |   31|180.0| +-----+-----+ Useable data
  61. 61. Format the data for ML ๏ Convert/Adapt dataset to Features and Label ๏ Required for Linear Regression in MLlib ๏Needs a column called label of type double ๏Needs a column called features of type VectorUDT
  62. 62. Format the data for ML spark.udf().register( "vectorBuilder", new VectorBuilder(), new VectorUDT()); df = df.withColumn("label", df.col("price")); df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest"))); 
 // ... Lots of complex ML code goes here ... double p = model.predict(features); System.out.println("Prediction for " + feature + " guests is " + p); /jgperrin/net.jgp.labs.sparkdq4ml
  63. 63. +-----+-----+-----+--------+------------------+ |guest|price|label|features|        prediction| +-----+-----+-----+--------+------------------+ |    1| 23.1| 23.1|   [1.0]|24.563807596513133| |    2| 30.0| 30.0|   [2.0]|29.595283312577884| |    2| 33.0| 33.0|   [2.0]|29.595283312577884| |    3| 34.0| 34.0|   [3.0]| 34.62675902864264| |    3| 30.0| 30.0|   [3.0]| 34.62675902864264| |    3| 38.0| 38.0|   [3.0]| 34.62675902864264| |    4| 40.0| 40.0|   [4.0]| 39.65823474470739| |   14| 89.0| 89.0|  [14.0]| 89.97299190535493| |   16|102.0|102.0|  [16.0]|100.03594333748444| |   20|120.0|120.0|  [20.0]|120.16184620174346| |   22|131.0|131.0|  [22.0]|130.22479763387295| |   24|142.0|142.0|  [24.0]|140.28774906600245| +-----+-----+-----+--------+------------------+ Prediction for 40.0 guests is 220.79136052303852 Prediction for 40 guests
  64. 64. (the complex ML code) LinearRegression lr = new LinearRegression() .setMaxIter(40) .setRegParam(1) .setElasticNetParam(1); LinearRegressionModel model = lr.fit(df); Double feature = 40.0; Vector features = Vectors.dense(40.0); double p = model.predict(features); /jgperrin/net.jgp.labs.sparkdq4ml Define algorithms and its (hyper)parameters Created a model from our data Apply the model to a new dataset: predict
  65. 65. It’s all about the base model Same model Trainer ModelDataset #1 ModelDataset #2 Predicted Data Step 1: Learning phase Step 2..n: Predictive phase
  66. 66. Conclusion
  67. 67. A (Big) Data Scenario Data Raw Data Ingestion DataQuality Pure Data Transformation Rich Data Load/Publish Data
  68. 68. Key takeaways ๏ Big Data is easier than one could think ๏ Java is the way to go (or Python) ๏ New vocabulary for using Spark ๏ You have a friend to help (ok, me) ๏ Spark is fun ๏ Spark is easily extensible
  69. 69. Going further ๏ Contact me @jgperrin ๏ Join the Spark User mailing list ๏ Get help from Stack Overflow ๏ fb.com/TriangleSpark
  70. 70. Going further Spark in action (Second edition, MEAP) by Jean Georges Perrin published by Manning http://jgp.net/sia sparkjava-65CE ctwato18 One free book 40% off
  71. 71. Thanks @jgperrin
  72. 72. Backup
  73. 73. Spark in Action Second edition, MEAP by Jean Georges Perrin published by Manning http://jgp.net/sia
  74. 74. Credits Photos by Pexels IBM PC XT by Ruben de Rijcke - http://dendmedia.com/ vintage/ - Own work, CC BY 3.0, https:// commons.wikimedia.org/w/index.php?curid=3610862 Illustrations © Jean Georges Perrin
  75. 75. No more slides You’re on your own!

×