Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Learning spark ch01 - Introduction to Data Analysis with Spark

987 visualizaciones

Publicado el

Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course

Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv

Publicado en: Educación
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Learning spark ch01 - Introduction to Data Analysis with Spark

  1. 1. C H A P T E R 0 1 : I N T R O D U C T I O N T O D A T A A N A L Y S I S W I T H S P A R K Learning Spark by Holden Karau et. al.
  2. 2. Overview: Introduction to Data Analysis with SPARK  What Is Apache Spark?  A Unified Stack  Spark Core  Spark SQL  Spark Streaming  MLlib  GraphX  Cluster Managers  Who Uses Spark, and for What?  Data Science Tasks  Data Processing Applications  A Brief History of Spark  Spark Versions and Releases  Storage Layers for Spark
  3. 3. 1.1 What Is Apache Spark?  Apache Spark is a cluster computing platform  Spark extends MapReduce model to support  Different computations  batch applications,  iterative algorithms,  interactive queries,  and streaming  Run computations in memory  Highly Accessible  simple APIs in Python, Java, Scala, and SQL  rich built-in libraries accessing Hadoop Clusters/Data Sources
  4. 4. Edx and Coursera Courses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  5. 5. 1.2 A Unified Stack
  6. 6. 1.2.1 A Unified Stack: Core, SQL, Streaming  Spark Core  Task Scheduling  Memory management  Fault recovery  Storage system interaction  API that defines resilient Distributed Dataset (RDD)  Spark SQL  Provide SQL interface to Spark  Allow programmatic data manipulations mix with SQL  Spark Streaming  Enables processing of live stream data e.g. web logs
  7. 7. 1.2.2 A Unified Stack: MLlib, GraphX, ClusterM  MLlib  Contains common machine learning (ML) modules  Classification, Regression, Clustering, Collaborative Filtering  Model evaluation, Data Import, Lower-level ML primitives  GraphX  Extends Spark RDD APIs just like Spark SQL/Streaming  Contains graph algorithms  Cluster Managers  Hadoop YARN, Apache Mesos  Default: Standalone scheduler
  8. 8. 1.3 Who Uses Spark, and for What ?  General-purpose framework for cluster computing  Data Scientists  Engineers  Data Scientists  Analyze and Model data  SQL, Statistics, Predictive Model (ML) using Python, R  Use Cases: Interactive shells with Python, Scala, SparkSQL supporting MLlib libraries calling out Matlab/R  Engineers  Data Processing Applications  Principles of SW engineering (Encapsulation, OOP, Interface design)
  9. 9. 1.4 A Brief History of Spark  2009: UC Berkeley RAD lab became AMPlab  Start with Hadoop MapReduce was inefficient for interactive computing jobs  designed for interactive and iterative query performance  In-memory storage  Efficient fault recovery 10-20X times faster than MapReduce  Early Adopters  Spark PoweredBy page  Spark Meetups  Spark Summit  2011  Berkeley Data Analytics Stacks (BDAS)
  10. 10. 1.5 Spark Versions and Releases  May 2014 Spark 1.1.0  April 2015 Spark 1.3.1  Spark Documentation
  11. 11. 1.6 Storage Layers for Spark  Spark can create distributed datasets from  HDFS  Supported by Hadoop API  Local Filesystem  Amazon S3  Cassandra  Hive  Hbase …etc  Supports others  Text file  Sequence file  Arvo  Parquet  Hadoop InputFormat
  12. 12. Learn More about Apache Spark

×