3. Free online webinar
events
Free 1-day local
training events
Local user groups
around the world
Online special
interest user groups
Business analytics
training
Free Online Resources
PASS Blog
White Papers
Session Recordings
Newsletter www.pass.org
Explore everything PASS has to offer
PASS Connector
BA Insights
Get involved
4. Session evaluations
Download the GuideBook App
and search: PASS Summit 2017
Follow the QR code link
displayed on session signage
throughout the conference
venue and in the program guide
Your feedback is important and valuable.
Go to passSummit.com
Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
5. Jen Stirrup
Data Whisperer
Data Relish UK
Postgrad in Artificial Intelligence
Universities in the UK and Paris
AI and BI Consultant for 20 years
Global delivery of projects
Author
Published author on Business Intelligence
technology boos
/jenstirrup @jenstirrup jenstirrup
10. Apache Spark™ is a fast and general engine for large-scale data processing.
11. Apache Spark
It is the largest open source process in data processing.
Since its release, Apache Spark has seen rapid adoption
by enterprises across a wide range of industries.
Apache Spark is a fast, in-memory data processing
engine with elegant and expressive development APIs
to allow data workers to efficiently execute streaming.
As well as machine learning or SQL workloads that
require fast iterative access to datasets
15. Apache Spark
Apache Spark consists of Spark Core and a set of libraries.
The core is the distributed execution engine and the Java,
Scala, and Python APIs offer a platform for distributed ETL
application development.
Quickly achieve success by writing applications in Java,
Scala, or Python.
16. Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs) are the fundamental
object used in Apache Spark.
RDDs are immutable collections representing datasets
New RDDs are created upon any operation
Lineage is also stored
18. Apache Spark
It comes with a built-in set of over 80 high-level operators.
And you can use it interactively to query data within the
shell.
In addition to Map and Reduce operations, it supports SQL
queries, streaming data, machine learning and graph data
processing.
19. Apache Spark
Developers can use these capabilities stand-alone or
combine them to run in a single data pipeline use case
20. Spark Components on HDInsight
Apache Spark is an open-source parallel processing
framework that supports in-memory processing to boost
the performance of big-data analytic applications.
Spark cluster on HDInsight is compatible with Azure
Storage (WASB) as well as Azure Data Lake Store.
22. Apache Spark
When you create a Spark cluster on HDInsight, you create
Azure compute resources with Spark installed and
configured.
It only takes about 10 minutes to create a Spark cluster in
HDInsight. The data to be processed is stored in Azure
Storage or Azure Data Lake Store.
23. Apache Spark
Spark provides primitives for in-memory cluster
computing.
A Spark job can load and cache data into memory and
query it repeatedly, much more quickly than disk-based
systems.
Spark also integrates into the Scala programming
language to let you manipulate distributed data sets like
local collections.
24. What does Spark give you?
Apache Spark is a powerful open source processing engine
for Hadoop data built around speed, easy to use, and
sophisticated analytics.
When comes to BigData processing speed always matters.
We always look for processing our huge data as fast as
possible.
25. What does Spark give you
Spark enables applications in Hadoop clusters to run up to
100x faster in memory, and 10x faster even when running
on disk.
Spark makes it possible by reducing number of read/write
to disc. It stores this intermediate processing data in-
memory.
26. Why Spark?
Easy: Built on Spark’s lightweight yet powerful APIs, Spark Streaming
lets you rapidly develop streaming applications
Fault tolerant: Unlike other streaming solutions (e.g. Storm), Spark
Streaming recovers lost work and delivers exactly-once semantics out
of the box with no extra code or configuration
Integrated: Reuse the same code for batch and stream processing,
even joining streaming data to historical data
27. Why Spark?
It uses the concept of Resilient Distributed Dataset (RDD),
which allows it to transparently store data on memory and
persist it to disc only it’s needed.
This helps to reduce most of the disc read and write the
main time consuming factors of data processing.
28. YARN Data Operating system:
YARN is one of the key features in the second-generation
Hadoop 2 version of the Apache Software Foundation's
open source distributed processing framework.
Originally described by Apache as a redesigned resource
manager, YARN is now characterized as a large-scale,
distributed operating system for big data applications.
29. YARN Data Operating system:
YARN is a software rewrite that decouples MapReduce's
resource management and scheduling capabilities from the
data processing component, enabling Hadoop to support
more varied processing approaches and a broader array of
applications.
30. Spark Deployment Modes:
Two deployment modes can be used to launch Spark applications:
In cluster mode, jobs are managed by the YARN cluster. The Spark driver runs
inside an Application Master (AM) process that is managed by YARN. This
means that the client can go away after initiating the application.
In client mode, the Spark driver runs in the client process, and the Application
Master is used only to request resources from YARN.
31. Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data
structure of Spark. It is an immutable distributed collection
of objects.
Each dataset in RDD is divided into logical partitions, which
may be computed on different nodes of the cluster. RDDs
can contain any type of Python, Java, or Scala objects,
including user-defined classes.
32. Resilient Distributed Datasets
There are two ways to create RDDs:
Parallelizing an existing collection in your driver program, or
referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.
33. Resilient Distributed Datasets
Parallelized Collections
Parallelized collections are created by calling SparkContext’s
parallelize method on an existing collection in your driver
program (a Scala Seq). The elements of the collection are
copied to form a distributed dataset that can be operated
on in parallel.
34. Resilient Distributed Datasets
External Datasets
Spark can create distributed datasets from any storage
source supported by Hadoop, including your local file
system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark
supports text files, SequenceFiles, and any other Hadoop
InputFormat
35. Transformations
Map (func): Return a new distributed dataset formed by
passing each element of the source through a function
func.
Filter (func): Return a new dataset formed by selecting
those elements of the source on which func returns true.
Distinct (numTasks): Return a new dataset that contains
the distinct elements of the source dataset.
37. • Agenda item one
• Agenda item two
• Agenda item three
• Agenda item four
• Agenda item five
Agenda
An agenda slide is highly recommended so attendees understand what
you will be presenting and to minimize session hopping.
38. Session evaluations
Download the GuideBook App
and search: PASS Summit 2017
Follow the QR code link
displayed on session signage
throughout the conference
venue and in the program guide
Your feedback is important and valuable.
Go to passSummit.com
Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
Today, CIOs and other business decision-makers are increasingly recognizing the value of open source software and Azure cloud computing for the enterprise, as a way of driving down costs whilst delivering enterprise capabilities. For the Business Intelligence professional, how can you introduce Open Source for analytics into the Enterprise in a robust way, whilst also creating an architecture that accommodates cloud, on-premise and hybrid architectures? We will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics. - incorporating Apache Spark into your existing projects - looking at your choices to parallelize Apache Spark your computations across nodes of a Hadoop cluster - how ScaleR works with Spark - Using Sparkly and SparkR within a ScaleR workflow Join this session to learn more about open source with Azure for Business Intelligence
Today, CIOs and other business decision-makers are increasingly recognizing the value of open source software and Azure cloud computing for the enterprise, as a way of driving down costs whilst delivering enterprise capabilities. For the Business Intelligence professional, how can you introduce Open Source for analytics into the Enterprise in a robust way, whilst also creating an architecture that accommodates cloud, on-premise and hybrid architectures? We will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics. - incorporating Apache Spark into your existing projects - looking at your choices to parallelize Apache Spark your computations across nodes of a Hadoop cluster - how ScaleR works with Spark - Using Sparkly and SparkR within a ScaleR workflow Join this session to learn more about open source with Azure for Business Intelligence
Today, CIOs and other business decision-makers are increasingly recognizing the value of open source software and Azure cloud computing for the enterprise, as a way of driving down costs whilst delivering enterprise capabilities. For the Business Intelligence professional, how can you introduce Open Source for analytics into the Enterprise in a robust way, whilst also creating an architecture that accommodates cloud, on-premise and hybrid architectures? We will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics. - incorporating Apache Spark into your existing projects - looking at your choices to parallelize Apache Spark your computations across nodes of a Hadoop cluster - how ScaleR works with Spark - Using Sparkly and SparkR within a ScaleR workflow Join this session to learn more about open source with Azure for Business Intelligence
Today, CIOs and other business decision-makers are increasingly recognizing the value of open source software and Azure cloud computing for the enterprise, as a way of driving down costs whilst delivering enterprise capabilities. For the Business Intelligence professional, how can you introduce Open Source for analytics into the Enterprise in a robust way, whilst also creating an architecture that accommodates cloud, on-premise and hybrid architectures?
We will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics. - incorporating Apache Spark into your existing projects - looking at your choices to parallelize Apache Spark your computations across nodes of a Hadoop cluster - how ScaleR works with Spark - Using Sparkly and SparkR within a ScaleR workflow Join this session to learn more about open source with Azure for Business Intelligence
Image credit: https://pixabay.com/en/users/Seanbatty-5097598/
No attribution required.
In this information age, we drive information to create value (Skok, 2013). But, the tools which create this value have always required substantial economic capital.
Intelligence systems that learn and suggest what we need to know, based on:
History
Your colleague’s actions
Data behaviour
AI can make sense of data
Learn and predict what you need to see.
https://pixabay.com/en/directory-away-wisdom-education-229117/
We need something to prioritize the data for us
Insights come in the form of KPIs but they are automatic, suggestive, predictive and drive value.
Gone are the days of reports and complicated dashboards.
We will see more focused, targeted information that can be consumed by users.
Mobile, apple watch and we can react immediately.
https://pixabay.com/en/laptop-prezi-3d-presentation-mockup-2411303/
Augmented reality.
If we think we are in trouble over the three Vs….
Too much data.
Automated data integration
Blending data is essential to insights
Automated data
Blockchain – people are talking about currencies and a financial world that we don’t even use.
https://pixabay.com/en/block-chain-data-records-concept-2850277/
Generality
Combine SQL, streaming, and complex analytics.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.
Resilient Distributed Datasets (RDDs) are the fundamental object used in Apache Spark.
RDDs are immutable collections representing datasets and have the inbuilt capability of reliability and failure recovery.
By nature, RDDs create new RDDs upon any operation such as transformation or action. They also store the lineage, which is used to recover from failures.