In spite of investments in big data lakes, there is wide use of expensive proprietary products for data ingestion, integration, and transformation (ETL) while bringing and processing data on the lake.
Enterprises have successfully tested Apache Spark for its versatility and strengths as a distributed computing framework that can completely handle all needs for data processing, analytics, and machine learning workloads.
Since the Hadoop distributions and the public cloud already include Apache Spark, there is nothing new to be procured. However, the skills required to put Spark to good use are typically unavailable today.
In this webinar, we will discuss how Apache Spark can be an inexpensive enterprise backbone for all types of data processing workloads. We will also demo how a visual framework on top of Apache Spark makes it much more viable.
The following scenarios will be covered:
On-Prem
Data quality and ETL with Apache Spark using pre-built operators
Advanced monitoring of Spark pipelines
On Cloud
Visual interactive development of Apache Spark Structured Streaming pipelines
IoT use-case with event-time, late-arrival and watermarks
Python based predictive analytics running on Spark
5. It’s a role play!
Anand Venugopal “AV”
Key Influencer, Enterprise Data
Satisfied with the current setup
Prefers traditional vendors
Open to learning about and considering new
technologies
Punit Shah
Apache Spark user and believer
Understands enterprise needs and legacy products
Up to date and hands-on with the latest in Apache
Spark
Likes to build it for real and show it rather than talk
about it
7. Big Data Solutions Architect
Just finished an Apache Spark project
Data platform for cyber security at a major bank
8. Vendor and technology selection, evaluation, POCs
Data storage and data processing
Ingest, integration, wrangling, predictive analytics, machine learning
Head of Enterprise Data Platforms
9. Head of Enterprise Data Platforms
6 vendor products
Matika - Big_data_edition
Allend
Fakta
Rakkle - Streams
SOS - Analytics
Rakkle - Big_data_appliance
10. Head of Enterprise Data Platforms
More overlapping vendors and
products for similar tasks in other
groups / departments
6 vendor products
Matika - Big_data_edition
Allend
Fakta
Rakkle - Streams
SOS - Analytics
Rakkle - Big_data_appliance
11. Head of Enterprise Data Platforms
3 years and a few million $
6 vendor products
Matika - Big_data_edition
Allend
Fakta
Rakkle - Streams
SOS - Analytics
Rakkle - Big_data_appliance
12. Head of Enterprise Data Platforms
We are a 24x7 operation
Nothing can go down
Enterprise vendors are proven
This is no open source game!
6 vendor products
Matika - Big_data_edition
Allend
Fakta
Rakkle - Streams
SOS - Analytics
Rakkle - Big_data_appliance
13. Customer 360 / Churn
Predictive Maintenance
Fraud and Security
Personalized Recommendation Engine
Real-time Dashboards
Business stalls for long, and then suddenly they want results
Integrated data silos, single source of truth
Ubiquitous, fast, self-service access to the data
“Big data enabled” use-cases
Head of Enterprise Data Platforms
14. Open Source esp. Apache Spark is becoming the de-facto choice
Widely deployed in Fortune 500 enterprises
We see near 100% usage in our customer base
Big Data Solutions Architect
15. Apache Spark - Distributed in-memory computation framework
Originally created to massively speed up ML jobs on Hadoop (30X)
Versatile !
Big Data Solutions Architect
Micro-batch
Hi-speed Batch Sits on Hadoop
and/or CloudInteractive Iterative
Graph Streaming
16. Fault Tolerant
Exactly Once Semantics
Back Pressure and Dynamic Scaling
Performance and Throughput is elastic
Is Apache Spark Enterprise ready?
Big Data Solutions Architect
17. Major US Airline – 3 nodes: 4TB / day: Ingested, Indexed, Rapid Query – CX use case
Major US Bank – 4 nodes: 200~ Million records / day – Complex event processing
Tier 1 US Telco – 4 nodes: 100~ Million records / day – Contact Center analytics
Larger deployment ranges of 20, 50, 100+ nodes – All stable over years
Is Apache Spark Enterprise ready?
Big Data Solutions Architect
18. Data Challenges to Implement Any Use Case
Establish Big Data Lake
Ingest – Batch and Streaming sources
Data Quality
Transformation
Blend & Enrich
Analytics – Rules, Statistical, Predictive, Prescriptive
Loading – Various target data stores
Visualization
Secure "Self-Service" Data Access
Governance
Head of Enterprise Data Platforms
19. End to End Data Processing with Apache Spark
Establish Big Data Lake
Ingest – Batch and Streaming sources
Data Quality - Cleanse
Transformation
Blend & Enrich
Analytics – Rules, Statistical, Predictive, Prescriptive
Loading – Various target data stores
Visualization
Secure "Self-Service" Data Access
Governance
Data 360
Big Data Solutions Architect
20. Data Processing Task Apache Spark API
Ingest File System and Databases:
HDFS, S3, Hive, RDBMS, ORC, Parquet (with partitioning
support), TextFile, CSV, JSON and more
Streaming Sources:
Kafka, RabbitMQ, JMS, AWS IoT Hub, Azure Event Hub
and more
Other Sources
Redis, Couchbase, Apache Ignite, Elastic, Sqoop
21. Data Processing Task Apache Spark API
Cleanse
(Data Quality)
Filter with expressions
DeDuplication
Time based filtering using watermark feature
Select query with out of the box comparison operators
over columns like gt, lt, where
DataFrame APIs like – drop, fill, distinct
Column based filtering such as – IsNaN, IsNull, like etc
22. Data Processing Task Apache Spark API
Blend Stream - Data at rest
Stream - Stream joins (Spark 2.3)
Data at rest
Joins – CrossJoins, InnerJoin, Conditional Joins, Broadcast
Join and more
23. Data Processing Task Apache Spark API
Transform Core API Functions
SQL Functions
UDFs
Aggregations & Group functions, State based functions
Custom function using ForEach & ForEachPartition
24. Data Processing Task Apache Spark API
Analytics Feature Extraction – TF-IDF, Word2Vec, CountVectorizer,
FeatureHasher
Feature Transformers - OneHotEncoder, Binarizer, PCA,
IndexToString, Interaction, SQLTransformer,
StopWordsRemover, VectorAssembler and more
Feature Selector – VectorSlicer, RFormula, ChiSqSelector
ML models: ClassificationModel, RegressionModel,
RandomForestRegressionModel,
DataSet APIs – Cube
Third party integrations – H20, Notebook and more
25. Data Processing Task Apache Spark API
Load Custom Sinks – Foreach Sink
File - ORC, JSON, CSV, Parquet with other compression
options
Hive and RDBMS
NoSQL Databases – Hbase, Cassandra, AWS DynamoDB and
more
Indexing Stores – Elastic, Solr
In Memory Distributed Caching – Redis, Ignite, Couchbase
and more
26. Enterprise Grade Hand Coded Apache Spark??
Different programming model – will take a lot of re-training
Scalable platform and applications
Monitoring, DevOps challenges (Debugging and diagnostics at scale ?)
Version management of Spark pipelines
Promoting from Dev to Test to Production
Multi-tenancy
Manual Apache Spark coding strategy doesn’t scale
Head of Enterprise Data Platforms
27. Demo: A Visual IDE for Apache Spark
• ETL and Predictive Analytics
• Connected Car IoT Use Case
28. RECAP:
Apache Spark – the New Enterprise backbone for ETL, Batch and Real-time Streaming
Too many point-solution vendors is a problem
Apache Spark - Great candidate for consolidating all data prep and compute workloads
Increase RoI of big data lake investment and save further costs
Recommended approach - Visual Enterprise Grade Spark
Provided by StreamAnalytix from Impetus Technologies Inc.
Ingest, Cleanse, Blend, Transform, Analyze, Load, Visualize – All on one UI
29. Poll and Feedback – Please Respond
Do you agree that Apache Spark is a strong candidate to be the enterprise data processing backbone –
as described in this webinar ?
Would you be interested in a deeper dive of StreamAnalytix – A Visual platform for Apache Spark, as
shown in this webinar ?
Webinar rating and feedback
30. Thank You
Questions?
Visit www.StreamAnalytix.com for a download OR a cloud based trial
Contact us at inquiry@streamanalytix.com for a proof of concept
Meet us at the Spark Summit and DataWorks Summit in June