SlideShare una empresa de Scribd logo
1 de 46
1© Cloudera, Inc. All rights reserved.
Hadoop for the Data Scientist:
Spark in Cloudera 5.5
Anand Iyer | Senior Product Manager | Cloudera
Sandy Ryza | Senior Data Scientist | Cloudera
2© Cloudera, Inc. All rights reserved.
Agenda
• Apache Spark Overview
• Machine Learning with Hadoop and Spark
• Machine Learning Use Cases
• What’s Next
3© Cloudera, Inc. All rights reserved.
Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform:
• One place for unlimited data
• Unified, multi-framework data
access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
FILESYSTEM RELATIONAL NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH SDK
4© Cloudera, Inc. All rights reserved.
One Platform, Many Workloads
Batch, Interactive,
and Real-Time.
Leading performance and
usability in one platform.
• End-to-end analytic workflows
• Access more data
• Work with data in new ways
• Enable new users
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite
5© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala,
Java, and Python
• Interactive shell
• APIs for different
types of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory
processing and
caching
6© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
SQL
Impala
SEARCH
Solr
SDK
Kite
BATCH & STREAM
Spark
Spark
Streaming Spark SQL DataFrames MLlib …
7© Cloudera, Inc. All rights reserved.
Easy Machine Learning
on data distributed over a large cluster of machines
8© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
9© Cloudera, Inc. All rights reserved.
What is Mllib?
Library of machine learning and data mining algorithms and utilities
• Implemented in Spark
• Invoked within Java, Scala, or Python Spark applications
MLlib applications are Spark applications
• Requires Spark knowledge to effectively run
• Recommended deployment on YARN
• MLlib apps require the same set of parameters Spark applications require
(number of executors, memory per executor, etc)
10© Cloudera, Inc. All rights reserved.
What Does MLlib Contain?
• Machine learning models for classification and regression
• Recommender System
• Clustering Algorithms
• Feature Engineering Algorithms and Utilities
• Data Mining Algorithms & Basic Statistical Analysis Utilities
11© Cloudera, Inc. All rights reserved.
Classification & Regression
Traditional Models
• Linear and Logistic Regression
• Naïve Bayes
• Decision Trees
• Support Vector Machines
12© Cloudera, Inc. All rights reserved.
Classification & Regression
Traditional Models
• Linear and Logistic Regression
• Naïve Bayes
• Decision Trees
• Support Vector Machines
Next-Gen Models
• Gradient Boosted Trees
• Random Forests
13© Cloudera, Inc. All rights reserved.
Clustering Algorithms
• K-Means
• Power Iteration Clustering (PIC)
• Gaussian Mixture Model
• Streaming K-Means
14© Cloudera, Inc. All rights reserved.
Clustering Algorithms
• K-Means
• Power Iteration Clustering (PIC)
• Gaussian Mixture Model
• Streaming K-Means
Textual data clustering i.e. identifying “topics” from a corpus of documents:
• Latent Dirichlet Allocation (LDA)
15© Cloudera, Inc. All rights reserved.
• Predicting the interests of a user, by
collecting partial list of preferences
from many users
• Predicting missing items of a user-item
association matrix
• Algorithm used: Alternating Least Squares
• Admittedly limited choice of algorithms
?
?
?
?
?
?
?
?
?
?
Collaborative Filtering
For Building Recommender Systems
16© Cloudera, Inc. All rights reserved.
Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
17© Cloudera, Inc. All rights reserved.
Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
Dimensionality Reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
18© Cloudera, Inc. All rights reserved.
Feature Engineering & Modeling Utilities
• Feature Scaling & Normalization
• Statistical Correlation Functions
(Pearson & Spearman’s)
• Tests of Statistical Significance
• Chi-Squared independence test for feature selection
• Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
Dimensionality Reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
Textual Feature Generation:
• Word2Vec
• Term Frequency – Inverse Document
Frequency (TF-IDF)
19© Cloudera, Inc. All rights reserved.
Data Mining: Frequent Pattern Mining
Data Mining Urban Legend:
Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men
who buy diapers have a very high likelihood of buying beer!”
20© Cloudera, Inc. All rights reserved.
Data Mining: Frequent Pattern Mining
Data Mining Urban Legend:
Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men
who buy diapers have a very high likelihood of buying beer!”
Algorithms in MLlib:
• Frequent Pattern-Growth
• Association Rule Mining
• PrefixSpan
21© Cloudera, Inc. All rights reserved.
What about “Deep Learning”?
Deep Learning is an umbrella term for large complex Multi-
Layer Neural Networks
• MLlib contains a robust Multilayer Neural Network implementation
22© Cloudera, Inc. All rights reserved.
Pipeline API
Hooking the Pieces Together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
Words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
23© Cloudera, Inc. All rights reserved.
Pipeline API
Hooking the Pieces Together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
Words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
Sequence is repeated during Training and Scoring
24© Cloudera, Inc. All rights reserved.
Pipeline API: Hooking the pieces together
• Inspired by scikit-learn pipelines
• ML involves running multiple sequential steps
Eg: Text Classification Pipeline
Bag of
words
Tokenize TF-IDF LDA
Scale &
Normalize
Features
Train
Classifier
Sequence is repeated during Training and Scoring
Hyper-Parameter Tuning  Repeat Sequence with different parameter values
25© Cloudera, Inc. All rights reserved.
Overview of Pipeline API
• Create Pipeline as a sequence of Stages:
• Transformers: Transform or augment features
• Estimators: Fit a model
• Re-use Pipeline
• Basic save and load functionality available
• Invoke Pipeline with different set of parameters passed as ParamMap
26© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
27© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
28© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
29© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
30© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
31© Cloudera, Inc. All rights reserved.
Process Flow for ML Development
Traditional Data
Management
Model Development
Phase
Production
Modeling
Production
Scoring*
Metadata Management
Development Tools (IDEs, source
control, notebooks) Scheduling, Workflow, Publishing
Data
Ingest
Data
Prep Feature
Engineering
Visualization
Modeling (incl.
hyperparameter
search & model
validation)
Feature
Generation +
Model Building
Model
Quality,
Usage,
Perf.
Metrics
Experiments
Batch
Scorer
Online
Model
Update
Server +
Scoring
*There may be further steps after scoring such as
aggregations, visualizations, reporting, etc
Score streaming events in
Spark Streaming.
32© Cloudera, Inc. All rights reserved.
Machine Learning Use Case
33© Cloudera, Inc. All rights reserved.
Predicting Influencers at a Large Telco
• Customer loyalty difficult and expensive
• Aggressive competition
34© Cloudera, Inc. All rights reserved.
Social Churn
• Churn is not an isolated event!
• When influential subscribers leave, they
take their friends with them
35© Cloudera, Inc. All rights reserved.
Casting This as a Data Science Problem
• Can we quantify: Which lost users were the most influential?
• Can we predict: Which current subscribers have as much influence?
36© Cloudera, Inc. All rights reserved.
The Challenge: Lots Customers, Lots of Data
• Over 100 million customers
• Over 1 billion connections
37© Cloudera, Inc. All rights reserved.
The Challenge: Lots Customers, Lots of Data
• Over 100 million customers
• Over 1 billion connections
38© Cloudera, Inc. All rights reserved.
Calculating Influencer Scores
• Connection: pair of users with communication both ways
• Influencer score: number of connected users that churn after user X churns
39© Cloudera, Inc. All rights reserved.
Predicting Influencer Scores
MLlib!
• Regression model
• Linear regression
• Random forests
• Features
• # of connections, # calls to connections
• Internal vs. External
40© Cloudera, Inc. All rights reserved.
Breaking Down the Work
Building User and Connection
Tables
Computing Historical
Influencer Scores
Feature Generation
Model Fitting
Model Evaluation
41© Cloudera, Inc. All rights reserved.
What’s Next
42© Cloudera, Inc. All rights reserved.
Roadmap Update
MANAGEMENT
Initial Spark-on-YARN
integration for shared
resource management
SECURITY SCALE STREAMING
New metrics for easier
diagnosis
Improved Spark-on-YARN for
better multi-tenancy,
performance, ease of use
Automated configurations
to optimize over time
Visibility into resource
utilization
Improved PySpark
integration for Python access
Kerberos-based
authorization
Fine-grained
access control
Auditing and lineage
(Governance)
Integration with Intel’s
Advanced Encryption
libraries
Full PCI compliance
Improved integration with
HDFS to enable scheduling
Reduced memory pressure
on larger jobs
Dynamic resource utilization
and prioritization
Stress test at scale with
mixed multi-tenant
workloads
Spark Streaming resiliency
for zero data loss
Data ingest integration for
Kafka and Flume
Improved state management
for better performance
Higher-level language
extensions
✔
✔✔
✔
✔✔
✔
43© Cloudera, Inc. All rights reserved.
Download Cloudera 5.5
cloudera.com/downloads
44© Cloudera, Inc. All rights reserved.
Data Science & Spark Training Courses
university.cloudera.com
45© Cloudera, Inc. All rights reserved.
Thank You
46© Cloudera, Inc. All rights reserved.
Spark Resources
• Learn Spark
• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)
• Cloudera Developer Blog: blog.cloudera.com/spark
• Spark Page: cloudera.com/spark
• Get Trained
• Cloudera Spark Training: university.cloudera.com
• Try it Out
• Cloudera Live Spark Tutorial: cloudera.com/live
• Download Cloudera 5.5: cloudera.com/downloads

Más contenido relacionado

La actualidad más candente

Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Cloudera, Inc.
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkJeremy Beard
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupAndrei Savu
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Cloudera, Inc.
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopDataWorks Summit/Hadoop Summit
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves HadoopCloudera, Inc.
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformInMobi Technology
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...DataWorks Summit
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesNacho García Fernández
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ ZooskCloudera, Inc.
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduGrant Henke
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariDataWorks Summit/Hadoop Summit
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Solr consistency and recovery internals
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internalsCloudera, Inc.
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 

La actualidad más candente (20)

Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
 
Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to HadoopSuccesses, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
Successes, Challenges, and Pitfalls Migrating a SAAS business to Hadoop
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves Hadoop
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & Ambari
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Solr consistency and recovery internals
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internals
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Hybrid is the New Normal
Hybrid is the New NormalHybrid is the New Normal
Hybrid is the New Normal
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 

Similar a Hadoop for the Data Scientist: Spark in Cloudera 5.5

Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
MySQL Enterprise Portfolio
MySQL Enterprise PortfolioMySQL Enterprise Portfolio
MySQL Enterprise PortfolioAbel Flórez
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Cloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Using MySQL in the Cloud
Using MySQL in the CloudUsing MySQL in the Cloud
Using MySQL in the CloudMatt Lord
 
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMark Swarbrick
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
MySQL in oracle_public_cloud
MySQL in oracle_public_cloudMySQL in oracle_public_cloud
MySQL in oracle_public_cloudOracleMySQL
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Web Services
 
MySQL in oracle public cloud
MySQL in oracle public cloudMySQL in oracle public cloud
MySQL in oracle public cloudMandy Ang
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman
 
Oracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suiteOracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suiteOTN Systems Hub
 
Latest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
Latest Innovations in Database as a Service Enabled by Oracle Enterprise ManagerLatest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
Latest Innovations in Database as a Service Enabled by Oracle Enterprise ManagerHari Srinivasan
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CAAdobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CAJaemi Bremner
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the EnterpriseJesus Rodriguez
 

Similar a Hadoop for the Data Scientist: Spark in Cloudera 5.5 (20)

Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
MySQL Enterprise Portfolio
MySQL Enterprise PortfolioMySQL Enterprise Portfolio
MySQL Enterprise Portfolio
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Using MySQL in the Cloud
Using MySQL in the CloudUsing MySQL in the Cloud
Using MySQL in the Cloud
 
MySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL FabricMySQL London Tech Tour March 2015 - MySQL Fabric
MySQL London Tech Tour March 2015 - MySQL Fabric
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
MySQL Fabric
MySQL FabricMySQL Fabric
MySQL Fabric
 
MySQL in oracle_public_cloud
MySQL in oracle_public_cloudMySQL in oracle_public_cloud
MySQL in oracle_public_cloud
 
Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.Amazon Redshift with Full 360 Inc.
Amazon Redshift with Full 360 Inc.
 
MySQL in oracle public cloud
MySQL in oracle public cloudMySQL in oracle public cloud
MySQL in oracle public cloud
 
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Choosing the Right Business Intelligence Tools for Your Data and Architectura...
Choosing the Right Business Intelligence Tools for Your Data and Architectura...
 
Oracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suiteOracle super cluster for oracle e business suite
Oracle super cluster for oracle e business suite
 
Latest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
Latest Innovations in Database as a Service Enabled by Oracle Enterprise ManagerLatest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
Latest Innovations in Database as a Service Enabled by Oracle Enterprise Manager
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CAAdobe Spark Meetup - 9/19/2018 - San Jose, CA
Adobe Spark Meetup - 9/19/2018 - San Jose, CA
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the Enterprise
 

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Último (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Hadoop for the Data Scientist: Spark in Cloudera 5.5

  • 1. 1© Cloudera, Inc. All rights reserved. Hadoop for the Data Scientist: Spark in Cloudera 5.5 Anand Iyer | Senior Product Manager | Cloudera Sandy Ryza | Senior Data Scientist | Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • Apache Spark Overview • Machine Learning with Hadoop and Spark • Machine Learning Use Cases • What’s Next
  • 3. 3© Cloudera, Inc. All rights reserved. Cloudera Enterprise Making Hadoop Fast, Easy, and Secure A new kind of data platform: • One place for unlimited data • Unified, multi-framework data access Cloudera makes it: • Fast for business • Easy to manage • Secure without compromise OPERATIONS DATA MANAGEMENT STRUCTURED UNSTRUCTURED PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT SECURITY FILESYSTEM RELATIONAL NoSQL STORE INTEGRATE BATCH STREAM SQL SEARCH SDK
  • 4. 4© Cloudera, Inc. All rights reserved. One Platform, Many Workloads Batch, Interactive, and Real-Time. Leading performance and usability in one platform. • End-to-end analytic workflows • Access more data • Work with data in new ways • Enable new users OPERATIONS Cloudera Manager Cloudera Director DATA MANAGEMENT Cloudera Navigator Encrypt and KeyTrustee Optimizer STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr SDK Kite
  • 5. 5© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easy Development Flexible Extensible API Fast Batch & Stream Processing • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  • 6. 6© Cloudera, Inc. All rights reserved. The Spark Ecosystem & Hadoop STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE SQL Impala SEARCH Solr SDK Kite BATCH & STREAM Spark Spark Streaming Spark SQL DataFrames MLlib …
  • 7. 7© Cloudera, Inc. All rights reserved. Easy Machine Learning on data distributed over a large cluster of machines
  • 8. 8© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 9. 9© Cloudera, Inc. All rights reserved. What is Mllib? Library of machine learning and data mining algorithms and utilities • Implemented in Spark • Invoked within Java, Scala, or Python Spark applications MLlib applications are Spark applications • Requires Spark knowledge to effectively run • Recommended deployment on YARN • MLlib apps require the same set of parameters Spark applications require (number of executors, memory per executor, etc)
  • 10. 10© Cloudera, Inc. All rights reserved. What Does MLlib Contain? • Machine learning models for classification and regression • Recommender System • Clustering Algorithms • Feature Engineering Algorithms and Utilities • Data Mining Algorithms & Basic Statistical Analysis Utilities
  • 11. 11© Cloudera, Inc. All rights reserved. Classification & Regression Traditional Models • Linear and Logistic Regression • Naïve Bayes • Decision Trees • Support Vector Machines
  • 12. 12© Cloudera, Inc. All rights reserved. Classification & Regression Traditional Models • Linear and Logistic Regression • Naïve Bayes • Decision Trees • Support Vector Machines Next-Gen Models • Gradient Boosted Trees • Random Forests
  • 13. 13© Cloudera, Inc. All rights reserved. Clustering Algorithms • K-Means • Power Iteration Clustering (PIC) • Gaussian Mixture Model • Streaming K-Means
  • 14. 14© Cloudera, Inc. All rights reserved. Clustering Algorithms • K-Means • Power Iteration Clustering (PIC) • Gaussian Mixture Model • Streaming K-Means Textual data clustering i.e. identifying “topics” from a corpus of documents: • Latent Dirichlet Allocation (LDA)
  • 15. 15© Cloudera, Inc. All rights reserved. • Predicting the interests of a user, by collecting partial list of preferences from many users • Predicting missing items of a user-item association matrix • Algorithm used: Alternating Least Squares • Admittedly limited choice of algorithms ? ? ? ? ? ? ? ? ? ? Collaborative Filtering For Building Recommender Systems
  • 16. 16© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc
  • 17. 17© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc Dimensionality Reduction: • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD)
  • 18. 18© Cloudera, Inc. All rights reserved. Feature Engineering & Modeling Utilities • Feature Scaling & Normalization • Statistical Correlation Functions (Pearson & Spearman’s) • Tests of Statistical Significance • Chi-Squared independence test for feature selection • Evaluation metrics: Precision, Recall, AUROC, F1-Measure, etc Dimensionality Reduction: • Principal Component Analysis (PCA) • Singular Value Decomposition (SVD) Textual Feature Generation: • Word2Vec • Term Frequency – Inverse Document Frequency (TF-IDF)
  • 19. 19© Cloudera, Inc. All rights reserved. Data Mining: Frequent Pattern Mining Data Mining Urban Legend: Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men who buy diapers have a very high likelihood of buying beer!”
  • 20. 20© Cloudera, Inc. All rights reserved. Data Mining: Frequent Pattern Mining Data Mining Urban Legend: Frequent Pattern Mining algorithm on supermarket purchase data revealed “Men who buy diapers have a very high likelihood of buying beer!” Algorithms in MLlib: • Frequent Pattern-Growth • Association Rule Mining • PrefixSpan
  • 21. 21© Cloudera, Inc. All rights reserved. What about “Deep Learning”? Deep Learning is an umbrella term for large complex Multi- Layer Neural Networks • MLlib contains a robust Multilayer Neural Network implementation
  • 22. 22© Cloudera, Inc. All rights reserved. Pipeline API Hooking the Pieces Together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of Words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier
  • 23. 23© Cloudera, Inc. All rights reserved. Pipeline API Hooking the Pieces Together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of Words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier Sequence is repeated during Training and Scoring
  • 24. 24© Cloudera, Inc. All rights reserved. Pipeline API: Hooking the pieces together • Inspired by scikit-learn pipelines • ML involves running multiple sequential steps Eg: Text Classification Pipeline Bag of words Tokenize TF-IDF LDA Scale & Normalize Features Train Classifier Sequence is repeated during Training and Scoring Hyper-Parameter Tuning  Repeat Sequence with different parameter values
  • 25. 25© Cloudera, Inc. All rights reserved. Overview of Pipeline API • Create Pipeline as a sequence of Stages: • Transformers: Transform or augment features • Estimators: Fit a model • Re-use Pipeline • Basic save and load functionality available • Invoke Pipeline with different set of parameters passed as ParamMap
  • 26. 26© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 27. 27© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 28. 28© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 29. 29© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 30. 30© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc
  • 31. 31© Cloudera, Inc. All rights reserved. Process Flow for ML Development Traditional Data Management Model Development Phase Production Modeling Production Scoring* Metadata Management Development Tools (IDEs, source control, notebooks) Scheduling, Workflow, Publishing Data Ingest Data Prep Feature Engineering Visualization Modeling (incl. hyperparameter search & model validation) Feature Generation + Model Building Model Quality, Usage, Perf. Metrics Experiments Batch Scorer Online Model Update Server + Scoring *There may be further steps after scoring such as aggregations, visualizations, reporting, etc Score streaming events in Spark Streaming.
  • 32. 32© Cloudera, Inc. All rights reserved. Machine Learning Use Case
  • 33. 33© Cloudera, Inc. All rights reserved. Predicting Influencers at a Large Telco • Customer loyalty difficult and expensive • Aggressive competition
  • 34. 34© Cloudera, Inc. All rights reserved. Social Churn • Churn is not an isolated event! • When influential subscribers leave, they take their friends with them
  • 35. 35© Cloudera, Inc. All rights reserved. Casting This as a Data Science Problem • Can we quantify: Which lost users were the most influential? • Can we predict: Which current subscribers have as much influence?
  • 36. 36© Cloudera, Inc. All rights reserved. The Challenge: Lots Customers, Lots of Data • Over 100 million customers • Over 1 billion connections
  • 37. 37© Cloudera, Inc. All rights reserved. The Challenge: Lots Customers, Lots of Data • Over 100 million customers • Over 1 billion connections
  • 38. 38© Cloudera, Inc. All rights reserved. Calculating Influencer Scores • Connection: pair of users with communication both ways • Influencer score: number of connected users that churn after user X churns
  • 39. 39© Cloudera, Inc. All rights reserved. Predicting Influencer Scores MLlib! • Regression model • Linear regression • Random forests • Features • # of connections, # calls to connections • Internal vs. External
  • 40. 40© Cloudera, Inc. All rights reserved. Breaking Down the Work Building User and Connection Tables Computing Historical Influencer Scores Feature Generation Model Fitting Model Evaluation
  • 41. 41© Cloudera, Inc. All rights reserved. What’s Next
  • 42. 42© Cloudera, Inc. All rights reserved. Roadmap Update MANAGEMENT Initial Spark-on-YARN integration for shared resource management SECURITY SCALE STREAMING New metrics for easier diagnosis Improved Spark-on-YARN for better multi-tenancy, performance, ease of use Automated configurations to optimize over time Visibility into resource utilization Improved PySpark integration for Python access Kerberos-based authorization Fine-grained access control Auditing and lineage (Governance) Integration with Intel’s Advanced Encryption libraries Full PCI compliance Improved integration with HDFS to enable scheduling Reduced memory pressure on larger jobs Dynamic resource utilization and prioritization Stress test at scale with mixed multi-tenant workloads Spark Streaming resiliency for zero data loss Data ingest integration for Kafka and Flume Improved state management for better performance Higher-level language extensions ✔ ✔✔ ✔ ✔✔ ✔
  • 43. 43© Cloudera, Inc. All rights reserved. Download Cloudera 5.5 cloudera.com/downloads
  • 44. 44© Cloudera, Inc. All rights reserved. Data Science & Spark Training Courses university.cloudera.com
  • 45. 45© Cloudera, Inc. All rights reserved. Thank You
  • 46. 46© Cloudera, Inc. All rights reserved. Spark Resources • Learn Spark • O’Reilly Advanced Analytics with Spark eBook (written by Clouderans) • Cloudera Developer Blog: blog.cloudera.com/spark • Spark Page: cloudera.com/spark • Get Trained • Cloudera Spark Training: university.cloudera.com • Try it Out • Cloudera Live Spark Tutorial: cloudera.com/live • Download Cloudera 5.5: cloudera.com/downloads