Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

•

1 recomendación•363 vistas

In times of huge amounts of heterogeneous data available, processing and extracting knowledge requires more and more efforts on building complex software architectures. In this context, Apache Spark provides a powerful and efficient approach for large-scale data processing. This talk will briefly introduce a powerful machine learning library (MLlib) along with a general overview of the Spark framework, describing how to launch applications within a cluster. In this way, a demo will show how to simulate a Spark cluster in a local machine using images available on a Docker Hub public repository. In the end, another demo will show how to save time using unit tests for validating jobs before running them in a cluster.

Tecnología

Building Machine Learning
applications locally with Spark
21/06/2017
Joel Pinho Lucas

Agenda
• Problems and Motivation
• Spark and MLlib overview
• Launching applications in a Spark cluster
• Simulating a Spark cluster using Docker
• Demo: deploying a Spark cluster in a local machine
• Unit tests for Spark jobs
2

3
• How to setup a Spark cluster (infra + conﬁguration)?
• Test and/or Debug a Spark job
• All team should have the same environment

4
• Lightweight cluster
• One machine
• Same environment for all team
• Deployed easily in any platform
Run Spark Locally with docker

5
• Easy to develop (API in Java, Scala, Python, R)
• High Quality algorithms
http://spark.apache.org/mllib/
• Fast to run
• Lazy evaluation
• In memory Storage

6
http://spark.apache.org/docs/2.1.0/cluster-overview.html
Spark Execution Model

Cluster Types
• Standalone
• Apache Mesos
• HadoopYarn
7

8
Starting a Cluster Manually
Manually Submitting an Application

Choose your Docker Image
(or build your own and share)
9

Some available Spark Docker
Images
10
• https://github.com/big-data-europe/docker-spark
• https://hub.docker.com/r/internavenue/centos-spark/
• https://github.com/sequenceiq/docker-spark
• https://github.com/epahomov/docker-spark
• https://www.anchormen.nl/spark-docker/
• https://github.com/gettyimages/docker-spark
• https://hub.docker.com/r/bigdatauniversity/spark/

http://github.com/joelplucas/docker-spark 11

Example to Run
• MLlib's FP-Growth algorithm
• Data from the digital publishing domain
• Problem: to ﬁnd frequent patterns from navigation proﬁles
• Write results in MongoDB
http://github.com/joelplucas/fpgrowth-spark-example
12

Unit Testing using Spark Testing Base
• Launched in Strata NYC 2015 by Holden Karau (and maintained by the community)
• Supports unit tests in Java, Scala and Python
14

Q&A - Contact
‣ Linkedin: http://br.linkedin.com/in/joelplucas/
‣ Email: joelpl@gmail.com
15

Más contenido relacionado

La actualidad más candente

Koalas: How Well Does Koalas Work?Databricks

Scalable Automatic Machine Learning in H2OSri Ambati

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...Databricks

Productionizing Machine Learning with a Microservices ArchitectureDatabricks

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks

AutoML Toolkit – Deep DiveDatabricks

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks

Unified MLOps: Feature Stores & Model DeploymentDatabricks

Machine Learning Pipelinesjeykottalam

Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflowDatabricks

Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...Sri Ambati

Mlflow with databricksLiangjun Jiang

Flock: Data Science Platform @ CISLDatabricks

How We Optimize Spark SQL Jobs With parallel and sync IODatabricks

Extending Machine Learning Algorithms with PySparkDatabricks

Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit

Understanding and Improving Code GenerationDatabricks

Managing Millions of Tests Using DatabricksDatabricks

Operationalize Apache Spark AnalyticsDatabricks

Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflowDatabricks

La actualidad más candente (20)

Koalas: How Well Does Koalas Work?

Scalable Automatic Machine Learning in H2O

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runta...

Productionizing Machine Learning with a Microservices Architecture

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

AutoML Toolkit – Deep Dive

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...

Unified MLOps: Feature Stores & Model Deployment

Machine Learning Pipelines

Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow

Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...

Mlflow with databricks

Flock: Data Science Platform @ CISL

How We Optimize Spark SQL Jobs With parallel and sync IO

Extending Machine Learning Algorithms with PySpark

Spark Summit EU talk by Kent Buenaventura and Willaim Lau

Understanding and Improving Code Generation

Managing Millions of Tests Using Databricks

Operationalize Apache Spark Analytics

Continuous Delivery of ML-Enabled Pipelines on Databricks using MLflow

Similar a Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

Productionizing Spark and the REST Job Server- Evan ChanSpark Summit

Productionizing Spark and the Spark Job ServerEvan Chan

Getting started with SparkSQL - Desert Code Camp 2016clairvoyantllc

Java in a world of containersDocker, Inc.

Java in a World of Containers - DockerCon 2018Arun Gupta

Java script nirvana in netbeans [con5679]Ryan Cuprak

Why to dockerKarthik Gaekwad

Docker Introductionw_akram

Learning Oracle with Oracle VM VirtualBoxLeighton Nelson

Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit

Webinar: Creating an Effective Docker Build Pipeline for Java AppsCodefresh

Spark SQL & Machine Learning - A Practical DemonstrationCraig Warman

Lessons Learned From Running Spark On DockerSpark Summit

Getting started with Apache SparkHabib Ahmed Bhutto

Run your Java code on Cloud FoundryAndy Piper

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup

Dec6 meetup spark presentationRamesh Mudunuri

Spring to ImageVMware Tanzu

Using the Splunk Java SDKDamien Dallimore

Autoscaling Spark on AWS EC2 - 11th Spark London meetupRafal Kwasny

Similar a Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017 (20)

Productionizing Spark and the REST Job Server- Evan Chan

Productionizing Spark and the Spark Job Server

Getting started with SparkSQL - Desert Code Camp 2016

Java in a world of containers

Java in a World of Containers - DockerCon 2018

Java script nirvana in netbeans [con5679]

Why to docker

Docker Introduction

Learning Oracle with Oracle VM VirtualBox

Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...

Webinar: Creating an Effective Docker Build Pipeline for Java Apps

Spark SQL & Machine Learning - A Practical Demonstration

Lessons Learned From Running Spark On Docker

Getting started with Apache Spark

Run your Java code on Cloud Foundry

Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Dec6 meetup spark presentation

Spring to Image

Using the Splunk Java SDK

Autoscaling Spark on AWS EC2 - 11th Spark London meetup

Más de PAPIs.io

Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017PAPIs.io

Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...PAPIs.io

Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...PAPIs.io

Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...PAPIs.io

A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...PAPIs.io

Scaling machine learning as a service at Uber — Li Erran Li at #papis2016PAPIs.io

Real-world applications of AI - Daniel Hulme @ PAPIs ConnectPAPIs.io

Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...PAPIs.io

Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...PAPIs.io

Demystifying Deep Learning - Roberto Paredes Palacios @ PAPIs ConnectPAPIs.io

Predictive APIs: What about Banking? - Natalino Busa @ PAPIs ConnectPAPIs.io

Microdecision making in financial services - Greg Lamp @ PAPIs ConnectPAPIs.io

Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...PAPIs.io

Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...PAPIs.io

How to predict the future of shopping - Ulrich Kerzel @ PAPIs ConnectPAPIs.io

The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...PAPIs.io

Automating Machine Learning Workflows: A Report from the Trenches - Jose A. O...PAPIs.io

Machine Learning Services Benchmark - Inês Almeida @ PAPIs ConnectPAPIs.io

Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io

How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectPAPIs.io

Más de PAPIs.io (20)

Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017

Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...

Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...

Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...

A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...

Scaling machine learning as a service at Uber — Li Erran Li at #papis2016

Real-world applications of AI - Daniel Hulme @ PAPIs Connect

Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...

Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...

Demystifying Deep Learning - Roberto Paredes Palacios @ PAPIs Connect

Predictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect

Microdecision making in financial services - Greg Lamp @ PAPIs Connect

Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...

Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...

How to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect

The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...

Automating Machine Learning Workflows: A Report from the Trenches - Jose A. O...

Machine Learning Services Benchmark - Inês Almeida @ PAPIs Connect

Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...

How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect

Último

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Slack Application Development 101 Slidespraypatel2

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

A Domino Admins Adventures (Engage 2024)Gabriella Davis

🐬 The future of MySQL is Postgres 🐘RTylerCroy

A Year of the Servo Reboot: Where Are We Now?Igalia

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

How to convert PDF to text with Nanonetsnaman860154

Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

1. Building Machine Learning applications locally with Spark 21/06/2017 Joel Pinho Lucas

2. Agenda • Problems and Motivation • Spark and MLlib overview • Launching applications in a Spark cluster • Simulating a Spark cluster using Docker • Demo: deploying a Spark cluster in a local machine • Unit tests for Spark jobs 2

3. 3 • How to setup a Spark cluster (infra + conﬁguration)? • Test and/or Debug a Spark job • All team should have the same environment

4. 4 • Lightweight cluster • One machine • Same environment for all team • Deployed easily in any platform Run Spark Locally with docker

5. 5 • Easy to develop (API in Java, Scala, Python, R) • High Quality algorithms http://spark.apache.org/mllib/ • Fast to run • Lazy evaluation • In memory Storage

6. 6 http://spark.apache.org/docs/2.1.0/cluster-overview.html Spark Execution Model

7. Cluster Types • Standalone • Apache Mesos • HadoopYarn 7

8. 8 Starting a Cluster Manually Manually Submitting an Application

9. Choose your Docker Image (or build your own and share) 9

10. Some available Spark Docker Images 10 • https://github.com/big-data-europe/docker-spark • https://hub.docker.com/r/internavenue/centos-spark/ • https://github.com/sequenceiq/docker-spark • https://github.com/epahomov/docker-spark • https://www.anchormen.nl/spark-docker/ • https://github.com/gettyimages/docker-spark • https://hub.docker.com/r/bigdatauniversity/spark/

11. http://github.com/joelplucas/docker-spark 11

12. Example to Run • MLlib's FP-Growth algorithm • Data from the digital publishing domain • Problem: to ﬁnd frequent patterns from navigation proﬁles • Write results in MongoDB http://github.com/joelplucas/fpgrowth-spark-example 12

13. The Dataset 13

14. Unit Testing using Spark Testing Base • Launched in Strata NYC 2015 by Holden Karau (and maintained by the community) • Supports unit tests in Java, Scala and Python 14

15. Q&A - Contact ‣ Linkedin: http://br.linkedin.com/in/joelplucas/ ‣ Email: joelpl@gmail.com 15

Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017

Similar a Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017 (20)

Más de PAPIs.io

Más de PAPIs.io (20)

Último

Último (20)

Building machine learning applications locally with Spark — Joel Pinho Lucas (Tailtarget) @PAPIs Connect — São Paulo 2017