Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

•

1 recomendación•989 vistas

You like to use R, and you need to use big data. dplyr, one of the most popular packages for R, makes it easy to query large data sets in scalable processing engines like Apache Spark and Apache Impala. But there can be pitfalls: dplyr works differently with different data sources—and those differences can bite you if you don’t know what you’re doing. Ian Cook is a data scientist, an R contributor, and a curriculum developer at Cloudera University. In this webinar, Ian will show you exactly what you need to know about sparklyr (from RStudio) and the package implyr (from Cloudera). He will show you how to write dplyr code that works across these different interfaces. And, he will solve mysteries: Do I need to know SQL to use dplyr? When is a “tbl” not a “tibble”? Why is 1 not always equal to 1? When should you collect(), collapse(), and compute()? How can you use dplyr to combine data stored in different systems? 3 things to learn: Do I need to know SQL to use dplyr? When should you collect(), collapse(), and compute()? How can you use dplyr to combine data stored in different systems?

Tecnología

1© Cloudera, Inc. All rights reserved.
dplyr Interfaces to Large-Scale Data
Ian Cook
@ianmcook
ian@cloudera.com

2© Cloudera, Inc. All rights reserved.
Mission for Cloudera: Provide a platform for data analysts, data scientists to
efficiently query, analyze, model large-scale data in clusters, cloud storage
• By distributing Apache Spark, Apache Impala, other tools
• By enabling productive use of these tools
Python and R users often have difficulty moving from smaller data to large-scale
distributed data
• Familiar packages, methods don’t work the same way on distributed data
Context

3© Cloudera, Inc. All rights reserved.
Poll question

4© Cloudera, Inc. All rights reserved.
]
SQLPySpark
SparkR
SQL
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API
SQL or
DataFrame API

5© Cloudera, Inc. All rights reserved.
Poll question

6© Cloudera, Inc. All rights reserved.
]
SQLPySpark
SparkR
SQL
dplyr

7© Cloudera, Inc. All rights reserved.
dplyr provides a set of verbs that perform common data manipulation steps
• select() to select columns
• filter() to filter rows
• arrange() to order rows
• mutate() to create new columns
• summarise() to aggregate
• group_by() to perform operations by group
dplyr works on local data and with remote data sources
• For remote sources, dplyr commands are translated into SQL
dplyr

8© Cloudera, Inc. All rights reserved.
Poll question

9© Cloudera, Inc. All rights reserved.
Demonstration
Example code at
github.com/ianmcook/dplyr-examples

10© Cloudera, Inc. All rights reserved.
dplyr SQL backends
dplyr
↕
dbplyr
↕
dplyr SQL backend package*
↕
DBI
↕
DBI-compatible interface package
↕
database driver or connector
↕
database/engine
* optional

11© Cloudera, Inc. All rights reserved.
• Provides a SQL backend to dplyr for Spark
• Also exposes the MLlib API and a subset of the Spark DataFrames API
• Developed by RStudio
spark.rstudio.com
sparklyr

12© Cloudera, Inc. All rights reserved.
• Provides a SQL backend to dplyr for Impala
• Uses ODBC or JDBC to connect to Impala
• Developed at Cloudera
tiny.cloudera.com/implyr
implyr
implyr

13© Cloudera, Inc. All rights reserved.
Five tips for using dplyr
with SQL data sources

14© Cloudera, Inc. All rights reserved.
Use show_query()
1

15© Cloudera, Inc. All rights reserved.
filter() early
arrange() late
2

16© Cloudera, Inc. All rights reserved.
Check your data types
3

17© Cloudera, Inc. All rights reserved.
Know your SQL engine
4

18© Cloudera, Inc. All rights reserved.
Know when to collect()
5

19© Cloudera, Inc. All rights reserved.
Questions?
Ian Cook
@ianmcook
ian@cloudera.com

20© Cloudera, Inc. All rights reserved.
Cloudera Data Science Workbench
More information
tiny.cloudera.com/cdsw
OnDemand training
tiny.cloudera.com/cdsw-training

Más contenido relacionado

La actualidad más candente

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.

A Community Approach to Fighting Cyber ThreatsCloudera, Inc.

Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.

Extreme Sports & Beyond: Exploring a new frontier in data with GoProCloudera, Inc.

Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.

Spark One Platform WebinarCloudera, Inc.

Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.

How Data Drives Business at Choice HotelsCloudera, Inc.

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.

Hadoop on Cloud: Why and How?Cloudera, Inc.

Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.

Analyzing Hadoop Data Using Sparklyr Cloudera, Inc.

Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.

One Hadoop, Multiple Clouds - NYC Big Data MeetupAndrei Savu

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.

Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Cloudera, Inc.

Hadoop Hadoop & Spark meetup - AltiscaleMark Kerzner

How Big Data Can Enable Analytics from the Cloud (Technical Workshop)Cloudera, Inc.

La actualidad más candente (20)

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

A Community Approach to Fighting Cyber Threats

Part 3: Models in Production: A Look From Beginning to End

Extreme Sports & Beyond: Exploring a new frontier in data with GoPro

Part 1: Lambda Architectures: Simplified by Apache Kudu

Spark One Platform Webinar

Cloudera Altus: Big Data in the Cloud Made Easy

How Data Drives Business at Choice Hotels

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...

Hadoop on Cloud: Why and How?

Data Science at Scale Using Apache Spark and Apache Hadoop

Analyzing Hadoop Data Using Sparklyr 

Multi-Tenant Operations with Cloudera 5.7 & BT

One Hadoop, Multiple Clouds - NYC Big Data Meetup

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...

Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...

Hadoop Hadoop & Spark meetup - Altiscale

How Big Data Can Enable Analytics from the Cloud (Technical Workshop)

Similar a Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

dplyr Interfaces to Large-Scale DataCloudera, Inc.

Part 2: A Visual Dive into Machine Learning and Deep Learning  Cloudera, Inc.

Data Science Languages and Industry AnalyticsWes McKinney

Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp

Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson

Applications on Hadoopmarkgrover

Twitter with hadoop for oowGwen (Chen) Shapira

New data dictionary an internal server api that mattersAlexander Nozdrin

Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp

Power of the AWR Warehouse- HotSos Symposium 2015Kellyn Pot'Vin-Gorman

Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...DataWorks Summit

Building Efficient Pipelines in Apache SparkJeremy Beard

OOW-TBE-12c-CON7307-SharableObaidur (OB) Rashid

Oracle Database Cloud ServiceJean-Philippe PINTE

Oracle NoSQL Database release 3.0 overviewPaulo Fagundes

PySpark Best PracticesCloudera, Inc.

Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.

Turning Relational Database Tables into Hadoop Datasources by Kuassi MensahData Con LA

Apache Spark in Scientific ApplicationsDr. Mirko Kämpf

Apache Spark in Scientific ApplciationsDr. Mirko Kämpf

Similar a Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data (20)

dplyr Interfaces to Large-Scale Data

Part 2: A Visual Dive into Machine Learning and Deep Learning  

Data Science Languages and Industry Analytics

Cloudera Analytics and Machine Learning Platform - Optimized for Cloud

Large-Scale Data Science on Hadoop (Intel Big Data Day)

Applications on Hadoop

Twitter with hadoop for oow

New data dictionary an internal server api that matters

Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017

Power of the AWR Warehouse- HotSos Symposium 2015

Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...

Building Efficient Pipelines in Apache Spark

OOW-TBE-12c-CON7307-Sharable

Oracle Database Cloud Service

Oracle NoSQL Database release 3.0 overview

PySpark Best Practices

Impala 2.0 - The Best Analytic Database for Hadoop

Turning Relational Database Tables into Hadoop Datasources by Kuassi Mensah

Apache Spark in Scientific Applications

Apache Spark in Scientific Applciations

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Último

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash

Corporate and higher education May webinar.pptxRustici Software

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Platformless Horizons for Digital AdaptabilityWSO2

Why Teams call analytics are critical to your entire businesspanagenda

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Architecting Cloud Native ApplicationsWSO2

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

MINDCTI Revenue Release Quarter One 2024MIND CTI

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

CNIC Information System with Pakdata Cf In Pakistandanishmna97

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

2. 2© Cloudera, Inc. All rights reserved. Mission for Cloudera: Provide a platform for data analysts, data scientists to efficiently query, analyze, model large-scale data in clusters, cloud storage • By distributing Apache Spark, Apache Impala, other tools • By enabling productive use of these tools Python and R users often have difficulty moving from smaller data to large-scale distributed data • Familiar packages, methods don’t work the same way on distributed data Context

7. 7© Cloudera, Inc. All rights reserved. dplyr provides a set of verbs that perform common data manipulation steps • select() to select columns • filter() to filter rows • arrange() to order rows • mutate() to create new columns • summarise() to aggregate • group_by() to perform operations by group dplyr works on local data and with remote data sources • For remote sources, dplyr commands are translated into SQL dplyr

10. 10© Cloudera, Inc. All rights reserved. dplyr SQL backends dplyr ↕ dbplyr ↕ dplyr SQL backend package* ↕ DBI ↕ DBI-compatible interface package ↕ database driver or connector ↕ database/engine * optional

11. 11© Cloudera, Inc. All rights reserved. • Provides a SQL backend to dplyr for Spark • Also exposes the MLlib API and a subset of the Spark DataFrames API • Developed by RStudio spark.rstudio.com sparklyr

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data

Similar a Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfaces to Large-scale Data