Koalas: Unifying Spark and pandas APIs

•Descargar como PPTX, PDF•

5 recomendaciones•1,919 vistas

The introduction of Koalas and the current development status. This is a slide for Spark Meetup Tokyo #1 (Spark+AI Summit 2019)

Tecnología

Koalas: Unifying Spark and
pandas APIs
1
Takuya UESHIN
Spark Meetup Tokyo #1, Jun 2019

2
About Me
- Software Engineer @databricks
- Apache Spark Committer
- Twitter: @ueshin
- GitHub: github.com/ueshin

Typical journey of a data scientist
Education (MOOCs, books, universities) → pandas
Analyze small data sets → pandas
Analyze big data sets → DataFrame in Spark
3

pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation and analysis in Python
Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib
Can deal with a lot of different situations, including:
- basic statistical analysis
- handling missing data
- time series, categorical variables, strings
4

Apache Spark
De facto unified analytics engine for large-scale data processing
(Streaming, ETL, ML)
Created at UC Berkeley by Databricks’ founders
PySpark API for Python; also API support for Scala, R and SQL
5

6
Pandas DataFrame Spark DataFrame
Column df[‘col’] df[‘col’]
Mutability Mutable Immutable
Add a column df[‘c’] = df[‘a’] + df[‘b’] df.withColumn(‘c’, df[‘a’] + df[‘b’])
Rename columns df.columns = [‘a’,’b’] df.select(df[‘c1’].alias(‘a’),
df[‘c2’].alias(‘b’))
Value count df[‘col’].value_counts() df.groupBy(df[‘col’]).count().order
By(‘count’, ascending = False)
Pandas DataFrame vs Spark DataFrame

A short example
7
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF(‘x’, ‘y’, ‘z1’)
df = df.withColumn(‘x2’, df.x*df.x)

Koalas
Announced April 24, 2019
Pure Python library
Aims at providing the pandas API on top of Apache Spark:
- unifies the two ecosystems with a familiar API
- seamless transition between small and large data
8

Quickly gaining traction
9
Weekly releases!
> 100 patches merged since
announcement at
Spark Summit (April 24)
> 10 significant contributors
outside of Databricks
> 2.5k daily downloads

A short example
10
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x

Key Differences
Spark is more lazy by nature:
- most operations only happen when displaying or writing a
DataFrame
Spark does not implicitly have ordering
Performance when working at scale
11

Current status
Weekly releases, very active community with daily changes
The most common functions have been implemented:
- 30% of the DataFrame API
- 30% of the Series API
- to_datetime, get_dummies, …
Special thanks to Hyukjin Kwon and Takuya Ueshin for major contributions to
the project
12

What to expect soon?
Performance enhancements, e.g. caching
Plotting
Better indexing support
More data manipulation (string and date manipulations)
13

Getting started
pip install koalas
conda install koalas
Look for docs and updates on github.com/databricks/koalas
14

Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
github.com/databricks/koalas/blob/master/CONTRIBUTING.md
15

16
Thank you
Takuya UESHIN (ueshin@databricks.com)

Más contenido relacionado

La actualidad más candente

Lessons from Running Large Scale Spark WorkloadsDatabricks

Large-Scale Data Science in Apache Spark 2.0Databricks

What's New in Apache Spark 2.3 & Why Should You CareDatabricks

Koalas: Unifying Spark and pandas APIsXiao Li

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks

Koalas: Pandas on Apache SparkDatabricks

ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks

Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa

Programming in Spark using PySpark Mostafa

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

Seattle Scalability Meetup - Ted Dunning - MapRclive boulton

The BDAS Open Source Communityjeykottalam

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks

La actualidad más candente (20)

Lessons from Running Large Scale Spark Workloads

Large-Scale Data Science in Apache Spark 2.0

What's New in Apache Spark 2.3 & Why Should You Care

Koalas: Unifying Spark and pandas APIs

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

Koalas: Pandas on Apache Spark

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Performant data processing with PySpark, SparkR and DataFrame API

Programming in Spark using PySpark

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Seattle Scalability Meetup - Ted Dunning - MapR

The BDAS Open Source Community

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Python and Bigdata - An Introduction to Spark (PySpark)

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...

Similar a Koalas: Unifying Spark and pandas APIs

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

Big data apache spark + scalaJuantomás García Molina

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

New directions for Apache Spark in 2015Databricks

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Apache spark-melbourne-april-2015-meetupNed Shawa

Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos

Open Source Lambda Architecture for deep learningPatrick Nicolas

Intro to Apache Spark by CTO of TwingoMapR Technologies

Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Started with-apache-sparkHappiest Minds Technologies

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson

20170126 big data processingVienna Data Science Group

Similar a Koalas: Unifying Spark and pandas APIs (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big data apache spark + scala

Apache Spark for Everyone - Women Who Code Workshop

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Jump Start on Apache® Spark™ 2.x with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

New directions for Apache Spark in 2015

Jump Start with Apache Spark 2.0 on Databricks

Apache spark-melbourne-april-2015-meetup

Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

Spark + AI Summit 2020 イベント概要

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Open Source Lambda Architecture for deep learning

Intro to Apache Spark by CTO of Twingo

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Started with-apache-spark

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)

20170126 big data processing

Más de Takuya UESHIN

2019.03.19 Deep Dive into Spark SQL with Advanced Performance TuningTakuya UESHIN

An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN

Deep Dive into Spark SQL with Advanced Performance TuningTakuya UESHIN

Failing gracefullyTakuya UESHIN

20140908 spark sql & catalystTakuya UESHIN

Introduction to Spark SQL & CatalystTakuya UESHIN

20110616 HBase勉強会(第二回)Takuya UESHIN

20100724 HBaseプログラミングTakuya UESHIN

Más de Takuya UESHIN (8)

2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning

An Insider’s Guide to Maximizing Spark SQL Performance

Deep Dive into Spark SQL with Advanced Performance Tuning

Failing gracefully

20140908 spark sql & catalyst

Introduction to Spark SQL & Catalyst

20110616 HBase勉強会(第二回)

20100724 HBaseプログラミング

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Partners Life - Insurer Innovation Award 2024The Digital Insurer

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

MINDCTI Revenue Release Quarter One 2024MIND CTI

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Real Time Object Detection Using Open CVKhem

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Koalas: Unifying Spark and pandas APIs

1. Koalas: Unifying Spark and pandas APIs 1 Takuya UESHIN Spark Meetup Tokyo #1, Jun 2019

2. 2 About Me - Software Engineer @databricks - Apache Spark Committer - Twitter: @ueshin - GitHub: github.com/ueshin

3. Typical journey of a data scientist Education (MOOCs, books, universities) → pandas Analyze small data sets → pandas Analyze big data sets → DataFrame in Spark 3

4. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib Can deal with a lot of different situations, including: - basic statistical analysis - handling missing data - time series, categorical variables, strings 4

5. Apache Spark De facto unified analytics engine for large-scale data processing (Streaming, ETL, ML) Created at UC Berkeley by Databricks’ founders PySpark API for Python; also API support for Scala, R and SQL 5

6. 6 Pandas DataFrame Spark DataFrame Column df[‘col’] df[‘col’] Mutability Mutable Immutable Add a column df[‘c’] = df[‘a’] + df[‘b’] df.withColumn(‘c’, df[‘a’] + df[‘b’]) Rename columns df.columns = [‘a’,’b’] df.select(df[‘c1’].alias(‘a’), df[‘c2’].alias(‘b’)) Value count df[‘col’].value_counts() df.groupBy(df[‘col’]).count().order By(‘count’, ascending = False) Pandas DataFrame vs Spark DataFrame

7. A short example 7 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = [‘x’, ‘y’, ‘z1’] df[‘x2’] = df.x * df.x df = (spark.read .option("inferSchema", "true") .option("comment", True) .csv("my_data.csv")) df = df.toDF(‘x’, ‘y’, ‘z1’) df = df.withColumn(‘x2’, df.x*df.x)

8. Koalas Announced April 24, 2019 Pure Python library Aims at providing the pandas API on top of Apache Spark: - unifies the two ecosystems with a familiar API - seamless transition between small and large data 8

9. Quickly gaining traction 9 Weekly releases! > 100 patches merged since announcement at Spark Summit (April 24) > 10 significant contributors outside of Databricks > 2.5k daily downloads

10. A short example 10 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = [‘x’, ‘y’, ‘z1’] df[‘x2’] = df.x * df.x import databricks.koalas as ks df = ks.read_csv("my_data.csv") df.columns = [‘x’, ‘y’, ‘z1’] df[‘x2’] = df.x * df.x

11. Key Differences Spark is more lazy by nature: - most operations only happen when displaying or writing a DataFrame Spark does not implicitly have ordering Performance when working at scale 11

12. Current status Weekly releases, very active community with daily changes The most common functions have been implemented: - 30% of the DataFrame API - 30% of the Series API - to_datetime, get_dummies, … Special thanks to Hyukjin Kwon and Takuya Ueshin for major contributions to the project 12

13. What to expect soon? Performance enhancements, e.g. caching Plotting Better indexing support More data manipulation (string and date manipulations) 13

14. Getting started pip install koalas conda install koalas Look for docs and updates on github.com/databricks/koalas 14

15. Do you have suggestions or requests? Submit requests to github.com/databricks/koalas/issues Very easy to contribute github.com/databricks/koalas/blob/master/CONTRIBUTING.md 15

16. 16 Thank you Takuya UESHIN (ueshin@databricks.com)

Koalas: Unifying Spark and pandas APIs

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Koalas: Unifying Spark and pandas APIs

Similar a Koalas: Unifying Spark and pandas APIs (20)

Más de Takuya UESHIN

Más de Takuya UESHIN (8)

Último

Último (20)

Koalas: Unifying Spark and pandas APIs