Koalas: Making an Easy Transition from Pandas to Apache Spark

Koalas: Unifying Spark and
pandas APIs
1
Spark + AI Summit Europe 2019
Tim Hunter
Takuya Ueshin

About
Takuya Ueshin, Software Engineer at Databricks
• Apache Spark committer and PMC member
• Focusing on Spark SQL and PySpark
• Koalas committer
Tim Hunter, Software Engineer at Databricks
• Co-creator of the Koalas project
• Contributes to Apache Spark MLlib, GraphFrames, TensorFrames and
Deep Learning Pipelines libraries
• Ph.D Machine Learning from Berkeley, M.S. Electrical Engineering from
Stanford

Outline
● pandas vs Spark at a high level
● why Koalas (combine everything in one package)
○ key differences
● current status & new features
● demo
● technical topics
○ InternalFrame
○ Operations on different DataFrames
○ Default Index
● roadmap

Typical journey of a data scientist
Education (MOOCs, books, universities) → pandas
Analyze small data sets → pandas
Analyze big data sets → DataFrame in Spark
4

pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation and analysis in Python
Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib
Can deal with a lot of different situations, including:
- basic statistical analysis
- handling missing data
- time series, categorical variables, strings
5

Apache Spark
De facto unified analytics engine for large-scale data processing
(Streaming, ETL, ML)
Originally created at UC Berkeley by Databricks’ founders
PySpark API for Python; also API support for Scala, R and SQL
6

7
pandas DataFrame Spark DataFrame
Column df['col'] df['col']
Mutability Mutable Immutable
Add a column df['c'] = df['a'] + df['b'] df.withColumn('c', df['a'] + df['b'])
Rename columns df.columns = ['a','b'] df.select(df['c1'].alias('a'),
df['c2'].alias('b'))
Value count df['col'].value_counts() df.groupBy(df['col']).count().order
By('count', ascending = False)
pandas DataFrame vs Spark DataFrame

A short example
8
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF('x', 'y', 'z1')
df = df.withColumn('x2', df.x*df.x)

Koalas
Announced April 24, 2019
Pure Python library
Aims at providing the pandas API on top of Apache Spark:
- unifies the two ecosystems with a familiar API
- seamless transition between small and large data
9

Quickly gaining traction
10
Bi-weekly releases!
> 500 patches merged since
announcement
> 20 significant contributors
outside of Databricks
> 8k daily downloads

A short example
11
import pandas as pd
df = pd.read_csv("my_data.csv")
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")

Koalas
- Provide discoverable APIs for common data science
tasks (i.e., follows pandas)
- Unify pandas API and Spark API, but pandas first
- pandas APIs that are appropriate for distributed
dataset
- Easy conversion from/to pandas DataFrame or
numpy array.
12

Key Differences
Spark is more lazy by nature:
- most operations only happen when displaying or writing a
DataFrame
Spark does not maintain row order
Performance when working at scale
13

Current status
Bi-weekly releases, very active community with daily changes
The most common functions have been implemented:
- 60% of the DataFrame / Series API
- 60% of the DataFrameGroupBy / SeriesGroupBy API
- 15% of the Index / MultiIndex API
- to_datetime, get_dummies, …
14

New features
- 80% of the plot functions (0.16.0-)
- Spark related functions (0.8.0-)
- IO: to_parquet/read_parquet, to_csv/read_csv,
to_json/read_json, to_spark_io/read_spark_io,
to_delta/read_delta, ...
- SQL
- cache
- Support for multi-index columns (90%) (0.16.0-)
- Options to configure Koalas’ behavior (0.17.0-)
15

Challenge: increasing scale
and complexity of data
operations
Struggling with the “Spark
switch” from pandas
More than 10X faster with
less than 1% code changes
How Virgin Hyperloop One reduced processing
time from hours to minutes with Koalas

Internal Frame
Internal immutable metadata.
- holds the current Spark DataFrame
- manages mapping from Koalas column names to Spark
column names
- manages mapping from Koalas index names to Spark column
names
- converts between Spark DataFrame and pandas DataFrame
18

InternalFrame
19
Koalas
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame

InternalFrame
20
Koalas
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
Koalas
DataFrame
API call copy with new state

InternalFrame
21
Koalas
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
InternalFrame
- column_index
- index_map
Koalas
DataFrame
Only updates metadata
kdf.set_index(...)

InternalFrame
22
Koalas
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
InternalFrame
- column_index
- index_map
Spark
DataFrame
e.g., inplace=True
kdf.dropna(..., inplace=True)

Operations on different DataFrames
We only allow Series derived from the same DataFrame by default.
23
OK
- df.a + df.b
- df['c'] = df.a * df.b
Not OK
- df1.a + df2.b
- df1['c'] = df2.a * df2.b

Equivalent Spark code?
24
OK
- df.a + df.b
- df['c'] = df.a * df.b
sdf.select(
sdf['a'] + sdf['b'])
Not OK
- df1.a + df2.b
- df1['c'] = df2.a * df2.b
???
sdf1.select(
sdf1['a'] + sdf2['b'])

25
OK
- df.a + df.b
- df['c'] = df.a * df.b
sdf.select(
Not OK
- df1.a + df2.b
- df1['c'] = df2.a * df2.b
???
sdf1.select(
sdf1['a'] + sdf2['b'])AnalysisException!!

ks.set_option('compute.ops_on_diff_frames', True)
26
OK
- df.a + df.b
- df['c'] = df.a * df.b
sdf.select(
OK
- df1.a + df2.b
- df1['c'] = df2.a * df2.b
sdf1.join(sdf2,
on="_index_")
.select('a * b')

Default Index
Koalas manages a group of columns as index.
The index behaves the same as pandas’.
If no index is specified when creating a Koalas DataFrame:
it attaches a “default index” automatically.
Each “default index” has Pros and Cons.
27

Default Indexes
Configurable by the option “compute.default_index_type”
See also: https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type
28
requires to collect data
into single node
requires shuffle continuous increments
sequence YES YES YES
distributed-sequence NO YES / NO YES / NO
distributed NO NO NO

What to expect?
• Improve pandas API coverage
- rolling/expanding
• Support categorical data types
• More time-series related functions
• Improve performance
- Minimize the overhead at Koalas layer
- Optimal implementation of APIs
29

Getting started
pip install koalas
conda install koalas
Look for docs on https://koalas.readthedocs.io/en/latest/
and updates on github.com/databricks/koalas
10 min tutorial in a Live Jupyter notebook is available from the docs.
30

Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
koalas.readthedocs.io/en/latest/development/contributing.html
31

Koalas Sessions
Koalas: Pandas on Apache Spark (Tutorial)
- 14:30 - @ROOM: G104
AMA: Koalas
- 16:00 - @EXPO HALL
32

Thank you! Q&A?
Get Started at databricks.com/try
33

Koalas: Making an Easy Transition from Pandas to Apache Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Koalas: Making an Easy Transition from Pandas to Apache Spark

Similar a Koalas: Making an Easy Transition from Pandas to Apache Spark (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Koalas: Making an Easy Transition from Pandas to Apache Spark