Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Koalas: Pandas on Apache Spark

411 visualizaciones

Publicado el

In this hands on tutorial we will present Koalas, a new open source project. Koalas is an open source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.

Publicado en: Datos y análisis
  • Sé el primero en comentar

Koalas: Pandas on Apache Spark

  1. 1. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  2. 2. Koalas: pandas on Apache Spark Niall Turbitt Data Scientist @ Databricks
  3. 3. About Niall Turbitt Data Scientist at Databricks ● Professional Services and Training ● Previously: ○ e-Commerce ○ Supply Chain and Logistics ○ Recommender Systems and Personalization ● MS Statistics University College Dublin ● BA Mathematics & Economics Trinity College Dublin
  4. 4. Outline ▪ pandas vs Spark ▪ Why Koalas? ▪ Koalas under the hood ▪ Demo ▪ Koalas Roadmap
  5. 5. Typical Journey of a Data Scientist ▪ Education (MOOCs, Books, Universities) -> pandas ▪ Analyze Small Datasets -> pandas ▪ Analyze Big Datasets -> Apache Spark
  6. 6. ▪ Authored by Wes McKinney in 2008 ▪ The standard tool for data manipulation and analysis in Python ▪ Deeply integrated into Python data science ecosystem ▪ numpy ▪ matplotlib ▪ scikit-learn ▪ Can deal with a lot of different situations, including: ▪ Basic statistical analysis ▪ Handling missing data ▪ Time series, categorical variables, strings
  7. 7. ▪ De facto unified analytics engine for large-scale data processing ▪ Streaming ▪ ETL ▪ ML ▪ Originally created at UC Berkeley by Databricks’ founders ▪ PySpark API for Python; also API support for Scala, R, Java and SQL
  8. 8. A short example import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x PySparkpandas df = (spark.read .option("inferSchema", "true") .csv("my_data.csv")) df = df.toDF('x', 'y', 'z1') df = df.withColumn('x2', df.x * df.x)
  9. 9. ▪ Announced April 2019 ▪ Pure Python Library ▪ Aims at providing the pandas API on top of Apache Spark ▪ Unifies the two ecosystems with a familiar API ▪ Seamless transition between small and large data Koalas
  10. 10. Koalas Growth ▪ > 30,000 daily downloads ▪ > 2,000 GitHub stars
  11. 11. A short example import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x pandas import databricks.koalas as ks df = ks.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x Koalas
  12. 12. Current Status ▪ Bi-weekly releases, very active community with daily changes ▪ Most common pandas functions have been implemented in Koalas: ▪ Series : 70% ▪ DataFrame : 74% ▪ Index : 56% ▪ MultiIndex : 51% ▪ DataFrameGroupBy : 67% ▪ SeriesGroupBy : 69% ▪ Plotting: 80%
  13. 13. Spark vs pandas - Key Differences ▪ DataFrame is immutable ▪ Lazy Evaluation ▪ Distributed ▪ Does not maintain row order ▪ Performance working at scale ▪ DataFrame is mutable ▪ Eager execution ▪ Single-machine ▪ Maintains row order ▪ Restricted to single machine pandasSpark
  14. 14. Koalas - Architecture Catalyst Optimization & Tungsten Execution DataFrame APIsSQL Koalas Core Data Source Connectors pandas SPARK Lean API layer
  15. 15. Koalas - Under the hood ▪ InternalFrame ▪ Holds the current Spark DataFrame ▪ Internal immutable metadata ▪ Manages mappings from Koalas column names to Spark column names ▪ Manages mapping from Koalas index names to Spark column names ▪ Converts between Spark DataFrame and pandas DataFrame
  16. 16. InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map Spark DataFrame InternalFrame - column_labels - index_map Spark DataFrame Koalas DataFrame API call copy with new state
  17. 17. InternalFrame - Metadata update only Koalas DataFrame InternalFrame - column_labels - index_map Spark DataFrame InternalFrame - column_labels - index_map Koalas DataFrame API call Copy with new state Only updates metadata kdf.set_index(...)
  18. 18. InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map Spark DataFrame InternalFrame - column_labels - index_map Spark DataFrame API call copy with new state e.g. inplace=True kdf.dropna(..., inplace=True)
  19. 19. Koalas - Under the hood ▪ Default Index ▪ Koalas manages a group of columns as an index ▪ Behaves in same manner as pandas ▪ If no index is specified when creating Koalas DataFrame ▪ A “default index” is attached automatically ▪ Each “default index” has pros and cons
  20. 20. Default Index Type ▪ Used by default ▪ Implements sequence that increases one by one ▪ Uses PySpark Window function without specifying partition ▪ Can end up with whole partition on single node ▪ Avoid when data is large ▪ Implements sequence that increases one by one ▪ Uses group-by and group-map in distributed manner ▪ Recommended if the index must be sequential for a large dataset and increasing one by one distributed-sequencesequence ▪ Implements monotonically increasing sequence ▪ Uses PySpark’s monotonically_increasing_id in distributed manner ▪ Values are non-deterministic ▪ Recommended if the index does not have to be a sequence increasing one by one distributed Configurable by the option compute.default_index_type
  21. 21. Demo: bit.ly/koalas_summit_2020
  22. 22. Koalas - On the roadmap ▪ June/July 2020: Koalas 1.0 release, which supports Spark 3.0 ▪ July/Aug 2020: Release DBR/MLR 7.1 will pre-install Koalas 1.x
  23. 23. Koalas - Getting started pip install koalas conda install koalas ▪ Docs: https://koalas.readthedocs.io ▪ Github: https://github.com/databricks/koalas
  24. 24. Thank You!
  25. 25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×