SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
pandas: a Foundational Python library for Data Analysis
                    and Statistics

                                   Wes McKinney


                            PyHPC 2011, 18 November 2011




Wes McKinney (@wesmckinn)          Data analysis with pandas   PyHPC 2011   1 / 25
An alternate title




       High Performance Structured Data
            Manipulation in Python




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   2 / 25
My background



     Former quant hacker at AQR Capital, now entrepreneur
     Background: math, statistics, computer science, quant finance.
     Shaken, not stirred
     Active in scientific Python community
     My blog: http://blog.wesmckinney.com
     Twitter: @wesmckinn
     Book! “Python for Data Analysis”, to hit the shelves later next year
     from O’Reilly




 Wes McKinney (@wesmckinn)   Data analysis with pandas       PyHPC 2011   3 / 25
Structured data



      cname              year   agefrom      ageto             ls     lsc    pop   ccode
0     Australia          1950   15           19                64.3   15.4   558   AUS
1     Australia          1950   20           24                48.4   26.4   645   AUS
2     Australia          1950   25           29                47.9   26.2   681   AUS
3     Australia          1950   30           34                44     23.8   614   AUS
4     Australia          1950   35           39                42.1   21.9   625   AUS
5     Australia          1950   40           44                38.9   20.1   555   AUS
6     Australia          1950   45           49                34     16.9   491   AUS
7     Australia          1950   50           54                29.6   14.6   439   AUS
8     Australia          1950   55           59                28     12.9   408   AUS
9     Australia          1950   60           64                26.3   12.1   356   AUS




    Wes McKinney (@wesmckinn)      Data analysis with pandas                 PyHPC 2011   4 / 25
Structured data



     A familiar data model
           Heterogeneous columns or hyperslabs
           Each column/hyperslab is homogeneously typed
           Relational databases (SQL, etc.) are just a special case
     Need good performance in row- and column-oriented operations
     Support for axis metadata
     Data alignment is critical
     Seamless integration with Python data structures and NumPy




 Wes McKinney (@wesmckinn)       Data analysis with pandas            PyHPC 2011   5 / 25
Structured data challenges



     Table modification: column insertion/deletion
     Axis indexing and data alignment
     Aggregation and transformation by group (“group by”)
     Missing data handling
     Pivoting and reshaping
     Merging and joining
     Time series-specific manipulations
     Fast IO: flat files, databases, HDF5, ...




 Wes McKinney (@wesmckinn)    Data analysis with pandas     PyHPC 2011   6 / 25
Not all fun and games




     We care nearly equally about
           Performance
           Ease-of-use (syntax / API fits your mental model)
           Expressiveness
     Clean, consistent API design is hard and underappreciated




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   7 / 25
The big picture



     Build a foundation for data analysis and statistical computing
     Craft the most expressive / flexible in-memory data manipulation tool
     in any language
           Preferably also one of the fastest, too
     Vastly simplify the data preparation, munging, and integration process
     Comfortable abstractions: master data-fu without needing to be a
     computer scientist
     Later: extend API with distributed computing backend for
     larger-than-memory datasets




 Wes McKinney (@wesmckinn)       Data analysis with pandas   PyHPC 2011   8 / 25
pandas: a brief history




     Starting building April 2008 back at AQR
     Open-sourced (BSD license) mid-2009
     29075 lines of Python/Cython code as of yesterday, and growing fast
     Heavily tested, being used by many companies (inc. lots of financial
     firms) in production




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   9 / 25
Cython: getting good performance



     My choice tool for writing performant code
     High level access to NumPy C API internals
     Buffer syntax/protocol abstracts away striding details of
     non-contiguous arrays, very low overhead vs. working with raw C
     pointers
     Reduce/remove interpreter overhead associated with working with
     Python data structures
     Interface directly with C/C++ code when necessary




 Wes McKinney (@wesmckinn)   Data analysis with pandas     PyHPC 2011   10 / 25
Axis indexing




     Key pandas feature
     The axis index is a data structure itself, which can be customized to
     support things like:
           1-1 O(1) indexing with hashable Python objects
           Datetime indexing for time series data
           Hierarchical (multi-level) indexing
     Use Python dict to support O(1) lookups and O(n) realignment ops.
     Can specialize to get better performance and memory usage




 Wes McKinney (@wesmckinn)      Data analysis with pandas    PyHPC 2011   11 / 25
Axis indexing



     Every axis has an index
     Automatic alignment between differently-indexed objects: makes it
     nearly impossible to accidentally combine misaligned data
     Hierarchical indexing provides an intuitive way of structuring and
     working with higher-dimensional data
     Natural way of expressing “group by” and join-type operations
     As good or in many cases much more integrated/flexible than
     commercial or open-source alternatives to pandas/Python




 Wes McKinney (@wesmckinn)     Data analysis with pandas     PyHPC 2011   12 / 25
The trouble with Python dicts...




     Python dict memory footprint can be quite large
           1MM key-value pairs: something like 70mb on a 64-bit system
           Even though sizeof(PyObject*) == 8
     Python dict is great, but should use a faster, threadsafe hash table for
     primitive C types (like 64-bit integer)
     BUT: using a hash table only necessary in the general case. With
     monotonic indexes you don’t need one for realignment ops




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   13 / 25
Some alignment numbers



     Hardware: Macbook Pro Core i7 laptop, Python 2.7.2
     Outer-join 500k-length indexes chosen from 1MM elements
           Dict-based with random strings: 2.2 seconds
           Sorted strings: 400ms (5.5x faster)
           Sorted int64: 19ms (115x faster)
     Fortunately, time series data falls into this last category
     Alignment ops with C primitives could be fairly easily parallelized with
     OpenMP in Cython




 Wes McKinney (@wesmckinn)      Data analysis with pandas          PyHPC 2011   14 / 25
DataFrame, the pandas workhorse




     A 2D tabular data structure with row and column indexes
     Hierarchical indexing one way to support higher-dimensional data in a
     lower-dimensional structure
     Simplified NumPy type system: float, int, boolean, object
     Rich indexing operations, SQL-like join/merges, etc.
     Support heterogeneous columns WITHOUT sacrificing performance in
     the homogeneous (e.g. floating point only) case




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   15 / 25
DataFrame, under the hood




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   16 / 25
Supporting size mutability



     In order to have good row-oriented performance, need to store
     like-typed columns in a single ndarray
     “Column” insertion: accumulate 1 × N × . . . homogeneous columns,
     later consolidate with other like-typed into a single block
     I.e. avoid reallocate-copy or array concatenation steps as long as
     possible
     Column deletions can be no-copy events (since ndarrays support
     views)




 Wes McKinney (@wesmckinn)    Data analysis with pandas       PyHPC 2011   17 / 25
Hierarchical indexing




     New this year, but really should have done long ago
     Natural result of multi-key groupby
     An intuitive way to work with higher-dimensional data
     Much less ad hoc way of expressing reshaping operations
     Once you have it, things like Excel-style pivot tables just “fall out”




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   18 / 25
Reshaping




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   19 / 25
Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   20 / 25
Reshaping implementation nuances




     Must deal with unbalanced group sizes / missing data
     Play vectorization tricks with the NumPy C-contiguous memory
     layout: no Python for loops allowed
     Care must be taken to handle heterogeneous and homogeneous data
     cases




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   21 / 25
GroupBy




     High level process
           split data set into groups
           apply function to each group (an aggregation or a transformation)
           combine results intelligently into a result data structure
     Can be used to emulate SQL GROUP BY operations




 Wes McKinney (@wesmckinn)      Data analysis with pandas        PyHPC 2011    22 / 25
GroupBy



     Grouping closely related to indexing
     Create correspondence between axis labels and group labels using one
     of:
           Array of group labels (like a DataFrame column)
           Python function to be applied to each axis tick
     Can group by multiple keys
     For a hierarchically indexed axis, can select a level and group by that
     (or some transformation thereof)




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   23 / 25
GroupBy implementation challenges


     Computing the group labels from arbitrary Python objects is very
     expensive
           77ms for 1MM strings with 1K groups
           107ms for 1MM strings with 10K groups
           350ms for 1MM strings with 100K groups
     To sort or not to sort (for iteration)?
           Once you have the labels, can reorder the data set in O(n) (with a
           much smaller constant than computing the labels)
           Roughly 35ms to reorder 1MM float64 data points given the labels
     (By contrast, computing the mean of 1MM elements takes 1.4ms)
     Python function call overhead is significant in cases with lots of small
     groups; much better (orders of magnitude speedup) to write
     specialized Cython routines


 Wes McKinney (@wesmckinn)      Data analysis with pandas         PyHPC 2011    24 / 25
Demo, time permitting




Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   25 / 25

Más contenido relacionado

La actualidad más candente

Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using pythonPurna Chander
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationOri Reshef
 
Introduction Data warehouse
Introduction Data warehouseIntroduction Data warehouse
Introduction Data warehouseAmin Choroomi
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An IntroductionDenodo
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandasPiyush rai
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization Sourabh Sahu
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 

La actualidad más candente (20)

Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
 
Introduction Data warehouse
Introduction Data warehouseIntroduction Data warehouse
Introduction Data warehouse
 
Pandas
PandasPandas
Pandas
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Tableau Desktop Material
Tableau Desktop MaterialTableau Desktop Material
Tableau Desktop Material
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

Similar a pandas: a Foundational Python Library for Data Analysis and Statistics

Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsWes McKinney
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Ken Mwai
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for HadoopJim Dowling
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseJulien Le Dem
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
xldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierxldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierTim Frazier
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Ben Busby
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
PDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptPDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptssuser52a19e
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSStéphane Fréchette
 
Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introductionmattcasters
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationDenodo
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 

Similar a pandas: a Foundational Python Library for Data Analysis and Statistics (20)

Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
xldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierxldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazier
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
PDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptPDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.ppt
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 

Más de Wes McKinney

Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 

Más de Wes McKinney (20)

Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 

Último

Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 

Último (20)

Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 

pandas: a Foundational Python Library for Data Analysis and Statistics

  • 1. pandas: a Foundational Python library for Data Analysis and Statistics Wes McKinney PyHPC 2011, 18 November 2011 Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25
  • 2. An alternate title High Performance Structured Data Manipulation in Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 2 / 25
  • 3. My background Former quant hacker at AQR Capital, now entrepreneur Background: math, statistics, computer science, quant finance. Shaken, not stirred Active in scientific Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn Book! “Python for Data Analysis”, to hit the shelves later next year from O’Reilly Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 3 / 25
  • 4. Structured data cname year agefrom ageto ls lsc pop ccode 0 Australia 1950 15 19 64.3 15.4 558 AUS 1 Australia 1950 20 24 48.4 26.4 645 AUS 2 Australia 1950 25 29 47.9 26.2 681 AUS 3 Australia 1950 30 34 44 23.8 614 AUS 4 Australia 1950 35 39 42.1 21.9 625 AUS 5 Australia 1950 40 44 38.9 20.1 555 AUS 6 Australia 1950 45 49 34 16.9 491 AUS 7 Australia 1950 50 54 29.6 14.6 439 AUS 8 Australia 1950 55 59 28 12.9 408 AUS 9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 4 / 25
  • 5. Structured data A familiar data model Heterogeneous columns or hyperslabs Each column/hyperslab is homogeneously typed Relational databases (SQL, etc.) are just a special case Need good performance in row- and column-oriented operations Support for axis metadata Data alignment is critical Seamless integration with Python data structures and NumPy Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 5 / 25
  • 6. Structured data challenges Table modification: column insertion/deletion Axis indexing and data alignment Aggregation and transformation by group (“group by”) Missing data handling Pivoting and reshaping Merging and joining Time series-specific manipulations Fast IO: flat files, databases, HDF5, ... Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 6 / 25
  • 7. Not all fun and games We care nearly equally about Performance Ease-of-use (syntax / API fits your mental model) Expressiveness Clean, consistent API design is hard and underappreciated Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 7 / 25
  • 8. The big picture Build a foundation for data analysis and statistical computing Craft the most expressive / flexible in-memory data manipulation tool in any language Preferably also one of the fastest, too Vastly simplify the data preparation, munging, and integration process Comfortable abstractions: master data-fu without needing to be a computer scientist Later: extend API with distributed computing backend for larger-than-memory datasets Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 8 / 25
  • 9. pandas: a brief history Starting building April 2008 back at AQR Open-sourced (BSD license) mid-2009 29075 lines of Python/Cython code as of yesterday, and growing fast Heavily tested, being used by many companies (inc. lots of financial firms) in production Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 9 / 25
  • 10. Cython: getting good performance My choice tool for writing performant code High level access to NumPy C API internals Buffer syntax/protocol abstracts away striding details of non-contiguous arrays, very low overhead vs. working with raw C pointers Reduce/remove interpreter overhead associated with working with Python data structures Interface directly with C/C++ code when necessary Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 10 / 25
  • 11. Axis indexing Key pandas feature The axis index is a data structure itself, which can be customized to support things like: 1-1 O(1) indexing with hashable Python objects Datetime indexing for time series data Hierarchical (multi-level) indexing Use Python dict to support O(1) lookups and O(n) realignment ops. Can specialize to get better performance and memory usage Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 11 / 25
  • 12. Axis indexing Every axis has an index Automatic alignment between differently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations As good or in many cases much more integrated/flexible than commercial or open-source alternatives to pandas/Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 12 / 25
  • 13. The trouble with Python dicts... Python dict memory footprint can be quite large 1MM key-value pairs: something like 70mb on a 64-bit system Even though sizeof(PyObject*) == 8 Python dict is great, but should use a faster, threadsafe hash table for primitive C types (like 64-bit integer) BUT: using a hash table only necessary in the general case. With monotonic indexes you don’t need one for realignment ops Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 13 / 25
  • 14. Some alignment numbers Hardware: Macbook Pro Core i7 laptop, Python 2.7.2 Outer-join 500k-length indexes chosen from 1MM elements Dict-based with random strings: 2.2 seconds Sorted strings: 400ms (5.5x faster) Sorted int64: 19ms (115x faster) Fortunately, time series data falls into this last category Alignment ops with C primitives could be fairly easily parallelized with OpenMP in Cython Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 14 / 25
  • 15. DataFrame, the pandas workhorse A 2D tabular data structure with row and column indexes Hierarchical indexing one way to support higher-dimensional data in a lower-dimensional structure Simplified NumPy type system: float, int, boolean, object Rich indexing operations, SQL-like join/merges, etc. Support heterogeneous columns WITHOUT sacrificing performance in the homogeneous (e.g. floating point only) case Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 15 / 25
  • 16. DataFrame, under the hood Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 16 / 25
  • 17. Supporting size mutability In order to have good row-oriented performance, need to store like-typed columns in a single ndarray “Column” insertion: accumulate 1 × N × . . . homogeneous columns, later consolidate with other like-typed into a single block I.e. avoid reallocate-copy or array concatenation steps as long as possible Column deletions can be no-copy events (since ndarrays support views) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 17 / 25
  • 18. Hierarchical indexing New this year, but really should have done long ago Natural result of multi-key groupby An intuitive way to work with higher-dimensional data Much less ad hoc way of expressing reshaping operations Once you have it, things like Excel-style pivot tables just “fall out” Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 18 / 25
  • 19. Reshaping Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 19 / 25
  • 20. Reshaping In [5]: df.unstack(’agefrom’).stack(’year’) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 20 / 25
  • 21. Reshaping implementation nuances Must deal with unbalanced group sizes / missing data Play vectorization tricks with the NumPy C-contiguous memory layout: no Python for loops allowed Care must be taken to handle heterogeneous and homogeneous data cases Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 21 / 25
  • 22. GroupBy High level process split data set into groups apply function to each group (an aggregation or a transformation) combine results intelligently into a result data structure Can be used to emulate SQL GROUP BY operations Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 22 / 25
  • 23. GroupBy Grouping closely related to indexing Create correspondence between axis labels and group labels using one of: Array of group labels (like a DataFrame column) Python function to be applied to each axis tick Can group by multiple keys For a hierarchically indexed axis, can select a level and group by that (or some transformation thereof) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 23 / 25
  • 24. GroupBy implementation challenges Computing the group labels from arbitrary Python objects is very expensive 77ms for 1MM strings with 1K groups 107ms for 1MM strings with 10K groups 350ms for 1MM strings with 100K groups To sort or not to sort (for iteration)? Once you have the labels, can reorder the data set in O(n) (with a much smaller constant than computing the labels) Roughly 35ms to reorder 1MM float64 data points given the labels (By contrast, computing the mean of 1MM elements takes 1.4ms) Python function call overhead is significant in cases with lots of small groups; much better (orders of magnitude speedup) to write specialized Cython routines Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 24 / 25
  • 25. Demo, time permitting Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 25 / 25