SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
pandas: Powerful data
analysis tools for Python
Wes McKinney
Lambda Foundry, Inc.
@wesmckinn
PhillyPUG 3/27/2012
Me
• Recovering mathematician
• 3 years in the quant finance industry
• Last 2: statistics + freelance + open source
• My new company: Lambda Foundry
• High productivity data analysis and
research tools for quant finance
Me
• Blog: http://blog.wesmckinney.com
• GitHub: http://github.com/wesm
• Twitter: @wesmckinn
Agile Tools for Real World Data
Wes McKinney
Python for
Data Analysis
• Pragmatic intro to
scientific Python
• pandas
• Case studies
• ETA: Late 2012
In the works
Agile Tools for Real World Data
pandas?
• http://pandas.pydata.org
• Rich relational data tool built on top of
NumPy
• Like R’s data.frame on steroids
• Excellent performance
• Easy-to-use, highly consistent API
• A foundation for data analysis in Python
pandas
• In heavy production use in the financial
industry, among others
• Generally much better performance than
other open source alternatives (e.g. R)
• Hope: basis for the “next generation”
statistical computing and analysis environment
Simplifying data wrangling
• Data munging / preparation / cleaning /
integration is slow, error prone, and time
consuming
• Everyone already <3’s Python for data
wrangling: pandas takes it to the next level
Explosive pandas growth
• 10 significant releases since 9/2011
• Hugely increased user base
Battle tested
• > 98% line coverage as measured by
coverage.py
• v0.3.0 (2/19/2011): 533 test functions
Battle tested
• > 98% line coverage as measured by
coverage.py
• v0.3.0 (2/19/2011): 533 test functions
• v0.7.3dev (3/27/2012): >1500 test functions
IPython
• Simply put: one of the hottest Python
projects out there
• Tab completion, introspection, interactive
debugger, command history
• Designed to enhance your productivity in
every way. I can’t live without it
• IPython HTML notebook is #winning
Series
• Subclass of numpy.ndarray
• Data: any type
• Index labels need not be ordered
• Duplicates are possible (but
result in reduced functionality)
5
6
12
-5
6.7
A
B
C
D
E
valuesindex
DataFrame
• NumPy array-like
• Each column can have a
different type
• Row and column index
• Size mutable: insert and delete
columns
0
4
8
-12
16
A
B
C
D
E
index
x
y
z
w
a
2.7
6
10
NA
18
True
True
False
False
False
foo bar baz quxcolumns
DataFrame
In [10]: tips[:10]
Out[10]:
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.50 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
7 8.770 2.00 Male No Sun Dinner 2
8 26.88 3.12 Male No Sun Dinner 4
9 15.04 1.96 Male No Sun Dinner 2
10 14.78 3.23 Male No Sun Dinner 2
DataFrame
• Axis indexing enable rich data alignment,
joins / merges, reshaping, selection, etc.
day Fri Sat Sun Thur
sex smoker
Female No 3.125 2.725 3.329 2.460
Yes 2.683 2.869 3.500 2.990
Male No 2.500 3.257 3.115 2.942
Yes 2.741 2.879 3.521 3.058
Axis indexing, the special
pandas-flavored sauce
• Enables “alignment-free” programming
• Prevents major source of data munging
frustration and errors
• Fast data selection
• Powerful way of describing reshape / join /
merge / pivot-table operations
Data alignment
• Binary operations are joins!
B
C
D
E
1
2
3
4
A
B
C
D
0
1
2
3
+ =
A
B
C
D
NA
2
4
6
E NA
GroupBy
A 0
B 5
C 10
5
10
15
10
15
20
A
A
A
B
B
B
C
C
C
A 15
B 30
C 45
A
B
C
A
B
C
0
5
10
5
10
15
10
15
20
sum
ApplySplit
Key
Combine
sum
sum
Hierarchical indexes
• Semantics: a tuple at each tick
• Enables easy group selection
• Terminology:“multiple levels”
• Natural part of GroupBy and
reshape operations
A 1
2
3
1
2
3
4
B
Hierarchical indexes
• Semantics: a tuple at each tick
• Enables easy group selection
• Terminology:“multiple levels”
• Natural part of GroupBy and
reshape operations
A 1
2
3
1
2
3
4
B
{
{
Let’s have a little fun
To the IPython Notebook!
What’s in pandas?
• A big library: 40k SLOC
Tests!
• Huge accumulation of use cases originating
in real world applications
• 68 lines of tests for every 100 lines of code
pandas.core
• Data structures
• Series (1D)
• DataFrame (2D)
• Panel (3D)
• NA-friendly statistics
• Index implementations / label-indexing
pandas.core
• GroupBy engine
• Time series tools
• Date range generation
• Extensible date offsets
• Hierarchical indexing stuff
Elsewhere
• Join / concatenation algorithms
• Sparse versions of Series, DataFrame...
• IO tools: CSV files, HDF5, Excel 2003/2007
• Moving window statistics (rolling mean, ...)
• Pivot tables
• High level matplotlib interface
Hmm, pandas/src
• ~6000 lines of mostly Cython code
• Fast data algorithms that power the library
and make it fast
• pandas in PyPy?
Ok, so why Python?
• Look around you!
• Build a superior data analysis and statistical
computing environment
• Build mission-critical, data-driven
production systems
Trolling #rstats
Hash tables, anyone?
The pandas roadmap
• Improved time series capabilities
• Port GroupBy engine to NumPy only
• Better integration with statsmodels and
scikit-learn
• R integration via rpy2
The pandas roadmap
• Integration with JavaScript visualization
frameworks: D3, Flot, others
• Alternate DataFrame “backends”
• Memory maps
• HDF5 / PyTables
• SQL or NoSQL-backed
• Tighter IPython Notebook integration
ggplot2 for Python
• We need to build better a better interface
for creating statistical graphics in Python
• Use pandas as the base layer !
• Upcoming project from Peter Wang: bokeh
pandas for “Big Data”
• Quite common to need to process larger-
than-RAM data sets
• Alternate DataFrame backends are the
likely solution
• Ripe for integration with MapReduce
frameworks
Better time series
• Integration of scikits.timeseries codebase
• NumPy datetime64 dtype
• Higher performance, less memory
Better time series
• Fixed frequency handling
• Time zones
• Multiple time concepts
• Intervals: 1984, or “1984 Q4”
• Timestamps: moment in time, to micro-
or nanosecond resolution
Thanks!
• Follow me on Twitter: @wesmckinn
• pydata/pandas on GitHub!

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Introduction to numpy
Introduction to numpyIntroduction to numpy
Introduction to numpy
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Essential NumPy
Essential NumPyEssential NumPy
Essential NumPy
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 
Numpy tutorial
Numpy tutorialNumpy tutorial
Numpy tutorial
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
Data Visualization(s) Using Python
Data Visualization(s) Using PythonData Visualization(s) Using Python
Data Visualization(s) Using Python
 
Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)Introduction to the basics of Python programming (part 1)
Introduction to the basics of Python programming (part 1)
 
Pandas
PandasPandas
Pandas
 
Seaborn.pptx
Seaborn.pptxSeaborn.pptx
Seaborn.pptx
 
Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...
Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...
Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 

Similar a pandas: Powerful data analysis tools for Python

What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
DataPad Inc.
 

Similar a pandas: Powerful data analysis tools for Python (20)

An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folks
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Python ml
Python mlPython ml
Python ml
 
Tableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My LifeTableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My Life
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the FieldPartner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data Models
 

Más de Wes McKinney

Más de Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

pandas: Powerful data analysis tools for Python

  • 1. pandas: Powerful data analysis tools for Python Wes McKinney Lambda Foundry, Inc. @wesmckinn PhillyPUG 3/27/2012
  • 2. Me • Recovering mathematician • 3 years in the quant finance industry • Last 2: statistics + freelance + open source • My new company: Lambda Foundry • High productivity data analysis and research tools for quant finance
  • 3. Me • Blog: http://blog.wesmckinney.com • GitHub: http://github.com/wesm • Twitter: @wesmckinn
  • 4. Agile Tools for Real World Data Wes McKinney Python for Data Analysis • Pragmatic intro to scientific Python • pandas • Case studies • ETA: Late 2012 In the works Agile Tools for Real World Data
  • 5. pandas? • http://pandas.pydata.org • Rich relational data tool built on top of NumPy • Like R’s data.frame on steroids • Excellent performance • Easy-to-use, highly consistent API • A foundation for data analysis in Python
  • 6. pandas • In heavy production use in the financial industry, among others • Generally much better performance than other open source alternatives (e.g. R) • Hope: basis for the “next generation” statistical computing and analysis environment
  • 7. Simplifying data wrangling • Data munging / preparation / cleaning / integration is slow, error prone, and time consuming • Everyone already <3’s Python for data wrangling: pandas takes it to the next level
  • 8.
  • 9. Explosive pandas growth • 10 significant releases since 9/2011 • Hugely increased user base
  • 10. Battle tested • > 98% line coverage as measured by coverage.py • v0.3.0 (2/19/2011): 533 test functions
  • 11. Battle tested • > 98% line coverage as measured by coverage.py • v0.3.0 (2/19/2011): 533 test functions • v0.7.3dev (3/27/2012): >1500 test functions
  • 12. IPython • Simply put: one of the hottest Python projects out there • Tab completion, introspection, interactive debugger, command history • Designed to enhance your productivity in every way. I can’t live without it • IPython HTML notebook is #winning
  • 13. Series • Subclass of numpy.ndarray • Data: any type • Index labels need not be ordered • Duplicates are possible (but result in reduced functionality) 5 6 12 -5 6.7 A B C D E valuesindex
  • 14. DataFrame • NumPy array-like • Each column can have a different type • Row and column index • Size mutable: insert and delete columns 0 4 8 -12 16 A B C D E index x y z w a 2.7 6 10 NA 18 True True False False False foo bar baz quxcolumns
  • 15. DataFrame In [10]: tips[:10] Out[10]: total_bill tip sex smoker day time size 1 16.99 1.01 Female No Sun Dinner 2 2 10.34 1.66 Male No Sun Dinner 3 3 21.01 3.50 Male No Sun Dinner 3 4 23.68 3.31 Male No Sun Dinner 2 5 24.59 3.61 Female No Sun Dinner 4 6 25.29 4.71 Male No Sun Dinner 4 7 8.770 2.00 Male No Sun Dinner 2 8 26.88 3.12 Male No Sun Dinner 4 9 15.04 1.96 Male No Sun Dinner 2 10 14.78 3.23 Male No Sun Dinner 2
  • 16. DataFrame • Axis indexing enable rich data alignment, joins / merges, reshaping, selection, etc. day Fri Sat Sun Thur sex smoker Female No 3.125 2.725 3.329 2.460 Yes 2.683 2.869 3.500 2.990 Male No 2.500 3.257 3.115 2.942 Yes 2.741 2.879 3.521 3.058
  • 17. Axis indexing, the special pandas-flavored sauce • Enables “alignment-free” programming • Prevents major source of data munging frustration and errors • Fast data selection • Powerful way of describing reshape / join / merge / pivot-table operations
  • 18. Data alignment • Binary operations are joins! B C D E 1 2 3 4 A B C D 0 1 2 3 + = A B C D NA 2 4 6 E NA
  • 19. GroupBy A 0 B 5 C 10 5 10 15 10 15 20 A A A B B B C C C A 15 B 30 C 45 A B C A B C 0 5 10 5 10 15 10 15 20 sum ApplySplit Key Combine sum sum
  • 20. Hierarchical indexes • Semantics: a tuple at each tick • Enables easy group selection • Terminology:“multiple levels” • Natural part of GroupBy and reshape operations A 1 2 3 1 2 3 4 B
  • 21. Hierarchical indexes • Semantics: a tuple at each tick • Enables easy group selection • Terminology:“multiple levels” • Natural part of GroupBy and reshape operations A 1 2 3 1 2 3 4 B { {
  • 22. Let’s have a little fun To the IPython Notebook!
  • 23. What’s in pandas? • A big library: 40k SLOC
  • 24. Tests! • Huge accumulation of use cases originating in real world applications • 68 lines of tests for every 100 lines of code
  • 25.
  • 26. pandas.core • Data structures • Series (1D) • DataFrame (2D) • Panel (3D) • NA-friendly statistics • Index implementations / label-indexing
  • 27. pandas.core • GroupBy engine • Time series tools • Date range generation • Extensible date offsets • Hierarchical indexing stuff
  • 28. Elsewhere • Join / concatenation algorithms • Sparse versions of Series, DataFrame... • IO tools: CSV files, HDF5, Excel 2003/2007 • Moving window statistics (rolling mean, ...) • Pivot tables • High level matplotlib interface
  • 29. Hmm, pandas/src • ~6000 lines of mostly Cython code • Fast data algorithms that power the library and make it fast • pandas in PyPy?
  • 30. Ok, so why Python? • Look around you! • Build a superior data analysis and statistical computing environment • Build mission-critical, data-driven production systems
  • 32. The pandas roadmap • Improved time series capabilities • Port GroupBy engine to NumPy only • Better integration with statsmodels and scikit-learn • R integration via rpy2
  • 33. The pandas roadmap • Integration with JavaScript visualization frameworks: D3, Flot, others • Alternate DataFrame “backends” • Memory maps • HDF5 / PyTables • SQL or NoSQL-backed • Tighter IPython Notebook integration
  • 34. ggplot2 for Python • We need to build better a better interface for creating statistical graphics in Python • Use pandas as the base layer ! • Upcoming project from Peter Wang: bokeh
  • 35. pandas for “Big Data” • Quite common to need to process larger- than-RAM data sets • Alternate DataFrame backends are the likely solution • Ripe for integration with MapReduce frameworks
  • 36. Better time series • Integration of scikits.timeseries codebase • NumPy datetime64 dtype • Higher performance, less memory
  • 37. Better time series • Fixed frequency handling • Time zones • Multiple time concepts • Intervals: 1984, or “1984 Q4” • Timestamps: moment in time, to micro- or nanosecond resolution
  • 38. Thanks! • Follow me on Twitter: @wesmckinn • pydata/pandas on GitHub!