11. April 2008 - Avant garde PyData
● Socializing Python inside AQR, a quantitative
hedge fund
● scipy.stats.models enabled some R ->
Python workload migration
12. Dec 2009 - pandas 0.1
● First open source release after ~18 months
of internal-only use
13. May 2011 - “PyData” core dev meetings
"Need a toolset that is robust, fast, and suitable
for a production environment..."
14. May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
May 2011 - “PyData” core dev meetings
15. May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
"... and easy / intuitive for non-software
engineers to use"
May 2011 - “PyData” core dev meetings
16. May 2011
* also, we need to fix packaging
May 2011 - “PyData” core dev meetings
17. July 2011- Concerns
"... the current state of affairs has me rather
anxious … these tools [e.g. pandas] have
largely not been integrated with any other tools
because of the community's collective
commitment anxiety"
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
20. Python for Data Analysis book - 2012
● A primer in data
manipulation in Python
● Focus: NumPy, IPython
/Jupyter, pandas,
matplotlib
● 2 editions (2012, 2017)
● 8 translations so far
21. 2013-2014 - An Entrepeneurial Detour
DataPad
Python-powered
Business Analytics
● Backend built with
PyData stack + custom
analytics
● Goal to contribute tech
back to OSS
ecosystem
23. PyData NYC 2013: 10 Things I Hate About pandas
● November 2013
● Summary: “pandas is
not designed like, or
intended to be used
as, a database query
engine”
26. Fall 2014: Python in a Big Data World
Task: Helping Python
become a first-class
technology for Big Data
Some Problems
● File formats
● JVM interop
● Non-array-oriented
interfaces
28. Apache Arrow:
Defragmenting data systems
● Language-independent open
standard in-memory
representation for columnar data
(i.e. data frames)
● Easily reuse code targeting
Arrow memory
● Efficient memory interchange
Arrow
memory
JVM Data Ecosystem
Database Systems
Data Science Libraries
29. Apache Arrow:
Defragmenting data systems
● https://github.com/apache/arrow
● Over 200 unique contributors
● Some level of support for 11 programming
languages