SlideShare una empresa de Scribd logo
1 de 66
Descargar para leer sin conexión
Python in the future of “Big
Data” analytics
Travis Oliphant, PhD
Continuum Analytics, Inc
September 30, 2013
London, UK
Beginnings
AfterBefore
⇢0 (2⇡f)
2
Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)],j
Python origins.
Version Date
0.9.0 Feb. 1991
0.9.4 Dec. 1991
0.9.6 Apr. 1992
0.9.8 Jan. 1993
1.0.0 Jan. 1994
1.2 Apr. 1995
1.4 Oct. 1996
1.5.2 Apr. 1999
http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html
A sample of users
Why Python
License
Community
Readable Syntax
Modern Constructs
Batteries Included
Free and Open Source, Permissive License
• Broad and friendly community
• Over 34,000 packages on PyPI
• Commercial Support
• Many conferences (PyData, SciPy, PyCons...)
• Executable pseudo-code
• Can understand and edit code a year later
• Fun to develop
• Use of Indentation
IPython
• Interactive prompt on steroids
• Allows less working memory
• Allows failing quickly for exploration
• List comprehensions
• Iterator protocol and generators
• Meta-programming
• Introspection
• (JIT Compiler and Concurrency)
• Internet (FTP, HTTP, SMTP, XMLRPC)
• Compression and Databases
• Logging, unit-tests
• Glue for other languages
• Distribution has much, much more....
Python supports a developer spectrum
DeveloperOccasional Scientist Developer
• Cut and paste
• Modify a few variables
• Call some functions
• Typical Quant or
Engineer who doesn’t
become programmer
• Extend frameworks
• Builds new objects
• Wraps code
• Quant / Engineer with
decent developer skill
• Creates frameworks
• Creates compilers
• Typical CS grad
• Knows multiple
languages
Unique aspect of Python
1999 : Early SciPy emergesDiscussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen,
and others. Activity in 1998, led to increased interest in 1999.
In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be
present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
be creating this uber-package which eventually became SciPy in 2001.
Gaussian quadrature 5 Jan 1999
cephes 1.0 30 Jan 1999
sigtools 0.40 23 Feb 1999
Numeric docs March 1999
cephes 1.1 9 Mar 1999
multipack 0.3 13 Apr 1999
Helper routines 14 Apr 1999
multipack 0.6 (leastsq, ode, fsolve,
quad)
29 Apr 1999
sparse plan described 30 May 1999
multipack 0.7 14 Jun 1999
SparsePy 0.1 5 Nov 1999
cephes 1.2 (vectorize) 29 Dec 1999
Plotting??
Gist
XPLOT
DISLIN
Gnuplot
Helping with f2py
Brief History
Person Package Year
Jim Fulton
Matrix Object
in Python
1994
Jim Hugunin Numeric 1995
Perry Greenfield, Rick
White,Todd Miller Numarray 2001
Travis Oliphant NumPy 2005
Community effort many, many others!
• Chuck Harris
• Pauli Virtanen
• Nathaniel Smith
• Warren Weckesser
• Ralf Gommers
• Robert Kern
• David Cournapeau
• Stefan van der Walt
• Jake Vanderplas
• Josef Perktold
• Anne Archibald
• Dag Sverre Seljebotn
• Joe Harrington --- Documentation effort
• Andrew Straw --- www.scipy.org





About 2,000,000 users of NumPy!
Scientific Stack
NumPy
SciPy Pandas Matplotlib
scikit-learnscikit-image statsmodels
PyTables
OpenCV
Cython
Numba SymPy NumExpr
astropy BioPython GDALPySAL
... many many more ...
Now What?
After watching NumPy and SciPy get used all over
Science and Technology (including Finance) --- what
would I do differently?
Blaze
Numba
Conda (Anaconda)
Continuum began operations in January of 2012
Python
Travis Oliphant Peter Wang
(Most of) Our Team
Scientists Developers Business
NumFOCUS
expertise
Big Picture
We are big backers of NumFOCUS and
organizers of PyData
Spyder
How we pay the bills
Enterprise
Python
Scientific
Computing
Data Processing
Data Analysis
Visualisation
Scalable
Computing
• Products
• Training
• Support
• Consulting
“Big Data” and the Hype Cycle
Advanced Analytics and HPC
HPC
Supercomputing
HSC
Fault Tolerance
Erasure Coding
Hadoop / Disco
MPI
Big-Compute
Scalapack
Trilinos
PETSc
GPUs
?
Python
Python and Science
Python is the “language of Science”
(Lots of R users might disagree)
IPython notebook is quickly becoming the way
scientists communicate about their work
Pandas has recently started converting even R users to
Python
The problem of Hadoop
Hadoop wants to be the OS for “big-data”. Advanced
analytics and Hadoop don’t blend well.
Many people (led by hype) use Hadoop when they don’t
need to --- and it slows them down and costs them $$.
Scale up first. Then, scale-out.
http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
“Don’t use Hadoop --- your data is not that big”
Options if you do need Hadoop
• Give Disco a try
• Try a non Java-specific emerging alternative to
HDFS (OrangeFS, GlusterFS, CephFS, Swift)
• Use Python wrapper to HDFS (snakebite,
webHDFS) and interface to map-reduce (luigi,
mrjob, MortarData CPython UDF etc.)
“Data Has Mass”
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Workflow
Perspective
Workflow
Perspective
Data-centric
Perspective
The largest data analysis gap is in this
man-machine interface. How can we put
the scientist back in control of his data?
How can we build analysis tools that are
intuitive and that augment the scientist’s
intellect rather than adding to the
intellectual burden with a forest of arcane
user tools? The real challenge is building
this smart notebook that unlocks the data
and makes it easy to capture, organize,
analyze, visualize, and publish.
-- Jim Gray et al, 2005
Why Don’t Scientists Use DBs?
• Do not support scientific data types, or access
patterns particular to a scientific problem
• Scientists can handle their existing data volumes
using programming tools
• Once data was loaded, could not manipulate it
with standard/familiar programs
• Poor visualization and plotting integration
• Require an expensive guru to maintain
“If one takes the controversial view that HDF,
NetCDF, FITS, and Root are nascent database
systems that provide metadata and portability but
lack non-procedural query analysis, automatic
parallelism, and sophisticated indexing, then one
can see a fairly clear path that integrates these
communities.”
Convergence
Key Question
How do we move code to data, while
avoiding data silos?
Continuum key OS technologies
Conda
Browser-based interactive visualization for Python users
Cross-platform package manager (with environments)
Array-oriented Python Compiler for CPUs and GPUs (speed
target is Fortran)Numba
Blaze
Bokeh
CDX
NumPy and Pandas for out-of-core and distributed data (general
data-base execution engine for data-flow subset of Python)
Continuum Data Explorer
Ashiba
New web-app building with only
Python and a little HTML
Our Emerging Platform
Rapid App Platform for SMEs
Wakari
Anaconda
Binstar
What is Conda
• Full package management (like yum or apt-get)
but cross-platform
• Control over environments (using link farms) ---
better than virtual-env. virtualenv today is like distutils and
setuptools of several years ago (great at first but will end up hating it)
• Architected to be able to manage any packages
(R, Scala, Clojure, Haskell, Ruby, JS)
• SAT solver to manage dependencies
• User-definable repositories
Binstar
Packaging and Distribution Solved
• conda and binstar solve most of the problems that
we have seen people encounter in managing
Python installations (especially in large-scale
institutions).
• They are supported solutions that can remove the
technology pain of managing Python
• Allow focus on software architecture and
separation of components (not just whatever
makes packaging convenient)
Anaconda
Free enterprise-ready Python distribution of open-
source tools for large-scale data processing,
predictive analytics, and scientific computing
Anaconda Add-Ons (paid-for)
•Revolutionary Python to GPU compiler
•Extends Numba to take a subset of Python
to the GPU (program CUDA in Python)
•CUDA FFT / BLAS interfaces
Fast, memory-efficient Python interface for
SQL databases, NoSQL stores,Amazon S3,
and large data files.
NumPy, SciPy, scikit-learn, NumExpr compiled
against Intel’s Math Kernel Library (MKL)
Launcher
Why Numba?
•Python is too slow for loops
•Most people are not learning C/C++/Fortran today
•Cython is an improvment (but still verbose and
needs C-compiler)
•NVIDIA using LLVM for the GPU
•Many people working with large typed-containers
(NumPy arrays)
•We want to take high-level, tarray-oriented
expressions and compile it to fast code
NumPy + Mamba = Numba
LLVM Library
Intel Nvidia AppleAMD
OpenCLISPC CUDA CLANGOpenMP
LLVMPY
Python Function Machine Code
ARM
Example
Numba
Numba
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up
Numba changes the game!
LLVM IR
x86
C++
ARM
PTX
C
Fortran
Python
Numba turns (a subset of) Python into a
“compiled language” as fast as C (but much more
flexible). You don’t have to reach for C/C++
Laplace Example
@jit('void(double[:,:], double, double)')
def numba_update(u, dx2, dy2):
nx, ny = u.shape
for i in xrange(1,nx-1):
for j in xrange(1, ny-1):
u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 +
(u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2))
Adapted from http://www.scipy.org/PerformancePython
originally by Prabhu Ramachandran
@jit('void(double[:,:], double, double)')
def numbavec_update(u, dx2, dy2):
u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 +
(u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
Results of Laplace example
Version Time Speed Up
NumPy 3.19 1.0
Numba 2.32 1.38
Vect. Numba 2.33 1.37
Cython 2.38 1.34
Weave 2.47 1.29
Numexpr 2.62 1.22
Fortran Loops 2.30 1.39
Vect. Fortran 1.50 2.13
https://github.com/teoliphant/speed.git
LLVMPy worth looking at
LLVM (via
LLVMPy)
has done
much heavy
lifting
LLVMPy =
Compilers for
everybody
New Project
Blaze
NumPy
Out of Core,
Distributed and Optimized
NumPy
Blaze Objectives
• Flexible descriptor for tabular and semi-structured data
• Seamless handling of:
• On-disk / Out of core
• Streaming data
• Distributed data
• Uniform treatment of:
• “arrays of structures” and
“structures of arrays”
• missing values
• “ragged” shapes
• categorical types
• computed columns
Blaze Deferred Arrays
+"
A" *"
B" C"
A + B*C
• Symbolic objects which build a graph
• Represents deferred computation
Usually what you have when
you have a Blaze Array
DataShape Type System


• A data description language
• A super-set of NumPy’s dtype
• Provides more flexibility
• Integration with PADS coming
Shape DType
DataShape
Blaze
Database
GPU Node
Array
Server
NFS
Array
Server
Array
Server
Blaze Client
Synthesized
Array/Table view
array+sql://
array://
file:// array://
Python REPL,
Scripts
Viz Data
Server
C, C++,
FORTRAN
JVM
languages
Progress
• Basic calculations work out-of-core (via Numba and
LLVM)
• Hard dependency on dynd and dynd-python (a dynamic
C++-only multi-dimensional library like NumPy but with
many improvements)
• Persistent arrays from BLZ
• Basic array-server functionality for layering over CSV
files
• 0.2 release in 1-2 weeks. 0.3 within a month after that
(first usable release)
Querying BLZ
In [15]: from blaze import blz
In [16]: t = blz.open("TWITTER_LOG_Wed_Oct_31_22COLON22COLON28_EDT_2012-lvl9.blz")
In [17]: t['(latitude>7) & (latitude<10) & (longitude >-10 ) & (longitude < 10) '] # query
Out[17]:
array([ (263843037069848576L, u'Cossy set to release album:http://t.co/Nijbe9GgShared via
Nigeria News for Android. @', datetime.datetime(2012, 11, 1, 3, 20, 56), 'moses_peleg', u'kaduna',
9.453095, 8.0125194, ''),
...
dtype=[('tid', '<u8'), ('text', '<U140'), ('created_at', '<M8[us]'), ('userid', 'S16'), ('userloc', '<U64'),
('latitude', '<f8'), ('longitude', '<f8'), ('lang', 'S2')])
In [18]: t[1000:3000] # get a range of tweets
Out[18]:
array([ (263829044892692480L, u'boa noite? ;( ue058ue41d', datetime.datetime(2012, 11, 1, 2,
25, 20), 'maaribeiro_', u'', nan, nan, ''),
(263829044875915265L, u"Nah but I'm writing a gym journal... Watch it last 2 days!",
datetime.datetime(2012, 11, 1, 2, 25, 20), 'Ryan_Shizzle', u'Shizzlesville', nan, nan, ''),
...
Kiva:Array Server
DataShape + Raw JSON = Web Service
type KivaLoan = {
id: int64;
name: string;
description: {
languages: var, string(2);
texts: json # map<string(2), string>;
};
status: string; # LoanStatusType;
funded_amount: float64;
basket_amount: json; # Option(float64);
paid_amount: json; # Option(float64);
image: {
id: int64;
template_id: int64;
};
video: json;
activity: string;
sector: string;
use: string;
delinquent: bool;
location: {
country_code: string(2);
country: string;
town: json; # Option(string);
geo: {
level: string; # GeoLevelType
pairs: string; # latlong
type: string; # GeoTypeType
}
};
....
{"id":200533,"name":"Miawand Group","description":{"languages":
["en"],"texts":{"en":"Ozer is a member of the Miawand Group. He lives in
the 16th district of Kabul, Afghanistan. He lives in a family of eight
members. He is single, but is a responsible boy who works hard and
supports the whole family. He is a carpenter and is busy working in his
shop seven days a week. He needs the loan to purchase wood and
needed carpentry tools such as tape measures, rulers and so on.rn r
nHe hopes to make progress through the loan and he is confident that
will make his repayments on time and will join for another loan cycle as
well. rnrn"}},"status":"paid","funded_amount":
925,"basket_amount":null,"paid_amount":925,"image":{"id":
539726,"template_id":
1},"video":null,"activity":"Carpentry","sector":"Construction","use":"He
wants to buy tools for his carpentry shop","delinquent":null,"location":
{"country_code":"AF","country":"Afghanistan","town":"Kabul
Afghanistan","geo":{"level":"country","pairs":"33
65","type":"point"}},"partner_id":
34,"posted_date":"2010-05-13T20:30:03Z","planned_expiration_date":
null,"loan_amount":
925,"currency_exchange_loss_amount":null,"borrowers":
[{"first_name":"Ozer","last_name":"","gender":"M","pictured":true},
{"first_name":"Rohaniy","last_name":"","gender":"M","pictured":true},
{"first_name":"Samem","last_name":"","gender":"M","pictured":true}],"ter
ms":
{"disbursal_date":"2010-05-13T07:00:00Z","disbursal_currency":"AFN","
disbursal_amount":42000,"loan_amount":925,"local_payments":
[{"due_date":"2010-06-13T07:00:00Z","amount":4200},
{"due_date":"2010-07-13T07:00:00Z","amount":4200},
{"due_date":"2010-08-13T07:00:00Z","amount":4200},
{"due_date":"2010-09-13T07:00:00Z","amount":4200},
{"due_date":"2010-10-13T07:00:00Z","amount":4200},
{"due_date":"2010-11-13T08:00:00Z","amount":4200},
{"due_date":"2010-12-13T08:00:00Z","amount":4200},
{"due_date":"2011-01-13T08:00:00Z","amount":4200},
{"due_date":"2011-02-13T08:00:00Z","amount":4200},
{"due_date":"2011-03-13T08:00:00Z","amount":
4200}],"scheduled_payments": ...
2.9gb of JSON => network-queryable array: ~5
minutes Kiva Array Server Demo
DARPA providing help
DARPA-BAA-12-38: XDATA
TA-1: Scalable analytics and data processing technology	
  
TA-2: Visual user interface technology
Bokeh Plotting Library
• Interactive graphics for the web
• Designed for large datasets
• Designed for streaming data
• Native interface in Python
• Fast JavaScript component
• DARPA funded
• v0.1 release imminent
Reasons for Bokeh
1. Plotting must happen near the data too
2. Quick iteration is essential => interactive visualization
3. Interactive visualization on remote-data => use the browser
4. Almost all web plotting libraries are either:
1. Designed for javascript programmers
2. Designed to output static graphs
5. We designed Bokeh to be dynamic graphing in the web for
Python programmers
6. Will include “Abstract” or “synthetic” rendering (working on
Hadoop and Spark compatibility)
Abstract Rendering
Pixels'are'Bins…'
and'always'have'been'
1 2 2 3 4 4 3 2 2 1
A'
D'
B'
C'
B'
C'
D'
A'
Counts'
Z>View'
Geometry'
Pixels'
Hi-def Alpha
Abstract Rendering
Basic AR can identify trouble spots in standard plots, and also
offer automatic tone mapping, taking perception into
account.
37 mil elements, showing adjacency between entities in Kiva dataset
Wakari
• Browser-based data analysis and
visualization platform
• Wordpress /YouTube / Github
for data analysis
• Full Linux environment with
Anaconda Python
• Can be installed on internal
clusters & servers
Why Wakari?
• Data is too big to fit on your desktop
• You need compute power but don’t have easy access to a
large cluster (cloud is sitting there with lots of power)
• Configuration of software on a new system stinks
(especially a cluster).
• Collaborative Data Analytics --- you want to build a
complex technical workflow and then share it with others
easily (without requiring they do painful configuration to
see your results)
• IPython Notebook is awesome --- let’s share it (but we
also need the dependencies and data).
Wakari
• Free account has 512 MB RAM / 2 GB disk and shared
multi-core CPU
• Easily spin-up map-reduce (Disco and Hadoop clusters)
• Use IPython Parallel on many-nodes in the cloud
• Develop GUI apps (possibly in Anaconda) and publish
them easily to Wakari (based on full power of scientific
python --- complex technical workflows (IPython
notebook for now)
Basic Data Explorer
Continuum Data Explorer (CDX)
• Open Source
• Goal is interactivity
• Combination of IPython REPL, Bokeh, and tables
• Tight integration between GUI elements and REPL
• Current features
- Namespace viewer (mapped to IPython namespace)
- DataTable widget with group-by, computed columns, advanced-
filters
- Interactive Plots connected to tables
CDX
Conclusion
Projects circle around giving tools to experts
(occasional programmers or domain
experts) to enable them to move their
expertise to the data to get insights --- keep
data where it is and move high-level but
performant code)
Join us or ask how we can help you!

Más contenido relacionado

La actualidad más candente

Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Databricks
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
 

La actualidad más candente (15)

Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
IPython: A Modern Vision of Interactive Computing (PyData SV 2013)
IPython: A Modern Vision of Interactive Computing (PyData SV 2013)IPython: A Modern Vision of Interactive Computing (PyData SV 2013)
IPython: A Modern Vision of Interactive Computing (PyData SV 2013)
 

Similar a London level39

2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 

Similar a London level39 (20)

Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
 
Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
(sebuah pengenalan) Free Open Source Software & Linux
(sebuah pengenalan) Free Open Source Software & Linux(sebuah pengenalan) Free Open Source Software & Linux
(sebuah pengenalan) Free Open Source Software & Linux
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
 
3 python packages
3 python packages3 python packages
3 python packages
 
Machine learning from software developers point of view
Machine learning from software developers point of viewMachine learning from software developers point of view
Machine learning from software developers point of view
 
What is Python? An overview of Python for science.
What is Python? An overview of Python for science.What is Python? An overview of Python for science.
What is Python? An overview of Python for science.
 
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기 Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
Python in geospatial analysis
Python in geospatial analysisPython in geospatial analysis
Python in geospatial analysis
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
A Whirlwind Tour Of Python
A Whirlwind Tour Of PythonA Whirlwind Tour Of Python
A Whirlwind Tour Of Python
 
PyTables
PyTablesPyTables
PyTables
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with Python
 

Más de Travis Oliphant

Más de Travis Oliphant (12)

SciPy Latin America 2019
SciPy Latin America 2019SciPy Latin America 2019
SciPy Latin America 2019
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
Python for Data Science with Anaconda
Python for Data Science with AnacondaPython for Data Science with Anaconda
Python for Data Science with Anaconda
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
 
Blaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for PythonBlaze: a large-scale, array-oriented infrastructure for Python
Blaze: a large-scale, array-oriented infrastructure for Python
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
PyData Introduction
PyData IntroductionPyData Introduction
PyData Introduction
 
Numba
NumbaNumba
Numba
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

London level39

  • 1. Python in the future of “Big Data” analytics Travis Oliphant, PhD Continuum Analytics, Inc September 30, 2013 London, UK
  • 2. Beginnings AfterBefore ⇢0 (2⇡f) 2 Ui (a, f) = [Cijkl (a, f) Uk,l (a, f)],j
  • 3. Python origins. Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991 0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999 http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html
  • 4. A sample of users
  • 5. Why Python License Community Readable Syntax Modern Constructs Batteries Included Free and Open Source, Permissive License • Broad and friendly community • Over 34,000 packages on PyPI • Commercial Support • Many conferences (PyData, SciPy, PyCons...) • Executable pseudo-code • Can understand and edit code a year later • Fun to develop • Use of Indentation IPython • Interactive prompt on steroids • Allows less working memory • Allows failing quickly for exploration • List comprehensions • Iterator protocol and generators • Meta-programming • Introspection • (JIT Compiler and Concurrency) • Internet (FTP, HTTP, SMTP, XMLRPC) • Compression and Databases • Logging, unit-tests • Glue for other languages • Distribution has much, much more....
  • 6. Python supports a developer spectrum DeveloperOccasional Scientist Developer • Cut and paste • Modify a few variables • Call some functions • Typical Quant or Engineer who doesn’t become programmer • Extend frameworks • Builds new objects • Wraps code • Quant / Engineer with decent developer skill • Creates frameworks • Creates compilers • Typical CS grad • Knows multiple languages Unique aspect of Python
  • 7. 1999 : Early SciPy emergesDiscussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy in 2001. Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999 Plotting?? Gist XPLOT DISLIN Gnuplot Helping with f2py
  • 8. Brief History Person Package Year Jim Fulton Matrix Object in Python 1994 Jim Hugunin Numeric 1995 Perry Greenfield, Rick White,Todd Miller Numarray 2001 Travis Oliphant NumPy 2005
  • 9. Community effort many, many others! • Chuck Harris • Pauli Virtanen • Nathaniel Smith • Warren Weckesser • Ralf Gommers • Robert Kern • David Cournapeau • Stefan van der Walt • Jake Vanderplas • Josef Perktold • Anne Archibald • Dag Sverre Seljebotn • Joe Harrington --- Documentation effort • Andrew Straw --- www.scipy.org
  • 11. Scientific Stack NumPy SciPy Pandas Matplotlib scikit-learnscikit-image statsmodels PyTables OpenCV Cython Numba SymPy NumExpr astropy BioPython GDALPySAL ... many many more ...
  • 12. Now What? After watching NumPy and SciPy get used all over Science and Technology (including Finance) --- what would I do differently? Blaze Numba Conda (Anaconda)
  • 13. Continuum began operations in January of 2012 Python Travis Oliphant Peter Wang
  • 14. (Most of) Our Team Scientists Developers Business NumFOCUS
  • 16. We are big backers of NumFOCUS and organizers of PyData Spyder
  • 17. How we pay the bills Enterprise Python Scientific Computing Data Processing Data Analysis Visualisation Scalable Computing • Products • Training • Support • Consulting
  • 18. “Big Data” and the Hype Cycle
  • 19. Advanced Analytics and HPC HPC Supercomputing HSC Fault Tolerance Erasure Coding Hadoop / Disco MPI Big-Compute Scalapack Trilinos PETSc GPUs ? Python
  • 20. Python and Science Python is the “language of Science” (Lots of R users might disagree) IPython notebook is quickly becoming the way scientists communicate about their work Pandas has recently started converting even R users to Python
  • 21. The problem of Hadoop Hadoop wants to be the OS for “big-data”. Advanced analytics and Hadoop don’t blend well. Many people (led by hype) use Hadoop when they don’t need to --- and it slows them down and costs them $$. Scale up first. Then, scale-out. http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html “Don’t use Hadoop --- your data is not that big”
  • 22. Options if you do need Hadoop • Give Disco a try • Try a non Java-specific emerging alternative to HDFS (OrangeFS, GlusterFS, CephFS, Swift) • Use Python wrapper to HDFS (snakebite, webHDFS) and interface to map-reduce (luigi, mrjob, MortarData CPython UDF etc.)
  • 26. The largest data analysis gap is in this man-machine interface. How can we put the scientist back in control of his data? How can we build analysis tools that are intuitive and that augment the scientist’s intellect rather than adding to the intellectual burden with a forest of arcane user tools? The real challenge is building this smart notebook that unlocks the data and makes it easy to capture, organize, analyze, visualize, and publish. -- Jim Gray et al, 2005
  • 27. Why Don’t Scientists Use DBs? • Do not support scientific data types, or access patterns particular to a scientific problem • Scientists can handle their existing data volumes using programming tools • Once data was loaded, could not manipulate it with standard/familiar programs • Poor visualization and plotting integration • Require an expensive guru to maintain
  • 28. “If one takes the controversial view that HDF, NetCDF, FITS, and Root are nascent database systems that provide metadata and portability but lack non-procedural query analysis, automatic parallelism, and sophisticated indexing, then one can see a fairly clear path that integrates these communities.” Convergence
  • 29. Key Question How do we move code to data, while avoiding data silos?
  • 30. Continuum key OS technologies Conda Browser-based interactive visualization for Python users Cross-platform package manager (with environments) Array-oriented Python Compiler for CPUs and GPUs (speed target is Fortran)Numba Blaze Bokeh CDX NumPy and Pandas for out-of-core and distributed data (general data-base execution engine for data-flow subset of Python) Continuum Data Explorer Ashiba New web-app building with only Python and a little HTML
  • 31. Our Emerging Platform Rapid App Platform for SMEs Wakari Anaconda Binstar
  • 32. What is Conda • Full package management (like yum or apt-get) but cross-platform • Control over environments (using link farms) --- better than virtual-env. virtualenv today is like distutils and setuptools of several years ago (great at first but will end up hating it) • Architected to be able to manage any packages (R, Scala, Clojure, Haskell, Ruby, JS) • SAT solver to manage dependencies • User-definable repositories
  • 34. Packaging and Distribution Solved • conda and binstar solve most of the problems that we have seen people encounter in managing Python installations (especially in large-scale institutions). • They are supported solutions that can remove the technology pain of managing Python • Allow focus on software architecture and separation of components (not just whatever makes packaging convenient)
  • 35. Anaconda Free enterprise-ready Python distribution of open- source tools for large-scale data processing, predictive analytics, and scientific computing
  • 36. Anaconda Add-Ons (paid-for) •Revolutionary Python to GPU compiler •Extends Numba to take a subset of Python to the GPU (program CUDA in Python) •CUDA FFT / BLAS interfaces Fast, memory-efficient Python interface for SQL databases, NoSQL stores,Amazon S3, and large data files. NumPy, SciPy, scikit-learn, NumExpr compiled against Intel’s Math Kernel Library (MKL)
  • 38. Why Numba? •Python is too slow for loops •Most people are not learning C/C++/Fortran today •Cython is an improvment (but still verbose and needs C-compiler) •NVIDIA using LLVM for the GPU •Many people working with large typed-containers (NumPy arrays) •We want to take high-level, tarray-oriented expressions and compile it to fast code
  • 39. NumPy + Mamba = Numba LLVM Library Intel Nvidia AppleAMD OpenCLISPC CUDA CLANGOpenMP LLVMPY Python Function Machine Code ARM
  • 41. Numba @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
  • 42. Numba changes the game! LLVM IR x86 C++ ARM PTX C Fortran Python Numba turns (a subset of) Python into a “compiled language” as fast as C (but much more flexible). You don’t have to reach for C/C++
  • 43. Laplace Example @jit('void(double[:,:], double, double)') def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Adapted from http://www.scipy.org/PerformancePython originally by Prabhu Ramachandran @jit('void(double[:,:], double, double)') def numbavec_update(u, dx2, dy2): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
  • 44. Results of Laplace example Version Time Speed Up NumPy 3.19 1.0 Numba 2.32 1.38 Vect. Numba 2.33 1.37 Cython 2.38 1.34 Weave 2.47 1.29 Numexpr 2.62 1.22 Fortran Loops 2.30 1.39 Vect. Fortran 1.50 2.13 https://github.com/teoliphant/speed.git
  • 45. LLVMPy worth looking at LLVM (via LLVMPy) has done much heavy lifting LLVMPy = Compilers for everybody
  • 46. New Project Blaze NumPy Out of Core, Distributed and Optimized NumPy
  • 47. Blaze Objectives • Flexible descriptor for tabular and semi-structured data • Seamless handling of: • On-disk / Out of core • Streaming data • Distributed data • Uniform treatment of: • “arrays of structures” and “structures of arrays” • missing values • “ragged” shapes • categorical types • computed columns
  • 48. Blaze Deferred Arrays +" A" *" B" C" A + B*C • Symbolic objects which build a graph • Represents deferred computation Usually what you have when you have a Blaze Array
  • 49. DataShape Type System   • A data description language • A super-set of NumPy’s dtype • Provides more flexibility • Integration with PADS coming Shape DType DataShape
  • 50. Blaze Database GPU Node Array Server NFS Array Server Array Server Blaze Client Synthesized Array/Table view array+sql:// array:// file:// array:// Python REPL, Scripts Viz Data Server C, C++, FORTRAN JVM languages
  • 51. Progress • Basic calculations work out-of-core (via Numba and LLVM) • Hard dependency on dynd and dynd-python (a dynamic C++-only multi-dimensional library like NumPy but with many improvements) • Persistent arrays from BLZ • Basic array-server functionality for layering over CSV files • 0.2 release in 1-2 weeks. 0.3 within a month after that (first usable release)
  • 52. Querying BLZ In [15]: from blaze import blz In [16]: t = blz.open("TWITTER_LOG_Wed_Oct_31_22COLON22COLON28_EDT_2012-lvl9.blz") In [17]: t['(latitude>7) & (latitude<10) & (longitude >-10 ) & (longitude < 10) '] # query Out[17]: array([ (263843037069848576L, u'Cossy set to release album:http://t.co/Nijbe9GgShared via Nigeria News for Android. @', datetime.datetime(2012, 11, 1, 3, 20, 56), 'moses_peleg', u'kaduna', 9.453095, 8.0125194, ''), ... dtype=[('tid', '<u8'), ('text', '<U140'), ('created_at', '<M8[us]'), ('userid', 'S16'), ('userloc', '<U64'), ('latitude', '<f8'), ('longitude', '<f8'), ('lang', 'S2')]) In [18]: t[1000:3000] # get a range of tweets Out[18]: array([ (263829044892692480L, u'boa noite? ;( ue058ue41d', datetime.datetime(2012, 11, 1, 2, 25, 20), 'maaribeiro_', u'', nan, nan, ''), (263829044875915265L, u"Nah but I'm writing a gym journal... Watch it last 2 days!", datetime.datetime(2012, 11, 1, 2, 25, 20), 'Ryan_Shizzle', u'Shizzlesville', nan, nan, ''), ...
  • 53. Kiva:Array Server DataShape + Raw JSON = Web Service type KivaLoan = { id: int64; name: string; description: { languages: var, string(2); texts: json # map<string(2), string>; }; status: string; # LoanStatusType; funded_amount: float64; basket_amount: json; # Option(float64); paid_amount: json; # Option(float64); image: { id: int64; template_id: int64; }; video: json; activity: string; sector: string; use: string; delinquent: bool; location: { country_code: string(2); country: string; town: json; # Option(string); geo: { level: string; # GeoLevelType pairs: string; # latlong type: string; # GeoTypeType } }; .... {"id":200533,"name":"Miawand Group","description":{"languages": ["en"],"texts":{"en":"Ozer is a member of the Miawand Group. He lives in the 16th district of Kabul, Afghanistan. He lives in a family of eight members. He is single, but is a responsible boy who works hard and supports the whole family. He is a carpenter and is busy working in his shop seven days a week. He needs the loan to purchase wood and needed carpentry tools such as tape measures, rulers and so on.rn r nHe hopes to make progress through the loan and he is confident that will make his repayments on time and will join for another loan cycle as well. rnrn"}},"status":"paid","funded_amount": 925,"basket_amount":null,"paid_amount":925,"image":{"id": 539726,"template_id": 1},"video":null,"activity":"Carpentry","sector":"Construction","use":"He wants to buy tools for his carpentry shop","delinquent":null,"location": {"country_code":"AF","country":"Afghanistan","town":"Kabul Afghanistan","geo":{"level":"country","pairs":"33 65","type":"point"}},"partner_id": 34,"posted_date":"2010-05-13T20:30:03Z","planned_expiration_date": null,"loan_amount": 925,"currency_exchange_loss_amount":null,"borrowers": [{"first_name":"Ozer","last_name":"","gender":"M","pictured":true}, {"first_name":"Rohaniy","last_name":"","gender":"M","pictured":true}, {"first_name":"Samem","last_name":"","gender":"M","pictured":true}],"ter ms": {"disbursal_date":"2010-05-13T07:00:00Z","disbursal_currency":"AFN"," disbursal_amount":42000,"loan_amount":925,"local_payments": [{"due_date":"2010-06-13T07:00:00Z","amount":4200}, {"due_date":"2010-07-13T07:00:00Z","amount":4200}, {"due_date":"2010-08-13T07:00:00Z","amount":4200}, {"due_date":"2010-09-13T07:00:00Z","amount":4200}, {"due_date":"2010-10-13T07:00:00Z","amount":4200}, {"due_date":"2010-11-13T08:00:00Z","amount":4200}, {"due_date":"2010-12-13T08:00:00Z","amount":4200}, {"due_date":"2011-01-13T08:00:00Z","amount":4200}, {"due_date":"2011-02-13T08:00:00Z","amount":4200}, {"due_date":"2011-03-13T08:00:00Z","amount": 4200}],"scheduled_payments": ... 2.9gb of JSON => network-queryable array: ~5 minutes Kiva Array Server Demo
  • 54. DARPA providing help DARPA-BAA-12-38: XDATA TA-1: Scalable analytics and data processing technology   TA-2: Visual user interface technology
  • 55. Bokeh Plotting Library • Interactive graphics for the web • Designed for large datasets • Designed for streaming data • Native interface in Python • Fast JavaScript component • DARPA funded • v0.1 release imminent
  • 56. Reasons for Bokeh 1. Plotting must happen near the data too 2. Quick iteration is essential => interactive visualization 3. Interactive visualization on remote-data => use the browser 4. Almost all web plotting libraries are either: 1. Designed for javascript programmers 2. Designed to output static graphs 5. We designed Bokeh to be dynamic graphing in the web for Python programmers 6. Will include “Abstract” or “synthetic” rendering (working on Hadoop and Spark compatibility)
  • 57. Abstract Rendering Pixels'are'Bins…' and'always'have'been' 1 2 2 3 4 4 3 2 2 1 A' D' B' C' B' C' D' A' Counts' Z>View' Geometry' Pixels'
  • 59. Abstract Rendering Basic AR can identify trouble spots in standard plots, and also offer automatic tone mapping, taking perception into account. 37 mil elements, showing adjacency between entities in Kiva dataset
  • 60. Wakari • Browser-based data analysis and visualization platform • Wordpress /YouTube / Github for data analysis • Full Linux environment with Anaconda Python • Can be installed on internal clusters & servers
  • 61. Why Wakari? • Data is too big to fit on your desktop • You need compute power but don’t have easy access to a large cluster (cloud is sitting there with lots of power) • Configuration of software on a new system stinks (especially a cluster). • Collaborative Data Analytics --- you want to build a complex technical workflow and then share it with others easily (without requiring they do painful configuration to see your results) • IPython Notebook is awesome --- let’s share it (but we also need the dependencies and data).
  • 62. Wakari • Free account has 512 MB RAM / 2 GB disk and shared multi-core CPU • Easily spin-up map-reduce (Disco and Hadoop clusters) • Use IPython Parallel on many-nodes in the cloud • Develop GUI apps (possibly in Anaconda) and publish them easily to Wakari (based on full power of scientific python --- complex technical workflows (IPython notebook for now)
  • 64. Continuum Data Explorer (CDX) • Open Source • Goal is interactivity • Combination of IPython REPL, Bokeh, and tables • Tight integration between GUI elements and REPL • Current features - Namespace viewer (mapped to IPython namespace) - DataTable widget with group-by, computed columns, advanced- filters - Interactive Plots connected to tables
  • 65. CDX
  • 66. Conclusion Projects circle around giving tools to experts (occasional programmers or domain experts) to enable them to move their expertise to the data to get insights --- keep data where it is and move high-level but performant code) Join us or ask how we can help you!