SlideShare una empresa de Scribd logo
1 de 66
Descargar para leer sin conexión
Python’s Role in the
Future of Data Analysis
Peter Wang
Continuum Analytics
pwang@continuum.io
@pwang
About Peter
• Co-founder & President at Continuum
• Author of several Python libraries & tools
• Scientific, financial, engineering HPC using
Python, C, C++, etc.
• Interactive Visualization of “Big Data”
• Organizer of Austin Python
• Background in Physics (BA Cornell ’99)
Continuum Analytics
Domains
Data Analysis
Visualisation

Data Processing

Scalable
Computing

Scientific
Computing
Enterprise
Python

• Finance
• Defense, government data
• Advertising metrics & data analysis
• Engineering simulation
• Scientific computing

Technologies
• Array/Columnar data processing
• Distributed computing, HPC
• GPU and new vector hardware
• Machine learning, predictive analytics
• Interactive Visualization
Overview
•

Deconstructing “big data” from a physics
perspective

•

Deconstructing “computer” from a EE
perspective

•

Deconstructing “programming language”
from a human perspective
Massive Data - A Relativistic Approach
Big Data: Hype Cycle

So, “deconstructing” big data seems like an easy thing to do.
Everyone loves to hate on the term now, but everyone still uses it, because it’s evocative. It
means something to most people.
There’s a lot of hype around this stuff, but I am a “data true believer”.
Data Revolution
“Internet Revolution” True Believer, 1996:
Businesses that build network-oriented capability
into their core will fundamentally outcompete and
destroy their competition.
“Data Revolution” True Believer, 2013:
Businesses that build data comprehension into
their core will destroy their competition over the
next 5-10 years

And what I mean by that term is this.
If you think back to 1996, Internet True Believer:
- use network to connect to customer, supply chain, telemetry on market and competition
- business needs network like a fish needs water
Data true believer:
- Having seen the folks on the vanguard, and seeing what is starting to become possible by
people that have access to a LOT of data (finance; DoD; internet ad companies)
Big Data: Opportunities
•

Storage disruption: plummeting HDD costs,
cloud-based storage

•
•
•
•

Computation disruption: Burst into clouds
There is actually more data.
Traditional BI tools fall short.
Demonstrated, clear value in large datasets

There are some core technology trends that are enabling this revolution.
Many businesses *can* actually store everything by default. In fact many have to have
explicit data destruction policies to retire old data.
Being able to immediately turn on tens of thousands of cores to run big problems, and then
spin them down - that level of dynamic provisioning was simply not available before a few
years ago.
Our devices and our software are generating much more data.
Big Data: Mature/Aging Players
SAS

~45

R

20

SPSS

45

S

37

Informatica

20

NumPy

8

SAP

23-40

Numeric

18

Cognos

~30

Python

22

IBM PC: 32

C Programming Language: 41

And if we look at the existing “big players” in business intelligence, they are actually all quite
old. They are very mature, but they are getting hit with really new needs and fundamentally
different kinds of analytical workloads than they were designed for.
The Fundamental Physics
Moving/copying data (and managing copies)
is more expensive than computation.
True for various definitions of “expense”:

•
•
•

Raw electrical & cooling power
Time
Human factors

So, these are all indicators and symptoms, but as a student of physics, I like to look for
underlying, simplifying, unifying concepts. And what I think the core issue is about is the
fact that, really, there is an inversion: The core challenge of "big data" is that moving data is
more costly than computing on data. It used to be that the computation on data was the
bottleneck. But now the I/O is actually the real bottleneck.
This cost is both from an underlying physical, hardware power cost, as well as a higher level,
more human-facing.
Business Data Processing

If you look at a traditional view of data processing and enterprise data management, it’s
really many steps that move data from one stage to another, transforming it in a variety of
ways.
Business Data Processing

source: wikipedia.org

In the business data world, the processing shown in the previous slide happens in what is
commonly called a “data warehouse”, where they manage the security and provenance of
data, build catalogs of denormalized and rolled-up views, manage user access to “data
marts”, etc.
When you have large data, every single one of these arrows is a liability.
Scientific Data Processing

source: http://cnx.org/content/m32861/1.3/

In science, we do very similar things. We have workflows and dataflow programming
environments. We have this code-centric view, because code is the hard part, right? We pay
developers lots of money to write code and fix bugs and that’s the expensive part. Data is
just, whatever - we just stream all that through once the code is done.
But this inversion of “data movement” being expensive now means that this view is at odds
with the real costs of computing.
“Data Has Mass”

http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
Uh...

http://datagravity.org/2012/06/26/a-formula-for-data-gravity/

Of course, you know he’s not really serious about this equation because he didn’t typeset it
in LaTeX.
Data-centric
Perspective

Workflow
Perspective

But there is something to this. But instead of trying to come up with a Theory of Universal
Data Gravitation, I’d just like to extend this concept of “massive data” with another metaphor.
So if we think about the workflow/dataflow perspective of data processing, it views each of
piece of software as a station on a route, from raw source data to finished analytical product,
and the data is a train that moves from one station to the next.
But if data is massive, and moving that train gets harder and harder, then a relativistic
perspective would be to get on the train, and see things from the point of view of the data.
Data-centric Warehouse

source: Master Data Management and Data Governance, 2e

This is actually not *that* new of a perspective. In fact, the business analytics world already
has a lot of discipline around this. But usually in these contexts, the motivation or driver for
keeping the data in one place and building functional/transformation views on top, is for
data provenance or data privacy reasons, and it does not have to do with the tractability of
dealing with “Big Data”.
The largest data analysis gap is in this
man-machine interface. How can we put
the scientist back in control of his data?
How can we build analysis tools that are
intuitive and that augment the scientist’s
intellect rather than adding to the
intellectual burden with a forest of arcane
user tools? The real challenge is building
this smart notebook that unlocks the data
and makes it easy to capture, organize,
analyze, visualize, and publish.

-- Jim Gray et al, 2005

If we change gears a little bit... if you think about scientific computing - which is where many
of the tools in the PyData ecosystem come from - they don’t really use databases very much.
They leave the data in files on disk, and then they write a bunch of scripts that transform that
data or do operations on that data.
Jim Gray and others wrote a great paper 8 years ago that addressed - from a critical
perspective - this question of “Why don’t scientists use databases?” He was considering this
problem of computation and reproducibility of scientific results, when scientists are faced
with increasing data volumes.
Science centers: "...it is much more economical
to move the end-user’s programs to the data
and only communicate questions and answers
rather than moving the source data and its
applications to the user‘s local system."
Metadata enables access: "Preserving
and augmenting this metadata as part of
the processing (data lineage) will be a key
benefit of the next-generation tools."
"Metadata enables data independence": "The separation of data and
programs is artificial – one cannot see the data without using a
program and most programs are data driven. So, it is paradoxical that
the data management community has worked for 40 years to achieve
something called data independence – a clear separation of programs
from data."

He has this great phrase in the paper: “metadata will set you free”. I need a shirt with that on
it.
Science centers: "...it is much more economical
to move the end-user’s programs to the data
and only communicate questions and answers
rather than moving the sourcegives parallelism": "The
"Set-oriented data access data and its
applications to the user‘s local system." and FITS
scientific file-formats of HDF, NetCDF,
can represent tabular data but they provide
Metadata enables access: "Preserving
minimal tools for searching and analyzing tabular
and augmenting this metadata as part of
data. Their main focus is getting the tables and
the processing
sub-arrays into your Fortran/C/Java/Python (data lineage) will be a key
benefit of the next-generation tools."
address space where you can manipulate the data
using the programming language... This Fortran/C/
Java/Python file-at-a-time procedural data
"Metadata enables data independence": "The separation of data and
analysis is nearing the cannot see the
programs is artificial – one breaking point. data without using a
program and most programs are data driven. So, it is paradoxical that
the data management community has worked for 40 years to achieve
something called data independence – a clear separation of programs
from data."
Science centers: "...it is much more economical
to move the end-user’s programs to the data
and only communicate questions and answers
rather than moving the sourcegives parallelism": "The
"Set-oriented data access data and its
applications to the user‘s local system." and FITS
scientific file-formats of HDF, NetCDF,
can represent tabular data but they provide
Metadata enables access: "Preserving
minimal tools for searching and analyzing tabular
and augmenting this metadata as part of
data. Their main focus is getting the tables and
the processing
sub-arrays into your Fortran/C/Java/Python (data lineage) will be a key
benefit of the next-generation tools."
address space where you can manipulate the data
using the programming language... This Fortran/C/
Java/Python file-at-a-time procedural data
"Metadata enables data independence": "The separation of data and
analysis is nearing the cannot see the
programs is artificial – one breaking point. data without using a
program and most programs are data driven. So, it is paradoxical that
the data management community has worked for 40 years to achieve
something called data independence – a clear separation of programs
from data."

Actually, this entire paper is full of awesome. Basically, Gray & co-authors are just
completely spot-on about what is needed for scientific data processing. If you want to
understand why we’re building what we’re building at Continuum, this paper explains a lot of
the deep motivation and rationale.
Why Don’t Scientists Use DBs?
•
•
•
•
•

Do not support scientific data types, or access
patterns particular to a scientific problem
Scientists can handle their existing data volumes
using programming tools
Once data was loaded, could not manipulate it
with standard/familiar programs
Poor visualization and plotting integration
Require an expensive guru to maintain

So, there *are* data-centric computing systems, for both business and for science as well.
After all, that’s what a database is.
In the Gray paper, they identified a few key reasons why scientists don’t use databases.
Convergence
“If one takes the controversial view that HDF,
NetCDF, FITS, and Root are nascent database
systems that provide metadata and portability but
lack non-procedural query analysis, automatic
parallelism, and sophisticated indexing, then one
can see a fairly clear path that integrates these
communities.”
Convergence
"Semantic convergence: numbers to objects"
“While the commercial world has standardized on the relational
data model and SQL, no single standard or tool has critical
mass in the scientific community. There are many parallel and
competing efforts to build these tool suites – at least one per
discipline. Data interchange outside each group is problematic.
In the next decade, as data interchange among scientific
disciplines becomes increasingly important, a common HDFlike format and package for all the sciences will likely emerge."

One thing they kind of didn’t foresee, however, is that there is now a convergence between
the analytical needs of business, and the traditional domain of scientific HPC. For the kinds
of advanced data analytics businesses are now interested in, e.g. recommender systems,
clustering and graph analytics, machine learning... all of these are rooted in being able to do
big linear algebra and big statistical simulation.
So just as scientific computing is hitting database-like needs in their big data processing, the
business work is hitting scalable computation needs which have been scientific computing’s
bread and butter for decades.
Key Question

How do we move code to data, while
avoiding data silos?

But before we can answer this question, let’s think a little more deeply about what code and
data actually are.
Representation & Transformation:
A Continuum Model
What is a Computer?

計

算

機

Memory

Calculate

Machine

This is the Chinese term for “computer”. (Well, one of them.)
And this is really the essence of a computer, right? The memory is some state that it retains,
and we impart meaning to that state via representations. A computer is fundamentally about
transforming those states via well-defined semantics. It’s a machine, which means it does
those transformations with greater accuracy or fidelity than a human.
Disk
CPU

Memory

Net

This is kind of the model of a PC workstation that we’ve had since the 1980s. There’s a CPU
which does the “calculation”, and then the RAM, disk, and network are the “memory”.
Disk

SAN

CPU
Interwebs

Memory

Net

Move into the 1990s, and you get the internet and SAN also representing areas of storage.
PCIe

Disk

SAN

CPU
Interwebs

Memory

Net

Nowadays, you’ve got GPUs that can be 100x more powerful than the CPU for some
problems. And they have several gigabytes of storage on them.
PCIe

Disk

SAN

CPU

Memory

Net
NUMA

Interwebs

And maybe instead of 1 GPU, maybe there’s a whole bunch of them in the same chassis?
Or maybe this one system board is actually part of a NUMA fabric in a rack full of other CPUs
interconnected with a super low latency bus? Where is the storage and where is the compute?
Then, if you look inside the CPU itself, there are all kinds of caches and pipelines, carefully
coordinated.
PCIe

Disk

SAN

CPU

Memory

Net
NUMA

Interwebs

This is a schematic of POWER5, which is nearly 10 years old now. Where is the memory, and
where is the calculation? Even deep in the bowels of a CPU there are different stages of
storage and transformation.
"Scripts"

HLLs: macros, DSLs, query
APIs

Apps

VMs

records,
objects, tables

App langs
OS "runtime"

files, dirs, pipes

Systems langs
OS Kernel

pages, blkdev

ISA, asm
Hardware

bits, bytes

Let’s try again, and take an architectural view.
We can look at the computer as layers of abstraction. The OS kernel and device drivers
abstract away the differences in hardware, and present unified programming models to
applications.
But each layer of execution abstraction also offers a particular kind of data representation.
These abstractions let programmers model more complex things than the boolean
relationship between 1s and 0s.
And the combination of execution and representation give rise to particular kinds of
programming languages.
Programming Language

•
•
•
•

Provide coherent set of data representations and
operations (i.e. easier to reason about)
Typically closer to some desired problem domain
to model
Requires a runtime (underlying execution model)
Is an illusion

But what exactly is a programming language? We have, at the bottom, hardware with specific
states it can be in. It’s actually all just APIs on top of that. But when APIs create new data
representations with coherent semantics, then it results in an explosion in the number of
possible states and state transitions of the system.
The entire point of a language is to give the illusion of a higher level of abstraction.
The promise made by a language is: “If you use these primitives and operations, then the
runtime will effect state transformation in a deterministic, well-defined way.” Usually
languages give you primitives that operate on bulk primitives of the lower-level runtime.
This helps you reach closer to domain problems that you’re actually trying to model.
But it is all still an illusion. If a compiler cannot generate valid low-level programs from
expressions at this higher level, then the illusion breaks down, and the user now has to
understand the low-level runtime to debug what went wrong. At the lowest level of
abstraction, even floating point numbers are abstractions that leak (subnormals, 56-bit vs
80-bit FPUs, etc).
Correctness / Robustness

Curve of Human Finitude

Complexity

So either you limit the number of possible states and state transitions (i.e. what the
programmer can express), or you have to live with less robust programs. The falloff is
ultimately because of the limits of human cognition: both on the part of the programmers
using a language, and the compiler or interpreter developers of that language. We can only
fit so much complexity and model so much state transition in our heads.
The flat area is the stuff that is closest to the core, primitive operations of the language.
Those are usually very well tested and very likely to result in correct execution. The more
complexity you introduce via loops, conditionals, tapping into external state, etc., the
buggier your code is.
Correctness

Encapsulation & Abstraction
Function libraries shift right.

Correctness

Complexity

User-defined abstractions
extend the slope.
Complexity

So to tackle harder problems, we have to deal with complexity, and this means shifting the
curve.
Simple libraries of functions shift the “easy correctness” up. But they don’t really change the
shape of the tail of the curve, because they do not intrinsically decrease the complexity of
hard programs. (Sometimes they increase it!)
A language that supports user-defined abstractions via OOP and metaprogramming extend
the slope of the tail because those actually do manage complexity.
Static & Dynamic Types
Correctness

Static typesystems with rich
capability shift the curve up, but
not by much.

Correctness

Complexity

Dynamic types trade off low-end
correctness for expressiveness.
Complexity

So I said before that a language consists of primitive representations and operations. Types
are a way of indicating that to the runtime. But we differentiate static vs. dynamic typing.
Of course, with things like template metaprogramming and generics added bolted on to
traditionally statically-typed languages like C++ and Java, the proponents of static typing
might argue that they’ve got the best of both world.
Correctness

Bad News
• Distributed computing
• GPU
• DSPs & FPGAs
• NUMA
• Tuning: SSD / HDD / FIO / 40gE

Complexity

Heterogenous hardware architectures, distributed computing, GPUs... runtime abstraction is
now very leaky. Just adding more libraries to handle this merely shifts the curve up, but
doesn’t increase the reach of our language.
Correctness

Language Innovation = Diagonal Shift

Complexity

You come up with not just new functions, and not just a few objects layered on top of the
existing syntax.. but rather, you spend the hard engineering time to actually build a new layer
of coherent abstraction. That puts you on a new curve. This is why people make new
languages - to reach a different optimization curve of expressitivity/correctness trade-off.
Of course, this is hard to do well. There are just a handful of really successful languages in
use today, and they literally take decades to mature.
Correctness

Correctness

Domain-Specific Languages

Complexity

Relational Algebra

??
Complexity

File operations
Web apps
Matrix algebra
Network comm.

But keep in mind that “complexity” is dependent on problem domain. Building a new general
purpose programming language that is much more powerful than existing ones is hard work.
But if you just tackle one specific problem, you can generally pull yourself up into a nicer
complexity curve. But then your language has no projection into expressing other operations
someone might want to do.
Domain-Specific Compiler

Recall this picture of runtimes and languages. I think the runtime/language split and
compiler/library split is becoming more and more of a false dichotomy as runtimes shift:
OSes, distributed computing, GPU, multicore, etc. Configuration & tuning is becoming as
important as just execution. The default scheduler in the OS, the default memory allocator in
libc, etc. are all becoming harder to do right “in generality”.
If data is massive, and expensive to move, then we need to rethink the approach for how we
cut up the complexity between hardware to domain-facing code. The tiers of runtimes
should be driven by considerations of bandwidth and latency.
We think of Python as a "high level idea language" that can express concepts in the classical
programming language modes: imperative, functional, dataflow; and is "meta-programmable
enough" to make these not completely terrible.
As lines between hardware, OS, configuration, and software blur, we need to revisit the
classical hierarchies of complexity and capability.
So, extensible dynamic runtimes, transparent and instrumentable static runtimes. And fast
compilers to dynamically generate code.
It’s not just me saying this: Look at GPU shaders. Look at the evolution of Javascript runtime
optimization, which has settled on asm.js as an approach. Everyone is talking about
compilers now.
Blaze & Numba
•

Shift the curve of an existing language

•
•
•
•

Not just using types to extend user code
Use dynamic compilation to also extend
the runtime!
Not a DSL: falls back to Python
Both a representation and a compilation
problem: use types to allow for dynamic
compilation & scheduling

So this is really the conceptual reasoning for Blaze and Numba.
So rather than going from bottom up to compose static primitives in a runtime, the goal is to
do a double-ended optimization process: at the highest level, we have statement of domainrelated algorithmic intent, and at the low level, via Blaze datashapes, we have a rich
description of underlying data. Numba, and the Blaze execution engine, are then responsible
for meeting up in the middle and dynamically generating fast code.
Blaze Objectives
• Flexible descriptor for tabular and semi-structured data
• Seamless handling of:
•
•
•

On-disk / Out of core
Streaming data
Distributed data

• Uniform treatment of:
•
•
•
•
•

“arrays of structures” and
“structures of arrays”
missing values
“ragged” shapes
categorical types
computed columns
Storage-agnostic
Database
Array
Server

array+sql://

Python REPL,
Scripts
Blaze Client

GPU Node

Array
Server

Viz Data
Server

array://
Synthesized
Array/Table view

file://

Array
Server

array://

C, C++,
FORTRAN

JVM
languages

NFS
Blaze Status
• DataShape type grammar
• NumPy-compatible C++ calculation engine (DyND)
• Synthesis of array function kernels (via LLVM)
• Fast timeseries routines (dynamic time warping for
pattern matching)

• Array Server prototype
• BLZ columnar storage format
• 0.3 released couple of weeks ago
BLZ ETL Process
• Ingested in Blaze binary format for doing efficient queries:
-

Dataset 1: 13 hours / 70 MB RAM / 1 core in single machine
Dataset 2: ~ 3 hours / 560 MB RAM / 8 cores in parallel

• The binary format is compressed by default and achieves
different compression ratios depending on the dataset:

CSV
Size

CSV.gz
Size
CR

DS 1 232 GB 70 GB
DS 2 146 GB 69 GB

BLZ
Size
CR

3.3x 136 GB 1.7x
2.1x 93 GB 1.6x
Querying BLZ
In [15]: from blaze import blz
In [16]: t = blz.open("TWITTER_LOG_Wed_Oct_31_22COLON22COLON28_EDT_2012-lvl9.blz")
In [17]: t['(latitude>7) & (latitude<10) & (longitude >-10 ) & (longitude < 10) '] # query
Out[17]:
array([ (263843037069848576L, u'Cossy set to release album:http://t.co/Nijbe9GgShared via
Nigeria News for Android. @', datetime.datetime(2012, 11, 1, 3, 20, 56), 'moses_peleg', u'kaduna',
9.453095, 8.0125194, ''),
...
dtype=[('tid', '<u8'), ('text', '<U140'), ('created_at', '<M8[us]'), ('userid', 'S16'), ('userloc', '<U64'),
('latitude', '<f8'), ('longitude', '<f8'), ('lang', 'S2')])
In [18]: t[1000:3000] # get a range of tweets
Out[18]:
array([ (263829044892692480L, u'boa noite? ;( ue058ue41d', datetime.datetime(2012, 11, 1, 2,
25, 20), 'maaribeiro_', u'', nan, nan, ''),
(263829044875915265L, u"Nah but I'm writing a gym journal... Watch it last 2 days!",
datetime.datetime(2012, 11, 1, 2, 25, 20), 'Ryan_Shizzle', u'Shizzlesville', nan, nan, ''),
...
Kiva: Array Server
DataShape +
type KivaLoan = {
id: int64;
name: string;
description: {
languages: var, string(2);
texts: json # map<string(2), string>;
};
status: string; # LoanStatusType;
funded_amount: float64;
basket_amount: json; # Option(float64);
paid_amount: json; # Option(float64);
image: {
id: int64;
template_id: int64;
};
video: json;
activity: string;
sector: string;
use: string;
delinquent: bool;
location: {
country_code: string(2);
country: string;
town: json; # Option(string);
geo: {
level: string; # GeoLevelType
pairs: string; # latlong
type: string; # GeoTypeType
}
};
....

Raw JSON

= Web Service

{"id":200533,"name":"Miawand Group","description":{"languages":
["en"],"texts":{"en":"Ozer is a member of the Miawand Group. He lives in the
16th district of Kabul, Afghanistan. He lives in a family of eight members. He
is single, but is a responsible boy who works hard and supports the whole
family. He is a carpenter and is busy working in his shop seven days a week.
He needs the loan to purchase wood and needed carpentry tools such as tape
measures, rulers and so on.rn rnHe hopes to make progress through the
loan and he is confident that will make his repayments on time and will join
for another loan cycle as well. rnrn"}},"status":"paid","funded_amount":
925,"basket_amount":null,"paid_amount":925,"image":{"id":
539726,"template_id":
1},"video":null,"activity":"Carpentry","sector":"Construction","use":"He wants
to buy tools for his carpentry shop","delinquent":null,"location":
{"country_code":"AF","country":"Afghanistan","town":"Kabul
Afghanistan","geo":{"level":"country","pairs":"33
65","type":"point"}},"partner_id":
34,"posted_date":"2010-05-13T20:30:03Z","planned_expiration_date":null,"loa
n_amount":925,"currency_exchange_loss_amount":null,"borrowers":
[{"first_name":"Ozer","last_name":"","gender":"M","pictured":true},
{"first_name":"Rohaniy","last_name":"","gender":"M","pictured":true},
{"first_name":"Samem","last_name":"","gender":"M","pictured":true}],"terms":
{"disbursal_date":"2010-05-13T07:00:00Z","disbursal_currency":"AFN","disbur
sal_amount":42000,"loan_amount":925,"local_payments":
[{"due_date":"2010-06-13T07:00:00Z","amount":4200},
{"due_date":"2010-07-13T07:00:00Z","amount":4200},
{"due_date":"2010-08-13T07:00:00Z","amount":4200},
{"due_date":"2010-09-13T07:00:00Z","amount":4200},
{"due_date":"2010-10-13T07:00:00Z","amount":4200},
{"due_date":"2010-11-13T08:00:00Z","amount":4200},
{"due_date":"2010-12-13T08:00:00Z","amount":4200},
{"due_date":"2011-01-13T08:00:00Z","amount":4200},
{"due_date":"2011-02-13T08:00:00Z","amount":4200},
{"due_date":"2011-03-13T08:00:00Z","amount":
4200}],"scheduled_payments": ...

2.9gb of JSON => network-queryable array: ~5 minutes
Kiva Array Server Demo
Remote Arrays
Remote Computed Fields
Numba
C++

x86

C
LLVM IR

ARM

Fortran
Python

PTX

Numba turns Python into a “compiled language”
Example
Numba
LLVM-based architecture
Image Processing

~1500x speed-up

@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
Glue 2.0
• Python’s legacy as a powerful glue language
• manipulate files (instead of shell scripts)
• call fast libraries (instead of using Matlab)
• Next-gen Glue:
• Link data silos
• Link disjoint memory & compute
• Unify disparate runtime models
• Transcend legacy models of computers

Instead of gluing disparate things together via a common API or ABI, it's about giving an end
user a capability to see treat things as a fluid, continuous whole.  And I want to re-iterate:
Numba and Blaze are not just about speed. It’s about moving domain expertise to data.
Blurred Lines
• Compile time, run time, JIT, asm.js
• Imperative code vs. configuration
• App, OS, lightweight virtualization,

hardware, virtual hardware
• Dev, dev ops, ops
• Clouds: IaaS, PaaS, SaaS, DBaaS, AaaS...
We have entered the post-PC era.

This gluing also extends beyond just the application or code layer.
- So much tech innovation happening right now
- A lot of churn but some real gems as well
- Not just software, but hardware, human roles, and business models
- Much of this can be really confusing to track and follow, but it all results from the fact that
we are entering a post-PC era
- Single unified, “random access memory”; single serial stream of instructions
Instead of figuring out how to glue things together, we think that using a high-level language
like Python helps people transcend to the level of recognizing that *There is no spoon*.
There is no computer - OSes are a lie. VMs and runtimes are a lie. Compilers are a lie.
There are just bits, and useful lies on top of the bits. Thus far, we've been able to get away
with these because we can build coherent lies. But as the underlying reality gets more
complex, the cost of abstraction is too high - or the abstractions will necessarily need to be
very leaky.
There’s an old joke that computers are bad because they do exactly what we tell them to do.
Computers would be better if they had a “do what I want” command, right? Well, with
challenge of scalable computing over big data, figuring out “what I want”, at a low level, is
itself a challenge. We instead need the “do whatever *you* want” command.
Bokeh
• Language-based (instead of GUI) visualization system
•
•

High-level expressions of data binding, statistical transforms,
interactivity and linked data
Easy to learn, but expressive depth for power users

• Interactive
•
•

Data space configuration as well as data selection
Specified from high-level language constructs

• Web as first class interface target
• Support for large datasets via intelligent downsampling
(“abstract rendering”)

Switch gears a bit and talk about Bokeh
Bokeh
Inspirations:
• Chaco: interactive, viz pipeline for large data
• Protovis & Stencil :
Binding visual Glyphs to data and expressions
• ggplot2: faceting, statistical overlays
Design goal:
Accessible, extensible, interactive plotting for the web...
... for non-Javascript programmers

It’s not exclusively for the web, though - we can target rich client UIs, and I’m excited about
the vispy work.
Bokeh & BokehJS Demos
• BokehJS demos
• Audio spectrogram
• Bokeh Examples
- Low-level Python interface
- IPython Notebook
integration
- ggplot example
Continuum Data Explorer (CDX)

Bokeh is a library and a tool, but it’s also a component that can be used as part of a larger
application.
Data Summary Explorer
Conclusion

Despite the temptation to ignore or dismiss the hype machine, the actual data revolution is
happening. But you cannot understand this revolution by focusing on technology *alone*.
The technology has to be considered in light of the human factors. You're not going to see
the shape of this revolution just by following the traditional industry blogs and trade journals
and web sites.
The human factors are: what do people really want to do with their data? The people who are
getting the most value from their data - what are their backgrounds, and what kinds of
companies do they work for, or are they building? How are those companies becoming datadriven?
In the business world, the flood of data has triggered a rapid evolution - a Cambrian
explosion, if you will. It's like the sun just came out, and all these businesses are struggling
to evolve retinae and eyeballs, and avoid getting eaten by other businesses that grew eyeballs
first.
What about scientific computing, then? Scientists have been working out in the daylight for a
long time now, and their decades-long obsession with performance and efficiency is
suddenly relevant for the rest of the world.. I think, in a way that they had not imagined.
I think in this metaphor, Python can be seen as the visual cortex. It connects the raw dataingest machinery of the eyes to the actual "smarts" of the rest of the brain.
And Python itself will need to evolve. It will certainly have to play well with a lot of legacy
systems, and integrate with foreign technology stacks. The reason we're so excited about
LLVM-based interop, and memory-efficient compatibility with things like the JVM, is because
these things give us a chance to at least be on the same carrier wave as other parts of the
brain.
But Python - or whatever Python evolves into - definitely has a central role to play in the dataenabled future, because of the human factors at the heart of the data revolution, and that
have also guided the development of the language so far.
So it’s a really simple syllogism. Given that:
- Analysis of large, complex datasets is about Data Exploration, an iterative process of
structuring, slicing, querying data to surface insights
- Really insightful hypotheses have to originate in the mind of a domain expert; they cannot
be outsourced, and an air gap between two brains leads to a massive loss of context
Therefore: Domain experts need to be empowered to directly manipulate, transform, and see
their massive datasets. They need a way to accurately express these operations to the
computer system, not merely select them from a fixed menu of options: exploration of a
conceptual space requires expressiveness.
Python was designed to be an easy-to-learn language. It has gained mindshare because it
fits in people's brains. It's a tool that empowers them, and bridges their minds with the
computer, so the computer is an extension of their exploratory capability. For data analysis,
this is absolutely its key feature. As a community, I think that if we keep sight of this, we will
ensure that Python has a long and healthy future to become as fundamental as mathematics
for the future of analytics.

Más contenido relacionado

La actualidad más candente

Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challengesDilpreet kaur Virk
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fataSuraj Sawant
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive FrameworkRan Zhang
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayXoriant Corporation
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Gregg Barrett
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and OpportunitiesKenny Huang Ph.D.
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science EducationJames Hendler
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceInformation Security Awareness Group
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleVasu S
 

La actualidad más candente (20)

Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challenges
 
Map Reduce in Big fata
Map Reduce in Big fataMap Reduce in Big fata
Map Reduce in Big fata
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Stanford DeepDive Framework
Stanford DeepDive FrameworkStanford DeepDive Framework
Stanford DeepDive Framework
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop Way
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Big Data
Big DataBig Data
Big Data
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and Opportunities
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security Alliance
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
The Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | QuboleThe Evolving Role of the Data Engineer - Whitepaper | Qubole
The Evolving Role of the Data Engineer - Whitepaper | Qubole
 

Similar a Python's Role in the Future of Data Analysis

Business_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_CaratanBusiness_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_CaratanLuke Caratan
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerMicrosoft
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)NikitaRajbhoj
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationDenodo
 
Unit 1 Introduction to Data Analytics .pptx
Unit 1 Introduction to Data Analytics .pptxUnit 1 Introduction to Data Analytics .pptx
Unit 1 Introduction to Data Analytics .pptxvipulkondekar
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAudrey Britton
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Thingspateelhs
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
Crossing the bridge - how do we link end-user-computing and formal tech for d...
Crossing the bridge - how do we link end-user-computing and formal tech for d...Crossing the bridge - how do we link end-user-computing and formal tech for d...
Crossing the bridge - how do we link end-user-computing and formal tech for d...J On The Beach
 

Similar a Python's Role in the Future of Data Analysis (20)

Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Business_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_CaratanBusiness_Analytics_Presentation_Luke_Caratan
Business_Analytics_Presentation_Luke_Caratan
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
Innovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringerInnovation med big data – chr. hansens erfaringer
Innovation med big data – chr. hansens erfaringer
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Unit 1 Introduction to Data Analytics .pptx
Unit 1 Introduction to Data Analytics .pptxUnit 1 Introduction to Data Analytics .pptx
Unit 1 Introduction to Data Analytics .pptx
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
big data Big Things
big data Big Thingsbig data Big Things
big data Big Things
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
Crossing the bridge - how do we link end-user-computing and formal tech for d...
Crossing the bridge - how do we link end-user-computing and formal tech for d...Crossing the bridge - how do we link end-user-computing and formal tech for d...
Crossing the bridge - how do we link end-user-computing and formal tech for d...
 
1
11
1
 

Más de Peter Wang

Rethinking Decentralization / Whither Privacy?
Rethinking Decentralization / Whither Privacy?Rethinking Decentralization / Whither Privacy?
Rethinking Decentralization / Whither Privacy?Peter Wang
 
Rethinking OSS In An Era of Cloud and ML
Rethinking OSS In An Era of Cloud and MLRethinking OSS In An Era of Cloud and ML
Rethinking OSS In An Era of Cloud and MLPeter Wang
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Peter Wang
 
Command line Data Tools
Command line Data ToolsCommand line Data Tools
Command line Data ToolsPeter Wang
 
Stories, Myth, and the Humane Network
Stories, Myth, and the Humane NetworkStories, Myth, and the Humane Network
Stories, Myth, and the Humane NetworkPeter Wang
 
Thoughts on Business & Startups
Thoughts on Business & StartupsThoughts on Business & Startups
Thoughts on Business & StartupsPeter Wang
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Bokeh Tutorial - PyData @ Strata San Jose 2015
Bokeh Tutorial - PyData @ Strata San Jose 2015Bokeh Tutorial - PyData @ Strata San Jose 2015
Bokeh Tutorial - PyData @ Strata San Jose 2015Peter Wang
 
Interactive Visualization With Bokeh (SF Python Meetup)
Interactive Visualization With Bokeh (SF Python Meetup)Interactive Visualization With Bokeh (SF Python Meetup)
Interactive Visualization With Bokeh (SF Python Meetup)Peter Wang
 
PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)Peter Wang
 

Más de Peter Wang (10)

Rethinking Decentralization / Whither Privacy?
Rethinking Decentralization / Whither Privacy?Rethinking Decentralization / Whither Privacy?
Rethinking Decentralization / Whither Privacy?
 
Rethinking OSS In An Era of Cloud and ML
Rethinking OSS In An Era of Cloud and MLRethinking OSS In An Era of Cloud and ML
Rethinking OSS In An Era of Cloud and ML
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
 
Command line Data Tools
Command line Data ToolsCommand line Data Tools
Command line Data Tools
 
Stories, Myth, and the Humane Network
Stories, Myth, and the Humane NetworkStories, Myth, and the Humane Network
Stories, Myth, and the Humane Network
 
Thoughts on Business & Startups
Thoughts on Business & StartupsThoughts on Business & Startups
Thoughts on Business & Startups
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Bokeh Tutorial - PyData @ Strata San Jose 2015
Bokeh Tutorial - PyData @ Strata San Jose 2015Bokeh Tutorial - PyData @ Strata San Jose 2015
Bokeh Tutorial - PyData @ Strata San Jose 2015
 
Interactive Visualization With Bokeh (SF Python Meetup)
Interactive Visualization With Bokeh (SF Python Meetup)Interactive Visualization With Bokeh (SF Python Meetup)
Interactive Visualization With Bokeh (SF Python Meetup)
 
PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)PyData: Past, Present Future (PyData SV 2014 Keynote)
PyData: Past, Present Future (PyData SV 2014 Keynote)
 

Último

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Python's Role in the Future of Data Analysis

  • 1. Python’s Role in the Future of Data Analysis Peter Wang Continuum Analytics pwang@continuum.io @pwang
  • 2. About Peter • Co-founder & President at Continuum • Author of several Python libraries & tools • Scientific, financial, engineering HPC using Python, C, C++, etc. • Interactive Visualization of “Big Data” • Organizer of Austin Python • Background in Physics (BA Cornell ’99)
  • 3. Continuum Analytics Domains Data Analysis Visualisation Data Processing Scalable Computing Scientific Computing Enterprise Python • Finance • Defense, government data • Advertising metrics & data analysis • Engineering simulation • Scientific computing Technologies • Array/Columnar data processing • Distributed computing, HPC • GPU and new vector hardware • Machine learning, predictive analytics • Interactive Visualization
  • 4. Overview • Deconstructing “big data” from a physics perspective • Deconstructing “computer” from a EE perspective • Deconstructing “programming language” from a human perspective
  • 5. Massive Data - A Relativistic Approach
  • 6. Big Data: Hype Cycle So, “deconstructing” big data seems like an easy thing to do. Everyone loves to hate on the term now, but everyone still uses it, because it’s evocative. It means something to most people. There’s a lot of hype around this stuff, but I am a “data true believer”.
  • 7. Data Revolution “Internet Revolution” True Believer, 1996: Businesses that build network-oriented capability into their core will fundamentally outcompete and destroy their competition. “Data Revolution” True Believer, 2013: Businesses that build data comprehension into their core will destroy their competition over the next 5-10 years And what I mean by that term is this. If you think back to 1996, Internet True Believer: - use network to connect to customer, supply chain, telemetry on market and competition - business needs network like a fish needs water Data true believer: - Having seen the folks on the vanguard, and seeing what is starting to become possible by people that have access to a LOT of data (finance; DoD; internet ad companies)
  • 8. Big Data: Opportunities • Storage disruption: plummeting HDD costs, cloud-based storage • • • • Computation disruption: Burst into clouds There is actually more data. Traditional BI tools fall short. Demonstrated, clear value in large datasets There are some core technology trends that are enabling this revolution. Many businesses *can* actually store everything by default. In fact many have to have explicit data destruction policies to retire old data. Being able to immediately turn on tens of thousands of cores to run big problems, and then spin them down - that level of dynamic provisioning was simply not available before a few years ago. Our devices and our software are generating much more data.
  • 9. Big Data: Mature/Aging Players SAS ~45 R 20 SPSS 45 S 37 Informatica 20 NumPy 8 SAP 23-40 Numeric 18 Cognos ~30 Python 22 IBM PC: 32 C Programming Language: 41 And if we look at the existing “big players” in business intelligence, they are actually all quite old. They are very mature, but they are getting hit with really new needs and fundamentally different kinds of analytical workloads than they were designed for.
  • 10. The Fundamental Physics Moving/copying data (and managing copies) is more expensive than computation. True for various definitions of “expense”: • • • Raw electrical & cooling power Time Human factors So, these are all indicators and symptoms, but as a student of physics, I like to look for underlying, simplifying, unifying concepts. And what I think the core issue is about is the fact that, really, there is an inversion: The core challenge of "big data" is that moving data is more costly than computing on data. It used to be that the computation on data was the bottleneck. But now the I/O is actually the real bottleneck. This cost is both from an underlying physical, hardware power cost, as well as a higher level, more human-facing.
  • 11. Business Data Processing If you look at a traditional view of data processing and enterprise data management, it’s really many steps that move data from one stage to another, transforming it in a variety of ways.
  • 12. Business Data Processing source: wikipedia.org In the business data world, the processing shown in the previous slide happens in what is commonly called a “data warehouse”, where they manage the security and provenance of data, build catalogs of denormalized and rolled-up views, manage user access to “data marts”, etc. When you have large data, every single one of these arrows is a liability.
  • 13. Scientific Data Processing source: http://cnx.org/content/m32861/1.3/ In science, we do very similar things. We have workflows and dataflow programming environments. We have this code-centric view, because code is the hard part, right? We pay developers lots of money to write code and fix bugs and that’s the expensive part. Data is just, whatever - we just stream all that through once the code is done. But this inversion of “data movement” being expensive now means that this view is at odds with the real costs of computing.
  • 15. Uh... http://datagravity.org/2012/06/26/a-formula-for-data-gravity/ Of course, you know he’s not really serious about this equation because he didn’t typeset it in LaTeX.
  • 16. Data-centric Perspective Workflow Perspective But there is something to this. But instead of trying to come up with a Theory of Universal Data Gravitation, I’d just like to extend this concept of “massive data” with another metaphor. So if we think about the workflow/dataflow perspective of data processing, it views each of piece of software as a station on a route, from raw source data to finished analytical product, and the data is a train that moves from one station to the next. But if data is massive, and moving that train gets harder and harder, then a relativistic perspective would be to get on the train, and see things from the point of view of the data.
  • 17. Data-centric Warehouse source: Master Data Management and Data Governance, 2e This is actually not *that* new of a perspective. In fact, the business analytics world already has a lot of discipline around this. But usually in these contexts, the motivation or driver for keeping the data in one place and building functional/transformation views on top, is for data provenance or data privacy reasons, and it does not have to do with the tractability of dealing with “Big Data”.
  • 18. The largest data analysis gap is in this man-machine interface. How can we put the scientist back in control of his data? How can we build analysis tools that are intuitive and that augment the scientist’s intellect rather than adding to the intellectual burden with a forest of arcane user tools? The real challenge is building this smart notebook that unlocks the data and makes it easy to capture, organize, analyze, visualize, and publish. -- Jim Gray et al, 2005 If we change gears a little bit... if you think about scientific computing - which is where many of the tools in the PyData ecosystem come from - they don’t really use databases very much. They leave the data in files on disk, and then they write a bunch of scripts that transform that data or do operations on that data. Jim Gray and others wrote a great paper 8 years ago that addressed - from a critical perspective - this question of “Why don’t scientists use databases?” He was considering this problem of computation and reproducibility of scientific results, when scientists are faced with increasing data volumes.
  • 19. Science centers: "...it is much more economical to move the end-user’s programs to the data and only communicate questions and answers rather than moving the source data and its applications to the user‘s local system." Metadata enables access: "Preserving and augmenting this metadata as part of the processing (data lineage) will be a key benefit of the next-generation tools." "Metadata enables data independence": "The separation of data and programs is artificial – one cannot see the data without using a program and most programs are data driven. So, it is paradoxical that the data management community has worked for 40 years to achieve something called data independence – a clear separation of programs from data." He has this great phrase in the paper: “metadata will set you free”. I need a shirt with that on it.
  • 20. Science centers: "...it is much more economical to move the end-user’s programs to the data and only communicate questions and answers rather than moving the sourcegives parallelism": "The "Set-oriented data access data and its applications to the user‘s local system." and FITS scientific file-formats of HDF, NetCDF, can represent tabular data but they provide Metadata enables access: "Preserving minimal tools for searching and analyzing tabular and augmenting this metadata as part of data. Their main focus is getting the tables and the processing sub-arrays into your Fortran/C/Java/Python (data lineage) will be a key benefit of the next-generation tools." address space where you can manipulate the data using the programming language... This Fortran/C/ Java/Python file-at-a-time procedural data "Metadata enables data independence": "The separation of data and analysis is nearing the cannot see the programs is artificial – one breaking point. data without using a program and most programs are data driven. So, it is paradoxical that the data management community has worked for 40 years to achieve something called data independence – a clear separation of programs from data."
  • 21. Science centers: "...it is much more economical to move the end-user’s programs to the data and only communicate questions and answers rather than moving the sourcegives parallelism": "The "Set-oriented data access data and its applications to the user‘s local system." and FITS scientific file-formats of HDF, NetCDF, can represent tabular data but they provide Metadata enables access: "Preserving minimal tools for searching and analyzing tabular and augmenting this metadata as part of data. Their main focus is getting the tables and the processing sub-arrays into your Fortran/C/Java/Python (data lineage) will be a key benefit of the next-generation tools." address space where you can manipulate the data using the programming language... This Fortran/C/ Java/Python file-at-a-time procedural data "Metadata enables data independence": "The separation of data and analysis is nearing the cannot see the programs is artificial – one breaking point. data without using a program and most programs are data driven. So, it is paradoxical that the data management community has worked for 40 years to achieve something called data independence – a clear separation of programs from data." Actually, this entire paper is full of awesome. Basically, Gray & co-authors are just completely spot-on about what is needed for scientific data processing. If you want to understand why we’re building what we’re building at Continuum, this paper explains a lot of the deep motivation and rationale.
  • 22. Why Don’t Scientists Use DBs? • • • • • Do not support scientific data types, or access patterns particular to a scientific problem Scientists can handle their existing data volumes using programming tools Once data was loaded, could not manipulate it with standard/familiar programs Poor visualization and plotting integration Require an expensive guru to maintain So, there *are* data-centric computing systems, for both business and for science as well. After all, that’s what a database is. In the Gray paper, they identified a few key reasons why scientists don’t use databases.
  • 23. Convergence “If one takes the controversial view that HDF, NetCDF, FITS, and Root are nascent database systems that provide metadata and portability but lack non-procedural query analysis, automatic parallelism, and sophisticated indexing, then one can see a fairly clear path that integrates these communities.”
  • 24. Convergence "Semantic convergence: numbers to objects" “While the commercial world has standardized on the relational data model and SQL, no single standard or tool has critical mass in the scientific community. There are many parallel and competing efforts to build these tool suites – at least one per discipline. Data interchange outside each group is problematic. In the next decade, as data interchange among scientific disciplines becomes increasingly important, a common HDFlike format and package for all the sciences will likely emerge." One thing they kind of didn’t foresee, however, is that there is now a convergence between the analytical needs of business, and the traditional domain of scientific HPC. For the kinds of advanced data analytics businesses are now interested in, e.g. recommender systems, clustering and graph analytics, machine learning... all of these are rooted in being able to do big linear algebra and big statistical simulation. So just as scientific computing is hitting database-like needs in their big data processing, the business work is hitting scalable computation needs which have been scientific computing’s bread and butter for decades.
  • 25. Key Question How do we move code to data, while avoiding data silos? But before we can answer this question, let’s think a little more deeply about what code and data actually are.
  • 27. What is a Computer? 計 算 機 Memory Calculate Machine This is the Chinese term for “computer”. (Well, one of them.) And this is really the essence of a computer, right? The memory is some state that it retains, and we impart meaning to that state via representations. A computer is fundamentally about transforming those states via well-defined semantics. It’s a machine, which means it does those transformations with greater accuracy or fidelity than a human.
  • 28. Disk CPU Memory Net This is kind of the model of a PC workstation that we’ve had since the 1980s. There’s a CPU which does the “calculation”, and then the RAM, disk, and network are the “memory”.
  • 29. Disk SAN CPU Interwebs Memory Net Move into the 1990s, and you get the internet and SAN also representing areas of storage.
  • 30. PCIe Disk SAN CPU Interwebs Memory Net Nowadays, you’ve got GPUs that can be 100x more powerful than the CPU for some problems. And they have several gigabytes of storage on them.
  • 31. PCIe Disk SAN CPU Memory Net NUMA Interwebs And maybe instead of 1 GPU, maybe there’s a whole bunch of them in the same chassis? Or maybe this one system board is actually part of a NUMA fabric in a rack full of other CPUs interconnected with a super low latency bus? Where is the storage and where is the compute? Then, if you look inside the CPU itself, there are all kinds of caches and pipelines, carefully coordinated.
  • 32. PCIe Disk SAN CPU Memory Net NUMA Interwebs This is a schematic of POWER5, which is nearly 10 years old now. Where is the memory, and where is the calculation? Even deep in the bowels of a CPU there are different stages of storage and transformation.
  • 33. "Scripts" HLLs: macros, DSLs, query APIs Apps VMs records, objects, tables App langs OS "runtime" files, dirs, pipes Systems langs OS Kernel pages, blkdev ISA, asm Hardware bits, bytes Let’s try again, and take an architectural view. We can look at the computer as layers of abstraction. The OS kernel and device drivers abstract away the differences in hardware, and present unified programming models to applications. But each layer of execution abstraction also offers a particular kind of data representation. These abstractions let programmers model more complex things than the boolean relationship between 1s and 0s. And the combination of execution and representation give rise to particular kinds of programming languages.
  • 34. Programming Language • • • • Provide coherent set of data representations and operations (i.e. easier to reason about) Typically closer to some desired problem domain to model Requires a runtime (underlying execution model) Is an illusion But what exactly is a programming language? We have, at the bottom, hardware with specific states it can be in. It’s actually all just APIs on top of that. But when APIs create new data representations with coherent semantics, then it results in an explosion in the number of possible states and state transitions of the system. The entire point of a language is to give the illusion of a higher level of abstraction. The promise made by a language is: “If you use these primitives and operations, then the runtime will effect state transformation in a deterministic, well-defined way.” Usually languages give you primitives that operate on bulk primitives of the lower-level runtime. This helps you reach closer to domain problems that you’re actually trying to model. But it is all still an illusion. If a compiler cannot generate valid low-level programs from expressions at this higher level, then the illusion breaks down, and the user now has to understand the low-level runtime to debug what went wrong. At the lowest level of abstraction, even floating point numbers are abstractions that leak (subnormals, 56-bit vs 80-bit FPUs, etc).
  • 35. Correctness / Robustness Curve of Human Finitude Complexity So either you limit the number of possible states and state transitions (i.e. what the programmer can express), or you have to live with less robust programs. The falloff is ultimately because of the limits of human cognition: both on the part of the programmers using a language, and the compiler or interpreter developers of that language. We can only fit so much complexity and model so much state transition in our heads. The flat area is the stuff that is closest to the core, primitive operations of the language. Those are usually very well tested and very likely to result in correct execution. The more complexity you introduce via loops, conditionals, tapping into external state, etc., the buggier your code is.
  • 36. Correctness Encapsulation & Abstraction Function libraries shift right. Correctness Complexity User-defined abstractions extend the slope. Complexity So to tackle harder problems, we have to deal with complexity, and this means shifting the curve. Simple libraries of functions shift the “easy correctness” up. But they don’t really change the shape of the tail of the curve, because they do not intrinsically decrease the complexity of hard programs. (Sometimes they increase it!) A language that supports user-defined abstractions via OOP and metaprogramming extend the slope of the tail because those actually do manage complexity.
  • 37. Static & Dynamic Types Correctness Static typesystems with rich capability shift the curve up, but not by much. Correctness Complexity Dynamic types trade off low-end correctness for expressiveness. Complexity So I said before that a language consists of primitive representations and operations. Types are a way of indicating that to the runtime. But we differentiate static vs. dynamic typing. Of course, with things like template metaprogramming and generics added bolted on to traditionally statically-typed languages like C++ and Java, the proponents of static typing might argue that they’ve got the best of both world.
  • 38. Correctness Bad News • Distributed computing • GPU • DSPs & FPGAs • NUMA • Tuning: SSD / HDD / FIO / 40gE Complexity Heterogenous hardware architectures, distributed computing, GPUs... runtime abstraction is now very leaky. Just adding more libraries to handle this merely shifts the curve up, but doesn’t increase the reach of our language.
  • 39. Correctness Language Innovation = Diagonal Shift Complexity You come up with not just new functions, and not just a few objects layered on top of the existing syntax.. but rather, you spend the hard engineering time to actually build a new layer of coherent abstraction. That puts you on a new curve. This is why people make new languages - to reach a different optimization curve of expressitivity/correctness trade-off. Of course, this is hard to do well. There are just a handful of really successful languages in use today, and they literally take decades to mature.
  • 40. Correctness Correctness Domain-Specific Languages Complexity Relational Algebra ?? Complexity File operations Web apps Matrix algebra Network comm. But keep in mind that “complexity” is dependent on problem domain. Building a new general purpose programming language that is much more powerful than existing ones is hard work. But if you just tackle one specific problem, you can generally pull yourself up into a nicer complexity curve. But then your language has no projection into expressing other operations someone might want to do.
  • 41. Domain-Specific Compiler Recall this picture of runtimes and languages. I think the runtime/language split and compiler/library split is becoming more and more of a false dichotomy as runtimes shift: OSes, distributed computing, GPU, multicore, etc. Configuration & tuning is becoming as important as just execution. The default scheduler in the OS, the default memory allocator in libc, etc. are all becoming harder to do right “in generality”. If data is massive, and expensive to move, then we need to rethink the approach for how we cut up the complexity between hardware to domain-facing code. The tiers of runtimes should be driven by considerations of bandwidth and latency. We think of Python as a "high level idea language" that can express concepts in the classical programming language modes: imperative, functional, dataflow; and is "meta-programmable enough" to make these not completely terrible. As lines between hardware, OS, configuration, and software blur, we need to revisit the classical hierarchies of complexity and capability. So, extensible dynamic runtimes, transparent and instrumentable static runtimes. And fast compilers to dynamically generate code. It’s not just me saying this: Look at GPU shaders. Look at the evolution of Javascript runtime optimization, which has settled on asm.js as an approach. Everyone is talking about compilers now.
  • 42. Blaze & Numba • Shift the curve of an existing language • • • • Not just using types to extend user code Use dynamic compilation to also extend the runtime! Not a DSL: falls back to Python Both a representation and a compilation problem: use types to allow for dynamic compilation & scheduling So this is really the conceptual reasoning for Blaze and Numba. So rather than going from bottom up to compose static primitives in a runtime, the goal is to do a double-ended optimization process: at the highest level, we have statement of domainrelated algorithmic intent, and at the low level, via Blaze datashapes, we have a rich description of underlying data. Numba, and the Blaze execution engine, are then responsible for meeting up in the middle and dynamically generating fast code.
  • 43. Blaze Objectives • Flexible descriptor for tabular and semi-structured data • Seamless handling of: • • • On-disk / Out of core Streaming data Distributed data • Uniform treatment of: • • • • • “arrays of structures” and “structures of arrays” missing values “ragged” shapes categorical types computed columns
  • 44. Storage-agnostic Database Array Server array+sql:// Python REPL, Scripts Blaze Client GPU Node Array Server Viz Data Server array:// Synthesized Array/Table view file:// Array Server array:// C, C++, FORTRAN JVM languages NFS
  • 45. Blaze Status • DataShape type grammar • NumPy-compatible C++ calculation engine (DyND) • Synthesis of array function kernels (via LLVM) • Fast timeseries routines (dynamic time warping for pattern matching) • Array Server prototype • BLZ columnar storage format • 0.3 released couple of weeks ago
  • 46. BLZ ETL Process • Ingested in Blaze binary format for doing efficient queries: - Dataset 1: 13 hours / 70 MB RAM / 1 core in single machine Dataset 2: ~ 3 hours / 560 MB RAM / 8 cores in parallel • The binary format is compressed by default and achieves different compression ratios depending on the dataset: CSV Size CSV.gz Size CR DS 1 232 GB 70 GB DS 2 146 GB 69 GB BLZ Size CR 3.3x 136 GB 1.7x 2.1x 93 GB 1.6x
  • 47. Querying BLZ In [15]: from blaze import blz In [16]: t = blz.open("TWITTER_LOG_Wed_Oct_31_22COLON22COLON28_EDT_2012-lvl9.blz") In [17]: t['(latitude>7) & (latitude<10) & (longitude >-10 ) & (longitude < 10) '] # query Out[17]: array([ (263843037069848576L, u'Cossy set to release album:http://t.co/Nijbe9GgShared via Nigeria News for Android. @', datetime.datetime(2012, 11, 1, 3, 20, 56), 'moses_peleg', u'kaduna', 9.453095, 8.0125194, ''), ... dtype=[('tid', '<u8'), ('text', '<U140'), ('created_at', '<M8[us]'), ('userid', 'S16'), ('userloc', '<U64'), ('latitude', '<f8'), ('longitude', '<f8'), ('lang', 'S2')]) In [18]: t[1000:3000] # get a range of tweets Out[18]: array([ (263829044892692480L, u'boa noite? ;( ue058ue41d', datetime.datetime(2012, 11, 1, 2, 25, 20), 'maaribeiro_', u'', nan, nan, ''), (263829044875915265L, u"Nah but I'm writing a gym journal... Watch it last 2 days!", datetime.datetime(2012, 11, 1, 2, 25, 20), 'Ryan_Shizzle', u'Shizzlesville', nan, nan, ''), ...
  • 48. Kiva: Array Server DataShape + type KivaLoan = { id: int64; name: string; description: { languages: var, string(2); texts: json # map<string(2), string>; }; status: string; # LoanStatusType; funded_amount: float64; basket_amount: json; # Option(float64); paid_amount: json; # Option(float64); image: { id: int64; template_id: int64; }; video: json; activity: string; sector: string; use: string; delinquent: bool; location: { country_code: string(2); country: string; town: json; # Option(string); geo: { level: string; # GeoLevelType pairs: string; # latlong type: string; # GeoTypeType } }; .... Raw JSON = Web Service {"id":200533,"name":"Miawand Group","description":{"languages": ["en"],"texts":{"en":"Ozer is a member of the Miawand Group. He lives in the 16th district of Kabul, Afghanistan. He lives in a family of eight members. He is single, but is a responsible boy who works hard and supports the whole family. He is a carpenter and is busy working in his shop seven days a week. He needs the loan to purchase wood and needed carpentry tools such as tape measures, rulers and so on.rn rnHe hopes to make progress through the loan and he is confident that will make his repayments on time and will join for another loan cycle as well. rnrn"}},"status":"paid","funded_amount": 925,"basket_amount":null,"paid_amount":925,"image":{"id": 539726,"template_id": 1},"video":null,"activity":"Carpentry","sector":"Construction","use":"He wants to buy tools for his carpentry shop","delinquent":null,"location": {"country_code":"AF","country":"Afghanistan","town":"Kabul Afghanistan","geo":{"level":"country","pairs":"33 65","type":"point"}},"partner_id": 34,"posted_date":"2010-05-13T20:30:03Z","planned_expiration_date":null,"loa n_amount":925,"currency_exchange_loss_amount":null,"borrowers": [{"first_name":"Ozer","last_name":"","gender":"M","pictured":true}, {"first_name":"Rohaniy","last_name":"","gender":"M","pictured":true}, {"first_name":"Samem","last_name":"","gender":"M","pictured":true}],"terms": {"disbursal_date":"2010-05-13T07:00:00Z","disbursal_currency":"AFN","disbur sal_amount":42000,"loan_amount":925,"local_payments": [{"due_date":"2010-06-13T07:00:00Z","amount":4200}, {"due_date":"2010-07-13T07:00:00Z","amount":4200}, {"due_date":"2010-08-13T07:00:00Z","amount":4200}, {"due_date":"2010-09-13T07:00:00Z","amount":4200}, {"due_date":"2010-10-13T07:00:00Z","amount":4200}, {"due_date":"2010-11-13T08:00:00Z","amount":4200}, {"due_date":"2010-12-13T08:00:00Z","amount":4200}, {"due_date":"2011-01-13T08:00:00Z","amount":4200}, {"due_date":"2011-02-13T08:00:00Z","amount":4200}, {"due_date":"2011-03-13T08:00:00Z","amount": 4200}],"scheduled_payments": ... 2.9gb of JSON => network-queryable array: ~5 minutes Kiva Array Server Demo
  • 51. Numba C++ x86 C LLVM IR ARM Fortran Python PTX Numba turns Python into a “compiled language”
  • 54. Image Processing ~1500x speed-up @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result
  • 55. Glue 2.0 • Python’s legacy as a powerful glue language • manipulate files (instead of shell scripts) • call fast libraries (instead of using Matlab) • Next-gen Glue: • Link data silos • Link disjoint memory & compute • Unify disparate runtime models • Transcend legacy models of computers Instead of gluing disparate things together via a common API or ABI, it's about giving an end user a capability to see treat things as a fluid, continuous whole.  And I want to re-iterate: Numba and Blaze are not just about speed. It’s about moving domain expertise to data.
  • 56. Blurred Lines • Compile time, run time, JIT, asm.js • Imperative code vs. configuration • App, OS, lightweight virtualization, hardware, virtual hardware • Dev, dev ops, ops • Clouds: IaaS, PaaS, SaaS, DBaaS, AaaS... We have entered the post-PC era. This gluing also extends beyond just the application or code layer. - So much tech innovation happening right now - A lot of churn but some real gems as well - Not just software, but hardware, human roles, and business models - Much of this can be really confusing to track and follow, but it all results from the fact that we are entering a post-PC era - Single unified, “random access memory”; single serial stream of instructions
  • 57. Instead of figuring out how to glue things together, we think that using a high-level language like Python helps people transcend to the level of recognizing that *There is no spoon*. There is no computer - OSes are a lie. VMs and runtimes are a lie. Compilers are a lie. There are just bits, and useful lies on top of the bits. Thus far, we've been able to get away with these because we can build coherent lies. But as the underlying reality gets more complex, the cost of abstraction is too high - or the abstractions will necessarily need to be very leaky. There’s an old joke that computers are bad because they do exactly what we tell them to do. Computers would be better if they had a “do what I want” command, right? Well, with challenge of scalable computing over big data, figuring out “what I want”, at a low level, is itself a challenge. We instead need the “do whatever *you* want” command.
  • 58. Bokeh • Language-based (instead of GUI) visualization system • • High-level expressions of data binding, statistical transforms, interactivity and linked data Easy to learn, but expressive depth for power users • Interactive • • Data space configuration as well as data selection Specified from high-level language constructs • Web as first class interface target • Support for large datasets via intelligent downsampling (“abstract rendering”) Switch gears a bit and talk about Bokeh
  • 59. Bokeh Inspirations: • Chaco: interactive, viz pipeline for large data • Protovis & Stencil : Binding visual Glyphs to data and expressions • ggplot2: faceting, statistical overlays Design goal: Accessible, extensible, interactive plotting for the web... ... for non-Javascript programmers It’s not exclusively for the web, though - we can target rich client UIs, and I’m excited about the vispy work.
  • 60. Bokeh & BokehJS Demos • BokehJS demos • Audio spectrogram • Bokeh Examples - Low-level Python interface - IPython Notebook integration - ggplot example
  • 61. Continuum Data Explorer (CDX) Bokeh is a library and a tool, but it’s also a component that can be used as part of a larger application.
  • 63. Conclusion Despite the temptation to ignore or dismiss the hype machine, the actual data revolution is happening. But you cannot understand this revolution by focusing on technology *alone*. The technology has to be considered in light of the human factors. You're not going to see the shape of this revolution just by following the traditional industry blogs and trade journals and web sites. The human factors are: what do people really want to do with their data? The people who are getting the most value from their data - what are their backgrounds, and what kinds of companies do they work for, or are they building? How are those companies becoming datadriven?
  • 64. In the business world, the flood of data has triggered a rapid evolution - a Cambrian explosion, if you will. It's like the sun just came out, and all these businesses are struggling to evolve retinae and eyeballs, and avoid getting eaten by other businesses that grew eyeballs first. What about scientific computing, then? Scientists have been working out in the daylight for a long time now, and their decades-long obsession with performance and efficiency is suddenly relevant for the rest of the world.. I think, in a way that they had not imagined.
  • 65. I think in this metaphor, Python can be seen as the visual cortex. It connects the raw dataingest machinery of the eyes to the actual "smarts" of the rest of the brain. And Python itself will need to evolve. It will certainly have to play well with a lot of legacy systems, and integrate with foreign technology stacks. The reason we're so excited about LLVM-based interop, and memory-efficient compatibility with things like the JVM, is because these things give us a chance to at least be on the same carrier wave as other parts of the brain. But Python - or whatever Python evolves into - definitely has a central role to play in the dataenabled future, because of the human factors at the heart of the data revolution, and that have also guided the development of the language so far.
  • 66. So it’s a really simple syllogism. Given that: - Analysis of large, complex datasets is about Data Exploration, an iterative process of structuring, slicing, querying data to surface insights - Really insightful hypotheses have to originate in the mind of a domain expert; they cannot be outsourced, and an air gap between two brains leads to a massive loss of context Therefore: Domain experts need to be empowered to directly manipulate, transform, and see their massive datasets. They need a way to accurately express these operations to the computer system, not merely select them from a fixed menu of options: exploration of a conceptual space requires expressiveness. Python was designed to be an easy-to-learn language. It has gained mindshare because it fits in people's brains. It's a tool that empowers them, and bridges their minds with the computer, so the computer is an extension of their exploratory capability. For data analysis, this is absolutely its key feature. As a community, I think that if we keep sight of this, we will ensure that Python has a long and healthy future to become as fundamental as mathematics for the future of analytics.