SlideShare una empresa de Scribd logo
1 de 37
Srivatsan Ramanujam
Senior Data Scientist
Greenplum

© Copyright 2011 EMC Corporation. All rights reserved.

1
Agenda
• Greenplum UAP overview
– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance
– GPDB Architecture

• MADlib
–
–
–
–

Overview
Algorithms
Working Mechanism
Performance Comparison with Mahout

• PyMADlib
– Overview
– Demo in IPython Notebook

• Future Directions
– GPHD and HAWQ

© Copyright 2011 EMC Corporation. All rights reserved.

2
Greenplum Overview

© Copyright 2011 EMC Corporation. All rights reserved.

3
Products

© Copyright 2011 EMC Corporation. All rights reserved.

4
Greenplum Database - Architecture
MPP (Massively Parallel Processing)
Shared-Nothing Architecture
Master
Servers

...

SQL
MapReduce

...

Query planning &
dispatch

Network
Interconnect

Segment
Servers

...

...

Query processing
& data storage

External
Sources
Loading,
streaming, etc.

© Copyright 2011 EMC Corporation. All rights reserved.

5
MADlib

© Copyright 2011 EMC Corporation. All rights reserved.

6
MADlib: The Origin

UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.

• First mention of MAD analytics was at VLDB’09
– MAD Skills: New Analysis Practices for Big Data
– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb
Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• MADlib project initiated in late 2010
– Maintained by Greenplum/EMC with significant contributions
from UW Madison, UFlorida and UC Berkeley.

© Copyright 2011 EMC Corporation. All rights reserved.

7
Current Modules
Data Modeling
Supervised Learning
•
•
•
•
•
•
•
•
•

Naive Bayes Classification
Linear Regression
Logistic Regression
Multinomial Logistic Regression
Decision Tree
Random Forest
Support Vector Machines
Cox-Proportional Hazards Regression
Conditional Random Field

Unsupervised Learning
• Association Rules
• k-Means Clustering
• Low-rank Matrix Factorization
• SVD Matrix Factorization
• Parallel Latent Dirichlet Allocation

Descriptive Statistics
Sketch-based Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)

Profile

Quantile

Support
Array
Operations
Conjugate
Gradient
Sparse
Vectors
Probability
Functions
Random
Sampling

Inferential Statistics
Hypothesis tests

© Copyright 2011 EMC Corporation. All rights reserved.

8
MADlib – User Doc
• Check out the user guide with examples at: http://doc.madlib.net

© Copyright 2011 EMC Corporation. All rights reserved.

9
How does it work ? : A Linear Regression Example
• Finding linear dependencies between variables
– y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2

Vector of
dependent
variables y

© Copyright 2011 EMC Corporation. All rights reserved.

from unm limit 6;

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

Design
matrix X

10
Reminder: Linear-Regression Model
•
• If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer

© Copyright 2011 EMC Corporation. All rights reserved.

11
Linear Regression: Streaming Algorithm
• How to compute with a single table scan?

-1
XT

XT

y

X

X TX

© Copyright 2011 EMC Corporation. All rights reserved.

XTy

12
Linear Regression: Parallel Computation
XT
y

Segment 1

T
X1 y1

© Copyright 2011 EMC Corporation. All rights reserved.

Segment 2

T
X2 y2

Master

X Ty

13
Performance Comparison : Test Setup on AWB
• AWB
– 1000-node cluster located in Las Vegas
– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk
storage
– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity
– GPHD 1.1, GPDB 4.2.3

• Mahout v0.7
• MADlib v0.5
– With small LMF change to allow 4-byte integer values

• Test matrix
–
–
–
–

Data size (# rows/records, # columns/features)
Algorithms
Algorithm parameters (e.g. convergence threshold, # iterations)
GPDB segment / MR (Map-Reduce) task configurations

© Copyright 2011 EMC Corporation. All rights reserved.

14
Performance & Scalability Results (summary)

• Whitepaper coming out shortly!

© Copyright 2011 EMC Corporation. All rights reserved.

15
Logistic Regression
• Mahout only has sequential (i.e. single node) IGD implementation

MADlib & Mahout Logistic Regression Scalability Across
Number of Attributes
700

Census data, 48 attributes [Mahout]
600

Time in Minutes

Census data, 48 attributes [MADlib]
500
400
300
200
100
0
1000000

10000000

10000000

1E+09

log(Number of Rows)

© Copyright 2011 EMC Corporation. All rights reserved.

16
Logistic Regression
MADlib Scalability Across Number of GPDB Segments
18
16

Time in Minutes

14
12
10
8
6
4
2
0
0

50

100

150

200

250

300

Number of GPDB Segments

© Copyright 2011 EMC Corporation. All rights reserved.

17
K-Means Clustering
MADlib & Mahout K-means Scalability Across
Number of Rows
350

Census data, 48 attributes [Mahout]
300

Census data, 48 attributes [MADlib]
Time in Min

250
200
150
100
50
0
1000000

10000000

10000000

1E+09

log(Number of Rows)

© Copyright 2011 EMC Corporation. All rights reserved.

18
K-Means Clustering
MADlib K-means Scalability Across
Number of GPDB Segments
10
9
8

Time in Min

7
6
5
4
3
2
1

0
0

50

100

150

200

250

300

Number of GPDB Segments

© Copyright 2011 EMC Corporation. All rights reserved.

19
PyMADlib : Python + MADlib = Awesome!

© Copyright 2011 EMC Corporation. All rights reserved.

20
Motivation
• SQL is great for many things, but it’s not nearly enough

• Undeniably the most straightforward way to query data

• But not necessarily designed for data science

© Copyright 2011 EMC Corporation. All rights reserved.

21
MADlib is a godsend!
• Empowers data scientists to run canned machine learning
routines – focus less on coding, more on science
• In-database, explicitly parallel.

• So why do we need anything else?
– UI is still all in SQL
– Need to tap into rich visualization libraries

© Copyright 2011 EMC Corporation. All rights reserved.

22
Then which interface is favored by and familiar
to data scientists?

• Depends on who you ask
• Left survey is for “higher level languages,” and right survey is for “lower level languages”

© Copyright 2011 EMC Corporation. All rights reserved.

23
Wait, don’t we already have this (PL/R,
PL/Python, SAS HPA)?
• PL/X’s are wonderful, but:
– It still requires non-trivial knowledge of SQL to use effectively
– Mostly limited to explicitly parallel jobs
– Primarily a SQL interface to the end user

• Need an interface that is:
– Less SQL, more R/Python/SAS
– Implicitly parallelized
– More scalable

• SAS HPA = $$$$$

© Copyright 2011 EMC Corporation. All rights reserved.

24
The challenge
• MADlib
–
–
–
–

Open source
Extremely powerful/scalable
Growing algorithm breadth
SQL

• Python/R
–
–
–
–

Open source
Memory limited
High algorithm breadth
Language/interface purpose-designed for data science

• SAS
–
–
–
–

High user loyalty
Non-HPA is memory limited, HPA requires investment
High algorithm breadth
Language/interface purpose-designed for data science

• Want to leverage both the performance benefits of MADlib and the
usability of languages like Python, SAS, and R

© Copyright 2011 EMC Corporation. All rights reserved.

25
Simple solution: Translate Python code into
SQL
ODBC/
JDBC

Python  SQL

SQL to execute MADlib
Model output

• All data stays in DB and all model estimation and heavy lifting done in DB by
MADlib

• Only strings of SQL and model output transferred across ODBC/JDBC
• Best of both worlds: number crunching power of MADlib along with rich set of
visualizations of Matplotlib, NetworkX and all your other favorite Python
libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL
database, while you program in your favorite language – Python.

© Copyright 2011 EMC Corporation. All rights reserved.

26
Demo

PyMADlib Tutorial –
IPython Notebook Viewer Link

http://nbviewer.ipython.org/5275846

© Copyright 2011 EMC Corporation. All rights reserved.

27
Where do I get it ?

$pip install pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

28
I don’t have GPDB or MADlib – What do I do ?
• Greenplum Database Community Edition is freely
available for single node installations on multiple
platforms
– Written permission may be requested from EMC/Greenplum
for research use for multi-node installations

• MADlib is free and open-source
– Downloadable for multiple platforms from
https://github.com/madlib/madlib

• PyMADlib is also free and open-source 
– Downloadable from https://github.com/vatsan/pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

29
Future Directions

© Copyright 2011 EMC Corporation. All rights reserved.

30
Greenplum HD
• HAWQ – Parallel SQL query engine that combines the key
technological advantages of industry-leading Greenplum
Database with scalability and convenience of Hadoop

• SQL Standards Compliant
– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes
+ range of scalar and aggregate functions

• ACID Compliant

© Copyright 2011 EMC Corporation. All rights reserved.

31
HAWQ – Architecture

© Copyright 2011 EMC Corporation. All rights reserved.

32
Performance : HAWQ1 Vs. Hive Vs. Impala2

All experiments were run on a 60 node deployment with Analytics Workbench3

1
2
3

http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf
https://github.com/cloudera/impala/
http://www.analyticsworkbench.com/

© Copyright 2011 EMC Corporation. All rights reserved.

33
HAWQ: Deep Scalable Analytics
What’s inside the box?

• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means

• Association Rules
• Latent Dirichlet Allocation
• Users can connect to HAWQ via popular programming languages and it also
supports JDBC and ODBC.
• Most tools will work out of the box with HAWQ, including PyMADlib

© Copyright 2011 EMC Corporation. All rights reserved.

34
Questions?
@being_bayesian
vatsan.cs@utexas.edu
https://github.com/vatsan/pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

35
Appendix

© Copyright 2011 EMC Corporation. All rights reserved.

36
Datasets
The following datasets were used in comparing the performance of
MADlib with Mahout
– KDD Cup 2009 Orange marketing churn data (16.5 MB)
• About 500,000 records and 15,000 numerical and categorical attributes
– Census 2000 data (1.7 GB)
• About 14 million records and 48 numerical and categorical attributes
– Enron data (1.9 GB)
• About 700,000 documents with a vocabulary size of 200,000
– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)
• About 1 million users, 600,000 songs, and 250 million ratings
– Netflix Prize 2009 data (52.7 MB)
• About 400,000 users, 900 movies, and 4.5 million ratings

© Copyright 2011 EMC Corporation. All rights reserved.

37

Más contenido relacionado

Destacado

Data Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesData Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesSrivatsan Ramanujam
 
Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315Sarah Aerni
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceSrivatsan Ramanujam
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 

Destacado (8)

Data Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesData Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehicles
 
Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data Science
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
All thingspython@pivotal
All thingspython@pivotalAll thingspython@pivotal
All thingspython@pivotal
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 

Similar a PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad IIIT ALLAHABAD
 
Pro sphere customer technical
Pro sphere customer technicalPro sphere customer technical
Pro sphere customer technicalsolarisyougood
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIBM Switzerland
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...Paul Hofmann
 
BrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack CloudBrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack CloudEitan Segal
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator Ganesan Narayanasamy
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetupragss
 
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!Kyle Hailey
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
 
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...MongoDB
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJim Dowling
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsEMC
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructureinside-BigData.com
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john maoNAVER D2
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 

Similar a PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library. (20)

EMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras PelenisEMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras Pelenis
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad
 
Pro sphere customer technical
Pro sphere customer technicalPro sphere customer technical
Pro sphere customer technical
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
 
BrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack CloudBrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack Cloud
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
 
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data Analytics
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructure
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john mao
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 

Último

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

  • 1. Srivatsan Ramanujam Senior Data Scientist Greenplum © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. Agenda • Greenplum UAP overview – Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance – GPDB Architecture • MADlib – – – – Overview Algorithms Working Mechanism Performance Comparison with Mahout • PyMADlib – Overview – Demo in IPython Notebook • Future Directions – GPHD and HAWQ © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. Greenplum Overview © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. Products © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. Greenplum Database - Architecture MPP (Massively Parallel Processing) Shared-Nothing Architecture Master Servers ... SQL MapReduce ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. MADlib © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. MADlib: The Origin UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. • First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data – Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf • MADlib project initiated in late 2010 – Maintained by Greenplum/EMC with significant contributions from UW Madison, UFlorida and UC Berkeley. © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. Current Modules Data Modeling Supervised Learning • • • • • • • • • Naive Bayes Classification Linear Regression Logistic Regression Multinomial Logistic Regression Decision Tree Random Forest Support Vector Machines Cox-Proportional Hazards Regression Conditional Random Field Unsupervised Learning • Association Rules • k-Means Clustering • Low-rank Matrix Factorization • SVD Matrix Factorization • Parallel Latent Dirichlet Allocation Descriptive Statistics Sketch-based Estimators • CountMin (CormodeMuthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Profile Quantile Support Array Operations Conjugate Gradient Sparse Vectors Probability Functions Random Sampling Inferential Statistics Hypothesis tests © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. MADlib – User Doc • Check out the user guide with examples at: http://doc.madlib.net © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. How does it work ? : A Linear Regression Example • Finding linear dependencies between variables – y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 Vector of dependent variables y © Copyright 2011 EMC Corporation. All rights reserved. from unm limit 6; y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design matrix X 10
  • 11. Reminder: Linear-Regression Model • • If residuals i.i.d. Gaussians with standard deviation σ: – max likelihood ⇔ min sum of squared residuals • First-order conditions for the following quadratic objective (in c) yield the minimizer © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. Linear Regression: Streaming Algorithm • How to compute with a single table scan? -1 XT XT y X X TX © Copyright 2011 EMC Corporation. All rights reserved. XTy 12
  • 13. Linear Regression: Parallel Computation XT y Segment 1 T X1 y1 © Copyright 2011 EMC Corporation. All rights reserved. Segment 2 T X2 y2 Master X Ty 13
  • 14. Performance Comparison : Test Setup on AWB • AWB – 1000-node cluster located in Las Vegas – Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk storage – 8000+ Map Task Capacity, 5000+ Reduce Task Capacity – GPHD 1.1, GPDB 4.2.3 • Mahout v0.7 • MADlib v0.5 – With small LMF change to allow 4-byte integer values • Test matrix – – – – Data size (# rows/records, # columns/features) Algorithms Algorithm parameters (e.g. convergence threshold, # iterations) GPDB segment / MR (Map-Reduce) task configurations © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. Performance & Scalability Results (summary) • Whitepaper coming out shortly! © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. Logistic Regression • Mahout only has sequential (i.e. single node) IGD implementation MADlib & Mahout Logistic Regression Scalability Across Number of Attributes 700 Census data, 48 attributes [Mahout] 600 Time in Minutes Census data, 48 attributes [MADlib] 500 400 300 200 100 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Logistic Regression MADlib Scalability Across Number of GPDB Segments 18 16 Time in Minutes 14 12 10 8 6 4 2 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. K-Means Clustering MADlib & Mahout K-means Scalability Across Number of Rows 350 Census data, 48 attributes [Mahout] 300 Census data, 48 attributes [MADlib] Time in Min 250 200 150 100 50 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. K-Means Clustering MADlib K-means Scalability Across Number of GPDB Segments 10 9 8 Time in Min 7 6 5 4 3 2 1 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. PyMADlib : Python + MADlib = Awesome! © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. Motivation • SQL is great for many things, but it’s not nearly enough • Undeniably the most straightforward way to query data • But not necessarily designed for data science © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. MADlib is a godsend! • Empowers data scientists to run canned machine learning routines – focus less on coding, more on science • In-database, explicitly parallel. • So why do we need anything else? – UI is still all in SQL – Need to tap into rich visualization libraries © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Then which interface is favored by and familiar to data scientists? • Depends on who you ask • Left survey is for “higher level languages,” and right survey is for “lower level languages” © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)? • PL/X’s are wonderful, but: – It still requires non-trivial knowledge of SQL to use effectively – Mostly limited to explicitly parallel jobs – Primarily a SQL interface to the end user • Need an interface that is: – Less SQL, more R/Python/SAS – Implicitly parallelized – More scalable • SAS HPA = $$$$$ © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. The challenge • MADlib – – – – Open source Extremely powerful/scalable Growing algorithm breadth SQL • Python/R – – – – Open source Memory limited High algorithm breadth Language/interface purpose-designed for data science • SAS – – – – High user loyalty Non-HPA is memory limited, HPA requires investment High algorithm breadth Language/interface purpose-designed for data science • Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. Simple solution: Translate Python code into SQL ODBC/ JDBC Python  SQL SQL to execute MADlib Model output • All data stays in DB and all model estimation and heavy lifting done in DB by MADlib • Only strings of SQL and model output transferred across ODBC/JDBC • Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python. © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. Demo PyMADlib Tutorial – IPython Notebook Viewer Link http://nbviewer.ipython.org/5275846 © Copyright 2011 EMC Corporation. All rights reserved. 27
  • 28. Where do I get it ? $pip install pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 28
  • 29. I don’t have GPDB or MADlib – What do I do ? • Greenplum Database Community Edition is freely available for single node installations on multiple platforms – Written permission may be requested from EMC/Greenplum for research use for multi-node installations • MADlib is free and open-source – Downloadable for multiple platforms from https://github.com/madlib/madlib • PyMADlib is also free and open-source  – Downloadable from https://github.com/vatsan/pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 29
  • 30. Future Directions © Copyright 2011 EMC Corporation. All rights reserved. 30
  • 31. Greenplum HD • HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop • SQL Standards Compliant – Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of scalar and aggregate functions • ACID Compliant © Copyright 2011 EMC Corporation. All rights reserved. 31
  • 32. HAWQ – Architecture © Copyright 2011 EMC Corporation. All rights reserved. 32
  • 33. Performance : HAWQ1 Vs. Hive Vs. Impala2 All experiments were run on a 60 node deployment with Analytics Workbench3 1 2 3 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf https://github.com/cloudera/impala/ http://www.analyticsworkbench.com/ © Copyright 2011 EMC Corporation. All rights reserved. 33
  • 34. HAWQ: Deep Scalable Analytics What’s inside the box? • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC. • Most tools will work out of the box with HAWQ, including PyMADlib © Copyright 2011 EMC Corporation. All rights reserved. 34
  • 36. Appendix © Copyright 2011 EMC Corporation. All rights reserved. 36
  • 37. Datasets The following datasets were used in comparing the performance of MADlib with Mahout – KDD Cup 2009 Orange marketing churn data (16.5 MB) • About 500,000 records and 15,000 numerical and categorical attributes – Census 2000 data (1.7 GB) • About 14 million records and 48 numerical and categorical attributes – Enron data (1.9 GB) • About 700,000 documents with a vocabulary size of 200,000 – KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB) • About 1 million users, 600,000 songs, and 250 million ratings – Netflix Prize 2009 data (52.7 MB) • About 400,000 users, 900 movies, and 4.5 million ratings © Copyright 2011 EMC Corporation. All rights reserved. 37

Notas del editor

  1. Special thanks to Grace Gee (Engineer, SOAR Program, Greenplum)