SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
MAD Skills: New
                          Analysis Practices for
                                Big Data
                                   Presented By:
                                  Christan Grant
                                cgrant@cise.ufl.edu




Thursday, August 25, 11                              1
MAD Skills: New Analysis
                               Practices for Big Data
                     •    Authors

                          •   Jeff Cohen – Greenplum

                          •   Brian Dolan – Fox Audience Network

                          •   Mark Dunlap – Evergreen Technologies

                          •   Joseph M Hellerstein – UC Berkeley

                          •   Caleb Welton – Greenplum

                     •    Presented at Very Large Database Conference 2009 in Lyon, France




Thursday, August 25, 11                                                                      2
If you are looking for a career where your services will be
                          in high demand, you should find something where you
                          provide a scarce, complementary service to something
                          that is getting ubiquitous and cheap.

                          So what’s getting ubiquitous and cheap? Data.

                          And what is complementary to data? Analysis.

                          – Prof. Hal Varian, Chief Economist at Google




Thursday, August 25, 11                                                                 3
• Storage is Cheap
                      • The worlds largest data warehouse of
                          ~15 years ago can be stored on disks for
                          about $2000.




Thursday, August 25, 11                                              4
• Traditionally, a lot of time is spent building
                          well structured data warehouses.

                          • Contains summaries of data
                          • Main analytics location
                          • Jealously guarded by IT

Thursday, August 25, 11                                                 5
• Data analysis is common culture
                       • Culture is to collect and analyze data in
                          difference business units.




Thursday, August 25, 11                                              6
• Data analysis is common culture
                       • Culture is to collect and analyze data in
                          difference business units.

                     • “In God we trust; all others must bring data”




Thursday, August 25, 11                                                6
• Data analysis is common culture
                       • Culture is to collect and analyze data in
                            difference business units.

                     • “In God we trust; all others must bring data”
                     • Analytics are crucial – Analysts need more
                          access and more permissions

                          • Hence M.A.D.
Thursday, August 25, 11                                                6
• Magnetic – Attract data and practitioners.
                          The Database must be painless and efficient.

                     • Agile – Analysts need to ingest and analyze
                          on the fly.

                     • Deep – Sophisticated analytics at scale.
                          Don’t force analyst to work on samples
                          (where the long tail is lost).

                     • Library of tools
Thursday, August 25, 11                                                 7
Outline
                     •    Background
                     •    FOX Audience Network
                     •    Being MAD
                     •    In-Database Operations
                     •    MAD DBMS
                     •    Conclusions
                     •    Question


Thursday, August 25, 11                            8
Background
                     • OLAP and Data Cubes
                          • Data structure to
                             provides descriptive
                             statistics – Simple
                             summaries (sum,
                             average, std)

                          • We want inferential/
                             inductive statistics –
                             Predictions, causality
                             analysis, distributional
                             comparison



Thursday, August 25, 11                                 9
Background
                     •    Databases and Statistical Packages
                          •   Many analysts download data to use in Excel/SAS/
                              Matlab/R or their favorite programming language?
                              FORTRAN??


                          •   Use matrix/vector operations
                          •   Most of these stat packages require data to fit in RAM
                              •   Taking samples from the full data to fit into ram
                                  results in loss of precision
                          •   External toolkits may also lack parallelism



Thursday, August 25, 11                                                               10
Background
                     • MapReduce and Parallel Programming
                      • Strengths of MR are scalable batch
                            parallelism and fault tolerance.
                          • There is work on putting ML algorithms
                            in MapReduce Framework (Mahout)
                          • MR is just a programming framework and
                            works well with MAD


Thursday, August 25, 11                                              11
Background
                     • Data Mining and Analytics in the DB
                      • Plenty of work in this area
                      • Methods are typically implemented as a
                            black box
                          • Algorithms are not flexible as in
                            programmed packages



Thursday, August 25, 11                                          12
FOX Audience Network
                     • Served MySpace.com, IGN.com, Scout.com
                          etc.
                     • About 150 Million users
                     • Ad network, bought by Rubicon in Nov
                          2010




Thursday, August 25, 11                                         13
Fox Audience Network
                     •    42 Nodes (2 Masters, 40 Workers)
                     •    Sun X4500 Thumper
                     •    48 500GB drives
                     •    16GB RAM
                     •    ~ 5TB Daily
                     •    1 Table 1.5 Trillion Rows
                     •    Many different types of users workloads
                     •    Dynamic query ecosystem


Thursday, August 25, 11                                             14
Fox Audience Network
                     •    Use Cases
                     •    Queries to the system are vastly different
                     •    Example queries:
                          •   How many female WWF enthusiasts under the age of 30 visited
                              the Toyota community over the last four days and saw a medium
                              rectangle?
                              •   Expensive
                          •   How are these people similar to those that visited Nissan?
                              •   Open-ended, requires some statistics and the analyst to
                                  be in the loop.



Thursday, August 25, 11                                                                       15
Being MAD
                     • Analysts (Data Scientists) trump DBAs
                      • They crave data
                      • They can work with dirty data
                      • They like all data (not samples)
                      • Have a belt of tools


Thursday, August 25, 11                                        16
Being MAD

                     • Get data to the warehouse as soon as
                          possible
                     • Cleaning and integration should be done
                          intelligently




Thursday, August 25, 11                                          17
Being MAD
                     • Three logical layers for data stores:
                      • Reporting – Specialized static aggregates
                      • Production – Aggregates for most users
                      • Stages – Raw tables and logs
                      • Sandboxes – Play ground for analysts
                     • The logical levels encourages creativity
Thursday, August 25, 11                                             18
Being MAD

                     • Data Scientists need power tools Math +
                          Stats + ML
                     • The MADLib approach is to develop a
                          hierarchy of mathematical concepts in the
                          database.




Thursday, August 25, 11                                               19
In-Database Operations
                     • User Defined Functions (UDF)
                       • SQL or PL/pgSQL
                       • Internal (c)
                       • Dynamically Loaded (c, c++, java, javascript,
                            perl, python, R, ruby, scheme, etc)
                          • SELECT f1(city) FROM states
                            WHERE f2(capital)


Thursday, August 25, 11                                                  20
In-Database Operations
                     •    Value abstraction levels

                     •    1.0                         Scalar

                     •    [0.3, 0.3, 0.4]            Vector

                     •    [[0.3, 0.7], [0.6, 0.4]]   Matrix
                                                     Function
                     •    f( • ) → value




Thursday, August 25, 11                                         21
In-Database Operations


                          Scalar Arithmetic



Thursday, August 25, 11                       22
In-Database Operations

                     • Scalar Arithmetic
                      • SELECT 5*4;
                      • SELECT sqrt(64);
                      • SELECT cos(-3.14159 * sqrt(2) / 2 );

Thursday, August 25, 11                                        23
In-Database Operations


                          Vector Arithmetic



Thursday, August 25, 11                       24
In-Database Operations

                     • Vector
                      • array objects: float8[]
                      • array operations are implemented as
                          UDFs.


                                                  technically violates 1st normal form



Thursday, August 25, 11                                                                  25
In-Database Operations
                     • Vector




                                technically violates 1st normal form



Thursday, August 25, 11                                                26
In-Database Operations


                          Matrix Arithmetic



Thursday, August 25, 11                       27
In-Database Operations

                • Matrix Representation #1
                          CREATE TABLE B (
                               row_number integer,
                               vector numeric[]
                          );


   http://www.postgresql.org/docs/current/static/arrays.html

Thursday, August 25, 11                                        28
In-Database Operations

                • Matrix addition:
                          SELECT A. row_number, A.vector + B.vector
                          FROM A, B
                          WHERE A.row_number = B.row_numbner



       This can be easily implemented as a primitive
       operator:
       SELECT A ** B from A ,B
Thursday, August 25, 11                                               29
In-Database Operations
                • Matrix and vector multiplication: Av

                          SELECT 1, array_accum(row_number, vector*v)
                          FROM A


                          array_accum(x,v) is a custom function. It takes the
                          value v and puts it in the row indexed by x.


Thursday, August 25, 11                                                         30
In-Database Operations
                •         Matrix Representation #2
                          CREATE TABLE A (
                               row_number INTEGER,
                               column_number INTEGER,
                               value DECIMAL
                          );
                • Great sparse representation!
   http://www.postgresql.org/docs/current/static/arrays.html

Thursday, August 25, 11                                        31
In-Database Operations
                • Matrix Multiplication (using rep #2):
                          SELECT A. row_number, B.column_number, SUM(A.value * B.value)

                          FROM A, B

                          WHERE A.column_number = B.row_number

                          GROUP BY A.row_number, B.column_number




Thursday, August 25, 11                                                                   32
In-Database Operations
                • These operations require one pass over the
                          data and can be done in parallel.
                • What about Matrix Division, and Matrix
                          Inverse?
                • SQL is awkward bc of it’s lack of loops for
                          iteration.




Thursday, August 25, 11                                         33
In-Database Operations
                     • Document similarity for fraud detection
                       • Check to see if several advertisers are
                            linking to pages to the same content.
                          • If several advertisers have out links to
                            similar documents it is possible they are
                            using stolen credit card.
                          • Especially if the outbound documents are
                            malicious.


Thursday, August 25, 11                                                 34
In-Database Operations
                • For document similarity we use tf-idf. This is a
                          score for a term in a given document.
                          For a term:
                             tf-idft,d = tft,d * idft
                               tft,d = term count in doc
                               idft = log { |docs| / |docs w. term|}



Thursday, August 25, 11                                                35
In-Database Operations
                 •        Table docs contains triples (document, term, count)

                 • tf UDF in SQL takes a term and document as parameters
                          SELECT count(term)
                          FROM docs
                          WHERE term = t AND document = d



                 • idf UDF in SQL takes a term as a parameter
                          SELECT log (
                                  (SELECT count(document) FROM docs ) /
                                  (SELECT count(document) FROM docs WHERE term = t GROUP BY document))
                            FROM docs




Thursday, August 25, 11                                                                                  36
In-Database Operations
                •         Each doc will have a vector with the tf-idf scores,
                          we can compare the results using cosine similarity.
                          cos θ = v * u / (||v|| * ||u||)
                          small θ means v and u are similar


                • SQL is straight forward using the vector
                          operators



Thursday, August 25, 11                                                         37
In-Database Operations
                •         Ordinary Least Squares is used to fit a line to a collection of points.
                •         This helps to describe seasonal trends - its a first look at the data.
                          Solve Y = Xβ where X is an n × k Matrix.
                          Estimate β using β* = (XTX)-1XTy
                          We can Parallellize using computation
                                Distribute:    A ← XTX, b ← XTy
                                Collect:      A and b.
                                Calculate:    A-1
                                Multiply:     A-1 * b




Thursday, August 25, 11                                                                            38
In-Database Operations


                          Functional Arithmetic



Thursday, August 25, 11                           39
In-Database Operations
                     • A/B Testing (Multivariate testing)
                      • Web companies show people different
                            versions of their website to to see which
                            one is the best.
                          • Ad companies compare different add
                            campaigns and may want to find the one
                            with the higher clicks-through rate.



Thursday, August 25, 11                                                 40
In-Database Operations
                     •    Mann-Whitney U Test
                          •   This is a popular test to compare non-
                              parametric data.
                          •   non-parametric def= data set that does not fit
                              in a well known distribution
                          •   Combine data values and score the values, find
                              the average location of the list for each value.
                          •   Z = U – (n1n2/2) / √(n1n2(n1 + n2 +1)/12)


Thursday, August 25, 11                                                          41
In-Database Operations
                •         Mann-Whitney U Test
                           •   Given Table T with (sample_id, value)
                          CREATE VIEW R AS
                          SELECT sample_id, avg(value) AS sample_avg
                             sum(rown) AS rank_sum, count(*) AS sample_n,
                             sum(rown) - count(*) * count(*) + 1 AS sample_us
                           FROM ( SELECT sample_id,
                             row_number() OVER (ORDER BY value DESC) AS
                             rown, value FROM T) AS ordered
                           GROUP BY sample_id



Thursday, August 25, 11                                                         42
In-Database Operations
                •         Mann-Whitney U Test
                          •   With large sample count we can get the z_score
                          SELECT r.sample_u, r.sample_avg, r.sample_n,
                              (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u *
                              (a.sum_n + 1) / 12) AS z_score
                          FROM R AS r, (SELECT sum(sample_u) AS
                          sum_u, sum(sample_n) AS sum_n FROM R) AS a
                          GROUP BY r.sample_u, r.sample_avg, r.sample_n,
                          a.sum_n, a.sum_u


Thursday, August 25, 11                                                        43
In-Database Operations
                •         Mann-Whitney U Test
                          •   With large sample count we can get the z_score
                          SELECT r.sample_u, r.sample_avg, r.sample_n,
                              (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u *
                              (a.sum_n + 1) / 12) AS z_score
                          FROM R AS r, (SELECT sum(sample_u) AS
                          sum_u, sum(sample_n) AS sum_n FROM R) AS a
                          GROUP BY r.sample_u, r.sample_avg, r.sample_n,
                          a.sum_n, a.sum_u
 This can be easily implemented as a stored procedure:
 SELECT man_whitney(value) FROM table
Thursday, August 25, 11                                                        43
In-Database Operations
                     • Resampling
                       • Sampling several subsets instead of the
                            whole population.
                          • Bootstrapping       def=
                                                  Sampling to buckets
                            (w/ replacement) then combine buckets
                          • Jackknifing   def=
                                           Leave some data out and
                            sample; then combine runs



Thursday, August 25, 11                                                 44
In-Database Operations
                     •    Bootstrapping
                          •   Pre-select which trials contain which rows
                          •    Suppose we have population N = 100 and we
                              have subsamples of M = 3
                          CREATE VIEW design AS
                          SELECT a.trial_id, floor(100 * random()) AS
                          row_id
                          FROM generate_series(1, 10000) AS a (trial_id),
                          generate_series(1, 3) AS b (subsample_id)


Thursday, August 25, 11                                                     45
In-Database Operations
                     •    Bootstrapping
                     •    Calculate the sampling distribution
                          CREATE VIEW trials AS
                          SELECT d.trial_id, AVG(T.value) AS avg_value
                          FROM design d,T
                          WHERE d.row_id = T.row_id
                          GROUP BY d.trial_id



Thursday, August 25, 11                                                  46
In-Database Operations
                     • Bootstrapping
                     • Final distribution parameters
                          SELECT AVG(avg_value), STDDEV(avg_value)
                          FROM trials




Thursday, August 25, 11                                              47
MAD DBMS
                     • Loading/Unloading
                      • Scatter/Gather Streaming – fast!
                      • Extract-Transform-Load vs Extract-Load-
                            Transform
                          • Greenplum has MapReduce support!


Thursday, August 25, 11                                           48
MAD DBMS
                     •    Storage
                          •   Tunable table types:
                              •   external tables (e.g. files)
                              •   heap tables (frequent updates)
          compressible
                              •   append-only tables (rare updates)
                              •   column-stores flexibility
                          •   All tables have a distribution policy


Thursday, August 25, 11                                               49
MAD DBMS
                     •    Partitioning
                          •   Partition by range of values or columns (list)
                              •   i.e. partition by timestamp old stuff goes to
                                  compressed table, new stuff goes to heap
                                  storage.
                          •   Query optimizer knows the partitioning
                              scheme
                          •   This is useful for staging data


Thursday, August 25, 11                                                           50
MAD DBMS
                     •    MAD Programming
                          •   Use your favorite programming language extension
                          •   Programmers must think out the code works w/o
                              shared memory. (data-parallel)
                          •   Map and Reduce functions can be written in Python,
                              Perl, or R.
                          •   They run using the Scatter/Gather technique. (Any
                              table type)
                          •   SQL is not necessary!


Thursday, August 25, 11                                                            51
MAD DBMS
                     • Future
                      • Need a better repository for the library
                            of functions. (Morpheus SIGMOD06)
                          • Automatically choose data and tables
                            layouts automatically
                          • Use online aggregation for faster answers
                            to queries.


Thursday, August 25, 11                                                 52
Questions




Thursday, August 25, 11               53
Questions

                     • Why would one be apposed to the M.A.D.
                          paradigm?




Thursday, August 25, 11                                         53
Questions

                     • Why would one be apposed to the M.A.D.
                          paradigm?
                     • How does MADLib and Greenplum fit into
                          the NoSQL movement?




Thursday, August 25, 11                                         53
Questions

                     • Why would one be apposed to the M.A.D.
                          paradigm?
                     • How does MADLib and Greenplum fit into
                          the NoSQL movement?
                     • Is there a market for an AppStore for
                          functions/utilities for MADLib?



Thursday, August 25, 11                                         53
Conclusion
                     • This paper describes the M.A.D. (Magnetic
                          Agile Deep) mindset and the M.A.D. library.
                     • It demonstrated how to perform analytics
                          inside the DB using SQL and stored
                          procedures.
                     • It also described the Greenplum database.
                          A general data-parallel database for simple
                          and complex queries from flexible tables.


Thursday, August 25, 11                                                 54
Thank You!
Thursday, August 25, 11                55
MADLib Installation
                              (ubuntu)
                     •    sudo apt-get install postgresql flex bison oxygen
                          libboost1.42-* git graphviz libarmadillo-dev
                          libarmadillo-doc
                     •    sudo apt-get install lapack++ blas++
                     •    git clone https://github.com/madlib/madlib.git
                     •    ./configure
                     •    cd build
                     •    cd make
  https://github.com/madlib/madlib/wiki/Building-MADlib-from-Source



Thursday, August 25, 11                                                      56
Matrix Multiplication
                                       Demo
                          -- Create tables in the new matrix format   	

      (2, 1, 7),
                          CREATE TABLE A (                            	

      (2, 2, 8);
                             row_number INTEGER,
                              column_number INTEGER,                  -- Check to see that the values are multiplying correctly
                          	

      	

   value DECIMAL                SELECT A.row_number, B.column_number,
                          );                                          A.value::text || '*' || B.value::text
                          INSERT INTO A VALUES                        FROM A, B
                          	

     (1, 1, 1),                          WHERE A.column_number = B.row_number
                          	

        (1, 2, 2),
                          	

        (2, 1, 3),                       -- Add the Groupby to see the sum work
                          	

        (2, 2, 4);                       SELECT A.row_number, B.column_number, SUM(A.value *
                                                                      B.value)
                          CREATE TABLE B (                            FROM A, B
                                row_number INTEGER,                   WHERE A.column_number = B.row_number
                                column_number INTEGER,                GROUP BY A.row_number, B.column_number;
                          	

     	

   value DECIMAL
                          );                                          DROP TABLE A;
                          INSERT INTO B VALUES                        DROP TABLE B;
                          	

        (1, 1, 5),
                          	

        (1, 2, 6),




Thursday, August 25, 11                                                                                                           57

Más contenido relacionado

Destacado

Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...SAS Institute India Pvt. Ltd
 
Useful Cosmetic Dentistry Information
Useful Cosmetic Dentistry InformationUseful Cosmetic Dentistry Information
Useful Cosmetic Dentistry InformationDental Clinic
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAmazon Web Services
 
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la IndiaOportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India Carlos Alberto Aquino Rodriguez
 
Ringmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrunRingmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrunLex Pit
 
Lazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist PlacesLazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist PlacesLazette Harnish
 
Plan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationalesPlan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationalesFlorian Brunner
 
Plumber colorado springs co open rooter
Plumber colorado springs co   open rooterPlumber colorado springs co   open rooter
Plumber colorado springs co open rooterplumber80903
 
How to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or LessHow to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or LessGeorge Sloane
 
Moodle is dead... Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... 	Iain Bruce, James Blair, Michael O'LoughlinMoodle is dead... 	Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... Iain Bruce, James Blair, Michael O'LoughlinIreland & UK Moodlemoot 2012
 
Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017megaradioexpress
 
Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017Ken Stayner
 

Destacado (15)

Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
Thomas Davenport: Analytics at Work: How to Make Better Decisions and Get Bet...
 
Useful Cosmetic Dentistry Information
Useful Cosmetic Dentistry InformationUseful Cosmetic Dentistry Information
Useful Cosmetic Dentistry Information
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
MagenTys Service Overview
MagenTys Service OverviewMagenTys Service Overview
MagenTys Service Overview
 
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la IndiaOportunidades y desafíos del Acuerdo Comercial del Perú con la India
Oportunidades y desafíos del Acuerdo Comercial del Perú con la India
 
Ringmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrunRingmakers of-saturn---norman-r.-bergrun
Ringmakers of-saturn---norman-r.-bergrun
 
Lazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist PlacesLazette Harnish: America's Most-Visited Tourist Places
Lazette Harnish: America's Most-Visited Tourist Places
 
Religion and Enviroment
Religion and EnviromentReligion and Enviroment
Religion and Enviroment
 
Goyescas
GoyescasGoyescas
Goyescas
 
Plan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationalesPlan pour la paix: Pour un renouveau des relations internationales
Plan pour la paix: Pour un renouveau des relations internationales
 
Plumber colorado springs co open rooter
Plumber colorado springs co   open rooterPlumber colorado springs co   open rooter
Plumber colorado springs co open rooter
 
How to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or LessHow to Develop a Social Media Presence in 30 Days or Less
How to Develop a Social Media Presence in 30 Days or Less
 
Moodle is dead... Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... 	Iain Bruce, James Blair, Michael O'LoughlinMoodle is dead... 	Iain Bruce, James Blair, Michael O'Loughlin
Moodle is dead... Iain Bruce, James Blair, Michael O'Loughlin
 
Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017Sintesis informativa 22 de marzo 2017
Sintesis informativa 22 de marzo 2017
 
Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017Announcements- Thursday, March 23, 2017
Announcements- Thursday, March 23, 2017
 

Similar a MAD Skills: New Analysis Practices for Big Data

Needs for Data Management & Citation Throughout the Information Lifecycle
Needs for Data Management & Citation Throughout  the Information LifecycleNeeds for Data Management & Citation Throughout  the Information Lifecycle
Needs for Data Management & Citation Throughout the Information LifecycleMicah Altman
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementMarieke Guy
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR Technologies
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides DuraSpace
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and LibariesRob Grim
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSTreasure Data, Inc.
 
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...Amazon Web Services
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Globus
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinarTed Dunning
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionClaudio Martella
 
Gilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead YetGilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead YetPeter O'Kelly
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryNeo4j
 

Similar a MAD Skills: New Analysis Practices for Big Data (20)

NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
 
Needs for Data Management & Citation Throughout the Information Lifecycle
Needs for Data Management & Citation Throughout  the Information LifecycleNeeds for Data Management & Citation Throughout  the Information Lifecycle
Needs for Data Management & Citation Throughout the Information Lifecycle
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data Management
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
251 carpenter
251 carpenter251 carpenter
251 carpenter
 
The architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWSThe architecture of data analytics PaaS on AWS
The architecture of data analytics PaaS on AWS
 
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
How Pacific Northwest National Laboratory Uses AWS to Enable Data Science & R...
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...
Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...
Hawaii Pacific GIS Conference 2012: LiDAR for Intrastructure and Terrian Mapp...
 
Hadoop: A Hands-on Introduction
Hadoop: A Hands-on IntroductionHadoop: A Hands-on Introduction
Hadoop: A Hands-on Introduction
 
Gilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead YetGilbane Boston 2012: XML and SQL: Not Dead Yet
Gilbane Boston 2012: XML and SQL: Not Dead Yet
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 

Último

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Último (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

MAD Skills: New Analysis Practices for Big Data

  • 1. MAD Skills: New Analysis Practices for Big Data Presented By: Christan Grant cgrant@cise.ufl.edu Thursday, August 25, 11 1
  • 2. MAD Skills: New Analysis Practices for Big Data • Authors • Jeff Cohen – Greenplum • Brian Dolan – Fox Audience Network • Mark Dunlap – Evergreen Technologies • Joseph M Hellerstein – UC Berkeley • Caleb Welton – Greenplum • Presented at Very Large Database Conference 2009 in Lyon, France Thursday, August 25, 11 2
  • 3. If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. – Prof. Hal Varian, Chief Economist at Google Thursday, August 25, 11 3
  • 4. • Storage is Cheap • The worlds largest data warehouse of ~15 years ago can be stored on disks for about $2000. Thursday, August 25, 11 4
  • 5. • Traditionally, a lot of time is spent building well structured data warehouses. • Contains summaries of data • Main analytics location • Jealously guarded by IT Thursday, August 25, 11 5
  • 6. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. Thursday, August 25, 11 6
  • 7. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. • “In God we trust; all others must bring data” Thursday, August 25, 11 6
  • 8. • Data analysis is common culture • Culture is to collect and analyze data in difference business units. • “In God we trust; all others must bring data” • Analytics are crucial – Analysts need more access and more permissions • Hence M.A.D. Thursday, August 25, 11 6
  • 9. • Magnetic – Attract data and practitioners. The Database must be painless and efficient. • Agile – Analysts need to ingest and analyze on the fly. • Deep – Sophisticated analytics at scale. Don’t force analyst to work on samples (where the long tail is lost). • Library of tools Thursday, August 25, 11 7
  • 10. Outline • Background • FOX Audience Network • Being MAD • In-Database Operations • MAD DBMS • Conclusions • Question Thursday, August 25, 11 8
  • 11. Background • OLAP and Data Cubes • Data structure to provides descriptive statistics – Simple summaries (sum, average, std) • We want inferential/ inductive statistics – Predictions, causality analysis, distributional comparison Thursday, August 25, 11 9
  • 12. Background • Databases and Statistical Packages • Many analysts download data to use in Excel/SAS/ Matlab/R or their favorite programming language? FORTRAN?? • Use matrix/vector operations • Most of these stat packages require data to fit in RAM • Taking samples from the full data to fit into ram results in loss of precision • External toolkits may also lack parallelism Thursday, August 25, 11 10
  • 13. Background • MapReduce and Parallel Programming • Strengths of MR are scalable batch parallelism and fault tolerance. • There is work on putting ML algorithms in MapReduce Framework (Mahout) • MR is just a programming framework and works well with MAD Thursday, August 25, 11 11
  • 14. Background • Data Mining and Analytics in the DB • Plenty of work in this area • Methods are typically implemented as a black box • Algorithms are not flexible as in programmed packages Thursday, August 25, 11 12
  • 15. FOX Audience Network • Served MySpace.com, IGN.com, Scout.com etc. • About 150 Million users • Ad network, bought by Rubicon in Nov 2010 Thursday, August 25, 11 13
  • 16. Fox Audience Network • 42 Nodes (2 Masters, 40 Workers) • Sun X4500 Thumper • 48 500GB drives • 16GB RAM • ~ 5TB Daily • 1 Table 1.5 Trillion Rows • Many different types of users workloads • Dynamic query ecosystem Thursday, August 25, 11 14
  • 17. Fox Audience Network • Use Cases • Queries to the system are vastly different • Example queries: • How many female WWF enthusiasts under the age of 30 visited the Toyota community over the last four days and saw a medium rectangle? • Expensive • How are these people similar to those that visited Nissan? • Open-ended, requires some statistics and the analyst to be in the loop. Thursday, August 25, 11 15
  • 18. Being MAD • Analysts (Data Scientists) trump DBAs • They crave data • They can work with dirty data • They like all data (not samples) • Have a belt of tools Thursday, August 25, 11 16
  • 19. Being MAD • Get data to the warehouse as soon as possible • Cleaning and integration should be done intelligently Thursday, August 25, 11 17
  • 20. Being MAD • Three logical layers for data stores: • Reporting – Specialized static aggregates • Production – Aggregates for most users • Stages – Raw tables and logs • Sandboxes – Play ground for analysts • The logical levels encourages creativity Thursday, August 25, 11 18
  • 21. Being MAD • Data Scientists need power tools Math + Stats + ML • The MADLib approach is to develop a hierarchy of mathematical concepts in the database. Thursday, August 25, 11 19
  • 22. In-Database Operations • User Defined Functions (UDF) • SQL or PL/pgSQL • Internal (c) • Dynamically Loaded (c, c++, java, javascript, perl, python, R, ruby, scheme, etc) • SELECT f1(city) FROM states WHERE f2(capital) Thursday, August 25, 11 20
  • 23. In-Database Operations • Value abstraction levels • 1.0 Scalar • [0.3, 0.3, 0.4] Vector • [[0.3, 0.7], [0.6, 0.4]] Matrix Function • f( • ) → value Thursday, August 25, 11 21
  • 24. In-Database Operations Scalar Arithmetic Thursday, August 25, 11 22
  • 25. In-Database Operations • Scalar Arithmetic • SELECT 5*4; • SELECT sqrt(64); • SELECT cos(-3.14159 * sqrt(2) / 2 ); Thursday, August 25, 11 23
  • 26. In-Database Operations Vector Arithmetic Thursday, August 25, 11 24
  • 27. In-Database Operations • Vector • array objects: float8[] • array operations are implemented as UDFs. technically violates 1st normal form Thursday, August 25, 11 25
  • 28. In-Database Operations • Vector technically violates 1st normal form Thursday, August 25, 11 26
  • 29. In-Database Operations Matrix Arithmetic Thursday, August 25, 11 27
  • 30. In-Database Operations • Matrix Representation #1 CREATE TABLE B ( row_number integer, vector numeric[] ); http://www.postgresql.org/docs/current/static/arrays.html Thursday, August 25, 11 28
  • 31. In-Database Operations • Matrix addition: SELECT A. row_number, A.vector + B.vector FROM A, B WHERE A.row_number = B.row_numbner This can be easily implemented as a primitive operator: SELECT A ** B from A ,B Thursday, August 25, 11 29
  • 32. In-Database Operations • Matrix and vector multiplication: Av SELECT 1, array_accum(row_number, vector*v) FROM A array_accum(x,v) is a custom function. It takes the value v and puts it in the row indexed by x. Thursday, August 25, 11 30
  • 33. In-Database Operations • Matrix Representation #2 CREATE TABLE A ( row_number INTEGER, column_number INTEGER, value DECIMAL ); • Great sparse representation! http://www.postgresql.org/docs/current/static/arrays.html Thursday, August 25, 11 31
  • 34. In-Database Operations • Matrix Multiplication (using rep #2): SELECT A. row_number, B.column_number, SUM(A.value * B.value) FROM A, B WHERE A.column_number = B.row_number GROUP BY A.row_number, B.column_number Thursday, August 25, 11 32
  • 35. In-Database Operations • These operations require one pass over the data and can be done in parallel. • What about Matrix Division, and Matrix Inverse? • SQL is awkward bc of it’s lack of loops for iteration. Thursday, August 25, 11 33
  • 36. In-Database Operations • Document similarity for fraud detection • Check to see if several advertisers are linking to pages to the same content. • If several advertisers have out links to similar documents it is possible they are using stolen credit card. • Especially if the outbound documents are malicious. Thursday, August 25, 11 34
  • 37. In-Database Operations • For document similarity we use tf-idf. This is a score for a term in a given document. For a term: tf-idft,d = tft,d * idft tft,d = term count in doc idft = log { |docs| / |docs w. term|} Thursday, August 25, 11 35
  • 38. In-Database Operations • Table docs contains triples (document, term, count) • tf UDF in SQL takes a term and document as parameters SELECT count(term) FROM docs WHERE term = t AND document = d • idf UDF in SQL takes a term as a parameter SELECT log ( (SELECT count(document) FROM docs ) / (SELECT count(document) FROM docs WHERE term = t GROUP BY document)) FROM docs Thursday, August 25, 11 36
  • 39. In-Database Operations • Each doc will have a vector with the tf-idf scores, we can compare the results using cosine similarity. cos θ = v * u / (||v|| * ||u||) small θ means v and u are similar • SQL is straight forward using the vector operators Thursday, August 25, 11 37
  • 40. In-Database Operations • Ordinary Least Squares is used to fit a line to a collection of points. • This helps to describe seasonal trends - its a first look at the data. Solve Y = Xβ where X is an n × k Matrix. Estimate β using β* = (XTX)-1XTy We can Parallellize using computation Distribute: A ← XTX, b ← XTy Collect: A and b. Calculate: A-1 Multiply: A-1 * b Thursday, August 25, 11 38
  • 41. In-Database Operations Functional Arithmetic Thursday, August 25, 11 39
  • 42. In-Database Operations • A/B Testing (Multivariate testing) • Web companies show people different versions of their website to to see which one is the best. • Ad companies compare different add campaigns and may want to find the one with the higher clicks-through rate. Thursday, August 25, 11 40
  • 43. In-Database Operations • Mann-Whitney U Test • This is a popular test to compare non- parametric data. • non-parametric def= data set that does not fit in a well known distribution • Combine data values and score the values, find the average location of the list for each value. • Z = U – (n1n2/2) / √(n1n2(n1 + n2 +1)/12) Thursday, August 25, 11 41
  • 44. In-Database Operations • Mann-Whitney U Test • Given Table T with (sample_id, value) CREATE VIEW R AS SELECT sample_id, avg(value) AS sample_avg sum(rown) AS rank_sum, count(*) AS sample_n, sum(rown) - count(*) * count(*) + 1 AS sample_us FROM ( SELECT sample_id, row_number() OVER (ORDER BY value DESC) AS rown, value FROM T) AS ordered GROUP BY sample_id Thursday, August 25, 11 42
  • 45. In-Database Operations • Mann-Whitney U Test • With large sample count we can get the z_score SELECT r.sample_u, r.sample_avg, r.sample_n, (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score FROM R AS r, (SELECT sum(sample_u) AS sum_u, sum(sample_n) AS sum_n FROM R) AS a GROUP BY r.sample_u, r.sample_avg, r.sample_n, a.sum_n, a.sum_u Thursday, August 25, 11 43
  • 46. In-Database Operations • Mann-Whitney U Test • With large sample count we can get the z_score SELECT r.sample_u, r.sample_avg, r.sample_n, (r.sample_u – a.sum_u / 2) / sqrt(a.sum_u * (a.sum_n + 1) / 12) AS z_score FROM R AS r, (SELECT sum(sample_u) AS sum_u, sum(sample_n) AS sum_n FROM R) AS a GROUP BY r.sample_u, r.sample_avg, r.sample_n, a.sum_n, a.sum_u This can be easily implemented as a stored procedure: SELECT man_whitney(value) FROM table Thursday, August 25, 11 43
  • 47. In-Database Operations • Resampling • Sampling several subsets instead of the whole population. • Bootstrapping def= Sampling to buckets (w/ replacement) then combine buckets • Jackknifing def= Leave some data out and sample; then combine runs Thursday, August 25, 11 44
  • 48. In-Database Operations • Bootstrapping • Pre-select which trials contain which rows • Suppose we have population N = 100 and we have subsamples of M = 3 CREATE VIEW design AS SELECT a.trial_id, floor(100 * random()) AS row_id FROM generate_series(1, 10000) AS a (trial_id), generate_series(1, 3) AS b (subsample_id) Thursday, August 25, 11 45
  • 49. In-Database Operations • Bootstrapping • Calculate the sampling distribution CREATE VIEW trials AS SELECT d.trial_id, AVG(T.value) AS avg_value FROM design d,T WHERE d.row_id = T.row_id GROUP BY d.trial_id Thursday, August 25, 11 46
  • 50. In-Database Operations • Bootstrapping • Final distribution parameters SELECT AVG(avg_value), STDDEV(avg_value) FROM trials Thursday, August 25, 11 47
  • 51. MAD DBMS • Loading/Unloading • Scatter/Gather Streaming – fast! • Extract-Transform-Load vs Extract-Load- Transform • Greenplum has MapReduce support! Thursday, August 25, 11 48
  • 52. MAD DBMS • Storage • Tunable table types: • external tables (e.g. files) • heap tables (frequent updates) compressible • append-only tables (rare updates) • column-stores flexibility • All tables have a distribution policy Thursday, August 25, 11 49
  • 53. MAD DBMS • Partitioning • Partition by range of values or columns (list) • i.e. partition by timestamp old stuff goes to compressed table, new stuff goes to heap storage. • Query optimizer knows the partitioning scheme • This is useful for staging data Thursday, August 25, 11 50
  • 54. MAD DBMS • MAD Programming • Use your favorite programming language extension • Programmers must think out the code works w/o shared memory. (data-parallel) • Map and Reduce functions can be written in Python, Perl, or R. • They run using the Scatter/Gather technique. (Any table type) • SQL is not necessary! Thursday, August 25, 11 51
  • 55. MAD DBMS • Future • Need a better repository for the library of functions. (Morpheus SIGMOD06) • Automatically choose data and tables layouts automatically • Use online aggregation for faster answers to queries. Thursday, August 25, 11 52
  • 57. Questions • Why would one be apposed to the M.A.D. paradigm? Thursday, August 25, 11 53
  • 58. Questions • Why would one be apposed to the M.A.D. paradigm? • How does MADLib and Greenplum fit into the NoSQL movement? Thursday, August 25, 11 53
  • 59. Questions • Why would one be apposed to the M.A.D. paradigm? • How does MADLib and Greenplum fit into the NoSQL movement? • Is there a market for an AppStore for functions/utilities for MADLib? Thursday, August 25, 11 53
  • 60. Conclusion • This paper describes the M.A.D. (Magnetic Agile Deep) mindset and the M.A.D. library. • It demonstrated how to perform analytics inside the DB using SQL and stored procedures. • It also described the Greenplum database. A general data-parallel database for simple and complex queries from flexible tables. Thursday, August 25, 11 54
  • 62. MADLib Installation (ubuntu) • sudo apt-get install postgresql flex bison oxygen libboost1.42-* git graphviz libarmadillo-dev libarmadillo-doc • sudo apt-get install lapack++ blas++ • git clone https://github.com/madlib/madlib.git • ./configure • cd build • cd make https://github.com/madlib/madlib/wiki/Building-MADlib-from-Source Thursday, August 25, 11 56
  • 63. Matrix Multiplication Demo -- Create tables in the new matrix format (2, 1, 7), CREATE TABLE A ( (2, 2, 8); row_number INTEGER, column_number INTEGER, -- Check to see that the values are multiplying correctly value DECIMAL SELECT A.row_number, B.column_number, ); A.value::text || '*' || B.value::text INSERT INTO A VALUES FROM A, B (1, 1, 1), WHERE A.column_number = B.row_number (1, 2, 2), (2, 1, 3), -- Add the Groupby to see the sum work (2, 2, 4); SELECT A.row_number, B.column_number, SUM(A.value * B.value) CREATE TABLE B ( FROM A, B row_number INTEGER, WHERE A.column_number = B.row_number column_number INTEGER, GROUP BY A.row_number, B.column_number; value DECIMAL ); DROP TABLE A; INSERT INTO B VALUES DROP TABLE B; (1, 1, 5), (1, 2, 6), Thursday, August 25, 11 57