Pivotal Greenplum DWH: People Linked to Pivotal Within 2km of ATM

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
2019 9 25
Pivotal
DWH
Pivotal Greenplum

Agenda
●
Ø Pivotal Greenplum
Ø Greenplum - Pivotal Greenplum 6
● DWH Pivotal Greenplum
Ø
Ø DWH
Ø
●

3© Copyright 2019 Pivotal. All rights reserved.
Pivotal Greenplum
• Pivotal Data Suite (CPU )
•
•
• ( ) K8s
• MPP DB
•
• ( etc..)
•
•
•

Pivotal Greenplum
MPP (Massively Parallel Processing)
... ...
x 2
x 2
SQL
SQL gNet

CPU
I/O
CPU CPU CPU CPU
CPU CPU CPU CPU CPU
I/O I/O
HW
RDB DB

( 1/2)
256GB RAM
1.8TB 10Krpm SAS HDD 6 RAID5
(4 + 1 + 1 )
256GB RAM256GB RAM 256GB RAM 256GB RAM
256GB RAM
(4 + 1 + 1 )
#1
(10Gbps x 52 )
#2
(10Gbps x 52 )
#1 #2 #3 #4
Intel E5-2680v3
2CPU,24Core
Intel E5-2680v3
2CPU,24Core
Intel E5-2680v3
2CPU,24Core
Intel E5-2680v3
2CPU,24Core
Intel E5-2680v3
2CPU,24Core
Intel E5-2680v3
2CPU,24Core
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
mLAG

( 2/2)
64GB RAM
300GB 10Krpm SAS HDD 6 RAID5
(4 + 1 + 1 )
64GB RAM64GB RAM 64GB RAM 64GB RAM
64GB RAM
(4 + 1 + 1 )
#1
(10Gbps x 52 )
#2
(10Gbps x 52 )
#1 #2 #3 #4
Intel E5-2660
2CPU,16Core
Intel E5-2660
2CPU,16Core
Intel E5-2660
2CPU,16Core
Intel E5-2660
2CPU,16Core
Intel E5-2660
2CPU,16Core
Intel E5-2660
2CPU,16Core
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
(10 + 1 + 1 )
HA
Bonding
#1 #1
#9 #9
#17 #17
5 #25 #25
mLAG

Greenplum
v1 v4 v5 v6
2003 - 2009 2010 2015-20182013 2019
BI

Cover w/ Image
DWH
• PB
• :
• SLA
•
•
• OSS
#1 DWH

September 4, 2019: Now Generally Available

Greenplum 6 Postgres
v8.4 – 2314 commits
TOTAL: 11,720 Commits Merged
Code Quality via Open Source
Optimized for Big Data in Greenplum
“Customers
frequently called out
the open-source
alignment with
PostgreSQL as a
strong and cost-
effective positive”
-- Gartner MQ 2019

Greenplum 6 OLTP
70
● OLTP
● OLTP
● 24,448 TPS for Update transactions in GP6
● 46,570 TPS for Single Row Insert in GP6
● 140,000 TPS for Select Only Query in GP6
●
Real world analytical
database and data
warehouse use cases
require a mixed
workload of long and
short queries as well
as updates and
deletes

“Replicated”
“DISTRIBUTED REPLICATED”
● Greenplum
● /
● Replicated
REPLICATED
DIMENSION TABLES
FOR FAST LOCAL JOIN

Greenplum 6 Replicated Tables
create table table_replicated (a int , b text)
distributed replicated;
insert into table_replicated
select id, 'val ' || id
from generate_series (1,10000) id;
select pg_relation_size('table_replicated');
pg_relation_size
------------------
917504
create table table_non_replicated (a int , b text)
distributed randomly;
insert into table_non_replicated
select id, 'val ' || id
from generate_series (1,10000) id;
select pg_relation_size('table_non_replicated');
pg_relation_size
------------------
458752
With Non Replicated table With Replicated Tables
Size is multiplied by the
number of primaries
select gp_segment_id, count(*) from table_replicated
group by 1;
ERROR: column "gp_segment_id" does not exist
LINE 1: select gp_segment_id, count(*) from ...
^
select gp_segment_id, count(*) from
table_non_replicated group by 1;
gp_segment_id | count
---------------+-------
0 | 5011
1 | 4989 The field gp_segment_id doesn't
exist in replicated tables

Greenplum 6 Replicated Tables Query Plan
explain select count(*) from table_fact f inner join table_replicated d on f.a = d.a;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Aggregate (cost=0.00..874.73 rows=1 width=8)
-> Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..874.73 rows=1 width=8)
-> Aggregate (cost=0.00..874.73 rows=1 width=8)
-> Hash Join (cost=0.00..874.73 rows=50000 width=1)
Hash Cond: (table_fact.a = table_replicated.a)
-> Seq Scan on table_fact (cost=0.00..432.15 rows=50000 width=4)
-> Hash (cost=431.23..431.23 rows=10000 width=4)
-> Seq Scan on table_replicated (cost=0.00..431.23 rows=10000 width=4)
Optimizer: PQO version 3.29.0
explain select count(*) from table_fact f inner join table_non_replicated d on f.a = d.a;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Aggregate (cost=0.00..874.31 rows=1 width=8)
-> Gather Motion 2:1 (slice3; segments: 2) (cost=0.00..874.31 rows=1 width=8)
-> Aggregate (cost=0.00..874.31 rows=1 width=8)
-> Hash Join (cost=0.00..874.31 rows=50000 width=1)
Hash Cond: (table_fact.a = table_non_replicated.a)
-> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..433.15 rows=50000 width=4)
Hash Key: table_fact.a
-> Seq Scan on table_fact (cost=0.00..432.15 rows=50000 width=4)
-> Hash (cost=431.22..431.22 rows=5000 width=4)
-> Redistribute Motion 2:2 (slice2; segments: 2) (cost=0.00..431.22 rows=5000 width=4)
Hash Key: table_non_replicated.a
-> Seq Scan on table_non_replicated (cost=0.00..431.12 rows=5000 width=4)
Optimizer: PQO version 3.29.0
WithNonReplicatedtable
1 slice vs 3 slices
No redistribution
WithReplicatedtable

H/W
● zStandard
● Facebook OSS
●
●
● CREATE TABLE WITH
WITH (compresstype=zstd)

SQL CTE
Using RECURSIVE,
a WITH query can refer
to its own output

ETL Writable CTE
Data modifying
CTE allows
several different
operations in the
same query

Unlogged :
● WAL Unlogged
● :
● DB
create unlogged table
table_unlogged
(a int , b text)
distributed randomly;

Private CloudBare-Metal Public Cloud
Greenplum Building
Blocks
• The most performant way to
run Greenplum on premise
• Pivotal Blueprint for Dell
reference hardware configs
• Superior price/performance; no
expensive proprietary hardware
• Certified and supported by
Pivotal
Run Greenplum in Any Environment
Greenplum for Kubernetes
Other Kubernetes
(on VMs or not)
Google
Container Engine
Enterprise & Essentials(OSS K8s)
•
• : 100%
•

Public Cloud

●
○ AI
Pivotal Greenplum
○ ( )
●
○
○
○
○
●
○
○ DR AZ
○ HA

● 1
●
●
( )
●
● 5
○
●
● pgBouncer
DB
● gpsnap/gpcronsnap -
●
IaaS
●
●
●
Azure Resource
Group
Deployment
AWS
CloudFormation
GCP
Deployment
Manager
V
M
V
M
V
M
V
M
V
M
X
Data
Volume
Snapshot Restore

Greenplum for Kubernetes
Other Kubernetes
(on VMs or not)
Google
Container Engine
Enterprise & Essentials(OSS K8s)

●
● K8s
● K8s
●
day-2
PKS
Container
Operator
Bringing Cloud Databases On-Premises

● Greenplum
(Postgres) / Pod /
VM(vMotion)
● Greenplum
●
●
●
●
K8s worker 1 K8s worker n
PKS/K8s cluster
pod pod
K8s worker VMs: 8 to 32 GB

● Greenplum
○
○
● Greenplum
○
Pod
● VM K8s
Greenplum Pod
○ Pod
Persistent Volume 1 . . n
K8s worker 1 K8s worker n
PKS/K8s cluster
pod pod

Pivotal
2km
ATM 24 200
Peter
Pavan

Pivotal 2km ATM 24 200
Peter Pavan
drop function if exists get_people(text,text,integer,integer,float,float);
CREATE FUNCTION get_people(text,text,integer,integer,float,float) RETURNS integer
AS $$
declare
linkchk integer; v1 record; v2 record;
begin
execute 'truncate table results;';
for v1 in select distinct a.id,a.firstname,a.lastname,amount,tran_date,c.lat,c.lng,address,a.description,d.score from people a,transactions b,location c,
(SELECT w.id, q.score FROM people w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gpadmin.public.people' , 'Pivotal', null) q
WHERE (q.id::integer) = w.id order by 2 desc) d
where soundex(firstname)=soundex($1) and a.id=b.id and amount > $3 and (extract(epoch from tran_date) - extract(epoch from now()))/3600 < $4
and st_distance_sphere(st_makepoint($5, $6),st_makepoint(c.lng, c.lat))/1000.0 <= 2.0 and b.locid=c.locid and a.id=d.id
loop
for v2 in select distinct a.id,a.firstname,a.lastname,amount,tran_date,c.lat,c.lng,address,a.description,d.score from people a,transactions b,location c,
(SELECT w.id, q.score FROM people w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gpadmin.public.people' , 'Pivotal', null) q
WHERE (q.id::integer) = w.id order by 2 desc) d
where soundex(firstname)=soundex($2) and a.id=b.id and amount > $3 and (extract(epoch from tran_date) - extract(epoch from now()))/3600 < $4
and st_distance_sphere(st_makepoint($5, $6),st_makepoint(c.lng, c.lat))/1000.0 <= 2.0 and b.locid=c.locid and a.id=d.id
loop
execute 'DROP TABLE IF EXISTS out, out_summary;';
execute 'SELECT madlib.graph_bfs(''people'',''id'',''links'',NULL,'||v1.id||',''out'');' ;
select 1 into linkchk from out where dist=1 and id=v2.id;
if linkchk is not null then
insert into results values (v1.id,v1.firstname,v1.lastname,v1.amount,v1.tran_date,v1.lat,v1.lng,v1.address,v1.description,v1.score);
insert into results values (v2.id,v2.firstname,v2.lastname,v2.amount,v2.tran_date,v2.lat,v2.lng,v2.address,v2.description,v2.score);
end if;
end loop;
end loop;
return 0;
end
$$ LANGUAGE plpgsql;
-- person1 , person 2, amount, duration in hours, longtitude, latitude (in question)
select get_people('Pavan','Peter',200,24,103.912680, 1.309432) ;
Greenplum POSTGIS functions
st_distance_sphere() and
st_makepoint() calculate distance
between ATM location and
reference lat ,long < 2 KM
GPText.search() function is
used to know if both
people work at ‘Pivotal’
Greenplum and Apache MADlib BFS
search to know if there are direct or
indirect links between people
Greenplum Fuzzy String
Match function Soundex()
to know if people name
sounds like ‘Pavan’ or
‘Peter’
Greenplum Time functions to
calculate difference in amount
withdrawn time < 24 hours
Amount
> $200
“Pivotal
- GPText
Peter
Pavan
- Fuzzy
String Match
- Apache MADlib 2km ATM”
- PostGIS
24 ”
/
200
”

: 3,000+ vs 34
LOAD
customer
data from
HDFS and
put to HIVE
DESCRIPTION
Column needs to
be indexed
SEARCH
IN Column
& WRITE
Result to
HDFS
WRITE
CODE :
Pulling Data
Into Spark
Data Frame
WRITE
CODE :
CHECK
Soundex
WRITE
CODE :
MATCH
SOLR
Result
WRITE
CODE :
GRAPH
LINK
Analysis
WRITE
CODE :
POSTGI
S
Distance
Calculation
WRITE
CODE :
GRAPH
LINK
Analysi
s
WRITE
CODE :
WRITE
RESULTS
TO HIVE
TABLE
“Investigate a crime suspect whose name sounds like ‘Pavan’, who knows Peter directly, who withdraw Peter’s $500 at an ATM
located 2km from Changi yesterday.”
Using a Hadoop Ecosystem: 10 steps, 3000+ Lines of code across 4 different systems
1 2 3 4 5 6 7 8 9 10
Using Greenplum: 1 step, 1 query – 34 Lines of Code
One query – using built-in functions: Soundex (sounds like), NLP (work at same company),
Machine Learning MADlib (know directly), Time (yesterday), PostGIS (within 2km)

Pivotal Greenplum OSS
•
• 50
• Pivotal Greenplum
• Apache
2017 7 :
http://madlib.apach
e.org
Apache MADlib
•
•
• http://lucene.apache
.org/solr/
Apache Solr
•
PostgreSQL
OSS
•
•
• http://postgis.net/
PostGIS
•
•
•
R
• https://www.r-
project.org/
R
•
•
•
•
• https://www.python.
org/
Python

In-DB
• Open source https://github.com/apache/madlib
• Downloads and docs http://madlib.apache.org/
• Wiki
https://cwiki.apache.org/confluence/display/MADLIB/
Apache MADlib: SQL
Apache
PostgreSQL &
Greenplum

Functions
Data Types and Transformations
Array and Matrix Operations
Matrix Factorization
• Low Rank
• Singular Value Decomposition (SVD)
Norms and Distance Functions
Sparse Vectors
Encoding Categorical Variables
Path Functions
Pivot
Sessionize
Stemming
May 2018
Graph
All Pairs Shortest Path (APSP)
Breadth-First Search
Hyperlink-Induced Topic Search (HITS)
Average Path Length
Closeness Centrality
Graph Diameter
In-Out Degree
PageRank and Personalized PageRank
Single Source Shortest Path (SSSP)
Weakly Connected Components
Model Selection
Cross Validation
Prediction Metrics
Train-Test Split
Statistics
Descriptive Statistics
• Cardinality Estimators
• Correlation and Covariance
• Summary
Inferential Statistics
• Hypothesis Tests
Probability Functions
Supervised Learning
Neural Networks
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Regression Models
• Clustered Variance
• Cox-Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Naïve Bayes
• Ordinal Regression
• Robust Variance
Tree Methods
• Decision Tree
• Random Forest
Time Series Analysis
• ARIMA
Unsupervised Learning
Association Rules (Apriori)
Clustering (k-Means)
Principal Component Analysis (PCA)
Topic Modelling (Latent Dirichlet Allocation)
Utility Functions
Columns to Vector
Conjugate Gradient
Linear Solvers
• Dense Linear Systems
• Sparse Linear Systems
Mini-Batching
PMML Export
Term Frequency for Text
Vector to Columns
Nearest Neighbors
• k-Nearest Neighbors
Sampling
Balanced
Random
Stratified

Greenplum
Standby
Master
…
Master
Host
SQL
Interconnect
Segment
Host
Node1
Segment
Host
Node2
Segment
Host
Node3
Segment
Host
Node
N
GPU N
…
GPU 1 GPU N
…
GPU 1 GPU N
…
GPU 1
…
GPU N
…
GPU 1
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing
Best of both worlds: GPU-
focused and CPU-focused
data science workloads
● Unified platform for full
range of data science
workloads
● Higher productivity due
to no data movement
● Persistent data storage
and management
integrated with core
machine learning & API
compute engine
Supporting the full spectrum of data science workloads:
Data preparation, feature generation, machine learning, geospatial, deep learning, etc

Data
Types
Structured
Data
Unstructured
Data
Geographic
Data
Real Time
Data
Natural
Language
Data
Time Series
Data
Event Data
Network
Data
Linked Data

●
●
●
●
In-Memory
Database
RDBMS
dataData Lake
HOT
DATA
WARM
DATA
COLD
DATA
41

Platform Extension Framework (PXF)

(GIS)
: (NICT)
• ( )
• Pivotal Greenplum
• PostGIS: OSS
• Apache MADlib:
OSS
•
•
•
•
•

Hadoop
GPText
Pivotal Greenplum
ColdHotWarm
DataTemperature
PIVOTAL
GEMFIRE
PIVOTAL
GREENPLUM
(Data Warehouse)

PIVOTAL
GREENPLUM
Structured Data
JDBC, OBBC
SQL
ANSI SQL
RDBMS
SparkGemFireHDFS
JSON, Apache AVRO, Apache Parquet and XML
Teradata SQL
DB SQL
Apache MADlib
/ /
Python. R,
Java, Perl, C
Apache SOLR PostGIS
Custom Apps BI / Reporting Machine Learning AI
Pivotal
Greenplum
KafkaETL
Spring
Cloud
Data Flow
(MPP)
PostgreSQL
(GPORCA)
Command
Center
SQL
(Hyper-Q)
IT

● Pivotal Greenplum 3
Ø
●
Ø &
Ø PostgreSQL DWH
● Greenplum 6
● DWH Pivotal Greenplum
●
● TW: @greenplummy
● connpass: https://pivotal-japan.connpass.com/

Pivotal Greenplum DWH: People Linked to Pivotal Within 2km of ATM

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Pivotal Greenplum DWH: People Linked to Pivotal Within 2km of ATM

Similar a Pivotal Greenplum DWH: People Linked to Pivotal Within 2km of ATM (20)

Último

Último (20)

Pivotal Greenplum DWH: People Linked to Pivotal Within 2km of ATM