SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
@martin_loetzsch
Dr. Martin Loetzsch
Data Natives 2017
Reducing pain in data engineering
2
Data Engineering
@martin_loetzsch
3
@martin_loetzsch
4
Which technology?
@martin_loetzsch
Avoid click-tools
hard to debug
hard to change
hard to scale with team size/ data complexity / data volume 

Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code 

Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
Start with scripts


unzip -p data.csv 

| python mapper_script.py 

| PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 

--set ON_ERROR_STOP=on etl_db 
--command="COPY s.target_table FROM STDIN”
cat query.sql 

| PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 

--set ON_ERROR_STOP=on etl_db

5
Make changing and testing things easy
@martin_loetzsch
Apply standard software engineering best practices
Target of computation


CREATE TABLE m_dim_next.region (

region_id SMALLINT PRIMARY KEY,

region_name TEXT NOT NULL UNIQUE,

country_id SMALLINT NOT NULL,

country_name TEXT NOT NULL,

_region_name TEXT NOT NULL

);



Do computation and store result in table


WITH raw_region
AS (SELECT DISTINCT
country,

region

FROM m_data.ga_session

ORDER BY country, region)



INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id,

CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1;

INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');

Speedup subsequent transformations


SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']);



SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']);



ANALYZE m_dim_next.region;
6
SQL as data processing language
@martin_loetzsch
Tables as (intermediate) results of processing steps
Recommended: your own, Apache Airflow, Mara (Project A)
Transformations are transparent to stakeholders
7
Task orchestration
@martin_loetzsch
Invest in transparency, parallel execution
8
Consistency & correctness
@martin_loetzsch
It’s easy to make mistakes during ETL


DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;



CREATE TABLE s.city (
city_id SMALLINT,
city_name TEXT,
country_name TEXT
);

INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', 'Hungary');



CREATE TABLE s.customer (
customer_id BIGINT,
city_fk SMALLINT
);

INSERT INTO s.customer VALUES
(1, 1),
(1, 2),
(2, 3);

Customers per country?


SELECT
country_name,
count(*) AS number_of_customers
FROM s.customer JOIN s.city 

ON customer.city_fk = s.city.city_id
GROUP BY country_name;



Back up all assumptions about data by constraints


ALTER TABLE s.city ADD PRIMARY KEY (city_id);
ALTER TABLE s.city ADD UNIQUE (city_name);
ALTER TABLE s.city ADD UNIQUE (city_name, country_name);


ALTER TABLE s.customer ADD PRIMARY KEY (customer_id);
[23505] ERROR: could not create unique index "customer_pkey"
Detail: Key (customer_id)=(1) is duplicated.

ALTER TABLE s.customer ADD FOREIGN KEY (city_fk)
REFERENCES s.city (city_id);
[23503] ERROR: insert or update on table "customer" violates
foreign key constraint "customer_city_fk_fkey"
Detail: Key (city_fk)=(3) is not present in table "city"
9
Referential consistency
@martin_loetzsch
Only very little overhead, will save your ass
10/18/2017 2017-10-18-dwh-schema-pav.svg
customer
customer_id
first_order_fk
favourite_product_fk
lifetime_revenue
product
product_id
revenue_last_6_months
order
order_id
processed_order_id
customer_fk
product_fk
revenue
Never repeat “business logic”


SELECT sum(total_price) AS revenue
FROM os_data.order
WHERE status IN ('pending', 'accepted', 'completed',

'proposal_for_change');




SELECT CASE WHEN (status <> 'started'
AND payment_status = 'authorised'
AND order_type <> 'backend')
THEN o.order_id END AS processed_order_fk
FROM os_data.order;



SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed
FROM os_data.order;









Refactor pipeline
Create separate task that computes everything we know about an order
Usually difficult in real life











Load → preprocess → transform → flatten-fact
10
Computational consistency
@martin_loetzsch
Requires discipline
load-product load-order load-customer
preprocess-product preprocess-order preprocess-customer
transform-product transform-order transform-customer
flatten-product-fact flatten-order-fact flatten-customer-fact
Check for “lost” rows


SELECT util.assert_equal(
'The order items fact table should contain all order items',
'SELECT count(*) FROM os_dim.order_item',
'SELECT count(*) FROM os_dim.order_items_fact');







Check consistency across cubes / domains


SELECT util.assert_almost_equal(
'The number of first orders should be the same in '

|| 'orders and marketing touchpoints cube',
'SELECT count(net_order_id)
FROM os_dim.order
WHERE _net_order_rank = 1;',

'SELECT (SELECT sum(number_of_first_net_orders)
FROM m_dim.acquisition_performance)
/ (SELECT count(*)
FROM m_dim.performance_attribution_model)',
1.0
);

Check completeness of source data


SELECT util.assert_not_found(
'Each adwords campaign must have the attribute "Channel"',
'SELECT DISTINCT campaign_name, account_name
FROM aw_tmp.ad
JOIN aw_dim.ad_performance ON ad_fk = ad_id
WHERE attributes->>''Channel'' IS NULL
AND impressions > 0
AND _date > now() - INTERVAL ''30 days''');



Check correctness of redistribution transformations


SELECT util.assert_almost_equal_relative(
'The cost of non-converting touchpoints must match the'
|| 'redistributed customer acquisition and reactivation cost',
'SELECT sum(cost)
FROM m_tmp.cost_of_non_converting_touchpoints;',
'SELECT
(SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_acquisition_cost)
+ (SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_reactivation_cost);',
0.00001);
11
Data consistency
@martin_loetzsch
Makes changing things easy
Contribution margin 3a
SELECT order_item_id,
((((((COALESCE(item_net_price, 0)::REAL
+ COALESCE(net_shipping_revenue, 0)::REAL)
- ((COALESCE(item_net_purchase_price, 0)::REAL
+ COALESCE(alcohol_tax, 0)::REAL)
+ COALESCE(import_tax, 0)::REAL))
- (COALESCE(net_fulfillment_costs, 0)::REAL
+ COALESCE(net_payment_costs, 0)::REAL))
- COALESCE(net_return_costs, 0)::REAL)
- ((COALESCE(item_net_price, 0)::REAL
+ COALESCE(net_shipping_revenue, 0)::REAL)
- ((((COALESCE(item_net_price, 0)::REAL
+ COALESCE(item_tax_amount, 0)::REAL)
+ COALESCE(gross_shipping_revenue, 0)::REAL)
- COALESCE(voucher_gross_amount, 0)::REAL)
* (1 - ((COALESCE(item_tax_amount, 0)::REAL
+ (COALESCE(gross_shipping_revenue, 0)::REAL
- COALESCE(net_shipping_revenue, 0)::REAL))
/ NULLIF(((COALESCE(item_net_price, 0)::REAL
+ COALESCE(item_tax_amount, 0)::REAL)
+ COALESCE(gross_shipping_revenue,
0)::REAL), 0))))))
- COALESCE(goodie_cost_per_item, 0)::REAL) :: DOUBLE PRECISION
AS "Contribution margin 3a"
FROM dim.sales_fact;
Use schemas between reporting and database
Mondrian
LookerML
your own
Or: Pre-compute metrics in database
12
Semantic consistency
@martin_loetzsch
Changing the meaning of metrics across all dashboards needs to be easy
Focus on the complexity of data
rather than the complexity of technology
@martin_loetzsch
13
14
We are open sourcing our BI infrastructure
@martin_loetzsch
ETL part released end of 2017
@martin_loetzsch
15
Meet us here at DN
16
Refer us a data person, earn 200€
@martin_loetzsch
Also analysts, BI managers
Thank you
@martin_loetzsch
17

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
array
arrayarray
array
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
Rclass
RclassRclass
Rclass
 
Tutorial - Learn SQL with Live Online Database
Tutorial - Learn SQL with Live Online DatabaseTutorial - Learn SQL with Live Online Database
Tutorial - Learn SQL with Live Online Database
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
 
Xgboost
XgboostXgboost
Xgboost
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #3
Neo4j Graph Data Science Training - June 9 & 10 - Slides #3Neo4j Graph Data Science Training - June 9 & 10 - Slides #3
Neo4j Graph Data Science Training - June 9 & 10 - Slides #3
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Sap pi 10 nodes
Sap pi 10 nodesSap pi 10 nodes
Sap pi 10 nodes
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)
 
Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)Tree-Based Methods (Article 8 - Practical Exercises)
Tree-Based Methods (Article 8 - Practical Exercises)
 
Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02
 
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
 
Charting with Google
Charting with GoogleCharting with Google
Charting with Google
 

Similar a DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A

解决Ora 14098分区交换索引不匹配错误
解决Ora 14098分区交换索引不匹配错误解决Ora 14098分区交换索引不匹配错误
解决Ora 14098分区交换索引不匹配错误
maclean liu
 
please code in c#- please note that im a complete beginner- northwind.docx
please code in c#- please note that im a complete beginner-  northwind.docxplease code in c#- please note that im a complete beginner-  northwind.docx
please code in c#- please note that im a complete beginner- northwind.docx
AustinaGRPaigey
 
Database Design Project-Oracle 11g
Database Design  Project-Oracle 11g Database Design  Project-Oracle 11g
Database Design Project-Oracle 11g
Sunny U Okoro
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
Chris Seebacher
 
New Features of SQL Server 2016
New Features of SQL Server 2016New Features of SQL Server 2016
New Features of SQL Server 2016
Mir Mahmood
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
Jerome Eteve
 
Sydney Oracle Meetup - execution plans
Sydney Oracle Meetup - execution plansSydney Oracle Meetup - execution plans
Sydney Oracle Meetup - execution plans
paulguerin
 

Similar a DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A (20)

Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
 
Data infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companiesData infrastructure for the other 90% of companies
Data infrastructure for the other 90% of companies
 
解决Ora 14098分区交换索引不匹配错误
解决Ora 14098分区交换索引不匹配错误解决Ora 14098分区交换索引不匹配错误
解决Ora 14098分区交换索引不匹配错误
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable CassandraCassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
Cassandra SF 2015 - Repeatable, Scalable, Reliable, Observable Cassandra
 
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: CassandraThe Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
 
please code in c#- please note that im a complete beginner- northwind.docx
please code in c#- please note that im a complete beginner-  northwind.docxplease code in c#- please note that im a complete beginner-  northwind.docx
please code in c#- please note that im a complete beginner- northwind.docx
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Database Design Project-Oracle 11g
Database Design  Project-Oracle 11g Database Design  Project-Oracle 11g
Database Design Project-Oracle 11g
 
Databaseconcepts
DatabaseconceptsDatabaseconcepts
Databaseconcepts
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
New Features of SQL Server 2016
New Features of SQL Server 2016New Features of SQL Server 2016
New Features of SQL Server 2016
 
chap 7.ppt(sql).ppt
chap 7.ppt(sql).pptchap 7.ppt(sql).ppt
chap 7.ppt(sql).ppt
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Agile Database Development with JSON
Agile Database Development with JSONAgile Database Development with JSON
Agile Database Development with JSON
 
Sql server T-sql basics ppt-3
Sql server T-sql basics  ppt-3Sql server T-sql basics  ppt-3
Sql server T-sql basics ppt-3
 
PHP tips by a MYSQL DBA
PHP tips by a MYSQL DBAPHP tips by a MYSQL DBA
PHP tips by a MYSQL DBA
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Sydney Oracle Meetup - execution plans
Sydney Oracle Meetup - execution plansSydney Oracle Meetup - execution plans
Sydney Oracle Meetup - execution plans
 

Más de Dataconomy Media

Más de Dataconomy Media (20)

Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
 
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
 
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
 
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
 
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
 
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
 
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
 
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
 
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
 
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
 
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
 
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
 
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
 
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
 
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
 
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
 
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
 
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
 

Último

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Último (20)

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 

DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A

  • 1. @martin_loetzsch Dr. Martin Loetzsch Data Natives 2017 Reducing pain in data engineering
  • 5. Avoid click-tools hard to debug hard to change hard to scale with team size/ data complexity / data volume 
 Data pipelines as code SQL files, python & shell scripts Structure & content of data warehouse are result of running code 
 Easy to debug & inspect Develop locally, test on staging system, then deploy to production Start with scripts 
 unzip -p data.csv 
 | python mapper_script.py 
 | PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 
 --set ON_ERROR_STOP=on etl_db --command="COPY s.target_table FROM STDIN” cat query.sql 
 | PGOPTIONS=--client-min-messages=warning psql --no-psqlrc 
 --set ON_ERROR_STOP=on etl_db
 5 Make changing and testing things easy @martin_loetzsch Apply standard software engineering best practices
  • 6. Target of computation 
 CREATE TABLE m_dim_next.region (
 region_id SMALLINT PRIMARY KEY,
 region_name TEXT NOT NULL UNIQUE,
 country_id SMALLINT NOT NULL,
 country_name TEXT NOT NULL,
 _region_name TEXT NOT NULL
 );
 
 Do computation and store result in table 
 WITH raw_region AS (SELECT DISTINCT country,
 region
 FROM m_data.ga_session
 ORDER BY country, region)
 
 INSERT INTO m_dim_next.region SELECT row_number() OVER (ORDER BY country, region ) AS region_id,
 CASE WHEN (SELECT count(DISTINCT country) FROM raw_region r2 WHERE r2.region = r1.region) > 1 THEN region || ' / ' || country ELSE region END AS region_name, dense_rank() OVER (ORDER BY country) AS country_id, country AS country_name, region AS _region_name FROM raw_region r1;
 INSERT INTO m_dim_next.region VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');
 Speedup subsequent transformations 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['_region_name', ‘country_name', 'region_id']);
 
 SELECT util.add_index( 'm_dim_next', 'region', column_names := ARRAY ['country_id', 'region_id']);
 
 ANALYZE m_dim_next.region; 6 SQL as data processing language @martin_loetzsch Tables as (intermediate) results of processing steps
  • 7. Recommended: your own, Apache Airflow, Mara (Project A) Transformations are transparent to stakeholders 7 Task orchestration @martin_loetzsch Invest in transparency, parallel execution
  • 9. It’s easy to make mistakes during ETL 
 DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
 
 CREATE TABLE s.city ( city_id SMALLINT, city_name TEXT, country_name TEXT );
 INSERT INTO s.city VALUES (1, 'Berlin', 'Germany'), (2, 'Budapest', 'Hungary');
 
 CREATE TABLE s.customer ( customer_id BIGINT, city_fk SMALLINT );
 INSERT INTO s.customer VALUES (1, 1), (1, 2), (2, 3);
 Customers per country? 
 SELECT country_name, count(*) AS number_of_customers FROM s.customer JOIN s.city 
 ON customer.city_fk = s.city.city_id GROUP BY country_name;
 
 Back up all assumptions about data by constraints 
 ALTER TABLE s.city ADD PRIMARY KEY (city_id); ALTER TABLE s.city ADD UNIQUE (city_name); ALTER TABLE s.city ADD UNIQUE (city_name, country_name); 
 ALTER TABLE s.customer ADD PRIMARY KEY (customer_id); [23505] ERROR: could not create unique index "customer_pkey" Detail: Key (customer_id)=(1) is duplicated.
 ALTER TABLE s.customer ADD FOREIGN KEY (city_fk) REFERENCES s.city (city_id); [23503] ERROR: insert or update on table "customer" violates foreign key constraint "customer_city_fk_fkey" Detail: Key (city_fk)=(3) is not present in table "city" 9 Referential consistency @martin_loetzsch Only very little overhead, will save your ass
  • 10. 10/18/2017 2017-10-18-dwh-schema-pav.svg customer customer_id first_order_fk favourite_product_fk lifetime_revenue product product_id revenue_last_6_months order order_id processed_order_id customer_fk product_fk revenue Never repeat “business logic” 
 SELECT sum(total_price) AS revenue FROM os_data.order WHERE status IN ('pending', 'accepted', 'completed',
 'proposal_for_change'); 
 
 SELECT CASE WHEN (status <> 'started' AND payment_status = 'authorised' AND order_type <> 'backend') THEN o.order_id END AS processed_order_fk FROM os_data.order;
 
 SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed FROM os_data.order;
 
 
 
 
 Refactor pipeline Create separate task that computes everything we know about an order Usually difficult in real life
 
 
 
 
 
 Load → preprocess → transform → flatten-fact 10 Computational consistency @martin_loetzsch Requires discipline load-product load-order load-customer preprocess-product preprocess-order preprocess-customer transform-product transform-order transform-customer flatten-product-fact flatten-order-fact flatten-customer-fact
  • 11. Check for “lost” rows 
 SELECT util.assert_equal( 'The order items fact table should contain all order items', 'SELECT count(*) FROM os_dim.order_item', 'SELECT count(*) FROM os_dim.order_items_fact');
 
 
 
 Check consistency across cubes / domains 
 SELECT util.assert_almost_equal( 'The number of first orders should be the same in '
 || 'orders and marketing touchpoints cube', 'SELECT count(net_order_id) FROM os_dim.order WHERE _net_order_rank = 1;',
 'SELECT (SELECT sum(number_of_first_net_orders) FROM m_dim.acquisition_performance) / (SELECT count(*) FROM m_dim.performance_attribution_model)', 1.0 );
 Check completeness of source data 
 SELECT util.assert_not_found( 'Each adwords campaign must have the attribute "Channel"', 'SELECT DISTINCT campaign_name, account_name FROM aw_tmp.ad JOIN aw_dim.ad_performance ON ad_fk = ad_id WHERE attributes->>''Channel'' IS NULL AND impressions > 0 AND _date > now() - INTERVAL ''30 days''');
 
 Check correctness of redistribution transformations 
 SELECT util.assert_almost_equal_relative( 'The cost of non-converting touchpoints must match the' || 'redistributed customer acquisition and reactivation cost', 'SELECT sum(cost) FROM m_tmp.cost_of_non_converting_touchpoints;', 'SELECT (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_acquisition_cost) + (SELECT sum(cost_per_touchpoint * number_of_touchpoints) FROM m_tmp.redistributed_customer_reactivation_cost);', 0.00001); 11 Data consistency @martin_loetzsch Makes changing things easy
  • 12. Contribution margin 3a SELECT order_item_id, ((((((COALESCE(item_net_price, 0)::REAL + COALESCE(net_shipping_revenue, 0)::REAL) - ((COALESCE(item_net_purchase_price, 0)::REAL + COALESCE(alcohol_tax, 0)::REAL) + COALESCE(import_tax, 0)::REAL)) - (COALESCE(net_fulfillment_costs, 0)::REAL + COALESCE(net_payment_costs, 0)::REAL)) - COALESCE(net_return_costs, 0)::REAL) - ((COALESCE(item_net_price, 0)::REAL + COALESCE(net_shipping_revenue, 0)::REAL) - ((((COALESCE(item_net_price, 0)::REAL + COALESCE(item_tax_amount, 0)::REAL) + COALESCE(gross_shipping_revenue, 0)::REAL) - COALESCE(voucher_gross_amount, 0)::REAL) * (1 - ((COALESCE(item_tax_amount, 0)::REAL + (COALESCE(gross_shipping_revenue, 0)::REAL - COALESCE(net_shipping_revenue, 0)::REAL)) / NULLIF(((COALESCE(item_net_price, 0)::REAL + COALESCE(item_tax_amount, 0)::REAL) + COALESCE(gross_shipping_revenue, 0)::REAL), 0)))))) - COALESCE(goodie_cost_per_item, 0)::REAL) :: DOUBLE PRECISION AS "Contribution margin 3a" FROM dim.sales_fact; Use schemas between reporting and database Mondrian LookerML your own Or: Pre-compute metrics in database 12 Semantic consistency @martin_loetzsch Changing the meaning of metrics across all dashboards needs to be easy
  • 13. Focus on the complexity of data rather than the complexity of technology @martin_loetzsch 13
  • 14. 14 We are open sourcing our BI infrastructure @martin_loetzsch ETL part released end of 2017
  • 16. 16 Refer us a data person, earn 200€ @martin_loetzsch Also analysts, BI managers