SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
Apache	Beam:
The	road	to	visits	re-assessment	
and	ETL	victory
Konstantinos	Papadopoulos
Analytics	Implementation	(among	others)	@	Debenhams
3/17/18 MeasureCamp	#12 1
Sounds	familiar?
• Is	network	connectivity	affecting	the	way	we	assess	visits	for	commuters?
• How	do	different	cut-off	periods	other	than	30	minutes	of	inactivity	affect	Marketing	Channels?
• How	many	visits	do	our	kiosks	have	and	what	is	their	conversion	rate?
• In	busy	stores,	customers	do	not	wait	30	minutes	before	using	them
• Can	we	do	ETL	at	scale	without	worrying	about	VMs,	servers,	clusters	etc.?
3/17/18 MeasureCamp	#12 2
Beam	Programming	
model
• Powerful	framework	for	ETL	and	data	
transformation	pipelines
• Batch	and	Stream processing	with	the	
same	code
• Java and	Python	SKDs
• Executed	in	multiple	engines	(Apache	
Spark,	Apache	Flink,	Apache	Hadoop	
MapReduce	etc.)
• Most	importantly,	on	Google	Dataflow.	
Fully	managed	service,	a.k.a.	no	DevOps	
involved.
Material	in	the	presentation	taken	from	either	https://beam.apache.org or	https://cloud.google.com/dataflow/
3/17/18 MeasureCamp	#12 3
Session-based	
Windowing
• Group	by	key	(i.e.	user	ID)	AND
window
• Define	custom	Minimum	Gap	
Duration
• “Subdivides	a	collection	of	objects	
according	to	the	timestamps	of	
those	objects	and	groups	them	by	a	
unique	key”
3/17/18 MeasureCamp	#12 4
In	other	words:	build	your	own	visits!
Timestamp	(in	
seconds)
Key	
(i.e.	Client	ID)
1 A
2 A
3 A
2 B
3 B
10 B
Key Timestamp
A [1,2,3]
B [2,3,10]
Key Timestamp
A [1,2,3]
B [2,3]
B [10]
30	seconds	of	
inactivity
5	seconds	of	
inactivity
3/17/18 MeasureCamp	#12 5
Executing	Beam	on	
Dataflow
>	python	main.py	--runner=Dataflow
• Initial	lines	processed:	115	Million
• Initial	data	size:	50	GB
• Total	processing	time:	75	minutes
3/17/18 MeasureCamp	#12 6
Less	than	250	lines	of	code
Windowing	
magic!
3/17/18 MeasureCamp	#12 7
Straight	into	BigQuery
SELECT	visit_key, visitor_id, start_of_visit, end_of_visit
FROM	[visits_analysis.visits]	
WHERE visit_key = '1000040030_1566002138_1511211346'
SELECT	visit_key, timestamp, page	
FROM	[visits_analysis.hits]
WHERE visit_key = '1000040030_1566002138_1511211346'
3/17/18 MeasureCamp	#12 8
Impact	on	visits	– Volume	of	visits	(5%	drop)
3/17/18 MeasureCamp	#12 9
Impact	on	visits	– Avg.	Visit	Duration	(~12%	increase)
3/17/18 MeasureCamp	#12 10
Cool	but	why?
• Adobe’s	Virtual	Report	suites	support	custom	cut-offs	only	in	Workspace	à NOT	practical
• What	about	data	outside	Adobe	Analytics?
• Server	/	Network	logs
• Kiosks	/	POS	logs
• Powerful	ETL	framework		- Alternative	to	Spark
• There	is	no	other	system	in	existence	which	provides	this	degree	of	flexibility	and	power,	period
…according	to	Google*.
• No	integrated	Machine	Learning	library	like	Spark’s	MLlib,	however… you	have	TensorFlow/Google	
Cloud	ML	or	can	write	separate	ML	applications	in	Spark
*:	https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
3/17/18 MeasureCamp	#12 11
Thank	you!
Feedback	time:
• https://www.linkedin.com/in/papadopoulosk/
• https://measure.slack.com/ - konstantinos.pap
Reading	time:
• Dataflow	model:	https://research.google.com/pubs/pub43864.html
• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
• Web	app	to	build	ETL	jobs:	https://cloud.google.com/dataprep/
• Python	Project:	https://bitbucket.org/digitalanalytics/visits-analysis
3/17/18 MeasureCamp	#12 12

Más contenido relacionado

Similar a London measure camp12-konstantinos-papadopoulos

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 

Similar a London measure camp12-konstantinos-papadopoulos (20)

Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
TIBCO vs MuleSoft Differentiators
TIBCO vs MuleSoft DifferentiatorsTIBCO vs MuleSoft Differentiators
TIBCO vs MuleSoft Differentiators
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Introduction To Apache Mesos
Introduction To Apache MesosIntroduction To Apache Mesos
Introduction To Apache Mesos
 
Migration of a high-traffic E-commerce website from Legacy Monolith to Micros...
Migration of a high-traffic E-commerce website from Legacy Monolith to Micros...Migration of a high-traffic E-commerce website from Legacy Monolith to Micros...
Migration of a high-traffic E-commerce website from Legacy Monolith to Micros...
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Resume2015
Resume2015Resume2015
Resume2015
 
EAI (Integration) and Mulesoft
EAI (Integration) and MulesoftEAI (Integration) and Mulesoft
EAI (Integration) and Mulesoft
 
The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11The Download: Tech Talks by the HPCC Systems Community, Episode 11
The Download: Tech Talks by the HPCC Systems Community, Episode 11
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
A Taste of Open Fabrics Interfaces
A Taste of Open Fabrics InterfacesA Taste of Open Fabrics Interfaces
A Taste of Open Fabrics Interfaces
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Google Cloud Platform Solutions for DevOps Engineers
Google Cloud Platform Solutions  for DevOps EngineersGoogle Cloud Platform Solutions  for DevOps Engineers
Google Cloud Platform Solutions for DevOps Engineers
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Toyota Financial Services Digital Transformation - Think 2019
Toyota Financial Services Digital Transformation - Think 2019Toyota Financial Services Digital Transformation - Think 2019
Toyota Financial Services Digital Transformation - Think 2019
 
Dubbo and Weidian's practice on micro-service architecture
Dubbo and Weidian's practice on micro-service architectureDubbo and Weidian's practice on micro-service architecture
Dubbo and Weidian's practice on micro-service architecture
 

Último

Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Monica Sydney
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
Asmae Rabhi
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 

Último (20)

Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 

London measure camp12-konstantinos-papadopoulos