SlideShare una empresa de Scribd logo
1 de 26
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
Software reliability on the Big Data ERA
with an Industry minded focus
Ángel Conde
aconde@ikerlan.es
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
About Me
2
@Neuw84
@IKERLANofficial
Ángel Conde Manjón
Data Analytics & Artificial Intelligence Team Lead @
Big Data
Artificial
Intelligence
Distributted
Systems Cloud
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
BIG DATA “RELIABILITY” OR “FAILURE SURVIVAL”
3
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
Distributed systems vs reliability
4
• Big Data equals to Distributed Processing System.
But……
“Can a distributed system be reliable?”
• Not really.
- Network Partitions.
- Node failure (Hardware, Software, etc).
- Clock Drift (related to consensus).
*google nowadays says otherwise….
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
The starting paradigm shift
5
• HPC Clusters too expensive (and they fail too).
“How can we process in cheap & reliable way high amount of data? “
• makes it: MapReduce: Simplified Data Processing on Large Clusters (2004, J.
Dean).
• Open Source its implementation
is born.
The rest is history….
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
The Map Reduce model
6
* Word Count is the Hello World in the Big Data Paradigm.
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
All fits in memory
7
• Map Reduce is somehow “slow”, every step persisted to disk.
• Memory gets cheaper and cheaper….
• Let´s do in memory computing!
Spark: Cluster Computing with Working Sets. (M. Zaharia, 2010).
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
Spark Lineage Model
8
• Everything is immutable.
• DATA is partitioned in replicated chunks (RDD).
• Before execution, a DAG is computed.
• DAG execution is checkpointed to failure tolerant storage.
• In case of node failure its recomputed from last checkpoint.
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
Orchestrators
9
• An important piece.
• Abstract resources of the cluster (CPUs, GPUs, Memory).
“I want my Big Data process to run on: 200 CPUs, 512GB Ram”
• Coordinates all the works running in the cluster.
• Relaunch to other nodes in case of failure.
• As DBs they have consensus capabilities (e.g., for leadership elections).
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
DISTRIBUTED DATABASES
10
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
The CAP Theroem
11
* Pick two
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
All about consensus
12
https://jvns.ca/blog/2016/11/19/a-critique-of-the-cap-theorem/
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
The Rise of NoSQL
13
• The internet become what it is some years ago (aka Internet size problems).
• Lot of No-SQL solutions to solve internet scale problems.
o Key-Value
o Document
o Time
o Graph
• Remember, usually YOU do not have those problems.
• Avoid sharding, multi-master approaches.
• No ACID transaction support.
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
A new approach
14
• Again, did it Spanner: Google's Globally-Distributed Database (C.
Corbettt, 2012)
• Complete control of the backbone network, being tolerant to failures.
• Atomic clocks global sync.
• Advanced Consensus protocols.
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
The Open Source alternatives
15
*nowadays high rise of multimodal databases
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
INDUSTRIAL INTERNET OF THINGS (INDUSTRY 4.0)
16
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
The Industrial Internet of Things (IIoT)
17
• : investment is expected to top $60 trillion during the next 15 years.
• : could add $14.2T to the global economy by 2030.
• will touch 43% of the global economy by 2025.
• Gartner : 20 billion IoT things installed by 2024.
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
Use cases & Key Benefits
18
+
efficiency
-
costs
• Supply-Demand matching and reduction of Time-to-market.
• Human resource optimization.
• Optimization of energy and raw material consumptions.
• Manufacturing asset optimization and OEE improvement.
• Quality Maximization.
• After sales service optimization.
• Environment health & security maximization.
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
Key Issues in IIoT
19
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
REAL TIME PROCESSING APPLIED TO IIOT
20
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
Big Data & Real Time Processing
21
• A table can be seen as a snapshot of streaming data (e.g. unbounded table).
• Usually streaming aggregations requires windows.
• Results are processed at some point (e.g. window), we make a “snapshot table”.
• Those snapshots are usually stored in a tolerant failure storage system.
However…. How do we deal with late arriving data?
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
Watermarking
22
Event
Time
Processing time
With 5 minutes triggers
12:00 12:05 12:10 12:15
11:55
12:00
12:05
12:10
12:15
• The (in)famous word count example.
5 minute watermark
(last seen event time – 5m)
11:58
(“hello”,1)
12:03
(“hello”,1)
12:08
(“hello”,1)
12:05
(“hello”,1) 12:03
(“hello”,1)
12:14
(“hello”,1)
Max event time
seen Word Count
Processing time = 12:00
Processing time = 12:05
Processing time = 12:10
Processing time = 12:15
“hello” 1
2
4
Event after the
watermark is not
written to the Sink
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
HANDS ON DEMO
23
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
Demo
24
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
© 2021. IKERLAN. All rights reserved
Overview
25
Digital Platform (PaaS)
MQTT - JSON
Filter and routing
Aggregates &
Raw data
Real time
processing
Cloud
UI
IKERLAN.
WHERE
TECHNOLOGY IS
AN ATTITUDE
IKERLAN
P.º José María Arizmendiarrieta, 2 - 20500 Arrasate-Mondragón
T. +34 943712400 F. +34 943796944
THANK YOU
https://github.com/Neuw84/ada_2021/
aconde@ikerlan.es
@neuw84

Más contenido relacionado

La actualidad más candente

Server Virtualization
Server VirtualizationServer Virtualization
Server Virtualization
webhostingguy
 
VMworld vBrownBag vmtn5534e - placement of iot workload operations within a c...
VMworld vBrownBag vmtn5534e - placement of iot workload operations within a c...VMworld vBrownBag vmtn5534e - placement of iot workload operations within a c...
VMworld vBrownBag vmtn5534e - placement of iot workload operations within a c...
Kenneth Moore
 

La actualidad más candente (20)

Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Cut Complexity, Cut Costs
Cut Complexity, Cut CostsCut Complexity, Cut Costs
Cut Complexity, Cut Costs
 
Server Virtualization
Server VirtualizationServer Virtualization
Server Virtualization
 
Cloudera - Enabling the IoT Revolution Driving Insights in a Connected World
Cloudera - Enabling the IoT Revolution Driving Insights in a Connected WorldCloudera - Enabling the IoT Revolution Driving Insights in a Connected World
Cloudera - Enabling the IoT Revolution Driving Insights in a Connected World
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine Learning
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for Analytics
 
VMworld vBrownBag vmtn5534e - placement of iot workload operations within a c...
VMworld vBrownBag vmtn5534e - placement of iot workload operations within a c...VMworld vBrownBag vmtn5534e - placement of iot workload operations within a c...
VMworld vBrownBag vmtn5534e - placement of iot workload operations within a c...
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Cloudera - IoT & Smart Cities
Cloudera - IoT & Smart CitiesCloudera - IoT & Smart Cities
Cloudera - IoT & Smart Cities
 
Database administrators (dbas) face increasing pressure to monitor databases
Database administrators (dbas) face increasing pressure to monitor databasesDatabase administrators (dbas) face increasing pressure to monitor databases
Database administrators (dbas) face increasing pressure to monitor databases
 
Gregory Touretsky - Intel IT- Open Cloud Journey
Gregory Touretsky - Intel IT- Open Cloud JourneyGregory Touretsky - Intel IT- Open Cloud Journey
Gregory Touretsky - Intel IT- Open Cloud Journey
 
How the Italian Market is Embracing Alternatives to Relational Databases
How the Italian Market is Embracing Alternatives to Relational DatabasesHow the Italian Market is Embracing Alternatives to Relational Databases
How the Italian Market is Embracing Alternatives to Relational Databases
 
ML-Based Data-Driven Software Development with InfluxDB 2.0
ML-Based Data-Driven Software Development with InfluxDB 2.0ML-Based Data-Driven Software Development with InfluxDB 2.0
ML-Based Data-Driven Software Development with InfluxDB 2.0
 
Case studies of the internet of things 062017
Case studies of the internet of things 062017Case studies of the internet of things 062017
Case studies of the internet of things 062017
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
BLD() Tech Conference — Data exploration with KSQL
BLD() Tech Conference — Data exploration with KSQLBLD() Tech Conference — Data exploration with KSQL
BLD() Tech Conference — Data exploration with KSQL
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Optimizing workload deployments to accelerate business outcomes
Optimizing workload deployments to accelerate business outcomes Optimizing workload deployments to accelerate business outcomes
Optimizing workload deployments to accelerate business outcomes
 

Similar a Software Realibility on the Big Data Era

Consumption Based On-Demand Private Cloud in a Box
Consumption Based On-Demand Private Cloud in a BoxConsumption Based On-Demand Private Cloud in a Box
Consumption Based On-Demand Private Cloud in a Box
Rebekah Rodriguez
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1
blewington
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
Linaro
 

Similar a Software Realibility on the Big Data Era (20)

EXASXALE COMPUTING
EXASXALE COMPUTINGEXASXALE COMPUTING
EXASXALE COMPUTING
 
Consumption Based On-Demand Private Cloud in a Box
Consumption Based On-Demand Private Cloud in a BoxConsumption Based On-Demand Private Cloud in a Box
Consumption Based On-Demand Private Cloud in a Box
 
Aleksejs Nemirovskis - Manage your data using oracle BDA
Aleksejs Nemirovskis - Manage your data using oracle BDAAleksejs Nemirovskis - Manage your data using oracle BDA
Aleksejs Nemirovskis - Manage your data using oracle BDA
 
Hey IT, Meet OT with Hima Mukkamala
Hey IT, Meet OT with Hima MukkamalaHey IT, Meet OT with Hima Mukkamala
Hey IT, Meet OT with Hima Mukkamala
 
Industrial IoT and the emergence of Edge Computing Navigating the Technologic...
Industrial IoT and the emergence of Edge Computing Navigating the Technologic...Industrial IoT and the emergence of Edge Computing Navigating the Technologic...
Industrial IoT and the emergence of Edge Computing Navigating the Technologic...
 
Oracle Database 19c - poslední z rodiny 12.2 a co přináší nového
Oracle Database 19c - poslední z rodiny 12.2 a co přináší novéhoOracle Database 19c - poslední z rodiny 12.2 a co přináší nového
Oracle Database 19c - poslední z rodiny 12.2 a co přináší nového
 
Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad
 
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
Theresa Melvin, HP Enterprise - IOT/AI/ML at Hyperscale - how to go faster wi...
 
Linaro connect 2018 keynote final updated
Linaro connect 2018 keynote final updatedLinaro connect 2018 keynote final updated
Linaro connect 2018 keynote final updated
 
Cloud 2015: Connecting the Next Billion - Intel Keynote @ HP Discover 2011
Cloud 2015: Connecting the Next Billion - Intel Keynote @ HP Discover 2011Cloud 2015: Connecting the Next Billion - Intel Keynote @ HP Discover 2011
Cloud 2015: Connecting the Next Billion - Intel Keynote @ HP Discover 2011
 
Power Quality in Internet Data Centers
Power Quality in Internet Data CentersPower Quality in Internet Data Centers
Power Quality in Internet Data Centers
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
Sgcp14phillips
Sgcp14phillipsSgcp14phillips
Sgcp14phillips
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
 
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super AffordableSupermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
 
Machine Learning and Artificial Intelligence
Machine Learning and Artificial IntelligenceMachine Learning and Artificial Intelligence
Machine Learning and Artificial Intelligence
 
Ci Physical Infrastructure Carousel
Ci Physical Infrastructure CarouselCi Physical Infrastructure Carousel
Ci Physical Infrastructure Carousel
 
2016 asl hitachi
2016 asl hitachi2016 asl hitachi
2016 asl hitachi
 

Más de Angel Conde Manjon (7)

Evolución hacia las plataformas de datos modernas, el Edge-to-cloud continuum
Evolución hacia las plataformas de datos modernas, el Edge-to-cloud continuumEvolución hacia las plataformas de datos modernas, el Edge-to-cloud continuum
Evolución hacia las plataformas de datos modernas, el Edge-to-cloud continuum
 
Continous Delivery and Continous Integration at IKERLAN
Continous Delivery and Continous Integration at IKERLANContinous Delivery and Continous Integration at IKERLAN
Continous Delivery and Continous Integration at IKERLAN
 
Towards an Unified API for Spark and the IIoT
Towards an Unified API for Spark and the IIoTTowards an Unified API for Spark and the IIoT
Towards an Unified API for Spark and the IIoT
 
Solving the Industry 4.0. challenges on the logistics domain using Apache Mesos
Solving the Industry 4.0. challenges on the logistics domain using Apache MesosSolving the Industry 4.0. challenges on the logistics domain using Apache Mesos
Solving the Industry 4.0. challenges on the logistics domain using Apache Mesos
 
Modern Java Development
Modern Java DevelopmentModern Java Development
Modern Java Development
 
Modern Software Development
Modern Software DevelopmentModern Software Development
Modern Software Development
 
Ph.D. Defense
Ph.D. Defense Ph.D. Defense
Ph.D. Defense
 

Último

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Software Realibility on the Big Data Era

  • 1. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE Software reliability on the Big Data ERA with an Industry minded focus Ángel Conde aconde@ikerlan.es
  • 2. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved About Me 2 @Neuw84 @IKERLANofficial Ángel Conde Manjón Data Analytics & Artificial Intelligence Team Lead @ Big Data Artificial Intelligence Distributted Systems Cloud
  • 3. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE BIG DATA “RELIABILITY” OR “FAILURE SURVIVAL” 3
  • 4. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved Distributed systems vs reliability 4 • Big Data equals to Distributed Processing System. But…… “Can a distributed system be reliable?” • Not really. - Network Partitions. - Node failure (Hardware, Software, etc). - Clock Drift (related to consensus). *google nowadays says otherwise….
  • 5. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved The starting paradigm shift 5 • HPC Clusters too expensive (and they fail too). “How can we process in cheap & reliable way high amount of data? “ • makes it: MapReduce: Simplified Data Processing on Large Clusters (2004, J. Dean). • Open Source its implementation is born. The rest is history….
  • 6. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved The Map Reduce model 6 * Word Count is the Hello World in the Big Data Paradigm.
  • 7. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved All fits in memory 7 • Map Reduce is somehow “slow”, every step persisted to disk. • Memory gets cheaper and cheaper…. • Let´s do in memory computing! Spark: Cluster Computing with Working Sets. (M. Zaharia, 2010).
  • 8. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved Spark Lineage Model 8 • Everything is immutable. • DATA is partitioned in replicated chunks (RDD). • Before execution, a DAG is computed. • DAG execution is checkpointed to failure tolerant storage. • In case of node failure its recomputed from last checkpoint.
  • 9. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved Orchestrators 9 • An important piece. • Abstract resources of the cluster (CPUs, GPUs, Memory). “I want my Big Data process to run on: 200 CPUs, 512GB Ram” • Coordinates all the works running in the cluster. • Relaunch to other nodes in case of failure. • As DBs they have consensus capabilities (e.g., for leadership elections).
  • 11. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved The CAP Theroem 11 * Pick two
  • 12. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved All about consensus 12 https://jvns.ca/blog/2016/11/19/a-critique-of-the-cap-theorem/
  • 13. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved The Rise of NoSQL 13 • The internet become what it is some years ago (aka Internet size problems). • Lot of No-SQL solutions to solve internet scale problems. o Key-Value o Document o Time o Graph • Remember, usually YOU do not have those problems. • Avoid sharding, multi-master approaches. • No ACID transaction support.
  • 14. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved A new approach 14 • Again, did it Spanner: Google's Globally-Distributed Database (C. Corbettt, 2012) • Complete control of the backbone network, being tolerant to failures. • Atomic clocks global sync. • Advanced Consensus protocols.
  • 15. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved The Open Source alternatives 15 *nowadays high rise of multimodal databases
  • 16. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE INDUSTRIAL INTERNET OF THINGS (INDUSTRY 4.0) 16
  • 17. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved The Industrial Internet of Things (IIoT) 17 • : investment is expected to top $60 trillion during the next 15 years. • : could add $14.2T to the global economy by 2030. • will touch 43% of the global economy by 2025. • Gartner : 20 billion IoT things installed by 2024.
  • 18. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved Use cases & Key Benefits 18 + efficiency - costs • Supply-Demand matching and reduction of Time-to-market. • Human resource optimization. • Optimization of energy and raw material consumptions. • Manufacturing asset optimization and OEE improvement. • Quality Maximization. • After sales service optimization. • Environment health & security maximization.
  • 19. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved Key Issues in IIoT 19
  • 20. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE REAL TIME PROCESSING APPLIED TO IIOT 20
  • 21. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved Big Data & Real Time Processing 21 • A table can be seen as a snapshot of streaming data (e.g. unbounded table). • Usually streaming aggregations requires windows. • Results are processed at some point (e.g. window), we make a “snapshot table”. • Those snapshots are usually stored in a tolerant failure storage system. However…. How do we deal with late arriving data?
  • 22. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved Watermarking 22 Event Time Processing time With 5 minutes triggers 12:00 12:05 12:10 12:15 11:55 12:00 12:05 12:10 12:15 • The (in)famous word count example. 5 minute watermark (last seen event time – 5m) 11:58 (“hello”,1) 12:03 (“hello”,1) 12:08 (“hello”,1) 12:05 (“hello”,1) 12:03 (“hello”,1) 12:14 (“hello”,1) Max event time seen Word Count Processing time = 12:00 Processing time = 12:05 Processing time = 12:10 Processing time = 12:15 “hello” 1 2 4 Event after the watermark is not written to the Sink
  • 24. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved Demo 24
  • 25. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE © 2021. IKERLAN. All rights reserved Overview 25 Digital Platform (PaaS) MQTT - JSON Filter and routing Aggregates & Raw data Real time processing Cloud UI
  • 26. IKERLAN. WHERE TECHNOLOGY IS AN ATTITUDE IKERLAN P.º José María Arizmendiarrieta, 2 - 20500 Arrasate-Mondragón T. +34 943712400 F. +34 943796944 THANK YOU https://github.com/Neuw84/ada_2021/ aconde@ikerlan.es @neuw84

Notas del editor

  1. Good afternoon to everybody, I´m Angel Conde from IKERLAN Technology Centre. The talk I´m presenting here is called Software Reliability on the Big Data ERA with an Industry minded focus
  2. Well I will give a brief introduction about me. I work leading the Data Analytics & Artificial Intelligence Team at Ikerlan. Ikerlan is a research centre member of the Basque Research & Technology Alliance. Those are some of the topics that I work on my day to day.
  3. Let’s start the talk with an introduction about how Big Data started with realibity in mind.
  4. The first thing that we need to be taken into account is that a Big Data system equals to a Distributed system. However, we should ask ourselves this question. Can a distributed system be reliable? Not really, we have all kind of failures. And that leaded to the famous 8 fallacies of Distributed computing.
  5. One can say that we have High Performance Computing clusters, but… they are too expensive to process the amount of data gathered by internet companies. Moreover, such systems fail too. Then… How can we process in cheap & reliable way high amount of data? Google, in, 2004 pubish a paper about an approach to processing data on large clusters. Some years later, Yahoo open sources its implementation and Hadoop is born… the rest is history.
  6. In the map reduce model we have usually some map steps chained with reduce steps. In this figure we can see the diagram for a word count. Word count is the hello world in the big data paradigm. A lot of use cases can be ported to this approach, more than you may think at first sight. We can see here that the network load on the shuffle steps seems to be important for the performance of approach. Moreover, for each step the intermediate results are stored on failure tolerant storage system
  7. Memory get cheaper and therefore the approach to do in memory computing is born. Berkeley publish a paper on one approach using this kind of paradigm an later on a lot of frameworks born using the in-memory paradigm.
  8. In spark, in order to be tolerant to failures. The first thing is that everything is inmmutable. The data is stored in a replicated way in memory. Before execution a DAG is computed trying to optimize the different steps of the computation. Moreover, the DAG steps are checkpointed as needed in order to be reliable. If no checkpoint exits, it recomputes the whole DAG. RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure
  9. Next we are going to speak about the orchestrators. They are in charge of job scheduling, abstract cluster resources, etc. In case of node failure the try to reschedule the jobs into other nodes. As distributed databases the need consensus capabilities (e.g. who is the leader).
  10. Well, we are going to change our focus into distributed databases, those databases are distributed by nature and therefore we are going to make a brief introduction about their design.
  11. In the Distributed Databases is famous the CAP theorem, this theorem says that in a distributed system you can’t have three of those features. For example, you can have consistency and availability but not being tolerant to network partitions.
  12. This theorem seems to provide an easy reasoning about these systems. However, in some combinations,.. That does not mean very much.
  13. But how this trend started? The rise of the distributed databases was meant to solve internet size problems. There a lot of no-sql to solve internet case problems. Those, approaches provide multimaster capabilities, avoid sharding…. However, no ACID support (consistency) in the majority of the approaches. (*these can be solved by developers on client side) And the developers wanted it’s SQL back (e.g. CQL) and companies wanted ACID.
  14. Google changed the landscape again in 2012 with another paper. The thing is that you have a complete control of the backbone network. Having multiple physical paths that provide tolerant to fialures. They have in each datacenter an Atomic clock in order to have a global time sync protocol. And with advanced protocols….
  15. After the famous paper, again…. Some open source databases have already implemented some of the paper tricks.
  16. Well let’s move into the next point. Now I will introduce
  17. Let´s start with some numbers related to the Iot and the IIoT to show why this is IMPPORTant General Electric says that Iiot Investment is expected to top…. Accenture: predicts that iiot could add McKinsey estimates that will touch 43% of the global economy. About the number of things Gartner says that 20 billion things will by installed by 2020.
  18. Lets see some of the for the industry Well the benefits apply for benefits r the whole product live cycle, from its development to its end of life support. Eg. Supply demand matching and reduction time Human resource optimization Optimization of energy and raw material consumptions Manufacturing asset optimization Overall Equipment Effectiveness Quality maximization After sales …….. All of these concepts are closely related to the industry 4.0.
  19. Following let´s speak about real time processing of IIoT data. Late Data and Ordering: - We can have connectivity issues such as: wireless mobile telecommunications, low signal, etc. Protocols: - Most MQTT brokers do not implement Qos2!! - CoAP is UDP based no ordering!! wrong designed local acquisition systems Therefore, if we are doing real-time processing of IIoT data we need a tool that enables us to work easily on unordered incoming data and to build filters for duplicates easily
  20. Next I am goint to explain the concept of Event time & watermarking for late data. Watermark is a moving threshold in event-time that trails behind the maximum event-time seen by the query in the processed data
  21. Well in this demo we are using some of the Big Data open source tools: - For example: we are using Nifi(Naifai) for ingestion and routing - Kafka messaging and decoupling - Spark for real time processing - Cassandra as backend storage. - Zeppelin as our web interface - The open source broker MQTT called mosquitto.
  22. The architecture is the following: Fake Sensor Data from two machines is sent to a MQTT broker running on the cloud. This data contains machine status, temperature, etc. From there MQTT data is ingested via Nifi (naifai) and sent to two topics depending the machine status. Then we have the real time processing engine, Spark. This component makes possible to do real time analytics on incoming data and store the results on Cassandra. For the demo we will use Zeppelin as a way to interact with Spark and Cassandra providing a useful user interface for our analytics. This kind of architecture or digital platform can run on any cloud or on-premises.
  23. We have come to the end of the demo. I’d just like to thank (thenk) you for listening and let you know that all code of this demo is already on Github. Now I would be pleased to take your comments and questions.