SlideShare una empresa de Scribd logo
1 de 19
Big Data Applications
Juan Pablo Paz Grau, PhD, PMP.
Juan Pablo Paz Grau, PhD, PMP
Systems Engineer
Specialist in Information Systems Management
PhD in Software Engineering
Certified in ITIL Foundation, PMP
Currently, I work in LG CNS Colombia
LG CNS Colombia is the IT partner of the SIRCI operation
The SIRCI Operation = Transmilenio Operation
Transmilenio is the world renown reference for BRT systems
The biggest public traffic system operation in Colombia
Presentation Agenda
1. What is Big Data?
2. Large Dataset Management Techniques
3. Hadoop Cluster Architecture
4. Closing the Loop: Real Time Cluster Architecture
5. The Development Process for Big Data Systems
6. Showcase of Big Data Tools for Public Traffic Systems
What is Big Data?
The DIKW Triangle
What is Big Data?
Information displayed
to final users
Data generated to
provide information
displayed to final
users
…
What is Big Data?
• Organizations produce lots of
data while they operate their
Information Systems
• Log files
• Access log files
• Debug log files
• Temporal, transient data
• Transactional data
• Usually, this data is stored
temporarily only for debugging
or incident analysis purposes
• With the increasing capacity to
store data, this data is been
reviewed and considered a
valuable source of information
Large Dataset Management Techniques
Very small intro to Hadoop
Cheap, reliable storage of
big datasets in commodity
hardware
A framework to parallelize
big data processing and
analysis
What is Hadoop?
Large Dataset
Large Dataset Management Techniques
Very small intro to Hadoop: Hadoop Distributed File System (HDFS)
File is split in
data blocks
File metadata and block
location is stored in the
name node
Data blocks are physically
stored in data nodes
Block B:
• If Data Node 0 fails, there is another
copy in the same rack at Data Node 1
• If the rack fails, there is still another
copy in another rack at Data Node 2
Rack 1 Rack 2
Large Dataset Management Techniques
• Very small intro to Hadoop: Map Reduce
Map: Select data that
matches a given criteria
(Status = Trip). The map
function returns a set of
{Key,Value} pairs
Shuffle: Collect an
sort the mapped pairs
Reduce: Apply a
reduce function (Sum
distance) for each key
Large Dataset Management Techniques
Very small intro to Hadoop: The Hadoop ecosystem
• Currently, there are a plethora of tools to work
with Big Data in top of Hadoop.
• The tools and frameworks selection will vary
depending on the implementation of the cluster.
Hadoop Cluster Architecture
The Lambda Architecture
Application
Data Access
Batch | Speed
Data
• Data layer: A data model and a set of data stored
following the data model. The data model should
be designed for the targeted subsystem.
• Batch layer: The computation layer that
processes data to turn facts into views for
querying the underlying stored data.
• Speed layer: A real time computation layer that
compensates the latency of the batch layer.
• Data Access layer: The engines, tools and
drivers that exposes views to applications and
manages queries.
• Application layer: The front-end application or
applications that present information to users of
the Big Data system.
Hadoop Cluster Architecture
Data Serialization
Source System
Source System
Source System
Data Serialization
Data Serialization
Data Serialization
Data Lake
Source System
Raw Data
Data Access: Hive, Hadoop Data Warehouse
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of managing data in Hadoop
• Manage files and schemas as tables
• Internal tables: Files managed by Hive
• External tables: Files located outside
of Hive but which can be analyzed with
Hive
• Provides a SQL like language to query data
stored in files
• Translates HiveQL language requests
into Map Reduce jobs
HiveQL
Load Transform Dump
Data Access: Pig, Data Processing Language
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of data processing and
analysis
• Capable of working with any type of data
source
• Provides a scripting language to process and
transform data
Pig
Latin
Hadoop Cluster Architecture
Hive
• Works with structured data
• Can index data
• HiveQL, a SQL like access language
• Turns the HiveQL input into MapReduce
jobs
Pig
• Works with structured/unstructured data
• Cannot index data
• Pig latin, a scripting language
• Turns the Pig latin input into MapReduce
jobs
Hive / Pig Comparison
Closing the Loop: Real Time Cluster Architecture
Why?
1. Hadoop is intended to store history, not changing data (write
once, read many times)
2. Batch processing of data usually takes many time to produce
output summarized data
3. Capability to provide real time processing of Big Data is also
desirable in the Lambda architecture
4. There is a need to implement a solution to cope with the time
between data in the Hadoop cluster and new data been
generated
Data available
in Hadoop
New data
been created
New data
stored in
Hadoop
Data
Gap
Time
Closing the Loop: Real Time Cluster Architecture
Cassandra: Accessing the Cluster
CQL Driver
CQL
1. Used to be through a thrift client, now CQL client
2. CQL (Cassandra QL), a very small subset of SQL
3. Driver is not JDBC like!
Cassandra: Data Model
1. Row oriented, instead of column oriented
2. Each row is identified by a key
3. Each key accesses a collection of columns
The Development Process for Big Data Systems
Development Process: System Implementation
Hadoop Cluster Architecture
Master Node
• Resource Manager
• Name Node
• Hive Server
• Sqoop
• Apache Tomcat
• MySQL Server
Worker Node Worker Node Worker Node Worker Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
Now, we have the cluster services up and running,
and data is flowing into our Big Data repository.
What´s next?
Showcase of Big Data Tools for Public Traffic Systems

Más contenido relacionado

La actualidad más candente

A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedInKeith Dsouza
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Pactera_US
 
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...DataWorks Summit
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 

La actualidad más candente (20)

A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
HADOOP
HADOOPHADOOP
HADOOP
 
Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...Improving Organizational Knowledge with Natural Language Processing Enriched ...
Improving Organizational Knowledge with Natural Language Processing Enriched ...
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Spark Core
Spark CoreSpark Core
Spark Core
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedIn
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data
 
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 

Similar a Big data applications

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsrishavkumar1402
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 

Similar a Big data applications (20)

List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 

Último

proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerkumenegertelayegrama
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Sebastiano Panichella
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...Sebastiano Panichella
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptxerickamwana1
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxRoquia Salam
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SESaleh Ibne Omar
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityApp Ethena
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per MVidyaAdsule1
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitysandeepnani2260
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Coolerenquirieskenstar
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 

Último (17)

proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeeger
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptx
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SE
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per M
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber security
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Cooler
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 

Big data applications

  • 1. Big Data Applications Juan Pablo Paz Grau, PhD, PMP.
  • 2. Juan Pablo Paz Grau, PhD, PMP Systems Engineer Specialist in Information Systems Management PhD in Software Engineering Certified in ITIL Foundation, PMP Currently, I work in LG CNS Colombia LG CNS Colombia is the IT partner of the SIRCI operation The SIRCI Operation = Transmilenio Operation Transmilenio is the world renown reference for BRT systems The biggest public traffic system operation in Colombia
  • 3. Presentation Agenda 1. What is Big Data? 2. Large Dataset Management Techniques 3. Hadoop Cluster Architecture 4. Closing the Loop: Real Time Cluster Architecture 5. The Development Process for Big Data Systems 6. Showcase of Big Data Tools for Public Traffic Systems
  • 4. What is Big Data? The DIKW Triangle
  • 5. What is Big Data? Information displayed to final users Data generated to provide information displayed to final users …
  • 6. What is Big Data? • Organizations produce lots of data while they operate their Information Systems • Log files • Access log files • Debug log files • Temporal, transient data • Transactional data • Usually, this data is stored temporarily only for debugging or incident analysis purposes • With the increasing capacity to store data, this data is been reviewed and considered a valuable source of information
  • 7. Large Dataset Management Techniques Very small intro to Hadoop Cheap, reliable storage of big datasets in commodity hardware A framework to parallelize big data processing and analysis What is Hadoop? Large Dataset
  • 8. Large Dataset Management Techniques Very small intro to Hadoop: Hadoop Distributed File System (HDFS) File is split in data blocks File metadata and block location is stored in the name node Data blocks are physically stored in data nodes Block B: • If Data Node 0 fails, there is another copy in the same rack at Data Node 1 • If the rack fails, there is still another copy in another rack at Data Node 2 Rack 1 Rack 2
  • 9. Large Dataset Management Techniques • Very small intro to Hadoop: Map Reduce Map: Select data that matches a given criteria (Status = Trip). The map function returns a set of {Key,Value} pairs Shuffle: Collect an sort the mapped pairs Reduce: Apply a reduce function (Sum distance) for each key
  • 10. Large Dataset Management Techniques Very small intro to Hadoop: The Hadoop ecosystem • Currently, there are a plethora of tools to work with Big Data in top of Hadoop. • The tools and frameworks selection will vary depending on the implementation of the cluster.
  • 11. Hadoop Cluster Architecture The Lambda Architecture Application Data Access Batch | Speed Data • Data layer: A data model and a set of data stored following the data model. The data model should be designed for the targeted subsystem. • Batch layer: The computation layer that processes data to turn facts into views for querying the underlying stored data. • Speed layer: A real time computation layer that compensates the latency of the batch layer. • Data Access layer: The engines, tools and drivers that exposes views to applications and manages queries. • Application layer: The front-end application or applications that present information to users of the Big Data system.
  • 12. Hadoop Cluster Architecture Data Serialization Source System Source System Source System Data Serialization Data Serialization Data Serialization Data Lake Source System Raw Data
  • 13. Data Access: Hive, Hadoop Data Warehouse Hadoop Cluster Architecture • Built on top of Hadoop • Eases the tasks of managing data in Hadoop • Manage files and schemas as tables • Internal tables: Files managed by Hive • External tables: Files located outside of Hive but which can be analyzed with Hive • Provides a SQL like language to query data stored in files • Translates HiveQL language requests into Map Reduce jobs HiveQL
  • 14. Load Transform Dump Data Access: Pig, Data Processing Language Hadoop Cluster Architecture • Built on top of Hadoop • Eases the tasks of data processing and analysis • Capable of working with any type of data source • Provides a scripting language to process and transform data Pig Latin
  • 15. Hadoop Cluster Architecture Hive • Works with structured data • Can index data • HiveQL, a SQL like access language • Turns the HiveQL input into MapReduce jobs Pig • Works with structured/unstructured data • Cannot index data • Pig latin, a scripting language • Turns the Pig latin input into MapReduce jobs Hive / Pig Comparison
  • 16. Closing the Loop: Real Time Cluster Architecture Why? 1. Hadoop is intended to store history, not changing data (write once, read many times) 2. Batch processing of data usually takes many time to produce output summarized data 3. Capability to provide real time processing of Big Data is also desirable in the Lambda architecture 4. There is a need to implement a solution to cope with the time between data in the Hadoop cluster and new data been generated Data available in Hadoop New data been created New data stored in Hadoop Data Gap Time
  • 17. Closing the Loop: Real Time Cluster Architecture Cassandra: Accessing the Cluster CQL Driver CQL 1. Used to be through a thrift client, now CQL client 2. CQL (Cassandra QL), a very small subset of SQL 3. Driver is not JDBC like! Cassandra: Data Model 1. Row oriented, instead of column oriented 2. Each row is identified by a key 3. Each key accesses a collection of columns
  • 18. The Development Process for Big Data Systems Development Process: System Implementation Hadoop Cluster Architecture Master Node • Resource Manager • Name Node • Hive Server • Sqoop • Apache Tomcat • MySQL Server Worker Node Worker Node Worker Node Worker Node • Data Node • Node Manager • Cassandra Node • Data Node • Node Manager • Cassandra Node • Data Node • Node Manager • Cassandra Node • Data Node • Node Manager • Cassandra Node
  • 19. Now, we have the cluster services up and running, and data is flowing into our Big Data repository. What´s next? Showcase of Big Data Tools for Public Traffic Systems

Notas del editor

  1. This is the question that your experiment answers