A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
2. Juan Pablo Paz Grau, PhD, PMP
Systems Engineer
Specialist in Information Systems Management
PhD in Software Engineering
Certified in ITIL Foundation, PMP
Currently, I work in LG CNS Colombia
LG CNS Colombia is the IT partner of the SIRCI operation
The SIRCI Operation = Transmilenio Operation
Transmilenio is the world renown reference for BRT systems
The biggest public traffic system operation in Colombia
3. Presentation Agenda
1. What is Big Data?
2. Large Dataset Management Techniques
3. Hadoop Cluster Architecture
4. Closing the Loop: Real Time Cluster Architecture
5. The Development Process for Big Data Systems
6. Showcase of Big Data Tools for Public Traffic Systems
5. What is Big Data?
Information displayed
to final users
Data generated to
provide information
displayed to final
users
…
6. What is Big Data?
• Organizations produce lots of
data while they operate their
Information Systems
• Log files
• Access log files
• Debug log files
• Temporal, transient data
• Transactional data
• Usually, this data is stored
temporarily only for debugging
or incident analysis purposes
• With the increasing capacity to
store data, this data is been
reviewed and considered a
valuable source of information
7. Large Dataset Management Techniques
Very small intro to Hadoop
Cheap, reliable storage of
big datasets in commodity
hardware
A framework to parallelize
big data processing and
analysis
What is Hadoop?
Large Dataset
8. Large Dataset Management Techniques
Very small intro to Hadoop: Hadoop Distributed File System (HDFS)
File is split in
data blocks
File metadata and block
location is stored in the
name node
Data blocks are physically
stored in data nodes
Block B:
• If Data Node 0 fails, there is another
copy in the same rack at Data Node 1
• If the rack fails, there is still another
copy in another rack at Data Node 2
Rack 1 Rack 2
9. Large Dataset Management Techniques
• Very small intro to Hadoop: Map Reduce
Map: Select data that
matches a given criteria
(Status = Trip). The map
function returns a set of
{Key,Value} pairs
Shuffle: Collect an
sort the mapped pairs
Reduce: Apply a
reduce function (Sum
distance) for each key
10. Large Dataset Management Techniques
Very small intro to Hadoop: The Hadoop ecosystem
• Currently, there are a plethora of tools to work
with Big Data in top of Hadoop.
• The tools and frameworks selection will vary
depending on the implementation of the cluster.
11. Hadoop Cluster Architecture
The Lambda Architecture
Application
Data Access
Batch | Speed
Data
• Data layer: A data model and a set of data stored
following the data model. The data model should
be designed for the targeted subsystem.
• Batch layer: The computation layer that
processes data to turn facts into views for
querying the underlying stored data.
• Speed layer: A real time computation layer that
compensates the latency of the batch layer.
• Data Access layer: The engines, tools and
drivers that exposes views to applications and
manages queries.
• Application layer: The front-end application or
applications that present information to users of
the Big Data system.
12. Hadoop Cluster Architecture
Data Serialization
Source System
Source System
Source System
Data Serialization
Data Serialization
Data Serialization
Data Lake
Source System
Raw Data
13. Data Access: Hive, Hadoop Data Warehouse
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of managing data in Hadoop
• Manage files and schemas as tables
• Internal tables: Files managed by Hive
• External tables: Files located outside
of Hive but which can be analyzed with
Hive
• Provides a SQL like language to query data
stored in files
• Translates HiveQL language requests
into Map Reduce jobs
HiveQL
14. Load Transform Dump
Data Access: Pig, Data Processing Language
Hadoop Cluster Architecture
• Built on top of Hadoop
• Eases the tasks of data processing and
analysis
• Capable of working with any type of data
source
• Provides a scripting language to process and
transform data
Pig
Latin
15. Hadoop Cluster Architecture
Hive
• Works with structured data
• Can index data
• HiveQL, a SQL like access language
• Turns the HiveQL input into MapReduce
jobs
Pig
• Works with structured/unstructured data
• Cannot index data
• Pig latin, a scripting language
• Turns the Pig latin input into MapReduce
jobs
Hive / Pig Comparison
16. Closing the Loop: Real Time Cluster Architecture
Why?
1. Hadoop is intended to store history, not changing data (write
once, read many times)
2. Batch processing of data usually takes many time to produce
output summarized data
3. Capability to provide real time processing of Big Data is also
desirable in the Lambda architecture
4. There is a need to implement a solution to cope with the time
between data in the Hadoop cluster and new data been
generated
Data available
in Hadoop
New data
been created
New data
stored in
Hadoop
Data
Gap
Time
17. Closing the Loop: Real Time Cluster Architecture
Cassandra: Accessing the Cluster
CQL Driver
CQL
1. Used to be through a thrift client, now CQL client
2. CQL (Cassandra QL), a very small subset of SQL
3. Driver is not JDBC like!
Cassandra: Data Model
1. Row oriented, instead of column oriented
2. Each row is identified by a key
3. Each key accesses a collection of columns
18. The Development Process for Big Data Systems
Development Process: System Implementation
Hadoop Cluster Architecture
Master Node
• Resource Manager
• Name Node
• Hive Server
• Sqoop
• Apache Tomcat
• MySQL Server
Worker Node Worker Node Worker Node Worker Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
• Data Node
• Node Manager
• Cassandra Node
19. Now, we have the cluster services up and running,
and data is flowing into our Big Data repository.
What´s next?
Showcase of Big Data Tools for Public Traffic Systems