SlideShare una empresa de Scribd logo
1 de 74
Architecting Your First Big
Data Implementation
Bob Wakefield
Principal
bob@MassStreet.net
Twitter:
@BobLovesData
Who is Mass Street?
• Boutique data consultancy
• We work on data problems big or “small”
• We have a focus on helping organizations
move from a batch to a real time paradigm.
• Free Big Data training
Mass Street Partnerships and
Capability
•Hortonworks Partner
•Confluent Partner
•ARG Back Office
Bob’s Background
• IT professional 16 years
• Currently working as a Data Engineer
• Education
• BS Business Admin (MIS) from KState
• MBA (finance concentration) from KU
• Coursework in Mathematics at Washburn
• Graduate certificate Data Science from Rockhurst
•Addicted to everything data
Follow Me!
•Personal Twitter: @BobLovesData
•Company Twitter: @MassStreet
•Blog: DataDrivenPerspectives.com
•Website: www.MassStreet.net
•Facebook: @MassStreetAnalyticsLLC
Upcoming MeetUps
October: Joint MeetUp with Data Science KC. Reproducible
Research with R, The Tidyverse, Notebooks, and Spark
November: Joint MeetUp with Kansas City Apache Spark
MeetUp. Practicing IoT skills with Satori. (Still in the works.)
December: Survey of common data science algorithms.
This Evening’s Learning
Objectives
Nerd alert!
Resources For This Evening’s
Material: Books
•Hadoop
Application
Architectures
•(it’s a little dated)
Resources For This Evening’s
Material: Online Classes
•Architect and Build Big Data
Applications
•Analytic Data Storage in
Hadoop
Buzzword definitions
• Distributed – many computers working on the
same problem
• Commodity Hardware – This does not mean
cheap. Means Cheaper.
• Schema-on-read – structure of data imposed at
processing time.
Buzzword definitions
• Linear Scale – Adding more boxes gets you performance
= mx+b vs. diminishing returns
What is Big Data?
Definition v1:
volume, velocity, variety
What is Big Data?
Definition v2:
When you have so much data
you can’t analyze it by
traditional means.
What is Big Data?
SELECT address, COUNT(*)
FROM customers
GROUP BY address
What is Big Data?
Definition v3:
When you have so much
data, you can’t get things
done in a reasonable amount
of time.
What Is Hadoop?
1. Distributed computing for the masses.
2. Is the backbone to many distributed
applications.
3. Can mean a lot of things depending on
context.
1. Could mean Hadoop core.
2. Could mean the entire set of Big Data
tools.
3. Could mean the MapReduce framework.
What is Hadoop?
• Provides distributed fault tolerant data
storage
• Provides linear scalability on commodity
hardware
• Translation: Take all of your data, throw it
across a bunch of cheap machines, and
analyze it. Get more data? Add more
machines
It’s like a child’s erector set!
What kind of problems does
Hadoop help me solve?
• Easier analysis of big data
• Lower TCO of data
• Offloading ETL workloads
• Transition from batch to real time
• It lets you see the future.
What do I need to build big data
competency?
• Engineers
• Things aren’t very point and clicky right now.
• Java vs. Python vs. Scala
• Linux experts
• Data Analyst
• These folks should be internal.
• You can train a data scientist.
• Python vs. R
Some questions to ask before you start
your first big data project.
• Use these questions to drive your architecture.
• What kind of data do I have?
• Highly organized vs. super messy.
• Critical data that needs to move fast vs. data that can
wait.
• How fast do I want to implement a solution?
• Is there low hanging fruit?
• Who do I have vs. who do I need to hire?
• Who is trainable?
• Should I get outside help?
Some questions to ask before you start
your first big data project.
• What is my current tech stack?
• Windows vs. Linux
• On Prim Vs. Cloud
• If you’re on prim do you have rack space?
• What are my current BI Tools?
• Excel vs. Microstrategy (free now!) vs. Bime vs.
Tableau
• What is managing my application security?
• LDAP vs. Active Directory
Big Data Software and Tools
• Most of it is free open source.
• Despite being open source, it’s high quality.
• A lot of projects are part of the Apache Software
Foundation.
• Several projects have dozens of committers.
• A cottage industry has sprung up around open
source.
Cloud Solutions
• AWS (S3)
• MS Azure
• Google Cloud (Big Query, Big
Table)
• Databricks
• Confluent (Kafka Saas)
Cloud Solutions Bob’s Thoughts
• Great for tactical solutions.
• If you’re a smaller organization, cloud
offers a managed solution.
• Can be a hassle. You need top flight
network engineers.
• Moving to the cloud is not something to do
lightly. It requires long term thinking.
• I have no idea why anybody waste time with
on prim servers anymore.
NoSQL Databases
If you’re interested in
learning more about
NoSQL, I suggest:
NoSQL Distilled by
Martin Fowler and
Pramod J. Sadalage
NoSQL Databases
• Easy ETL
• Easy schema evolution
• Impedance mismatch solved
Types of NoSQL Databases
• Key Value – A simple has table, primary used when all access to
the database is via primary key.
• Document – The database stores and retrieves documents, which
can be XML, JSON, BSON
• Column-family - Data is stored in columns
• Graph – allows you to store relationships between entities
Issues with NoSQL Databases
• They do NOT work like relational databases
• The guarantees that you are used to aren’t there
• CAP theory (consistency, availability, partition
tolerance)
• Each database has it’s own query language
• That language will frequently look NOTHING like
SQL
Examples of NoSQL Databases
Document Key Value Column Store Graph
Mongo DB Apache Accumulo Apache Accumulo Neo4J
Redis (in memory) Druid
Hive?
Cassandra (DataStax)
Hbase
New SQL Databases
• Distributed versions of databases you’re used to
• New concept Hybrid Transactional/Analytical
Processing (HTAP)
• Generally two types
• In Memory
• Massively Parallel Processing (MPP)
New SQL Databases
• In Memory Databases
• MemSQL
• In Memory Databases: A Real Time
Analytics Solution
• MassStreet.net -> YouTube
• VoltDB
• NuoDB
New SQL Databases
• MPP Databases
• More suited to analytics
• Examples
• Greenplumb (has gone open source!)
• SQL Server PDW
• MySQL Cluster CGE
Hadoop Distributions
• Get. A. Distro!
• What is a distro?
• Popular big data software bundled together and installed as a
total package.
• Popular distros:
• Hortonworks
• Cloudera
• MapR
• Various Others
• Most important difference is the support package
• You can download and use distros without buying a support
package.
Hadoop Core
• Hadoop Distributed File System (HDFS) – a
distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across
the cluster.
• Hadoop YARN – a resource-management platform
responsible for managing computing resources in clusters
and using them for scheduling of users’ applications.
• Hadoop MapReduce – a programming model for large
scale data processing.
Hadoop Core
• Apache ZooKeeper – A highly available system for
coordinating distributed processes. Distributed
applications use ZooKeeper to store and mediate updates
to important configuration information.
• Apache Tez - Generalizes the MapReduce paradigm to a
more powerful framework for executing a complex DAG
(directed acyclic graph) of tasks for near real-time big data
processing.
File Formats
• You can’t just drop CSVs onto HDFS
• That’s super inefficient
• Pick a file format with the following requirements
• Compressible
• Splitable
• You have options on compression codecs
• Snappy is popular
• Ultimately an engineering decision
File Formats
• Apache Avro
• Language neutral data serialization
• Serializes data in a compact binary format
• Fairly powerful tool
• Useful for schema evolution
• Apache Parquet
• General purpose storage format
• Column oriented
• Stores metadata thus self documenting
File Formats
• Avro is good for storing transactional workloads
• Parquet is good for storing analytical workloads
• When in doubt, use Avro
• Both can be used
• Read and write Parquet with Avro APIs
IMPORTANT!
• Data in HDFS is IMMUTABLE
• There is no random access
• If you need that, you’ll have to use a database technology
• A lot of big data databases use HDFS to store the actual
data file.
Batch Data Flow Tools
• Apache Pig – A platform for processing and analyzing large data sets. Pig consists of a high-
level language (Pig Latin) for expressing data analysis programs paired with the MapReduce
framework for processing these programs.
• Cascading - software abstraction layer for Apache Hadoop and Apache Flink. Cascading is
used to create and execute complex data processing workflows on a Hadoop cluster using any
JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of
MapReduce jobs. It is open source and available under the Apache License. Commercial
support is available from Driven, Inc.
• Apache Sqoop – Sqoop is a tool that speeds and eases movement of data in and out of
Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.
Real Time Processing
• THIS IS THE
QUANTUM LEAP
GAME CHANGER IN
YOUR DATA
MANAGEMENT
STRATEGY!!!!!
• As a first step, read I
(*heart*) Logs by Jay
Kreps
Central Concepts
• The log as a unifying abstraction
• Everything in your org is an event that
can be logged, captured, and
transported
Real Time Processing Considerations
• Consider a radically different data architecture
• Microservices?
• Apache Kafka is critical in any streaming
architecture
• True Streaming vs. Microbatch
• I don’t think microbatch is a thing anymore
• Delivery guarantees
• At least once
• At most once
• Exactly once
Real Time Processing
• Apache Kafka – Kafka is a fast and scalable
publish-subscribe messaging system that is often
used in place of traditional message brokers
because of its higher throughput, replication, and
fault tolerance.
• Commercially backed by Confluent
• Open source and Enterprise Versions
Radically Different Architecture
Radically Different Architecture
Real Time Processing
• Apache Spark Structured Streaming -
scalable and fault-tolerant stream processing
engine built on the Spark SQL engine.
• Apache Storm – Storm is a distributed real-time
computation system for processing fast, large
streams of data adding reliable real-time data
processing capabilities to Apache Hadoop 2.x.
• Apache Flume – Flume allows you to efficiently
aggregate and move large amounts of log data from
many different sources to Hadoop.
Real Time Processing
• Other real time processing frameworks
• Apache Flink
• Apache Samza
• They all have benefits and drawbacks
Interesting Point and Click Tools
• Talend
• Kafka/Spark streaming product
• Streamsets
• Connector to SQL Server CDC
• NiFi
Data Access and Processing
• Apache Spark – Spark is ideal for in-memory data processing. It allows data
scientists to implement fast, iterative algorithms for advanced analytics such as
clustering and classification of datasets.
• Apache Mahout – Mahout provides scalable machine learning algorithms for
Hadoop which aids with data science for clustering, classification and batch based
collaborative filtering.
• Hue – Hadoop User Experience. Open source interface to the Hadoop ecosystem.
• Provides interfaces and code editors for various projects.
• Prefer it for interacting with data over Ambari.
• If you’re not using Cloudera, it’s a little hard to install.
• Apache Zeppelin – really powerful notebook
• Comes with Hortonworks
• Jupyter Notebooks – less powerful but more universal
BI Tools
• Apache Superset
• Microstrategy
• Things Board
• Grafana
Orchestration (Job Scheduler)
• Apache Oozie
• Apache Airflow
Security and Governance
• Apache Knox – The Knox Gateway (“Knox”) provides a single
point of authentication and access for Apache Hadoop services in
a cluster. The goal of the project is to simplify Hadoop security
for users who access the cluster data and execute jobs, and for
operators who control access to the cluster.
• Apache Ranger – Apache Ranger delivers a comprehensive
approach to security for a Hadoop cluster. It provides central
security policy administration across the core enterprise security
requirements of authorization, accounting and data protection.
Use Cases
• The use cases are only limited by your imagination
and creativity.
• There are a few common use cases.
• We’ll discuss two types:
• Use cases you can find in a book.
• My personal pet use cases
Use Case Assumptions
• All use cases require a cluster of machines
• On prim
• Or in the cloud
• All use cases will require a security/access plan
• You’ve had conversations with your engineers about
the best approach for your specific situation
Use Case 1: Get started analyzing big data
Problem: We have large amounts of data and it takes a
long time to analyze it.
Solution: Move that data into Hive.
Use Case 1: Get started analyzing big data
• Required tech:
• Hadoop Core
• Apache Hive
• ETL tool of your choice
• Data access tool of your choice
• Recommend Hue
• Ambari interface not that impressive for data
access
Use Case 1: Get started analyzing big data
• Can be as simple as an ad-hoc process.
• Can be as complicated as a nightly run.
• Design
• Identify data sets that need analyzed.
• Think of it as creating views.
• Create tables in Hive.
• Dump data into those tables as necessary.
Use Case 2: Real Time Data Warehouse
Problem: Executives need faster access to historical
data.
Solution: Re-engineer your ETL for a real time
paradigm
Use Case 2: Real Time Data Warehouse
• Required tech:
• Hadoop Core
• Streaming ETL tool of your choice
• Recommend StreamSets.
• Apache Kafka
• Recommend Enterprise Kafka through
Confluent.
• MPP database
• Recommend Greenplumb
Use Case 2: Real Time Data Warehouse
• Required tech:
• BI Tool of your choice
• You should be able to plug your current BI
tools right into Greenplumb
Use Case 2: Real Time Data Warehouse
• Does not require “Big Data”
• Will require a total rethink about how you move
data around your organization.
• I highly recommend you read I (*heart*) Logs by
Jay Kreps before you attempt this.
• I recommend that you create a POC.
• Use cloud resources and the cost of your POC will
be low.
Use Case 3: Breaking down data silos
Problem: We have data located in many different data
sources being used differently by different parts of the
enterprise.
Solution: Create a data lake/data hub.
Use Case 3: Breaking down data silos
• Required tech:
• Hadoop Core
• ETL tool of your choice
• Data access tool of your choice
• BI tool of your choice
Use Case 3: Breaking down data silos
• This is a batch process.
• You can stream in data if you want.
• The goal is to drop everything into Hadoop in its
NATIVE FORMAT.
• Still use a file format
• Store it now. Figure out what to do with it later.
Use Case 3: Breaking down data silos
• Implementing a data lake is a presentation all its
own.
• Sounds simple. Actually pretty complicated.
• Recommend: Architecting Data Lakes by Alice
LaPlante and Ben Sharma
Use Case 3: Breaking down data silos
• Data lakes solve a lot of problems at once.
• Really cuts down the time on ETL development
• Removes need to put some virtualization solution
in place
• Removes the need to put some federated data
solution in place.
• Implements a self serve model
• Gives everybody access to all the data
• Facilitates 360 degree analysis
Use Case 3: Breaking down data silos
• Data governance here is key
• Data dictionary has to be on point.
• Stuff needs to be clean.
• All datasets need clear provenance.
Other Use Cases
• Clickstream Analysis
• Basically analyze weblogs
• Can be done in batch
• More interesting to do it in real time
• https://hortonworks.com/tutorial/visualize-
website-clickstream-data/
Other Use Cases
• Fraud Detection
• Network Intrusion detection
• Same approach
• Pull in data in real time and make decisions on it.
• Requires data science tools to create models of
behavior.
• Aren’t there commercial solutions?
• Pretty generic use case.
• Build vs. buy.
Obligatory Q & A Period

Más contenido relacionado

La actualidad más candente

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Relational to Big Graph
Relational to Big GraphRelational to Big Graph
Relational to Big GraphNeo4j
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...Mark Rittman
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real WorldMark Kromer
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInOSCON Byrum
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

La actualidad más candente (20)

Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Relational to Big Graph
Relational to Big GraphRelational to Big Graph
Relational to Big Graph
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 

Similar a Architecting Your First Big Data Implementation

Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 

Similar a Architecting Your First Big Data Implementation (20)

Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Apache drill
Apache drillApache drill
Apache drill
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 

Más de Adaryl "Bob" Wakefield, MBA

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataAdaryl "Bob" Wakefield, MBA
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkAdaryl "Bob" Wakefield, MBA
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionAdaryl "Bob" Wakefield, MBA
 

Más de Adaryl "Bob" Wakefield, MBA (7)

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Getting Started With Sparklyr
Getting Started With SparklyrGetting Started With Sparklyr
Getting Started With Sparklyr
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT Data
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
 
Hands On: Linux survival skills
Hands On: Linux survival skillsHands On: Linux survival skills
Hands On: Linux survival skills
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 

Último

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 

Último (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

Architecting Your First Big Data Implementation

  • 1. Architecting Your First Big Data Implementation Bob Wakefield Principal bob@MassStreet.net Twitter: @BobLovesData
  • 2. Who is Mass Street? • Boutique data consultancy • We work on data problems big or “small” • We have a focus on helping organizations move from a batch to a real time paradigm. • Free Big Data training
  • 3. Mass Street Partnerships and Capability •Hortonworks Partner •Confluent Partner •ARG Back Office
  • 4. Bob’s Background • IT professional 16 years • Currently working as a Data Engineer • Education • BS Business Admin (MIS) from KState • MBA (finance concentration) from KU • Coursework in Mathematics at Washburn • Graduate certificate Data Science from Rockhurst •Addicted to everything data
  • 5. Follow Me! •Personal Twitter: @BobLovesData •Company Twitter: @MassStreet •Blog: DataDrivenPerspectives.com •Website: www.MassStreet.net •Facebook: @MassStreetAnalyticsLLC
  • 6. Upcoming MeetUps October: Joint MeetUp with Data Science KC. Reproducible Research with R, The Tidyverse, Notebooks, and Spark November: Joint MeetUp with Kansas City Apache Spark MeetUp. Practicing IoT skills with Satori. (Still in the works.) December: Survey of common data science algorithms.
  • 9. Resources For This Evening’s Material: Books •Hadoop Application Architectures •(it’s a little dated)
  • 10. Resources For This Evening’s Material: Online Classes •Architect and Build Big Data Applications •Analytic Data Storage in Hadoop
  • 11. Buzzword definitions • Distributed – many computers working on the same problem • Commodity Hardware – This does not mean cheap. Means Cheaper. • Schema-on-read – structure of data imposed at processing time.
  • 12. Buzzword definitions • Linear Scale – Adding more boxes gets you performance = mx+b vs. diminishing returns
  • 13. What is Big Data? Definition v1: volume, velocity, variety
  • 14. What is Big Data? Definition v2: When you have so much data you can’t analyze it by traditional means.
  • 15. What is Big Data? SELECT address, COUNT(*) FROM customers GROUP BY address
  • 16. What is Big Data? Definition v3: When you have so much data, you can’t get things done in a reasonable amount of time.
  • 17. What Is Hadoop? 1. Distributed computing for the masses. 2. Is the backbone to many distributed applications. 3. Can mean a lot of things depending on context. 1. Could mean Hadoop core. 2. Could mean the entire set of Big Data tools. 3. Could mean the MapReduce framework.
  • 18. What is Hadoop? • Provides distributed fault tolerant data storage • Provides linear scalability on commodity hardware • Translation: Take all of your data, throw it across a bunch of cheap machines, and analyze it. Get more data? Add more machines
  • 19.
  • 20. It’s like a child’s erector set!
  • 21. What kind of problems does Hadoop help me solve? • Easier analysis of big data • Lower TCO of data • Offloading ETL workloads • Transition from batch to real time • It lets you see the future.
  • 22. What do I need to build big data competency? • Engineers • Things aren’t very point and clicky right now. • Java vs. Python vs. Scala • Linux experts • Data Analyst • These folks should be internal. • You can train a data scientist. • Python vs. R
  • 23. Some questions to ask before you start your first big data project. • Use these questions to drive your architecture. • What kind of data do I have? • Highly organized vs. super messy. • Critical data that needs to move fast vs. data that can wait. • How fast do I want to implement a solution? • Is there low hanging fruit? • Who do I have vs. who do I need to hire? • Who is trainable? • Should I get outside help?
  • 24. Some questions to ask before you start your first big data project. • What is my current tech stack? • Windows vs. Linux • On Prim Vs. Cloud • If you’re on prim do you have rack space? • What are my current BI Tools? • Excel vs. Microstrategy (free now!) vs. Bime vs. Tableau • What is managing my application security? • LDAP vs. Active Directory
  • 25. Big Data Software and Tools • Most of it is free open source. • Despite being open source, it’s high quality. • A lot of projects are part of the Apache Software Foundation. • Several projects have dozens of committers. • A cottage industry has sprung up around open source.
  • 26. Cloud Solutions • AWS (S3) • MS Azure • Google Cloud (Big Query, Big Table) • Databricks • Confluent (Kafka Saas)
  • 27. Cloud Solutions Bob’s Thoughts • Great for tactical solutions. • If you’re a smaller organization, cloud offers a managed solution. • Can be a hassle. You need top flight network engineers. • Moving to the cloud is not something to do lightly. It requires long term thinking. • I have no idea why anybody waste time with on prim servers anymore.
  • 28. NoSQL Databases If you’re interested in learning more about NoSQL, I suggest: NoSQL Distilled by Martin Fowler and Pramod J. Sadalage
  • 29. NoSQL Databases • Easy ETL • Easy schema evolution • Impedance mismatch solved
  • 30. Types of NoSQL Databases • Key Value – A simple has table, primary used when all access to the database is via primary key. • Document – The database stores and retrieves documents, which can be XML, JSON, BSON • Column-family - Data is stored in columns • Graph – allows you to store relationships between entities
  • 31. Issues with NoSQL Databases • They do NOT work like relational databases • The guarantees that you are used to aren’t there • CAP theory (consistency, availability, partition tolerance) • Each database has it’s own query language • That language will frequently look NOTHING like SQL
  • 32. Examples of NoSQL Databases Document Key Value Column Store Graph Mongo DB Apache Accumulo Apache Accumulo Neo4J Redis (in memory) Druid Hive? Cassandra (DataStax) Hbase
  • 33. New SQL Databases • Distributed versions of databases you’re used to • New concept Hybrid Transactional/Analytical Processing (HTAP) • Generally two types • In Memory • Massively Parallel Processing (MPP)
  • 34. New SQL Databases • In Memory Databases • MemSQL • In Memory Databases: A Real Time Analytics Solution • MassStreet.net -> YouTube • VoltDB • NuoDB
  • 35. New SQL Databases • MPP Databases • More suited to analytics • Examples • Greenplumb (has gone open source!) • SQL Server PDW • MySQL Cluster CGE
  • 36. Hadoop Distributions • Get. A. Distro! • What is a distro? • Popular big data software bundled together and installed as a total package. • Popular distros: • Hortonworks • Cloudera • MapR • Various Others • Most important difference is the support package • You can download and use distros without buying a support package.
  • 37. Hadoop Core • Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. • Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications. • Hadoop MapReduce – a programming model for large scale data processing.
  • 38. Hadoop Core • Apache ZooKeeper – A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information. • Apache Tez - Generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.
  • 39. File Formats • You can’t just drop CSVs onto HDFS • That’s super inefficient • Pick a file format with the following requirements • Compressible • Splitable • You have options on compression codecs • Snappy is popular • Ultimately an engineering decision
  • 40. File Formats • Apache Avro • Language neutral data serialization • Serializes data in a compact binary format • Fairly powerful tool • Useful for schema evolution • Apache Parquet • General purpose storage format • Column oriented • Stores metadata thus self documenting
  • 41. File Formats • Avro is good for storing transactional workloads • Parquet is good for storing analytical workloads • When in doubt, use Avro • Both can be used • Read and write Parquet with Avro APIs
  • 42. IMPORTANT! • Data in HDFS is IMMUTABLE • There is no random access • If you need that, you’ll have to use a database technology • A lot of big data databases use HDFS to store the actual data file.
  • 43. Batch Data Flow Tools • Apache Pig – A platform for processing and analyzing large data sets. Pig consists of a high- level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs. • Cascading - software abstraction layer for Apache Hadoop and Apache Flink. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Driven, Inc. • Apache Sqoop – Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.
  • 44. Real Time Processing • THIS IS THE QUANTUM LEAP GAME CHANGER IN YOUR DATA MANAGEMENT STRATEGY!!!!! • As a first step, read I (*heart*) Logs by Jay Kreps
  • 45. Central Concepts • The log as a unifying abstraction • Everything in your org is an event that can be logged, captured, and transported
  • 46. Real Time Processing Considerations • Consider a radically different data architecture • Microservices? • Apache Kafka is critical in any streaming architecture • True Streaming vs. Microbatch • I don’t think microbatch is a thing anymore • Delivery guarantees • At least once • At most once • Exactly once
  • 47. Real Time Processing • Apache Kafka – Kafka is a fast and scalable publish-subscribe messaging system that is often used in place of traditional message brokers because of its higher throughput, replication, and fault tolerance. • Commercially backed by Confluent • Open source and Enterprise Versions
  • 50. Real Time Processing • Apache Spark Structured Streaming - scalable and fault-tolerant stream processing engine built on the Spark SQL engine. • Apache Storm – Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop 2.x. • Apache Flume – Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop.
  • 51. Real Time Processing • Other real time processing frameworks • Apache Flink • Apache Samza • They all have benefits and drawbacks
  • 52. Interesting Point and Click Tools • Talend • Kafka/Spark streaming product • Streamsets • Connector to SQL Server CDC • NiFi
  • 53. Data Access and Processing • Apache Spark – Spark is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering and classification of datasets. • Apache Mahout – Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering. • Hue – Hadoop User Experience. Open source interface to the Hadoop ecosystem. • Provides interfaces and code editors for various projects. • Prefer it for interacting with data over Ambari. • If you’re not using Cloudera, it’s a little hard to install. • Apache Zeppelin – really powerful notebook • Comes with Hortonworks • Jupyter Notebooks – less powerful but more universal
  • 54. BI Tools • Apache Superset • Microstrategy • Things Board • Grafana
  • 55. Orchestration (Job Scheduler) • Apache Oozie • Apache Airflow
  • 56. Security and Governance • Apache Knox – The Knox Gateway (“Knox”) provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access to the cluster. • Apache Ranger – Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core enterprise security requirements of authorization, accounting and data protection.
  • 57. Use Cases • The use cases are only limited by your imagination and creativity. • There are a few common use cases. • We’ll discuss two types: • Use cases you can find in a book. • My personal pet use cases
  • 58. Use Case Assumptions • All use cases require a cluster of machines • On prim • Or in the cloud • All use cases will require a security/access plan • You’ve had conversations with your engineers about the best approach for your specific situation
  • 59. Use Case 1: Get started analyzing big data Problem: We have large amounts of data and it takes a long time to analyze it. Solution: Move that data into Hive.
  • 60. Use Case 1: Get started analyzing big data • Required tech: • Hadoop Core • Apache Hive • ETL tool of your choice • Data access tool of your choice • Recommend Hue • Ambari interface not that impressive for data access
  • 61. Use Case 1: Get started analyzing big data • Can be as simple as an ad-hoc process. • Can be as complicated as a nightly run. • Design • Identify data sets that need analyzed. • Think of it as creating views. • Create tables in Hive. • Dump data into those tables as necessary.
  • 62. Use Case 2: Real Time Data Warehouse Problem: Executives need faster access to historical data. Solution: Re-engineer your ETL for a real time paradigm
  • 63. Use Case 2: Real Time Data Warehouse • Required tech: • Hadoop Core • Streaming ETL tool of your choice • Recommend StreamSets. • Apache Kafka • Recommend Enterprise Kafka through Confluent. • MPP database • Recommend Greenplumb
  • 64. Use Case 2: Real Time Data Warehouse • Required tech: • BI Tool of your choice • You should be able to plug your current BI tools right into Greenplumb
  • 65. Use Case 2: Real Time Data Warehouse • Does not require “Big Data” • Will require a total rethink about how you move data around your organization. • I highly recommend you read I (*heart*) Logs by Jay Kreps before you attempt this. • I recommend that you create a POC. • Use cloud resources and the cost of your POC will be low.
  • 66. Use Case 3: Breaking down data silos Problem: We have data located in many different data sources being used differently by different parts of the enterprise. Solution: Create a data lake/data hub.
  • 67. Use Case 3: Breaking down data silos • Required tech: • Hadoop Core • ETL tool of your choice • Data access tool of your choice • BI tool of your choice
  • 68. Use Case 3: Breaking down data silos • This is a batch process. • You can stream in data if you want. • The goal is to drop everything into Hadoop in its NATIVE FORMAT. • Still use a file format • Store it now. Figure out what to do with it later.
  • 69. Use Case 3: Breaking down data silos • Implementing a data lake is a presentation all its own. • Sounds simple. Actually pretty complicated. • Recommend: Architecting Data Lakes by Alice LaPlante and Ben Sharma
  • 70. Use Case 3: Breaking down data silos • Data lakes solve a lot of problems at once. • Really cuts down the time on ETL development • Removes need to put some virtualization solution in place • Removes the need to put some federated data solution in place. • Implements a self serve model • Gives everybody access to all the data • Facilitates 360 degree analysis
  • 71. Use Case 3: Breaking down data silos • Data governance here is key • Data dictionary has to be on point. • Stuff needs to be clean. • All datasets need clear provenance.
  • 72. Other Use Cases • Clickstream Analysis • Basically analyze weblogs • Can be done in batch • More interesting to do it in real time • https://hortonworks.com/tutorial/visualize- website-clickstream-data/
  • 73. Other Use Cases • Fraud Detection • Network Intrusion detection • Same approach • Pull in data in real time and make decisions on it. • Requires data science tools to create models of behavior. • Aren’t there commercial solutions? • Pretty generic use case. • Build vs. buy.
  • 74. Obligatory Q & A Period