SlideShare una empresa de Scribd logo
1 de 58
Descargar para leer sin conexión
The Elephant in the Room
A DBA’s Guide to Hadoop & Big Data
Purpose
Rosetta Stone presentation
High level overview of Hadoop & Big Data
NOT a deep dive
NOT a demo session
Mostly theory & vocabulary
Where to learn more
About Me
Manage DBA’s for financial services company
Former Data Architect, DBA, developer
Linchpin People TeamMate
AtlantaMDF Chapter Leader
Infrequent blogger: http://codegumbo.com
About You
Assume that
● mostly developers
● SQL experience
● exposure to database admin &
architecture
● little to no experience with Big Data
“Big” Data
Big Data is like teenage sex...
Everyone talks about it,
Nobody really knows how to do it,
Everyone thinks everyone else is doing it,
So everyone claims they are doing it…
-Dan Ariely
The Four V’s of Big Data
Volume - data is too big to scale out
Velocity - decision window is small
Variety - multiple formats challenge integration
Variability - same data, different interpretations
http://goo.gl/6icouZ
RDBMS versus Big Data
RDBMS
Primarily Scale-Up
Strong Typing
Normalization
Default Mutable
Mature
Big Data
Primarily Scale-Out
Schemaless
Default Immutable
Evolving
Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
Foundations
“Gentlemen, this is a
football…”
- Vince Lombardi
Hadoop Ecosystem (Hortonworks)
Hortonworks
Hadoop
Scaleable, distributed processing framework
open-source
Hortonworks*
Cloudera
proprietary components
Facebook
Yahoo
HDFS
Hadoop Distributed File System
Inspired by Google FileSystem (2002-2003)
Cluster storage of large files across servers
Yahoo - 10,000 core Hadoop cluster(s)
Facebook - 100 PB+ (June, 2012)
http://goo.gl/SpSN
HDFS
HDFS
File permissions and authentication.
Rack aware
fsck: find missing files or blocks.
Scheduled Rebalancing
Redundancy & Replication
Built around MapReduce
MapReduce
“Developed” by Google; patent issued in 2004
Map - filtering and sorting
Reduce - summarization
Inherently distributed
MapReduce
Hive
HiveQL - SQL like syntax
DDL scripts define tables
Query transformed into MapReduce jobs
Performance increases with scalability
Stinger initiative - MicrosoftHortonworks
Hive
Hive
create external table price_data (stock_exchange string,
symbol string, trade_date string, open float, high float,
low float, close float, volume int, adj_close float) row
format delimited fields terminated by ',' stored as
textfile location '/user/hue/nyse/nyse_prices';
select * from price_data where symbol = 'IBM';
Hive
HCatalog
Tight integration with Hive, but supports all
Hadoop data access protocols
Define relational view into data (DDL)
“Tables” can be reused by Hive, Pig, Storm...
Tutorial
Pig
Data abstraction language; Yahoo (2006)
Based on Java; supports Python & Ruby
Procedural (SQL is declarative)
Allows for ETL
Lazy evaluation
Pig
Pig
Pig
ETL service; useful as “duct tape”
Typical scenario:
Load data into HDFS
Use Pig to scrub data, and
Pump to another “db” (e.g., MongoDB)
Web service reads from destination
Hadoop Ecosystem (Hortonworks)
Hortonworks
Hadoop SQL Server
HDFS Windows Cluster
Database
MapReduce Query Optimizer
Master Web Interface SQL Server Management Studio
Hive SQL
HCatalog Views
Pig Powershell
SSIS
Big Data Administration
The possession of
facts is knowledge,
the use of them is
wisdom. – Thomas
Jefferson
Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
PERFORMANCE
APPLICATION GROWTH
RDBMS
PERFORMANCE
APPLICATION GROWTH
BIG DATA
PERFORMANCE
APPLICATION GROWTH
Scale-Up Costs (SQL Server)
Single Server
Maximum RAM
SAN
Licenses
Windows
SQL Server
Microsoft Support
Personnel
Developers
DBA
SAN Admin
Network Admin
Facilities
Minimum Footprint
Scale-Out Costs (Hortonworks HDP)
Multiple Servers
Commodity
Licenses
Windows ($$$)
Linux ($)
HDP Support
Personnel
Developer
HDP Admin
Network Admin
Facilities
Power
Space
Air
Performance Tuning
SYSTEM
CODE
RDBMS
SYSTEM
CODE
HADOOP
Performance Tuning Tips
Hadoop Ecosystem (Hortonworks)
Hortonworks
Performance Architecture
Nathan Marz - Twitter, Storm
Lambda Architecture
Performance Architecture
Getting Started (Massive Size)
1. Lab Environment (Virtualized)
2. Setup OS (Windows or Linux)
3. Download (MSI or RPM)
4. Deploy Prereqs (Python, Java, C++)
5. Setup Master Node(s)
6. Setup Data Node(s)
Windows Installation Tutorial
Big Data Use Cases
Massive Size
PB of info
Data Warehouse
Large clusters
High Cost
Complex Analytics
Schemaless
Investigational
Single-node
Low Cost
Word Count
Problem: count the number of times a word
displays in a specific record.
e.g. “Lorem ipsum dolor sit amet, consectetur
adipiscing elit.”...
Word Count
SQL Server
Create UDF to
parse strings
Hadoop
Pig script to parse
strings
Word Count - SQL Server
CREATE function WordRepeatedNumTimes
(@SourceString varchar(max),@TargetWord varchar(8000))
RETURNS int
AS
BEGIN
DECLARE @NumTimesRepeated int
,@CurrentStringPosition int
,@LengthOfString int
,@PatternStartsAtPosition int
,@LengthOfTargetWord int
,@NewSourceString varchar(max)
Word Count - SQL Server
SET @LengthOfTargetWord = len(@TargetWord)
SET @LengthOfString = len(@SourceString)
SET @NumTimesRepeated = 0
SET @CurrentStringPosition = 0
SET @PatternStartsAtPosition = 0
SET @NewSourceString = @SourceString
WHILE len(@NewSourceString) >= @LengthOfTargetWord
BEGIN
SET @PatternStartsAtPosition = CHARINDEX (@TargetWord,
@NewSourceString)
IF @PatternStartsAtPosition <> 0
BEGIN
Word Count - SQL Server
SET @NumTimesRepeated = @NumTimesRepeated + 1
SET @CurrentStringPosition = @CurrentStringPosition +
@PatternStartsAtPosition + @LengthOfTargetWord
SET @NewSourceString = substring(@NewSourceString,
@PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString)
END
ELSE
BEGIN
SET @NewSourceString = ''
END
END
RETURN @NumTimesRepeated
END
Word Count (Hadoop)
a = load '/user/hue/word_count_text.txt';
b = foreach a generate flatten(TOKENIZE
((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into '/user/hue/pig_wordcount';
Getting Started (Complex Analysis)
1. Lab Environment (Virtualized)
2. Install Hortonworks Sandbox
1. Setup Azure account
2. HDInsight
Theoretically, can scale to PB, but
no idea what that will cost you.
Note that the interface highlights
Hive (with Stinger); Pig commands
are run through Powershell
In Conclusion
Lots of vocabulary
HDFS, Pig, Hive, MapReduce
Map to SQL Server (RDBMS) vocabulary
Different Use Cases
Massive Data
Complex Analysis
Questions & Feedback
Contact Me
Stuart R. Ainsworth
Twitter: @codegumbo
Email: stuart@codegumbo.com
SpeakerRate: http://spkr8.com/t/33521
Big Data - Dangerous
http://www.thefacehawk.com/

Más contenido relacionado

La actualidad más candente

Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by KeylabsSiva Sankar
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersEdureka!
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherMongoDB
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabsSiva Sankar
 

La actualidad más candente (20)

Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Real-time analytics with HBase
Real-time analytics with HBaseReal-time analytics with HBase
Real-time analytics with HBase
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Big Data - Part IV
Big Data - Part IVBig Data - Part IV
Big Data - Part IV
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data - Part II
Big Data - Part IIBig Data - Part II
Big Data - Part II
 
Big Data - Part I
Big Data - Part IBig Data - Part I
Big Data - Part I
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Big Data - Part III
Big Data - Part IIIBig Data - Part III
Big Data - Part III
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Using MongoDB + Hadoop Together
Using MongoDB + Hadoop TogetherUsing MongoDB + Hadoop Together
Using MongoDB + Hadoop Together
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 

Destacado

Sql server 2014 what's new-
Sql server 2014  what's new-Sql server 2014  what's new-
Sql server 2014 what's new-Stuart Ainsworth
 
Team rockets oms.
Team rockets oms.Team rockets oms.
Team rockets oms.c_liberty
 
Communicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van ImpelenCommunicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van ImpelenWit_Bestuurscommunicatie
 
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...Beneyam Haile
 
Portafolio estudiantil de farmacología
Portafolio estudiantil de farmacologíaPortafolio estudiantil de farmacología
Portafolio estudiantil de farmacologíaZuli Campaña
 
Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015Evan Kearney
 
Functional programming
Functional programmingFunctional programming
Functional programmingNewHeart
 
Уникальное коммерческое предложение
Уникальное коммерческое предложениеУникальное коммерческое предложение
Уникальное коммерческое предложениеSEO_Experts
 
Sarus 2014 magazine
Sarus 2014 magazineSarus 2014 magazine
Sarus 2014 magazineHuyHuang
 
Presentació curs fisqui
Presentació curs fisquiPresentació curs fisqui
Presentació curs fisquilauraod
 
SEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентамиSEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентамиSEO_Experts
 

Destacado (19)

Sql server 2014 what's new-
Sql server 2014  what's new-Sql server 2014  what's new-
Sql server 2014 what's new-
 
Team rockets oms.
Team rockets oms.Team rockets oms.
Team rockets oms.
 
All you need to know about WMS
All you need to know about WMSAll you need to know about WMS
All you need to know about WMS
 
Communicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van ImpelenCommunicatie is topsport - Corine van Impelen
Communicatie is topsport - Corine van Impelen
 
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
Use of Coordinated Multipoint Transmission for Relaxation of Relay Link Bott...
 
Gruppo Ambiente Sicurezza & Lifegate
Gruppo Ambiente Sicurezza & LifegateGruppo Ambiente Sicurezza & Lifegate
Gruppo Ambiente Sicurezza & Lifegate
 
Bulungi Creative
Bulungi CreativeBulungi Creative
Bulungi Creative
 
Portafolio estudiantil de farmacología
Portafolio estudiantil de farmacologíaPortafolio estudiantil de farmacología
Portafolio estudiantil de farmacología
 
Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015Coke Ramadan Jay Chiat 2015
Coke Ramadan Jay Chiat 2015
 
Office Add-Ins
Office Add-InsOffice Add-Ins
Office Add-Ins
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
заохочення і покарання
заохочення  і покараннязаохочення  і покарання
заохочення і покарання
 
Assignmen1
Assignmen1Assignmen1
Assignmen1
 
Circuitos mixtos
Circuitos mixtosCircuitos mixtos
Circuitos mixtos
 
Уникальное коммерческое предложение
Уникальное коммерческое предложениеУникальное коммерческое предложение
Уникальное коммерческое предложение
 
Sarus 2014 magazine
Sarus 2014 magazineSarus 2014 magazine
Sarus 2014 magazine
 
Presentació curs fisqui
Presentació curs fisquiPresentació curs fisqui
Presentació curs fisqui
 
SEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентамиSEO продвижение - сравнение с конкурентами
SEO продвижение - сравнение с конкурентами
 
Estación tercera
Estación terceraEstación tercera
Estación tercera
 

Similar a BIG DATA TITLE

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010Christopher Curtin
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoopguest27e6764
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Christopher Curtin
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoopAdam Muise
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystemGrzegorz Kolpuc
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...WebExpo
 

Similar a BIG DATA TITLE (20)

מיכאל
מיכאלמיכאל
מיכאל
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
 

Último

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

BIG DATA TITLE

  • 1. The Elephant in the Room A DBA’s Guide to Hadoop & Big Data
  • 2.
  • 3.
  • 4. Purpose Rosetta Stone presentation High level overview of Hadoop & Big Data NOT a deep dive NOT a demo session Mostly theory & vocabulary Where to learn more
  • 5. About Me Manage DBA’s for financial services company Former Data Architect, DBA, developer Linchpin People TeamMate AtlantaMDF Chapter Leader Infrequent blogger: http://codegumbo.com
  • 6. About You Assume that ● mostly developers ● SQL experience ● exposure to database admin & architecture ● little to no experience with Big Data
  • 8. Big Data is like teenage sex... Everyone talks about it, Nobody really knows how to do it, Everyone thinks everyone else is doing it, So everyone claims they are doing it… -Dan Ariely
  • 9. The Four V’s of Big Data Volume - data is too big to scale out Velocity - decision window is small Variety - multiple formats challenge integration Variability - same data, different interpretations http://goo.gl/6icouZ
  • 10. RDBMS versus Big Data RDBMS Primarily Scale-Up Strong Typing Normalization Default Mutable Mature Big Data Primarily Scale-Out Schemaless Default Immutable Evolving
  • 11. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 12. Foundations “Gentlemen, this is a football…” - Vince Lombardi
  • 14. Hadoop Scaleable, distributed processing framework open-source Hortonworks* Cloudera proprietary components Facebook Yahoo
  • 15. HDFS Hadoop Distributed File System Inspired by Google FileSystem (2002-2003) Cluster storage of large files across servers Yahoo - 10,000 core Hadoop cluster(s) Facebook - 100 PB+ (June, 2012) http://goo.gl/SpSN
  • 16. HDFS
  • 17. HDFS File permissions and authentication. Rack aware fsck: find missing files or blocks. Scheduled Rebalancing Redundancy & Replication Built around MapReduce
  • 18. MapReduce “Developed” by Google; patent issued in 2004 Map - filtering and sorting Reduce - summarization Inherently distributed
  • 20. Hive HiveQL - SQL like syntax DDL scripts define tables Query transformed into MapReduce jobs Performance increases with scalability Stinger initiative - MicrosoftHortonworks
  • 21. Hive
  • 22. Hive create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices'; select * from price_data where symbol = 'IBM';
  • 23. Hive
  • 24. HCatalog Tight integration with Hive, but supports all Hadoop data access protocols Define relational view into data (DDL) “Tables” can be reused by Hive, Pig, Storm... Tutorial
  • 25. Pig Data abstraction language; Yahoo (2006) Based on Java; supports Python & Ruby Procedural (SQL is declarative) Allows for ETL Lazy evaluation
  • 26. Pig
  • 27. Pig
  • 28. Pig ETL service; useful as “duct tape” Typical scenario: Load data into HDFS Use Pig to scrub data, and Pump to another “db” (e.g., MongoDB) Web service reads from destination
  • 30.
  • 31. Hadoop SQL Server HDFS Windows Cluster Database MapReduce Query Optimizer Master Web Interface SQL Server Management Studio Hive SQL HCatalog Views Pig Powershell SSIS
  • 32. Big Data Administration The possession of facts is knowledge, the use of them is wisdom. – Thomas Jefferson
  • 33. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 37. Scale-Up Costs (SQL Server) Single Server Maximum RAM SAN Licenses Windows SQL Server Microsoft Support Personnel Developers DBA SAN Admin Network Admin Facilities Minimum Footprint
  • 38. Scale-Out Costs (Hortonworks HDP) Multiple Servers Commodity Licenses Windows ($$$) Linux ($) HDP Support Personnel Developer HDP Admin Network Admin Facilities Power Space Air
  • 41. Performance Architecture Nathan Marz - Twitter, Storm Lambda Architecture
  • 43. Getting Started (Massive Size) 1. Lab Environment (Virtualized) 2. Setup OS (Windows or Linux) 3. Download (MSI or RPM) 4. Deploy Prereqs (Python, Java, C++) 5. Setup Master Node(s) 6. Setup Data Node(s)
  • 45. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters High Cost Complex Analytics Schemaless Investigational Single-node Low Cost
  • 46. Word Count Problem: count the number of times a word displays in a specific record. e.g. “Lorem ipsum dolor sit amet, consectetur adipiscing elit.”...
  • 47. Word Count SQL Server Create UDF to parse strings Hadoop Pig script to parse strings
  • 48. Word Count - SQL Server CREATE function WordRepeatedNumTimes (@SourceString varchar(max),@TargetWord varchar(8000)) RETURNS int AS BEGIN DECLARE @NumTimesRepeated int ,@CurrentStringPosition int ,@LengthOfString int ,@PatternStartsAtPosition int ,@LengthOfTargetWord int ,@NewSourceString varchar(max)
  • 49. Word Count - SQL Server SET @LengthOfTargetWord = len(@TargetWord) SET @LengthOfString = len(@SourceString) SET @NumTimesRepeated = 0 SET @CurrentStringPosition = 0 SET @PatternStartsAtPosition = 0 SET @NewSourceString = @SourceString WHILE len(@NewSourceString) >= @LengthOfTargetWord BEGIN SET @PatternStartsAtPosition = CHARINDEX (@TargetWord, @NewSourceString) IF @PatternStartsAtPosition <> 0 BEGIN
  • 50. Word Count - SQL Server SET @NumTimesRepeated = @NumTimesRepeated + 1 SET @CurrentStringPosition = @CurrentStringPosition + @PatternStartsAtPosition + @LengthOfTargetWord SET @NewSourceString = substring(@NewSourceString, @PatternStartsAtPosition + @LengthOfTargetWord, @LengthOfString) END ELSE BEGIN SET @NewSourceString = '' END END RETURN @NumTimesRepeated END
  • 51. Word Count (Hadoop) a = load '/user/hue/word_count_text.txt'; b = foreach a generate flatten(TOKENIZE ((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/user/hue/pig_wordcount';
  • 52. Getting Started (Complex Analysis) 1. Lab Environment (Virtualized) 2. Install Hortonworks Sandbox 1. Setup Azure account 2. HDInsight
  • 53. Theoretically, can scale to PB, but no idea what that will cost you. Note that the interface highlights Hive (with Stinger); Pig commands are run through Powershell
  • 54.
  • 55. In Conclusion Lots of vocabulary HDFS, Pig, Hive, MapReduce Map to SQL Server (RDBMS) vocabulary Different Use Cases Massive Data Complex Analysis
  • 57. Contact Me Stuart R. Ainsworth Twitter: @codegumbo Email: stuart@codegumbo.com SpeakerRate: http://spkr8.com/t/33521
  • 58. Big Data - Dangerous http://www.thefacehawk.com/