SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
MAKING BIG DATA, SMALL
Using distributed systems for processing, analysing and managing
large huge data sets


    Marcin Jedyk
    Software Professional’s Network, Cheshire Datasystems Ltd
WARM-UP QUESTIONS
 How many of you heard about Big Data before?
 How many about NoSQL?

 Hadoop?
AGENDA.
 Intro – motivation, goal and ‘not about…’
 What is Big Data?
 NoSQL and systems classification
 Hadoop & HDFS
 MapReduce & live demo
 HBase
AGENDA
 Pig
 Building Hadoop cluster

 Conclusions

 Q&A
MOTIVATION
 Data is everywhere – why not to analyse it?
 With Hadoop and NoSQL systems, building
  distributed systems is easier than before
 Relying on software & cheap hardware rather
  than expensive hardware works better!
MOTIVATION
GOAL
 To explain basic ideas behind Big Data
 To present different approaches towards BD

 To show that Big Data systems are easy to build

 To show you where to start with such systems
WHAT IT IS NOT ABOUT?
 Not a detailed lecture on a single system
 Not about advanced techniques in Big Data

 Not only about technology – but also about its
  application
WHAT IS BIG DATA?
   Data characterised by 3 Vs:
     Volume

     Variety

     Velocity

   The interesting ones: variety & velocity
WHAT IS BIG DATA
 Data of high velocity: cannot store? Process on
  the fly!
 Data of high variety: doesn’t fit into relational
  schema? Don’t use schema, use NoSQL!
 Data which is impractical to process on a single
  server
NO-SQL
 Hand in and with Big Data
 NoSQL – an umbrella term for non-relational
  data bases or data storages
 It’s not always possible to replace RDBMS with
  NoSQL! (opposite is also true)
NO-SQL
   NoSQL DBs are built around different principles
     Key-value stores: Redis, Riak
     Document stores: i.e. MongoDB – record as a
      document; each entry has its own meta-data (JSON like,
      BSON)
     Table stores: i.e. Hbase – data persisted in multiple
      columns (even millions), billions of rows and multiple
      versions of records
HADOOP
 Existed before ‘Big Data’ buzzword emerged
 A simple idea – MapReduce

 A primary purpose – to crunch tera- and
  petabytes of data
 HDFS as underlying distributed file system
HADOOP – ARCHITECTURE BY EXAMPLE
 Image you need to process 1TB of logs
 What would you need?

 A server!
HADOOP – ARCHITECTURE BY EXAMPLE
 But 1TB is quite a lot of data… we want it
  quicker!
 Ok, what about distributed environment?
HADOOP – ARCHITECTURE BY EXAMPLE
   So what about that Hadoop stuff?
     Each node can: store data & process it (DataNode
      & TaskTracker)
HADOOP – ARCHITECTURE BY EXAMPLE
   How about allocating jobs to slaves? We need a
    JobTracker!
HADOOP – ARCHITECTURE BY EXAMPLE
 How about HDFS, how data blocks are
  assembled into files?
 NameNode does it.
HADOOP – ARCHITECTURE BY EXAMPLE
 NameNode – manages HDFS metadata, doesn’t
  deal with files directly
 JobTracker – schedules, allocates and monitors
  job execution on slaves – TaskTrackers
 TaskTracker – runs MapReduce operations
 DataNode – stores blocks of HDFS – default
  replication level for each block: 3
HADOOP - LIMITATIONS
 DataNodes & TaskTrackers are fault tollerant
 NameNode & JobTracker are NOT! (existing
  workaround for this problem)
 HDFS deals nicely with large files, doesn’t do
  well with billions of small files
MAP_REDUCE
 MapReduce – parallelisation approach
 Two main stages:
     Map – do an actual bit of work, i.e.: extract info
     Reduce – summarise, aggregate or filter outputs from
      Map operation
   For each job, multiple Map and Reduce operations
    – each may run on different node = parallelism
MAP_REDUCE – AN EXAMPLE
 Let’s process 1TB of raw logs and extract traffic by
  host.
 After submitting a job, JobTracker allocates tasks
  to slaves – possibly divided into 64MB packs =
  16384 Map operations!
 Map - analyse logs and return them as set of
  <key,value>
 Reduce -> merge output of Map operations
MAP_REDUCE – AN EXAMPLE
  Take a look at mocked log extract:
[IP – bandwidth]
10.0.0.1 – 1234
10.0.0.1 – 900
10.0.0.2 – 1230
10.0.0.3 – 999
MAP_REDUCE – AN EXAMPLE
 It’s important to define key, in this case IP
<10.0.0.1;2134>
<10.0.0.2;1230>
<10.0.0.3;999>
 Now, assume another Map operation returned:
<10.0.0.1;1500>
<10.0.0.3;1000>
<10.0.0.4;500>
MAP_REDUCE – AN EXAMPLE
Now, Reduce will merge those results:
<10.0.0.1;3624>
<10.0.0.2;2230>
<10.0.0.3;1499>
<10.0.0.4;500>
MAP_REDUCE
 Selecting a key is important
 It’s possible to define composite key, i.e.
  IP+date
 For more complex tasks, it’s possible to chain
  MapReduce jobs
HBASE
 Another layer on top of Hadoop/HDFS
 A distributed data storage

 Not a replacement for RDBMS!

 Can be used with MapReduce

 Good for unstructured data – no need to worry
  about exact schema in advance
PIG – HBASE ENHANCEMENT
 HBase - missing proper query language
 Pig – makes life easier for HBase users

 Translates queries into MapReduce jobs

 When working with Pig or HBase, forget what
  you know about SQL – it makes your life easier
BUILDING HADOOP CLUSTER
 Post production servers are ok
 Don’t take ‘cheap hardware’ too literally
 Good connection between nodes is a must!
 >=1Gbps between nodes
 >=10Gbps between racks
 1 disk per CPU core
 More RAM, more caching!
FINAL CONCLUSIONS
 Hadoop and NoSQL-like DB/DS scale very well
 Hadoop ideal for crunching huge data sets

 Does very well in production environment

 Cluster of slaves is fault tolerant, NameNode
  and JobTracker are not!
EXTERNAL RESOURCES
 Trending Topic – build on Wikipedia access logs:
  http://goo.gl/BWWO1
 Building web crawler with Hadoop:
  http://goo.gl/xPTlJ
 Analysing adverse drug events:
  http://goo.gl/HFXAx
 Moving average for large data sets:
  http://goo.gl/O4oml
EXTERNAL RESOURCES – USEFUL LINKS
http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011-
recommendation-talk/1
https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://hstack.org/hbase-performance-testing/
http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/
http://wiki.apache.org/hadoop/MachineScaling
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-
ladis2009.pdf
http://www.cloudera.com/resource-types/video/
http://hstack.org/why-were-using-hbase-part-2/
QUESTIONS?

Más contenido relacionado

La actualidad más candente

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatationAshish Saraf
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopDenis Shestakov
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop Sudarshan Pant
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡youngick
 

La actualidad más candente (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop presentation
Hadoop presentationHadoop presentation
Hadoop presentation
 
Hadoop
HadoopHadoop
Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 

Destacado

Big Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTOBig Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTORob Clark
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionIn a Rocket
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanPost Planner
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 

Destacado (7)

Big data
Big dataBig data
Big data
 
Big Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTOBig Data, Small Data, Data that Totally Rocks - SMWTO
Big Data, Small Data, Data that Totally Rocks - SMWTO
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job? Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
Succession “Losers”: What Happens to Executives Passed Over for the CEO Job?
 

Similar a Making Big Data, small

Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystemrohitraj268
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick GuideAsim Jalis
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourseSamatha Kamuni
 

Similar a Making Big Data, small (20)

Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Big Data - Part III
Big Data - Part IIIBig Data - Part III
Big Data - Part III
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop and aws map reducecourse
Hadoop and aws map reducecourseHadoop and aws map reducecourse
Hadoop and aws map reducecourse
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 

Último

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Último (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

Making Big Data, small

  • 1. MAKING BIG DATA, SMALL Using distributed systems for processing, analysing and managing large huge data sets Marcin Jedyk Software Professional’s Network, Cheshire Datasystems Ltd
  • 2. WARM-UP QUESTIONS  How many of you heard about Big Data before?  How many about NoSQL?  Hadoop?
  • 3. AGENDA.  Intro – motivation, goal and ‘not about…’  What is Big Data?  NoSQL and systems classification  Hadoop & HDFS  MapReduce & live demo  HBase
  • 4. AGENDA  Pig  Building Hadoop cluster  Conclusions  Q&A
  • 5. MOTIVATION  Data is everywhere – why not to analyse it?  With Hadoop and NoSQL systems, building distributed systems is easier than before  Relying on software & cheap hardware rather than expensive hardware works better!
  • 7. GOAL  To explain basic ideas behind Big Data  To present different approaches towards BD  To show that Big Data systems are easy to build  To show you where to start with such systems
  • 8. WHAT IT IS NOT ABOUT?  Not a detailed lecture on a single system  Not about advanced techniques in Big Data  Not only about technology – but also about its application
  • 9. WHAT IS BIG DATA?  Data characterised by 3 Vs:  Volume  Variety  Velocity  The interesting ones: variety & velocity
  • 10. WHAT IS BIG DATA  Data of high velocity: cannot store? Process on the fly!  Data of high variety: doesn’t fit into relational schema? Don’t use schema, use NoSQL!  Data which is impractical to process on a single server
  • 11. NO-SQL  Hand in and with Big Data  NoSQL – an umbrella term for non-relational data bases or data storages  It’s not always possible to replace RDBMS with NoSQL! (opposite is also true)
  • 12. NO-SQL  NoSQL DBs are built around different principles  Key-value stores: Redis, Riak  Document stores: i.e. MongoDB – record as a document; each entry has its own meta-data (JSON like, BSON)  Table stores: i.e. Hbase – data persisted in multiple columns (even millions), billions of rows and multiple versions of records
  • 13. HADOOP  Existed before ‘Big Data’ buzzword emerged  A simple idea – MapReduce  A primary purpose – to crunch tera- and petabytes of data  HDFS as underlying distributed file system
  • 14. HADOOP – ARCHITECTURE BY EXAMPLE  Image you need to process 1TB of logs  What would you need?  A server!
  • 15. HADOOP – ARCHITECTURE BY EXAMPLE  But 1TB is quite a lot of data… we want it quicker!  Ok, what about distributed environment?
  • 16. HADOOP – ARCHITECTURE BY EXAMPLE  So what about that Hadoop stuff?  Each node can: store data & process it (DataNode & TaskTracker)
  • 17. HADOOP – ARCHITECTURE BY EXAMPLE  How about allocating jobs to slaves? We need a JobTracker!
  • 18. HADOOP – ARCHITECTURE BY EXAMPLE  How about HDFS, how data blocks are assembled into files?  NameNode does it.
  • 19. HADOOP – ARCHITECTURE BY EXAMPLE  NameNode – manages HDFS metadata, doesn’t deal with files directly  JobTracker – schedules, allocates and monitors job execution on slaves – TaskTrackers  TaskTracker – runs MapReduce operations  DataNode – stores blocks of HDFS – default replication level for each block: 3
  • 20. HADOOP - LIMITATIONS  DataNodes & TaskTrackers are fault tollerant  NameNode & JobTracker are NOT! (existing workaround for this problem)  HDFS deals nicely with large files, doesn’t do well with billions of small files
  • 21. MAP_REDUCE  MapReduce – parallelisation approach  Two main stages:  Map – do an actual bit of work, i.e.: extract info  Reduce – summarise, aggregate or filter outputs from Map operation  For each job, multiple Map and Reduce operations – each may run on different node = parallelism
  • 22. MAP_REDUCE – AN EXAMPLE  Let’s process 1TB of raw logs and extract traffic by host.  After submitting a job, JobTracker allocates tasks to slaves – possibly divided into 64MB packs = 16384 Map operations!  Map - analyse logs and return them as set of <key,value>  Reduce -> merge output of Map operations
  • 23. MAP_REDUCE – AN EXAMPLE  Take a look at mocked log extract: [IP – bandwidth] 10.0.0.1 – 1234 10.0.0.1 – 900 10.0.0.2 – 1230 10.0.0.3 – 999
  • 24. MAP_REDUCE – AN EXAMPLE  It’s important to define key, in this case IP <10.0.0.1;2134> <10.0.0.2;1230> <10.0.0.3;999>  Now, assume another Map operation returned: <10.0.0.1;1500> <10.0.0.3;1000> <10.0.0.4;500>
  • 25. MAP_REDUCE – AN EXAMPLE Now, Reduce will merge those results: <10.0.0.1;3624> <10.0.0.2;2230> <10.0.0.3;1499> <10.0.0.4;500>
  • 26. MAP_REDUCE  Selecting a key is important  It’s possible to define composite key, i.e. IP+date  For more complex tasks, it’s possible to chain MapReduce jobs
  • 27. HBASE  Another layer on top of Hadoop/HDFS  A distributed data storage  Not a replacement for RDBMS!  Can be used with MapReduce  Good for unstructured data – no need to worry about exact schema in advance
  • 28. PIG – HBASE ENHANCEMENT  HBase - missing proper query language  Pig – makes life easier for HBase users  Translates queries into MapReduce jobs  When working with Pig or HBase, forget what you know about SQL – it makes your life easier
  • 29. BUILDING HADOOP CLUSTER  Post production servers are ok  Don’t take ‘cheap hardware’ too literally  Good connection between nodes is a must!  >=1Gbps between nodes  >=10Gbps between racks  1 disk per CPU core  More RAM, more caching!
  • 30. FINAL CONCLUSIONS  Hadoop and NoSQL-like DB/DS scale very well  Hadoop ideal for crunching huge data sets  Does very well in production environment  Cluster of slaves is fault tolerant, NameNode and JobTracker are not!
  • 31. EXTERNAL RESOURCES  Trending Topic – build on Wikipedia access logs: http://goo.gl/BWWO1  Building web crawler with Hadoop: http://goo.gl/xPTlJ  Analysing adverse drug events: http://goo.gl/HFXAx  Moving average for large data sets: http://goo.gl/O4oml
  • 32. EXTERNAL RESOURCES – USEFUL LINKS http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011- recommendation-talk/1 https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html http://hstack.org/hbase-performance-testing/ http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/ http://wiki.apache.org/hadoop/MachineScaling http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote- ladis2009.pdf http://www.cloudera.com/resource-types/video/ http://hstack.org/why-were-using-hbase-part-2/