Bigdata and Hadoop overview with Hadoop cluster setup

•

1 recomendación•1,246 vistas

This document provides an introduction to big data and Hadoop. It discusses how distributed systems can scale to handle large data volumes and discusses Hadoop's architecture. It also provides instructions on setting up a Hadoop cluster on a laptop and summarizes Hadoop's MapReduce programming model and YARN framework. Finally, it announces an upcoming workshop on Spark and Pyspark.

Datos y análisis

Bigdata and Hadoop
Haridas Narayanaswamy
https://haridas.in
06/Mar/2019

Quick reminder
1. HYD and BLR: Online Attendance registration link
https://docs.google.com/spreadsheets/d/1DAuclSqS6xdv4QtixUQIs6
OjpxXRLVEEmnCeAkgD9z8/edit#gid=0
2.
docker pull haridasn/hadoop-2.8.5
docker pull haridasn/hadoop-cli

Agenda
• Introduction to the big-data problems
• How we can scale systems.
• Hadoop introduction
• Setup a Hadoop cluster on your laptop.

IO bound vs CPU bound problems
• Downloading a file
• Bitcoin mining ( proof of work )
• Watching a movie
• ETL jobs
• ML training

Latency Numbers
• nano second - unit
speed

Application Architectures
Online
Standalone
Distributed

Monolithic
• All on single machine
• Every application starts here

N-tier systems
• Multiple services in your application
• Separation of concerns
• Master-slave database systems
• Load balancer
• Caching layer
• Pretty good for standard application
workloads.

SOA and Microservices
• Service Oriented Architecture
• Each service will do only one task – fine
grained separation, hence named as
micro-services
• API Gateways
• Messaging Systems
• Scalable Databases accessible to micro
services
• Caching services
• Etc.

CAP Theorem
• Consistency - Reads from all
machines on the cluster would give
same data.
• Availability - All operation would
produce some result, even though
they aren’t consistent.
• Partition tolerance - System
perform normally even after some
nodes got disconnected from
network.
• Eventual Consistent systems
• Split brain problem

Bigdata platforms
Where they fit in ?
Offline vs Online systems

Count number of unique hits from India
• Consider this scenario for Facebook
• Daily total log file comes is in Petabytes
• One machine can’t save it
• Other scenarios
• User click tracking
• Tracking effectiveness of an Ad
• All flavours of ETL jobs

Hadoop
• Every node can store the data - HDFS
• Every node can do processing on data it
holds locally.
• Easily scalable
• Redundant and fault tolerant
• Main Components
• Name Node
• Data Node
• Resource Manager
• Node Manager

Map-Reduce compute model
• Main stages of map-reduce
• Copy code not data.
• Data locality
• How we can use map-reduce to count
unique hits from a region ?

Yarn and HDFS
• Main stages of map-reduce
• Dynamic programming ?
• Bring code to data location
• Data locality
• How we can use mapreduce to count
unique hits from a region ?

Yarn Framework
• Client Submits jobs into cluster
• AM handles application orchestration,
once it has been started by RM
• Containers are actual resources (
CPU/Ram/IO) allocated to AM.
• Job progress tracking

Hybrid environments
• Storage on cloud storages s3/azure storage
• Execution engine on our cloud or Kubernetes.

Next: Spark on Hadoop Yarn and Pyspark
ON 13-March-2019

Más contenido relacionado

La actualidad más candente

RedisConf17 - Home Depot - Turbo charging existing applications with RedisRedis Labs

Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar

Zabbix at scale with ElasticsearchLeandro Totino Pereira

Apache Cassandra Lunch #72: Databricks and CassandraAnant Corporation

A Collaborative Data Science Development WorkflowDatabricks

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB

Scylla Summit 2018: Adventures in AdTech: Processing 50 Billion User Profiles...ScyllaDB

Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...DataStax Academy

Scylla Summit 2018: Getting the Most Out of Scylla on KubernetesScyllaDB

HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...Michael Stack

Running Scylla on Kubernetes with Scylla OperatorScyllaDB

Scylla Summit 2019 Keynote - Avi KivityScyllaDB

Scylla Summit 2022: Rakuten’s Catalog Platform Migration from Cassandra to Sc...ScyllaDB

Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack

Presto Summit 2018 - 02 - LinkedInkbajda

Introducing Scylla Open Source 4.0ScyllaDB

FireEye & Scylla: Intel Threat Analysis Using a Graph DatabaseScyllaDB

Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand UsersScyllaDB

Introduction to AWS Big Data Omid Vahdaty

La actualidad más candente (20)

RedisConf17 - Home Depot - Turbo charging existing applications with Redis

Hoodie: How (And Why) We built an analytical datastore on Spark

Zabbix at scale with Elasticsearch

Apache Cassandra Lunch #72: Databricks and Cassandra

A Collaborative Data Science Development Workflow

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

Scylla Summit 2018: Adventures in AdTech: Processing 50 Billion User Profiles...

Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...

Scylla Summit 2018: Getting the Most Out of Scylla on Kubernetes

HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...

Running Scylla on Kubernetes with Scylla Operator

Scylla Summit 2019 Keynote - Avi Kivity

Scylla Summit 2022: Rakuten’s Catalog Platform Migration from Cassandra to Sc...

Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

Presto Summit 2018 - 02 - LinkedIn

Introducing Scylla Open Source 4.0

FireEye & Scylla: Intel Threat Analysis Using a Graph Database

Disney+ Hotstar: Scaling NoSQL for Millions of Video On-Demand Users

Introduction to AWS Big Data

Similar a Bigdata and Hadoop overview with Hadoop cluster setup

project--2 nd review_2Aswini Ashu

project--2 nd review_2aswini pilli

Operational Intelligence Using HadoopDataWorks Summit

Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha

YARN Ready: Integrating to YARN with Tez Hortonworks

Practical introduction to hadoopinside-BigData.com

How @twitterhadoop chose google cloudlohitvijayarenu

Big data - Online TrainingLearntek1

Apache Tez - Accelerating Hadoop Data Processinghitesh1892

Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit

Apache Tez: Accelerating Hadoop Query ProcessingHortonworks

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju

Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB

What's the Hadoop-la about Kubernetes?DataWorks Summit

Tez big datacamp-la-bikas_sahaData Con LA

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

eHarmony in the CloudCraig Dickson

Hadoop-Quick introductionSandeep Singh

Apache Tez -- A modern processing enginebigdatagurus_meetup

HdInsight essentials Hadoop on Microsoft Platformnvvrajesh

Similar a Bigdata and Hadoop overview with Hadoop cluster setup (20)

project--2 nd review_2

Operational Intelligence Using Hadoop

Apache Tez : Accelerating Hadoop Query Processing

YARN Ready: Integrating to YARN with Tez

Practical introduction to hadoop

How @twitterhadoop chose google cloud

Big data - Online Training

Apache Tez - Accelerating Hadoop Data Processing

Apache Tez - A New Chapter in Hadoop Data Processing

Apache Tez: Accelerating Hadoop Query Processing

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Webinar: Enterprise Trends for Database-as-a-Service

What's the Hadoop-la about Kubernetes?

Tez big datacamp-la-bikas_saha

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

eHarmony in the Cloud

Hadoop-Quick introduction

Apache Tez -- A modern processing engine

HdInsight essentials Hadoop on Microsoft Platform

Último

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml

Networking Case Study prepared by teacher.pptxHimangsuNath

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

convolutional neural network and its applications.pdfSubhamKumar3239

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

INTRODUCTION TO Natural language processingsocarem879

Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal

wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17

Principles and Practices of Data VisualizationKianJazayeri1

Bigdata and Hadoop overview with Hadoop cluster setup

1. Bigdata and Hadoop Haridas Narayanaswamy https://haridas.in 06/Mar/2019

2. Quick reminder 1. HYD and BLR: Online Attendance registration link https://docs.google.com/spreadsheets/d/1DAuclSqS6xdv4QtixUQIs6 OjpxXRLVEEmnCeAkgD9z8/edit#gid=0 2. docker pull haridasn/hadoop-2.8.5 docker pull haridasn/hadoop-cli

3. Agenda • Introduction to the big-data problems • How we can scale systems. • Hadoop introduction • Setup a Hadoop cluster on your laptop.

4. IO bound vs CPU bound problems • Downloading a file • Bitcoin mining ( proof of work ) • Watching a movie • ETL jobs • ML training

5. Latency Numbers • nano second - unit speed

6. Application Architectures Online Standalone Distributed

7. Monolithic • All on single machine • Every application starts here

8. N-tier systems • Multiple services in your application • Separation of concerns • Master-slave database systems • Load balancer • Caching layer • Pretty good for standard application workloads.

9. SOA and Microservices • Service Oriented Architecture • Each service will do only one task – fine grained separation, hence named as micro-services • API Gateways • Messaging Systems • Scalable Databases accessible to micro services • Caching services • Etc.

10. CAP Theorem • Consistency - Reads from all machines on the cluster would give same data. • Availability - All operation would produce some result, even though they aren’t consistent. • Partition tolerance - System perform normally even after some nodes got disconnected from network. • Eventual Consistent systems • Split brain problem

11. Bigdata platforms Where they fit in ? Offline vs Online systems

12. Count number of unique hits from India • Consider this scenario for Facebook • Daily total log file comes is in Petabytes • One machine can’t save it • Other scenarios • User click tracking • Tracking effectiveness of an Ad • All flavours of ETL jobs

13. Hadoop • Every node can store the data - HDFS • Every node can do processing on data it holds locally. • Easily scalable • Redundant and fault tolerant • Main Components • Name Node • Data Node • Resource Manager • Node Manager

14. Map-Reduce compute model • Main stages of map-reduce • Copy code not data. • Data locality • How we can use map-reduce to count unique hits from a region ?

15. Yarn and HDFS • Main stages of map-reduce • Dynamic programming ? • Bring code to data location • Data locality • How we can use mapreduce to count unique hits from a region ?

16. Yarn Framework • Client Submits jobs into cluster • AM handles application orchestration, once it has been started by RM • Containers are actual resources ( CPU/Ram/IO) allocated to AM. • Job progress tracking

17. Hybrid environments • Storage on cloud storages s3/azure storage • Execution engine on our cloud or Kubernetes.

18. Workshop

19. Next: Spark on Hadoop Yarn and Pyspark ON 13-March-2019

20. Thank you Haridas N <hn@haridas.in>

Bigdata and Hadoop overview with Hadoop cluster setup

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Bigdata and Hadoop overview with Hadoop cluster setup

Similar a Bigdata and Hadoop overview with Hadoop cluster setup (20)

Último

Último (20)

Bigdata and Hadoop overview with Hadoop cluster setup