SlideShare una empresa de Scribd logo
1 de 36
BIG DATA
  Enlightening Big Data

               - Jayant
What is BIG Data ? ? ?

How   BIG   is BIG Data ?
How to define BIG Data ? ? ?




             Gartner’s Doug Laney in a 2001 research report.
Volume
Velocity
                      • 300m photos uploaded / day
                      • 2.5b content shared / day
           Facebook   • 70K Queries executed / day
                      • 500+TB / day




                      • 340m tweets / day
            Twitter   • 140m active users




                      • 4.7b search queries / day
            Google    • Processing 20 PB data / day




                      • 1m transaction / hour
           Walmart    • 2.5 petabytes of data / hour
Variety




          Structured Analysis               Unstructured Analysis
          Responses to Pledge,              Responses to following questions
          multiple choice questions         • Share your story
                                            • Ask a question to Aamir
                                            • Send a message of hope
                                            • Share your solution

                                            Content Filtering Rating Tagging
                                            System (CFRTS)
                                            L0, L1, L2 phased analytics

          Impact Analysis
          Crawling general internet for measuring the before & after scenario
          on a particular topic
Value




It is a capital mistake to theorize
              before one has data.

                -Sherlock Holmes
Variability
     Who enjoys the fastest internet?                              Where does our energy come from?




                             Living longer with fewer children




                                                    http://www.google.com/publicdata/directory
Veracity
Other Effect – Geo, Event …
3 I’s for Big Data
               • “data that’s an order of magnitude greater than data you’re accustomed to.”
                 - Gartner analyst Doug Laney
               • “data that exceeds the processing capacity of conventional database systems. The data is too big,
                 moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from
                 this data, you must choose an alternative way to process it.”
Ill-Defined      - Ed Dumbill, program chair for the O’Reilly Strata Conference




               • How do you make Big Data approachable?
               • There are lots of challenges in leveraging Big Data, from managing the data to having the right
                 tools to get you the insights that matter.
               • Companies like Splunk and Sumo Logic are Big Data Apps for machine data.
                 Marketing relevance company BloomReach processes more than 100 million web pages,
Intimidating     generating 94% average annual incremental traffic as a result.




               • What’s actionable about big data?
               • “the analytic value of data decays rapidly.”
                 - Andrew Rogers, founder and CTO of Space Curve

                That means being able to analyze your data as fast as possible is critical to gaining competitive
Immediate       advantage. “hit the iron when it is hot”
Managing BIG Data
• Distributed Computing
• Multiprocessing Unit
• Parallel processing

• SMP (Symmetric MultiProcessing solutions) :
  SMP systems use multiple processors that share a common operating system
  (OS) and memory.
  e.g. Microsoft SQL Server 2008 R2 Fast Track Data Warehouse platform

• MPP (Massively Parallel Processing) :
  MPP systems harness numerous processors each having own OS & memory
  working on different parts of an operation in a coordinated way.
  e.g. Microsoft’s Parallel Data Warehouse solution

• NoSQL Platforms :
  They increase performance at a lower cost, with linear scalability, true
  commodity hardware, a schema-free structure, and more relaxed data-
  consistency validation.
  e.g. Hadoop
Evolution – Distributed System
    Atomicity           For the internet workload, with distributed
    Consistency         computing, ACID properties are too strong.
    Isolation
    Durability


Rather than requiring consistency          Basic
after every transaction, it is enough      Availability
for the database to eventually be in       Soft-state
a consistent state -- BASE.                Eventual consistency

•    Consistent – Reads always pick up the latest write.
•    Available – can always read and write.
•    Partition tolerant – The system can be split across
     multiple machines and datacenters




Can do at most two of these three.
                                                           Brewer’s CAP Theorem for Distributed Systems
Path to DataStack 3.0
                                              Must support Variety, Volume and Velocity


Data Stack 1.0                  Data Stack 2.0                    Data Stack 3.0
Relational Database Systems     Enterprise Data Warehouse         Dynamic Data Platform


Recording Business Events       Support for Decision Making       Uncovering Key Insights

Highly Normalized Data          Unnormalize Dimensional Model     Schema less Approach

GBs of Data                     TBs of Data                       PBs of Data

End User Access thru Ent Apps   End User Access Through Reports   End User Direct Access

Structured                      Structured                        Structured + Semi Structured
Hadoop
• A scalable fault-tolerant grid operating system for data
  storage and processing
• Its scalability comes from the marriage of:
   • HDFS: Self-Healing High-Bandwidth Clustered Storage
   • MapReduce: Fault-Tolerant Distributed Processing
• Operates on unstructured and structured data
• A large and active ecosystem (many developers and additions
  like HBase, Hive, Pig, …)
• Open source under the friendly Apache License
• http://wiki.apache.org/hadoop/

Hadoop Design Axioms:-
• System Shall Manage and Heal Itself
• Performance Shall Scale Linearly
• Compute Should Move to Data
• Simple Core, Modular and Extensible
Hadoop
• Hadoop’s Inspiration – Google’s MapReduce       2002-2004: Doug Cutting and Mike Cafarella
                                                  started working on Nutch
• Google’s GFS & GMR  Hadoop’s HDFS & HMR        2003-2004: Google publishes GFS and
• Hadoop was created by Doug Cutting and          MapReduce papers
                                                  2004: Cutting adds DFS & MapReduce support to
  Michael J. Cafarella.                           Nutch
• Hadoop is written in the Java programming       2006: Yahoo! hires Cutting, Hadoop spins out of
                                                  Nutch
  language and is a top-level Apache project      2007: NY Times converts 4TB of archives over
  being built and used by a global community of   100 EC2s
  contributors.                                   2008: Web-scale deployments at Y!, Facebook,
                                                  Last.fm
                                                  April 2008: Yahoo does fastest sort of a TB,
                                                  3.5mins over 910 nodes
                                                  May 2009:
                                                       Yahoo does fastest sort of a TB, 62secs over
                                                       1460 nodes
                                                       Yahoo sorts a PB in 16.25hours over 3658
                                                       nodes
                                                  June 2009, Oct 2009: Hadoop Summit (750),
                                                  Hadoop World (500)
HDFS           Hadoop Distributed File System
   Block Size = 64MB
  Replication Factor = 3




Cost/GB is a few ¢/month vs $/month
MapReduce   Distributed Processing
Working of Hadoop – I (Map Reduce)
Working of Hadoop – I (Map Reduce)
Working of Hadoop – I (MR Code)

   public void map(Object key, Text value, …. ) {
     StringTokenizer itr = new StringTokenizer(value.toString());
     while (itr.hasMoreTokens()) {
           word.set(itr.nextToken());
          context.write(word, one);
       }


   public void reduce(Text key, Iterable<IntWritable> values, ……… ) {
         int sum = 0;
         for (IntWritable val : values) {sum += val.get();}
         result.set(sum);
         context.write(key, result);
       }
Working of Hadoop - II
Working of Hadoop - III
Hadoop Layout
Hadoop - Economics
• Typical Hardware:
   •   Two Quad Core Nehalems
   •   24GB RAM
   •   12 * 1TB SATA disks (JBOD mode, no need for RAID)
   •   1 Gigabit Ethernet card
• Cost/node: $5K/node
• Effective HDFS Space:
   • ¼ reserved for temp shuffle space, which leaves 9TB/node
   • 3 way replication leads to 3TB effective HDFS space/node
   • But assuming 7x compression that becomes ~ 20TB/node
Effective Cost per user TB: $250/TB
Other solutions cost in the range of $5K to $100K per user TB
   Powered by Hadoop:
        • Facebook
            • 1100-nodes cluster with 8800 cores
            • store copies of internal log and dimension data sources and use it as a
              source for reporting/analytics and machine learning
        • Yahoo
            • Biggest cluster: 4000 nodes
            • Search Marketing, People you may know, Search Assist, and many more…
        • Ebay
            • 532 nodes cluster (8 * 532 cores, 5.3PB).
            • Using it for Search optimization and Research

                                                               http://wiki.apache.org/hadoop/PoweredBy
RDBMS and Hadoop
                    RDBMS                   MapReduce
Data size   Gigabytes             Petabytes
Access      Interactive and batch Batch
Structure   Fixed schema          Unstructured schema
Language    SQL                 Procedural (Java, C++, Ruby, etc)
Integrity   High                Low
Scaling     Nonlinear           Linear
Updates     Read and write      Write once, read many times
Latency     Low                 High
Choose Right Tool
BIG Data Landscape
Hadoopable Problem Types
              1                Batchable
• They are batchable into the two-phase Map/Reduce sequence(s)



              2                Massive Volume
• There is a need to analyze massive data volumes, which precluded their solution using more traditional platforms.



              3                No Data Dependency
• They exhibit little or no data dependence, meaning that work being done by one computational node is largely done on
  data locally accessible to that computational node.


              4                No Process Dependency
• They are amenable to massive parallelism in that there is little process dependence across computations. The tasks do
  not have to be “sequentialized,” meaning that those tasks really can be executed at the same time without having to
  wait for each other to provide interim results, except during the transition between the map and reduce phases.

              5                Unstructure++
• They are not limited to data managed within a structured environment, and in fact unstructured data analysis and
  analyzing combinations of structured and unstructured data are suitable.


              6                No Inter-Process Communication
• Individually-assigned tasks require limited inter-process communication, reducing any latency delays associated with
  injecting data into and pulling data out of a network.

                                                                     6 Super Scale Hadoop Deployments
Myths
              1                 Big Data is Only About Massive Data Volume
• Volume is just one key element in defining Big Data, and it is arguably the least important of three elements. The other
  two are variety and velocity.
• Experts consider PBs of data volume as the starting point for Big Data, although this volume indicator is a moving target.

              2                 Big Data Means Hadoop
• Hadoop is the Apache open-source software framework for working with Big Data. It was derived from Google
  technology and put to practice by Yahoo and others.
• Big Data is too varied and complex for a one-size-fits-all solution.

              3                 Big Data Means Unstructured Data
• The term “unstructured" is imprecise and doesn’t account for the many varying and subtle structures typically
  associated with Big Data types. Big Data is probably better termed “multi-structured” as it could include text strings,
  documents of all types, audio and video files, metadata, web pages, email messages, social media feeds, form data, etc.


              4                 Big Data is for Social Media Feeds and Sentiment Analysis
• Early pioneers of Big Data have been the largest, web-based, social media companies — Google, Yahoo, Facebook — it
  was the volume, variety, and velocity of data generated by their services that required a radically new solution rather
  than the need to analyze social feeds or gauge audience sentiment.


              5                 NoSQL means No SQL
• NoSQL means “not only” SQL because these types of data stores offer domain-specific access
• Technologies in this NoSQL category include key value stores, document-oriented databases, graph databases, big table
  structures, and caching data stores.
Where/How its used

     Business                      Technical
• Behavioral analysis          • Staging area for Data
• Targeting marketing offers     warehouse / analytics
• Analyzing marketing          • Analytics Sandbox
  effectiveness                • Unstructured / semi-
• Root cause analysis            structured content
• Sentiment Analysis             storage and analysis
• Fraud Analysis               • Total data analysis
• Risk Mitigation              • Commodity based Storage
Applications
Case Study
Rigorous Weekly
Operation Cycle
producing instant
analytics
Killer combo of Human+Softwareto analyze the data
efficiently
                                                                                         Topic opens on Sunday



                                                              Episode Tags are
                                                           refined and messages                                       Live Analytics report is
                                                             are re-ingested for                                       sent during the show
                                                                another pass




                                                     Featured content is
                                                                                                                             Data capture from SMS,
                                                    delivered thrice a day
                                                                                                                               phone calls, social
                                                     all through out the
                                                                                                                                 media, website,
                                                            week.




                                                                         JSONs are created for
                                                                                                        System runs L0 Analysis,
                                                                            the external and
                                                                                                        L1, L2 Analysts continue
                                                                          internal dashboards
Road Ahead…
“With too little data, you won’t be able to make any conclusions that you trust.
                  With loads of data you will find relationships that aren’t real…

                                      Big data isn’t about bits, it’s about talent”

                                                                – Douglas Merrill




                                                                    Q&A
Torture the data, and it will confess to anything.
          -Ronald Coase, Economics, Nobel Prize Laureate




                                               Thank You

Más contenido relacionado

La actualidad más candente (20)

Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Big data
Big dataBig data
Big data
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big data
Big dataBig data
Big data
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 
Applications of Big Data
Applications of Big DataApplications of Big Data
Applications of Big Data
 
BIG DATA & DATA ANALYTICS
BIG  DATA & DATA  ANALYTICSBIG  DATA & DATA  ANALYTICS
BIG DATA & DATA ANALYTICS
 
Big Data
Big DataBig Data
Big Data
 

Destacado

Destacado (20)

2 тема 1.1. этапы развития психологии как науки
2 тема 1.1. этапы развития психологии как науки2 тема 1.1. этапы развития психологии как науки
2 тема 1.1. этапы развития психологии как науки
 
тема 2.3. память
тема 2.3. памятьтема 2.3. память
тема 2.3. память
 
El cáncer
El cáncerEl cáncer
El cáncer
 
 
Projekt marketing
Projekt marketingProjekt marketing
Projekt marketing
 
Word como herramienta didáctica.
Word como herramienta didáctica.Word como herramienta didáctica.
Word como herramienta didáctica.
 
тема 5.6. психология подростков
тема 5.6. психология подростковтема 5.6. психология подростков
тема 5.6. психология подростков
 
тема 4.4. new темперамент
тема 4.4.  new темпераменттема 4.4.  new темперамент
тема 4.4. new темперамент
 
El futuro es mio
El futuro es mioEl futuro es mio
El futuro es mio
 
hadoop_module
hadoop_modulehadoop_module
hadoop_module
 
Present raul mendez
Present raul mendezPresent raul mendez
Present raul mendez
 
Ders01
Ders01Ders01
Ders01
 
GöZdecv
GöZdecvGöZdecv
GöZdecv
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Blogger en la educacion
Blogger en la educacionBlogger en la educacion
Blogger en la educacion
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 

Similar a Big Data & Hadoop Introduction

Similar a Big Data & Hadoop Introduction (20)

Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big data
Big dataBig data
Big data
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data
Big DataBig Data
Big Data
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 

Último

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Big Data & Hadoop Introduction

  • 1. BIG DATA Enlightening Big Data - Jayant
  • 2. What is BIG Data ? ? ? How BIG is BIG Data ?
  • 3. How to define BIG Data ? ? ? Gartner’s Doug Laney in a 2001 research report.
  • 5. Velocity • 300m photos uploaded / day • 2.5b content shared / day Facebook • 70K Queries executed / day • 500+TB / day • 340m tweets / day Twitter • 140m active users • 4.7b search queries / day Google • Processing 20 PB data / day • 1m transaction / hour Walmart • 2.5 petabytes of data / hour
  • 6. Variety Structured Analysis Unstructured Analysis Responses to Pledge, Responses to following questions multiple choice questions • Share your story • Ask a question to Aamir • Send a message of hope • Share your solution Content Filtering Rating Tagging System (CFRTS) L0, L1, L2 phased analytics Impact Analysis Crawling general internet for measuring the before & after scenario on a particular topic
  • 7. Value It is a capital mistake to theorize before one has data. -Sherlock Holmes
  • 8. Variability Who enjoys the fastest internet? Where does our energy come from? Living longer with fewer children http://www.google.com/publicdata/directory
  • 10. Other Effect – Geo, Event …
  • 11. 3 I’s for Big Data • “data that’s an order of magnitude greater than data you’re accustomed to.” - Gartner analyst Doug Laney • “data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.” Ill-Defined - Ed Dumbill, program chair for the O’Reilly Strata Conference • How do you make Big Data approachable? • There are lots of challenges in leveraging Big Data, from managing the data to having the right tools to get you the insights that matter. • Companies like Splunk and Sumo Logic are Big Data Apps for machine data. Marketing relevance company BloomReach processes more than 100 million web pages, Intimidating generating 94% average annual incremental traffic as a result. • What’s actionable about big data? • “the analytic value of data decays rapidly.” - Andrew Rogers, founder and CTO of Space Curve That means being able to analyze your data as fast as possible is critical to gaining competitive Immediate advantage. “hit the iron when it is hot”
  • 12. Managing BIG Data • Distributed Computing • Multiprocessing Unit • Parallel processing • SMP (Symmetric MultiProcessing solutions) : SMP systems use multiple processors that share a common operating system (OS) and memory. e.g. Microsoft SQL Server 2008 R2 Fast Track Data Warehouse platform • MPP (Massively Parallel Processing) : MPP systems harness numerous processors each having own OS & memory working on different parts of an operation in a coordinated way. e.g. Microsoft’s Parallel Data Warehouse solution • NoSQL Platforms : They increase performance at a lower cost, with linear scalability, true commodity hardware, a schema-free structure, and more relaxed data- consistency validation. e.g. Hadoop
  • 13. Evolution – Distributed System Atomicity For the internet workload, with distributed Consistency computing, ACID properties are too strong. Isolation Durability Rather than requiring consistency Basic after every transaction, it is enough Availability for the database to eventually be in Soft-state a consistent state -- BASE. Eventual consistency • Consistent – Reads always pick up the latest write. • Available – can always read and write. • Partition tolerant – The system can be split across multiple machines and datacenters Can do at most two of these three. Brewer’s CAP Theorem for Distributed Systems
  • 14. Path to DataStack 3.0 Must support Variety, Volume and Velocity Data Stack 1.0 Data Stack 2.0 Data Stack 3.0 Relational Database Systems Enterprise Data Warehouse Dynamic Data Platform Recording Business Events Support for Decision Making Uncovering Key Insights Highly Normalized Data Unnormalize Dimensional Model Schema less Approach GBs of Data TBs of Data PBs of Data End User Access thru Ent Apps End User Access Through Reports End User Direct Access Structured Structured Structured + Semi Structured
  • 15. Hadoop • A scalable fault-tolerant grid operating system for data storage and processing • Its scalability comes from the marriage of: • HDFS: Self-Healing High-Bandwidth Clustered Storage • MapReduce: Fault-Tolerant Distributed Processing • Operates on unstructured and structured data • A large and active ecosystem (many developers and additions like HBase, Hive, Pig, …) • Open source under the friendly Apache License • http://wiki.apache.org/hadoop/ Hadoop Design Axioms:- • System Shall Manage and Heal Itself • Performance Shall Scale Linearly • Compute Should Move to Data • Simple Core, Modular and Extensible
  • 16. Hadoop • Hadoop’s Inspiration – Google’s MapReduce 2002-2004: Doug Cutting and Mike Cafarella started working on Nutch • Google’s GFS & GMR  Hadoop’s HDFS & HMR 2003-2004: Google publishes GFS and • Hadoop was created by Doug Cutting and MapReduce papers 2004: Cutting adds DFS & MapReduce support to Michael J. Cafarella. Nutch • Hadoop is written in the Java programming 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch language and is a top-level Apache project 2007: NY Times converts 4TB of archives over being built and used by a global community of 100 EC2s contributors. 2008: Web-scale deployments at Y!, Facebook, Last.fm April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910 nodes May 2009: Yahoo does fastest sort of a TB, 62secs over 1460 nodes Yahoo sorts a PB in 16.25hours over 3658 nodes June 2009, Oct 2009: Hadoop Summit (750), Hadoop World (500)
  • 17. HDFS Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 Cost/GB is a few ¢/month vs $/month
  • 18. MapReduce Distributed Processing
  • 19. Working of Hadoop – I (Map Reduce)
  • 20. Working of Hadoop – I (Map Reduce)
  • 21. Working of Hadoop – I (MR Code) public void map(Object key, Text value, …. ) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } public void reduce(Text key, Iterable<IntWritable> values, ……… ) { int sum = 0; for (IntWritable val : values) {sum += val.get();} result.set(sum); context.write(key, result); }
  • 25. Hadoop - Economics • Typical Hardware: • Two Quad Core Nehalems • 24GB RAM • 12 * 1TB SATA disks (JBOD mode, no need for RAID) • 1 Gigabit Ethernet card • Cost/node: $5K/node • Effective HDFS Space: • ¼ reserved for temp shuffle space, which leaves 9TB/node • 3 way replication leads to 3TB effective HDFS space/node • But assuming 7x compression that becomes ~ 20TB/node Effective Cost per user TB: $250/TB Other solutions cost in the range of $5K to $100K per user TB Powered by Hadoop: • Facebook • 1100-nodes cluster with 8800 cores • store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning • Yahoo • Biggest cluster: 4000 nodes • Search Marketing, People you may know, Search Assist, and many more… • Ebay • 532 nodes cluster (8 * 532 cores, 5.3PB). • Using it for Search optimization and Research http://wiki.apache.org/hadoop/PoweredBy
  • 26. RDBMS and Hadoop RDBMS MapReduce Data size Gigabytes Petabytes Access Interactive and batch Batch Structure Fixed schema Unstructured schema Language SQL Procedural (Java, C++, Ruby, etc) Integrity High Low Scaling Nonlinear Linear Updates Read and write Write once, read many times Latency Low High
  • 29. Hadoopable Problem Types 1 Batchable • They are batchable into the two-phase Map/Reduce sequence(s) 2 Massive Volume • There is a need to analyze massive data volumes, which precluded their solution using more traditional platforms. 3 No Data Dependency • They exhibit little or no data dependence, meaning that work being done by one computational node is largely done on data locally accessible to that computational node. 4 No Process Dependency • They are amenable to massive parallelism in that there is little process dependence across computations. The tasks do not have to be “sequentialized,” meaning that those tasks really can be executed at the same time without having to wait for each other to provide interim results, except during the transition between the map and reduce phases. 5 Unstructure++ • They are not limited to data managed within a structured environment, and in fact unstructured data analysis and analyzing combinations of structured and unstructured data are suitable. 6 No Inter-Process Communication • Individually-assigned tasks require limited inter-process communication, reducing any latency delays associated with injecting data into and pulling data out of a network. 6 Super Scale Hadoop Deployments
  • 30. Myths 1 Big Data is Only About Massive Data Volume • Volume is just one key element in defining Big Data, and it is arguably the least important of three elements. The other two are variety and velocity. • Experts consider PBs of data volume as the starting point for Big Data, although this volume indicator is a moving target. 2 Big Data Means Hadoop • Hadoop is the Apache open-source software framework for working with Big Data. It was derived from Google technology and put to practice by Yahoo and others. • Big Data is too varied and complex for a one-size-fits-all solution. 3 Big Data Means Unstructured Data • The term “unstructured" is imprecise and doesn’t account for the many varying and subtle structures typically associated with Big Data types. Big Data is probably better termed “multi-structured” as it could include text strings, documents of all types, audio and video files, metadata, web pages, email messages, social media feeds, form data, etc. 4 Big Data is for Social Media Feeds and Sentiment Analysis • Early pioneers of Big Data have been the largest, web-based, social media companies — Google, Yahoo, Facebook — it was the volume, variety, and velocity of data generated by their services that required a radically new solution rather than the need to analyze social feeds or gauge audience sentiment. 5 NoSQL means No SQL • NoSQL means “not only” SQL because these types of data stores offer domain-specific access • Technologies in this NoSQL category include key value stores, document-oriented databases, graph databases, big table structures, and caching data stores.
  • 31. Where/How its used Business Technical • Behavioral analysis • Staging area for Data • Targeting marketing offers warehouse / analytics • Analyzing marketing • Analytics Sandbox effectiveness • Unstructured / semi- • Root cause analysis structured content • Sentiment Analysis storage and analysis • Fraud Analysis • Total data analysis • Risk Mitigation • Commodity based Storage
  • 33. Case Study Rigorous Weekly Operation Cycle producing instant analytics Killer combo of Human+Softwareto analyze the data efficiently Topic opens on Sunday Episode Tags are refined and messages Live Analytics report is are re-ingested for sent during the show another pass Featured content is Data capture from SMS, delivered thrice a day phone calls, social all through out the media, website, week. JSONs are created for System runs L0 Analysis, the external and L1, L2 Analysts continue internal dashboards
  • 35. “With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real… Big data isn’t about bits, it’s about talent” – Douglas Merrill Q&A
  • 36. Torture the data, and it will confess to anything. -Ronald Coase, Economics, Nobel Prize Laureate Thank You

Notas del editor

  1. Veracity is defined as “conformity with truth or fact,” or in short, Accuracy or Certainty.  Things that can cause us to question the data are inconsistencies, model approximations, ambiguities, deception, fraud, duplication, spam and latency.Variability: Say you go to an ice cream parlor that has 20 flavors of ice cream. That is Variety. Now, say you go there three days in a row and order strawberry; but each time it looks and tastes different.The different meanings/contexts associated with a given piece of dataValue : How fast &amp; accurately you analyze and provide analytics to make sense (business sense) out of it.
  2. India alone generated about 40,000 PB of Data in 2010. – EMC &amp; IDC data.Volume: Whether they deal with incoming or outgoing requests, companies with exceptionally large amounts of data always look for faster, more efficient, and lower-cost solutions for data storage and access requirements.
  3. 90% of Data was generated in last 2 years.Velocity: A high rate of data arriving from multiple, disparate sources in various formats requires solutions that rapidly process query requests for large data, and also support the acquisition and retention of data just as quickly.
  4. Variety: Traditionally, companies have only analyzed data in structured formats and have either fought to generate value from unstructured data or have confined their analysis to a structured part of the overall picture. Today’s technology, such as “Not Only SQL” (NoSQL) platforms, let businesses combine structured data with unstructured and semi-structured data to answer questions spanning all of their managed data.
  5. Data is the new oil!-Clive Humby, ANA Senior marketer’s summit, 2006Data is the new oil? No: Data is the new soil.-David McCandless, TEDGlobal, 2010Value: IT departments have had to make tough decisions about which data to keep and how long to keep it, and the processing power required to perform large and complex ad hoc analysis often has been beyond the department’s capacity and budget. Big-data solutions can provide value through insights gained by combining larger sets of data than were previously possible to manage. Now, companies can harvest more external data on market conditions, customer satisfaction, and competitive analysis, performing what-if scenarios for new insights.
  6. Variability: The variability in data structure and how users want to interpret that data in the short and long term are considerations that may help a solution provider steer an organization toward a big data solution. Often the initial structure and content of data can change over time, and similar data from different sources can exhibit wide variability in structure and format. Big data solutions allow data to be stored in its original form and transformed for in-depth analysis when a user queries the data.China introduced 2-child @ 1970 &amp; 1-child policy @ 1979.
  7. SMPs are limited by the capacity of the OS to manage the architecture, necessitating solutions with 16 to 32 processors.MPPs often contain 50 to 200 processors or more. MPP systems can grow horizontally simply by adding more processors.
  8. Challenges with Distributed ComputingCheap nodes fail, especially if you have manyMean time between failures for 1 node = 3 yearsMean time between failures for 1000 nodes = 1 daySolution: Build fault-tolerance into systemCommodity network = low bandwidthSolution: Push computation to the dataProgramming distributed systems is hardSolution: Data-parallel programming model: users write “map” &amp; “reduce” functions, system distributes work and handles faults
  9. Confronted with a data explosion, Google engineers Jeff Dean and Sanjay Ghemawatarchitected (and published!) two seminal systems: the Google File System (GFS) and Google MapReduce (GMR).GFS was a brilliantly pragmatic solution to exabyte-scale data management using commodity hardware.GMR was an equally brilliant implementation of a long-standing design pattern applied to massively parallel processing of said data on said commodity machines.GFS and GMR became the core of the processing engine used to crawl, analyze, and rank web pages into the giant inverted index that we all use daily at google.com.Enter reverse engineering in the open source world, and, voila, Apache Hadoop — comprised of the Hadoop Distributed File System and HadoopMapReduce — was born in the image of GFS and GMR.Doug, who was working at Yahoo at the time, named it after his son&apos;s toy elephant.Read : http://www.slideshare.net/mlmilleratmit/gluecon-miller-horizon
  10. Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
  11. Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries.MapReduce can run on top of HDFS or a selection of other storage systemsIntelligent scheduling algorithms for locality, sharing, and resource optimization.
  12. RDBMS and Hadoop: Apples and Oranges?Choose right tools for the right job.
  13. FacebookYahooeBayGE – Sentiment AnalysisOrbitzInfochimps
  14. 1.Variety refers to the many different data and file types that are important to manage and analyze more thoroughly, but for which traditional relational databases are poorly suited. Some examples of this variety include sound and movie files, images, documents, geo-location data, web logs, and text strings.Velocity is about the rate of change in the data and how quickly it must be used to create real value. Traditional technologies are especially poorly suited to storing and using high-velocity data. So new approaches are needed. If the data in question is created and aggregates very quickly and must be used swiftly to uncover patterns and problems, the greater the velocity and the more likely that you have a Big Data opportunity.2.Hadoop has surely captured the greatest name recognition, it is just one of three classes of technologies well suited to storing and managing Big Data. The other two classes are NoSQL and Massively Parallel Processing (MPP) data stores. Examples of MPP data stores include EMC’s Greenplum, IBM’s Netezza, and HP’s Vertica.3.The consistent trait of these varied data types is that the data schema isn’t known or defined when the data is captured and stored. Rather, a data model is often applied at the time the data is used.4.Now, thanks to rapidly increasing computer power (often cloud-based), open source software (e.g., the Apache Hadoop distribution), and a modern onslaught of data that could generate economic value if properly utilized, there are an endless stream of Big Data uses and applications.5.The specific native access methods to stored data provide a rich, low-latency approach, typically through a proprietary interface. SQL access has the advantage of familiarity and compatibility with many existing tools. Although this is usually at some expense of latency driven by the interpretation of the query to the native “language” of the underlying system.
  15. “President Obama’s campaign ran an extremely sophisticated and relentless digital operation that threw out the rule book and took no assumption for granted, it was masterminded by data analysts who left nothing to chance.”http://www.thebigdatainsightgroup.com/site/article/how-big-data-influenced-us-presidential-electionhttp://aws.typepad.com/aws/2012/11/aws-in-action-behind-the-scenes-of-a-presidential-campaign.html
  16. Human ProfilingRelated Video : http://www.youtube.com/watch?v=DS310JMdu2sHealthcarehttp://gigaom.com/2012/07/15/better-medicine-brought-to-you-by-big-data/http://www.bbc.co.uk/news/health-21045594http://www.blog.telecomfuturecentre.it/2013/02/06/reshaping-medicine-through-big-data/US Presidential Election: powered by Hadoophttp://www.thebigdatainsightgroup.com/site/article/how-big-data-influenced-us-presidential-electionhttp://aws.typepad.com/aws/2012/11/aws-in-action-behind-the-scenes-of-a-presidential-campaign.htmlCrime/Forensichttp://www.digitalreasoning.com/2012/industry-news/nypd-fights-crime-through-big-data/http://www.digitalreasoning.com/2012/industry-news/big-data-fights-financial-crime/http://blogs.unisys.com/eurovoices/index.php/2012/06/28/data-analysis-using-big-data-tools-for-financial-crime-prevention/Astronomyhttp://escience.washington.edu/get-help-now/astronomical-image-processing-hadoophttp://escience.washington.edu/get-help-now/astronomy-large-scale-data-processinghttp://www.theatlantic.com/technology/archive/2012/04/how-big-data-is-changing-astronomy-again/255917/Weatherhttp://www.forbes.com/sites/toddwoody/2012/03/21/meet-the-scientists-mining-big-data-to-predict-the-weather/http://www.nasdaq.com/article/big-data-delivers-fewer-hunches-more-facts-to-weather-channel-20130211-00763#.URpje6XC3kM