SlideShare una empresa de Scribd logo
1 de 80
Descargar para leer sin conexión
Cloud Computing Era
Trend Micro
Three Major Trends to Chang the World



                 Cloud Computing




           Big Data           Mobile
What is Cloud Computing?
 National Institute of Standards and Technology (NIST) definition :


                                                             Essential
                                                             Characteristics




                                                             Service Models




                                                             Deployment Models




 Provide scalable and elastic IT relative functions to users via as-a-
 service business models and internet technologies
It’s About the Ecosystem

                        Structured, Semi-structured

                                                      Cloud Computing

                                                            SaaS
   Enterprise Data Warehouse
                                                            PaaS

                                                            IaaS




                                                                                 Generate




                                                                             Big Data

                                                                                  Lead




                                                                          Business Insights

                                                                                  create




                                                                        Competition, Innovation,
                                                                             Productivity
What is Big Data?
What is the problem

• Getting the data to the processors
  becomes the bottleneck


• Quick calculation
   – Typical disk data transfer rate:
      • 75MB/sec
   – Time taken to transfer 100GB of data
     to the processor:
      • approx. 22   minutes!
The Era of Big Data – Are You Ready




        Data for business commercial analysis
        • 2011: multi-terabyte (TB)
        • 2020: 35.2 ZB (1 ZB = 1 billion TB)
Who Needs It?
           Enterprise Database                          Hadoop




When to use?                          When to use?
•   Ad-hoc Reporting (<1sec)          •   Affordable Storage/Compute
•   Multi-step Transactions           •   Unstructured or Semi-structured
•   Lots of Inserts/Updates/Deletes   •   Resilient Auto Scalability
Hadoop!
– inspired by
• Apache Hadoop project
  – inspired by Google's MapReduce and Google File System
    papers.
• Open sourced, flexible and available architecture for
  large scale computation and data processing on a
  network of commodity hardware
• Open Source Software + Hardware Commodity
  – IT Costs Reduction
Hadoop Core




              MapReduce
              HDFS




              © 2011 Cloudera, Inc. All Rights Reserved.
HDFS

• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool



                   © 2011 Cloudera, Inc. All Rights Reserved.
MapReduce

 • Two Phases of Functional Programming
 • Redundancy
 • Fault Tolerant
 • Scalable
 • Self Healing
 • Java API




13
                    © 2011 Cloudera, Inc. All Rights Reserved.
Hadoop Core

                                                            Java
Java

               MapReduce
               HDFS
                                                             Java
      Java

 14
               © 2011 Cloudera, Inc. All Rights Reserved.
Word Count Example




       Key: offset
       Value: line

                            Key: word      Key: word
                            Value: count   Value: sum of count



0:The cat sat on the mat
22:The aardvark sat on the sofa
The Hadoop Ecosystems
The Ecosystem is the System

• Hadoop has become the kernel of the distributed
  operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
Relation Map

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Zookeeper – Coordination Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is ZooKeeper

• A centralized service for maintaining
  – Configuration information
  – Providing distributed synchronization
• A set of tools to build distributed applications that can
  safely handle partial failures
• ZooKeeper was designed to store coordination data
  – Status information
  – Configuration
  – Location information
Flume / Sqoop – Data Integration Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What’s the problem for data collection

• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
  collection path
(and how can it help?)



• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
  amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
Flume Architecture


Log                                                        Log
                                         ...


Flume Node                                                 Flume Node


HDFS


              © 2011 Cloudera, Inc. All Rights Reserved.
Flume Sources and Sinks

• Local Files
• HDFS
• Stdin, Stdout
• Twitter
• IRC
• IMAP




                  © 2011 Cloudera, Inc. All Rights Reserved.
Sqoop

• Easy, parallel database import/export
• What you want do?
  – Insert data from RDBMS to HDFS
  – Export data from HDFS back into RDBMS
Sqoop


        HDFS


        Sqoop


        RDBMS


28
        © 2011 Cloudera, Inc. All Rights Reserved.
Sqoop Examples

 $ sqoop import --connect jdbc:mysql://localhost/world --
 username root --table City
 ...

 $ hadoop fs -cat City/part-m-00000
 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He
 rat,AFG,Herat,1868004,Mazar-e-
 Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200
 ...




29
                    © 2011 Cloudera, Inc. All Rights Reserved.
Pig / Hive – Analytical Language

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Why Hive and Pig?

• Although MapReduce is very powerful, it can also be
  complex to master
• Many organizations have business or data analysts who
  are skilled at writing SQL queries, but not at writing Java
  code
• Many organizations have programmers who are skilled
  at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
  to help such people analyze huge amounts of data via
  MapReduce
  – Hive was initially developed at Facebook, Pig at Yahoo!
Hive     – Developed by

• What is Hive?
  – An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
  summarization and ad hoc querying on top of Hadoop
  – MapRuduce for execution
  – HDFS for storage
• Hive Query Language
  – Basic-SQL : Select, From, Join, Group-By
  – Equi-Join, Muti-Table Insert, Multi-Group-By
  – Batch query
 SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
Hive


       SQL


       Hive


       MapReduce


33
       © 2011 Cloudera, Inc. All Rights Reserved.
Pig                 – Initiated by




• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug         A = load ‘a.txt’ as (id, name, age, ...)
                     B = load ‘b.txt’ as (id, address, ...)
                     C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
Pig


      Script


      Pig


      MapReduce


       © 2011 Cloudera, Inc. All Rights Reserved.
Hive vs. Pig

                  Hive                   Pig
Language          HiveQL (SQL-like)      Pig Latin, a scripting language
Schema            Table definitions      A schema is optionally defined
                  that are stored in a   at runtime
                  metastore
Programmait Access JDBC, ODBC            PigServer
WordCount Example

• Input
  Hello World Bye World
  Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
  < Hello, 1>
  < World, 1>
  < Bye, 1>
  < World, 1>
  < Hello, 1>
  < Hadoop, 1>
  < Goodbye, 1>
  < Hadoop, 1>

• the reduce just sums up the values
   < Bye, 1>
  < Goodbye, 1>
  < Hadoop, 2>
  < Hello, 2>
  < World, 2>
WordCount Example In MapReduce
public class WordCount {
 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         context.write(word, one);
      }
    }
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
         sum += val.get();
      }
      context.write(key, new IntWritable(sum));
  }
}

public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}
WordCount Example By Pig


A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;
WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;


SELECT count(*) FROM wordcount GROUP BY token;
The Story So Far


    SQL         Hive                               Pig        Script

    Java        MapReduce
    Java        HDFS
                Sqoop Flume
    SQL         RDBMS FS                                      Posix


4
1                © 2011 Cloudera, Inc. All Rights Reserved.
Hbase – Column NoSQL DB

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
Structured-data vs Raw-data
I – Inspired by

• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
  –   PUT
  –   GET
  –   DELETE
  –   SCAN
Hbase – Data Model

• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
HBase Examples

hbase>   create 'mytable', 'mycf‘
hbase>   list
hbase>   put 'mytable', 'row1', 'mycf:col1', 'val1‘
hbase>   put 'mytable', 'row1', 'mycf:col2', 'val2‘
hbase>   put 'mytable', 'row2', 'mycf:col1', 'val3‘
hbase>   scan 'mytable‘
hbase>   disable 'mytable‘
hbase>   drop 'mytable'




                     © 2011 Cloudera, Inc. All Rights Reserved.
Oozie – Job Workflow & Scheduling

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is                       ?

• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
• Triggered            Job 1 Job 2
  – Time
  – Data
                            Job 3

                       Job 4 Job 5
Oozie Features

• Component Independent
  –   MapReduce
  –   Hive
  –   Pig
  –   SqoopStreaming




                       © 2011 Cloudera, Inc. All Rights Reserved.
Mahout – Data Mining

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is

• Machine-learning tool
• Distributed and scalable machine learning algorithms on
  the Hadoop platform
• Building intelligent applications easier and faster
Mahout Use Cases

• Yahoo: Spam Detection
• Foursquare: Recommendations
• SpeedDate.com: Recommendations
• Adobe: User Targetting
• Amazon: Personalization Platform




                  © 2011 Cloudera, Inc. All Rights Reserved.
Hue – developed by

• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI
  library
Hue

• HUE comes with a suite of applications
  – File Browser: Browse HDFS; change permissions and
    ownership; upload, download, view and edit files.
  – Job Browser: View jobs, tasks, counters, logs, etc.
  – Beeswax: Wizards to help create Hive tables, load data, run and
    manage Hive queries, and download results in Excel format.
Hue: File Browser UI
Hue: Beewax UI
Use case Example

• Predict what the user likes based on
  – His/Her historical behavior
  – Aggregate behavior of people similar to him
Conclusion

Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
  ecosystem
Recap – Hadoop Ecosystem

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
Trend Micro Smart Protection
Network (SPN) Case Study
Collaboration in the underground
Network Threats Shows Explosive Growth
Threats on the network like variants of the virus, spams, unknown download
sources escapes the detection of the traditional security system and continues to
show explosive growth.

                        New Unique Malware Discovered
                                                                      1M
                                                                    unique
                                                                   Malwares
                                                                    every
                                                                    month
Cloud computing era
New Design Concept for Threat Intelligence
                 CDN / xSP                             Human
                                                       Intelligence




 Honeypot


                                                                      Web Crawler




       Trend Micro
       Mail Protection
                                                 Trend Micro
                             Trend Micro         Endpoint Protection
                             Web Protection

                   150M+ Worldwide Endpoints/Sensors
Challenges We Are Faced
             The Concept is Great but ….
  6TB of data and 15B lines of logs received daily by




           It becomes the Big Data Challenge!
Issues to Address




                                          Threat
 Raw Data           Information   Intelligence/Solution


 Volume: Infinite
 Time: No Delay
 Target: Keep Changing Threats
SPN
                     Feedback
                                              SPAM
                                                                  CDN Log              SPN High Level Architecture
                             HTTP POST

                                   L4


                        Log               Log
                      Receiver          Receiver

                                   L4


                       Log Post          Log Post           Log Post
                      Processing        Processing         Processing
SPN infrastructure




                                                   Adhoc-Query (Pig)


                         MapReduce                      HBase
                                                                             Circus
                          Hadoop Distributed File System                    (Ambari)
                                     (HDFS)

                     Feedback Information

                                                                                         Message Bus
Application




                       Email Reputation Service                                        Web Reputation     File Reputation
                                                                                          Service              Service
Trend Micro Big Data process capacity

Daily amount of SPN data to be processed
• 8.5 billions Web Reputation queries
• 3 billions Email Reputation queries
• 7 billions File Reputation queries
• Process 6 TB worldwide raw logs
• 150 millions End-point connections
Trend Micro: Web Reputation Services
 Technology                         Process                           Operation


    Trend Micro               User Traffic | Honeypot
Products / Technology
                                                              8 billions/day
                                      Akamai                             40% filtered
CDN Cache
                                                             4.8 billions/day
                               Rating Server for Known




                                                                                           15 Minutes
High Throughput Web Service            Threats                           82% filtered

                                Unknown & Prefilter
Hadoop Cluster
                                                             860 millions/day
                                  Page Download
Web Crawling
                                                                         99.98% filtered
                                       Threat
                                      Analysis
Machine Learning
Data Mining
                                                         25,000 malicious URL /day


Block malicious URL within 15 minutes once it goes online!
Big Data Cases
Google vs. Hadoop Ecosystem


Chubby vs. Zookeeper

MapReduce vs. MapReduce

BigTable vs. HBase

GFS vs. HDFS

 Ref: Google Cluster
Pioneer of Big Data Infrastructure – Google
Hbase use Case@Facebook - Messages
 HBase Use Cases @ Facebook

            Facebook Insights   Operational Data Store
            Self-service        More Analytics/Hashout apps
 Messages   Hashout             Site Integrity




    2010        2011                 2012            2013

                                                  Social Graph Search Indexing
                                                  Realtime Hive Updates
                                                  Cross-system Tracing
                                                  … and more
Flagship App:Facebook Messages
Monthly data volume prior to launch
• Monthly data volume prior to launch



                 15B x 1,024 bytes = 14TB



                 120B x 100 bytes = 11TB
Facebook Messages Now
book Messages NOW
    •
StatsQuick Stats
                                Messages   Chats
       – 11B+ messages/day
         • 90B+ data accesses
+ messages/day
         • Peak:1.5M ops/sec
0B+ data~55% Read, 45% Write
         •
            accesses
eak: 1.5M ops/sec
55%Rd, 45% Wr data
      – 20PB+ of total
                                 Emails    SMS
         • Grows 400TB/month

B+ of total data
rows 400TB/month
Facebook Messages:Requirements

• Very High Write Volume
  – Previously, chat was not persisted to disk
• Ever-growing data sets(Old data rarely gets
  accessed)
• Elasticity & automatic failover
• Strong consistency within a single data center
• Large scans/map-reduce support for migrations &
  schema conversions
• Bulk import data
Physical Multi-tenancy

• Real-time Ads Insights
  – Real-time analytics for social plugins on top of Hbase
  – Publishers get real-time distribution/engagement metrics:
     • # of impressions, likes
     • analytics by domain/URL/demographics and time periods
  – Uses HBase capabilities:
     • Efficient counters (single-RPC increments)
     • TTL for purging old data
  – Needs massive write throughput & low latencies
     • Billions of URLs
     • Millions of counter increments/second

• Operational Data Store
Facebook Open Source Stack



• Memcached --> App Server Cache
• ZooKeeper --> Small Data Coordination Service
• HBase --> Database Storage Engine
• HDFS --> Distributed FileSystem
• Hadoop --> Asynchronous Map-Reduce Jobs
Cloud computing era
Questions?
Thank you!

Más contenido relacionado

La actualidad más candente

Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentYahoo Developer Network
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloHortonworks
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 

La actualidad más candente (20)

Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Compaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache AccumuloCompaction and Splitting in Apache Accumulo
Compaction and Splitting in Apache Accumulo
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 

Destacado

Zh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfsZh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfsTrendProgContest13
 
Zh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduceZh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduceTrendProgContest13
 
Zh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computingZh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computingTrendProgContest13
 
Social Business =Cloud + Big Data + Social Media + Mobile Computing
Social Business =Cloud + Big Data + Social Media + Mobile ComputingSocial Business =Cloud + Big Data + Social Media + Mobile Computing
Social Business =Cloud + Big Data + Social Media + Mobile ComputingWilliam Tanenbaum
 
Healthcare Statitsics - a Market Research Report by RapidValue Solutions
Healthcare Statitsics - a Market Research Report by RapidValue SolutionsHealthcare Statitsics - a Market Research Report by RapidValue Solutions
Healthcare Statitsics - a Market Research Report by RapidValue SolutionsRapidValue
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
 
Mobile cloud Computing
Mobile cloud ComputingMobile cloud Computing
Mobile cloud ComputingPooja Sharma
 
Mobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big PictureMobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big PictureReza Rahimi
 
Future challenges in computer science
Future challenges in computer scienceFuture challenges in computer science
Future challenges in computer scienceSeminar Links
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Startups and Bollywood
Startups and BollywoodStartups and Bollywood
Startups and BollywoodP J
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Characteristics of gifted students
Characteristics of gifted studentsCharacteristics of gifted students
Characteristics of gifted studentsTrinette Atri
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Destacado (19)

Zh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfsZh tw introduction_to_hadoop and hdfs
Zh tw introduction_to_hadoop and hdfs
 
Zh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduceZh tw introduction_to_map_reduce
Zh tw introduction_to_map_reduce
 
Zh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computingZh tw introduction_to_cloud_computing
Zh tw introduction_to_cloud_computing
 
Zh tw introduction_to_h_base
Zh tw introduction_to_h_baseZh tw introduction_to_h_base
Zh tw introduction_to_h_base
 
Social Business =Cloud + Big Data + Social Media + Mobile Computing
Social Business =Cloud + Big Data + Social Media + Mobile ComputingSocial Business =Cloud + Big Data + Social Media + Mobile Computing
Social Business =Cloud + Big Data + Social Media + Mobile Computing
 
Healthcare Statitsics - a Market Research Report by RapidValue Solutions
Healthcare Statitsics - a Market Research Report by RapidValue SolutionsHealthcare Statitsics - a Market Research Report by RapidValue Solutions
Healthcare Statitsics - a Market Research Report by RapidValue Solutions
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 
Mobile cloud Computing
Mobile cloud ComputingMobile cloud Computing
Mobile cloud Computing
 
Mobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big PictureMobile Cloud Computing: Big Picture
Mobile Cloud Computing: Big Picture
 
Future challenges in computer science
Future challenges in computer scienceFuture challenges in computer science
Future challenges in computer science
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Bollywood quiz answers
Bollywood quiz answersBollywood quiz answers
Bollywood quiz answers
 
Gifted Presentation
Gifted PresentationGifted Presentation
Gifted Presentation
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Startups and Bollywood
Startups and BollywoodStartups and Bollywood
Startups and Bollywood
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Characteristics of gifted students
Characteristics of gifted studentsCharacteristics of gifted students
Characteristics of gifted students
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar a Cloud computing era

Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceSteve Loughran
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemAL500745425
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

Similar a Cloud computing era (20)

Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG France
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop Echosystem
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Big data
Big dataBig data
Big data
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 

Cloud computing era

  • 2. Three Major Trends to Chang the World Cloud Computing Big Data Mobile
  • 3. What is Cloud Computing? National Institute of Standards and Technology (NIST) definition : Essential Characteristics Service Models Deployment Models Provide scalable and elastic IT relative functions to users via as-a- service business models and internet technologies
  • 4. It’s About the Ecosystem Structured, Semi-structured Cloud Computing SaaS Enterprise Data Warehouse PaaS IaaS Generate Big Data Lead Business Insights create Competition, Innovation, Productivity
  • 5. What is Big Data?
  • 6. What is the problem • Getting the data to the processors becomes the bottleneck • Quick calculation – Typical disk data transfer rate: • 75MB/sec – Time taken to transfer 100GB of data to the processor: • approx. 22 minutes!
  • 7. The Era of Big Data – Are You Ready Data for business commercial analysis • 2011: multi-terabyte (TB) • 2020: 35.2 ZB (1 ZB = 1 billion TB)
  • 8. Who Needs It? Enterprise Database Hadoop When to use? When to use? • Ad-hoc Reporting (<1sec) • Affordable Storage/Compute • Multi-step Transactions • Unstructured or Semi-structured • Lots of Inserts/Updates/Deletes • Resilient Auto Scalability
  • 10. – inspired by • Apache Hadoop project – inspired by Google's MapReduce and Google File System papers. • Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware • Open Source Software + Hardware Commodity – IT Costs Reduction
  • 11. Hadoop Core MapReduce HDFS © 2011 Cloudera, Inc. All Rights Reserved.
  • 12. HDFS • Hadoop Distributed File System • Redundancy • Fault Tolerant • Scalable • Self Healing • Write Once, Read Many Times • Java API • Command Line Tool © 2011 Cloudera, Inc. All Rights Reserved.
  • 13. MapReduce • Two Phases of Functional Programming • Redundancy • Fault Tolerant • Scalable • Self Healing • Java API 13 © 2011 Cloudera, Inc. All Rights Reserved.
  • 14. Hadoop Core Java Java MapReduce HDFS Java Java 14 © 2011 Cloudera, Inc. All Rights Reserved.
  • 15. Word Count Example Key: offset Value: line Key: word Key: word Value: count Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa
  • 17. The Ecosystem is the System • Hadoop has become the kernel of the distributed operating system for Big Data • No one uses the kernel alone • A collection of projects at Apache
  • 18. Relation Map Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 19. Zookeeper – Coordination Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 20. What is ZooKeeper • A centralized service for maintaining – Configuration information – Providing distributed synchronization • A set of tools to build distributed applications that can safely handle partial failures • ZooKeeper was designed to store coordination data – Status information – Configuration – Location information
  • 21. Flume / Sqoop – Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 22. What’s the problem for data collection • Data collection is currently a priori and ad hoc • A priori – decide what you want to collect ahead of time • Ad hoc – each kind of data source goes through its own collection path
  • 23. (and how can it help?) • A distributed data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism • One-stop solution for data collection of all formats
  • 24. Flume Architecture Log Log ... Flume Node Flume Node HDFS © 2011 Cloudera, Inc. All Rights Reserved.
  • 25. Flume Sources and Sinks • Local Files • HDFS • Stdin, Stdout • Twitter • IRC • IMAP © 2011 Cloudera, Inc. All Rights Reserved.
  • 26. Sqoop • Easy, parallel database import/export • What you want do? – Insert data from RDBMS to HDFS – Export data from HDFS back into RDBMS
  • 27. Sqoop HDFS Sqoop RDBMS 28 © 2011 Cloudera, Inc. All Rights Reserved.
  • 28. Sqoop Examples $ sqoop import --connect jdbc:mysql://localhost/world -- username root --table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He rat,AFG,Herat,1868004,Mazar-e- Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ... 29 © 2011 Cloudera, Inc. All Rights Reserved.
  • 29. Pig / Hive – Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 30. Why Hive and Pig? • Although MapReduce is very powerful, it can also be complex to master • Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code • Many organizations have programmers who are skilled at writing code in scripting languages • Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo!
  • 31. Hive – Developed by • What is Hive? – An SQL-like interface to Hadoop • Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop – MapRuduce for execution – HDFS for storage • Hive Query Language – Basic-SQL : Select, From, Join, Group-By – Equi-Join, Muti-Table Insert, Multi-Group-By – Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  • 32. Hive SQL Hive MapReduce 33 © 2011 Cloudera, Inc. All Rights Reserved.
  • 33. Pig – Initiated by • A high-level scripting language (Pig Latin) • Process data one step at a time • Simple to write MapReduce program • Easy understand • Easy debug A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  • 34. Pig Script Pig MapReduce © 2011 Cloudera, Inc. All Rights Reserved.
  • 35. Hive vs. Pig Hive Pig Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions A schema is optionally defined that are stored in a at runtime metastore Programmait Access JDBC, ODBC PigServer
  • 36. WordCount Example • Input Hello World Bye World Hello Hadoop Goodbye Hadoop • For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> • the reduce just sums up the values < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 37. WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 38. WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;
  • 39. WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;
  • 40. The Story So Far SQL Hive Pig Script Java MapReduce Java HDFS Sqoop Flume SQL RDBMS FS Posix 4 1 © 2011 Cloudera, Inc. All Rights Reserved.
  • 41. Hbase – Column NoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 43. I – Inspired by • Coordinated by Zookeeper • Low Latency • Random Reads And Writes • Distributed Key/Value Store • Simple API – PUT – GET – DELETE – SCAN
  • 44. Hbase – Data Model • Cells are “versioned” • Table rows are sorted by row key • Region – a row range [start-key:end-key]
  • 45. HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable' © 2011 Cloudera, Inc. All Rights Reserved.
  • 46. Oozie – Job Workflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 47. What is ? • A Java Web Application • Oozie is a workflow scheduler for Hadoop • Crond for Hadoop • Triggered Job 1 Job 2 – Time – Data Job 3 Job 4 Job 5
  • 48. Oozie Features • Component Independent – MapReduce – Hive – Pig – SqoopStreaming © 2011 Cloudera, Inc. All Rights Reserved.
  • 49. Mahout – Data Mining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 50. What is • Machine-learning tool • Distributed and scalable machine learning algorithms on the Hadoop platform • Building intelligent applications easier and faster
  • 51. Mahout Use Cases • Yahoo: Spam Detection • Foursquare: Recommendations • SpeedDate.com: Recommendations • Adobe: User Targetting • Amazon: Personalization Platform © 2011 Cloudera, Inc. All Rights Reserved.
  • 52. Hue – developed by • Hadoop User Experience • Apache Open source project • HUE is a web UI for Hadoop • Platform for building custom applications with a nice UI library
  • 53. Hue • HUE comes with a suite of applications – File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. – Job Browser: View jobs, tasks, counters, logs, etc. – Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.
  • 56. Use case Example • Predict what the user likes based on – His/Her historical behavior – Aggregate behavior of people similar to him
  • 57. Conclusion Today, we introduced: • Why Hadoop is needed • The basic concepts of HDFS and MapReduce • What sort of problems can be solved with Hadoop • What other projects are included in the Hadoop ecosystem
  • 58. Recap – Hadoop Ecosystem Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 59. Trend Micro Smart Protection Network (SPN) Case Study
  • 60. Collaboration in the underground
  • 61. Network Threats Shows Explosive Growth Threats on the network like variants of the virus, spams, unknown download sources escapes the detection of the traditional security system and continues to show explosive growth. New Unique Malware Discovered 1M unique Malwares every month
  • 63. New Design Concept for Threat Intelligence CDN / xSP Human Intelligence Honeypot Web Crawler Trend Micro Mail Protection Trend Micro Trend Micro Endpoint Protection Web Protection 150M+ Worldwide Endpoints/Sensors
  • 64. Challenges We Are Faced The Concept is Great but …. 6TB of data and 15B lines of logs received daily by It becomes the Big Data Challenge!
  • 65. Issues to Address Threat Raw Data Information Intelligence/Solution  Volume: Infinite  Time: No Delay  Target: Keep Changing Threats
  • 66. SPN Feedback SPAM CDN Log SPN High Level Architecture HTTP POST L4 Log Log Receiver Receiver L4 Log Post Log Post Log Post Processing Processing Processing SPN infrastructure Adhoc-Query (Pig) MapReduce HBase Circus Hadoop Distributed File System (Ambari) (HDFS) Feedback Information Message Bus Application Email Reputation Service Web Reputation File Reputation Service Service
  • 67. Trend Micro Big Data process capacity Daily amount of SPN data to be processed • 8.5 billions Web Reputation queries • 3 billions Email Reputation queries • 7 billions File Reputation queries • Process 6 TB worldwide raw logs • 150 millions End-point connections
  • 68. Trend Micro: Web Reputation Services Technology Process Operation Trend Micro User Traffic | Honeypot Products / Technology 8 billions/day Akamai 40% filtered CDN Cache 4.8 billions/day Rating Server for Known 15 Minutes High Throughput Web Service Threats 82% filtered Unknown & Prefilter Hadoop Cluster 860 millions/day Page Download Web Crawling 99.98% filtered Threat Analysis Machine Learning Data Mining 25,000 malicious URL /day Block malicious URL within 15 minutes once it goes online!
  • 70. Google vs. Hadoop Ecosystem Chubby vs. Zookeeper MapReduce vs. MapReduce BigTable vs. HBase GFS vs. HDFS Ref: Google Cluster
  • 71. Pioneer of Big Data Infrastructure – Google
  • 72. Hbase use Case@Facebook - Messages HBase Use Cases @ Facebook Facebook Insights Operational Data Store Self-service More Analytics/Hashout apps Messages Hashout Site Integrity 2010 2011 2012 2013 Social Graph Search Indexing Realtime Hive Updates Cross-system Tracing … and more
  • 73. Flagship App:Facebook Messages Monthly data volume prior to launch • Monthly data volume prior to launch 15B x 1,024 bytes = 14TB 120B x 100 bytes = 11TB
  • 74. Facebook Messages Now book Messages NOW • StatsQuick Stats Messages Chats – 11B+ messages/day • 90B+ data accesses + messages/day • Peak:1.5M ops/sec 0B+ data~55% Read, 45% Write • accesses eak: 1.5M ops/sec 55%Rd, 45% Wr data – 20PB+ of total Emails SMS • Grows 400TB/month B+ of total data rows 400TB/month
  • 75. Facebook Messages:Requirements • Very High Write Volume – Previously, chat was not persisted to disk • Ever-growing data sets(Old data rarely gets accessed) • Elasticity & automatic failover • Strong consistency within a single data center • Large scans/map-reduce support for migrations & schema conversions • Bulk import data
  • 76. Physical Multi-tenancy • Real-time Ads Insights – Real-time analytics for social plugins on top of Hbase – Publishers get real-time distribution/engagement metrics: • # of impressions, likes • analytics by domain/URL/demographics and time periods – Uses HBase capabilities: • Efficient counters (single-RPC increments) • TTL for purging old data – Needs massive write throughput & low latencies • Billions of URLs • Millions of counter increments/second • Operational Data Store
  • 77. Facebook Open Source Stack • Memcached --> App Server Cache • ZooKeeper --> Small Data Coordination Service • HBase --> Database Storage Engine • HDFS --> Distributed FileSystem • Hadoop --> Asynchronous Map-Reduce Jobs