SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
PlatformDay2009




SNS Analysis using Cloud Computing Services
DHT-based Key-Value Storage and MapReduce-based Analysis

DongWoo Lee
oiko.cloud@gmail.com




                                 S    Oiko
                                       Laboratory
                                                    D   SocialFlow
                                                           OikoLab   2
                                                                     CloudKR
Agenda
                                                            2
                                                            CloudKR

 ‣ Introduction
     • Social Network Serivce
    • Motivation : Visualization, Social Network Analysis
    • SocialFlow
    • Scale Out Technologies : Cloud Computing
 ‣ SNS Analysis Architecture based on Cloud
    • Overall Process
    • Crawling
    • DHT Storage (CouchDB)
    • MapReduce
    • Pair-Wise Similarity
 ‣ Cloud Computing Service
    • Amazon Web Service
    • EC2 / S3 / Elastic MapReduce
    • Tips
 ‣ References
Introduction
                                            2        CloudKR




  Social Network   Cloud Computing   Mobile Device
Social Network Service
                                                                             2
                                                                             CloudKR

       “Social Applications = Social Networks”
        “A social network is a collection of people bound together
        through a specific set of social relations.”

        “A collection of people is a social network if and only if it is
        possible for something to spread virally through that collection.”
Social Network Services : Twitter, Facebook
                                              2
                                              CloudKR
Social Applications
Social Networks




              http://www.vincos.it/world-map-of-social-networks/
Social Network Analysis
                                              2
                                              CloudKR
‣ Social Graph Analysis
‣ Visualization
‣ Person-to-Person Relationship
‣ Temporal Mind Mining (Content Clustering)
‣ Post-Mortem Log Processing
Social Network Analysis : Visualization
                                              2
                                              CloudKR




                                          ‣
Social Network Analysis : Visualization
                                              2
                                              CloudKR




                                          ‣
Social Network Analysis : Visualization
                                           2
                                           CloudKR



                                   ‣


                                   ‣
                                       ‣
                                       ‣
Social Network Analysis : Visualization
                                              2
                                              CloudKR




                                          ‣
SocialFlow
                                                                                   2
                                                                                   CloudKR
‣ Thoughts, Feelings, Interests, Relationship and Information of SNS
‣ Real-time Massive Social Data Streams
‣ Difficult to follow the Social Streams
‣ Need a way to get a summary or clustered information based on Common Interests




                          D       SocialFlow
                                    OikoLab
SocialFlow
‣ Getting Common Flows of people through Content Similarities
                                                                2
                                                                CloudKR



‣ Reflecting Short-Term Interests of People
‣ Extracting Hot Issues
‣ Revealing Relationships among In/Out Resources
‣ Implementing Scale-Out Technologies
‣ Evolving toward Recommendation System
  based on Collective Intelligence
Scale Out Technologies : Cloud Computing
                                           2
                                           CloudKR
Why Cloud Computing?
                                                        2
                                                        CloudKR
‣ SPOF (Single Point of Failure)

‣ Cluster Administration (Who do this?)

‣ Initial Infrastructure Investment (Risk Management)

‣ Focus on Main Thing (Intelligence)

‣ Enable Highly Scalable Services




       New resource provision paradigms
       for Grid Infrastructures: Virtualization
       and Cloud / ISGC 2009
       http://tinyurl.com/nacgu7
Cloud Computing: e.g. Storage Failure
                                            2
                                            CloudKR




                                        é
SNS Analysis Architecture based on Cloud
                                           2
                                           CloudKR




                 D   SocialFlow
                       OikoLab
Experimental Project
                                         2
                                         CloudKR




D       SocialFlow
           OikoLab


‣Python / Django / Boto
‣ML / Data Mining
‣DHT / CouchDB
‣Cloud / AWS S3, EC2, Hadoop MapReduce
Workflow
                                                                  2 CloudKR



 SNS    Crawler              MapReduce    Post-Processing   CDN   User




       In-house Cluster                  Cloud Service
        (Local DataCenter)
Technologies : Before
                                                                 2
                                                                 CloudKR




                        Crawler

             Crawler              Crawler




        Hash_ring      Consistent
                                       MapReduce       CouchJS
                         DHT



        CouchDB        Key-Value            Machine    Home
                        Storage             Learning   Made
Technologies : After
                                                                2
                                                                CloudKR




                        Crawler

             Crawler              Crawler

                                                 Storage   S3




        Hash_ring      Consistent                       EC2
                                       MapReduce
                         DHT                           Hadoop




        CouchDB        Key-Value            Machine    Home
                        Storage             Learning   Made
Crawling
                                                                                       2 CloudKR
‣ Fetching recent postings of SNS
‣ Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever)
‣ Pushing raw data into the Cloud to process them with MapReduce



                         Crawler

                                                DB                     [ term, doc ]

                         Crawler
                                                DB
                                                                          Index
                                                           Indexer
                                                                           File
                                                DB
                         Crawler

                                                DB         Mapper


                         Crawler

                                      DHT   Replication
Consistent DHT (Distributed Hash Table)
‣ Uniform key distribution and load balancing with a good hash function
                                                                                                               2          CloudKR


‣ Minimizing the effects of a storage crash or temporal down
‣ High availability with replication scheme


                          N-1                                                           0

                       Node N-1                                                      Node k-1


                                                                 k-1



                                                                                              ‣ Notice: A real node has non-
                                                                                                linear portions of the total key
                                                                                                space.
                                                      Replicas


                                            k+1
                       Node k+1                                                      Node k

                           2                                                            1

                                                                         Replicate(k, k-1, k+1)


                                  !"#$!%&'()*+,-.(
                                  /0123',(0405123',(&6-.-7-1(080.-'9(.0405.-'9(.&6-.-7-1(0:
Consistent DHT (Distributed Hash Table)
                                                                                    2     CloudKR

                          Admin                                            Anonymouse
                          Traffic                                            User Traffic


                                                View

                   Admin View                                           User View

                                            Generated Contents
          SNS Crawler      SNS Anlysis                                  AWS S3

                                                                 html            image
                          DHT Front End



                          Memory Cache




           N-1                              0

       Node N-1                           Node k-1
                                DHT
        Node k+1                          Node k

            2                               1
Consistent DHT : Replication
                                                           2       CloudKR



                         * Replica = 2




       D             A                         B           C
   A       B     B       C               C         D   D       A




                                               B
           B     B


       Replica                               Replica
CouchDB (Key-Value Storage)
                                               2
                                               CloudKR
‣ Erlang -based Key-Value Storage
‣ Storage Engine (MVCC, B-tree)
‣ RESTful API
‣ Service-side JavaScript Engine (MapReduce)
‣ View Engine
‣ Futon Web UI
CouchDB: Server-side Javascript
                                                                  2      CloudKR
‣ Purpose

   ‣ Local Computations on Local Data Sets

‣ Features

   ‣ Mozilla’s Spidermonkey

   ‣ MapReduce Framework with Javascript

   ‣ Fork External Process (couchjs)

‣ Performance Enhancements Expected

   ‣ Googles V8
     (Chrome’s Javascript Engine / JIT)




                                             http://tinyurl.com/m76sx3
CouchDB: MapReduce
                       2
                       CloudKR


doc = (d1, d2, fq)




  dx: { di }
Map & Reduce : Pair-Wise Similarity
                                                                                                 2            CloudKR

                                                                                        [ term, { docs } ]
                                                                                               =>
  DB                       [ term, doc ]             [ term, { docs } ]                   [ doc1, doc2 ]


  DB
                              Index          Doc          Group              Doc           Candidate
              Indexer
                               File        Grouper         File           Combinator          File
  DB


  DB          Mapper                       Reducer                         Mapper
                                                                                             DocPair
                                                                                                              Reducer
                                                                                             Counter


                               Doc
                               File
                                                                                             Result
                                                                                              File


‣ Indexer and Grouper for Processing Korean.
                                                                                       [ freq, doc1, doc2 ]
‣ No NLP and No Structural Analysis.

‣ Produce a pairwise similarity between two postings.
Map & Reduce : Optimization
                                                                       2        CloudKR

‣ Concerns                                  ‣ Sample Data
   ‣ Consider Key Group Size Distribution      ‣ Two months postings of my friends
   ‣ Data Load Balancing                       ‣ Reachable graph: 4,060 Peoples
   ‣ Barrier Point                             ‣ Total Postings: 206,115
Pair-Wise Similarity and its TreeMap




                                       Posting: 110,008
                                       Users: 2,691

                                       Score >= 6
Pair-Wise Similarity and its Cluster
                                                   2
                                                   CloudKR
  ➡One issue and different opinions among people
Pair-Wise Similarity and its Cluster
                                        2
                                        CloudKR
➡Common Interest / Hot Issue
Pair-Wise Similarity and its Cluster
  ➡One person and the similar contents pattern (specialty)
                                                             2
                                                             CloudKR
Pair-Wise Similarity and its Cluster
   ➡ Similar Structure of Sentences (trendy, parady)
                                                       2
                                                       CloudKR
Deployment
                             2
                             CloudKR




             EC2



             S3/CloudFront



             Flickr


             www
Cloud Computing Service
                          2
                          CloudKR
Before the Cloud Age
‣   Smart Shell Guru’s Daily Work : Parallel Sort
                                                                                             2      CloudKR




$ wc -l data                            scp                scp                  $ sort -rm data*.sorted >
$ split -l 1000k data                   NFS                NFS                         data.sorted
                             $ nohup ./work.sh data1 > data1.processed
                             $ nohup sort -r data1.processed > data1.sorted

                        ➡ Need to prepare/maintain physical machines and resources
        Complexity      ➡ Need to monitor job progress (wait and see job’s status)
                        ➡ Need to cope with machine failure (slave nodes / storages / networks)
                        ➡ Need to schedule multiple jobs
Amazon Web Service : Overview
                                               EC2     EC2   EC2     EC2
                                                                                                                                2             CloudKR

                                                                       Messages

                                               SQS (Simple Query Service)
                 Auto Scaling


                 CloudWatch
                  Monitoring


            Elastic Load Balancing            EC2 (Elastic Compute Cloud)             Mount         EBS (Elastic Block Store)          1 GB to 1TB


                                                                                                 Permissions         Header
  Clients                                                                     API                         Objects
  Clients         HTTP
  Clients                                                                                                 Buckets
                       AMI (Machine Image)
                                                                                                                          eSATA/USB
                                                                  SimpleDB          S3 (Simple Storage Service)
                                                                                                                              Offline
   Mgmt Console          EC2 CLI       SSH
                                                                                                                                  Import/Export
               Admin                                              key-value                 CloudFront

       Access Key ID                                                                  Edges
       Secret Access Key
       Key Pair                      Instant EC2 Hadoop Cluster       Elastic MapReduce                    HTTP


                                   Hadoop     Hadoop    Hadoop                                             Clients
Amazon Web Service
 ‣ Amazon Management Console
                               2
                               CloudKR
AWS : AMI
                        2     CloudKR




                  AMI
            Amazon Machine Image
AWS : Paid AMI / The Cloud Market
                                           2     CloudKR




                                     AMI
                               Amazon Machine Image


                                           Paid AMI
AWS : How to make a AMI (1)
                                                                                          2
                                                                                          CloudKR
         Loopback File
         # dd if=/dev/zero of=new_image.fs bs=1M count=1024

         Make ext3 file system
         # mke2fs -F -j new_image.fs
         # mkdir /mnt/ec2-fs
         # mount -o loop new_image.fs /mnt/ec2-fs
         # mkdir /mnt/ec2-fs/dev
         # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console
         # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null
         # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero
         # mkdir /mnt/ec2-fs/etc

         Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys)
         Create yum-xen.conf

         # mkdir /mnt/ec2-fs/proc
         # mount -t proc none /mnt/ec2-fs/proc
         # yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base

         Edit /mnt/ec2-fs/etc/sysconfig/network-scripts/ifcfg-eth0
         Edit /mnt/ec2-fs/etc/sysconfig/network
         Edit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap)

         chroot /mnt/ec2-fs /bin/sh
         Edit services
AWS : How to make a AMI (2)
                                                                                   2
                                                                                   CloudKR
         Building an AMI
         # yum install ruby
         # rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket)
         # ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id

         Local Machine Root File System
         # ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id

         Upload to S3
         # ec2-upload-bundle -b my-bucket -m image.manifest
                             -a my-aws-access-key-id -s my-secret-key-id

         Register AMI
         # ec2-register my-bucket/image.manifest
         IMAGE ami-xxxx

         Testing
         # ec2-describe-images ami-xxxx

         Deregister AMI
         # ec2-deregister ami-xxxx

         Running AMI
         # ec2-run-intances ami-xxxx -n 1


         http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/
AWS : EC2 Running Instance
 ‣ AWS Management Console
                             2
                             CloudKR
AWS : EC2 Running Instance
                             2
                             CloudKR
Amazon Web Service: Access Methods
‣ Access Key ID / Secret Access Key ID / Key Pairs
                                                      2
                                                      CloudKR



‣ Amazon Management Console
‣ EC2 API (WSDL) / EC2 CLI (Command Line Interface)
‣ SSH
‣ Firefox Extensions
   • S3 Firefox Organizer
   • Elasticfox
‣ S3
   •DNS: s3  CNAME s3.amazonaws.com.
    e.g) Bucket Name: /s3.xyz.com
    http://s3.xyz.com ---> S3‘s s3.xyz.com

‣s3cmd (python)
‣s3cmd.rb / s3sync.rb (ruby)
‣S3Hub (Mac)
Amazon Web Service: Elasticfox
‣ Firefox’s Extension: Elasticfox
                                    2
                                    CloudKR
Amazon Web Service: Elasticfox
                                             2      CloudKR


                                 ‣ Key Pairs
                                    ‣ Private Key
                                    ‣ SSH
Amazon Web Service: Elasticfox
                                            2       CloudKR


                                 ‣ Security Groups
                                    ‣ Open Network Ports
AWS: Elastic MapReduce
                             2
                             CloudKR


 ‣ EC2 + Hadoop
 ‣Tools
   ‣ Management Console
   ‣ elastic-mapreduce CLI
 ‣ Preparation
    ‣ Code --> S3
    ‣ Data --> S3
 ‣ Log Folder
 ‣ Output Folder
 ‣Job Flow
    ‣ Streaming
    ‣ Custom Jar
    ‣ Sample Applications
AWS: Elastic MapReduce
                         2
                         CloudKR
AWS: Elastic MapReduce : Web UI
                                  2
                                  CloudKR
AWS: Elastic MapReduce : CLI for Workflow
                                       2    CloudKR

      input/*                Step1



                      jobflow #id




 output1/part-000**          Step2




 output2/part-000**          Step3



 output3/part-000**
AWS: Elastic MapReduce
                                                                           2
                                                                           CloudKR

 ‣ Failed tasks will be rescheduled in other Hadoop slaves.
 ‣ If a task is finished, the same instance will be killed by a tracker.
AWS: Elastic MapReduce
                         2
                         CloudKR
AWS: SocialFlow Automation
                                                                                                   2 CloudKR




    Home                   IDC                                    Amazon                Wild World
                           Local                                   Global


                                                        Results




    Admin
                            DHT                                     S3                     Users



                                           Read/Write




                                                                            Read Only
                          Renderer


        boto python   Launching EC2 pool
AWS: EC2, EMR Price Model
                                                                                             2         CloudKR




    Service      Type              Per Instance Hour                1 Week (7 Days) 1 Week (7 Days)


                                $ 0.10 (S)                               $ 16.8      KRW  20,865
              On-Demand         $ 0.40 (L)                               $ 67.2      KRW 83,462
                                $ 0.80 (E)                               $ 134.4     KRW 166,924

     EC2
               Reserved         $ 0.03 (S)                               $ 5.04      KRW  6,259
               1yr $ 325        $ 0.12 (L)                               $ 20.16     KRW 25,038
               3yr $ 500        $ 0.24 (E)                               $ 40.32     KRW 50,077




                                $ 0.10 (S)          $ 0.015              $ 19.32     KRW  23,995
    Elastic
              On-Demand         $ 0.40 (L)          $ 0.06               $ 77.28     KRW 95,981
  MapReduce                     $ 0.80 (E)                               $ 154.56    KRW 191,963
                                                    $ 0.12


                           (S) = Small, (L) = Large, (E) = Extra Large              1 USD = 1242 KRW
AWS: Performance
                                               2
                                               CloudKR




                   http://tinyurl.com/qj6ao7
AWS: Performance
                   2
                   CloudKR
AWS: Performance
                             2        CloudKR

                   http://tinyurl.com/p9jsyz
AWS: Performance
                                       2       CloudKR




                   http://tinyurl.com/cqqxgl
10 Cent Tips
                                                                                 2
                                                                                 CloudKR
‣ AWS EC2

   ‣ Minimizing set-up time with prepared shell scripts

   ‣ Use Boto for automating deployments

   ‣ Use S3 (Free of Charge between S3 and EC2 in the same region)

      ‣ $0.030 per GB through June 30, 2000 ($0.1 per GB normal price)



‣ AWS Elastic MapReduce

   ‣ Enabling the SSH port(22) and Hadoop related ports (9100, 91001)
   ‣ Assess to Master Node: ssh -i keypair hadoop@public_dns_name

   ‣ Double Check (PATH, etc)

   ‣ Debug, Debug, Debug

   ‣ Use EC2 for hadoop (eg. Clouera’s Hadoop AMI) (No extra cost for Hadoop!)
10 Cent Tips
                                                                                      2        CloudKR
‣ AWS S3

  ‣ Setting HTTP header for images and static resources.
     ‣ Cache-Control: max-age=31536000

  ‣ Block Search Bots

     ‣ robots.txt at the root of a Bucket
        ‣ User-agent: *
        ‣ Disallow: /
  ‣ Using BitTorrent for large files
     ‣ http://s3.xyz.com/xfile.zip?torrent
  ‣ Compress Rendered HTML with gzip
     ‣ Content-Encoding: gzip
                                                  $ s3cmd put index.html s3://s3.xyz.com/www 
                                                  
      --mime-type "text/html” 
                                                  	      --add-header "Content-Encoding: gzip" 
                                                  	      --acl-public
Amazon Web Service : Limitations
                                   2
                                   CloudKR
References
‣ 10 MapReduces Tips, Cloudera, http://tinyurl.com/pxuqup
                                                                                          2      CloudKR


‣   Christian Charas, Thierry Lecroq, Handbook of Exact String-Matching Algorithms
‣   Dan Pritchett (eBay), BASE: Alternative ACID, p.48-55, ACM Queue May/June 2008
‣   Edward Chang, (Google Research), Mining Large Scale Social Networks, MMDS ’08
‣   Edward Walker, Benchmarking Amazon EC2 for high-performance scientific computing
‣ Matei Zaharia et al, Improving MapReduce Performance in Heterogeneous Environments, OSDI ’08


‣ Following Twitter
   ‣ http://twitter.com/AmazonEC2
   ‣ http://twitter.com/AmazonS3S3

Más contenido relacionado

Similar a Social Network Analysis using Cloud Computing Services

Cloud 2.0 - How Containers, Microservices and Open Source Software are Redefi...
Cloud 2.0 - How Containers, Microservices and Open Source Software are Redefi...Cloud 2.0 - How Containers, Microservices and Open Source Software are Redefi...
Cloud 2.0 - How Containers, Microservices and Open Source Software are Redefi...Mark Hinkle
 
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...HostedbyConfluent
 
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach  to Provision Resources in the CloudsTowards CloudML, a Model-Based Approach  to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach to Provision Resources in the CloudsSébastien Mosser
 
30 daysofcloud - 2
30 daysofcloud - 230 daysofcloud - 2
30 daysofcloud - 2HitanshDoshi
 
Inaugural address manjusha - Indicthreads cloud computing conference 2011
Inaugural address manjusha -  Indicthreads cloud computing conference 2011Inaugural address manjusha -  Indicthreads cloud computing conference 2011
Inaugural address manjusha - Indicthreads cloud computing conference 2011IndicThreads
 
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)Todd Deshane
 
O Outro Lado BSidesSP Ed. 5 - As Nove Principais Ameaças na Computação em Nuvem
O Outro Lado BSidesSP Ed. 5 - As Nove Principais Ameaças na Computação em NuvemO Outro Lado BSidesSP Ed. 5 - As Nove Principais Ameaças na Computação em Nuvem
O Outro Lado BSidesSP Ed. 5 - As Nove Principais Ameaças na Computação em NuvemAndre Serralheiro
 
The Cloud is dead ?! Blockchain in the new cloud
The Cloud is dead ?! Blockchain in the new cloudThe Cloud is dead ?! Blockchain in the new cloud
The Cloud is dead ?! Blockchain in the new cloudYuval Birenboum
 
OW2con'16 Keynote address: Kubernetes, the rising tide of systems administrat...
OW2con'16 Keynote address: Kubernetes, the rising tide of systems administrat...OW2con'16 Keynote address: Kubernetes, the rising tide of systems administrat...
OW2con'16 Keynote address: Kubernetes, the rising tide of systems administrat...OW2
 
Cloud: CDN Killer?
Cloud: CDN Killer? Cloud: CDN Killer?
Cloud: CDN Killer? Internap
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloudconfluent
 
Cloud computing and grid computing 360 degree compared
Cloud computing and grid computing 360 degree comparedCloud computing and grid computing 360 degree compared
Cloud computing and grid computing 360 degree comparedMd. Hasibur Rashid
 
Content Delivery Using Amazon CloudFront - AWS Presentation - John Mancuso
Content Delivery Using Amazon CloudFront - AWS Presentation - John MancusoContent Delivery Using Amazon CloudFront - AWS Presentation - John Mancuso
Content Delivery Using Amazon CloudFront - AWS Presentation - John MancusoAmazon Web Services
 
Cloud lockin and interoperability v2 indic threads cloud computing conferen...
Cloud lockin and interoperability v2   indic threads cloud computing conferen...Cloud lockin and interoperability v2   indic threads cloud computing conferen...
Cloud lockin and interoperability v2 indic threads cloud computing conferen...IndicThreads
 
Cloud lockin and interoperability v2 indic threads cloud computing conferen...
Cloud lockin and interoperability v2   indic threads cloud computing conferen...Cloud lockin and interoperability v2   indic threads cloud computing conferen...
Cloud lockin and interoperability v2 indic threads cloud computing conferen...IndicThreads
 
Cloud Trends for 2017 and Actions You Can Take Now
Cloud Trends for 2017 and Actions You Can Take NowCloud Trends for 2017 and Actions You Can Take Now
Cloud Trends for 2017 and Actions You Can Take NowRightScale
 
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud ComputingOSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud ComputingMark Hinkle
 
All Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZAll Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZconfluent
 

Similar a Social Network Analysis using Cloud Computing Services (20)

Cloud 2.0 - How Containers, Microservices and Open Source Software are Redefi...
Cloud 2.0 - How Containers, Microservices and Open Source Software are Redefi...Cloud 2.0 - How Containers, Microservices and Open Source Software are Redefi...
Cloud 2.0 - How Containers, Microservices and Open Source Software are Redefi...
 
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
 
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach  to Provision Resources in the CloudsTowards CloudML, a Model-Based Approach  to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
 
30 daysofcloud - 2
30 daysofcloud - 230 daysofcloud - 2
30 daysofcloud - 2
 
Inaugural address manjusha - Indicthreads cloud computing conference 2011
Inaugural address manjusha -  Indicthreads cloud computing conference 2011Inaugural address manjusha -  Indicthreads cloud computing conference 2011
Inaugural address manjusha - Indicthreads cloud computing conference 2011
 
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
Open Source Cloud Computing: Practical Solutions For Your Online Presence (PDF)
 
O Outro Lado BSidesSP Ed. 5 - As Nove Principais Ameaças na Computação em Nuvem
O Outro Lado BSidesSP Ed. 5 - As Nove Principais Ameaças na Computação em NuvemO Outro Lado BSidesSP Ed. 5 - As Nove Principais Ameaças na Computação em Nuvem
O Outro Lado BSidesSP Ed. 5 - As Nove Principais Ameaças na Computação em Nuvem
 
The Cloud is dead ?! Blockchain in the new cloud
The Cloud is dead ?! Blockchain in the new cloudThe Cloud is dead ?! Blockchain in the new cloud
The Cloud is dead ?! Blockchain in the new cloud
 
Cloud vs grid
Cloud vs gridCloud vs grid
Cloud vs grid
 
OW2con'16 Keynote address: Kubernetes, the rising tide of systems administrat...
OW2con'16 Keynote address: Kubernetes, the rising tide of systems administrat...OW2con'16 Keynote address: Kubernetes, the rising tide of systems administrat...
OW2con'16 Keynote address: Kubernetes, the rising tide of systems administrat...
 
Cloud: CDN Killer?
Cloud: CDN Killer? Cloud: CDN Killer?
Cloud: CDN Killer?
 
Citi Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid CloudCiti Tech Talk: Hybrid Cloud
Citi Tech Talk: Hybrid Cloud
 
Cloud computing and grid computing 360 degree compared
Cloud computing and grid computing 360 degree comparedCloud computing and grid computing 360 degree compared
Cloud computing and grid computing 360 degree compared
 
Content Delivery Using Amazon CloudFront - AWS Presentation - John Mancuso
Content Delivery Using Amazon CloudFront - AWS Presentation - John MancusoContent Delivery Using Amazon CloudFront - AWS Presentation - John Mancuso
Content Delivery Using Amazon CloudFront - AWS Presentation - John Mancuso
 
Cloud lockin and interoperability v2 indic threads cloud computing conferen...
Cloud lockin and interoperability v2   indic threads cloud computing conferen...Cloud lockin and interoperability v2   indic threads cloud computing conferen...
Cloud lockin and interoperability v2 indic threads cloud computing conferen...
 
Cloud lockin and interoperability v2 indic threads cloud computing conferen...
Cloud lockin and interoperability v2   indic threads cloud computing conferen...Cloud lockin and interoperability v2   indic threads cloud computing conferen...
Cloud lockin and interoperability v2 indic threads cloud computing conferen...
 
Cloud based Web Intelligence
Cloud based Web IntelligenceCloud based Web Intelligence
Cloud based Web Intelligence
 
Cloud Trends for 2017 and Actions You Can Take Now
Cloud Trends for 2017 and Actions You Can Take NowCloud Trends for 2017 and Actions You Can Take Now
Cloud Trends for 2017 and Actions You Can Take Now
 
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud ComputingOSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
 
All Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZAll Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZ
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Social Network Analysis using Cloud Computing Services

  • 1. PlatformDay2009 SNS Analysis using Cloud Computing Services DHT-based Key-Value Storage and MapReduce-based Analysis DongWoo Lee oiko.cloud@gmail.com S Oiko Laboratory D SocialFlow OikoLab 2 CloudKR
  • 2. Agenda 2 CloudKR ‣ Introduction • Social Network Serivce • Motivation : Visualization, Social Network Analysis • SocialFlow • Scale Out Technologies : Cloud Computing ‣ SNS Analysis Architecture based on Cloud • Overall Process • Crawling • DHT Storage (CouchDB) • MapReduce • Pair-Wise Similarity ‣ Cloud Computing Service • Amazon Web Service • EC2 / S3 / Elastic MapReduce • Tips ‣ References
  • 3. Introduction 2 CloudKR Social Network Cloud Computing Mobile Device
  • 4. Social Network Service 2 CloudKR “Social Applications = Social Networks” “A social network is a collection of people bound together through a specific set of social relations.” “A collection of people is a social network if and only if it is possible for something to spread virally through that collection.”
  • 5. Social Network Services : Twitter, Facebook 2 CloudKR
  • 7. Social Networks http://www.vincos.it/world-map-of-social-networks/
  • 8. Social Network Analysis 2 CloudKR ‣ Social Graph Analysis ‣ Visualization ‣ Person-to-Person Relationship ‣ Temporal Mind Mining (Content Clustering) ‣ Post-Mortem Log Processing
  • 9. Social Network Analysis : Visualization 2 CloudKR ‣
  • 10. Social Network Analysis : Visualization 2 CloudKR ‣
  • 11. Social Network Analysis : Visualization 2 CloudKR ‣ ‣ ‣ ‣
  • 12. Social Network Analysis : Visualization 2 CloudKR ‣
  • 13. SocialFlow 2 CloudKR ‣ Thoughts, Feelings, Interests, Relationship and Information of SNS ‣ Real-time Massive Social Data Streams ‣ Difficult to follow the Social Streams ‣ Need a way to get a summary or clustered information based on Common Interests D SocialFlow OikoLab
  • 14. SocialFlow ‣ Getting Common Flows of people through Content Similarities 2 CloudKR ‣ Reflecting Short-Term Interests of People ‣ Extracting Hot Issues ‣ Revealing Relationships among In/Out Resources ‣ Implementing Scale-Out Technologies ‣ Evolving toward Recommendation System based on Collective Intelligence
  • 15. Scale Out Technologies : Cloud Computing 2 CloudKR
  • 16. Why Cloud Computing? 2 CloudKR ‣ SPOF (Single Point of Failure) ‣ Cluster Administration (Who do this?) ‣ Initial Infrastructure Investment (Risk Management) ‣ Focus on Main Thing (Intelligence) ‣ Enable Highly Scalable Services New resource provision paradigms for Grid Infrastructures: Virtualization and Cloud / ISGC 2009 http://tinyurl.com/nacgu7
  • 17. Cloud Computing: e.g. Storage Failure 2 CloudKR é
  • 18. SNS Analysis Architecture based on Cloud 2 CloudKR D SocialFlow OikoLab
  • 19. Experimental Project 2 CloudKR D SocialFlow OikoLab ‣Python / Django / Boto ‣ML / Data Mining ‣DHT / CouchDB ‣Cloud / AWS S3, EC2, Hadoop MapReduce
  • 20. Workflow 2 CloudKR SNS Crawler MapReduce Post-Processing CDN User In-house Cluster Cloud Service (Local DataCenter)
  • 21. Technologies : Before 2 CloudKR Crawler Crawler Crawler Hash_ring Consistent MapReduce CouchJS DHT CouchDB Key-Value Machine Home Storage Learning Made
  • 22. Technologies : After 2 CloudKR Crawler Crawler Crawler Storage S3 Hash_ring Consistent EC2 MapReduce DHT Hadoop CouchDB Key-Value Machine Home Storage Learning Made
  • 23. Crawling 2 CloudKR ‣ Fetching recent postings of SNS ‣ Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever) ‣ Pushing raw data into the Cloud to process them with MapReduce Crawler DB [ term, doc ] Crawler DB Index Indexer File DB Crawler DB Mapper Crawler DHT Replication
  • 24. Consistent DHT (Distributed Hash Table) ‣ Uniform key distribution and load balancing with a good hash function 2 CloudKR ‣ Minimizing the effects of a storage crash or temporal down ‣ High availability with replication scheme N-1 0 Node N-1 Node k-1 k-1 ‣ Notice: A real node has non- linear portions of the total key space. Replicas k+1 Node k+1 Node k 2 1 Replicate(k, k-1, k+1) !"#$!%&'()*+,-.( /0123',(0405123',(&6-.-7-1(080.-'9(.0405.-'9(.&6-.-7-1(0:
  • 25. Consistent DHT (Distributed Hash Table) 2 CloudKR Admin Anonymouse Traffic User Traffic View Admin View User View Generated Contents SNS Crawler SNS Anlysis AWS S3 html image DHT Front End Memory Cache N-1 0 Node N-1 Node k-1 DHT Node k+1 Node k 2 1
  • 26. Consistent DHT : Replication 2 CloudKR * Replica = 2 D A B C A B B C C D D A B B B Replica Replica
  • 27. CouchDB (Key-Value Storage) 2 CloudKR ‣ Erlang -based Key-Value Storage ‣ Storage Engine (MVCC, B-tree) ‣ RESTful API ‣ Service-side JavaScript Engine (MapReduce) ‣ View Engine ‣ Futon Web UI
  • 28. CouchDB: Server-side Javascript 2 CloudKR ‣ Purpose ‣ Local Computations on Local Data Sets ‣ Features ‣ Mozilla’s Spidermonkey ‣ MapReduce Framework with Javascript ‣ Fork External Process (couchjs) ‣ Performance Enhancements Expected ‣ Googles V8 (Chrome’s Javascript Engine / JIT) http://tinyurl.com/m76sx3
  • 29. CouchDB: MapReduce 2 CloudKR doc = (d1, d2, fq) dx: { di }
  • 30. Map & Reduce : Pair-Wise Similarity 2 CloudKR [ term, { docs } ] => DB [ term, doc ] [ term, { docs } ] [ doc1, doc2 ] DB Index Doc Group Doc Candidate Indexer File Grouper File Combinator File DB DB Mapper Reducer Mapper DocPair Reducer Counter Doc File Result File ‣ Indexer and Grouper for Processing Korean. [ freq, doc1, doc2 ] ‣ No NLP and No Structural Analysis. ‣ Produce a pairwise similarity between two postings.
  • 31. Map & Reduce : Optimization 2 CloudKR ‣ Concerns ‣ Sample Data ‣ Consider Key Group Size Distribution ‣ Two months postings of my friends ‣ Data Load Balancing ‣ Reachable graph: 4,060 Peoples ‣ Barrier Point ‣ Total Postings: 206,115
  • 32. Pair-Wise Similarity and its TreeMap Posting: 110,008 Users: 2,691 Score >= 6
  • 33. Pair-Wise Similarity and its Cluster 2 CloudKR ➡One issue and different opinions among people
  • 34. Pair-Wise Similarity and its Cluster 2 CloudKR ➡Common Interest / Hot Issue
  • 35. Pair-Wise Similarity and its Cluster ➡One person and the similar contents pattern (specialty) 2 CloudKR
  • 36. Pair-Wise Similarity and its Cluster ➡ Similar Structure of Sentences (trendy, parady) 2 CloudKR
  • 37. Deployment 2 CloudKR EC2 S3/CloudFront Flickr www
  • 39. Before the Cloud Age ‣ Smart Shell Guru’s Daily Work : Parallel Sort 2 CloudKR $ wc -l data scp scp $ sort -rm data*.sorted > $ split -l 1000k data NFS NFS data.sorted $ nohup ./work.sh data1 > data1.processed $ nohup sort -r data1.processed > data1.sorted ➡ Need to prepare/maintain physical machines and resources Complexity ➡ Need to monitor job progress (wait and see job’s status) ➡ Need to cope with machine failure (slave nodes / storages / networks) ➡ Need to schedule multiple jobs
  • 40. Amazon Web Service : Overview EC2 EC2 EC2 EC2 2 CloudKR Messages SQS (Simple Query Service) Auto Scaling CloudWatch Monitoring Elastic Load Balancing EC2 (Elastic Compute Cloud) Mount EBS (Elastic Block Store) 1 GB to 1TB Permissions Header Clients API Objects Clients HTTP Clients Buckets AMI (Machine Image) eSATA/USB SimpleDB S3 (Simple Storage Service) Offline Mgmt Console EC2 CLI SSH Import/Export Admin key-value CloudFront Access Key ID Edges Secret Access Key Key Pair Instant EC2 Hadoop Cluster Elastic MapReduce HTTP Hadoop Hadoop Hadoop Clients
  • 41. Amazon Web Service ‣ Amazon Management Console 2 CloudKR
  • 42. AWS : AMI 2 CloudKR AMI Amazon Machine Image
  • 43. AWS : Paid AMI / The Cloud Market 2 CloudKR AMI Amazon Machine Image Paid AMI
  • 44. AWS : How to make a AMI (1) 2 CloudKR Loopback File # dd if=/dev/zero of=new_image.fs bs=1M count=1024 Make ext3 file system # mke2fs -F -j new_image.fs # mkdir /mnt/ec2-fs # mount -o loop new_image.fs /mnt/ec2-fs # mkdir /mnt/ec2-fs/dev # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null # /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero # mkdir /mnt/ec2-fs/etc Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys) Create yum-xen.conf # mkdir /mnt/ec2-fs/proc # mount -t proc none /mnt/ec2-fs/proc # yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base Edit /mnt/ec2-fs/etc/sysconfig/network-scripts/ifcfg-eth0 Edit /mnt/ec2-fs/etc/sysconfig/network Edit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap) chroot /mnt/ec2-fs /bin/sh Edit services
  • 45. AWS : How to make a AMI (2) 2 CloudKR Building an AMI # yum install ruby # rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket) # ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id Local Machine Root File System # ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id Upload to S3 # ec2-upload-bundle -b my-bucket -m image.manifest -a my-aws-access-key-id -s my-secret-key-id Register AMI # ec2-register my-bucket/image.manifest IMAGE ami-xxxx Testing # ec2-describe-images ami-xxxx Deregister AMI # ec2-deregister ami-xxxx Running AMI # ec2-run-intances ami-xxxx -n 1 http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/
  • 46. AWS : EC2 Running Instance ‣ AWS Management Console 2 CloudKR
  • 47. AWS : EC2 Running Instance 2 CloudKR
  • 48. Amazon Web Service: Access Methods ‣ Access Key ID / Secret Access Key ID / Key Pairs 2 CloudKR ‣ Amazon Management Console ‣ EC2 API (WSDL) / EC2 CLI (Command Line Interface) ‣ SSH ‣ Firefox Extensions • S3 Firefox Organizer • Elasticfox ‣ S3 •DNS: s3 CNAME s3.amazonaws.com. e.g) Bucket Name: /s3.xyz.com http://s3.xyz.com ---> S3‘s s3.xyz.com ‣s3cmd (python) ‣s3cmd.rb / s3sync.rb (ruby) ‣S3Hub (Mac)
  • 49. Amazon Web Service: Elasticfox ‣ Firefox’s Extension: Elasticfox 2 CloudKR
  • 50. Amazon Web Service: Elasticfox 2 CloudKR ‣ Key Pairs ‣ Private Key ‣ SSH
  • 51. Amazon Web Service: Elasticfox 2 CloudKR ‣ Security Groups ‣ Open Network Ports
  • 52. AWS: Elastic MapReduce 2 CloudKR ‣ EC2 + Hadoop ‣Tools ‣ Management Console ‣ elastic-mapreduce CLI ‣ Preparation ‣ Code --> S3 ‣ Data --> S3 ‣ Log Folder ‣ Output Folder ‣Job Flow ‣ Streaming ‣ Custom Jar ‣ Sample Applications
  • 54. AWS: Elastic MapReduce : Web UI 2 CloudKR
  • 55. AWS: Elastic MapReduce : CLI for Workflow 2 CloudKR input/* Step1 jobflow #id output1/part-000** Step2 output2/part-000** Step3 output3/part-000**
  • 56. AWS: Elastic MapReduce 2 CloudKR ‣ Failed tasks will be rescheduled in other Hadoop slaves. ‣ If a task is finished, the same instance will be killed by a tracker.
  • 58. AWS: SocialFlow Automation 2 CloudKR Home IDC Amazon Wild World Local Global Results Admin DHT S3 Users Read/Write Read Only Renderer boto python Launching EC2 pool
  • 59. AWS: EC2, EMR Price Model 2 CloudKR Service Type Per Instance Hour 1 Week (7 Days) 1 Week (7 Days) $ 0.10 (S) $ 16.8 KRW 20,865 On-Demand $ 0.40 (L) $ 67.2 KRW 83,462 $ 0.80 (E) $ 134.4 KRW 166,924 EC2 Reserved $ 0.03 (S) $ 5.04 KRW 6,259 1yr $ 325 $ 0.12 (L) $ 20.16 KRW 25,038 3yr $ 500 $ 0.24 (E) $ 40.32 KRW 50,077 $ 0.10 (S) $ 0.015 $ 19.32 KRW 23,995 Elastic On-Demand $ 0.40 (L) $ 0.06 $ 77.28 KRW 95,981 MapReduce $ 0.80 (E) $ 154.56 KRW 191,963 $ 0.12 (S) = Small, (L) = Large, (E) = Extra Large 1 USD = 1242 KRW
  • 60. AWS: Performance 2 CloudKR http://tinyurl.com/qj6ao7
  • 61. AWS: Performance 2 CloudKR
  • 62. AWS: Performance 2 CloudKR http://tinyurl.com/p9jsyz
  • 63. AWS: Performance 2 CloudKR http://tinyurl.com/cqqxgl
  • 64. 10 Cent Tips 2 CloudKR ‣ AWS EC2 ‣ Minimizing set-up time with prepared shell scripts ‣ Use Boto for automating deployments ‣ Use S3 (Free of Charge between S3 and EC2 in the same region) ‣ $0.030 per GB through June 30, 2000 ($0.1 per GB normal price) ‣ AWS Elastic MapReduce ‣ Enabling the SSH port(22) and Hadoop related ports (9100, 91001) ‣ Assess to Master Node: ssh -i keypair hadoop@public_dns_name ‣ Double Check (PATH, etc) ‣ Debug, Debug, Debug ‣ Use EC2 for hadoop (eg. Clouera’s Hadoop AMI) (No extra cost for Hadoop!)
  • 65. 10 Cent Tips 2 CloudKR ‣ AWS S3 ‣ Setting HTTP header for images and static resources. ‣ Cache-Control: max-age=31536000 ‣ Block Search Bots ‣ robots.txt at the root of a Bucket ‣ User-agent: * ‣ Disallow: / ‣ Using BitTorrent for large files ‣ http://s3.xyz.com/xfile.zip?torrent ‣ Compress Rendered HTML with gzip ‣ Content-Encoding: gzip $ s3cmd put index.html s3://s3.xyz.com/www --mime-type "text/html” --add-header "Content-Encoding: gzip" --acl-public
  • 66. Amazon Web Service : Limitations 2 CloudKR
  • 67. References ‣ 10 MapReduces Tips, Cloudera, http://tinyurl.com/pxuqup 2 CloudKR ‣ Christian Charas, Thierry Lecroq, Handbook of Exact String-Matching Algorithms ‣ Dan Pritchett (eBay), BASE: Alternative ACID, p.48-55, ACM Queue May/June 2008 ‣ Edward Chang, (Google Research), Mining Large Scale Social Networks, MMDS ’08 ‣ Edward Walker, Benchmarking Amazon EC2 for high-performance scientific computing ‣ Matei Zaharia et al, Improving MapReduce Performance in Heterogeneous Environments, OSDI ’08 ‣ Following Twitter ‣ http://twitter.com/AmazonEC2 ‣ http://twitter.com/AmazonS3S3