SlideShare una empresa de Scribd logo
1 de 26
Large-Scale Log Analysis for Marketing


                          Kenji Hara/ Yukio Uematsu
                     Innovative IP Architecture Center
                    NTT Communications Corporation


                      Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
Company Overview

• Name: NTT Communications
• Headquaters: Tokyo, Japan
• Revenue: USD$ 12.9B(March, 2011; USD 1 = JPY 80)
• Employees: 8,250(June, 2011)
• Business Areas
   – International communication
   – Internet provider
   – System integration
   – Cloud services
• History
   – 1952 NTT is established
   – 1987 NTT went public (Tokyo Stock Exchange: 9432)
   – 1999 spun off from NTT and incorporated (May 28, 1999)



                                          Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.   2
NTT Group, NTT Communications Corporate Structure

                                                                    Innovative IP Architecture Center   R&D


                       100%                                         First Sales Division

                                                                    Second Sales Division




                                                                      .
                                                                      ..
                                 US$ 12.9B revenue
                                 Global data, Internet Access,
                                 Voice, IT                          Global Sales Division


Nippon Telephone &
Telegraph                                                           Video & Voice Division
                       100%
                                                                    Network Services Division
                                 US$ 24.4B revenue, Local Telecom

                                                                    Cloud Services Division

                       100%                                         Applications and Cotent Division
                                                                                                        Product
                                 US$ 21.9B revenue, Local Telecom   Solutions Division


                                                                    Customer Services Division
                       66.4%
                                                                    Service Infrastructure Division     Operation
                                 US$ 52.8B revenue, Mobile
                                                                    Systems Division


                       54.2%                                        Corporate Planning Division
                                                                                                        Staff
                           US$ 14.5B revenue,System Integration     Finance Division
                                                                      .
                                                                      ..
BizCITY: Cloud Services provided by NTT
                 Communications

          ICT                                                                                Big Data Analysis
Outsourcing
 BizHosting                  BizMail                      SaaS               BizStorage                                  BizMarketing
                             WebMail,                                       Online Storage                                    Multi Layer
 Virtual Server                                     CRM/SFA
    Hosting                  Scheduler                                                                                         Analysis
                                                                                                                                  Big Data
                                                                                                                                 user log
                                                                                                                                 (user log)



                             High-Speed Backbone between Datacenters
                                          Secure Connectivity
                                                                                           Fire Wall                                              Internet
Global
NW                               VPN Service                                     Internet/IP Phone

                Guaranteed
                                 Burst

                                            Best Effort


                                                                 Mobile        Mobile
                                                                 Access      Thin Client

                                                                     Remote Access                  Mobile Access                    IP Phone
International                  Domestic
                                                                                                                       Ubiquitous Office

                                                                                       Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                                                             4
Big Data in BizCITY

                     BizStorage                           BizMarketing
                     Online Storage                             Multi Layer
                                                                                         BLO
                                                                 Analysis                 G

                                           Access    Query
                                                                                           CGM Log
                                            Log       Log


   Data              Private Data                               User Log


                Secure & High-Capacity
 Feature                                        Mining Data for Marketing
                    Storage Service

                                                       Statistics
Application      Private Data Analysis
                                              Natural Language Processing


                                         Use hadoop for “enormous” user
                         Next target              log analysis



                                                Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                     5
Hadoop in Biz Marketing

CGM Data Analysis                                                  Web Access Analysis


                                                                                                               Many Join
                     Increasing                                                                                Operations
                                                          Tweets
                       Data!!                              Per
                                                           Day




            Jan      July    Jan    July    Jan    July
           2009     2009    2010   2010    2011   2011




                                           Requirement for scalability


                                                          Hadoop!!
                                                                      Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                                           6
CGM Data Analysis in Biz Marketing

“Buzz Finder” supports marketing activity
using customers’ feedbacks in social media


                                    Buzz Finder
                      Cra
                         wl

                                                     Marketer                              Promotion
     Tweet                                                                                     Company
                                                            Branding                          Reputations
     Blog     BLOG   Crawl
                                                      Advertiser                           R&D
     Search
                                                                                            Diffrence with
                                                          Ads’ Result
                                t
                                                                                            other ompanies
                           llec
                      Co




                                                  Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                       7
Data Flow in BuzzFinder




 PostgreSQL   Hadoop Cluster   PostgreSQL


              NLP and Statistics by
                 Map/Reduce




                                 Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                      8
Map/Reduce in BuzzFinder
       Map(NLP)      Map(Data Extract)        Reduce(Statistics)
                                 Keywords
                                 Keywords                        Keyword
                                  Keywords
             Linguistic                                          Count
CGM
             &User Data
Data                             Topics
                                  Topics                         Topic
                                   Topics
                                                                 Count

                               Semtiments
                                Semtiment                        Semtiment
                                 Semtiment
                                                                 Count

                                Locations
                                 Locations                       Location
                                  Locations
                                                                 Count

                               Index Data                                   Points
                                Index Data
                                 Index Data
                                                Rich data/record
                                                Small amount of records (x mil /day)
                                                Map is costly (mainly by NLP)

                                                Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                     9
Results of BuzzFinder(1/4)
Trends of “Earthquake” and “Nuclear Power Plant” in twitter

         Earthquake                                     Nuclear Power Plant
       18565 tweets / day                               65642 tweets / day


                  Heavy white smoke from
                  Fukushima No.1 nuclear power plant.
 100,000          95,271 tweets


  50,000




                   Many tweets abount “Earthquake” on 11th each month




      Trend overview of specified keywords in Twitter
                                                    Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                         10
Results of BuzzFinder(2/4)

Topics about“Nuclear Power Plant” in September

               Topics about
           “Nuclear Power Plant”
             Tokyo Electric Power
                    Japan
               Nuclear Accident
                  Fukushima
                    Noda



Popular topics about specified keywords in Twitter

                                  Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                       11
Results of BuzzFinder(3/4)
Location analysis of “Nuclear Power Plant”


 Many



                                                           Disaster Area




 Few                                                   Tokyo Area




        Many tweets from big city and disaster
        area
                                   Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                        12
Results of BuzzFinder(4/4)

     Sentiment analysis of “Nuclear Power Plant”

                                                                                                      Positive
                                                                                                      Negative
          51.6%   48.4%                          52.5%         47.5%




         2011/04                                 2011/08

The sentiment of “Nuclear Power Plant” got more negative from April
(1 month after the earthquake) to August.
The sentiment is more negative than average sentiment(70% positive)




                                                 Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                      13
Hadoop in Biz Marketing

CGM Data Analysis                                                  Web Access Analysis


                                                                                                               Many Join
                     Increasing                                                                                Operations
                                                          Tweets
                       Data!!                              Per
                                                           Day




            Jan      July    Jan    July    Jan    July
           2009     2009    2010   2010    2011   2011




                                           Requirement for scalability


                                                          Hadoop!!
                                                                      Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                                           14
Visualization of internet-users behaviors


   •   Web access log consists of
        – time
        – url
        – userid
   •   Other data                   Click stream based analysis
        – Location information        ex.) Why users went out without conversion?
        – Referrer information
        – User attribute


                                        Statistics
                                        Click stream analysis (OLAP)



                                             Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.   15
Fast Map/Reduce for PaaS Services

Shuffle is costly!
               Map/Reduce speeding-up technique

Normal Hadoop Cluster
                                       High Speed Hadoop Cluster


                             Server
                           reduction




                                                            Speeding-up technique
                        At a same speed                     1. Summation
                                                            2. OLAP(multi join processing)




                                                        Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                             16
Strategies for Shuffle Cost Reduction


                   Map Multi-Reduce *

  Statistics         Record reduce
                      Pre-reduce during map function to reduce intermediate-data
(summation)
                     Local reduce
                      Pre-reduce in the same server before combiner function


                   Pjoin **
   OLAP
   (join)            Join with semi-join view
                      Pre-processing redundant data for multiple join


            *, ** “Map Multi-Reduce” and“PJoin” are the techniques in NTT labs which are closed source now.



                                                                 Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                                      17
Map Multi-Reduce/Record Reduce

         Pre-reduce during map function to reduce intermediate-data
                                                                                                                  Server
                                                                                                                 Process

                      Map Task
                      Map Task                                                          Reduce Task
                                                                                        Reduce Task                  File




Normal map/reduce
                                                       sort&spil
 input          Map                  MapOutputBuffer                      Spill files            mergeParts                 Output
                                                       l

                                     Pre-reduce function in map
Map/reduce with record reduce        function
                            Record    MapOutputBuffe   sort&spil
 Input          Map                                                       Spill files            mergeParts                 Output
                            reduce    r                l

                                                                   Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                                        18
Map Multi-Reduce/Local Reduce

             Pre-reduce data in the same server before combiner function                                                       Server

                                                  User                                                                          Process
                                                Program                fork
                               fork
                                               fork                                                                           File
     Local Reduce タスク
        Local Reduce
                                      assign                      assign
                                      map        Master
                                                                  reduce
                                                        assign
Input Data                                              local reduce
  Split 0            worker
                                                      worker
 Split 1             worker                                                                                                    Output
                                                                                                   worker
                                                                                                                               File 0
 Split 2             worker
                                                      worker
                                                                                                                               Output
 Split 3             worker                                                                        worker
                                                                                                                               File 1

 Split 4             worker                           worker                        remote read,
                              local                                                 sort
             read             write

                     Achieved twice as fast as the normal cluster
                                                                       Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                                            19
OLAP in Click Stream Based Analysis

Click stream data analysis uses star-join scheme
                                 Page info                                                Location info




                                                       click_stream
                                 User info
Unique key count is large                                                                     Click info




           Scalable join is required!
                                             Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.   20
Join using Map/Reduce

•   3 ways to join by map/reduce
     – Memory-backed join/Reduce side join: hive implemented
     – Map-side join


                            Memory-          Reduce side              Map-side join
                            backed join      join

           Scalability              △             ○                      Depends on
                                                                       implementation
           Shuffle cost           High         Very high                          Low
           Speed                  Fast           Slow                    Depends on
                                                                       implementation


           Scalability is requirement so   Shuffle is costly!
                                                   Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.   21
PJoin/Join using Semi-Join View

                Pre-processing redundant data for multiple join
                Join in map-side using pre-generated view, and only rest of join in reduce side
 DFS read
 shuffle                             siteinfo a                                mapper

                                     siteinfo b      siteinfo_
                                                                                                                   reducer
                                                                        accesses processing
                                                    accesses 1                   +                               Joining with
    siteinfo          hash(x)
                                                                            semi-joinh


                                         …
                                                                                                                   siteinfo
                                                     accesses 1
Site description                     siteinfo z
      data
siteinfo primary key &                               siteinfo a
foreign key (accesses primary key)    siteinfo_




                                                                                       …




                                                                                                                          …
                                     accesses 1




                                                         …
                     hash(y)
                                     accesses 1      siteinfo_          accesses processing
                                                    accesses n                   +                               Joining with
                                         …




   accesses           hash(y)                                                semi-join                             siteinfo

                                      siteinfo_      accesses n
   Access log                        accesses n

          Pre-computation                             siteinfo z
                                                                                     Query execution
                                     accesses n

                                                                   Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                                        22
Experimental evaluation (Pjoin)

         1TB access log join processing using Pjoin to verify the effectiveness

                                             50 servers(normal hadoop cluster)




                                                                               =
                                                                                             same speed!!
                                                 20 servers (Pjoin Applied Cluster)
                                                                                                                                     HiveQL
                                             PJoin vs Hive
                                     Pjo in マシン台数バリエーショ 選択率低
                                                       ン

                                                                                                     insert overwrite table q1_result
                     6
                                                                                                     select
                     5
                                                                                                       count(distinct s_sessionseqid)
Processing time




                     4                                                                               from clckstrm c
          処理時間 (分)




                     3                                                                                 join page p
                                                                                                          on
                     2
                                                                                                              c.c_pageseqid = p.p_pageseqid
                     1                                                                                        and p.p_url like '%blog.goo.ne.jp%'
                     0                                                                                 join session_info s
                         20         25             30           35        40            45      50
                                                                                                          on
                                                          server
                                                           マシン台数
                                                                                                              s.s_clckstrmseqid = c.c_clckstrmseqid
                          6. pjoin - > dis tinc t - > pjoin 案        7. pjoin - > rs join案
                          HIVE50台最速                                                                           and s.s_referer like '%QUERY%';
                                                                                                                           Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.   23
Other verification of Hadoop

                                                                   80



• 40 servers 250 cores
                                                                   70

                                                                   60

                                                                                     WAN




                                                 Processing time
• Wide-area ethernet
                                                                   50

                                                                   40

                                                                   30


• LACP 4G between racks                                            20

                                                                   10

                                                                   0
                                                                        0   5   10     15        20        25        30



                   Hadoop Cluster(250cores)                                     Servers

       Rack 1(LOC1 )           Rack 2(LOC1)
                                                     Rack 3
                                                     (LOC2 )




                ・・・                ・・・


  Namenode

                        LACP 4GB              WAN(30miles)

                                                Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                          24
Conclusions
• NTT Communications provide cloud services, BizCITY


• Solved two problems using hadoop in BizMarketing
   – NLP of Big CGM data
   – Join operations in big web access logs


• Reduced operation cost using speeding up technique
   – Map Multi-Reduce
   – Pjoin


• Introduced our hadoop cluster which consists of wide area network



                                             Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                  25
Contacts

•   Kenji Hara, @haracane, kenji.hara@ntt.com
•   Yukio Uematsu, @alfyukio, y.uematsu@ntt.com



•   BizCITY: http://www.ntt.com/bizcity/

     – BizStorage: http://www.ntt.com/bizstorage/

     – BizMarketing: http://www.ntt.com/marketing/




                                                    Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
                                                                                                                         26

Más contenido relacionado

La actualidad más candente

Presentation of e readiness-for workshop at guatemala
Presentation of e readiness-for workshop at guatemalaPresentation of e readiness-for workshop at guatemala
Presentation of e readiness-for workshop at guatemalaPriMora (Barlianta) Harahap
 
IMT Lecture ICT Multimedia+IMS Part01.MM-Biztel 04 Nov09v1.1
IMT Lecture ICT Multimedia+IMS Part01.MM-Biztel 04 Nov09v1.1IMT Lecture ICT Multimedia+IMS Part01.MM-Biztel 04 Nov09v1.1
IMT Lecture ICT Multimedia+IMS Part01.MM-Biztel 04 Nov09v1.1gueste2f09df
 
offload
 offload offload
offloadxingbei
 
Next Generation Video Services Fundamentals
Next Generation Video Services FundamentalsNext Generation Video Services Fundamentals
Next Generation Video Services FundamentalsCisco Canada
 
P2P - Real Time Communications in the Enterprise
P2P - Real Time Communications in the EnterpriseP2P - Real Time Communications in the Enterprise
P2P - Real Time Communications in the EnterpriseMead Eblan
 
Traffic Management, DPI, Internet Offload Gateway
Traffic Management, DPI, Internet Offload GatewayTraffic Management, DPI, Internet Offload Gateway
Traffic Management, DPI, Internet Offload GatewayContinuous Computing
 
HwaCom corporateoverview-e
HwaCom   corporateoverview-eHwaCom   corporateoverview-e
HwaCom corporateoverview-eHwaCom
 
Tsl list of assets 2012 03-20 v17
Tsl list of assets 2012 03-20 v17Tsl list of assets 2012 03-20 v17
Tsl list of assets 2012 03-20 v17Joel Demay
 
Introduction To Xener Systems
Introduction To  Xener  SystemsIntroduction To  Xener  Systems
Introduction To Xener SystemsGuisun Han
 
Technology Development and Innovation at Cisco
Technology Development and Innovation at CiscoTechnology Development and Innovation at Cisco
Technology Development and Innovation at CiscoCisco Canada
 

La actualidad más candente (16)

COLT Telecom - Neversa Case Study
COLT Telecom - Neversa Case StudyCOLT Telecom - Neversa Case Study
COLT Telecom - Neversa Case Study
 
Presentation of e readiness
Presentation of e readinessPresentation of e readiness
Presentation of e readiness
 
Presentation of e readiness-for workshop at guatemala
Presentation of e readiness-for workshop at guatemalaPresentation of e readiness-for workshop at guatemala
Presentation of e readiness-for workshop at guatemala
 
IMT Lecture ICT Multimedia+IMS Part01.MM-Biztel 04 Nov09v1.1
IMT Lecture ICT Multimedia+IMS Part01.MM-Biztel 04 Nov09v1.1IMT Lecture ICT Multimedia+IMS Part01.MM-Biztel 04 Nov09v1.1
IMT Lecture ICT Multimedia+IMS Part01.MM-Biztel 04 Nov09v1.1
 
Ficha be one_eng1
Ficha be one_eng1Ficha be one_eng1
Ficha be one_eng1
 
Lam Chee Keong
Lam Chee KeongLam Chee Keong
Lam Chee Keong
 
offload
 offload offload
offload
 
Next Generation Video Services Fundamentals
Next Generation Video Services FundamentalsNext Generation Video Services Fundamentals
Next Generation Video Services Fundamentals
 
Introduction to Data
Introduction to DataIntroduction to Data
Introduction to Data
 
P2P - Real Time Communications in the Enterprise
P2P - Real Time Communications in the EnterpriseP2P - Real Time Communications in the Enterprise
P2P - Real Time Communications in the Enterprise
 
Traffic Management, DPI, Internet Offload Gateway
Traffic Management, DPI, Internet Offload GatewayTraffic Management, DPI, Internet Offload Gateway
Traffic Management, DPI, Internet Offload Gateway
 
HwaCom corporateoverview-e
HwaCom   corporateoverview-eHwaCom   corporateoverview-e
HwaCom corporateoverview-e
 
Tsl list of assets 2012 03-20 v17
Tsl list of assets 2012 03-20 v17Tsl list of assets 2012 03-20 v17
Tsl list of assets 2012 03-20 v17
 
Introduction To Xener Systems
Introduction To  Xener  SystemsIntroduction To  Xener  Systems
Introduction To Xener Systems
 
Technology Development and Innovation at Cisco
Technology Development and Innovation at CiscoTechnology Development and Innovation at Cisco
Technology Development and Innovation at Cisco
 
The future telecom
The future telecomThe future telecom
The future telecom
 

Destacado

1994年頃の電子書籍(LT『本を読む人々 Vol.3』)
1994年頃の電子書籍(LT『本を読む人々 Vol.3』)1994年頃の電子書籍(LT『本を読む人々 Vol.3』)
1994年頃の電子書籍(LT『本を読む人々 Vol.3』)Hiroko Ohki Takagi
 
Exreme coffee brewing 2013 summer
Exreme coffee brewing 2013 summerExreme coffee brewing 2013 summer
Exreme coffee brewing 2013 summerHiroko Ohki Takagi
 
Creator's night 05 31 2013
Creator's night 05 31 2013Creator's night 05 31 2013
Creator's night 05 31 2013Len Matsuyama
 
Deeplearning勉強会20160220
Deeplearning勉強会20160220Deeplearning勉強会20160220
Deeplearning勉強会20160220正志 坪坂
 
Big Data Bootstrap (ICML読み会)
Big Data Bootstrap (ICML読み会)Big Data Bootstrap (ICML読み会)
Big Data Bootstrap (ICML読み会)正志 坪坂
 
eXtreme Coffee Brewing 2014 summer
eXtreme Coffee Brewing 2014 summereXtreme Coffee Brewing 2014 summer
eXtreme Coffee Brewing 2014 summerHiroko Ohki Takagi
 
Introduction to contexual bandit
Introduction to contexual banditIntroduction to contexual bandit
Introduction to contexual bandit正志 坪坂
 
PRML上巻勉強会 at 東京大学 資料 第5章5.1 〜 5.3.1
PRML上巻勉強会 at 東京大学 資料 第5章5.1 〜 5.3.1PRML上巻勉強会 at 東京大学 資料 第5章5.1 〜 5.3.1
PRML上巻勉強会 at 東京大学 資料 第5章5.1 〜 5.3.1Len Matsuyama
 
Riak Search 2.0を使ったデータ集計
Riak Search 2.0を使ったデータ集計Riak Search 2.0を使ったデータ集計
Riak Search 2.0を使ったデータ集計正志 坪坂
 
Deeplearning輪読会
Deeplearning輪読会Deeplearning輪読会
Deeplearning輪読会正志 坪坂
 
Contexual bandit @TokyoWebMining
Contexual bandit @TokyoWebMiningContexual bandit @TokyoWebMining
Contexual bandit @TokyoWebMining正志 坪坂
 
Well log analysis for reservoir characterization aapg wiki
Well log analysis for reservoir characterization   aapg wikiWell log analysis for reservoir characterization   aapg wiki
Well log analysis for reservoir characterization aapg wikiBRIKAT Abdelghani
 
確率モデルを使ったグラフクラスタリング
確率モデルを使ったグラフクラスタリング確率モデルを使ったグラフクラスタリング
確率モデルを使ったグラフクラスタリング正志 坪坂
 

Destacado (20)

1994年頃の電子書籍(LT『本を読む人々 Vol.3』)
1994年頃の電子書籍(LT『本を読む人々 Vol.3』)1994年頃の電子書籍(LT『本を読む人々 Vol.3』)
1994年頃の電子書籍(LT『本を読む人々 Vol.3』)
 
Exreme coffee brewing 2013 summer
Exreme coffee brewing 2013 summerExreme coffee brewing 2013 summer
Exreme coffee brewing 2013 summer
 
Creator's night 05 31 2013
Creator's night 05 31 2013Creator's night 05 31 2013
Creator's night 05 31 2013
 
Tokyowebmining2012
Tokyowebmining2012Tokyowebmining2012
Tokyowebmining2012
 
Recsys2014 recruit
Recsys2014 recruitRecsys2014 recruit
Recsys2014 recruit
 
Deeplearning勉強会20160220
Deeplearning勉強会20160220Deeplearning勉強会20160220
Deeplearning勉強会20160220
 
Big Data Bootstrap (ICML読み会)
Big Data Bootstrap (ICML読み会)Big Data Bootstrap (ICML読み会)
Big Data Bootstrap (ICML読み会)
 
eXtreme Coffee Brewing 2014 summer
eXtreme Coffee Brewing 2014 summereXtreme Coffee Brewing 2014 summer
eXtreme Coffee Brewing 2014 summer
 
Recsys2015
Recsys2015Recsys2015
Recsys2015
 
Introduction to contexual bandit
Introduction to contexual banditIntroduction to contexual bandit
Introduction to contexual bandit
 
PRML上巻勉強会 at 東京大学 資料 第5章5.1 〜 5.3.1
PRML上巻勉強会 at 東京大学 資料 第5章5.1 〜 5.3.1PRML上巻勉強会 at 東京大学 資料 第5章5.1 〜 5.3.1
PRML上巻勉強会 at 東京大学 資料 第5章5.1 〜 5.3.1
 
Riak Search 2.0を使ったデータ集計
Riak Search 2.0を使ったデータ集計Riak Search 2.0を使ったデータ集計
Riak Search 2.0を使ったデータ集計
 
KDD 2015読み会
KDD 2015読み会KDD 2015読み会
KDD 2015読み会
 
KDD2014_study
KDD2014_study KDD2014_study
KDD2014_study
 
Deeplearning輪読会
Deeplearning輪読会Deeplearning輪読会
Deeplearning輪読会
 
EMNLP2014_reading
EMNLP2014_readingEMNLP2014_reading
EMNLP2014_reading
 
Contexual bandit @TokyoWebMining
Contexual bandit @TokyoWebMiningContexual bandit @TokyoWebMining
Contexual bandit @TokyoWebMining
 
Well log analysis for reservoir characterization aapg wiki
Well log analysis for reservoir characterization   aapg wikiWell log analysis for reservoir characterization   aapg wiki
Well log analysis for reservoir characterization aapg wiki
 
NIPS 2012 読む会
NIPS 2012 読む会NIPS 2012 読む会
NIPS 2012 読む会
 
確率モデルを使ったグラフクラスタリング
確率モデルを使ったグラフクラスタリング確率モデルを使ったグラフクラスタリング
確率モデルを使ったグラフクラスタリング
 

Similar a Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

David Brunnen - Titanic Quarter Belfast
David Brunnen - Titanic Quarter BelfastDavid Brunnen - Titanic Quarter Belfast
David Brunnen - Titanic Quarter BelfastShane Mitchell
 
Wireless Broadband Delivers The 21st Century
Wireless Broadband Delivers The 21st CenturyWireless Broadband Delivers The 21st Century
Wireless Broadband Delivers The 21st Centuryandrescarvallo
 
Top Ten Imperatives for Service Providers
Top Ten Imperatives for Service ProvidersTop Ten Imperatives for Service Providers
Top Ten Imperatives for Service ProvidersJuniper Networks
 
Gunnar Florus - Manchester
Gunnar Florus - ManchesterGunnar Florus - Manchester
Gunnar Florus - ManchesterMarit Hendriks
 
Gunnar Alcatel Lucent Open Networks
Gunnar   Alcatel Lucent   Open NetworksGunnar   Alcatel Lucent   Open Networks
Gunnar Alcatel Lucent Open Networksandrewmac101
 
Saiful Hidayat Telkom Indonesia Vietnam Telecoms International Summit Nuturin...
Saiful Hidayat Telkom Indonesia Vietnam Telecoms International Summit Nuturin...Saiful Hidayat Telkom Indonesia Vietnam Telecoms International Summit Nuturin...
Saiful Hidayat Telkom Indonesia Vietnam Telecoms International Summit Nuturin...Saiful Hidayat
 
La convergence fixe/mobile au coeur des communications unifiees
La convergence fixe/mobile au coeur des communications unifieesLa convergence fixe/mobile au coeur des communications unifiees
La convergence fixe/mobile au coeur des communications unifieesJavier Sanz-Blasco
 
Chambers cisco live keynote external june2012
Chambers cisco live keynote external june2012Chambers cisco live keynote external june2012
Chambers cisco live keynote external june2012Leslie Rubin
 
Centros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercadoCentros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercadoMundo Contact
 
Ieee pimrc 2011 befemto panel - femto-wifi
Ieee pimrc 2011 befemto panel - femto-wifiIeee pimrc 2011 befemto panel - femto-wifi
Ieee pimrc 2011 befemto panel - femto-wifiThierry Lestable
 
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...Cisco Canada
 
Technology Disruption Brings New VAS Opportunities
Technology Disruption Brings New VAS OpportunitiesTechnology Disruption Brings New VAS Opportunities
Technology Disruption Brings New VAS OpportunitiesRadisys Corporation
 
4G Mobile: Opportunities and Challenges in Indonesia
4G Mobile: Opportunities and Challenges in Indonesia4G Mobile: Opportunities and Challenges in Indonesia
4G Mobile: Opportunities and Challenges in IndonesiaArief Gunawan
 
Corporate Presentation open - Andy Lockwood
Corporate Presentation open - Andy LockwoodCorporate Presentation open - Andy Lockwood
Corporate Presentation open - Andy LockwoodTalkTalk Business
 
"Mobile value-chain" by Sundeep Gupta
"Mobile value-chain" by Sundeep Gupta"Mobile value-chain" by Sundeep Gupta
"Mobile value-chain" by Sundeep GuptaAbhilash Ravishankar
 
Investor presentation december 2011
Investor presentation december 2011Investor presentation december 2011
Investor presentation december 2011Satyawan Jangra
 
Meet XO Communications
Meet XO CommunicationsMeet XO Communications
Meet XO CommunicationsMarc Cloutier
 

Similar a Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications (20)

David Brunnen - Titanic Quarter Belfast
David Brunnen - Titanic Quarter BelfastDavid Brunnen - Titanic Quarter Belfast
David Brunnen - Titanic Quarter Belfast
 
Mobile mondayb2b belgacom
Mobile mondayb2b   belgacomMobile mondayb2b   belgacom
Mobile mondayb2b belgacom
 
Wireless Broadband Delivers The 21st Century
Wireless Broadband Delivers The 21st CenturyWireless Broadband Delivers The 21st Century
Wireless Broadband Delivers The 21st Century
 
Top Ten Imperatives for Service Providers
Top Ten Imperatives for Service ProvidersTop Ten Imperatives for Service Providers
Top Ten Imperatives for Service Providers
 
Gunnar Florus - Manchester
Gunnar Florus - ManchesterGunnar Florus - Manchester
Gunnar Florus - Manchester
 
Gunnar Alcatel Lucent Open Networks
Gunnar   Alcatel Lucent   Open NetworksGunnar   Alcatel Lucent   Open Networks
Gunnar Alcatel Lucent Open Networks
 
Saiful Hidayat Telkom Indonesia Vietnam Telecoms International Summit Nuturin...
Saiful Hidayat Telkom Indonesia Vietnam Telecoms International Summit Nuturin...Saiful Hidayat Telkom Indonesia Vietnam Telecoms International Summit Nuturin...
Saiful Hidayat Telkom Indonesia Vietnam Telecoms International Summit Nuturin...
 
La convergence fixe/mobile au coeur des communications unifiees
La convergence fixe/mobile au coeur des communications unifieesLa convergence fixe/mobile au coeur des communications unifiees
La convergence fixe/mobile au coeur des communications unifiees
 
Chambers cisco live keynote external june2012
Chambers cisco live keynote external june2012Chambers cisco live keynote external june2012
Chambers cisco live keynote external june2012
 
Centros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercadoCentros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercado
 
Mobile Service Edge
Mobile Service EdgeMobile Service Edge
Mobile Service Edge
 
Ieee pimrc 2011 befemto panel - femto-wifi
Ieee pimrc 2011 befemto panel - femto-wifiIeee pimrc 2011 befemto panel - femto-wifi
Ieee pimrc 2011 befemto panel - femto-wifi
 
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
 
Technology Disruption Brings New VAS Opportunities
Technology Disruption Brings New VAS OpportunitiesTechnology Disruption Brings New VAS Opportunities
Technology Disruption Brings New VAS Opportunities
 
4G Mobile: Opportunities and Challenges in Indonesia
4G Mobile: Opportunities and Challenges in Indonesia4G Mobile: Opportunities and Challenges in Indonesia
4G Mobile: Opportunities and Challenges in Indonesia
 
Corporate Presentation open - Andy Lockwood
Corporate Presentation open - Andy LockwoodCorporate Presentation open - Andy Lockwood
Corporate Presentation open - Andy Lockwood
 
"Mobile value-chain" by Sundeep Gupta
"Mobile value-chain" by Sundeep Gupta"Mobile value-chain" by Sundeep Gupta
"Mobile value-chain" by Sundeep Gupta
 
Investor presentation december 2011
Investor presentation december 2011Investor presentation december 2011
Investor presentation december 2011
 
Dham bangalore q407
Dham bangalore q407Dham bangalore q407
Dham bangalore q407
 
Meet XO Communications
Meet XO CommunicationsMeet XO Communications
Meet XO Communications
 

Último

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Último (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications

  • 1. Large-Scale Log Analysis for Marketing Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved.
  • 2. Company Overview • Name: NTT Communications • Headquaters: Tokyo, Japan • Revenue: USD$ 12.9B(March, 2011; USD 1 = JPY 80) • Employees: 8,250(June, 2011) • Business Areas – International communication – Internet provider – System integration – Cloud services • History – 1952 NTT is established – 1987 NTT went public (Tokyo Stock Exchange: 9432) – 1999 spun off from NTT and incorporated (May 28, 1999) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 2
  • 3. NTT Group, NTT Communications Corporate Structure Innovative IP Architecture Center R&D 100% First Sales Division Second Sales Division . .. US$ 12.9B revenue Global data, Internet Access, Voice, IT Global Sales Division Nippon Telephone & Telegraph Video & Voice Division 100% Network Services Division US$ 24.4B revenue, Local Telecom Cloud Services Division 100% Applications and Cotent Division Product US$ 21.9B revenue, Local Telecom Solutions Division Customer Services Division 66.4% Service Infrastructure Division Operation US$ 52.8B revenue, Mobile Systems Division 54.2% Corporate Planning Division Staff US$ 14.5B revenue,System Integration Finance Division . ..
  • 4. BizCITY: Cloud Services provided by NTT Communications           ICT Big Data Analysis Outsourcing BizHosting BizMail SaaS BizStorage BizMarketing WebMail, Online Storage Multi Layer Virtual Server CRM/SFA Hosting Scheduler Analysis Big Data user log (user log) High-Speed Backbone between Datacenters Secure Connectivity Fire Wall Internet Global NW VPN Service Internet/IP Phone Guaranteed Burst Best Effort Mobile Mobile Access Thin Client Remote Access Mobile Access IP Phone International Domestic Ubiquitous Office Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 4
  • 5. Big Data in BizCITY BizStorage BizMarketing Online Storage Multi Layer BLO Analysis G Access Query CGM Log Log Log Data Private Data User Log Secure & High-Capacity Feature Mining Data for Marketing Storage Service Statistics Application Private Data Analysis Natural Language Processing Use hadoop for “enormous” user Next target log analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 5
  • 6. Hadoop in Biz Marketing CGM Data Analysis Web Access Analysis Many Join Increasing Operations Tweets Data!! Per Day Jan July Jan July Jan July 2009 2009 2010 2010 2011 2011 Requirement for scalability Hadoop!! Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 6
  • 7. CGM Data Analysis in Biz Marketing “Buzz Finder” supports marketing activity using customers’ feedbacks in social media Buzz Finder Cra wl Marketer Promotion Tweet Company Branding Reputations Blog BLOG Crawl Advertiser R&D Search Diffrence with Ads’ Result t other ompanies llec Co Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 7
  • 8. Data Flow in BuzzFinder PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 8
  • 9. Map/Reduce in BuzzFinder Map(NLP) Map(Data Extract) Reduce(Statistics) Keywords Keywords Keyword Keywords Linguistic Count CGM &User Data Data Topics Topics Topic Topics Count Semtiments Semtiment Semtiment Semtiment Count Locations Locations Location Locations Count Index Data Points Index Data Index Data Rich data/record Small amount of records (x mil /day) Map is costly (mainly by NLP) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 9
  • 10. Results of BuzzFinder(1/4) Trends of “Earthquake” and “Nuclear Power Plant” in twitter Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Heavy white smoke from Fukushima No.1 nuclear power plant. 100,000 95,271 tweets 50,000 Many tweets abount “Earthquake” on 11th each month Trend overview of specified keywords in Twitter Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 10
  • 11. Results of BuzzFinder(2/4) Topics about“Nuclear Power Plant” in September Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima Noda Popular topics about specified keywords in Twitter Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 11
  • 12. Results of BuzzFinder(3/4) Location analysis of “Nuclear Power Plant” Many Disaster Area Few Tokyo Area Many tweets from big city and disaster area Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 12
  • 13. Results of BuzzFinder(4/4) Sentiment analysis of “Nuclear Power Plant” Positive Negative 51.6% 48.4% 52.5% 47.5% 2011/04 2011/08 The sentiment of “Nuclear Power Plant” got more negative from April (1 month after the earthquake) to August. The sentiment is more negative than average sentiment(70% positive) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 13
  • 14. Hadoop in Biz Marketing CGM Data Analysis Web Access Analysis Many Join Increasing Operations Tweets Data!! Per Day Jan July Jan July Jan July 2009 2009 2010 2010 2011 2011 Requirement for scalability Hadoop!! Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 14
  • 15. Visualization of internet-users behaviors • Web access log consists of – time – url – userid • Other data Click stream based analysis – Location information ex.) Why users went out without conversion? – Referrer information – User attribute Statistics Click stream analysis (OLAP) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 15
  • 16. Fast Map/Reduce for PaaS Services Shuffle is costly! Map/Reduce speeding-up technique Normal Hadoop Cluster High Speed Hadoop Cluster Server reduction Speeding-up technique At a same speed 1. Summation 2. OLAP(multi join processing) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 16
  • 17. Strategies for Shuffle Cost Reduction Map Multi-Reduce * Statistics Record reduce Pre-reduce during map function to reduce intermediate-data (summation) Local reduce Pre-reduce in the same server before combiner function Pjoin ** OLAP (join) Join with semi-join view Pre-processing redundant data for multiple join *, ** “Map Multi-Reduce” and“PJoin” are the techniques in NTT labs which are closed source now. Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 17
  • 18. Map Multi-Reduce/Record Reduce Pre-reduce during map function to reduce intermediate-data Server Process Map Task Map Task Reduce Task Reduce Task File Normal map/reduce sort&spil input Map MapOutputBuffer Spill files mergeParts Output l Pre-reduce function in map Map/reduce with record reduce function Record MapOutputBuffe sort&spil Input Map Spill files mergeParts Output reduce r l Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 18
  • 19. Map Multi-Reduce/Local Reduce Pre-reduce data in the same server before combiner function Server User Process Program fork fork fork File Local Reduce タスク Local Reduce assign assign map Master reduce assign Input Data local reduce Split 0 worker worker Split 1 worker Output worker File 0 Split 2 worker worker Output Split 3 worker worker File 1 Split 4 worker worker remote read, local sort read write Achieved twice as fast as the normal cluster Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 19
  • 20. OLAP in Click Stream Based Analysis Click stream data analysis uses star-join scheme Page info Location info click_stream User info Unique key count is large Click info Scalable join is required! Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 20
  • 21. Join using Map/Reduce • 3 ways to join by map/reduce – Memory-backed join/Reduce side join: hive implemented – Map-side join Memory- Reduce side Map-side join backed join join Scalability △ ○ Depends on implementation Shuffle cost High Very high Low Speed Fast Slow Depends on implementation Scalability is requirement so Shuffle is costly! Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 21
  • 22. PJoin/Join using Semi-Join View Pre-processing redundant data for multiple join Join in map-side using pre-generated view, and only rest of join in reduce side DFS read shuffle siteinfo a mapper siteinfo b siteinfo_ reducer accesses processing accesses 1 + Joining with siteinfo hash(x) semi-joinh … siteinfo accesses 1 Site description siteinfo z data siteinfo primary key & siteinfo a foreign key (accesses primary key) siteinfo_ … … accesses 1 … hash(y) accesses 1 siteinfo_ accesses processing accesses n + Joining with … accesses hash(y) semi-join siteinfo siteinfo_ accesses n Access log accesses n Pre-computation siteinfo z Query execution accesses n Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 22
  • 23. Experimental evaluation (Pjoin) 1TB access log join processing using Pjoin to verify the effectiveness 50 servers(normal hadoop cluster) = same speed!! 20 servers (Pjoin Applied Cluster) HiveQL PJoin vs Hive Pjo in マシン台数バリエーショ 選択率低 ン insert overwrite table q1_result 6 select 5 count(distinct s_sessionseqid) Processing time 4 from clckstrm c 処理時間 (分) 3 join page p on 2 c.c_pageseqid = p.p_pageseqid 1 and p.p_url like '%blog.goo.ne.jp%' 0 join session_info s 20 25 30 35 40 45 50 on server マシン台数 s.s_clckstrmseqid = c.c_clckstrmseqid 6. pjoin - > dis tinc t - > pjoin 案 7. pjoin - > rs join案 HIVE50台最速 and s.s_referer like '%QUERY%'; Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 23
  • 24. Other verification of Hadoop 80 • 40 servers 250 cores 70 60 WAN Processing time • Wide-area ethernet 50 40 30 • LACP 4G between racks 20 10 0 0 5 10 15 20 25 30 Hadoop Cluster(250cores) Servers Rack 1(LOC1 ) Rack 2(LOC1) Rack 3 (LOC2 ) ・・・ ・・・ Namenode LACP 4GB WAN(30miles) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 24
  • 25. Conclusions • NTT Communications provide cloud services, BizCITY • Solved two problems using hadoop in BizMarketing – NLP of Big CGM data – Join operations in big web access logs • Reduced operation cost using speeding up technique – Map Multi-Reduce – Pjoin • Introduced our hadoop cluster which consists of wide area network Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 25
  • 26. Contacts • Kenji Hara, @haracane, kenji.hara@ntt.com • Yukio Uematsu, @alfyukio, y.uematsu@ntt.com • BizCITY: http://www.ntt.com/bizcity/ – BizStorage: http://www.ntt.com/bizstorage/ – BizMarketing: http://www.ntt.com/marketing/ Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 26