SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Hadoop/Mahout/HBase



                     2011/04/10
                   #TokyoWebmining10-2

                         yanaoki



2011   4   18
•
                • HBase
                • Mahout
                • Naive Bayes
                •
                • Web

2011   4   18
•
                    •   naoki yanai
                •
                    •
                    •                 …

                •
                    •
                    •       Hadoop

                •
                    •

2011   4   18
HBase
                •   KeyValue

                    •                                                         read/write

                        •   goal is the hosting of very large tables -- billions of rows ,
                            millions of columns ...


                    •   Hadoop

                •   CAP                   C,P

                    •   C:            ,A:             ,P:

                •            Sharding

                •   Hadoop/MapReduce
2011   4   18
HBase
                •
                    •   ―

                    •   ―

                    •



                            qualifier

2011   4   18
Mahout
           •
           •    Hadoop

                •
                •                          HBase

                •
           •
                •   Classifier / Clustering / Pattern Mining

                •   Recommenders / Collaborative Filtering

                •   Evolutionary Algorithms ...
2011   4   18
Mahout

           •
           •
                •
                •   Mahout

                •   Mahout in Action PDF

                •   hamadakoichi

                •   TokyoWebmining

2011   4   18
Naive Bayes
           •        F1,...,Fn           C




           •    C




           •


2011   4   18
Naive Bayes
                •
                    •
                        •
                    •
                        •
                    •
                        •
2011   4   18
Naive Bayes
                •
                    •
                •
                    •
                •
                    •
                •
                    •
2011   4   18
•       Web

                    •
                    •
                    •
                •
                    •

2011   4   18
2011   4   18
•    Ruby

                •   ExtractContent

           require "open-uri"
           require "extractcontent"

           html = open("http://
           news.nifty.com/....htm").read
           body, title = ExtractContent::analyse(html)

           puts body.toutf8 #=>        HTML


2011   4   18
•    Ruby

                •   scrAPI


       require 'scrapi'
       require 'open-uri'

       scr = Scraper.define do
        process "div.tweet", "tweets[]"=> :text
        result :tweets
       end

       tweets = scr.scrape(URI.parse("http://togetter.com/li/
       121476"), :parser_options => {:char_encoding => 'utf8'})

       tweets.each{ |tw| puts tw } #=>


2011   4   18
•                                             RSS                      HBase


           •
                      (URL)
                                         content                         categories

       http://togetter/1.html                                  category:src=”togetter”
                                                   ...
                                                               category:cat=”social”

       http://                                                 category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                               category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/
                                                  …

       http://ameblo.jp/....html
                                   KARA …

2011   4   18
•    HBase

                    category_id <TAB>

           •    HBase           MaprReduce   HDFS

                •
                    •
                    •
                        •   Wikipedia

                    •
2011   4   18
•    mahout

                    $ mahout trainclassifier       ...

                    $ mahout testclassifier        …

           •    mahout

                •    --input/--output         /

                •    --dataSource                   HDFS   HBase

                •    --gramSize     N-gram

                •    --classifierType

                •    --alpha

                •    --minDF/--minSupport                  /

2011   4   18
•                            HBase


           •
           =======================================================
           Summary
           -------------------------------------------------------
           Correctly Classified Instances          :       1884       82.2348%
           Incorrectly Classified Instances        :        407       17.7652%
           Total Classified Instances              :       2291
           =======================================================
           Confusion Matrix
           -------------------------------------------------------
           a       b       c       d       e       <--Classified as
           216     32      22      155     0        |  425         a     = t
           0       514     13      70      0        |  597         b     = s
           0       2       514     9       0        |  525         c     = e
           1       8       13      638     0        |  660         d     = b
           0       0       67      15      2        |  84          e     = a
           Default Category: unknown: 5


2011   4   18
•
           •                                      reducer                      HBase


            //
            BayesParameters params = new BayesParameters();
            params.set("alpha_i", "1");
            algorithm = new CBayesAlgorithm();
            datastore = new HBaseBayesDatastore("model_table_name", params);
            classifier = new ClassifierContext(algorithm, datastore);

            //
            ClassifierResult category = classifier.classifyDocument(doc.toArray(new String
            [doc.size()]), "default");

            String label = category.getLabel();


2011   4   18
•

                      (URL)
                                         content                        categories

       http://togetter/1.html                                 category:src=”togetter”
                                                   ...
                                                              category:cat=”social”

       http://                                                category:src=”nifty”
       news.nifty.com/....html     AKB      ...
                                                              category:cat=”entertainment”
       http://groups.google.com/                         10
       group/webmining-tokyo/                                 category:cat=”technology”
                                                  …

       http://ameblo.jp/....html
                                   KARA …                     category:cat=”entertainment”

2011   4   18
Web




2011   4   18
Web
                •   Google News Togetter
                                   RSS

                •
                    •                              …

                    •                                         …
                •
                        a                   935        5.2M
                        b                  5,112       7.2M
                        e                  3,746       8.1M
                        s                  4,737       12M
                        t                  3,969       9.2M
2011   4   18
4/18

                                 Web
                •
                      •
                =======================================================
                Summary
                -------------------------------------------------------
                Correctly Classified Instances          :      13388        91.6798%
                Incorrectly Classified Instances        :       1215         8.3202%
                Total Classified Instances              :      14603

                =======================================================
                Confusion Matrix
                -------------------------------------------------------
                a         b         c         d         e         <--Classified as
                2328      19        515       250       0          |  3112       a       =   t
                3         2939      54        20        0          |  3016       b       =   e
                32        3         3542      109       0          |  3686       c       =   s
                33        16        128       3877      0          |  4054       d       =   b
                1         27        2         3         702        |  735        e       =   a
                Default Category: unknown: 5


2011   4   18
Web


                •
                    •
                        •                              alpha


                              1         0.5     0.1        0.01    0.001




                            65.38%   65.83%   66.73%     66.82%   67.02%


2011   4   18
4/18

                               Web


                •
                    •
                        •   N-Gram


                                     unigram   bigram


                                     63.57%    66.09%


2011   4   18
Web


                •
                    •
                        •

                                           +




                                  56.8%   65.38%


2011   4   18
4/18

                            Web


                •
                    •
                        •



                                  67.02%   67.88%


2011   4   18
•
                    •
                •               HBase/Mahout

                    •
                    •   HBase



2011   4   18
2011   4   18

Más contenido relacionado

Destacado

Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみたMahoutにパッチを送ってみた
Mahoutにパッチを送ってみたissaymk2
 
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierNaoki Yanai
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahouttakaya imai
 
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Koichi Hamada
 
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Koichi Hamada
 
協調フィルタリング with Mahout
協調フィルタリング with Mahout協調フィルタリング with Mahout
協調フィルタリング with MahoutKatsuhiro Takata
 
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Koichi Hamada
 
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" -  #TokyoWebmining 14th"Mahout Recommendation" -  #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14thKoichi Hamada
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習Preferred Networks
 
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for ShareYasushi Gunya
 
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)Shota Yasui
 
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京Koichi Hamada
 
Appium: Automation for Mobile Apps
Appium: Automation for Mobile AppsAppium: Automation for Mobile Apps
Appium: Automation for Mobile AppsSauce Labs
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-Naoki Yanai
 

Destacado (15)

Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみたMahoutにパッチを送ってみた
Mahoutにパッチを送ってみた
 
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifierComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
 
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahoutIntroduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
 
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6
 
Frequency Pattern Mining
Frequency Pattern MiningFrequency Pattern Mining
Frequency Pattern Mining
 
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8 Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8
 
協調フィルタリング with Mahout
協調フィルタリング with Mahout協調フィルタリング with Mahout
協調フィルタリング with Mahout
 
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9
 
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" -  #TokyoWebmining 14th"Mahout Recommendation" -  #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14th
 
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
 
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share
 
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
 
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針  -データマイニング+WEB勉強会@東京
 
Appium: Automation for Mobile Apps
Appium: Automation for Mobile AppsAppium: Automation for Mobile Apps
Appium: Automation for Mobile Apps
 
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
 

Similar a Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseWibiData
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
What's behind facebook
What's behind facebookWhat's behind facebook
What's behind facebookAjen 陳
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applicationsdzhou
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveKorea Sdec
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Be nice to your designers
Be nice to your designersBe nice to your designers
Be nice to your designersPai-Cheng Tao
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-augustpharkmillups
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4moai kids
 

Similar a Hadoop/Mahout/HBaseで テキスト分類器を作ったよ (20)

Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBase, no trouble
HBase, no troubleHBase, no trouble
HBase, no trouble
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBaseAnalyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
What's behind facebook
What's behind facebookWhat's behind facebook
What's behind facebook
 
HBase app HUG talk
HBase app HUG talkHBase app HUG talk
HBase app HUG talk
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Be nice to your designers
Be nice to your designersBe nice to your designers
Be nice to your designers
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
Riak seattle-meetup-august
Riak seattle-meetup-augustRiak seattle-meetup-august
Riak seattle-meetup-august
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 

Último

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Hadoop/Mahout/HBaseで テキスト分類器を作ったよ

  • 1. Hadoop/Mahout/HBase 2011/04/10 #TokyoWebmining10-2 yanaoki 2011 4 18
  • 2. • HBase • Mahout • Naive Bayes • • Web 2011 4 18
  • 3. • naoki yanai • • • … • • • Hadoop • • 2011 4 18
  • 4. HBase • KeyValue • read/write • goal is the hosting of very large tables -- billions of rows , millions of columns ... • Hadoop • CAP C,P • C: ,A: ,P: • Sharding • Hadoop/MapReduce 2011 4 18
  • 5. HBase • • ― • ― • qualifier 2011 4 18
  • 6. Mahout • • Hadoop • • HBase • • • Classifier / Clustering / Pattern Mining • Recommenders / Collaborative Filtering • Evolutionary Algorithms ... 2011 4 18
  • 7. Mahout • • • • Mahout • Mahout in Action PDF • hamadakoichi • TokyoWebmining 2011 4 18
  • 8. Naive Bayes • F1,...,Fn C • C • 2011 4 18
  • 9. Naive Bayes • • • • • • • 2011 4 18
  • 10. Naive Bayes • • • • • • • • 2011 4 18
  • 11. Web • • • • • 2011 4 18
  • 12. 2011 4 18
  • 13. Ruby • ExtractContent require "open-uri" require "extractcontent" html = open("http:// news.nifty.com/....htm").read body, title = ExtractContent::analyse(html) puts body.toutf8 #=> HTML 2011 4 18
  • 14. Ruby • scrAPI require 'scrapi' require 'open-uri' scr = Scraper.define do process "div.tweet", "tweets[]"=> :text result :tweets end tweets = scr.scrape(URI.parse("http://togetter.com/li/ 121476"), :parser_options => {:char_encoding => 'utf8'}) tweets.each{ |tw| puts tw } #=> 2011 4 18
  • 15. RSS HBase • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ … http://ameblo.jp/....html KARA … 2011 4 18
  • 16. HBase category_id <TAB> • HBase MaprReduce HDFS • • • • Wikipedia • 2011 4 18
  • 17. mahout $ mahout trainclassifier ... $ mahout testclassifier … • mahout • --input/--output / • --dataSource HDFS HBase • --gramSize N-gram • --classifierType • --alpha • --minDF/--minSupport / 2011 4 18
  • 18. HBase • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :       1884       82.2348% Incorrectly Classified Instances        :        407       17.7652% Total Classified Instances              :       2291 ======================================================= Confusion Matrix ------------------------------------------------------- a       b       c       d       e       <--Classified as 216     32      22      155     0        |  425         a     = t 0       514     13      70      0        |  597         b     = s 0       2       514     9       0        |  525         c     = e 1       8       13      638     0        |  660         d     = b 0       0       67      15      2        |  84          e     = a Default Category: unknown: 5 2011 4 18
  • 19. • reducer HBase // BayesParameters params = new BayesParameters(); params.set("alpha_i", "1"); algorithm = new CBayesAlgorithm(); datastore = new HBaseBayesDatastore("model_table_name", params); classifier = new ClassifierContext(algorithm, datastore); // ClassifierResult category = classifier.classifyDocument(doc.toArray(new String [doc.size()]), "default"); String label = category.getLabel(); 2011 4 18
  • 20. (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ category:cat=”technology” … http://ameblo.jp/....html KARA … category:cat=”entertainment” 2011 4 18
  • 21. Web 2011 4 18
  • 22. Web • Google News Togetter RSS • • … • … • a 935 5.2M b 5,112 7.2M e 3,746 8.1M s 4,737 12M t 3,969 9.2M 2011 4 18
  • 23. 4/18 Web • • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances          :      13388        91.6798% Incorrectly Classified Instances        :       1215         8.3202% Total Classified Instances              :      14603 ======================================================= Confusion Matrix ------------------------------------------------------- a         b         c         d         e         <--Classified as 2328      19        515       250       0          |  3112       a     = t 3         2939      54        20        0          |  3016       b     = e 32        3         3542      109       0          |  3686       c     = s 33        16        128       3877      0          |  4054       d     = b 1         27        2         3         702        |  735        e     = a Default Category: unknown: 5 2011 4 18
  • 24. Web • • • alpha 1 0.5 0.1 0.01 0.001 65.38% 65.83% 66.73% 66.82% 67.02% 2011 4 18
  • 25. 4/18 Web • • • N-Gram unigram bigram 63.57% 66.09% 2011 4 18
  • 26. Web • • • + 56.8% 65.38% 2011 4 18
  • 27. 4/18 Web • • • 67.02% 67.88% 2011 4 18
  • 28. • • HBase/Mahout • • HBase 2011 4 18
  • 29. 2011 4 18