Enviar búsqueda
Cargar
Hadoop/Mahout/HBaseで テキスト分類器を作ったよ
•
15 recomendaciones
•
6,418 vistas
Naoki Yanai
Seguir
Tecnología
Diseño
Denunciar
Compartir
Denunciar
Compartir
1 de 29
Descargar ahora
Descargar para leer sin conexión
Recomendados
Amplexor drupal-high trafficwebsites-2012-03-08
Amplexor drupal-high trafficwebsites-2012-03-08
Amplexor
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Alex Moundalexis
Mi Domain Wheel Slides
Mi Domain Wheel Slides
lancesfa
Apache hadoop
Apache hadoop
Darpan Dekivadiya
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
Hortonworks
Scalability andefficiencypres
Scalability andefficiencypres
NekoGato
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
Hortonworks
Apache Thrift Outline
Apache Thrift Outline
Akihiro Katou
Recomendados
Amplexor drupal-high trafficwebsites-2012-03-08
Amplexor drupal-high trafficwebsites-2012-03-08
Amplexor
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Alex Moundalexis
Mi Domain Wheel Slides
Mi Domain Wheel Slides
lancesfa
Apache hadoop
Apache hadoop
Darpan Dekivadiya
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
Hortonworks
Scalability andefficiencypres
Scalability andefficiencypres
NekoGato
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
Hortonworks
Apache Thrift Outline
Apache Thrift Outline
Akihiro Katou
Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみた
issaymk2
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
Naoki Yanai
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
takaya imai
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6
Koichi Hamada
Frequency Pattern Mining
Frequency Pattern Mining
Katsuhiro Takata
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8
Koichi Hamada
協調フィルタリング with Mahout
協調フィルタリング with Mahout
Katsuhiro Takata
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9
Koichi Hamada
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14th
Koichi Hamada
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
Preferred Networks
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share
Yasushi Gunya
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
Shota Yasui
オープニングトーク - 創設の思い・目的・進行方針 -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針 -データマイニング+WEB勉強会@東京
Koichi Hamada
Appium: Automation for Mobile Apps
Appium: Automation for Mobile Apps
Sauce Labs
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
Naoki Yanai
Intro to HBase - Lars George
Intro to HBase - Lars George
JAX London
HBase, no trouble
HBase, no trouble
LINE Corporation
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
rhatr
Apache Hive
Apache Hive
Amit Khandelwal
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
dave_revell
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
WibiData
Más contenido relacionado
Destacado
Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみた
issaymk2
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
Naoki Yanai
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
takaya imai
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6
Koichi Hamada
Frequency Pattern Mining
Frequency Pattern Mining
Katsuhiro Takata
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8
Koichi Hamada
協調フィルタリング with Mahout
協調フィルタリング with Mahout
Katsuhiro Takata
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9
Koichi Hamada
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14th
Koichi Hamada
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
Preferred Networks
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share
Yasushi Gunya
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
Shota Yasui
オープニングトーク - 創設の思い・目的・進行方針 -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針 -データマイニング+WEB勉強会@東京
Koichi Hamada
Appium: Automation for Mobile Apps
Appium: Automation for Mobile Apps
Sauce Labs
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
Naoki Yanai
Destacado
(15)
Mahoutにパッチを送ってみた
Mahoutにパッチを送ってみた
ComplementaryNaiveBayesClassifier
ComplementaryNaiveBayesClassifier
Introduction to fuzzy kmeans on mahout
Introduction to fuzzy kmeans on mahout
Introduction to Mahout Clustering - #TokyoWebmining #6
Introduction to Mahout Clustering - #TokyoWebmining #6
Frequency Pattern Mining
Frequency Pattern Mining
Apache Mahout - Random Forests - #TokyoWebmining #8
Apache Mahout - Random Forests - #TokyoWebmining #8
協調フィルタリング with Mahout
協調フィルタリング with Mahout
Mahout Canopy Clustering - #TokyoWebmining 9
Mahout Canopy Clustering - #TokyoWebmining 9
"Mahout Recommendation" - #TokyoWebmining 14th
"Mahout Recommendation" - #TokyoWebmining 14th
MapReduceによる大規模データを利用した機械学習
MapReduceによる大規模データを利用した機械学習
20161029 TVI Tokyowebmining Seminar for Share
20161029 TVI Tokyowebmining Seminar for Share
計量経済学と 機械学習の交差点入り口 (公開用)
計量経済学と 機械学習の交差点入り口 (公開用)
オープニングトーク - 創設の思い・目的・進行方針 -データマイニング+WEB勉強会@東京
オープニングトーク - 創設の思い・目的・進行方針 -データマイニング+WEB勉強会@東京
Appium: Automation for Mobile Apps
Appium: Automation for Mobile Apps
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
Similar a Hadoop/Mahout/HBaseで テキスト分類器を作ったよ
Intro to HBase - Lars George
Intro to HBase - Lars George
JAX London
HBase, no trouble
HBase, no trouble
LINE Corporation
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
rhatr
Apache Hive
Apache Hive
Amit Khandelwal
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
dave_revell
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
WibiData
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Data Con LA
What's behind facebook
What's behind facebook
Ajen 陳
HBase app HUG talk
HBase app HUG talk
Kevin (Xi Zhao) Wang
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
Yahoo Developer Network
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
dzhou
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
Ryan Hennig
SDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
Korea Sdec
Mahout Introduction BarCampDC
Mahout Introduction BarCampDC
Drew Farris
Be nice to your designers
Be nice to your designers
Pai-Cheng Tao
20100128ebay
20100128ebay
Jeff Hammerbacher
Riak seattle-meetup-august
Riak seattle-meetup-august
pharkmillups
Programming Hive Reading #4
Programming Hive Reading #4
moai kids
Similar a Hadoop/Mahout/HBaseで テキスト分類器を作ったよ
(20)
Intro to HBase - Lars George
Intro to HBase - Lars George
HBase, no trouble
HBase, no trouble
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Apache Hive
Apache Hive
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Analyzing Large-Scale User Data with Hadoop and HBase
Analyzing Large-Scale User Data with Hadoop and HBase
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
What's behind facebook
What's behind facebook
HBase app HUG talk
HBase app HUG talk
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
SDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
Mahout Introduction BarCampDC
Mahout Introduction BarCampDC
Be nice to your designers
Be nice to your designers
20100128ebay
20100128ebay
Riak seattle-meetup-august
Riak seattle-meetup-august
Programming Hive Reading #4
Programming Hive Reading #4
Último
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
The Digital Insurer
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Roshan Dwivedi
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
apidays
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
wesley chun
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Igalia
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
apidays
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Principled Technologies
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
UK Journal
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Drew Madelung
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
SynarionITSolutions
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
The Digital Insurer
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
wesley chun
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Safe Software
Último
(20)
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Hadoop/Mahout/HBaseで テキスト分類器を作ったよ
1.
Hadoop/Mahout/HBase
2011/04/10 #TokyoWebmining10-2 yanaoki 2011 4 18
2.
•
• HBase • Mahout • Naive Bayes • • Web 2011 4 18
3.
•
• naoki yanai • • • … • • • Hadoop • • 2011 4 18
4.
HBase
• KeyValue • read/write • goal is the hosting of very large tables -- billions of rows , millions of columns ... • Hadoop • CAP C,P • C: ,A: ,P: • Sharding • Hadoop/MapReduce 2011 4 18
5.
HBase
• • ― • ― • qualifier 2011 4 18
6.
Mahout
• • Hadoop • • HBase • • • Classifier / Clustering / Pattern Mining • Recommenders / Collaborative Filtering • Evolutionary Algorithms ... 2011 4 18
7.
Mahout
• • • • Mahout • Mahout in Action PDF • hamadakoichi • TokyoWebmining 2011 4 18
8.
Naive Bayes
• F1,...,Fn C • C • 2011 4 18
9.
Naive Bayes
• • • • • • • 2011 4 18
10.
Naive Bayes
• • • • • • • • 2011 4 18
11.
•
Web • • • • • 2011 4 18
12.
2011
4 18
13.
•
Ruby • ExtractContent require "open-uri" require "extractcontent" html = open("http:// news.nifty.com/....htm").read body, title = ExtractContent::analyse(html) puts body.toutf8 #=> HTML 2011 4 18
14.
•
Ruby • scrAPI require 'scrapi' require 'open-uri' scr = Scraper.define do process "div.tweet", "tweets[]"=> :text result :tweets end tweets = scr.scrape(URI.parse("http://togetter.com/li/ 121476"), :parser_options => {:char_encoding => 'utf8'}) tweets.each{ |tw| puts tw } #=> 2011 4 18
15.
•
RSS HBase • (URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ … http://ameblo.jp/....html KARA … 2011 4 18
16.
•
HBase category_id <TAB> • HBase MaprReduce HDFS • • • • Wikipedia • 2011 4 18
17.
•
mahout $ mahout trainclassifier ... $ mahout testclassifier … • mahout • --input/--output / • --dataSource HDFS HBase • --gramSize N-gram • --classifierType • --alpha • --minDF/--minSupport / 2011 4 18
18.
•
HBase • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 1884 82.2348% Incorrectly Classified Instances : 407 17.7652% Total Classified Instances : 2291 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e <--Classified as 216 32 22 155 0 | 425 a = t 0 514 13 70 0 | 597 b = s 0 2 514 9 0 | 525 c = e 1 8 13 638 0 | 660 d = b 0 0 67 15 2 | 84 e = a Default Category: unknown: 5 2011 4 18
19.
•
• reducer HBase // BayesParameters params = new BayesParameters(); params.set("alpha_i", "1"); algorithm = new CBayesAlgorithm(); datastore = new HBaseBayesDatastore("model_table_name", params); classifier = new ClassifierContext(algorithm, datastore); // ClassifierResult category = classifier.classifyDocument(doc.toArray(new String [doc.size()]), "default"); String label = category.getLabel(); 2011 4 18
20.
•
(URL) content categories http://togetter/1.html category:src=”togetter” ... category:cat=”social” http:// category:src=”nifty” news.nifty.com/....html AKB ... category:cat=”entertainment” http://groups.google.com/ 10 group/webmining-tokyo/ category:cat=”technology” … http://ameblo.jp/....html KARA … category:cat=”entertainment” 2011 4 18
21.
Web 2011
4 18
22.
Web
• Google News Togetter RSS • • … • … • a 935 5.2M b 5,112 7.2M e 3,746 8.1M s 4,737 12M t 3,969 9.2M 2011 4 18
23.
4/18
Web • • ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 13388 91.6798% Incorrectly Classified Instances : 1215 8.3202% Total Classified Instances : 14603 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e <--Classified as 2328 19 515 250 0 | 3112 a = t 3 2939 54 20 0 | 3016 b = e 32 3 3542 109 0 | 3686 c = s 33 16 128 3877 0 | 4054 d = b 1 27 2 3 702 | 735 e = a Default Category: unknown: 5 2011 4 18
24.
Web
• • • alpha 1 0.5 0.1 0.01 0.001 65.38% 65.83% 66.73% 66.82% 67.02% 2011 4 18
25.
4/18
Web • • • N-Gram unigram bigram 63.57% 66.09% 2011 4 18
26.
Web
• • • + 56.8% 65.38% 2011 4 18
27.
4/18
Web • • • 67.02% 67.88% 2011 4 18
28.
•
• • HBase/Mahout • • HBase 2011 4 18
29.
2011
4 18
Descargar ahora