SlideShare a Scribd company logo
1 of 119
Opportunities and Challenges of Web Search and Mining   Lee-Feng Chien ( 簡立峰 ) Academia Sinica & National Taiwan University
Outline   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
WSE = Google Globalization!
WSE = Google
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy   Business  . Profitable  . Models  . Competitions
Problems of WSE Inside WSE . Fast . Coverage  . Accuracy   Business  . Profitable  . Models  . Competition   Impacts  . Web Computing  . Knowledge Windows  . New Paradigm of Civilization
I.  Some Must-Know   Statistics
Online Language Populations ,[object Object]
Top Ten Languages in the Web ,[object Object],More and more non-English users! 100.0 % 6,390,147,487 12.5 % 800,040,498 WORLD TOTAL 12.7 % 2,602,992,587 3.9 % 101,686,725 Rest of the Languages 87.3 % 3,787,154,900 18.4 % 698,353,773 TOP TEN LANGUAGES 1.7 % 24,125,950 56.6 % 13,657,170 Dutch 2.9 % 224,664,100 10.3 % 23,058,254 Portuguese 3.6 % 57,987,100 49.3 % 28,610,000 Italian 3.8 % 74,730,000 41.0 % 30,670,000 Korean 4.4 % 375,164,185 9.3 % 35,034,269 French 6.7 % 386,413,200 13.9 % 53,670,063 Spanish 6.8 % 95,893,300 56.3 % 54,035,201 German 8.3 % 127,853,600 52.1 % 66,548,060 Japanese 13.2 % 1,321,669,200 8.0 % 105,484,112 Chinese 35.9 % 1,098,654,265 26.2 % 287,369,520 English Language as % of Total Internet Users World Population Estimate for Language Average Penetration Internet Users, by Language TOP TEN LANGUAGES IN THE INTERNET
Web Content Source:  Network Wizards Jan 99 Internet Domain Survey More and more  non-English pages
Web Users and Pages  (5 years ago) Challenge of Scalability ! Chinese Users: 110M Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others. Source: Global Reach, 2004
Number of Chinese Web Pages 573,000,000 pages Scalability Problem !
Number of Web Pages   The world’s  largest search engine ? ,[object Object],[object Object],Billions Of Textual Documents Indexed As of Sept 2, 2003 KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista.  Source: Search Engine Watch
The top 10 Internet trends 2004 predicted by eOneNet.com   ,[object Object],[object Object],[object Object],[object Object]
The top 10 Internet trends 2004 predicted by eOneNet.com   ,[object Object],[object Object],[object Object]
The top 10 Internet trends 2004 predicted by eOneNet.com   ,[object Object],[object Object],[object Object]
II.  Inside WSE
Components  ,[object Object],[object Object],[object Object],[object Object]
Architecture   SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable  Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
Spider ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Index Server   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
System Anatomy
Data Structure Lexicon:  fit in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted  by wordID. Inside barrel,  sorted by docID Inverted Index: some content as  the forward index, but sorted by wordID. doc list is sorted by docID
Query Server ,[object Object],[object Object],[object Object],[object Object],[object Object]
PageRank
PageRank (Cont.) ,[object Object],[object Object],[object Object]
Search Functions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Document Delivery   ,[object Object],[object Object],[object Object],[object Object],[object Object]
III.  Business
What is Google?   ,[object Object],[object Object],[object Object],[object Object]
Company Facts Employees:  1,300+ Languages spoken: 34 Worldwide Offices:  21 (Mostly in US & Europe) Annual Revenues: $900m
Google Revenue ,[object Object],[object Object],[object Object],Source:  Eric Schmidt Interview,  PCWorld.com (January 30, 2002)
Sources of Revenue   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Challenges (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object]
Competitors: Ebay and Amazon ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Competitors: Microsoft and Yahoo ,[object Object],[object Object],[object Object],[object Object],[object Object]
IV.  Impacts
Impacts   ,[object Object],[object Object],[object Object]
Web Computing   ,[object Object],[object Object],[object Object],[object Object]
Web Computing   ,[object Object],[object Object],[object Object],[object Object]
Knowledge Windows   ,[object Object],[object Object],[object Object],[object Object]
New Web OS ,[object Object],[object Object],[object Object]
V.  New Gen. of WSE
Advanced Google ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
New Features in Google ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Other Search Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Clusty.com
Example on Vivisimo
Vivisimo  (cont.)
New Directions   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
VI.  Web Mining
Web Search/Information Retrieval Millions of Users Web Search Engine Information Seeking
Improving Search via Mining Millions of Users Web texts, images, logs   … Search Engine Knowledge Discovery
Valuable Web Resources  Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Hyper Links Anchor Texts  Search Result Pages Query Logs Query Session Logs Clicked Stream Logs Deep Web, ….
Discovered Knowledge  Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Users’ Preferences/Need:  Topic, Location,  Timing, … Authority/Popularity: Site, File, People,  Company, Product Clusters/Associations/ Relations:  Site, Page, People,  Company, Product,  Query
Web Mining for IR Web logs, texts, images ,  … Knowledge  Discovery Millions of Users Search Classification Clustering Cross-language IR Information Extraction   Text mining Filtering
[object Object],[object Object],[object Object],[object Object]
Computational Linguistics, 29 , Issue 3,  September 2003 .
Research at  Web   Knowledge   Discovery  Lab
Research at  Web   Knowledge   Discovery  Lab ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Research at  Web   Knowledge   Discovery  Lab ,[object Object],[object Object],[object Object],[object Object]
LiveTrans:  Cross-language Web Search
LiveClassifier : Classifying search results into user-defined classification tree
LiveClassifier  :  Paper Title Categorization Note: no labeled training data
LiveCluster :  Taxonomy Generation
Terms Clustering
Query Clustering   勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空  (EVA airline) 長榮  (EVA) 航空公司  (airline) 航空  (airway) 華航  (China airline) 中華航空  (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
 
Outline ,[object Object],[object Object],[object Object]
Translating Unknown Queries ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Note: First work dealing with online translation
Introduction (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],比爾蓋茲 Bill Gates 柯林頓 / 克林頓 Clinton 嚴重急性呼吸道症候群 / 非典 / 沙士 SARS 數位圖書館 / 數字圖書館  Digital library 班夫 / 班芙   Banff 石川県   Ishikawa 国立情報学研究所 NII Japan 羅浮宮 louvre  museum Chinese Translation English Terminologies
Web Mining of  Query Translations ,[object Object],Source Term Target Translations Term Translation Web Mining Anchor-Text Mining Search-Result  Mining OOD Yahoo <->  雅虎
Anchor Text (Yahoo <->  雅虎 ) ,[object Object],[object Object]
Search Result Page  (National Palace Museum vs.  故宮博物院 ) ,[object Object]
Problems ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Term Extraction: SCPCD
… … Term Selection:   Probabilistic Inference Model Page Authority Co-occurrence Page Rank ,[object Object],[object Object]
Observation of Anchor Text Source Term(Ts)  Translation(Tt) 雅虎 => Yahoo
-  in USA Taiwan  - www.yahoo.com www.yahoo.com.tw Yahoo Yahoo Source Query Observation of Anchor Text
-  in USA Taiwan  - 台灣  - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Translation Candidates Anchor-Text Set  Observation of Anchor Text
…… (#in-link= 187) …… (#in-link= 21) -  in USA Taiwan  - 台灣  - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Page Authority Observation of Anchor Text Co-occurrence
Search Result Mining Term Extraction Source Query Target Translations Web Pages Search Engine PAT-tree based term extraction method [Chien, SIGIR ‘97] Term  Selection
Term Selection  ,[object Object],[object Object],[object Object],[object Object],Query S . . . T 1 T 2 T n
Chi-Square Test ,[object Object],a : # of pages containing both terms  s  and  t b : # of pages containing term  s  but not  t c : # of pages containing term   t  but not  s d : # of pages containing neither term  s  nor  t N : the total number of pages, i.e.,  N =  a + b + c + d
Context Vector Analysis ,[object Object],[object Object]
Indirect Association Problem   Cisco s t s 1 t 1 系統  (system) system Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors. 思科  (Cisco)
Competitive Linking Algorithm t 1 system s t 2 系統   (system) Cisco 資訊   (information) 網路   (network) 電腦   (computer) St 1 思科   (Cisco) Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2 . St 2
Combined Method ,[object Object],[object Object],[object Object],R m (s,t)  : Ranking of score  in different methods
Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Random Query Test Set ,[object Object],72% 66.0% 64.0% 44.0% Combined 32% 32.0% 32.0% 20.0% AT 68% 52.0% 50.0% 36.0% X 2 68% 54.0% 54.0% 40.0% CV Coverage Top-5 Top-3 Top-1 Method Table 2.  Coverage and top 1~5 inclusion rates obtained with the four different methods for the random-query set.
Other Experiments ,[object Object],[object Object]
Transitive Translation Top-n inclusion rates  obtained with different models. Model Top1 Top2 Top3 Top4 Top5 Direct Translation 35.7% 43.0% 46.9% 49.6% 51.2% Indirect Translation 44.2% 55.1% 58.0% 59.7% 60.5% Transitive Translation 49.2% 58.1% 60.9% 61.6% 62.0%
Transitive Translation Model
Chinese-Japanese Translation   61.9% 61.3% 58.6% 51.4% 42.9% Transitive 59.6% 58.6% 56.6% 49.4% 40.2% Indirect  15.1% 15.1% 14.3% 12.8% 10.5% Direct  Top5 Top4 Top3 Top2 Top1 Model Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Sony Nike Stanford Sydney internet network homepage computer database information 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 Japanese Simplified Chinese English Extracted target translations Source terms (Traditional Chinese) Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
Translation Lexicons with Regional Variations   (a)  Taiwan  (b)  Mainland China  (c)  Hong Kong Figure 1:  E xample s   of  search-result page s   in different Chinese regions that were obtained via  the English query  words  “ George Bush ”  from Google.
Summary  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
LiveCluster:  Generating Taxonomy from terms or documents
Taxonomy Generation from Terms
Hierarchical Query Clustering
The Steps   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Feature Extraction   ,[object Object],Creative Nude Photography Network -- Fine Art Nude and  ...   ...  The Creative  Nude  and  Erotic Photography  Network is the number one net portal to the best in fine art  nude  and  erotic photography ! Over 100 CNPN Member Sites  ...   Nude Places ...  to be  naked . Walking in the forest, cruising the lake in open boats, swimming, picnicking and  nude  photography are all enjoyed in the  nude . 60 minutes $39.95.  ...   A Brave Nude World ...  A Brave  Nude  World! Warning: This site contains links to fine art  nude  &  erotic photography . If you are under 18 or do not wish to view this material, You can  ...   nude Co-occurred  feature terms 3/2 erotic photography 1/1 naked … … … 3/2 art 2/2 photography tf/df term
Term Weighting
Extraction of Basic Feature Terms ,[object Object],[object Object],[object Object],[object Object]
Task I: Query Clustering   (Cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Term Similarity
Hierarchical Term Clustering   ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Clustering Results   勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空  (EVA airline) 長榮  (EVA) 航空公司  (airline) 航空  (airway) 華航  (China airline) 中華航空  (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
Cluster Partition
Quality Function
Quality Function  (Cont.)
Quality Function  (Cont.)
Preliminary Experiment ,[object Object],[object Object],[object Object],[object Object],[object Object]
Evaluation: F-Measure
Obtained F-Measures
 
Results of Hierarchical Structure Generation

More Related Content

Similar to Web Search And Mining (Ntuim)

Internet research-1200691875464541-5
Internet research-1200691875464541-5Internet research-1200691875464541-5
Internet research-1200691875464541-5惠子 李
 
Evolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic WebEvolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic WebLeeFeigenbaum
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and ProfitLouis Rosenfeld
 
Attention Allocation - from Search to Social
Attention Allocation - from Search to SocialAttention Allocation - from Search to Social
Attention Allocation - from Search to Socialmediaintransition
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
webmining overview
webmining overviewwebmining overview
webmining overviewabon
 
Week 6
Week 6Week 6
Week 6A VD
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALsathish sak
 
Leveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case StudyLeveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case StudyMarcus Vannini
 
061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shan061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shancjin cheng
 
Web trends, social media, viralmarketing
Web trends, social media, viralmarketingWeb trends, social media, viralmarketing
Web trends, social media, viralmarketingPer Axbom
 
Exploring Opportunities E Week Talk
Exploring Opportunities   E Week TalkExploring Opportunities   E Week Talk
Exploring Opportunities E Week TalkDorai Thodla
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010steverz
 

Similar to Web Search And Mining (Ntuim) (20)

Internet research-1200691875464541-5
Internet research-1200691875464541-5Internet research-1200691875464541-5
Internet research-1200691875464541-5
 
Internet research
Internet researchInternet research
Internet research
 
Internet Research
Internet ResearchInternet Research
Internet Research
 
Evolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic WebEvolution Towards Web 3.0: The Semantic Web
Evolution Towards Web 3.0: The Semantic Web
 
Search Analytics for Fun and Profit
Search Analytics for Fun and ProfitSearch Analytics for Fun and Profit
Search Analytics for Fun and Profit
 
Attention Allocation - from Search to Social
Attention Allocation - from Search to SocialAttention Allocation - from Search to Social
Attention Allocation - from Search to Social
 
Web 3.0 Emerging
Web 3.0 EmergingWeb 3.0 Emerging
Web 3.0 Emerging
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
webmining overview
webmining overviewwebmining overview
webmining overview
 
Week 6
Week 6Week 6
Week 6
 
Web 20 For Acra
Web 20 For AcraWeb 20 For Acra
Web 20 For Acra
 
INTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVALINTRODUCTION TO INFORMATION RETRIVAL
INTRODUCTION TO INFORMATION RETRIVAL
 
Leveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case StudyLeveraging Search & Social Media To Build Traffic To Your Site Case Study
Leveraging Search & Social Media To Build Traffic To Your Site Case Study
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 
Web2.0!
Web2.0!Web2.0!
Web2.0!
 
061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shan061223_web_20_conference_sf_shan
061223_web_20_conference_sf_shan
 
Web trends, social media, viralmarketing
Web trends, social media, viralmarketingWeb trends, social media, viralmarketing
Web trends, social media, viralmarketing
 
Exploring Opportunities E Week Talk
Exploring Opportunities   E Week TalkExploring Opportunities   E Week Talk
Exploring Opportunities E Week Talk
 
ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010ΟΚΤΩΒΡΙΟΣ 2010
ΟΚΤΩΒΡΙΟΣ 2010
 
Web 2.0
Web 2.0Web 2.0
Web 2.0
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Web Search And Mining (Ntuim)

  • 1. Opportunities and Challenges of Web Search and Mining Lee-Feng Chien ( 簡立峰 ) Academia Sinica & National Taiwan University
  • 2.
  • 3. WSE = Google Globalization!
  • 5. Problems of WSE Inside WSE . Fast . Coverage . Accuracy
  • 6. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competitions
  • 7. Problems of WSE Inside WSE . Fast . Coverage . Accuracy Business . Profitable . Models . Competition Impacts . Web Computing . Knowledge Windows . New Paradigm of Civilization
  • 8. I. Some Must-Know Statistics
  • 9.
  • 10.
  • 11. Web Content Source: Network Wizards Jan 99 Internet Domain Survey More and more non-English pages
  • 12. Web Users and Pages (5 years ago) Challenge of Scalability ! Chinese Users: 110M Including 87M (CN), 4.9M (HK), 8.8M (TW), 2.14M (SG), and others. Source: Global Reach, 2004
  • 13. Number of Chinese Web Pages 573,000,000 pages Scalability Problem !
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. II. Inside WSE
  • 19.
  • 20. Architecture SE SE SE Browser Web 1B queries/day Quality results Log .Spam . Freshness 5B pages Scalable Index Index Index Spider Indexer Archive (1) (2) (3) (4) (5)
  • 21.
  • 22.
  • 24. Data Structure Lexicon: fit in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted by wordID. Inside barrel, sorted by docID Inverted Index: some content as the forward index, but sorted by wordID. doc list is sorted by docID
  • 25.
  • 27.
  • 28.
  • 29.
  • 31.
  • 32. Company Facts Employees: 1,300+ Languages spoken: 34 Worldwide Offices: 21 (Mostly in US & Europe) Annual Revenues: $900m
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44. V. New Gen. of WSE
  • 45.
  • 46.
  • 47.  
  • 48.
  • 52.
  • 53. VI. Web Mining
  • 54. Web Search/Information Retrieval Millions of Users Web Search Engine Information Seeking
  • 55. Improving Search via Mining Millions of Users Web texts, images, logs … Search Engine Knowledge Discovery
  • 56. Valuable Web Resources Web logs, texts, images , … Knowledge Discovery Millions of Users Hyper Links Anchor Texts Search Result Pages Query Logs Query Session Logs Clicked Stream Logs Deep Web, ….
  • 57. Discovered Knowledge Web logs, texts, images , … Knowledge Discovery Millions of Users Users’ Preferences/Need: Topic, Location, Timing, … Authority/Popularity: Site, File, People, Company, Product Clusters/Associations/ Relations: Site, Page, People, Company, Product, Query
  • 58. Web Mining for IR Web logs, texts, images , … Knowledge Discovery Millions of Users Search Classification Clustering Cross-language IR Information Extraction Text mining Filtering
  • 59.
  • 60. Computational Linguistics, 29 , Issue 3, September 2003 .
  • 61. Research at Web Knowledge Discovery Lab
  • 62.
  • 63.
  • 65. LiveClassifier : Classifying search results into user-defined classification tree
  • 66. LiveClassifier : Paper Title Categorization Note: no labeled training data
  • 67. LiveCluster : Taxonomy Generation
  • 69. Query Clustering 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
  • 70.  
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 79.
  • 80. Observation of Anchor Text Source Term(Ts) Translation(Tt) 雅虎 => Yahoo
  • 81. - in USA Taiwan - www.yahoo.com www.yahoo.com.tw Yahoo Yahoo Source Query Observation of Anchor Text
  • 82. - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Translation Candidates Anchor-Text Set Observation of Anchor Text
  • 83. …… (#in-link= 187) …… (#in-link= 21) - in USA Taiwan - 台灣 - 搜尋引擎 www.yahoo.com www.yahoo.com.tw Yahoo 雅虎 雅虎 Yahoo Page Authority Observation of Anchor Text Co-occurrence
  • 84. Search Result Mining Term Extraction Source Query Target Translations Web Pages Search Engine PAT-tree based term extraction method [Chien, SIGIR ‘97] Term Selection
  • 85.
  • 86.
  • 87.
  • 88. Indirect Association Problem Cisco s t s 1 t 1 系統 (system) system Fig. 4. An illustration showing the indirect association problem, in which the dashed arrows indicate possible indirect association errors. 思科 (Cisco)
  • 89. Competitive Linking Algorithm t 1 system s t 2 系統 (system) Cisco 資訊 (information) 網路 (network) 電腦 (computer) St 1 思科 (Cisco) Fig. 6. An illustration showing a bipartite graph generated by using Algorithm 2 . St 2
  • 90.
  • 91.
  • 92.
  • 93.
  • 94. Transitive Translation Top-n inclusion rates obtained with different models. Model Top1 Top2 Top3 Top4 Top5 Direct Translation 35.7% 43.0% 46.9% 49.6% 51.2% Indirect Translation 44.2% 55.1% 58.0% 59.7% 60.5% Transitive Translation 49.2% 58.1% 60.9% 61.6% 62.0%
  • 96. Chinese-Japanese Translation 61.9% 61.3% 58.6% 51.4% 42.9% Transitive 59.6% 58.6% 56.6% 49.4% 40.2% Indirect 15.1% 15.1% 14.3% 12.8% 10.5% Direct Top5 Top4 Top3 Top2 Top1 Model Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Sony Nike Stanford Sydney internet network homepage computer database information 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 Japanese Simplified Chinese English Extracted target translations Source terms (Traditional Chinese) Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
  • 97. Translation Lexicons with Regional Variations (a) Taiwan (b) Mainland China (c) Hong Kong Figure 1: E xample s of search-result page s in different Chinese regions that were obtained via the English query words “ George Bush ” from Google.
  • 98.
  • 99. LiveCluster: Generating Taxonomy from terms or documents
  • 102.
  • 103.
  • 105.
  • 106.
  • 108.
  • 109.  
  • 110. Clustering Results 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 eva 長榮航空 長榮 航空公司 航空 華航 中華航空 補帖 大補帖 泡麵 dbt 武俠 金庸 武俠小說 黃易 作家 武俠 金庸 武俠小說 黃易 作家 補帖 大補帖 泡麵 dbt eva 長榮航空 (EVA airline) 長榮 (EVA) 航空公司 (airline) 航空 (airway) 華航 (China airline) 中華航空 (China airline) 占卜 塔羅牌 算命 紫微斗數 命理 姓名學 心理測驗 星座 愛情 勞委會 職訓局 就業 青輔會 自傳 徵才 人力資源 104 人力銀行 人力銀行 找工作 履歷表 求職 求才 cut 1 2 3 4 5 1 2 3 4 5
  • 113. Quality Function (Cont.)
  • 114. Quality Function (Cont.)
  • 115.
  • 118.  
  • 119. Results of Hierarchical Structure Generation