SlideShare a Scribd company logo
2 0 1 3 讀 
O a s e s
PageRank 7.3 
Introduction 
PageRank Algorithm 
Strengths and Weaknesses 
Timed PageRank & Recency Search
PageRank 7.3 Introduction 
HITS was presented by Jon Kleinberg in January, 1998 at 
the Ninth Annual ACM-SIAM Symposium on Discrete 
Algorithms.. 
PageRank was presented by Sergey Brin and Larry Page 
at the Seventh International World Wide Web Conference 
(WWW7) in April, 1998. 
-Based on the algorithm, they built the search engine 
Google
PageRank 7.3.1 PageRank Algorithm 
PageRank (PR)is a static ranking of Web pages. 
PageRank is based on the measure of prestige in social 
networks, the PageRank value of each page can be 
regarded as its prestige.
PageRank 7.3.1 PageRank Algorithm 
Concepts: 
In-links of page i: These are the hyperlinks that point to 
page i from other pages. Usually, hyperlinks from the 
same site are not considered. 
Out-links of page i: These are the hyperlinks that point 
out to other pages from page i. Usually, links to pages of 
the same site are not considered. 
In-links Out-links
PageRank 7.3.1 PageRank Algorithm 
 uses G=(V, E) [G=graph, V=pages, E=links] 
PageRank Score: 
※ Oj is the number of 
out-links of page j
PageRank 7.3.1 PageRank Algorithm 
doesn’t not quite suffice. 
(隨機性下的發生) 
Based on the Markov chain: 
※ Aij(1) is the probability of going 
from i to j in 1 transition
PageRank 7.3.1 PageRank Algorithm 
※ adding a 
link from page 5 to every page
PageRank 7.3.1 PageRank Algorithm 
Ex2:
PageRank 7.3.1 PageRank Algorithm 
The random surfer has two options: 
1. With probability d, he randomly chooses an out-link to follow. 
2. With probability 1-d, he jumps to a random page without a link. 
Ex3:
PageRank 7.3.1 PageRank Algorithm 
Sol:
PageRank 7.3.2 Strengths and Weaknesses 
1.The advantage of PageRank is its ability to fight spam. 
Since it is not easy for Web page owner to add in-links into 
his/her page from other important pages, it is thus not easy 
to influence PageRank. 
Nevertheless, there are reported ways to influence PageRank. 
Recognizing and fighting spam is an important issue in 
Web search.
PageRank 7.3.2 Strengths and Weaknesses 
2. Another major advantage of PageRank is that it is a global 
measure and is query independent. 
At the query time, only a lookup is needed to find the value 
to be integrated with other strategies to rank the pages. 
It is thus very efficient at the query time.
PageRank 7.3.2 Strengths and Weaknesses 
1. The main criticism is also the query-independence nature of 
PageRank. It could not distinguish between pages that are 
authoritative in general and pages that are authoritative on 
the query topic.
PageRank 7.3.3 Timed PageRank and Recency Search 
The Web is a dynamic environment. It changes constantly. 
Quality pages in the past may not be quality pages now or 
in the future. 
Many outdated pages and links are not deleted. This causes 
problems for Web search because such outdated pages 
may still be ranked high. - Thus, search has a temporal 
dimension.
PageRank 7.3.3 Timed PageRank and Recency Search 
Time-Sensitive ranking algorithm called TS-Rank. 
the surfer can take one of the two actions: 
1. With probability f(ti), he randomly chooses an out-going 
link to follow. 
2. With probability 1-f(ti), he jumps to a random page 
without a link.
PageRank 7.3.3 Timed PageRank and Recency Search 
Time-Sensitive ranking algorithm called TS-Rank.
HITS 7.4 
Introduction 
HITS Algorithm 
Finding Other Eigenvectors 
Relationships with Co-Citation and 
Bibliographic Coupling 
Strengths and Weaknesses of HITS
HITS 7.4 Introduction 
HITS stands for Hypertext Induced Topic Search 
Statement : 
expands the list of relevant pages returned by a search 
engine and then produces two rankings of the expanded 
set of pages, authority ranking and hub ranking. 
Authority : 
a page with many in-links. 
A good authority is a page pointed to by many good hubs. 
Hub : 
a page with many out-links. 
A good hub is a page that points to many good authorities.
HITS 7.4 Introduction 
Authority : 
a page with many in-links. 
A good authority is a page pointed to by many good hubs. 
Hub1 
http1 
http2 
http3…. 
HubN 
http1 
http2 
http3…. 
Hub2 
http1 
http2 
http3…. 
Authority
HITS 7.4 Introduction 
Hub : 
a page with many out-links. 
A good hub is a page that points to many good authorities. 
Hub 
http1 
http2 
http3…. 
Authority 
1 Authority 
2 
Authority 
N 
authorities and hubs have a mutual reinforcement relationship
HITS 7.4.1 HITS Algorithm 
 uses G=(V, E) [G=graph, V=pages, E=links] 
 計算page i 的authority 分數a(i), hub 分數h(i). 
The mutual reinforcing relationship of the two scores is 
represented as follows:
HITS 7.4.1 HITS Algorithm 
Writing them in the matrix form, 
a scores = (a(1), a(2), …, a(n))T 
h scores = (h(1), h(2), …, h(n))T 
a = LT La 
h = L LTa
HITS 7.4.1 HITS Algorithm 
Ex: 
1 3 
2 4 
0010 
 
 
 
 
1010 
0001 
 
 
 
 
 
 
 
 
 
0100 
A 
(0.2, 0.2, 0.2, 0.2 ) 
 
(0.2, 0.2, 0.2, 0.2 ) 
a 
 
h 
Sol:
HITS 7.4.1 HITS Algorithm 
0010 
 
 
 
 
1010 
0001 
 
 
 
 
 
 
 
 
 
0100 
A 
Sol: 
a = LT La h = L L a T 
0100 
0001 
1100 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0010 
 
 
 
 
 
 
1010 
0001 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0.4 
0.2 
0.6 
0.2 
0.2 
0.2 
0.2 
0.2 
0100 
0010 
a 
0010 
1010 
0001 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0100 
 
 
0001 
 
 
 
 
 
 
 
 
 
1100 
 
 
 
 
 
 
 
 
 
 
 
 
0.4 
0.6 
0.2 
0.2 
0.2 
0.2 
0.2 
0.2 
0010 
0100 
h 
The most authority 
is Page 3 
The most hub is 
Page 2
HITS 7.4.2 Finding Other Eigenvectors 
Each of such collections could potentially be relevant to the 
query topic, but they could be well separated from one 
another in the graph G for a variety of reasons. 
For example, 
1. The query string may represent a topic that may arise as 
a term in the multiple communities, e.g. “classification”. 
2. The query string may refer to a highly polarized issue, 
involving groups that are not likely to link to one another, 
e.g. “abortion”.
HITS 7.4.3 Relationships with Co-Citation and 
Bibliographic Coupling 
An authority page is like an influential research paper 
(publication) which is cited by many subsequent papers. 
A hub page is like a survey paper which cites many other 
papers (including those influential papers).
HITS 7.4.4 Strengths and Weaknesses of HITS 
The main strength of HITS is its ability to rank pages 
according to the query topic, which may be able to 
provide more relevant authority and hub pages. 
However, HITS has several disadvantages: 
1. HITS does not have the anti-spam capability of PageRank. 
2. HITS is topic drift. because people put hyperlinks 
for all kinds of reasons, including favor, spamming… 
3. The query time evaluation is also a major drawback. 
Performing eigenvector computation are all time 
consuming operations.
END

More Related Content

What's hot

Visualizing and Making Sense of Information
Visualizing and Making Sense of InformationVisualizing and Making Sense of Information
Visualizing and Making Sense of Information
PARC, a Xerox company
 
Pydata Taipei 2020
Pydata Taipei 2020Pydata Taipei 2020
Pydata Taipei 2020
Tunghai University
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
Anil Shrestha
 
Development of Twitter Application #6 - Trends
Development of Twitter Application #6 - TrendsDevelopment of Twitter Application #6 - Trends
Development of Twitter Application #6 - Trends
Myungjin Lee
 
Twitter Sentiment Analysis
Twitter Sentiment AnalysisTwitter Sentiment Analysis
Twitter Sentiment Analysis
ijtsrd
 

What's hot (8)

CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
15071
1507115071
15071
 
EH 101
EH 101EH 101
EH 101
 
Visualizing and Making Sense of Information
Visualizing and Making Sense of InformationVisualizing and Making Sense of Information
Visualizing and Making Sense of Information
 
Pydata Taipei 2020
Pydata Taipei 2020Pydata Taipei 2020
Pydata Taipei 2020
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
 
Development of Twitter Application #6 - Trends
Development of Twitter Application #6 - TrendsDevelopment of Twitter Application #6 - Trends
Development of Twitter Application #6 - Trends
 
Twitter Sentiment Analysis
Twitter Sentiment AnalysisTwitter Sentiment Analysis
Twitter Sentiment Analysis
 

Similar to WEB Data Mining

Pagerank and hits
Pagerank and hitsPagerank and hits
Pagerank and hits
Shatakirti Er
 
Pagerank
PagerankPagerank
Pagerank
Sunil Rawal
 
Macran
MacranMacran
Macran
Pradip Rahul
 
Evaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesEvaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and Features
Waqas Tariq
 
AN EFFECTIVE RANKING METHOD OF WEBPAGE THROUGH TFIDF AND HYPERLINK CLASSIFIED...
AN EFFECTIVE RANKING METHOD OF WEBPAGE THROUGH TFIDF AND HYPERLINK CLASSIFIED...AN EFFECTIVE RANKING METHOD OF WEBPAGE THROUGH TFIDF AND HYPERLINK CLASSIFIED...
AN EFFECTIVE RANKING METHOD OF WEBPAGE THROUGH TFIDF AND HYPERLINK CLASSIFIED...
IJDKP
 
Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Margaret Wang
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Reportweichen
 
Discovering knowledge using web structure mining
Discovering knowledge using web structure miningDiscovering knowledge using web structure mining
Discovering knowledge using web structure miningAtul Khanna
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)James Arnold
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spamJames Arnold
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)James Arnold
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学
Xu jiakon
 
Web mining
Web miningWeb mining
Web mining
Rashmi Bhat
 
Ambiguity Resolution in Information Retrieval
Ambiguity Resolution in Information RetrievalAmbiguity Resolution in Information Retrieval
Ambiguity Resolution in Information Retrieval
kevig
 
Page Rank Link Farm Detection
Page Rank Link Farm DetectionPage Rank Link Farm Detection
I04015559
I04015559I04015559
Search engine
Search engineSearch engine
Search engine
swaraj27
 
Web mining
Web miningWeb mining
Web mining
MohamadHayeri1
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
Raja R
 

Similar to WEB Data Mining (20)

Pagerank and hits
Pagerank and hitsPagerank and hits
Pagerank and hits
 
Pagerank
PagerankPagerank
Pagerank
 
Macran
MacranMacran
Macran
 
Evaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesEvaluation of Web Search Engines Based on Ranking of Results and Features
Evaluation of Web Search Engines Based on Ranking of Results and Features
 
AN EFFECTIVE RANKING METHOD OF WEBPAGE THROUGH TFIDF AND HYPERLINK CLASSIFIED...
AN EFFECTIVE RANKING METHOD OF WEBPAGE THROUGH TFIDF AND HYPERLINK CLASSIFIED...AN EFFECTIVE RANKING METHOD OF WEBPAGE THROUGH TFIDF AND HYPERLINK CLASSIFIED...
AN EFFECTIVE RANKING METHOD OF WEBPAGE THROUGH TFIDF AND HYPERLINK CLASSIFIED...
 
Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Report
 
Discovering knowledge using web structure mining
Discovering knowledge using web structure miningDiscovering knowledge using web structure mining
Discovering knowledge using web structure mining
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
 
Internet 信息检索中的数学
Internet 信息检索中的数学Internet 信息检索中的数学
Internet 信息检索中的数学
 
Mazhiming
MazhimingMazhiming
Mazhiming
 
Web mining
Web miningWeb mining
Web mining
 
Ambiguity Resolution in Information Retrieval
Ambiguity Resolution in Information RetrievalAmbiguity Resolution in Information Retrieval
Ambiguity Resolution in Information Retrieval
 
Page Rank Link Farm Detection
Page Rank Link Farm DetectionPage Rank Link Farm Detection
Page Rank Link Farm Detection
 
I04015559
I04015559I04015559
I04015559
 
Search engine
Search engineSearch engine
Search engine
 
Web mining
Web miningWeb mining
Web mining
 
Search engine page rank demystification
Search engine page rank demystificationSearch engine page rank demystification
Search engine page rank demystification
 

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 

WEB Data Mining

  • 1. 2 0 1 3 讀 O a s e s
  • 2. PageRank 7.3 Introduction PageRank Algorithm Strengths and Weaknesses Timed PageRank & Recency Search
  • 3. PageRank 7.3 Introduction HITS was presented by Jon Kleinberg in January, 1998 at the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms.. PageRank was presented by Sergey Brin and Larry Page at the Seventh International World Wide Web Conference (WWW7) in April, 1998. -Based on the algorithm, they built the search engine Google
  • 4. PageRank 7.3.1 PageRank Algorithm PageRank (PR)is a static ranking of Web pages. PageRank is based on the measure of prestige in social networks, the PageRank value of each page can be regarded as its prestige.
  • 5. PageRank 7.3.1 PageRank Algorithm Concepts: In-links of page i: These are the hyperlinks that point to page i from other pages. Usually, hyperlinks from the same site are not considered. Out-links of page i: These are the hyperlinks that point out to other pages from page i. Usually, links to pages of the same site are not considered. In-links Out-links
  • 6. PageRank 7.3.1 PageRank Algorithm  uses G=(V, E) [G=graph, V=pages, E=links] PageRank Score: ※ Oj is the number of out-links of page j
  • 7. PageRank 7.3.1 PageRank Algorithm doesn’t not quite suffice. (隨機性下的發生) Based on the Markov chain: ※ Aij(1) is the probability of going from i to j in 1 transition
  • 8. PageRank 7.3.1 PageRank Algorithm ※ adding a link from page 5 to every page
  • 9. PageRank 7.3.1 PageRank Algorithm Ex2:
  • 10. PageRank 7.3.1 PageRank Algorithm The random surfer has two options: 1. With probability d, he randomly chooses an out-link to follow. 2. With probability 1-d, he jumps to a random page without a link. Ex3:
  • 11. PageRank 7.3.1 PageRank Algorithm Sol:
  • 12. PageRank 7.3.2 Strengths and Weaknesses 1.The advantage of PageRank is its ability to fight spam. Since it is not easy for Web page owner to add in-links into his/her page from other important pages, it is thus not easy to influence PageRank. Nevertheless, there are reported ways to influence PageRank. Recognizing and fighting spam is an important issue in Web search.
  • 13. PageRank 7.3.2 Strengths and Weaknesses 2. Another major advantage of PageRank is that it is a global measure and is query independent. At the query time, only a lookup is needed to find the value to be integrated with other strategies to rank the pages. It is thus very efficient at the query time.
  • 14. PageRank 7.3.2 Strengths and Weaknesses 1. The main criticism is also the query-independence nature of PageRank. It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.
  • 15. PageRank 7.3.3 Timed PageRank and Recency Search The Web is a dynamic environment. It changes constantly. Quality pages in the past may not be quality pages now or in the future. Many outdated pages and links are not deleted. This causes problems for Web search because such outdated pages may still be ranked high. - Thus, search has a temporal dimension.
  • 16. PageRank 7.3.3 Timed PageRank and Recency Search Time-Sensitive ranking algorithm called TS-Rank. the surfer can take one of the two actions: 1. With probability f(ti), he randomly chooses an out-going link to follow. 2. With probability 1-f(ti), he jumps to a random page without a link.
  • 17. PageRank 7.3.3 Timed PageRank and Recency Search Time-Sensitive ranking algorithm called TS-Rank.
  • 18. HITS 7.4 Introduction HITS Algorithm Finding Other Eigenvectors Relationships with Co-Citation and Bibliographic Coupling Strengths and Weaknesses of HITS
  • 19. HITS 7.4 Introduction HITS stands for Hypertext Induced Topic Search Statement : expands the list of relevant pages returned by a search engine and then produces two rankings of the expanded set of pages, authority ranking and hub ranking. Authority : a page with many in-links. A good authority is a page pointed to by many good hubs. Hub : a page with many out-links. A good hub is a page that points to many good authorities.
  • 20. HITS 7.4 Introduction Authority : a page with many in-links. A good authority is a page pointed to by many good hubs. Hub1 http1 http2 http3…. HubN http1 http2 http3…. Hub2 http1 http2 http3…. Authority
  • 21. HITS 7.4 Introduction Hub : a page with many out-links. A good hub is a page that points to many good authorities. Hub http1 http2 http3…. Authority 1 Authority 2 Authority N authorities and hubs have a mutual reinforcement relationship
  • 22. HITS 7.4.1 HITS Algorithm  uses G=(V, E) [G=graph, V=pages, E=links]  計算page i 的authority 分數a(i), hub 分數h(i). The mutual reinforcing relationship of the two scores is represented as follows:
  • 23. HITS 7.4.1 HITS Algorithm Writing them in the matrix form, a scores = (a(1), a(2), …, a(n))T h scores = (h(1), h(2), …, h(n))T a = LT La h = L LTa
  • 24. HITS 7.4.1 HITS Algorithm Ex: 1 3 2 4 0010     1010 0001          0100 A (0.2, 0.2, 0.2, 0.2 )  (0.2, 0.2, 0.2, 0.2 ) a  h Sol:
  • 25. HITS 7.4.1 HITS Algorithm 0010     1010 0001          0100 A Sol: a = LT La h = L L a T 0100 0001 1100                              0010       1010 0001                  0.4 0.2 0.6 0.2 0.2 0.2 0.2 0.2 0100 0010 a 0010 1010 0001                              0100   0001          1100             0.4 0.6 0.2 0.2 0.2 0.2 0.2 0.2 0010 0100 h The most authority is Page 3 The most hub is Page 2
  • 26. HITS 7.4.2 Finding Other Eigenvectors Each of such collections could potentially be relevant to the query topic, but they could be well separated from one another in the graph G for a variety of reasons. For example, 1. The query string may represent a topic that may arise as a term in the multiple communities, e.g. “classification”. 2. The query string may refer to a highly polarized issue, involving groups that are not likely to link to one another, e.g. “abortion”.
  • 27. HITS 7.4.3 Relationships with Co-Citation and Bibliographic Coupling An authority page is like an influential research paper (publication) which is cited by many subsequent papers. A hub page is like a survey paper which cites many other papers (including those influential papers).
  • 28. HITS 7.4.4 Strengths and Weaknesses of HITS The main strength of HITS is its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages. However, HITS has several disadvantages: 1. HITS does not have the anti-spam capability of PageRank. 2. HITS is topic drift. because people put hyperlinks for all kinds of reasons, including favor, spamming… 3. The query time evaluation is also a major drawback. Performing eigenvector computation are all time consuming operations.
  • 29. END

Editor's Notes

  1. Periodic週期性
  2. Periodic週期性
  3. suffice 充足
  4. outdated 過時、未更新的 temporal 時間的
  5. outdated 過時、未更新的 temporal 時間的
  6. outdated 過時、未更新的 temporal 時間的 For a complete new page in a Web site, which has few or no in-links, we can use the average TS-Rank value of the past pages of the site, which represents the reputation of the site.
  7. Eigenvectors 特徵向量 Abortion 墮胎
  8. Spamming使用網路來作為廣播媒體傳送相同的訊息給大量未要求傳送訊息的使用者的一種不適當的企圖 Drift 趨勢 computation 計算結果的數值 consuming 耗時的