2. Specifies
The WWW is huge, widely distributed, global
information service centre for
Information services:
news, advertisements, consumer
information, financial
management, education, government, e-
commerce, etc.
Hyper-link information
Access and usage information
WWW provides rich sources of data for data mining
3. The Web: Opportunities & Challenges
1. The amount of information on the Web is huge
2. The coverage of Web information is very wide and
diverse
3. Information/data of almost all types exist on the
Web
4. Much of the Web information is
semi-structured
5. Much of the Web information is linked
6. Much of the Web information is redundant
4. The Web: Opportunities & Challenges
7. The Web is noisy
8. The Web is also about services
9. The Web is dynamic
10. Above all, the Web is a virtual society
11. The Web consists of surface Web and deep Web.
Surface Web: pages that can be browsed using a
browser.
Deep Web: databases that can only be accessed
through parameterized query interfaces
5. What is Web Data ?
Web data is
1. Web content –text,image,records,etc.
2. Web structure –hyperlinks,tags,etc.
3. Web usage –http logs,app server logs,etc.
4. Intra-page structures
5. Inter-page structures
6. Supplemental data
1. Profiles
2. Registration information
3. Cookies
6.
7. Web Mining
Web Mining is the use of the data mining techniques
to automatically discover and extract information
from web documents/services
Web mining is the application of data mining
techniques to find interesting and potentially useful
knowledge from web data
Web mining is the application of data mining
techniques to extract knowledge from web
data, including web documents, hyperlinks between
documents, usage logs of web sites, etc.
8. Web Mining
• Web Mining is the use of the data mining techniques to
automatically discover and extract information from web
documents/services
• Discovering useful information from the World-Wide
Web and its usage patterns
• My Definition: Using data mining techniques to make the
web more useful and more profitable (for some) and to
increase the efficiency of our interaction with the web
9. Why Mine the Web?
Enormous wealth of information on Web
Financial information (e.g. stock quotes)
Book/CD/Video stores (e.g. Amazon)
Restaurant information
Car prices
Lots of data on user access patterns
Web logs contain sequence of URLs accessed by users
Possible to mine interesting nuggets of information
People who ski also travel frequently to Europe
Tech stocks have corrections in the summer and rally from November
until February
10. The Web is a huge collection of documents except for
Hyper-link information
Access and usage information
The Web is very dynamic
New pages are constantly being generated
Challenge: Develop new Web mining algorithms and adapt
traditional data mining algorithms to
Exploit hyper-links and access patterns
Be incremental
Why is Web Mining Different?
11. Web Mining: Subtasks
Resource finding
Retrieving intended documents
Information selection/pre-processing
Select and pre-process specific information from selected
documents
Generalization
Discover general patterns within and across web sites
Analysis
Validation and/or interpretation of mined patterns
12. Web Mining Issues
Size
Grows at about 1 million pages a day
Google indexes 9 billion documents
Number of web sites
Netcraft survey says 72 million sites
(http://news.netcraft.com/archives/web_server_survey.html)
Diverse types of data
Images
Text
Audio/video
XML
HTML
13. E-commerce (Infrastructure)
Generate user profiles
Targetted advertizing
Fraud
Similar image retrieval
Information retrieval (Search) on the Web
Automated generation of topic hierarchies
Web knowledge bases
Extraction of schema for XML documents
Network Management
Performance management
Fault management
Web Mining Applications
15. Web Data Mining
Use of data mining techniques to
automatically discover interesting and
potentially useful information from Web
documents and services.
Web mining may be divided into three
categories:
1. Web content mining
2. Web structure mining
3. Web usage mining
17. Web Content Mining
Discovery of useful information from web
contents / data / documents
Web data contents:
1. text,
2. image,
3. audio,
4. video,
5. metadata and
6. hyperlinks
18. Web Content Mining
Examine the contents of web pages as well as result of web
searching
Can be thought of as extending the work performed by basic
search engines
Search engines have crawlers to search the web and gather
information, indexing techniques to store the
information, and query processing support to provide
information to the users
Web Content Mining is: the process of extracting knowledge
from web contents
19. Web Content Mining
It provides no information about structure of
content that we are searching for and no
information about various categories of
documents that are found.
Need more sophisticated tools for searching or
discovering Web content.
20. Web Content mining
Discovering useful information from contents of Web
pages.
Web content is very rich consisting of
textual, image, audio, video etc and metadata as well
as hyperlinks.
The data may be unstructured (free text) or
structured (data from a database) or semi-structured
(html) although much of the Web is unstructured.
21. Web Content Data Structure
Unstructured – free text
Semi-structured – HTML
More structured – Table or Database generated
HTML pages
Multimedia data – receive less attention than text or
hypertext
22. Web Content mining
Web content mining is related to data mining
and text mining
It is related to data mining because many data
mining techniques can be applied in Web content
mining.
It is related to text mining because much of the
web contents are texts.
Web data are mainly semi-structured and/or
unstructured, while data mining is structured and
text is unstructured.
23. Web Content Data Structure
Web content consists of several types of data
Text, image, audio, video, hyperlinks.
Unstructured – free text
Semi-structured – HTML
More structured – Data in the tables or
database generated HTML pages
Note: much of the Web content data is unstructured
text data.
24. Semi-structured Data
Content is, in general, semi-structured
Example:
Title
Author
Publication_Date
Length
Category
Abstract
Content
25. Web Content Mining: IR View
Unstructured Documents
Bag of words, or phrase-based feature
representation
Features can be boolean or frequency based
Features can be reduced using different feature
selection techniques
Word stemming, combining morphological
variations into one feature
26. Web Content Mining: IR View
Semi-Structured Documents
Uses richer representations for features, based on
information from the document structure
(typically HTML and hyperlinks)
Uses common data mining methods (whereas
unstructured might use more text mining
methods)
27. Web Content Mining: DB View
Tries to infer the structure of a Web site or transform
a Web site to become a database
Better information management
Better querying on the Web
Can be achieved by:
Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database
28. Web Content Mining: DB View
Mainly uses the Object Exchange Model (OEM)
Represents semi-structured data (some
structure, no rigid schema) by a labeled graph
Process typically starts with manual selection of Web
sites for content mining
Main application: building a structural summary of
semi-structured data (schema extraction or
discovery)
29. Tech for Web Content Mining
Classifications
Clustering
Association
30. Web Content Mining : Topics
Structured data extraction
Unstructured text extraction
Sentiment classification, analysis and summarization
of consumer reviews
Information integration and schema matching
Knowledge synthesis
Template detection and page segmentation
31. Structured Data Extraction
Most widely studied research topic
A large amount of information on the Web is
contained in regularly structured data objects
(retrieved from databases)Such Web data records are
important they often present the essential
information of their host pages, e.g., lists of products
and services
32. Structured Data Extraction
Applications: integrated and value-added
services, e.g., Comparative shopping, meta-search &
query, etc
34. Structured Data Extraction
:Approaches
Wrapper Generation
Write an extraction program for each website
based on observed format patterns
Labor intensive & time consuming
38. Automatic Approach
Structured data objects on the web are normally
database records
Retrieved from databases & displayed in web
pages with fixed templates
Find patterns / grammars from the web pages &
then use them to extract data
e. g. IEPAD, MDR, ROADRUNNER, EXALG etc
38
39. Wrapper Induction or Wrapper Learning
Main technique currently
The user first manually labels a set of trained
pages
A learning system then generates rules from the
training pages
The resulting rules are then applied to extract
target items from web pages
e.g. WIEN, Stalker, BWI, WL etc
39
40. Supervised Learning
Supervised learning is a ‘machine learning’ technique for
creating a function from training data .
Documents are categorized
The output can predict a class label of the input object (called
classification).
Techniques used are
Nearest Neighbor Classifier
Feature Selection
Decision Tree
41. Removes terms in the training documents which
are statistically uncorrelated with the class labels
Simple heuristics
Stop words like “a”, “an”, “the” etc.
Empirically chosen thresholds for ignoring “too
frequent” or “too rare” terms
Discard “too frequent” and “too rare terms”
42. Examples of Discovered
Patterns
Association rules
98% of AOL users also have E-trade accounts
Classification
People with age less than 40 and salary > 40k trade on-line
Clustering
Users A and B access similar URLs
Outlier Detection
User A spends more than twice the average amount of time
surfing on the Web
43. Important for improving customization
Provide users with pages, advertisements of interest
Example profiles: on-line trader, on-line shopper
Generate user profiles based on their access patterns
Cluster users based on frequently accessed URLs
Use classifier to generate a profile for each cluster
Engage technologies
Tracks web traffic to create anonymous user profiles of Web
surfers
Has profiles for more than 35 million anonymous users
44. Ads are a major source of revenue for Web
portals (e.g., Yahoo, Lycos) and E-commerce
sites
Plenty of startups doing internet advertizing
Doubleclick, AdForce, Flycast, AdKnowledge
Internet advertizing is probably the “hottest”
web mining application today
45. Scheme 1:
Manually associate a set of ads with each user
profile
For each user, display an ad from the set based on
profile
Scheme 2:
Automate association between ads and users
Use ad click information to cluster users (each user
is associated with a set of ads that he/she clicked
on)
For each cluster, find ads that occur most frequently
in the cluster and these become the ads for the set
of users in the cluster
46. Use collaborative filtering (e.g. Likeminds, Firefly)
Each user Ui has a rating for a subset of ads (based
on click information, time spent, items bought etc.)
Rij - rating of user Ui for ad Aj
Problem: Compute user Ui‟s rating for an unrated ad
Aj
A1 A2 A3
?
Internet Advertizing
47. Key Idea: User Ui‟s rating for ad Aj is set to Rkj, where Uk
is the user whose rating of ads is most similar to Ui‟s
User Ui‟s rating for an ad Aj that has not been previously
displayed to Ui is computed as follows:
Consider a user Uk who has rated ad Aj
Compute Dik, the distance between Ui and Uk‟s ratings on
common ads
Ui‟s rating for ad Aj = Rkj (Uk is user with smallest Dik)
Display to Ui ad Aj with highest computed rating
Internet Advertizing
48. With the growing popularity of E-commerce, systems to
detect and prevent fraud on the Web become important
Maintain a signature for each user based on buying
patterns on the Web (e.g., amount spent, categories of
items bought)
If buying pattern changes significantly, then signal fraud
HNC software uses domain knowledge and neural
networks for credit card fraud detection
49. Given:
A set of images
Find:
All images similar to a given image
All pairs of similar images
Sample applications:
Medical diagnosis
Weather predication
Web search engine for images
E-commerce
50. QBIC, Virage, Photobook
Compute feature signature for each image
QBIC uses color histograms
WBIIS, WALRUS use wavelets
Use spatial index to retrieve database image whose
signature is closest to the query‟s signature
WALRUS decomposes an image into regions
A single signature is stored for each region
Two images are considered to be similar if they have
enough similar region pairs
52. Today‟s search engines are plagued by
problems:
the abundance problem (99% of info of no
interest to 99% of people)
limited coverage of the Web (internet
sources hidden behind search interfaces)
Largest crawlers cover < 18% of all web
pages
limited query interface based on keyword-
oriented search
limited customization to individual users
53. Today‟s search engines are plagued by
problems:
Web is highly dynamic
Lot of pages added, removed, and updated every
day
Very high dimensionality
54. Use Web directories (or topic hierarchies)
Provide a hierarchical classification of documents (e.g., Yahoo!)
Searches performed in the context of a topic restricts the search to only
a subset of web pages related to the topic
Recreation ScienceBusiness News
Yahoo home page
SportsTravel Companies Finance Jobs
55. In the Clever project, hyper-links between Web pages
are taken into account when categorizing them
Use a bayesian classifier
Exploit knowledge of the classes of immediate neighbors of
document to be classified
Show that simply taking text from neighbors and using
standard document classifiers to classify page does not work
Inktomi‟s Directory Engine uses “Concept Induction” to
automatically categorize millions of documents
56. Objective: To deliver content to users quickly and
reliably
• Traffic management
• Fault management
Service Provider Network
Router
Server
57. While annual bandwidth demand is increasing ten-fold
on average, annual bandwidth supply is rising only by
a factor of three
Result is frequent congestion at servers and on
network links
during a major event (e.g., princess diana‟s death), an
overwhelming number of user requests can result in millions
of redundant copies of data flowing back and forth across the
world
Olympic sites during the games
NASA sites close to launch and landing of shuttles
58. Key Ideas
Dynamically replicate/cache content at multiple sites within the
network and closer to the user
Multiple paths between any pair of sites
Route user requests to server closest to the user or least
loaded server
Use path with least congested network links
Akamai, Inktomi
60. Need to mine network and Web traffic to determine
What content to replicate?
Which servers should store replicas?
Which server to route a user request?
What path to use to route packets?
Network Design issues
Where to place servers?
Where to place routers?
Which routers should be connected by links?
One can use association rules, sequential pattern mining
algorithms to cache/prefetch replicas at server
61. Fault management involves
Quickly identifying failed/congested servers and links in network
Re-routing user requests and packets to avoid congested/down servers and
links
Need to analyze alarm and traffic data to carry out root cause analysis of
faults
Bayesian classifiers can be used to predict the root cause given a set of
alarms
63. Web data sets can be very large
Tens to hundreds of terabytes
Cannot mine on a single server!
Need large farms of servers
How to organize hardware/software to
mine multi-terabye data sets
Without breaking the bank!
65. Pages contain information
Links are „roads‟
How do people navigate the Internet
Web Usage Mining (clickstream analysis)
Information on navigation paths
available in log files
Logs can be mined from a client or a
server perspective
66. Why analyze Website usage?
Knowledge about how visitors use Website could
Provide guidelines to web site reorganization; Help prevent
disorientation
Help designers place important information where the visitors
look for it
Pre-fetching and caching web pages
Provide adaptive Website (Personalization)
Questions which could be answered
What are the differences in usage and access patterns
among users?
What user behaviors change over time?
How usage patterns change with quality of service
(slow/fast)?
What is the distribution of network traffic over time?
67.
68.
69. Analog – Web Log File Analyser
Gives basic statistics such as
number of hits
average hits per time period
what are the popular pages in your site
who is visiting your site
what keywords are users searching for to get to
you
what is being downloaded
http://www.analog.cx/
70.
71.
72.
73. Content is, in general, semi-structured
Example:
Title
Author
Publication_Date
Length
Category
Abstract
Content
74. Many methods designed to analyze structured data
If we can represent documents by a set of attributes
we will be able to use existing data mining methods
How to represent a document?
Vector based representation(referred to as “bag of
words” as it is invariant to permutations)
Use statistics to add a numerical dimension to
unstructured text
75. A document representation aims to capture what the
document is about
One possible approach:
Each entry describes a document
Attribute describe whether or not a term appears in the
document
76. Another approach:
Each entry describes a document
Attributes represent the frequency in
which a term appears in the document
77. Stop Word removal: Many words are not
informative and thus
Irrelevant for document representation the, and, a,
an, is, of, that, …
Stemming: reducing words to their root form
(Reduce dimensionality)
A document may contain several occurrences of
words like fish, fishes, fisher, and fishers. But would
not be retrieved by a query with the keyword
“fishing”
Different words share the same word stem and
should be represented with its stem, instead of the
actual word “Fish”