Poster Semantic Web - Abhijit Chandrasen Manepatil
Prospectus presentation
1. 1
UNDERSTANDING
DEEP WEB
SEARCH INTERFACES
Prospectus-Presentation
April 03 2009
Ritu Khare
2. This presentation
uses this space
Presentation Order for writing
additional facts.
2
Problem Statement
The Deep Web & Challenges Problem:
Understanding Semantics
Search Interface Understanding of Search Interfaces
Challenges and Significance
Literature Review Results
Settings How do existing approaches
Reductionist Analysis solve this problem?
Holistic Analysis
Research Questions and Design Ideas What are the
Techniques Vs Heterogeneity research gaps? How
to fill them?
Semantics and Artificial Designer
3. 3 PROBLEM STATEMENT
The Deep Web
Challenges in Accessing Deep Web
Doors of Opportunity
The SIU Process
About the Stages of the Process
Why SIU is Challenging?
Why SIU is Significant?
4. HAS MANY OTHER NAMES!!
Hidden Web, Dark Web,
The Deep Web Invisible Web, Subject-
specific databases, Data-
intensive Web sites.
4
What is DEEP WEB?
WEB
The portion of Web SURFACE (as seen by
resources that is not WEB search
engines)
returned by search
ALMOST
engines through VISIBLE
WEB DEEP
traditional crawling and WEB
indexing.
Where do the contents
LIE?
Online Databases
5. HAS MANY OTHER NAMES!!
Hidden Web, Dark Web,
The Deep Web Invisible Web, Subject-
specific databases, Data-
intensive Web sites.
5
How are the contents
ACCESSED?
By filling up HTML forms
on search interfaces.
How are they
PRESENTED to users?
Dynamic Pages /Result
Pages /Response Pages
6. QUICK FACT!
Challenges Deep Web includes
307,000 sites
450,000 databases
in Accessing Deep Web Contents 1,258,000 interfaces
6
500 times more than Increase of 3-7
that of the rest of times from 2000-
the Web 2004 (He et al., 2007a).
(BrightPlanet.com, 2001).
visits several manually reconciles
The deep Web interfaces before information obtained
remains invisible finding the right from diff. sources.
on the Web information.
Alternative approaches: Not Scalable
Invisible Web directories
and Search engine browse
directories cover only
37% of the deep Web
(He et al., 2007a).
7. INTERESTING FACT!
Opportunities There exist at least 10
million high quality
HTML forms on the
in Accessing Deep Web Contents deep Web
7
HTML Forms on search interfaces provide a useful way
of discovering the underlying database structure.
The labels attached to fields are very expressive and
meaningful.
Instructions for users to enter data may provide information
on data constraints (such as range of data, domain of data),
and integrity constraints (mandatory /optional attributes).
In the last decade, several prominent researchers have
focused on the PROBLEM OF UNDERSTANDING
SEARCH INTERFACES.
8. Search Interface Understanding
(SIU) Process
8
Search B. Parsing
Online Interface
DB (Input)
A. Representation
Manually C. Segmentation
Tagged Search E. Evaluation
Interface
D. Segment Processing
System-Tagged
Search Interface Extracted
(Output) DB
The SIU process is challenging because search interfaces are designed
autonomously by different designers and thus, do not have a standard
structure (Halevy, 2005).
9. This stage builds
A. Representation and Modeling up the foundation
for the process
9
This stage formalizes the information
to be extracted from a search
interface
interface components: Any text or form element
semantic label: Meaning of a component from a
user’s standpoint.
segment: a composite component formed with a
group of related components
segment label: semantic label of the segment
10. This stage builds
A. Representation and Modeling up the foundation
for the process
10
Zhang et al. (2004)
represent an interface
as a list of query
conditions
Segment Label = Query
Condition Attribute-name Operator Value
Segment consists of following
semantic
labels:
An attribute name
Operator
Value
11. This stage is the
first task physically
B. Parsing performed on the
interface.
11
The interface is parsed into a
workable memory structure. It can be
done in two modes:
by reading the HTML source code;
by rendering the page on a Web
browser either manually or automatically
using a visual layout engine.
12. This stage is the
first task physically
B. Parsing performed on the
interface.
12
He et al. (2007b) parse the interface into an
interface expression (IEXP) with constructs t
(corresponding to any text), e (corresponding
to any form element), and | (corresponding to
a row limiter).
IEXP for figure is:
t|te|teee
13. Techniques used :
Rules
C. Segmentation Heuristics
Machine Learning.
13
A segment has a semantic existence
but no physically defined boundaries
making this stage a challenging one.
Grouping of semantically related
components. (A sub-problem is to associate a
surrounding text with a form element)
Assignment of semantic labels to components
14. Techniques used :
Rules
C. Segmentation Heuristics
Machine Learning.
14
He et al. (2007b) use a heuristic-based method LEX to group
elements and text labels together. One heuristic used by LEX is
that the text and form element that lie on the same line are
likely to belong to one segment. In Figure , the 3 components
“Gene Name”, radio button with options ‘Exact Match’ and
‘Ignore Case’, and the textbox belong to one segment.
Logical Attribute
Attribute-label Constraint Element Domain Element
15. Techniques used :
Rules
D. Segment Processing Heuristics
Machine Learning.
15
In this stage,
Each segment is further tagged with additional meta-
information regarding itself and its components.
Post-processing of extracted information: normalization,
stemming, removal of stop words.
He et al. (2007b)’s LEX extracts meta-information about
each extracted segment using Naïve Bayes classification
technique.
The extracted information for the segment includes domain
type (finite, infinite); Unit (miles, sec); value type (numeric,
character, etc.); layout order position (in IEXP)
16. An approach is
usually tested on
E. Evaluation a set of interfaces
belonging to a
particular domain.
16
How accurate extracted information is?
The system-generated segmented and tagged
interface is compared with the manually segmented
and tagged interface
The results are evaluated based on standard
metrics (precision, recall, accuracy, etc).
17. SIGNIFICANCE:
SIU is a pre-requisite
Why SIU is Significant? for several advanced
deep Web applications.
17
Researchers have proposed solutions to make the deep Web
contents more useful to the users. These solutions can be divided
into following categories based on goals:
To Increase Content Visibility on Search Engines
Building Dynamic Page Repository: Raghavan and Garcia-Molina (2001)
Building Database Content Repository: Madhavan et al. (2008)
To Increase Domain-specific Usability
Meta-search Engines: Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005), Pei et al.
(2006), He and Chang (2003), and Wang et al. (2004)
To Attain Knowledge Organization
Derivation of Ontologies: Benslimane et al. (2007)
These solutions can only be materialized by leveraging the
opportunities provided by search interface.
18. 18 LITERATURE REVIEW RESULTS
Review Settings
Reductionist Analysis
Holistic Analysis
Progress Made
19. ALIAS:
For quick reference
Literature Review Process each work is
assigned an alias.
19
Reviewed research works that propose approaches for
performing the SIU process in the context of the deep Web.
S.No. Reference Alias
1 Raghavan and Garcia Molina (2001) LITE
2 Kalijuvee et al. (2001) CombMatch
3 Wu et al. (2004) FieldTree
4 Zhang et al. (2004) HSP
5 Shestakov et al.(2005) DEQUE
6 Pei et al. (2006) AttrList
7 He et al. (2007) LEX
8 Benslimane et al. (2007) FormModel
9 Nguyen et al. (2008) LabelEx
20. DIMENSION:
facilitates comparison
Literature Review Process among different works
by placing them under
the same umbrella
20
Review was done in 2 phases:
Reductionist Analysis: The works were
decomposed into small pieces.
Each work was visualized as a 2-dimensional grid where
the horizontal sections refer to stages of the SIU process.
For each stage the works were analyzed in vertical
degrees of analysis known as Stage-specific dimensions.
Holistic Analysis: Each work was studied in its
entirety within a big picture context.
Composite dimensions were created out of the stage-
specific dimensions.
21. A: Representation
Reductionist Analysis : Representation B: Parsing
C: Segmentation
D: Segment Processing
E: Evaluation
Work Segment: Segment Contents Text Label: Form Meta-information
21 Element
HSP Conditional Pattern: Attribute- 1:M
name, Operator*, and Value+
DEQUE Field segment : f, Name(f), 1:1 JavaScript Functions,
Label(f) Visible and invisible values,
f = field Subinfo(F) = {action, method, enctype)
Iset(F) =initial field set, that can be submittied
without completing a form
domain(f), type(f)
F=form
AttrList Attribute: Attribute-name, 1:1 Domain information for each attribute (set of
description and form element values and data types)
LEX Logical Attribute Ai: Attr-label L, 1:M site information and form constraint
List of domain elements {Ej,…Ek}, (1:1 in case of Ai=(P, U, Re, Ca, DT, DF, VT)
and element labels . “element label” Ai=ith attribute,
:form element) P = layout order position, U = Unit, Re=
relationship type, Ca = domain element
constraint, DT = domain type, DF = default
value, VT = value type
Ei=(N, Fe, V, DV)
N = internal name, Fe = format, V = set of
values, DV = default value.
22. A: Representation
B: Parsing
Reductionist Analysis : Parsing C: Segmentation
D: Segment Processing
E: Evaluation
22
Work Input Mode Basic Step: Description Cleaning Up Resulting Structure
LITE HTML source code Pruning Discard Images, Pruned Page
And Isolate elements that ignore styling
Visual Interface directly influence layout information such as
of form elements and font size, font style,
labels. and style sheets.
CombMatch HTML source code Chunk Partitioning, and Stop Chunk List and Table
finding meta-information Phrases(“optional”, Index List. Each chunk is
about each chunk: Find “required”, “*”, Text represented as an 8-
bounding HTML tags, and formatting HTML tags. tuple describing meta-
text strings delimited by information.
table cell tags, etc.
DEQUE HTML text Preparing Form Ignore font size, Pruned Tree
And Database: A DOM tree is typefaces, and styling
Visual interface created for each FORM information.
element
LEX HTML source code Interface Expression String
Generation:
t=text, e = element,
I=row delimiter (<BR>,
<P>, or </TR>
23. A: Representation
B: Parsing
Reductionist Analysis: Segmentation C: Segmentation
D: Segment Processing
E: Evaluation
23
Work Problem Description Segmentation Criteria Technique
CombMatch Assigning text Label to an Combination of string similarity Heuristics (String properties,
input element and spatial similarity algorithms Proximity and Layout)
HSP Finding the 3-tuple <attribute Grammar (set of rules) based on Rules (Best Effort Parser to
name, operators, values> productions and preferences build a parse tree)
LEX Assigning text labels to Ending colon, textual similarity Heuristics
attributes, and assigning with element name, vertical (String Properties, Layout and
element labels to domain alignment, distance, preference Proximity)
elements to current row
LabelEx Assigning text Label to a Classifiers (Naïve Bayes, and Supervised Machine Learning
form element Decision Tree). Features
considered include spatial
features, element type, font type,
internal, similarity, alignment,
label placement, distance.
24. A: Representation
B: Parsing
C: Segmentation
Reductionist Analysis : Segment Processing D: Segment Processing
E: Evaluation
24
Work Technique for extracting Post-processing
meta-information
HSP The Merger module reports conflicting (that occur in
two query conditions) and missing tokens (they do not
occur in any query condition).
LEX Naïve Bayesian Classification Meaningless stopwords (the, with, any, etc.)
(Supervised Machine
Learning)
FormModel Learning by Examples
(Machine Learning)
LabelEx Heuristics for reconciliation of multiple assigned labels
to an element; and to handle dangling elements.
25. A: Representation
B: Parsing
C: Segmentation
D: Segment Processing
E: Evaluation
25 Work Test Domain Yahoo Subject Category Comparison with… Metrics
LITE Semiconductor Science, Entertainment, CombMatch (in terms of Accuracy
Reductionist Analysis: Evaluation
Industry, Movies, Computers & Internet methodology)
Database
Technology.
HSP Airfare, automobile, Business & Economy, 4 datasets from different Precision, Recall
book, job, real estate, Recreation & Sports, sources collected by
car rental, hotel, Entertainment authors.
movies, music records.
LabelEx Airfare, Automobiles, Business & Economy, Barbosa et al. (2007)’s Recall,
Books, Movies. Recreation & Sports, and HSP ( in terms of Precision,
Entertainment datasets) F-Measure
Using Classifier Ensemble
with or without Mapping
Reconciliation (MR).
Generic Classifier Vs
Domain-specific Classifier
Generic Classifier with MR
Vs Domain-specific
Classifier with MR
HSP and LEX ( in terms of
methodology)
26. Work Type of semantics Techniques Human Target Application
Involvement
LITE Partial form capabilities Heuristics None Deep Web Crawler
(Label associated with (search engine
26
form element) visibility)
HSP Query capability Rules Manual Meta-searchers
(attribute name, operator Specification of (domain-specific
and values) Grammar Rules usability)
LEX Components belonging to Heuristics None Meta-searchers
same logical attribute (domain-specific
(labels and form elements) usability)
Meta-information Supervised Machine Training data for
Learning classifier
Holistic Analysis
FormModel Structural Units (groups of NOT REPORTED Unknown Ontology Derivation
fields belonging to same (Knowledge
entity) Organization)
Partial form capabilities Heuristics None
(Label associated with
form element)
Meta-information Supervised Machine Training data for
Learning learning by
examples.
LabelEx Partial form capabilities Supervised Machine Classifier Training Deep Web in general
(Label associated with Learning data was manually (search engine
form element) tagged. visibility domain-
specific usability)
27. A: Representation
B: Parsing
Progress Made C: Segmentation
D: Segment Processing
E: Evaluation
27
SEMANTICS modeled and extracted. (Stages A and B)
from merely stating what we see, to stating what is meant by what we see
from merely associating labels to form elements, to discovering query
capabilities
from no meta-information to a lot of meta-information which might be useful for
target application.
TECHNIQUES employed (Stages C and D)
A mild transitioning from naïve techniques (rules-based and heuristic-based) to
sophisticated techniques (supervised machine learning).
DOMAINS explored (Stage E)
Only Commercial Domains: books, used cars, movies, etc.
Still Unexplored Non-Commercial Domains: yahoo.com subject categories such as
regional, society and culture, education, arts and humanities, science, reference,
and others
28. 28 RESEARCH QUESTIONS
Techniques Vs Design Heterogeneity
Techniques Vs Domain Heterogeneity
Simulating a Human Designer
29. Derived from
Research Questions
Holistic and
Reductionist
Analysis
29
R.Q.#1 Technique Vs Design Heterogeneity
What is the correlation between the technique employed and the ability to
handle heterogeneity in design of interfaces?
R.Q.#2 Technique Vs Domains
How can we design approaches that work well for arbitrary domains, and
thus prevent the need to design domain-specific approaches?
R.Q.#3 Simulating a Human Designer
How can we make a machine understand an interface in the same way as
a human designer does?
30. Technique is a
Research Question #1 dimension of
What is the correlation between the technique employed and the ability to Stages Segmentation
& Segment
handle heterogeneity in design of interfaces? Processing
30
Elaborating the Question
Techniques: Rules, Heuristics, and Machine Learning.
Design: Arrangement of interface components.
Handling Heterogeneity in design: Being able to perform
the following tasks for any kind of design.
Segmentation
Semantic Tagging
Grouping (Label Assignment is a part of this)
Segment Processing
31. Technique is a
Research Question #1 dimension of
What is the correlation between the technique employed and the ability to Stages Segmentation
& Segment
handle heterogeneity in design of interfaces? Processing
31
Heterogeneity: Automobile Domain Heterogeneity: Movie Domain
Multiple Attribute-name
Operator Attribute-name
Operand
32. This question has
Research Question #1 been only partially
What is the correlation between the technique employed and the ability to explored.
handle heterogeneity in design of interfaces?
32
Existing Efforts to Answer
A 2002 study (Kushmerick, 2002) suggests the superiority of
machine learning techniques over rule-based and
heuristic-based techniques for handling design
heterogeneity in general.
A 2008 study (Nguyen et al., 2008) compared the label
assignment accuracy (a part of grouping accuracy) of
the three approaches: rule-based (HSP), heuristic-
based (LEX), and machine learning based (LabelEx).
Machine learning technique outperformed the other
two.
33. Tasks to test:
Segmentation
Investigating R.Q.#1 •Grouping
Technique Vs Design Heterogeneity •Semantic Tagging
Segment Processing
33
However, there is NO comparative study in terms of overall
grouping, semantic tagging, and segment processing.
Experiment Description Evaluation Result Compared With Improvement
Metrics
A machine learning Grouping 86% Heuristic-based 10%
technique based on Hidden Accuracy (label state-of-the-art
Markov Models (HMMs) was assignment approach LEX
designed and tested on a included)
dataset belonging to biology Semantic Tagging 90% A Heuristic-based 17%
domain. Accuracy algorithm was
designed
Compare Segmentation Performance: Compare Segment Processing Performances:
Machine Learning Vs. Rule-Based Rules Vs. heuristics Vs. machine learning
Various machine learning techniques
Classification Vs. HMM Vs…
34. Human
Intervention is a
Investigating R.Q.#1 dimension of
Technique Vs Design Heterogeneity Holistic analysis
34
There is NO comparative study to measure human intervention in these techniques.
Experiment Description Evaluation Result Compared With
Metrics
Monitoring Human Rule-based: Manual Rule Based Vs
Intervention Crafting Heuristics Vs
(IN PROGRESS) Heuristics: Manual Machine Learning
Observations
Machine Learning:
Manual Tagging
The HMM was trained P(O|λ) Not promising
using unsupervised
training algorithm Baum
Welch
Designing Unsupervised Techniques
35. Research Question #2 Domain tested
is a dimension of
How can we design approaches that work well for arbitrary domains,
Evaluation stage.
and thus prevent the need to design domain-specific approaches?
35
Elaborating the Question
Domain Heterogeneity: Deep Web is heterogeneous in
terms of domains, i.e. has databases belonging to all the
14 subject categories of Yahoo (Arts & Humanities, Business and
Economy, Computers and Internet, Education, Entertainment, Government,
Health, etc. )
How to design generic approaches that work for many
domains?
How do interface designs differ across domains?
Which technique should be employed?
36. Research Question #2 Deep Web has a
balanced domain
How can we design approaches that work well for arbitrary domains,
distribution
and thus prevent the need to design domain-specific approaches?
36
Existing Efforts to Answer
2004: A single grammar (rule-based) generates reasonably good segmentation
performance (grouping & semantic tagging) for all domains. (Zhang et al., 2004)
Higher accuracy can be attained using domain-specific techniques which are not
feasible to be designed using rules (Nguyen et al., 2008) .
2008: For label assignment (a portion of grouping), domain-specific classifiers
result in higher accuracy than generic classifiers. (Nguyen et al., 2008)
Still missing:
A comparison of domain-specific and generic approaches for overall segmentation
performance
The design differences across domains
generic approaches that result in equally good results for as many domains as possible.
37. 0.41 Design tendencies
Investigating R.Q. #2 of designers from
0.09
0.21 Text- 0.35 different domains
trivial
0.30 are different.
0.56 0.08 Attribute- 0.57
Operand
37 name
0.83 0.12
0.40 0.20 Text- 0.14
0.62 trivial
0.37 0.20 0.22 Attribute-
Movie Operator Operand 0.34 name
0.64
0.31
0.15 0.21 0.88
0.21 Text- 0.44 0.11
Operator
trivial
0.23 0.16
References &
Attribute-
Operand 0.59 name Education
0.54 0.64
0.89 0.08
Text- 0.08
0.09 0.09 0.17 0.24
Biology Operator
trivial
0.11 0.05 Attribute-
Operand 0. 51 name
0.83
0.08
1.0 Automobile
Operator
38. Investigating R.Q. #2: All experiments
done using the
Technique Vs Domain Machine learning
technique, HMM.
38
Domain Exp Description Evaluation Winner
(improvement)
Movie Domain-Specific HMM Segmentation Accuracy Generic HMM
Vs. Generic HMM (4.4%)
Ref & Edu Domain-Specific HMM Segmentation Accuracy Domain-Specific
Vs. Generic HMM HMM (7%)
Automobile Domain-Specific HMM Segmentation Accuracy Domain-Specific
Vs. Generic HMM HMM (8%)
Biology Domain-Specific HMM Segmentation Accuracy Domain-Specific
Vs. Generic HMM HMM (36%)
What is the correlation between design topology and performance of domain-specific model?
39. Research Question #3
How can we make a machine understand the interface and extract
semantics from it in the same way as a human designer does?
39
A human-designer/user naturally understands the
design and semantics of an interface based on visual
cues and based on his prior experiences.
A machine cannot really “see” an interface and does
not have any implicit Web search experience. (How
much do visual layout engines assist?)
Hence, there is a difference between the way a
machine perceives an interface and the way a designer
perceives the interface.
How can we reconcile these differences?
40. Existing methods have been
Investigating R.Q. #3 able to: understand design,
attach semantic labels, derive
Simulating a Human Designer segments and query
capabilities.
40
Hypothesis: A machine can be made to understand the interface in the
same way as a human designer does if it is enabled to discover the deep
source of knowledge that created the interface in the first place.
Attach
Search Understand Derive
Semantic
Interface design Segments
Labels
Understands
/ Designs Derive
Web Design
Query
Knowledge
Capabilities
Conceptual
Designer Recover DB
Model
/Modeler Schema
Extracting DB schema and conceptual model is still an open question.
41. Connecting the dots R.Q. 1
Attach
Search Understand Derive
Search Semantic
41 Interface design Segments
Interface Labels
Web Design
Knowledge Derive
Query
Web Design Capabilities
Knowledge
Designer Conceptual
Recover DB
Model
R.Q. 2 Schema
R.Q. 3
Search ? Conceptual
Model based
Interface
Interface
42. 42 THANK YOU !
Suggestions, Comments, Thoughts, Ideas, Questions…
Acknowledgements:
To My Prospectus Committee Members
References:
[1] to [42] (in prospectus report).