Prospectus presentation

1

UNDERSTANDING
DEEP WEB
SEARCH INTERFACES

Prospectus-Presentation
April 03 2009
Ritu Khare

This presentation
uses this space

Presentation Order for writing
additional facts.

2

 Problem Statement
 The Deep Web & Challenges Problem:
Understanding Semantics
 Search Interface Understanding of Search Interfaces
 Challenges and Significance
 Literature Review Results
 Settings How do existing approaches
 Reductionist Analysis solve this problem?

 Holistic Analysis
 Research Questions and Design Ideas What are the
 Techniques Vs Heterogeneity research gaps? How
to fill them?
 Semantics and Artificial Designer

3 PROBLEM STATEMENT
The Deep Web
Challenges in Accessing Deep Web
Doors of Opportunity
The SIU Process
About the Stages of the Process
Why SIU is Challenging?
Why SIU is Significant?

HAS MANY OTHER NAMES!!
Hidden Web, Dark Web,

The Deep Web Invisible Web, Subject-
specific databases, Data-
intensive Web sites.
4

 What is DEEP WEB?
WEB
 The portion of Web SURFACE (as seen by

resources that is not WEB search
engines)

returned by search
ALMOST
engines through VISIBLE
WEB DEEP
traditional crawling and WEB

indexing.
 Where do the contents
LIE?
 Online Databases

HAS MANY OTHER NAMES!!
Hidden Web, Dark Web,

The Deep Web Invisible Web, Subject-
specific databases, Data-
intensive Web sites.
5

 How are the contents
ACCESSED?
 By filling up HTML forms
on search interfaces.

 How are they
PRESENTED to users?
 Dynamic Pages /Result
Pages /Response Pages

QUICK FACT!

Challenges Deep Web includes
307,000 sites
450,000 databases
in Accessing Deep Web Contents 1,258,000 interfaces
6

500 times more than Increase of 3-7
that of the rest of times from 2000-
the Web 2004 (He et al., 2007a).
(BrightPlanet.com, 2001).

visits several manually reconciles
The deep Web interfaces before information obtained
remains invisible finding the right from diff. sources.
on the Web information.

Alternative approaches: Not Scalable
Invisible Web directories
and Search engine browse
directories cover only
37% of the deep Web
(He et al., 2007a).

INTERESTING FACT!

Opportunities There exist at least 10
million high quality
HTML forms on the
in Accessing Deep Web Contents deep Web
7

 HTML Forms on search interfaces provide a useful way
of discovering the underlying database structure.
 The labels attached to fields are very expressive and
meaningful.
 Instructions for users to enter data may provide information
on data constraints (such as range of data, domain of data),
and integrity constraints (mandatory /optional attributes).

 In the last decade, several prominent researchers have
focused on the PROBLEM OF UNDERSTANDING
SEARCH INTERFACES.

Search Interface Understanding
(SIU) Process
8

Search B. Parsing
Online Interface
DB (Input)
A. Representation

Manually C. Segmentation
Tagged Search E. Evaluation
Interface

D. Segment Processing

System-Tagged
Search Interface Extracted
(Output) DB

The SIU process is challenging because search interfaces are designed
autonomously by different designers and thus, do not have a standard
structure (Halevy, 2005).

This stage builds
A. Representation and Modeling up the foundation
for the process
9

 This stage formalizes the information
to be extracted from a search
interface
 interface components: Any text or form element
 semantic label: Meaning of a component from a
user’s standpoint.
 segment: a composite component formed with a
group of related components
 segment label: semantic label of the segment

This stage builds
A. Representation and Modeling up the foundation
for the process
10

 Zhang et al. (2004)
represent an interface
as a list of query
conditions
 Segment Label = Query
Condition Attribute-name Operator Value
 Segment consists of following
semantic
labels:
 An attribute name
 Operator
 Value

This stage is the
first task physically
B. Parsing performed on the
interface.
11

 The interface is parsed into a
workable memory structure. It can be
done in two modes:
by reading the HTML source code;
by rendering the page on a Web
browser either manually or automatically
using a visual layout engine.

This stage is the
first task physically
B. Parsing performed on the
interface.
12

 He et al. (2007b) parse the interface into an
interface expression (IEXP) with constructs t
(corresponding to any text), e (corresponding
to any form element), and | (corresponding to
a row limiter).
 IEXP for figure is:

 t|te|teee

Techniques used :
Rules

C. Segmentation Heuristics
Machine Learning.

13

 A segment has a semantic existence
but no physically defined boundaries
making this stage a challenging one.
 Grouping of semantically related
components. (A sub-problem is to associate a
surrounding text with a form element)
 Assignment of semantic labels to components

Techniques used :
Rules

C. Segmentation Heuristics
Machine Learning.

14

 He et al. (2007b) use a heuristic-based method LEX to group
elements and text labels together. One heuristic used by LEX is
that the text and form element that lie on the same line are
likely to belong to one segment. In Figure , the 3 components
“Gene Name”, radio button with options ‘Exact Match’ and
‘Ignore Case’, and the textbox belong to one segment.

Logical Attribute

Attribute-label Constraint Element Domain Element

Techniques used :
Rules

D. Segment Processing Heuristics
Machine Learning.

15

 In this stage,
 Each segment is further tagged with additional meta-
information regarding itself and its components.
 Post-processing of extracted information: normalization,
stemming, removal of stop words.
 He et al. (2007b)’s LEX extracts meta-information about
each extracted segment using Naïve Bayes classification
technique.
 The extracted information for the segment includes domain
type (finite, infinite); Unit (miles, sec); value type (numeric,
character, etc.); layout order position (in IEXP)

An approach is
usually tested on

E. Evaluation a set of interfaces
belonging to a
particular domain.
16

 How accurate extracted information is?
 The system-generated segmented and tagged
interface is compared with the manually segmented
and tagged interface
 The results are evaluated based on standard
metrics (precision, recall, accuracy, etc).

SIGNIFICANCE:
SIU is a pre-requisite

Why SIU is Significant? for several advanced
deep Web applications.
17

 Researchers have proposed solutions to make the deep Web
contents more useful to the users. These solutions can be divided
into following categories based on goals:
 To Increase Content Visibility on Search Engines
 Building Dynamic Page Repository: Raghavan and Garcia-Molina (2001)
 Building Database Content Repository: Madhavan et al. (2008)
 To Increase Domain-specific Usability
 Meta-search Engines: Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005), Pei et al.
(2006), He and Chang (2003), and Wang et al. (2004)

 To Attain Knowledge Organization
 Derivation of Ontologies: Benslimane et al. (2007)
 These solutions can only be materialized by leveraging the
opportunities provided by search interface.

18 LITERATURE REVIEW RESULTS
Review Settings
Reductionist Analysis
Holistic Analysis
Progress Made

ALIAS:
For quick reference

Literature Review Process each work is
assigned an alias.

19

 Reviewed research works that propose approaches for
performing the SIU process in the context of the deep Web.
S.No. Reference Alias
1 Raghavan and Garcia Molina (2001) LITE
2 Kalijuvee et al. (2001) CombMatch
3 Wu et al. (2004) FieldTree
4 Zhang et al. (2004) HSP
5 Shestakov et al.(2005) DEQUE
6 Pei et al. (2006) AttrList
7 He et al. (2007) LEX
8 Benslimane et al. (2007) FormModel
9 Nguyen et al. (2008) LabelEx

DIMENSION:
facilitates comparison

Literature Review Process among different works
by placing them under
the same umbrella
20

 Review was done in 2 phases:
 Reductionist Analysis: The works were
decomposed into small pieces.
 Each work was visualized as a 2-dimensional grid where
the horizontal sections refer to stages of the SIU process.
For each stage the works were analyzed in vertical
degrees of analysis known as Stage-specific dimensions.
 Holistic Analysis: Each work was studied in its
entirety within a big picture context.
 Composite dimensions were created out of the stage-
specific dimensions.

A: Representation
Reductionist Analysis : Representation B: Parsing
C: Segmentation
D: Segment Processing
E: Evaluation
Work Segment: Segment Contents Text Label: Form Meta-information
21 Element
HSP Conditional Pattern: Attribute- 1:M
name, Operator*, and Value+
DEQUE Field segment : f, Name(f), 1:1 JavaScript Functions,
Label(f) Visible and invisible values,
f = field Subinfo(F) = {action, method, enctype)
Iset(F) =initial field set, that can be submittied
without completing a form
domain(f), type(f)
F=form
AttrList Attribute: Attribute-name, 1:1 Domain information for each attribute (set of
description and form element values and data types)
LEX Logical Attribute Ai: Attr-label L, 1:M site information and form constraint
List of domain elements {Ej,…Ek}, (1:1 in case of Ai=(P, U, Re, Ca, DT, DF, VT)
and element labels . “element label” Ai=ith attribute,
:form element) P = layout order position, U = Unit, Re=
relationship type, Ca = domain element
constraint, DT = domain type, DF = default
value, VT = value type
Ei=(N, Fe, V, DV)
N = internal name, Fe = format, V = set of
values, DV = default value.

A: Representation
B: Parsing

Reductionist Analysis : Parsing C: Segmentation
E: Evaluation

22
Work Input Mode Basic Step: Description Cleaning Up Resulting Structure
LITE HTML source code Pruning Discard Images, Pruned Page
And Isolate elements that ignore styling
Visual Interface directly influence layout information such as
of form elements and font size, font style,
labels. and style sheets.
CombMatch HTML source code Chunk Partitioning, and Stop Chunk List and Table
finding meta-information Phrases(“optional”, Index List. Each chunk is
about each chunk: Find “required”, “*”, Text represented as an 8-
bounding HTML tags, and formatting HTML tags. tuple describing meta-
text strings delimited by information.
table cell tags, etc.
DEQUE HTML text Preparing Form Ignore font size, Pruned Tree
And Database: A DOM tree is typefaces, and styling
Visual interface created for each FORM information.
element
LEX HTML source code Interface Expression String
Generation:
t=text, e = element,
I=row delimiter (<BR>,
<P>, or </TR>

A: Representation
B: Parsing

Reductionist Analysis: Segmentation C: Segmentation
E: Evaluation
23
Work Problem Description Segmentation Criteria Technique

CombMatch Assigning text Label to an Combination of string similarity Heuristics (String properties,
input element and spatial similarity algorithms Proximity and Layout)

HSP Finding the 3-tuple <attribute Grammar (set of rules) based on Rules (Best Effort Parser to
name, operators, values> productions and preferences build a parse tree)

LEX Assigning text labels to Ending colon, textual similarity Heuristics
attributes, and assigning with element name, vertical (String Properties, Layout and
element labels to domain alignment, distance, preference Proximity)
elements to current row

LabelEx Assigning text Label to a Classifiers (Naïve Bayes, and Supervised Machine Learning
form element Decision Tree). Features
considered include spatial
features, element type, font type,
internal, similarity, alignment,
label placement, distance.

A: Representation
B: Parsing
C: Segmentation
Reductionist Analysis : Segment Processing D: Segment Processing
E: Evaluation
24

Work Technique for extracting Post-processing
meta-information

HSP The Merger module reports conflicting (that occur in
two query conditions) and missing tokens (they do not
occur in any query condition).

LEX Naïve Bayesian Classification Meaningless stopwords (the, with, any, etc.)
(Supervised Machine
Learning)

FormModel Learning by Examples
(Machine Learning)

LabelEx Heuristics for reconciliation of multiple assigned labels
to an element; and to handle dangling elements.

A: Representation
B: Parsing
C: Segmentation
E: Evaluation

25 Work Test Domain Yahoo Subject Category Comparison with… Metrics
LITE Semiconductor Science, Entertainment, CombMatch (in terms of Accuracy
Reductionist Analysis: Evaluation

Industry, Movies, Computers & Internet methodology)
Database
Technology.
HSP Airfare, automobile, Business & Economy, 4 datasets from different Precision, Recall
book, job, real estate, Recreation & Sports, sources collected by
car rental, hotel, Entertainment authors.
movies, music records.
LabelEx Airfare, Automobiles, Business & Economy, Barbosa et al. (2007)’s Recall,
Books, Movies. Recreation & Sports, and HSP ( in terms of Precision,
Entertainment datasets) F-Measure
Using Classifier Ensemble
with or without Mapping
Reconciliation (MR).
Generic Classifier Vs
Domain-specific Classifier
Generic Classifier with MR
Vs Domain-specific
Classifier with MR
HSP and LEX ( in terms of
methodology)

Work Type of semantics Techniques Human Target Application
Involvement
LITE Partial form capabilities Heuristics None Deep Web Crawler
(Label associated with (search engine
26
form element) visibility)
HSP Query capability Rules Manual Meta-searchers
(attribute name, operator Specification of (domain-specific
and values) Grammar Rules usability)
LEX Components belonging to Heuristics None Meta-searchers
same logical attribute (domain-specific
(labels and form elements) usability)
Meta-information Supervised Machine Training data for
Learning classifier
Holistic Analysis

FormModel Structural Units (groups of NOT REPORTED Unknown Ontology Derivation
fields belonging to same (Knowledge
entity) Organization)
Partial form capabilities Heuristics None
(Label associated with
form element)
Meta-information Supervised Machine Training data for
Learning learning by
examples.
LabelEx Partial form capabilities Supervised Machine Classifier Training Deep Web in general
(Label associated with Learning data was manually (search engine
form element) tagged. visibility domain-
specific usability)

A: Representation
B: Parsing

Progress Made C: Segmentation
E: Evaluation
27

 SEMANTICS modeled and extracted. (Stages A and B)
 from merely stating what we see, to stating what is meant by what we see
 from merely associating labels to form elements, to discovering query
capabilities
 from no meta-information to a lot of meta-information which might be useful for
target application.
 TECHNIQUES employed (Stages C and D)
 A mild transitioning from naïve techniques (rules-based and heuristic-based) to
sophisticated techniques (supervised machine learning).
 DOMAINS explored (Stage E)
 Only Commercial Domains: books, used cars, movies, etc.
 Still Unexplored Non-Commercial Domains: yahoo.com subject categories such as
regional, society and culture, education, arts and humanities, science, reference,
and others

28 RESEARCH QUESTIONS
Techniques Vs Design Heterogeneity
Techniques Vs Domain Heterogeneity

Simulating a Human Designer

Derived from

Research Questions
Holistic and
Reductionist
Analysis
29

 R.Q.#1 Technique Vs Design Heterogeneity
 What is the correlation between the technique employed and the ability to
handle heterogeneity in design of interfaces?

 R.Q.#2 Technique Vs Domains
 How can we design approaches that work well for arbitrary domains, and
thus prevent the need to design domain-specific approaches?

 R.Q.#3 Simulating a Human Designer
 How can we make a machine understand an interface in the same way as
a human designer does?

Technique is a
Research Question #1 dimension of
What is the correlation between the technique employed and the ability to Stages Segmentation
& Segment
handle heterogeneity in design of interfaces? Processing
30

Elaborating the Question
 Techniques: Rules, Heuristics, and Machine Learning.
 Design: Arrangement of interface components.
 Handling Heterogeneity in design: Being able to perform
the following tasks for any kind of design.
 Segmentation

 Semantic Tagging
 Grouping (Label Assignment is a part of this)
 Segment Processing

Technique is a
Research Question #1 dimension of
What is the correlation between the technique employed and the ability to Stages Segmentation
& Segment
handle heterogeneity in design of interfaces? Processing
31

Heterogeneity: Automobile Domain Heterogeneity: Movie Domain

Multiple Attribute-name

Operator Attribute-name
Operand

This question has
Research Question #1 been only partially
What is the correlation between the technique employed and the ability to explored.

handle heterogeneity in design of interfaces?
32

Existing Efforts to Answer

 A 2002 study (Kushmerick, 2002) suggests the superiority of
machine learning techniques over rule-based and
heuristic-based techniques for handling design
heterogeneity in general.
 A 2008 study (Nguyen et al., 2008) compared the label
assignment accuracy (a part of grouping accuracy) of
the three approaches: rule-based (HSP), heuristic-
based (LEX), and machine learning based (LabelEx).
Machine learning technique outperformed the other
two.

Tasks to test:
Segmentation
Investigating R.Q.#1 •Grouping
Technique Vs Design Heterogeneity •Semantic Tagging
Segment Processing
33

However, there is NO comparative study in terms of overall
grouping, semantic tagging, and segment processing.

Experiment Description Evaluation Result Compared With Improvement
Metrics
A machine learning Grouping 86% Heuristic-based 10%
technique based on Hidden Accuracy (label state-of-the-art
Markov Models (HMMs) was assignment approach LEX
designed and tested on a included)
dataset belonging to biology Semantic Tagging 90% A Heuristic-based 17%
domain. Accuracy algorithm was
designed

 Compare Segmentation Performance:  Compare Segment Processing Performances:
 Machine Learning Vs. Rule-Based  Rules Vs. heuristics Vs. machine learning
 Various machine learning techniques
 Classification Vs. HMM Vs…

Human
Intervention is a
Investigating R.Q.#1 dimension of
Technique Vs Design Heterogeneity Holistic analysis

34

There is NO comparative study to measure human intervention in these techniques.

Experiment Description Evaluation Result Compared With
Metrics
Monitoring Human Rule-based: Manual Rule Based Vs
Intervention Crafting Heuristics Vs
(IN PROGRESS) Heuristics: Manual Machine Learning
Observations
Machine Learning:
Manual Tagging
The HMM was trained P(O|λ) Not promising
using unsupervised
training algorithm Baum
Welch

 Designing Unsupervised Techniques

Research Question #2 Domain tested
is a dimension of
How can we design approaches that work well for arbitrary domains,
Evaluation stage.
and thus prevent the need to design domain-specific approaches?
35

Elaborating the Question

 Domain Heterogeneity: Deep Web is heterogeneous in
terms of domains, i.e. has databases belonging to all the
14 subject categories of Yahoo (Arts & Humanities, Business and
Economy, Computers and Internet, Education, Entertainment, Government,
Health, etc. )
 How to design generic approaches that work for many
domains?
 How do interface designs differ across domains?
 Which technique should be employed?

Research Question #2 Deep Web has a
balanced domain
How can we design approaches that work well for arbitrary domains,
distribution
and thus prevent the need to design domain-specific approaches?
36

Existing Efforts to Answer

 2004: A single grammar (rule-based) generates reasonably good segmentation
performance (grouping & semantic tagging) for all domains. (Zhang et al., 2004)
 Higher accuracy can be attained using domain-specific techniques which are not
feasible to be designed using rules (Nguyen et al., 2008) .
 2008: For label assignment (a portion of grouping), domain-specific classifiers
result in higher accuracy than generic classifiers. (Nguyen et al., 2008)
 Still missing:
 A comparison of domain-specific and generic approaches for overall segmentation
performance
 The design differences across domains
 generic approaches that result in equally good results for as many domains as possible.

0.41 Design tendencies
Investigating R.Q. #2 of designers from
0.09
0.21 Text- 0.35 different domains
trivial
0.30 are different.
0.56 0.08 Attribute- 0.57
Operand
37 name
0.83 0.12
0.40 0.20 Text- 0.14
0.62 trivial
0.37 0.20 0.22 Attribute-
Movie Operator Operand 0.34 name
0.64
0.31

0.15 0.21 0.88
0.21 Text- 0.44 0.11
Operator
trivial
0.23 0.16
References &
Attribute-
Operand 0.59 name Education
0.54 0.64
0.89 0.08

Text- 0.08
0.09 0.09 0.17 0.24
Biology Operator
trivial
0.11 0.05 Attribute-
Operand 0. 51 name
0.83
0.08
1.0 Automobile
Operator

Investigating R.Q. #2: All experiments
done using the
Technique Vs Domain Machine learning
technique, HMM.
38

Domain Exp Description Evaluation Winner
(improvement)
Movie Domain-Specific HMM Segmentation Accuracy Generic HMM
Vs. Generic HMM (4.4%)

Ref & Edu Domain-Specific HMM Segmentation Accuracy Domain-Specific
Vs. Generic HMM HMM (7%)

Automobile Domain-Specific HMM Segmentation Accuracy Domain-Specific

Biology Domain-Specific HMM Segmentation Accuracy Domain-Specific

What is the correlation between design topology and performance of domain-specific model?

Research Question #3
How can we make a machine understand the interface and extract
semantics from it in the same way as a human designer does?
39

A human-designer/user naturally understands the
design and semantics of an interface based on visual
cues and based on his prior experiences.
 A machine cannot really “see” an interface and does
not have any implicit Web search experience. (How
much do visual layout engines assist?)
 Hence, there is a difference between the way a
machine perceives an interface and the way a designer
perceives the interface.
 How can we reconcile these differences?

Existing methods have been
Investigating R.Q. #3 able to: understand design,
attach semantic labels, derive
Simulating a Human Designer segments and query
capabilities.
40

 Hypothesis: A machine can be made to understand the interface in the
same way as a human designer does if it is enabled to discover the deep
source of knowledge that created the interface in the first place.
Attach
Search Understand Derive
Semantic
Interface design Segments
Labels

Understands
/ Designs Derive
Web Design
Query
Knowledge
Capabilities

Conceptual
Designer Recover DB
Model
/Modeler Schema

Extracting DB schema and conceptual model is still an open question.

Connecting the dots R.Q. 1

Attach
Search Understand Derive
Search Semantic
41 Interface design Segments
Interface Labels

Web Design
Knowledge Derive
Query
Web Design Capabilities
Knowledge

Designer Conceptual
Recover DB
Model
R.Q. 2 Schema

R.Q. 3

Search ? Conceptual
Model based
Interface
Interface

42 THANK YOU !
Suggestions, Comments, Thoughts, Ideas, Questions…

Acknowledgements:
To My Prospectus Committee Members

References:
[1] to [42] (in prospectus report).

Prospectus presentation

Recommended

Recommended

More Related Content

Similar to Prospectus presentation

Similar to Prospectus presentation (20)

Prospectus presentation