20090813MEETING

FivaTech ： The problem of peer
node recognition
Reporter ： Che-Min Liao

Outline
• Introduction
• Related Work
• Problem Formulation
• System Architecture
• The Approach
• Experiment
• Conclusion

Introduction
• Web data extraction has been an important part for many
web data analysis applications.
• Many web sites contain large sets of pages generated using
a common template or layout.
– EX ： Amazon 、 Ebay 、 Google, etc.
• The key to automatic extraction for these template web pages
depend on whether we can deduce the template automatically.
– There is no need to annotate the web pages for extraction targets.

Introduction (Cont.)
• According to the kind of extraction targets, the web data
extraction tasks can be classified into three categories ：
– Record-level ： the target is usually constrained to record-wide
information
• DEPTA
• IEPAD
– Page-level ： the target aims at page-wide information.
• RoadRunner
• EXALG
• FivaTech
– Site-level ： populate database from pages of a Web site.

Introduction (Cont.)
• We take FivaTech System as our research, and study it’s
problem to improve the performance.
– It is unsupervised.
– It is both page-level and record-level.
– It has much higher precision than EXALG.
– It is comparable with other record-level extraction systems
like ViPER and MSE.

• Assume the similarity between b1 and b2 is 1.0 ， and the
similarity between tr1~tr4 and tr5~tr6 is 0.6
• The FivaMatchingScore is (1.0+0.6+0.6+0.6+0.6)/5 = 0.68

The problem of FivaMatchingScore
• Case 1. Table structure.
• Case 2. Child trees containing set type data.
• Case 3. Asymmetry.

Case 2. Child trees containing set type
data
• Assume tr5 and tr6 containing set type data, and the similarity
between tr1~tr4 and tr5~tr6 is 0.3.
• The FivaMatchingScore is 1.0/5 = 0.2.

Case 3. Asymmetry
• Assume S(b1,b2) = 1.0, S(tr1,tr5) = 0.6, S(tr4,tr6) = 0.6,
S(tr2~tr4,tr5) = 0.3, S(tr1~tr3,tr6) = 0.3, where S = Similarity.
• FivaMatchingScore(A,B) = (1.0+0.6+0.6)/5 = 0.44
≠ FivaMatchingScore(B,A) = (1.0+0.6+0.6)/3 = 0.86

20090813MEETING

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a 20090813MEETING

Similar a 20090813MEETING (20)

Más de marxliouville

Más de marxliouville (13)

Último

Último (20)

20090813MEETING