3. Introduction
• Web data extraction has been an important part for many
web data analysis applications.
• Many web sites contain large sets of pages generated using
a common template or layout.
– EX : Amazon 、 Ebay 、 Google, etc.
• The key to automatic extraction for these template web pages
depend on whether we can deduce the template automatically.
– There is no need to annotate the web pages for extraction targets.
4. Introduction (Cont.)
• According to the kind of extraction targets, the web data
extraction tasks can be classified into three categories :
– Record-level : the target is usually constrained to record-wide
information
• DEPTA
• IEPAD
– Page-level : the target aims at page-wide information.
• RoadRunner
• EXALG
• FivaTech
– Site-level : populate database from pages of a Web site.
5. Introduction (Cont.)
• We take FivaTech System as our research, and study it’s
problem to improve the performance.
– It is unsupervised.
– It is both page-level and record-level.
– It has much higher precision than EXALG.
– It is comparable with other record-level extraction systems
like ViPER and MSE.
7. • Assume the similarity between b1 and b2 is 1.0 , and the
similarity between tr1~tr4 and tr5~tr6 is 0.6
• The FivaMatchingScore is (1.0+0.6+0.6+0.6+0.6)/5 = 0.68
8. The problem of FivaMatchingScore
• Case 1. Table structure.
• Case 2. Child trees containing set type data.
• Case 3. Asymmetry.
11. Case 2. Child trees containing set type
data
• Assume tr5 and tr6 containing set type data, and the similarity
between tr1~tr4 and tr5~tr6 is 0.3.
• The FivaMatchingScore is 1.0/5 = 0.2.