1. Towards Web-Scale Information Extraction Eugene Agichtein Mathematics & Computer Science Emory University [email_address] http:// www.mathcs.emory.edu /~eugene/
2.
3. Example: Answering Queries Over Text For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte , a Microsoft VP . "That's a super-important shift for us in terms of code access.“ Richard Stallman , founder of the Free Software Foundation , countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. PEOPLE Select Name From PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from William Cohen’s IE tutorial, 2003)
4.
5.
6.
7.
8.
9.
10. Representation Models [Cohen and McCallum, 2003] Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Sliding Window Classify Pre-segmented Candidates Finite State Machines Context Free Grammars Boundary Models Abraham Lincoln was born in Kentucky. member? Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky . Classifier which class? … and beyond Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Classifier which class? BEGIN END BEGIN END BEGIN Abraham Lincoln was born in Kentucky. Most likely state sequence? Abraham Lincoln was born in Kentucky. NNP V P NP V NNP NP PP VP VP S Most likely parse?
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29. Reachability via Querying t 1 retrieves document d 1 that contains t 2 t 1 t 2 t 3 t 4 t 5 Upper recall limit : determined by the size of the biggest connected component Reachability Graph Tuples Documents t 1 t 2 t 3 t 4 t 5 d 1 d 2 d 3 d 4 d 5 <SARS, China> <Ebola, Zaire> <Malaria, Ethiopia> <Cholera, Sudan> <H5N1, Vietnam> [Agichtein et al. 2003b]
Check attribution Lexicon: lookup Classify candidates Sliding window – when candidats not known Boundary model – window+classification in one Finite state machine for complete path Grammers
All of these provide a form of API for integration with other code