2. | 2
Elsevier!
So much more!
- Entity recognition and linking
- Text summarization
- Question answering
- Image understanding
- Ontology creation/alignment
- Knowledge graph creation
- User representation and
understanding
- Recommendations
- Search
- …
3. | 3
Working with Amsterdam Data Science
It’s been a brilliant journey so far!
• A successful internship program for 3 years
now
- With about 30 graduates
- More than 10 publications
- And 5 hires
• An AI lab VU and UVA
- 3 PhD students and 2 post-doctoral
researchers
- Helping us with themes around information
extraction and search
• Inspiring others about Amsterdam as
attractive Data Science Hub
5. | 5
Enter Topic Pages!
Definition
More info
Other related
concepts
6. | 6
Breaking down the problem
• The intention was to build an scientific encyclopedia
- Automatically
- From peer-reviewed, citable content
• An encyclopedia provides well-structured and meaningful information about concepts
- So we need to have a database of concepts, at the least
- We need to find the concepts in free text
- And, finally, show the region of text where the concept was found, if it is meaningful
7. | 7
Step 1: Tag the content and find candidates
• The first step was to tag all of the incoming textual data using pre-defined concepts
- At Elsevier, we have a large, general-purpose, semi-automatically made taxonomy, called
Omniscience
- It combines and extends several existing taxonomies such as EMMeT, MeSH, Reaxys, etc.
8. | 8
Step 1: Tag the content and find candidates…
• So we know what we have to tag text with..
- We still need to figure out a way to do the tagging
• We’ve developed a state-of-the-art tagging system, which we call
the Fingerprint engine (FPE) uses NLP-driven rules to impose
an external taxonomy on incoming text
- So, given a piece of text
- And, say, the branch of Omniscience dealing with chemistry
concepts, it gives you annotations which correspond to a concept in
the taxonomy
• Finally, every sentence which contains an annotation is a
candidate which can possibly be displayed on the topic page
9. | 9
For a single document, we get…
Candidates:
1) During adolescence, considerable social and biological
changes occur that interact with functional brain maturation,
some of which are sex-specific.
2) The amygdala is one brain area that has displayed sexual
dimorphism, specifically in socio-affective (superficial amygdala
[SFA]), stress (centromedial amygdala [CMA]), and learning and
memory (basolateral amygdala [BLA]) processing.
3) The amygdala has also been implicated in mood and anxiety
disorders which display sex-specific features, most prominently
observed during adolescence.
4) Using functional magnetic resonance imaging (fMRI), the
present study examined the interaction of age and sex on resting
state functional connectivity (RSFC) of amygdala sub-regions,
BLA and SFA, in a sample of healthy adolescents between the
ages 10 and 16 years (n = 122, 71 boys).
…
10. | 10
And aggregating over concepts…
10
• As an example, for amygdala we observe about 10k candidate sentences
- In mice, which do not form pair bonds, OTR in the medial amygdala and V1aR in the lateral septum are
essential for individual discrimination.272,289
- To determine the sites of action in the brain, DCS was microinjected into the NAcc, the amygdala, and the
caudate putamen.
- The central amygdala, however, is viewed as an important output region.
- The central amygdala then orchestrates responses appropriate to cope with the detected biologically-significant
event.
- The amygdala also innervates the locus coeruleus allowing emotional pain and physical stressors of withdrawal
to trigger noradrenergic (norepinephrine) (fight-or-flight) responses.
- …
11. | 11
• Once we had sentences, we needed to select the good ones for definitions and snippets
• A major challenge was the lack of training data
- Remember that this is highly specific information, pertaining to highly evolved domains of science
- Training data must be manually curated by subject-matter experts who know the field
- There is a lot of sentences to label!
• To collect data we devised a stratification:
- Hearst patterns (,i.e., is defined by, is a)
- Which section did it come from?
- Sentence length
- Presence of other concepts
- Similarity of the main concept to other concepts
- Similarity to DBPedia definitions
- …
Step 2: Make a training set to train machine learning algorithms
12. | 12
Unlabeled
Candidate
Sentences
Learning
Algorithm
Reinforcement
(Q-) Learning
SME
Choose next
Predict
Train
Feedback
Label
• Active Learning is an efficient way to get the most informative training data out of the entire
unlabeled set
- The learning algorithm is a LSTM network with a linear SVM as the final layer
- And we use Q-learning to select the next sample out of the pre-computed strata
- Such that it gets a reward if it selects a sentence which the classifier thinks is a good candidate, but
the human annotator marks as bad
Active learning to gather data
13. | 13
Training set…
Is good
definition?
Concept Definition
1 Massively
Parallel
Processing
Massively parallel processing is a means of crunching huge amounts
of data by distributing the processing over hundreds or thousands of
processors, which might be running in the same box or in separate,
distantly located computers.
0 Software
Adaption
Software adaptation is a remarkably complex phenomenon and it
must be and will be studied for some time.
1 Hash tables Hash tables are one of the most basic data structures used to
provide fast access and compact storage for sparse data.
0 Computational
space
Computational space is an imagery of the two prior spaces that
resides temporarily in the magnetic, semiconductor locations during
the emulation and execution phases of the problems encountered in
dealing with real-space scenarios.
1 Flip-flops Flip-flops are the principal memory circuits that will store past values
and make them available when called for.
0 Probabilistic
sampling
Probabilistic sampling is when there is a well-formed population from
which you are sampling.
14. | 14
Step 3: Machine Learning for Definition Classification
Results on a public
dataset
17. | 17
• Topic pages are a freely available resource
• Great for users who want to find out more about what they’re reading or know about
erstwhile unknown-to-them concepts
• We use ensemble machine learning algorithms
• Deployed on Apache Spark clusters
• To continuously improve the quality of these pages
• Which keeps readers engaged
• And drives incoming traffic from search engines
Summary
18. | 18
Thanks! Bedankt!
Come talk to us in person if you’re not feeling too tired or shy!
Or go to:
https://www.elsevier.com/about/careers