Benjamin M. Good, Max Nanis, Andrew I. Su
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses that would otherwise be impossible. As a result, many biological natural language processing (BioNLP) projects attempt to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are vital to the process of knowledge extraction but are always in short supply. Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text.
Here, we investigated the use of the AMT in capturing disease mentions in Pubmed abstracts. We used the recently published NCBI Disease corpus as a gold standard for refining and benchmarking the crowdsourcing protocol. After merging the responses from 5 AMT workers per abstract with a simple voting scheme, we were able to achieve a maximum f measure of 0.815 (precision 0.823, recall 0.807) over 593 abstracts as compared to the NCBI annotations on the same abstracts. Comparisons were based on exact matches to annotation spans. The results can also be tuned to optimize for precision (max = 0.98 when recall = 0.23) or recall (max = 0.89 when precision = 0.45). It took 7 days and cost $192.90 to complete all 593 abstracts considered here (at $.06/abstract with 50 additional abstracts used for spam detection).
This experiment demonstrated that microtask-based crowdsourcing can be applied to the disease mention recognition problem in the text of biomedical research articles. The f-measure of 0.815 indicates that there is room for improvement in the crowdsourcing protocol but that, overall, AMT workers are clearly capable of performing this annotation task.
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abstracts
1. Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Identifying concepts and relationships in biomedical text enables
knowledge to be applied in computational analyses that would otherwise
be impossible. As a result, many biological natural language processing
(BioNLP) projects attempt to address this challenge. However, the state
of the art in BioNLP still leaves much room for improvement in terms of
precision, recall and the complexity of knowledge structures that can be
extracted automatically. Expert curators are vital to the process of
knowledge extraction but are always in short supply. Recent studies
have shown that workers on microtasking platforms such as Amazon’s
Mechanical Turk (AMT) can, in aggregate, generate high-quality
annotations of biomedical text [1].
Here, we investigated the use of the AMT in capturing disease mentions
in PubMed abstracts. We used the recently published NCBI Disease
corpus as a gold standard for refining and benchmarking the
crowdsourcing protocol. After merging the responses from 5 AMT
workers per abstract with a simple voting scheme, we were able to
achieve a maximum f measure of 0.815 (precision 0.823, recall 0.807)
over 593 abstracts as compared to the NCBI annotations on the same
abstracts. Comparisons were based on exact matches to annotation
spans. The results can also be tuned to optimize for precision (max =
0.98 when recall = 0.23) or recall (max = 0.89 when precision = 0.45). It
took 7 days and cost $192.90 to complete all 593 abstracts considered
here (at $.06/abstract with 50 additional abstracts used for spam
detection).
This experiment demonstrated that microtask-based crowdsourcing can
be applied to the disease mention recognition problem in the text of
biomedical research articles. The f-measure of 0.815 indicates that
there is room for improvement in the crowdsourcing protocol but
that, overall, AMT workers are clearly capable of performing this
annotation task.
ABSTRACT
Benjamin M Good, Max Nanis, Andrew I Su
The Scripps Research Institute, La Jolla, California, USA
REFERENCES
We acknowledge support from the National Institute of General Medical
Sciences (GM089820 and GM083924).
CONTACT
Benjamin Good: bgood@scripps.edu Andrew Su: asu@scripps.edu
1. Zhai, Haijun, et al. "Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural
language processing." Journal of medical Internet research 15.4 (2013).
2. Doğan, Rezarta Islamaj, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings
of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics, 2012.
ABSTRACT
FUNDING
Ongoing experiments
Iterate, iterate, iterate
Microtask: highlight “diseases” in a PubMed abstract
QUESTION: To what extent can we reproduce the expert-created
NCBI disease corpus [2] with non-expert annotators ?
Highlight all diseases and disease abbreviations
“...are associated with Huntington disease ( HD )... HD patients received...”
“The Wiskott-Aldrich syndrome ( WAS ) …”
Highlight the longest span of text specific to a disease
“... contains the insulin-dependent diabetes mellitus locus …”
and not just ‘diabetes’.
“...was initially detected in four of 33 colorectal cancer families…”
Highlight disease conjunctions as single, long spans.
“...the life expectancy of Duchenne and Becker muscular dystrophy patients..”
“... a significant fraction of familial breast and ovarian cancer , but undergoes…”
Highlight symptoms - physical results of having a disease
“XFE progeroid syndrome can cause dwarfism, cachexia, and microcephaly. Patients often display
learning disabilities, hearing loss, and visual impairment.
Highlight all occurrences of disease terms
“Women who carry a mutation in the BRCA1 gene have an 80 % risk of breast cancer by the age of 70.
Individuals who have rare alleles of the VNTR also have an increased risk of breast cancer ( 2-4 )”.
Experiments
Key points
• Took three design
iterations to produce
the current system
• Even first attempt
was better than
automated methods
• Can now perform
task at about 500
abstracts and $200
per week
• Only 33/194 workers passed the qualification test
• Only 17 workers contributed to the training corpus task
• How might we increase the number of workers while
maintaining high quality?
We are hiring! Looking for postdocs, programmers interested in
crowdsourcing and bioinformatics contact asu@scripps.edu
Microtask crowdsourcing, key points
• Hundreds of thousands of workers
working in parallel
• Perform ‘human intelligence tasks’ (HITs)
• Paid $ for each HIT
• Tasks are managed programmatically
(1) Take HIT-
specific
qualification
exam
(2) Perform
HITs
(3) Get paid
Quality is managed by:
• Qualification tests, test HITs with known answers
• Redundancy and Aggregation: multiple workers perform
each task and we merge their work programmatically
Example Disease
Mentions
Qualification test: identify the correct annotations from three
representative abstract snippets. Score ranges from 0-26.
Passing is > 21.
Key parameters
• AMT platform
• $.06/abstract
• 593 abstracts
• 5 workers per abstract
• No worker filters beyond
qualification test
Annotation Interface
RESULTS, reproducing the NCBI training corpus
Top aggregate
performance
F = 0.815
Precision = 0.823
Recall = 0.807
K = 2 means that at least
2 workers independently
created each annotation
Current hypothesis:
• Train workers using
feedback provided after
each task completed based
on embedded gold
standard annotations and
annotations from other
workers