SlideShare una empresa de Scribd logo
1 de 19
Developing a Curator Assistant for Functional Analysis of Genome Databases
 Requesting $1,451,005 from NSF BIO Advances in Biological Informatics, August 2009

PI: Bruce Schatz, Institute for Genomic Biology, University of Illinois at Urbana-Champaign
coPI: ChengXiang Zhai, Computer Science, University of Illinois, LanguageTechnology
coPI: Susan Brown, Biology, Kansas State University, ArthropodBaseConsortium (BeetleBase)
coPI: Donald Gilbert, Bioinformatics, Indiana University, Community Annotation (wFleaBase)

  Intellectual Merit
   The advent of next-generation sequencing is rapidly decreasing the cost of genomes.
Projections are that costs will decrease from $1M to $100K to $10K in the next 5 to 10 years.
As a result, the number of arthropod genomes will increase from 10 to 1000 to 10,000. This
shifts the major limitation from sequencing to annotating. The current level of annotation is
recognizing genes from sequences, rather than understanding the function of genes.
   Traditionally, functional analysis has been performed by human curators who read biological
literature to provide evidence for a genome database of gene function such as FlyBase. To
functionally analyze a genome, biologists develop an ortholog pipeline, where the genes of their
organism have the orthologs computed and these used to find the most similar gene in a model
organism database. This process is inexpensive, but inaccurate, compared to manual curators.
   We propose to develop a Curator Assistant that will enable the communities that are
generating genomes to analyze the function of their genes by themselves. While the model
organism databases (MODs) have groups of curators, subsequent genome databases have
struggled to find funding for even a single human curator. Such bases will have to be curated by
the communities themselves, by community biologists using software infrastructure to help them
extract functions from community literature. Within the Arthropod Base Consortium (ABC), for
example, only FlyBase is a MOD with professional curators.
   During the NSF-funded BeeSpace project, we developed prototype software for automatically
extracting entities and relations from biological literature. The entities include genes, anatomy,
and behavior, while the relations include interaction (gene-gene), expression (gene-anatomy),
and function (gene-behavior). These entities and relations can be used to populate relational
tables to build a genome database. Our prototype works on Drosophila literature and leverages
FlyBase, the MOD for the ABC. Our techniques appear general enough for all arthropods.
   We propose to develop a fully fledged Curator Assistant that fully utilizes machine learning
technologies for natural language processing. These include community dictionaries, heuristic
procedures, and training sets. Given the community collection with relevant literature, the
assistant software suggests candidate relations that the community biologists can select from.
Providing additional knowledge is much easier than reading biological literature and
mechanisms are provided to specify the level of quality desired and revise the information itself.
  Broader Impact
   Our project has been organized via the annual Symposium of the Arthropod Base Consortium.
Our investigators including the BeeSpace PI for informatics and the Symposium organizer for
biology, representing arthropod genomes in particular and animal genomes in general. Our
project will develop language technology for entity-relation semantics into usable infrastructure
and distribute it through GMOD, which already provides the sequence support used by ABC.
We will develop the standards for literature support for customized extraction and curation,
including practical deployment to a distributed community of NSF-funded genome biologists.
Developing a Curator Assistant for Functional Analysis of Genome Databases

PI: Bruce Schatz, Institute for Genomic Biology, University of Illinois at Urbana-Champaign
coPI: ChengXiang Zhai, Computer Science, University of Illinois, LanguageTechnology
coPI: Susan Brown, Biology, Kansas State University, ArthropodBaseConsortium (BeetleBase)
coPI: Donald Gilbert, Bioinformatics, Indiana University, CommunityAnnotation (wFleaBase)


1. GENOME SEQUENCING AND BIOCURATION

   The advent of next-generation sequencing is rapidly decreasing the cost of genomes.
Projections are that costs will decrease from $1M to $100K to $10K in the next 5 to 10 years.
As a result, the number of arthropod genomes will increase from 10 to 1000 to 10,000. This
shifts the major limitation from sequencing to annotating. The current level of annotation is
recognizing genes from sequences, rather than understanding the function of genes.
   Traditionally, functional analysis has been performed by human curators who read biological
literature to provide evidence for a genome database of gene function such as FlyBase. To
functionally analyze a genome, biologists develop an ortholog pipeline, where the genes of their
organism have the orthologs computed and these used to find the most similar gene in a model
organism database. This process is inexpensive, but inaccurate, compared to manual curators.
   We propose to develop a Curator Assistant that will enable the communities that are
generating genomes to analyze the function of their genes by themselves. While the model
organism databases (MODs) have groups of curators, subsequent genome databases have
struggled to find funding for even a single human curator. Such bases will have to be curated by
the communities themselves, by community biologists using software infrastructure to help them
extract functions from community literature. Within the Arthropod Base Consortium (ABC), for
example, only FlyBase is the only MOD with professional curators.
     During the NSF-funded BeeSpace project, we developed prototype software for functional
analysis [33], by automatically extracting entities and relations from biological literature. The
entities include genes, anatomy, and behavior, while the relations include interaction (gene-
gene), expression (gene-anatomy), and function (gene-behavior). These entities and relations
can be used to populate relational tables to build a genome database. Our prototype currently
works on Drosophila literature and leverages FlyBase, the MOD for the ABC. Our techniques
are general enough for all arthropods. This is an important taxa of organisms for NSF biologists.
   We propose to develop a fully fledged Curator Assistant that fully utilizes machine learning
technologies for natural language processing. These include community dictionaries, heuristic
procedures, and training sets. Given the community collection with relevant literature, the
assistant software suggests candidate relations that the community biologists can select from.
Providing additional knowledge is much easier than reading biological literature and
mechanisms are provided to specify the level of quality desired and revise the information itself.
   We will debug the new system on the existing bases such as BeetleBase and wFleaBase, then
deploy more widely to the full bases of the Arthropod Base Consortium as it grows. The
software will be general enough to be widely applicable for genome databases. We will use
GMOD (Generic Model Organism Database) consortium as the distribution mechanism for our
literature curation software, to complement their existing software for sequence curation.




                                               -2-
2. GENOME DATABASES AND BIOCURATION

   The Curator Assistant will initially focus on arthropod genomes, as organisms of central
interest to NSF. At least half of the described species of living animals are arthropods (jointed
legs, mostly insects), species of great scientific interest for molecular genetics and evolutionary
synthesis. The Arthropod Base Consortium (ABC) has been meeting quarterly for the past 4
years, to discuss their needs for genome data bases and data analysis. The inner circle has about
40 scientists, who hold workshops at the major community sites. The outer circle has about 400
scientists, who attend the Annual Symposium [www.k-state.edu/agc/symposium.shtml]. This
community consortium currently includes some 10 resource genomes, including insects
important biologically (bee, beetle, butterfly, aphid), crustaceans important ecologically (water
flea), and vectors important for human diseases (mosquito, tick, louse).
   There is a reference genome for this community, the fruit fly Drosophila melanogaster, which
has been a genetic model for 100 years. As the model insect, Drosophila is important enough to
justify a 40-person staff at FlyBase, who manually curate this model organism database (MOD).
Through a close collaboration, the FlyBase literature curation process is serving as the model for
our semantic indexing of biological literature, see Figure 1 below.
    The first wave of genomes were of the model genetic organisms, these MODs already had
Bases with human curators. For the arthropods, the only MOD is FlyBase for the insect
Drosophila meglanoster. The second wave of genomes did not have decades of genetics, but
were attempting to jumpstart with genome sequencing. For arthropods, these include the insects
honey bee and flour beetle, both important scientifically and agriculturally. The corresponding
bases, e.g. BeeBase and BeetleBase, were able to gain modest funding, but not for professional
curators, only for postdocs and programmers. Such resources thus went into annotating genes of
particular interest (small numbers) or support of automatic processing (large numbers).
    With the third wave, the sequencing is still done at genome centers, but no attempts are made
at manual curation. These Bases, e.g. wFleaBase for Daphnia, spend their limited resources on
community annotation and computation. Beyond the third wave, the sequencing is being done at
campus centers rather than national centers and any curation is done automatically with quality
enhancement by the community itself. Within ABC, ButterflyBase and AphidBase are down this
path and will be working with our group as their genomes mature. The 10,000 arthropod
genomes expected in the next decade will all be in the post-curator era.
    From a technology standpoint, this implies that the Curator Assistant must support variable
levels of quality because different bases from different waves will do different amounts of post-
assistant quality improvement. With many curators, the system should generate many candidates
that can be manually checked by human experts. With few curators, the system should generate
few candidates for manual checking, thus higher precision and lower recall. With no curators,
the system should generate highest precision “correct” entries, which are annotated by the
community itself using collaboration technology. In the preliminary work performed in the
BeeSpace project described below, we developed prototype services tuned towards recall and
towards precision, indicating feasibility of developing a fully tunable system for curation quality.

What’s in a Base: An Examination of FlyBase

   For many reasons several of the fields in FlyBase use structured controlled vocabularies (aka
ontologies). This makes it much easier (and more robust) to make links within the database, as



                                                -3-
well as making it much easier to search the database for information. Moreover, several of these
controlled vocabularies are shared with other databases, and this provides a degree of integration
between them. The controlled vocabularies are only implemented in certain fields in FlyBase.
The initial literature selection is done at FlyBase at Cambridge University while the bulk of the
literature curation is done at FlyBase at Harvard University to populate the gene models in the
database from highlighted facts in the literature articles [8].

    Controlled vocabularies currently used by FlyBase are [www.flybase.org]:
    • The Gene Ontology (GO). This provides structured controlled vocabularies for the
        annotation of gene products (although FlyBase annotates genes with GO terms, as a
        surrogate for their products). The GO has three domains: the molecular function of gene
        products, the biological process in which they are involved and their cellular component.
    • Anatomy. A structured controlled vocabulary of the anatomy of Drosophila
        melanogaster, used for the description of phenotypes and where a gene is expressed.
    • Development. A structured controlled vocabulary of the development of Drosophila
        melanogaster, used for the description of phenotypes and when a gene is expressed.
    • The Sequence Ontology (SO). A structured controlled vocabulary for sequence
        annotation, for the exchange of annotation data and for the description of sequence
        objects in databases. FlyBase describes the genome in a consistent and rigorous manner.
    All of these structured controlled vocabularies are in the same format, that used by the Open
Biomedical Ontology group. This format is called the OBO format [www.obo.org] .
    These controlled vocabularies focus on the most important types of data for genome
databases, namely “gene”, “anatomy”, and types of “function” such as “development” [37]. The
factoids in the official database are relations on these datatypes, such as Interaction (gene-gene),
Expression (gene-anatomy), Function (gene-development). When a FlyBase curator records a
factoid, they also record the type of evidence that enables them to judge its correctness. The list
for genes is as below. Note this implies that even manual curation includes different factoids at
different qualities, whether a relation is true depends on the level of evidence chosen.
    The Gene Ontology Guide to GO Evidence Codes contains comprehensive descriptions of
the evidence codes used in GO annotation. FlyBase uses the following evidence codes when
assigning GO data: inferred from mutant phenotype (IMP), inferred from genetic interaction
(IGI), inferred from direct assay (IDA), inferred from physical interaction (IPI), inferred from
expression pattern (IEP), inferred from sequence or structural similarity (ISS), inferred from
electronic annotation (IEA), inferred from reviewed computational analysis (RCA),,traceable
author statement (TAS), non-traceable author statement (NAS), inferred by curator (IC), no
biological data available (ND). Note some of these are observational and some computational.


3. CURATOR ASSISTANT SYSTEM

  Biocuration [17] is the process of extracting facts from the biological literature to populate a
database about gene function. The curators at the Model Organism Databases (MODs) read
input papers from scientific literature relevant to their organism and extract facts judged to be
correct, which are then used to populate the structured fields of their genome database. There are
currently 10 reference genomes, each with their own group of curators. These groups are falling




                                                -4-
behind, with the current scale of literature, and new resource genomes are being denied custom
curator support, due to financial limitations.
  In the 5-year BeeSpace project just ending with NSF BIO FIBR funding, we have been
working closely with FlyBase curators to better understand what can be automated within the
biocuration process. We are fortunate in collaborating with John MacMullen from the Graduate
School of Library and Information Science, who specializes in studying the process of
biocuration by analyzing the detailed activities of MOD curators. He is analyzing the curator
annotations in FlyBase, among others, by examining which sentences are highlighted in the texts
and which database entries are inferred from these. Through the BeeSpace project, we also work
with the many curators at the FlyBase project under PI William Gelbart at Harvard University
and the few curators at the BeeBase project under PI Christine Elsik at Georgetown University.
  The Group Manager at FlyBase-Cambridge (England), Steven Marygold, provided the Figure
below giving the steps in the FlyBase curation process. He spoke at the ABC working meeting in
December 2007 hosted at our project home site in the Institute for Genomic Biology at the
University of Illinois, slides at www.beespace.uiuc.edu/files/Marygold-ABC.ppt .




  Figure 1. FlyBase Literature Curation Process Diagram [27].

   The automatic process set up in the Curator Assistant is modeled after this manual process.
The user could be a full biocurator or could be a community member research biologist, thus
differently tuning the system to their needs. They search the literature to choose articles. The
manual curator can only choose tens of articles to skim, but the assisted curator can choose
thousands of articles to be automatically skimmed. The BeeSpace system that the Curator
Assistant leverages contains powerful services for choosing collections well targeted to the
particular purpose, including searching and clustering. The major strength of the automatic
system is breadth, it can cover a much wider selection of the available literature than can
humans. In demonstrating the prototype to many curators at the Arthropod Genomics
Symposium, even the most professional curators spoke longingly of having an automatic system
to filter candidates, in order to attempt to with the full range of biological literature.
   The Curator Assistant will focus on the middle of the diagram, the central core of the curation
process. This process highlights the curatable material and then performs curation, this is


                                               -5-
basically finding sentences with functional information and extracting the facts that are described
by the functional sentences. For example, two genes interact with each other (Interaction), a
gene is expressed in a specific part of the anatomy (Expression), a gene regulates a particular
behavior (Function). Key information is usually contained within the abstract, which is why our
current services are effective, even though they cover only Medline and Biological Abstracts.
The manual curators have the advantage of reading the fulltext, so we will be also gathering
fulltext systematically for our community, through the collaboration technology described below.
   For the bottom of the diagram, the Curator Assistant will also support error checking of
different kinds by the community curators themselves and by the community biologists
themselves, as described in the later section on Community Annotation and Curation. Finally,
through an arrangement with the GMOD consortium (Generic Model Organism Database
software), who support the GBrowse genome sequence displayer and the CHADO database
schema format, we will be distributing our literature infrastructure software to the broader
genome community to supplement the existing sequence infrastructure software. The concluding
section below on Organization and Schedule contains further details on GMOD relations.
   The underlying system uses natural language processing to extract relevant entities and
relations automatically from relevant literature. An entity is a noun phrase representing a
community datatype, e.g. gene name or body part. A relation is a verb phrase representing the
action performed by an entity, e.g. gene A regulates behavior B in organism C. Many projects
extract entities and relations, using template rules for a particular domain. The BeeSpace project
pioneered trained adaptive entity recognition, where sample sentences are used to train the
recognizer for particular entities with high accuracy and software adapts the training to related
domains automatically [18,19] and we will be leveraging off this NSF BIO project, which ends
in August 2009 before the proposed project would begin. We also leverage off our previous
NSF research in digital libraries on interactive support for literature curation [4,22].
   The first prototype within the BeeSpace system has already become a production service, with
streamlined v4 interface available at www.beespace.uiuc.edu . The Gene Summarizer was the
subject of an accepted plenary talk at the 2nd International Biocurator Meeting in San Jose in
October 2007 [34]. The Gene Summarizer has two stages: the first highlights the curatable
materials while the second curates these materials in a usable interactive form [25,26]. The
highlighting is tuned for recall, so that sentences containing gene names are automatically
extracted from the literature abstracts, where the entity “gene” is broadly recognized, including
genes, proteins, and gene-like descriptions. The curation is simpler than what is proposed for
the Curator Assistant but is very effective for practicing biologists who use the interactive
system, where each gene sentence is placed automatically into a functional category.
     The first version of this service used a machine learning approach that was trained on the
curator generated sentences from FlyBase, explaining why the curator had entered a particular
factoid into FlyBase relational database. PI Schatz of BeeSpace then visited PI Gelbart of
FlyBase at Harvard and observed the curator process at length. A reciprocal visit by a FlyBase
curator, Sian Giametes, to BeeSpace refined the automatic process and the functional categories.
We then also did specific training with new sentences judged by bee biologists at University of
Illinois and beetle biologists at Kansas State University. A subsequent version was developed
using this training with much higher accuracy than previous dictionary-based versions.
     Figures 2 and 3 give examples of using the Gene Summarizer with this insect training on a
Drosophila fly gene and on a Tribolium beetle gene. There are more fly papers than beetle
papers so the number of highlighted sentences are naturally greater. The functional categories



                                               -6-
are: Gene Products (GP), Expression Location (EL), Sequence Information (SI), Wild-type
Function & Phenotypic Information (WFPI), Mutant Phenotype (MP), Genetic Interaction (GI).




   Figure 2. Gene Summarization for Automatic Curation on FlyBase collection.




   Figure 3. Gene Summarization for Automatic Curation on BeetleBase collection.


                                           -7-
4. CURATOR ASSISTANT PROTOTYPE

    After integrating the Gene Summarizer in BeeSpace v3, we developed a prototype BeeSpace
v5 that specifically extracted entity and relation from literature. This has deeper curation,
recognizing within a highlighted sentence what entities and relations are mentioned. The
extractors were tuned for precision to produce “correct” factoids, rather than the previous
extractors that were tuned for recall to produce comprehensive coverage of all entities present.
From this, it became clear that the level of precision and recall was a tunable feature of machine
learning and thus it would be feasible to support varying qualities for different purposes.
    The precision v5 system was an important prototype for the Curator Assistant, as it showed
that accurate automatic extraction was technically possible. The first version leveraged the
relations within FlyBase and was run on the Drosophila collection of standard articles that we
obtained through collaboration from FlyBase at Indiana University where the software
development is done. The high precision used disambiguation algorithms that enabled
identification of which gene was mentioned. For v3 recall, “wingless” was a particular text
phrase but for v5 precision, the same word was a particular gene number. Thus, accurate
linkouts became possible. So a gene entity recognized can jump directly to the FlyBase gene
entry for that name and an anatomy entity can jump directly to the FlyBase anatomical hierarchy.
    Figure 4 contains a sample output from the v5 prototype on the Drosophila fly collection.
Multiple word phrases are recognized correctly for gene in green, for anatomy in orange, for
behavior in blue, and for chemical in yellow. (Tags are correct if this figure displayed in color.)
Anatomy is dictionary-based, just like gene, using the FlyBase anatomy terms as the base. The
function terms in the categories of behavior and chemical were extracted using heuristics of
certain key words. There was another set of function terms for development, the other category
used in FlyBase, but not many terms identified with our simple heuristics. Figure 5 shows that
the recognized gene is linked to its corresponding correct gene database entry in FlyBase.

    In the proposed project, for entities, we will focus on gene, anatomy, and function
(combining behavior, anatomy, development). For relations, we will focus on different
combinations of these such as Interaction (Gene-Gene), Expression (Gene-Anatomy), Function
(Gene-Behavior etc). We will leverage existing resources for dictionary generation, such as gene
names from NCBI Entrez Gene [www.ncbi.nlm.nih.gov/sites/entrez?db=gene] and anatomy
names from FlyBase [http://flybase.org/static_pages/anatomy/glossary.html]. The relational
indexes in Biological Abstracts include gene and anatomy, providing a rich source of entities
tagged by human curators from phrases in biological literature. FlyMine [www.flymine.org] is a
rich source of query relations, including multistep inferences extracted from FlyBase. We will
also leverage available resources to obtain training data or pseudo training data. In particular,
BioCreative studies [16,29] have resulted in a valuable training set, which we have already used
in gene recognition. Fixed template systems such as Textpresso [30] have hand-generated rules
useful for constructing features in our learning-based framework.
   For the proposed project, we plan to do extensive training to improve the precision of the
dictionaries and of the heuristics, to automatically identify sentence slots for particular entities.
This process greatly improved our previous efforts for entity summarization, as discussed above.
To achieve better results, the community curators can supplement the dictionaries with local
gene names or anatomy names. The next section is a technical discussion of the training
procedures and how such tuning can be feasibly implemented.



                                                -8-
Figure 4. Preliminary Work from BeeSpace Prototype v5. Interactive System for
Entity Relations using FlyBase relational database for leverage, with live linkouts.




Figure 5. FlyBase Gene entry (manual) linked to from Curator Assistant (automatic).


                                             -9-
We have tried running the Drosophila trained v5 extractors on Tribolium literature, since few
beetle genes have direct names but commonly use the fly gene names. The anatomy is also not
identical but similar in many ways. This process sometimes produces good results as shown in
Figure 6. This version is the initial attempt at a general system for arthropods using prototype
classification, the closer the organism is to the prototype fly the more accurate the recognition.




   Figure 6. Entity Relation v5 on Beetle Tribolium literature. This still uses the FlyBase
training so not as accurate as would be trained system, but still produces some useful outputs.

    We are currently extracting from a large insect collection from the Biological Abstracts
database. PI Schatz is giving an invited lecture in December 2009 at the annual meeting of the
ESA Entomological Society of America on "Computer support for community knowledge:
information technologies for insect biologists to automatically annotate their molecular
information" and will demonstrate the evolved version of this prototype. coPI Gilbert is giving
an invited talk in the same session on Integrative Physiological and Molecular Insect Systems.
He works on the arthropod water flea, a good test of machine learning for entity anatomy.

              PROJECT SCHEDULE FOR CURATOR ASSISTANT
 Year 1. Develop v1 leverage FlyBase (base BeeSpace v5). Deploy to BeetleBase.
 Year 2. Develop v2 with Trained Recognizers. Deploy to BeetleBase and wFleaBase.
 Year 3. Develop v3 with Community Curation. Deploy to entire ABC including Hymenoptera
 and Leptidoptera genome databases without curators and VectorBase with.




                                              - 10 -
5. ENTITY RELATION EXTRACTION

    This project proposes that it is feasible to apply advanced machine learning and natural
language processing techniques to extract various biological entities and relations with tunable
extraction results in a sustainable way through leveraging the increasing amount of training data
from annotations naturally accumulated over time. This sustainability is illustrated in Figure 7.
     The main technical component is
the trainable and tunable extractor.
This extractor can automatically
process large amounts of literature
and identify relevant entities and
relations that can become candidate
factoids for curation. The extracted
results would then be validated by
human curators or any one with
appropriate expertise for validation.
The validated results can be
incorporated into structured databases
for researcher query or analysis tools
to further process. The growing
amount of validated entities and Figure 7. Extraction Process for Assistant, where Curator
relations     naturally    serves    as    tunes the Dictionaries and the Training.
additional training data for the
extractor, leading to “organic” improvement of extraction performance over time.
    The extractor is trainable due to the use of a machine learning approach to extraction as
opposed to the traditional rule-based approaches. This means that the extractor can learn over
time from the human-validated extraction results to improve its extraction accuracy; the more
training data we have, the better the accuracy of extraction will be. Thus as we accumulate more
and more entities and relations, the Curator Assistant would become more and more intelligent
and powerful, being able to replace more and more of the human labor. Thus, the extractor
would become more and more scalable to handle large amounts of literature automatically.
    The extractor is tunable due to a combination of high-precision techniques such as
dictionary lookup and rule-based recognition with high-recall enhancement from statistical
learning. Informally, our idea is that we can first use dictionary lookup and/or rule-based
methods to obtain a small amount of highly accurate extraction results and then feed these results
as (pseudo) training data to a learning-based extractor to train the extractor to extract more
results, thus increase recall. A learning-based extractor also generally has parameters to control
the tradeoff of precision and recall, making it possible to tune the system to output either fewer
results with higher precision or more results with higher recall but potentially lower precision.
    This trainable and tunable extractor will be implemented based on a general learning
framework for information extraction, in which all resources, including dictionaries, human-
generated rules, and existing annotations, can be integrated in a principled way. The basic idea of
using machine learning [1] for extraction is to cast the extraction problem as a classification
problem. For example, for entity extraction, the task would be to classify a candidate phrase as
either being a particular type of entity (e.g., gene) or not, while for relation extraction, the
classification task can be to classify a sentence as either containing a particular relation (e.g.,



                                              - 11 -
gene interaction) or not. The prediction is based on a function that combines various features that
describe an instance (i.e., a phrase or a sentence) in a weighted manner. For example, for gene
prediction, features can include every possible clue that can potentially help making the
prediction. Or features can be local syntactic features such as whether the phrase has capitalized
letters, whether there are parentheses or Greek letters, whether there is a hyphen, or contextual
features such as whether the word “gene” or “expressed” occurs in a small window around the
phrase. These features can be combined to generate a score as basis for the prediction. The exact
way to combine the features and to make the decision would vary from method to method [1].
     For example, a commonly used effective classifier is based on logistic regression [1,18]. It
works as follows. Let X be a candidate phrase and f1(X), f2(X), …, fk(X) be k feature values
computed on X; e.g., f1(X)=1 (or 0) can indicate that the first letter of X is (or not) capitalized.
Let Y ∈{0,1} be a binary variable indicating whether X is a gene. The logistic regression
classifier assumes that Y and the features are related through the parameterized function:
                                               k
                                        exp(∑ β i f i ( X ))              k
    p (Y = 1 | X , β1 ,..., β k ) =           i =1
                                                     k
                                                                 ∝ exp(∑ β i f i ( X ))
                                      1 + exp(∑ β i f i ( X ))           i =1

                                                   i =1

where β’s are parameters that control the weights on all the features learned from training data.
      Given any instance X, we can use the formula above to compute p(Y=1|X), and thus can
predict X to be a gene if p(Y=1|X)> p(Y=0|X) (i.e., p(Y=1|X)>0.5), and a non-gene otherwise.
The training data will be of the form of a pair (Xj, Yj) where Xj is a phrase and Yj ∈{0,1} is the
correct prediction for Xj , thus a pair like (Xj, Yj=1) would mean that phase Xj should be predicted
as a gene, while a pair like (Xj, Yj=0) would mean that phase Xj should be predicted as not a
gene. In general, we will have many such training pairs, which tell us the expected predictions
for various instances. With a set of such training data {(Xj, Yj)}, j=1,…,n, in the training phase,
we would optimize the parameters (i.e., β’s) to minimize the prediction errors on the training
data. Intuitively, this is to figure out the best settings for these β’s so that ideally for all training
pairs where Yj=1, p(Yj=1| Xj) would be larger than 0.5, while for those where Yj=0, p(Yj=1| Xj)
would be smaller than 0.5.
     Although we used gene prediction as an example to illustrate the idea of this kind of learning
approach, it is clear that the same method can be used for recognizing other entities as well as
relations if X is a candidate sentence and Y indicates whether a certain relation is expressed in X.
There are many other classifiers [1] such as SVM and k-nearest neighbors that we can also use;
they all work in a similar way – using training data to optimize a combination of features for
making a prediction.
     A significant advantage of such a learning-based approach over the traditional rule-based
approach (as used in, e.g., the Textpresso system [30]) is that it can keep improving its
performance through leveraging the naturally growing curated database as training data, thus
gradually reducing the need for human effort over time. Indeed, such supervised learning
methods have already been applied successfully for information extraction from biology
literature (see, e.g., [3,9,12,28,35,36,43] ) and many other tasks such as text categorization and
hand-written character recognition.
    Such a learning-based method relies on the availability of two critical resources: (1) training
data; (2) computable effective features. The more training data we have and the more useful
features we have, the accuracy of extraction would be higher. Unfortunately, these two resources


                                                               - 12 -
are not always readily available to us. Below we discuss how we can apply advanced machine
learning and NLP techniques to solve these two challenges.

Insufficient training data: All the human-generated annotations are naturally available high
quality training data, but for a new genome, we may not have many or any annotations available,
creating a problem of “cold start”. We solve this problem using three strategies:

1. “Borrow” training data from related model organisms that have already been well annotated
through the use of domain adaptation techniques [18,19,20]. For example, our previous work
shows that cross-domain validation (emphasizing more on features that work well for multiple
domains) can lead to an improvement in the accuracy of extracting genes from a BioCreative test
set [16] by up to 40% [18].

  2. Bootstrap with a small number of manually created rules to generate pseudo training
examples (e.g., by assuming that all the matched cases with a rule are correct predictions). This
is a general powerful idea to improve recall, thus can be expected to be very useful when we
want to tune toward high recall based on high precision results. For example, a small set of
human-generated rules can be used for extraction with high accuracy; the generated high
precision results can then be used to train a classifier, which would be able to augment the
extraction results to improve recall. In our previous study, this technique has also been shown to
be very effective when combined with domain adaptation [20].

Figure 8 shows some sample results from using the
pseudo training data automatically generated from
entries in a FlyBase table for genetic interaction
relation recognition. Different curves correspond to
using different combinations of features. The best
performing curve uses all the words in a sentence
as features. Note that this top curve also shows that
it is possible to tune the extractor to produce either
high-precision low-recall results or low-precision
high-recall results by applying a different cutoff
threshold to a ranked list of predictions.
                                                         Figure 8. Relation Extractor with Tunable
3. In the worst case, we will resort to human           Precision-Recall depending on thresholds.
annotators to generate a small number of high-
quality training examples with minimum effort using active learning techniques, which allow us
to choose the most useful examples for a human annotator to work on so as to minimize human
effort. The basic idea is to ask a human expert to judge a case on which our classifier is most
uncertain about; we can expect the classifier to learn most from the correct prediction for such
uncertain cases. There are many active learning techniques that we can apply [7,10,40].

Insufficient effective features: Some entities and relations are easier to extract than others; for
example, organisms are easier to extract than genes because the former is usually restricted to a
closed set of vocabulary while the latter is not. For most entities, we expect that the standard
features defined based on surface forms of words and contextual words around a phrase would



                                                - 13 -
be sufficiently effective for prediction. However, for difficult cases, we may need to extend the
existing feature construction methods to define and extract additional effective features for a
specific entity or relation. We will solve this problem using two strategies:

1. Systematically generate more sophisticated linguistic features based on syntactic and semantic
structures (e.g., dependency relations between words determined by a parser). To improve the
effectiveness of features, it is useful to consider more discriminative features than words. To this
end, we will parse text to obtain syntactic and semantic structures of sentences and
systematically generate a large space of linguistically meaningful features that can potentially
capture more semantic relations and are more discriminative. In our previous study [21], we have
proposed a graph representation that enables systematic enumeration of linguistic features, and
our study has found that using a combination of features of different granularity can improve
performance for relation extraction. In this project, we will apply this methodology to enable the
classifier to work on a large space of features.

2. Involve human experts in the loop of learning so that when the system makes a mistake, the
expert can pinpoint to the exact feature responsible for the error; this way, the system can
effectively     improve      the
quality of features through
human feature supervision.
For example, in some
previous experiments, we
have        discovered      that
dictionary-based approaches               Figure 9. Sample gene name disambiguation results
to gene name recognition are
unable to distinguish a gene abbreviation such as “for” from the common preposition word “for”.
Thus if we just add a feature to the classifier to indicate whether the phrase occurs in a
dictionary, we may potentially misrecognize a preposition like “for” as a gene name. To solve
this problem, we designed a special classifier targeting at disambiguating such cases based on the
distribution patterns of words in the nearby text. The results in Figure 9 show that this technique
can successfully distinguish all the occurrences of “foraging” and “for” (the numbers are the
scores given by the classifier; a positive number indicates a gene, while a negative number a
non-gene). The output from such a disambiguation classifier can be regarded as a high-level
feature that can be fed into a general gene recognizer to tune the classifier toward high precision.
    Note that we take a very broad view of features, which makes our framework quite general.
Thus, in addition to leveraging all kinds of training data, we can also incorporate a variety of
other useful resources such as dictionaries and human-generated rules through defining
appropriate features (e.g., a feature can correspond to whether an instance matches a particular
rule or an entry in a dictionary), effectively leveraging results from existing work. Extracting
entities and relations from biomedical literature has been studied extensively in the literature
(see, e.g., [3,5,6,9,11-15,23-24,28,30-31,35,36,38-39,41-43]), including our own previous work
(e.g., [18-21]). Our framework would enable us to leverage and combine the findings and
resources from all these previous studies to perform large-scale information extraction. For
example, we can obtain a wide range of useful features from previous work and various
strategies for optimizing extraction accuracy.




                                               - 14 -
6. COMMUNITY ANNOTATION and CURATION

     The Community itself will eventually have to take over the curator role, with interactive
analysis to enable scientists to use the infrastructure to infer biological functions and infer
semantic relationships. Today's new genome projects are efforts contributed by many experts
and students, supported and enabled by distributed data sets, wiki project notebooks, genome
maps, annotation and search tools. These projects are not supported in a monolithic way, but via
contributions by biologists at nearly as many institutions as the hundreds of individual labs.
     For example, more than 400 biologists contributed gene annotations to the Daphnia genome
[17]. As this is the same scale of attendees to the Arthropod Genomics Symposium, but for a
single arthropod, the number of potential contributors to the ArthropodBaseConsortium
annotations clearly numbers in the tens of thousands. Each of these is a potential curator, with
effective infrastructure for Curator Assistant.         See the Collaboration Wikis for Daphnia
Genomics Consortium [https://dgc.cgb.indiana.edu/display/DGC/] and for Aphid Genomics
Consortium [https://dgc.cgb.indiana.edu/display/aphid/] for arthropod genomes examples.
     This is a new model of sustainable scientific activity, with cost-effective collaboration via
widely adopted cyberinfrastructure. Experts and students in focus areas are actively involved,
and contribute according to their means and interest. They join from disparate areas of basic and
applied sciences, educational, governmental, and industry centers (e.g. Daphnia and Aphid
genomes involve EPA and USDA agencies, agricultural and environmental businesses).
     We will develop infrastructure to address collaboration support for community annotation.
By providing tunable quality for biological factoids, we provide an automatic system to filter the
literature for curatable knowledge. In current gene annotation systems, such as Apollo
distributed by GMOD, the curator is presented with a blank form in which to write a gene
description. In the Curator Assistant, they are presented with candidate suggestions, thus greatly
expanding the number of persons who can serve as effective curators. We will also provide
mechanisms for the community to enter their own documents as published into the base
collections for the system, yielding a rich source of full-text articles, and to directly provide their
own factoids from their articles, without the inaccuracy of automatic entity-relation extraction.
     Currently, the most popular collaboration tools are wikis. While a wiki excels at simplicity
and flexibility, it lacks validation tools, rich indexing and social instrumentation. We propose to
develop structured social instrumentation for collaborative research environments, including
collaborative curation. In particular, our systems will allow users to offer confidence ratings for
human annotations and for various automated metadata extracts presented to the users. The
users themselves will gain expert status when their annotations receive high confidence ratings.
These ratings and rankings will allow researchers to share expertise and enhance the precision of
automated annotation systems in a mutually-beneficial way with secure transactions.
     A relevance rating system will be integrated in the basic functioning of the system itself.
Every view of information (entities, relations, abstracts, document lists) will also include
checkboxes to up-rate or reject/dismiss any listed elements. For example, community members
can judge the quality of the factoids viewed during their usage of the system. Items which are
selected and viewed receive increased relevance ratings. Data items which are
dismissed/rejected are down-rated in relevance and/or validity. The rating system is not
optional: It is transparently embedded within the user experience, which is key to its success.
This model of relevance feedback and validity ratings embedded within the core system has
proven effective in popular commercial social network systems such as YouTube and LastFM.



                                                - 15 -
7. PROJECT ORGANIZATION AND SCHEDULE

   Our project has been organized via the annual Symposium of the Arthropod Base Consortium
(ABC). This is sponsored by the Arthropod Genomics Center at Kansas State University with
coPI Brown as Director. There have been 3 symposia held thus far in Kansas City, drawing
300-400 attendees, generally representatives of their research laboratories or genome projects.
http://www.k-state.edu/agc/symp2009/ The steering committee for the ABC meets after the
workshop to plan community support, this proposal grew out of these planning meetings.
   There have also been specific meetings of the inner circle, 30-40 attendees, once or twice a
year at the main infrastructure sites such as FlyBase. The BeeSpace project hosted the one in
December 2007 at the University of Illinois, the slides for this workshop are at
http://www.beespace.uiuc.edu/groups_abc.php . The investigators for this proposal each spoke
at this meeting, along with the Head of Literature Curation for FlyBase Cambridge. The
proposed project will host a budgeted annual specialty workshop to plan Curator Assistant.
    The genome databases being used as test models in this project have already bypassed the
use of professional curators. They are coming in later than the post-MOD wave, such as honey
bee, where a case for a few curators was eventually successful after many grant attempts. So
BeetleBase for Tribolium the flour beetle and wFleaBase for Daphnia the water flea employ a
few biologists and programmers to help with sequencing support and computational pipelines.
The coPIs who lead the bioinformatics for these, respectively Susan Brown and Donald Gilbert,
are influential proponents of the new paradigm for community curation via annotation software.
    This proposal is concerned with developing an effective Curator Assistant and testing it to
evolve to full utility. The infrastructure investigators will develop the software infrastructure,
Schatz leading the informatics system development and Zhai leading the computer science
research. These were the same roles they played in the BIO FIBR BeeSpace project, which
developed interactive services for functional analysis using computer science research. The
bioinformatics investigators will serve as the initial users, each is the lead for the informatics of a
major community of arthropod biologists with several hundred community members. Tribolium
is an insect close to Drosophila, while Daphnia is a non-insect arthropod far from Drosophila.
The close BeeSpace collaboration with FlyBase will be continued, with both the curator site at
Harvard with PI Bill Gelbart and the software site at Indiana with PI Thom Kaufmann.
   Deployment to the full ABC and beyond will begin towards the end of the project. The groups
already identified coordinate multiple related databases. They will be the wave of deployment
after the investigator organisms are effectively using the Curator Assistant. Their coordinators
have expressed great interest while serving on the ABC steering committee. NIH-supported
VectorBase has many curators for mosquitos and ticks, USDA-supported HymenopteraBase has
few curators for bees and wasps, LepidopteraBase has no curators for butterflies and moths.
There is also an international collaboration for AphidBase hosted at INRA in France.
   The GMOD (Generic Model Organism Database) consortium is a bioinformatics group who
provide common infrastructure for over 100 genome projects, including all the ABC genomes
[www.gmod.org/wiki/GMOD_Users]. We have presented our preliminary software at GMOD
meetings [32], using RESTful protocols for linking Genome Browser to Gene Summarizer, and
made arrangements with the coordinator Scott Cain to link our software into GMOD for mass
distribution, during extensive conversations at the GMOD meetings and the ABC meetings. So
the Curator Assistant will become the literature infrastructure for ABC, just as GBrowse is the
sequence infrastructure, and through GMOD made available to the genome biology community.



                                                - 16 -
References Cited
[1] Bishop C (2007) Pattern Recognition and Machine Learning, Springer, 2007.
[2] Buell J, Stone D, Naeger N, Fahrbach S, Bruce C, Schatz B (2009) Experiencing BeeSpace:
   Educational Explorations in Behavioral Genomics for High School and Beyond, AAAS Annual
   Symposium, Chicago, Feb 2009. curricular materials at www.beespace.uiuc.edu/ebeespace
[3] Chang J, Schutze H, Altman R (2004) GAPSCORE: finding gene and protein names one
   word at a time, Bioinformatics, 20(2):216-25.
[4] Chung Y, Pottenger W, Schatz B (1998) Automatic Subject Indexing using an Associative
   Neural Network, 3rd Int ACM Conf on Digital Libraries, Pittsburgh, PA, Jun, pp 59-68.
   Nominated for Best Paper award.
[5] Cohen A (2005) Unsupervised gene/protein entity normalization using automatically
   extracted dictionaries, Proc BioLINK2005 Workshop Linking Biological Literature,
   Ontologies and Databases: Mining Biological Semantics. Detroit, MI: Association for
   Computational Linguistics; 2005:17-24.
[6] Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I (2004) Extracting
   human protein interactions from MEDLINE using a full-sentence parser, Bioinformatics,
   20(5):604-11, 2004.
[7] Dasgupta S, Tauman Kalai A, Monteleoni C (2005) Analysis of perceptron-based active
   learning, Proceedings of COLT 2005, 249-263, 2005.
[8] Drysdale R, Crosby M, FlyBase Consortium (2005) FlyBase: genes and gene models,
   Nucleic Acids Research, 33:D390-D395, Database Issue, doi:10.1093/nar/gki046.
[9] Finkel J, Dingare S, Manning C, Nissim M, Alex B, Grover C (2005) Exploring the
   boundaries: gene and protein identification in biomedical text, BMC Bioinformatics, 6 Suppl
   1(NIL):S5, 2005.
[10] Freund Y, Seung H, Shamir E, Tishby N (1997) Selective sampling using the query by
   committee algorithm, Machine Learning, 28(2-3):133-168.
[11] Fukuda K, Tamura A, Tsunoda T, Takagi T (1998) Toward information extraction:
   identifying protein names from biological papers, Pac Symp Biocomput, NIL(NIL):707-18.
[12] Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L, Leser U, Scheffer T (2005)
   Systematic feature evaluation for gene name recognition, BMC Bioinformatics, 6 Suppl
   1(NIL):S9, 2005.
[13] Hanisch D, Fundel K, Mevissen H, Zimmer R, Fluck J (2005) ProMiner: rule-based protein
   and gene entity recognition, BMC Bioinformatics 2005, 6(Suppl
   1):S14doi:10.1186/1471-2105-6-S1-S14.
[14] Hatzivassiloglou V, Duboue P, Rzhetsky A (2001) Disambiguating proteins, genes, and rna
   in text: a machine learning approach, Bioinformatics, 17 Suppl 1.:S97-S106.
[15] Hirschman L, Park J, Tsujii J, Wong L, Wu C (2002) Accomplishments and challenges in
   literature data mining for biology, Bioinformatics, 18(12):1553-1561.
[16] Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreAtIvE: critical
   assessment of information extraction for biology, BMC Bioinformatics 2005, 6(Suppl
   1):S1doi:10.1186/1471-2105-6-S1-S1.
[17] Howe D, Costanzo M, Fey P, et. al. (2008) Big data: The future of biocuration, Nature 455:
   47-50; doi:10.1038/455047a.
[18] Jiang J, Zhai C (2006) Exploiting Domain Structure for Named Entity
   Recognition, Proceedings of HLT/NAACL 2006.



                                             - 17 -
[19] Jiang J, Zhai C (2007) Instance weighting for domain adaptation in NLP, Proceedings of
  the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), 264-271.
[20] Jiang J, Zhai C (2007) A Two-Stage Approach to Domain Adaptation for Statistical
  Classifiers , Proc 16th ACM International Conference on Information and Knowledge
  Management ( CIKM'07), pp 401-410.
[21] Jiang J, Zhai C (2007) A Systematic Exploration of The Feature Space for Relation
  Extraction, Proc Human Language Technologies: Annual Conference of the North American
  Chapter of the Association for Computational Linguistics (NAACL-HLT 2007), pp 113-120.
[22] Johnson E, Schatz B, Cochrane P (1996) Interactive Term Suggestion for Users of Digital
  Libraries: Using Subject Thesauri and Co-occurrence Lists for Information Retrieval, Proc
  Digital Libraries '96: 1st ACM Intl Conf on Digital Libraries, March, Bethesda, MD.
[23] Kazama J, Makino T, Ohta Y, Tsujii J (2002) Tuning SVM for biomedical named entity
  recognition, Proc workshop on NLP in the biomedical domain, 2002.
[24] Kulick S and others (2004) Integrated Annotation for Biomedical Information Extraction,
  Proc HTL-NAACL 2004 Workshop on Linking Biological Literature, Ontologies and
  Databases, pp 61-68.
[25] Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B (2006) Automatically generating gene
  summaries from biomedical literature, Proc Pacific Symposium on Biocomputing, pp 40-51.
[26] Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B (2007) Generating gene summaries from
  biomedical literature: A study of semi-structured summarization, Information Processing and
  Management, 43: 1777-1791.
[27] Marygold S (2007) Genetic Literature Curation at FlyBase-Cambridge, presentation at
  ArthropodBaseConsortium working group meeting at University of Illinois, Dec 2007.
  www.beespace.uiuc.edu/files/Marygold-ABC.ppt
[28] Mika S, Rost B (2004) Protein names precisely peeled off free text, Bioinformatics, 20
  Suppl. 1:241-247, 2004.
[29] Morgan A, Hirschman L (2007) Overview of BioCreative II Gene Normalization, Proc of
  the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: CNIO; 2007:17-27.
[30] Muller H, Kenny E, Sternberg P (2004) Textpresso: an ontology-based information retrieval
  and extraction system for biological literature, PLoS Biology 2004 Nov; 2(11) e309.
  doi:10.1371/journal.pbio.0020309 pmid:15383839. www.textpresso.org
[31] Narayanaswamy M, Ravikumar K, Vijay-Shanker K (2003) A biological named entity
  recognizer, Proc Pacific Symposium on Biocomputing, pp 427-38.
[32] Sanders B, Arcoleo D, Schatz B (2008) BeeSpace Navigator Integration with GMOD
  GBrowse, 9th annual Bioinformatics Open Source Conference (BOSC 2008), Toronto, ON,
  Canada. www.beespace.uiuc.edu/files/BOSC2008_v3.ppt
[33] Schatz B (2002) Building Analysis Environments: Beyond the Genome and the Web,
  invited essay for Trends and Controversies section about Mining Information for Functional
  Genomics, IEEE Intelligent Systems 17: 70-73 (May/June 2002).
[34] Schatz B (2007) Gene Summarizer: Software for Automatically Generating Structured
  Summaries from Biomedical Literature, accepted plenary Presentation to 2nd International
  Biocurator Meeting, San Jose. www.canis.uiuc.edu/~schatz/Biocurator.GeneSummarizer.ppt
[35] Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and
  other entity names in text, Bioinformatics, 21(14):3191-3192, 2005.
[36] Skounakis M, Craven M, Ray S (2003) Hierarchical hidden markov models for information
  extraction, Proc of the 18th International Joint Conference on Artificial Intelligence, 2003.



                                             - 18 -
[37] Sokolowski M (2001) Drosophila: genetics meets behaviour. Nature Reviews Genetics,
  11(2):2001.
[38] Srinivasan P, Libbus B (2004) Mining Medline for implicit links between dietary substances
  and diseases, Bioinformatics, 20 Suppl. 1:290-296, 2004.
[39] Tanabe L, Wilbur W (2002) Tagging gene and protein names in biomedical text,
  Proceedings of the workshop on NLP in the biomedical domain, 2002.
[40] Tong S, Koller D (2001) Support vector machine active learning with applications to text
  classification, Journal of Machine Learning Research, 2:45-66, 2001.
[41] Tsuruoka Y, Tsujii J (2003) Boosting precision and recall of dictionary-based protein name
  recognition, Proc ACL 2003 workshop on Natural language processing in biomedicine, pp
  41-48, Morristown, NJ.
[42] Tuason O, Chen L, Liu H, Blake J, Friedman C (2004) Biological nomenclatures: A source
  of lexical knowledge and ambiguity, Proc Pacific Symposium on Biocomputing 9, pp 238-249.
[43] Zhou G, Shen D, Zhang J, Su J, Tan S (2005) Recognition of protein/gene names from text
  using an ensemble of classifiers, BMC Bioinformatics, 6 Suppl 1(NIL):S7, 2005.




                                             - 19 -

Más contenido relacionado

La actualidad más candente

Biological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usabilityBiological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usability
Lars Juhl Jensen
 
UNL UCARE Summer Symposium Poster
UNL UCARE Summer Symposium PosterUNL UCARE Summer Symposium Poster
UNL UCARE Summer Symposium Poster
Nichole Leacock
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
c.titus.brown
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
c.titus.brown
 
[13.07.07] karst mewe13 dna_extraction_nonotes
[13.07.07] karst mewe13 dna_extraction_nonotes[13.07.07] karst mewe13 dna_extraction_nonotes
[13.07.07] karst mewe13 dna_extraction_nonotes
sorenkarst
 

La actualidad más candente (20)

Biological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usabilityBiological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usability
 
UNL UCARE Summer Symposium Poster
UNL UCARE Summer Symposium PosterUNL UCARE Summer Symposium Poster
UNL UCARE Summer Symposium Poster
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Currsci Jan10 2003
Currsci Jan10 2003Currsci Jan10 2003
Currsci Jan10 2003
 
EuKRef. A community effort towards phylogenetic-based curation of ribosomal d...
EuKRef. A community effort towards phylogenetic-based curation of ribosomal d...EuKRef. A community effort towards phylogenetic-based curation of ribosomal d...
EuKRef. A community effort towards phylogenetic-based curation of ribosomal d...
 
Schindel i evobio norman ok - jun 11
Schindel   i evobio norman ok - jun 11Schindel   i evobio norman ok - jun 11
Schindel i evobio norman ok - jun 11
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
[13.07.07] karst mewe13 dna_extraction_nonotes
[13.07.07] karst mewe13 dna_extraction_nonotes[13.07.07] karst mewe13 dna_extraction_nonotes
[13.07.07] karst mewe13 dna_extraction_nonotes
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Introduction to Biodiversity Informatics
Introduction to Biodiversity Informatics Introduction to Biodiversity Informatics
Introduction to Biodiversity Informatics
 
NCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners SlidesNCBI Boot Camp for Beginners Slides
NCBI Boot Camp for Beginners Slides
 
Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, a...
Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, a...Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, a...
Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, a...
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Ncbi
NcbiNcbi
Ncbi
 

Destacado

What's Available in Assistive Technology for Students with ...
What's Available in Assistive Technology for Students with ...What's Available in Assistive Technology for Students with ...
What's Available in Assistive Technology for Students with ...
butest
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...
butest
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
butest
 
BenMartine.doc
BenMartine.docBenMartine.doc
BenMartine.doc
butest
 
ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..
butest
 
mathnightinfo.docx - Anne Arundel County Public Schools
mathnightinfo.docx - Anne Arundel County Public Schoolsmathnightinfo.docx - Anne Arundel County Public Schools
mathnightinfo.docx - Anne Arundel County Public Schools
butest
 

Destacado (7)

What's Available in Assistive Technology for Students with ...
What's Available in Assistive Technology for Students with ...What's Available in Assistive Technology for Students with ...
What's Available in Assistive Technology for Students with ...
 
What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...What s an Event ? How Ontologies and Linguistic Semantics ...
What s an Event ? How Ontologies and Linguistic Semantics ...
 
Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
S10
S10S10
S10
 
BenMartine.doc
BenMartine.docBenMartine.doc
BenMartine.doc
 
ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..ARDA-Insider-BAA03-0..
ARDA-Insider-BAA03-0..
 
mathnightinfo.docx - Anne Arundel County Public Schools
mathnightinfo.docx - Anne Arundel County Public Schoolsmathnightinfo.docx - Anne Arundel County Public Schools
mathnightinfo.docx - Anne Arundel County Public Schools
 

Similar a ABIcurator.doc

Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014
Monica Munoz-Torres
 
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO) Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Jie Bao
 
Munoz torres web-apollo-workshop_exeter-2014_ss
Munoz torres web-apollo-workshop_exeter-2014_ssMunoz torres web-apollo-workshop_exeter-2014_ss
Munoz torres web-apollo-workshop_exeter-2014_ss
Monica Munoz-Torres
 
Greene Bosc2008
Greene Bosc2008Greene Bosc2008
Greene Bosc2008
bosc_2008
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
drnigam
 

Similar a ABIcurator.doc (20)

Eumicrobedb - Oomycetes Genomics Database
Eumicrobedb - Oomycetes Genomics Database Eumicrobedb - Oomycetes Genomics Database
Eumicrobedb - Oomycetes Genomics Database
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014
 
Using the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support EcoinformaticsUsing the Semantic Web to Support Ecoinformatics
Using the Semantic Web to Support Ecoinformatics
 
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO) Developing Frameworks and Tools for Animal Trait Ontology (ATO)
Developing Frameworks and Tools for Animal Trait Ontology (ATO)
 
Munoz torres web-apollo-workshop_exeter-2014_ss
Munoz torres web-apollo-workshop_exeter-2014_ssMunoz torres web-apollo-workshop_exeter-2014_ss
Munoz torres web-apollo-workshop_exeter-2014_ss
 
Protocols for genomics and proteomics
Protocols for genomics and proteomics Protocols for genomics and proteomics
Protocols for genomics and proteomics
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.
 
Introduction to Web Apollo for the i5K pilot species.
Introduction to Web Apollo for the i5K pilot species.Introduction to Web Apollo for the i5K pilot species.
Introduction to Web Apollo for the i5K pilot species.
 
Semantic Technologies at FAO
Semantic Technologies at FAOSemantic Technologies at FAO
Semantic Technologies at FAO
 
VectorBase - PopGenBase Meeting at ASTMH08
VectorBase - PopGenBase Meeting at ASTMH08VectorBase - PopGenBase Meeting at ASTMH08
VectorBase - PopGenBase Meeting at ASTMH08
 
Greene Bosc2008
Greene Bosc2008Greene Bosc2008
Greene Bosc2008
 
Introduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental BiologyIntroduction to Ontologies for Environmental Biology
Introduction to Ontologies for Environmental Biology
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 
David
DavidDavid
David
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Web Apollo Workshop UIUC
Web Apollo Workshop UIUCWeb Apollo Workshop UIUC
Web Apollo Workshop UIUC
 
The agricultural ontology service
The agricultural ontology serviceThe agricultural ontology service
The agricultural ontology service
 
Biodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary ChallengeBiodiversity Informatics: An Interdisciplinary Challenge
Biodiversity Informatics: An Interdisciplinary Challenge
 

Más de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Más de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

ABIcurator.doc

  • 1. Developing a Curator Assistant for Functional Analysis of Genome Databases Requesting $1,451,005 from NSF BIO Advances in Biological Informatics, August 2009 PI: Bruce Schatz, Institute for Genomic Biology, University of Illinois at Urbana-Champaign coPI: ChengXiang Zhai, Computer Science, University of Illinois, LanguageTechnology coPI: Susan Brown, Biology, Kansas State University, ArthropodBaseConsortium (BeetleBase) coPI: Donald Gilbert, Bioinformatics, Indiana University, Community Annotation (wFleaBase) Intellectual Merit The advent of next-generation sequencing is rapidly decreasing the cost of genomes. Projections are that costs will decrease from $1M to $100K to $10K in the next 5 to 10 years. As a result, the number of arthropod genomes will increase from 10 to 1000 to 10,000. This shifts the major limitation from sequencing to annotating. The current level of annotation is recognizing genes from sequences, rather than understanding the function of genes. Traditionally, functional analysis has been performed by human curators who read biological literature to provide evidence for a genome database of gene function such as FlyBase. To functionally analyze a genome, biologists develop an ortholog pipeline, where the genes of their organism have the orthologs computed and these used to find the most similar gene in a model organism database. This process is inexpensive, but inaccurate, compared to manual curators. We propose to develop a Curator Assistant that will enable the communities that are generating genomes to analyze the function of their genes by themselves. While the model organism databases (MODs) have groups of curators, subsequent genome databases have struggled to find funding for even a single human curator. Such bases will have to be curated by the communities themselves, by community biologists using software infrastructure to help them extract functions from community literature. Within the Arthropod Base Consortium (ABC), for example, only FlyBase is a MOD with professional curators. During the NSF-funded BeeSpace project, we developed prototype software for automatically extracting entities and relations from biological literature. The entities include genes, anatomy, and behavior, while the relations include interaction (gene-gene), expression (gene-anatomy), and function (gene-behavior). These entities and relations can be used to populate relational tables to build a genome database. Our prototype works on Drosophila literature and leverages FlyBase, the MOD for the ABC. Our techniques appear general enough for all arthropods. We propose to develop a fully fledged Curator Assistant that fully utilizes machine learning technologies for natural language processing. These include community dictionaries, heuristic procedures, and training sets. Given the community collection with relevant literature, the assistant software suggests candidate relations that the community biologists can select from. Providing additional knowledge is much easier than reading biological literature and mechanisms are provided to specify the level of quality desired and revise the information itself. Broader Impact Our project has been organized via the annual Symposium of the Arthropod Base Consortium. Our investigators including the BeeSpace PI for informatics and the Symposium organizer for biology, representing arthropod genomes in particular and animal genomes in general. Our project will develop language technology for entity-relation semantics into usable infrastructure and distribute it through GMOD, which already provides the sequence support used by ABC. We will develop the standards for literature support for customized extraction and curation, including practical deployment to a distributed community of NSF-funded genome biologists.
  • 2. Developing a Curator Assistant for Functional Analysis of Genome Databases PI: Bruce Schatz, Institute for Genomic Biology, University of Illinois at Urbana-Champaign coPI: ChengXiang Zhai, Computer Science, University of Illinois, LanguageTechnology coPI: Susan Brown, Biology, Kansas State University, ArthropodBaseConsortium (BeetleBase) coPI: Donald Gilbert, Bioinformatics, Indiana University, CommunityAnnotation (wFleaBase) 1. GENOME SEQUENCING AND BIOCURATION The advent of next-generation sequencing is rapidly decreasing the cost of genomes. Projections are that costs will decrease from $1M to $100K to $10K in the next 5 to 10 years. As a result, the number of arthropod genomes will increase from 10 to 1000 to 10,000. This shifts the major limitation from sequencing to annotating. The current level of annotation is recognizing genes from sequences, rather than understanding the function of genes. Traditionally, functional analysis has been performed by human curators who read biological literature to provide evidence for a genome database of gene function such as FlyBase. To functionally analyze a genome, biologists develop an ortholog pipeline, where the genes of their organism have the orthologs computed and these used to find the most similar gene in a model organism database. This process is inexpensive, but inaccurate, compared to manual curators. We propose to develop a Curator Assistant that will enable the communities that are generating genomes to analyze the function of their genes by themselves. While the model organism databases (MODs) have groups of curators, subsequent genome databases have struggled to find funding for even a single human curator. Such bases will have to be curated by the communities themselves, by community biologists using software infrastructure to help them extract functions from community literature. Within the Arthropod Base Consortium (ABC), for example, only FlyBase is the only MOD with professional curators. During the NSF-funded BeeSpace project, we developed prototype software for functional analysis [33], by automatically extracting entities and relations from biological literature. The entities include genes, anatomy, and behavior, while the relations include interaction (gene- gene), expression (gene-anatomy), and function (gene-behavior). These entities and relations can be used to populate relational tables to build a genome database. Our prototype currently works on Drosophila literature and leverages FlyBase, the MOD for the ABC. Our techniques are general enough for all arthropods. This is an important taxa of organisms for NSF biologists. We propose to develop a fully fledged Curator Assistant that fully utilizes machine learning technologies for natural language processing. These include community dictionaries, heuristic procedures, and training sets. Given the community collection with relevant literature, the assistant software suggests candidate relations that the community biologists can select from. Providing additional knowledge is much easier than reading biological literature and mechanisms are provided to specify the level of quality desired and revise the information itself. We will debug the new system on the existing bases such as BeetleBase and wFleaBase, then deploy more widely to the full bases of the Arthropod Base Consortium as it grows. The software will be general enough to be widely applicable for genome databases. We will use GMOD (Generic Model Organism Database) consortium as the distribution mechanism for our literature curation software, to complement their existing software for sequence curation. -2-
  • 3. 2. GENOME DATABASES AND BIOCURATION The Curator Assistant will initially focus on arthropod genomes, as organisms of central interest to NSF. At least half of the described species of living animals are arthropods (jointed legs, mostly insects), species of great scientific interest for molecular genetics and evolutionary synthesis. The Arthropod Base Consortium (ABC) has been meeting quarterly for the past 4 years, to discuss their needs for genome data bases and data analysis. The inner circle has about 40 scientists, who hold workshops at the major community sites. The outer circle has about 400 scientists, who attend the Annual Symposium [www.k-state.edu/agc/symposium.shtml]. This community consortium currently includes some 10 resource genomes, including insects important biologically (bee, beetle, butterfly, aphid), crustaceans important ecologically (water flea), and vectors important for human diseases (mosquito, tick, louse). There is a reference genome for this community, the fruit fly Drosophila melanogaster, which has been a genetic model for 100 years. As the model insect, Drosophila is important enough to justify a 40-person staff at FlyBase, who manually curate this model organism database (MOD). Through a close collaboration, the FlyBase literature curation process is serving as the model for our semantic indexing of biological literature, see Figure 1 below. The first wave of genomes were of the model genetic organisms, these MODs already had Bases with human curators. For the arthropods, the only MOD is FlyBase for the insect Drosophila meglanoster. The second wave of genomes did not have decades of genetics, but were attempting to jumpstart with genome sequencing. For arthropods, these include the insects honey bee and flour beetle, both important scientifically and agriculturally. The corresponding bases, e.g. BeeBase and BeetleBase, were able to gain modest funding, but not for professional curators, only for postdocs and programmers. Such resources thus went into annotating genes of particular interest (small numbers) or support of automatic processing (large numbers). With the third wave, the sequencing is still done at genome centers, but no attempts are made at manual curation. These Bases, e.g. wFleaBase for Daphnia, spend their limited resources on community annotation and computation. Beyond the third wave, the sequencing is being done at campus centers rather than national centers and any curation is done automatically with quality enhancement by the community itself. Within ABC, ButterflyBase and AphidBase are down this path and will be working with our group as their genomes mature. The 10,000 arthropod genomes expected in the next decade will all be in the post-curator era. From a technology standpoint, this implies that the Curator Assistant must support variable levels of quality because different bases from different waves will do different amounts of post- assistant quality improvement. With many curators, the system should generate many candidates that can be manually checked by human experts. With few curators, the system should generate few candidates for manual checking, thus higher precision and lower recall. With no curators, the system should generate highest precision “correct” entries, which are annotated by the community itself using collaboration technology. In the preliminary work performed in the BeeSpace project described below, we developed prototype services tuned towards recall and towards precision, indicating feasibility of developing a fully tunable system for curation quality. What’s in a Base: An Examination of FlyBase For many reasons several of the fields in FlyBase use structured controlled vocabularies (aka ontologies). This makes it much easier (and more robust) to make links within the database, as -3-
  • 4. well as making it much easier to search the database for information. Moreover, several of these controlled vocabularies are shared with other databases, and this provides a degree of integration between them. The controlled vocabularies are only implemented in certain fields in FlyBase. The initial literature selection is done at FlyBase at Cambridge University while the bulk of the literature curation is done at FlyBase at Harvard University to populate the gene models in the database from highlighted facts in the literature articles [8]. Controlled vocabularies currently used by FlyBase are [www.flybase.org]: • The Gene Ontology (GO). This provides structured controlled vocabularies for the annotation of gene products (although FlyBase annotates genes with GO terms, as a surrogate for their products). The GO has three domains: the molecular function of gene products, the biological process in which they are involved and their cellular component. • Anatomy. A structured controlled vocabulary of the anatomy of Drosophila melanogaster, used for the description of phenotypes and where a gene is expressed. • Development. A structured controlled vocabulary of the development of Drosophila melanogaster, used for the description of phenotypes and when a gene is expressed. • The Sequence Ontology (SO). A structured controlled vocabulary for sequence annotation, for the exchange of annotation data and for the description of sequence objects in databases. FlyBase describes the genome in a consistent and rigorous manner. All of these structured controlled vocabularies are in the same format, that used by the Open Biomedical Ontology group. This format is called the OBO format [www.obo.org] . These controlled vocabularies focus on the most important types of data for genome databases, namely “gene”, “anatomy”, and types of “function” such as “development” [37]. The factoids in the official database are relations on these datatypes, such as Interaction (gene-gene), Expression (gene-anatomy), Function (gene-development). When a FlyBase curator records a factoid, they also record the type of evidence that enables them to judge its correctness. The list for genes is as below. Note this implies that even manual curation includes different factoids at different qualities, whether a relation is true depends on the level of evidence chosen. The Gene Ontology Guide to GO Evidence Codes contains comprehensive descriptions of the evidence codes used in GO annotation. FlyBase uses the following evidence codes when assigning GO data: inferred from mutant phenotype (IMP), inferred from genetic interaction (IGI), inferred from direct assay (IDA), inferred from physical interaction (IPI), inferred from expression pattern (IEP), inferred from sequence or structural similarity (ISS), inferred from electronic annotation (IEA), inferred from reviewed computational analysis (RCA),,traceable author statement (TAS), non-traceable author statement (NAS), inferred by curator (IC), no biological data available (ND). Note some of these are observational and some computational. 3. CURATOR ASSISTANT SYSTEM Biocuration [17] is the process of extracting facts from the biological literature to populate a database about gene function. The curators at the Model Organism Databases (MODs) read input papers from scientific literature relevant to their organism and extract facts judged to be correct, which are then used to populate the structured fields of their genome database. There are currently 10 reference genomes, each with their own group of curators. These groups are falling -4-
  • 5. behind, with the current scale of literature, and new resource genomes are being denied custom curator support, due to financial limitations. In the 5-year BeeSpace project just ending with NSF BIO FIBR funding, we have been working closely with FlyBase curators to better understand what can be automated within the biocuration process. We are fortunate in collaborating with John MacMullen from the Graduate School of Library and Information Science, who specializes in studying the process of biocuration by analyzing the detailed activities of MOD curators. He is analyzing the curator annotations in FlyBase, among others, by examining which sentences are highlighted in the texts and which database entries are inferred from these. Through the BeeSpace project, we also work with the many curators at the FlyBase project under PI William Gelbart at Harvard University and the few curators at the BeeBase project under PI Christine Elsik at Georgetown University. The Group Manager at FlyBase-Cambridge (England), Steven Marygold, provided the Figure below giving the steps in the FlyBase curation process. He spoke at the ABC working meeting in December 2007 hosted at our project home site in the Institute for Genomic Biology at the University of Illinois, slides at www.beespace.uiuc.edu/files/Marygold-ABC.ppt . Figure 1. FlyBase Literature Curation Process Diagram [27]. The automatic process set up in the Curator Assistant is modeled after this manual process. The user could be a full biocurator or could be a community member research biologist, thus differently tuning the system to their needs. They search the literature to choose articles. The manual curator can only choose tens of articles to skim, but the assisted curator can choose thousands of articles to be automatically skimmed. The BeeSpace system that the Curator Assistant leverages contains powerful services for choosing collections well targeted to the particular purpose, including searching and clustering. The major strength of the automatic system is breadth, it can cover a much wider selection of the available literature than can humans. In demonstrating the prototype to many curators at the Arthropod Genomics Symposium, even the most professional curators spoke longingly of having an automatic system to filter candidates, in order to attempt to with the full range of biological literature. The Curator Assistant will focus on the middle of the diagram, the central core of the curation process. This process highlights the curatable material and then performs curation, this is -5-
  • 6. basically finding sentences with functional information and extracting the facts that are described by the functional sentences. For example, two genes interact with each other (Interaction), a gene is expressed in a specific part of the anatomy (Expression), a gene regulates a particular behavior (Function). Key information is usually contained within the abstract, which is why our current services are effective, even though they cover only Medline and Biological Abstracts. The manual curators have the advantage of reading the fulltext, so we will be also gathering fulltext systematically for our community, through the collaboration technology described below. For the bottom of the diagram, the Curator Assistant will also support error checking of different kinds by the community curators themselves and by the community biologists themselves, as described in the later section on Community Annotation and Curation. Finally, through an arrangement with the GMOD consortium (Generic Model Organism Database software), who support the GBrowse genome sequence displayer and the CHADO database schema format, we will be distributing our literature infrastructure software to the broader genome community to supplement the existing sequence infrastructure software. The concluding section below on Organization and Schedule contains further details on GMOD relations. The underlying system uses natural language processing to extract relevant entities and relations automatically from relevant literature. An entity is a noun phrase representing a community datatype, e.g. gene name or body part. A relation is a verb phrase representing the action performed by an entity, e.g. gene A regulates behavior B in organism C. Many projects extract entities and relations, using template rules for a particular domain. The BeeSpace project pioneered trained adaptive entity recognition, where sample sentences are used to train the recognizer for particular entities with high accuracy and software adapts the training to related domains automatically [18,19] and we will be leveraging off this NSF BIO project, which ends in August 2009 before the proposed project would begin. We also leverage off our previous NSF research in digital libraries on interactive support for literature curation [4,22]. The first prototype within the BeeSpace system has already become a production service, with streamlined v4 interface available at www.beespace.uiuc.edu . The Gene Summarizer was the subject of an accepted plenary talk at the 2nd International Biocurator Meeting in San Jose in October 2007 [34]. The Gene Summarizer has two stages: the first highlights the curatable materials while the second curates these materials in a usable interactive form [25,26]. The highlighting is tuned for recall, so that sentences containing gene names are automatically extracted from the literature abstracts, where the entity “gene” is broadly recognized, including genes, proteins, and gene-like descriptions. The curation is simpler than what is proposed for the Curator Assistant but is very effective for practicing biologists who use the interactive system, where each gene sentence is placed automatically into a functional category. The first version of this service used a machine learning approach that was trained on the curator generated sentences from FlyBase, explaining why the curator had entered a particular factoid into FlyBase relational database. PI Schatz of BeeSpace then visited PI Gelbart of FlyBase at Harvard and observed the curator process at length. A reciprocal visit by a FlyBase curator, Sian Giametes, to BeeSpace refined the automatic process and the functional categories. We then also did specific training with new sentences judged by bee biologists at University of Illinois and beetle biologists at Kansas State University. A subsequent version was developed using this training with much higher accuracy than previous dictionary-based versions. Figures 2 and 3 give examples of using the Gene Summarizer with this insect training on a Drosophila fly gene and on a Tribolium beetle gene. There are more fly papers than beetle papers so the number of highlighted sentences are naturally greater. The functional categories -6-
  • 7. are: Gene Products (GP), Expression Location (EL), Sequence Information (SI), Wild-type Function & Phenotypic Information (WFPI), Mutant Phenotype (MP), Genetic Interaction (GI). Figure 2. Gene Summarization for Automatic Curation on FlyBase collection. Figure 3. Gene Summarization for Automatic Curation on BeetleBase collection. -7-
  • 8. 4. CURATOR ASSISTANT PROTOTYPE After integrating the Gene Summarizer in BeeSpace v3, we developed a prototype BeeSpace v5 that specifically extracted entity and relation from literature. This has deeper curation, recognizing within a highlighted sentence what entities and relations are mentioned. The extractors were tuned for precision to produce “correct” factoids, rather than the previous extractors that were tuned for recall to produce comprehensive coverage of all entities present. From this, it became clear that the level of precision and recall was a tunable feature of machine learning and thus it would be feasible to support varying qualities for different purposes. The precision v5 system was an important prototype for the Curator Assistant, as it showed that accurate automatic extraction was technically possible. The first version leveraged the relations within FlyBase and was run on the Drosophila collection of standard articles that we obtained through collaboration from FlyBase at Indiana University where the software development is done. The high precision used disambiguation algorithms that enabled identification of which gene was mentioned. For v3 recall, “wingless” was a particular text phrase but for v5 precision, the same word was a particular gene number. Thus, accurate linkouts became possible. So a gene entity recognized can jump directly to the FlyBase gene entry for that name and an anatomy entity can jump directly to the FlyBase anatomical hierarchy. Figure 4 contains a sample output from the v5 prototype on the Drosophila fly collection. Multiple word phrases are recognized correctly for gene in green, for anatomy in orange, for behavior in blue, and for chemical in yellow. (Tags are correct if this figure displayed in color.) Anatomy is dictionary-based, just like gene, using the FlyBase anatomy terms as the base. The function terms in the categories of behavior and chemical were extracted using heuristics of certain key words. There was another set of function terms for development, the other category used in FlyBase, but not many terms identified with our simple heuristics. Figure 5 shows that the recognized gene is linked to its corresponding correct gene database entry in FlyBase. In the proposed project, for entities, we will focus on gene, anatomy, and function (combining behavior, anatomy, development). For relations, we will focus on different combinations of these such as Interaction (Gene-Gene), Expression (Gene-Anatomy), Function (Gene-Behavior etc). We will leverage existing resources for dictionary generation, such as gene names from NCBI Entrez Gene [www.ncbi.nlm.nih.gov/sites/entrez?db=gene] and anatomy names from FlyBase [http://flybase.org/static_pages/anatomy/glossary.html]. The relational indexes in Biological Abstracts include gene and anatomy, providing a rich source of entities tagged by human curators from phrases in biological literature. FlyMine [www.flymine.org] is a rich source of query relations, including multistep inferences extracted from FlyBase. We will also leverage available resources to obtain training data or pseudo training data. In particular, BioCreative studies [16,29] have resulted in a valuable training set, which we have already used in gene recognition. Fixed template systems such as Textpresso [30] have hand-generated rules useful for constructing features in our learning-based framework. For the proposed project, we plan to do extensive training to improve the precision of the dictionaries and of the heuristics, to automatically identify sentence slots for particular entities. This process greatly improved our previous efforts for entity summarization, as discussed above. To achieve better results, the community curators can supplement the dictionaries with local gene names or anatomy names. The next section is a technical discussion of the training procedures and how such tuning can be feasibly implemented. -8-
  • 9. Figure 4. Preliminary Work from BeeSpace Prototype v5. Interactive System for Entity Relations using FlyBase relational database for leverage, with live linkouts. Figure 5. FlyBase Gene entry (manual) linked to from Curator Assistant (automatic). -9-
  • 10. We have tried running the Drosophila trained v5 extractors on Tribolium literature, since few beetle genes have direct names but commonly use the fly gene names. The anatomy is also not identical but similar in many ways. This process sometimes produces good results as shown in Figure 6. This version is the initial attempt at a general system for arthropods using prototype classification, the closer the organism is to the prototype fly the more accurate the recognition. Figure 6. Entity Relation v5 on Beetle Tribolium literature. This still uses the FlyBase training so not as accurate as would be trained system, but still produces some useful outputs. We are currently extracting from a large insect collection from the Biological Abstracts database. PI Schatz is giving an invited lecture in December 2009 at the annual meeting of the ESA Entomological Society of America on "Computer support for community knowledge: information technologies for insect biologists to automatically annotate their molecular information" and will demonstrate the evolved version of this prototype. coPI Gilbert is giving an invited talk in the same session on Integrative Physiological and Molecular Insect Systems. He works on the arthropod water flea, a good test of machine learning for entity anatomy. PROJECT SCHEDULE FOR CURATOR ASSISTANT Year 1. Develop v1 leverage FlyBase (base BeeSpace v5). Deploy to BeetleBase. Year 2. Develop v2 with Trained Recognizers. Deploy to BeetleBase and wFleaBase. Year 3. Develop v3 with Community Curation. Deploy to entire ABC including Hymenoptera and Leptidoptera genome databases without curators and VectorBase with. - 10 -
  • 11. 5. ENTITY RELATION EXTRACTION This project proposes that it is feasible to apply advanced machine learning and natural language processing techniques to extract various biological entities and relations with tunable extraction results in a sustainable way through leveraging the increasing amount of training data from annotations naturally accumulated over time. This sustainability is illustrated in Figure 7. The main technical component is the trainable and tunable extractor. This extractor can automatically process large amounts of literature and identify relevant entities and relations that can become candidate factoids for curation. The extracted results would then be validated by human curators or any one with appropriate expertise for validation. The validated results can be incorporated into structured databases for researcher query or analysis tools to further process. The growing amount of validated entities and Figure 7. Extraction Process for Assistant, where Curator relations naturally serves as tunes the Dictionaries and the Training. additional training data for the extractor, leading to “organic” improvement of extraction performance over time. The extractor is trainable due to the use of a machine learning approach to extraction as opposed to the traditional rule-based approaches. This means that the extractor can learn over time from the human-validated extraction results to improve its extraction accuracy; the more training data we have, the better the accuracy of extraction will be. Thus as we accumulate more and more entities and relations, the Curator Assistant would become more and more intelligent and powerful, being able to replace more and more of the human labor. Thus, the extractor would become more and more scalable to handle large amounts of literature automatically. The extractor is tunable due to a combination of high-precision techniques such as dictionary lookup and rule-based recognition with high-recall enhancement from statistical learning. Informally, our idea is that we can first use dictionary lookup and/or rule-based methods to obtain a small amount of highly accurate extraction results and then feed these results as (pseudo) training data to a learning-based extractor to train the extractor to extract more results, thus increase recall. A learning-based extractor also generally has parameters to control the tradeoff of precision and recall, making it possible to tune the system to output either fewer results with higher precision or more results with higher recall but potentially lower precision. This trainable and tunable extractor will be implemented based on a general learning framework for information extraction, in which all resources, including dictionaries, human- generated rules, and existing annotations, can be integrated in a principled way. The basic idea of using machine learning [1] for extraction is to cast the extraction problem as a classification problem. For example, for entity extraction, the task would be to classify a candidate phrase as either being a particular type of entity (e.g., gene) or not, while for relation extraction, the classification task can be to classify a sentence as either containing a particular relation (e.g., - 11 -
  • 12. gene interaction) or not. The prediction is based on a function that combines various features that describe an instance (i.e., a phrase or a sentence) in a weighted manner. For example, for gene prediction, features can include every possible clue that can potentially help making the prediction. Or features can be local syntactic features such as whether the phrase has capitalized letters, whether there are parentheses or Greek letters, whether there is a hyphen, or contextual features such as whether the word “gene” or “expressed” occurs in a small window around the phrase. These features can be combined to generate a score as basis for the prediction. The exact way to combine the features and to make the decision would vary from method to method [1]. For example, a commonly used effective classifier is based on logistic regression [1,18]. It works as follows. Let X be a candidate phrase and f1(X), f2(X), …, fk(X) be k feature values computed on X; e.g., f1(X)=1 (or 0) can indicate that the first letter of X is (or not) capitalized. Let Y ∈{0,1} be a binary variable indicating whether X is a gene. The logistic regression classifier assumes that Y and the features are related through the parameterized function: k exp(∑ β i f i ( X )) k p (Y = 1 | X , β1 ,..., β k ) = i =1 k ∝ exp(∑ β i f i ( X )) 1 + exp(∑ β i f i ( X )) i =1 i =1 where β’s are parameters that control the weights on all the features learned from training data. Given any instance X, we can use the formula above to compute p(Y=1|X), and thus can predict X to be a gene if p(Y=1|X)> p(Y=0|X) (i.e., p(Y=1|X)>0.5), and a non-gene otherwise. The training data will be of the form of a pair (Xj, Yj) where Xj is a phrase and Yj ∈{0,1} is the correct prediction for Xj , thus a pair like (Xj, Yj=1) would mean that phase Xj should be predicted as a gene, while a pair like (Xj, Yj=0) would mean that phase Xj should be predicted as not a gene. In general, we will have many such training pairs, which tell us the expected predictions for various instances. With a set of such training data {(Xj, Yj)}, j=1,…,n, in the training phase, we would optimize the parameters (i.e., β’s) to minimize the prediction errors on the training data. Intuitively, this is to figure out the best settings for these β’s so that ideally for all training pairs where Yj=1, p(Yj=1| Xj) would be larger than 0.5, while for those where Yj=0, p(Yj=1| Xj) would be smaller than 0.5. Although we used gene prediction as an example to illustrate the idea of this kind of learning approach, it is clear that the same method can be used for recognizing other entities as well as relations if X is a candidate sentence and Y indicates whether a certain relation is expressed in X. There are many other classifiers [1] such as SVM and k-nearest neighbors that we can also use; they all work in a similar way – using training data to optimize a combination of features for making a prediction. A significant advantage of such a learning-based approach over the traditional rule-based approach (as used in, e.g., the Textpresso system [30]) is that it can keep improving its performance through leveraging the naturally growing curated database as training data, thus gradually reducing the need for human effort over time. Indeed, such supervised learning methods have already been applied successfully for information extraction from biology literature (see, e.g., [3,9,12,28,35,36,43] ) and many other tasks such as text categorization and hand-written character recognition. Such a learning-based method relies on the availability of two critical resources: (1) training data; (2) computable effective features. The more training data we have and the more useful features we have, the accuracy of extraction would be higher. Unfortunately, these two resources - 12 -
  • 13. are not always readily available to us. Below we discuss how we can apply advanced machine learning and NLP techniques to solve these two challenges. Insufficient training data: All the human-generated annotations are naturally available high quality training data, but for a new genome, we may not have many or any annotations available, creating a problem of “cold start”. We solve this problem using three strategies: 1. “Borrow” training data from related model organisms that have already been well annotated through the use of domain adaptation techniques [18,19,20]. For example, our previous work shows that cross-domain validation (emphasizing more on features that work well for multiple domains) can lead to an improvement in the accuracy of extracting genes from a BioCreative test set [16] by up to 40% [18]. 2. Bootstrap with a small number of manually created rules to generate pseudo training examples (e.g., by assuming that all the matched cases with a rule are correct predictions). This is a general powerful idea to improve recall, thus can be expected to be very useful when we want to tune toward high recall based on high precision results. For example, a small set of human-generated rules can be used for extraction with high accuracy; the generated high precision results can then be used to train a classifier, which would be able to augment the extraction results to improve recall. In our previous study, this technique has also been shown to be very effective when combined with domain adaptation [20]. Figure 8 shows some sample results from using the pseudo training data automatically generated from entries in a FlyBase table for genetic interaction relation recognition. Different curves correspond to using different combinations of features. The best performing curve uses all the words in a sentence as features. Note that this top curve also shows that it is possible to tune the extractor to produce either high-precision low-recall results or low-precision high-recall results by applying a different cutoff threshold to a ranked list of predictions. Figure 8. Relation Extractor with Tunable 3. In the worst case, we will resort to human Precision-Recall depending on thresholds. annotators to generate a small number of high- quality training examples with minimum effort using active learning techniques, which allow us to choose the most useful examples for a human annotator to work on so as to minimize human effort. The basic idea is to ask a human expert to judge a case on which our classifier is most uncertain about; we can expect the classifier to learn most from the correct prediction for such uncertain cases. There are many active learning techniques that we can apply [7,10,40]. Insufficient effective features: Some entities and relations are easier to extract than others; for example, organisms are easier to extract than genes because the former is usually restricted to a closed set of vocabulary while the latter is not. For most entities, we expect that the standard features defined based on surface forms of words and contextual words around a phrase would - 13 -
  • 14. be sufficiently effective for prediction. However, for difficult cases, we may need to extend the existing feature construction methods to define and extract additional effective features for a specific entity or relation. We will solve this problem using two strategies: 1. Systematically generate more sophisticated linguistic features based on syntactic and semantic structures (e.g., dependency relations between words determined by a parser). To improve the effectiveness of features, it is useful to consider more discriminative features than words. To this end, we will parse text to obtain syntactic and semantic structures of sentences and systematically generate a large space of linguistically meaningful features that can potentially capture more semantic relations and are more discriminative. In our previous study [21], we have proposed a graph representation that enables systematic enumeration of linguistic features, and our study has found that using a combination of features of different granularity can improve performance for relation extraction. In this project, we will apply this methodology to enable the classifier to work on a large space of features. 2. Involve human experts in the loop of learning so that when the system makes a mistake, the expert can pinpoint to the exact feature responsible for the error; this way, the system can effectively improve the quality of features through human feature supervision. For example, in some previous experiments, we have discovered that dictionary-based approaches Figure 9. Sample gene name disambiguation results to gene name recognition are unable to distinguish a gene abbreviation such as “for” from the common preposition word “for”. Thus if we just add a feature to the classifier to indicate whether the phrase occurs in a dictionary, we may potentially misrecognize a preposition like “for” as a gene name. To solve this problem, we designed a special classifier targeting at disambiguating such cases based on the distribution patterns of words in the nearby text. The results in Figure 9 show that this technique can successfully distinguish all the occurrences of “foraging” and “for” (the numbers are the scores given by the classifier; a positive number indicates a gene, while a negative number a non-gene). The output from such a disambiguation classifier can be regarded as a high-level feature that can be fed into a general gene recognizer to tune the classifier toward high precision. Note that we take a very broad view of features, which makes our framework quite general. Thus, in addition to leveraging all kinds of training data, we can also incorporate a variety of other useful resources such as dictionaries and human-generated rules through defining appropriate features (e.g., a feature can correspond to whether an instance matches a particular rule or an entry in a dictionary), effectively leveraging results from existing work. Extracting entities and relations from biomedical literature has been studied extensively in the literature (see, e.g., [3,5,6,9,11-15,23-24,28,30-31,35,36,38-39,41-43]), including our own previous work (e.g., [18-21]). Our framework would enable us to leverage and combine the findings and resources from all these previous studies to perform large-scale information extraction. For example, we can obtain a wide range of useful features from previous work and various strategies for optimizing extraction accuracy. - 14 -
  • 15. 6. COMMUNITY ANNOTATION and CURATION The Community itself will eventually have to take over the curator role, with interactive analysis to enable scientists to use the infrastructure to infer biological functions and infer semantic relationships. Today's new genome projects are efforts contributed by many experts and students, supported and enabled by distributed data sets, wiki project notebooks, genome maps, annotation and search tools. These projects are not supported in a monolithic way, but via contributions by biologists at nearly as many institutions as the hundreds of individual labs. For example, more than 400 biologists contributed gene annotations to the Daphnia genome [17]. As this is the same scale of attendees to the Arthropod Genomics Symposium, but for a single arthropod, the number of potential contributors to the ArthropodBaseConsortium annotations clearly numbers in the tens of thousands. Each of these is a potential curator, with effective infrastructure for Curator Assistant. See the Collaboration Wikis for Daphnia Genomics Consortium [https://dgc.cgb.indiana.edu/display/DGC/] and for Aphid Genomics Consortium [https://dgc.cgb.indiana.edu/display/aphid/] for arthropod genomes examples. This is a new model of sustainable scientific activity, with cost-effective collaboration via widely adopted cyberinfrastructure. Experts and students in focus areas are actively involved, and contribute according to their means and interest. They join from disparate areas of basic and applied sciences, educational, governmental, and industry centers (e.g. Daphnia and Aphid genomes involve EPA and USDA agencies, agricultural and environmental businesses). We will develop infrastructure to address collaboration support for community annotation. By providing tunable quality for biological factoids, we provide an automatic system to filter the literature for curatable knowledge. In current gene annotation systems, such as Apollo distributed by GMOD, the curator is presented with a blank form in which to write a gene description. In the Curator Assistant, they are presented with candidate suggestions, thus greatly expanding the number of persons who can serve as effective curators. We will also provide mechanisms for the community to enter their own documents as published into the base collections for the system, yielding a rich source of full-text articles, and to directly provide their own factoids from their articles, without the inaccuracy of automatic entity-relation extraction. Currently, the most popular collaboration tools are wikis. While a wiki excels at simplicity and flexibility, it lacks validation tools, rich indexing and social instrumentation. We propose to develop structured social instrumentation for collaborative research environments, including collaborative curation. In particular, our systems will allow users to offer confidence ratings for human annotations and for various automated metadata extracts presented to the users. The users themselves will gain expert status when their annotations receive high confidence ratings. These ratings and rankings will allow researchers to share expertise and enhance the precision of automated annotation systems in a mutually-beneficial way with secure transactions. A relevance rating system will be integrated in the basic functioning of the system itself. Every view of information (entities, relations, abstracts, document lists) will also include checkboxes to up-rate or reject/dismiss any listed elements. For example, community members can judge the quality of the factoids viewed during their usage of the system. Items which are selected and viewed receive increased relevance ratings. Data items which are dismissed/rejected are down-rated in relevance and/or validity. The rating system is not optional: It is transparently embedded within the user experience, which is key to its success. This model of relevance feedback and validity ratings embedded within the core system has proven effective in popular commercial social network systems such as YouTube and LastFM. - 15 -
  • 16. 7. PROJECT ORGANIZATION AND SCHEDULE Our project has been organized via the annual Symposium of the Arthropod Base Consortium (ABC). This is sponsored by the Arthropod Genomics Center at Kansas State University with coPI Brown as Director. There have been 3 symposia held thus far in Kansas City, drawing 300-400 attendees, generally representatives of their research laboratories or genome projects. http://www.k-state.edu/agc/symp2009/ The steering committee for the ABC meets after the workshop to plan community support, this proposal grew out of these planning meetings. There have also been specific meetings of the inner circle, 30-40 attendees, once or twice a year at the main infrastructure sites such as FlyBase. The BeeSpace project hosted the one in December 2007 at the University of Illinois, the slides for this workshop are at http://www.beespace.uiuc.edu/groups_abc.php . The investigators for this proposal each spoke at this meeting, along with the Head of Literature Curation for FlyBase Cambridge. The proposed project will host a budgeted annual specialty workshop to plan Curator Assistant. The genome databases being used as test models in this project have already bypassed the use of professional curators. They are coming in later than the post-MOD wave, such as honey bee, where a case for a few curators was eventually successful after many grant attempts. So BeetleBase for Tribolium the flour beetle and wFleaBase for Daphnia the water flea employ a few biologists and programmers to help with sequencing support and computational pipelines. The coPIs who lead the bioinformatics for these, respectively Susan Brown and Donald Gilbert, are influential proponents of the new paradigm for community curation via annotation software. This proposal is concerned with developing an effective Curator Assistant and testing it to evolve to full utility. The infrastructure investigators will develop the software infrastructure, Schatz leading the informatics system development and Zhai leading the computer science research. These were the same roles they played in the BIO FIBR BeeSpace project, which developed interactive services for functional analysis using computer science research. The bioinformatics investigators will serve as the initial users, each is the lead for the informatics of a major community of arthropod biologists with several hundred community members. Tribolium is an insect close to Drosophila, while Daphnia is a non-insect arthropod far from Drosophila. The close BeeSpace collaboration with FlyBase will be continued, with both the curator site at Harvard with PI Bill Gelbart and the software site at Indiana with PI Thom Kaufmann. Deployment to the full ABC and beyond will begin towards the end of the project. The groups already identified coordinate multiple related databases. They will be the wave of deployment after the investigator organisms are effectively using the Curator Assistant. Their coordinators have expressed great interest while serving on the ABC steering committee. NIH-supported VectorBase has many curators for mosquitos and ticks, USDA-supported HymenopteraBase has few curators for bees and wasps, LepidopteraBase has no curators for butterflies and moths. There is also an international collaboration for AphidBase hosted at INRA in France. The GMOD (Generic Model Organism Database) consortium is a bioinformatics group who provide common infrastructure for over 100 genome projects, including all the ABC genomes [www.gmod.org/wiki/GMOD_Users]. We have presented our preliminary software at GMOD meetings [32], using RESTful protocols for linking Genome Browser to Gene Summarizer, and made arrangements with the coordinator Scott Cain to link our software into GMOD for mass distribution, during extensive conversations at the GMOD meetings and the ABC meetings. So the Curator Assistant will become the literature infrastructure for ABC, just as GBrowse is the sequence infrastructure, and through GMOD made available to the genome biology community. - 16 -
  • 17. References Cited [1] Bishop C (2007) Pattern Recognition and Machine Learning, Springer, 2007. [2] Buell J, Stone D, Naeger N, Fahrbach S, Bruce C, Schatz B (2009) Experiencing BeeSpace: Educational Explorations in Behavioral Genomics for High School and Beyond, AAAS Annual Symposium, Chicago, Feb 2009. curricular materials at www.beespace.uiuc.edu/ebeespace [3] Chang J, Schutze H, Altman R (2004) GAPSCORE: finding gene and protein names one word at a time, Bioinformatics, 20(2):216-25. [4] Chung Y, Pottenger W, Schatz B (1998) Automatic Subject Indexing using an Associative Neural Network, 3rd Int ACM Conf on Digital Libraries, Pittsburgh, PA, Jun, pp 59-68. Nominated for Best Paper award. [5] Cohen A (2005) Unsupervised gene/protein entity normalization using automatically extracted dictionaries, Proc BioLINK2005 Workshop Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. Detroit, MI: Association for Computational Linguistics; 2005:17-24. [6] Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I (2004) Extracting human protein interactions from MEDLINE using a full-sentence parser, Bioinformatics, 20(5):604-11, 2004. [7] Dasgupta S, Tauman Kalai A, Monteleoni C (2005) Analysis of perceptron-based active learning, Proceedings of COLT 2005, 249-263, 2005. [8] Drysdale R, Crosby M, FlyBase Consortium (2005) FlyBase: genes and gene models, Nucleic Acids Research, 33:D390-D395, Database Issue, doi:10.1093/nar/gki046. [9] Finkel J, Dingare S, Manning C, Nissim M, Alex B, Grover C (2005) Exploring the boundaries: gene and protein identification in biomedical text, BMC Bioinformatics, 6 Suppl 1(NIL):S5, 2005. [10] Freund Y, Seung H, Shamir E, Tishby N (1997) Selective sampling using the query by committee algorithm, Machine Learning, 28(2-3):133-168. [11] Fukuda K, Tamura A, Tsunoda T, Takagi T (1998) Toward information extraction: identifying protein names from biological papers, Pac Symp Biocomput, NIL(NIL):707-18. [12] Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L, Leser U, Scheffer T (2005) Systematic feature evaluation for gene name recognition, BMC Bioinformatics, 6 Suppl 1(NIL):S9, 2005. [13] Hanisch D, Fundel K, Mevissen H, Zimmer R, Fluck J (2005) ProMiner: rule-based protein and gene entity recognition, BMC Bioinformatics 2005, 6(Suppl 1):S14doi:10.1186/1471-2105-6-S1-S14. [14] Hatzivassiloglou V, Duboue P, Rzhetsky A (2001) Disambiguating proteins, genes, and rna in text: a machine learning approach, Bioinformatics, 17 Suppl 1.:S97-S106. [15] Hirschman L, Park J, Tsujii J, Wong L, Wu C (2002) Accomplishments and challenges in literature data mining for biology, Bioinformatics, 18(12):1553-1561. [16] Hirschman L, Yeh A, Blaschke C, Valencia A (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology, BMC Bioinformatics 2005, 6(Suppl 1):S1doi:10.1186/1471-2105-6-S1-S1. [17] Howe D, Costanzo M, Fey P, et. al. (2008) Big data: The future of biocuration, Nature 455: 47-50; doi:10.1038/455047a. [18] Jiang J, Zhai C (2006) Exploiting Domain Structure for Named Entity Recognition, Proceedings of HLT/NAACL 2006. - 17 -
  • 18. [19] Jiang J, Zhai C (2007) Instance weighting for domain adaptation in NLP, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), 264-271. [20] Jiang J, Zhai C (2007) A Two-Stage Approach to Domain Adaptation for Statistical Classifiers , Proc 16th ACM International Conference on Information and Knowledge Management ( CIKM'07), pp 401-410. [21] Jiang J, Zhai C (2007) A Systematic Exploration of The Feature Space for Relation Extraction, Proc Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007), pp 113-120. [22] Johnson E, Schatz B, Cochrane P (1996) Interactive Term Suggestion for Users of Digital Libraries: Using Subject Thesauri and Co-occurrence Lists for Information Retrieval, Proc Digital Libraries '96: 1st ACM Intl Conf on Digital Libraries, March, Bethesda, MD. [23] Kazama J, Makino T, Ohta Y, Tsujii J (2002) Tuning SVM for biomedical named entity recognition, Proc workshop on NLP in the biomedical domain, 2002. [24] Kulick S and others (2004) Integrated Annotation for Biomedical Information Extraction, Proc HTL-NAACL 2004 Workshop on Linking Biological Literature, Ontologies and Databases, pp 61-68. [25] Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B (2006) Automatically generating gene summaries from biomedical literature, Proc Pacific Symposium on Biocomputing, pp 40-51. [26] Ling X, Jiang J, He X, Mei Q, Zhai C, Schatz B (2007) Generating gene summaries from biomedical literature: A study of semi-structured summarization, Information Processing and Management, 43: 1777-1791. [27] Marygold S (2007) Genetic Literature Curation at FlyBase-Cambridge, presentation at ArthropodBaseConsortium working group meeting at University of Illinois, Dec 2007. www.beespace.uiuc.edu/files/Marygold-ABC.ppt [28] Mika S, Rost B (2004) Protein names precisely peeled off free text, Bioinformatics, 20 Suppl. 1:241-247, 2004. [29] Morgan A, Hirschman L (2007) Overview of BioCreative II Gene Normalization, Proc of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain: CNIO; 2007:17-27. [30] Muller H, Kenny E, Sternberg P (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature, PLoS Biology 2004 Nov; 2(11) e309. doi:10.1371/journal.pbio.0020309 pmid:15383839. www.textpresso.org [31] Narayanaswamy M, Ravikumar K, Vijay-Shanker K (2003) A biological named entity recognizer, Proc Pacific Symposium on Biocomputing, pp 427-38. [32] Sanders B, Arcoleo D, Schatz B (2008) BeeSpace Navigator Integration with GMOD GBrowse, 9th annual Bioinformatics Open Source Conference (BOSC 2008), Toronto, ON, Canada. www.beespace.uiuc.edu/files/BOSC2008_v3.ppt [33] Schatz B (2002) Building Analysis Environments: Beyond the Genome and the Web, invited essay for Trends and Controversies section about Mining Information for Functional Genomics, IEEE Intelligent Systems 17: 70-73 (May/June 2002). [34] Schatz B (2007) Gene Summarizer: Software for Automatically Generating Structured Summaries from Biomedical Literature, accepted plenary Presentation to 2nd International Biocurator Meeting, San Jose. www.canis.uiuc.edu/~schatz/Biocurator.GeneSummarizer.ppt [35] Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text, Bioinformatics, 21(14):3191-3192, 2005. [36] Skounakis M, Craven M, Ray S (2003) Hierarchical hidden markov models for information extraction, Proc of the 18th International Joint Conference on Artificial Intelligence, 2003. - 18 -
  • 19. [37] Sokolowski M (2001) Drosophila: genetics meets behaviour. Nature Reviews Genetics, 11(2):2001. [38] Srinivasan P, Libbus B (2004) Mining Medline for implicit links between dietary substances and diseases, Bioinformatics, 20 Suppl. 1:290-296, 2004. [39] Tanabe L, Wilbur W (2002) Tagging gene and protein names in biomedical text, Proceedings of the workshop on NLP in the biomedical domain, 2002. [40] Tong S, Koller D (2001) Support vector machine active learning with applications to text classification, Journal of Machine Learning Research, 2:45-66, 2001. [41] Tsuruoka Y, Tsujii J (2003) Boosting precision and recall of dictionary-based protein name recognition, Proc ACL 2003 workshop on Natural language processing in biomedicine, pp 41-48, Morristown, NJ. [42] Tuason O, Chen L, Liu H, Blake J, Friedman C (2004) Biological nomenclatures: A source of lexical knowledge and ambiguity, Proc Pacific Symposium on Biocomputing 9, pp 238-249. [43] Zhou G, Shen D, Zhang J, Su J, Tan S (2005) Recognition of protein/gene names from text using an ensemble of classifiers, BMC Bioinformatics, 6 Suppl 1(NIL):S7, 2005. - 19 -