This document summarizes an analysis of discourse segments in biology research articles. The analysis identified different types of segments based on linguistic characteristics and segmented two biology papers into fact, hypothesis, problem, goal, method, and result segments. Tense patterns differentiated conceptual from experimental realms, and citations were analyzed to trace the evolution of facts. While the segment types helped provide structure and enabled initial epistemic markup, more research is needed to validate the approach across more papers and subfields.
Presentation on how to chat with PDF using ChatGPT code interpreter
Epistemics
1. Categorizing Epistemic
Segment Types in Biology
Research Articles
Anita de Waard
Elsevier Labs, Amsterdam
UiL-OTS, Utrecht University
Thursday, September 17, 2009 1
3. Why Study Biological Discourse?
- There is too much of it!
- Text mining and ‘fact
extraction’ techniques are
gaining ground to tame this
tangle
- Emerging area of biological
natural language processing
(BioNLP): subfield of computational linguistics
- Main focus: identifying biological entities (genes,
proteins, drugs) and their relationships
Thursday, September 17, 2009 3
4. Example state of the art: MEDIE
without some idea of the status of the
sentence, it cannot be interpreted!
Alteration of nm23, P53, and S100A4 expression may
contribute to the development of gastric
Previous studies have implicated miR-34a as a tumor
suppressor gene whose transcription is activated by p53.
Thursday, September 17, 2009 4
5. How can linguistics help?
Underlying model of text mining systems:
- Scientific paper is ‘statement of pertinent facts’
- So: finding entities and relationships will give you a summary of
the knowledge within the paper
- However, information extracted this way is not very useful....
Proposed approach: treat scientific paper as a persuasive text: specific
genre, with genre characteristics and allowed persuasive techniques:
- ‘these results suggest’ (depersonification)
- ‘as fig. 2a shows’ (evidence is in the data)
- ‘oncogenes produce a stress response [Serrano, 2003]’
References and data form a “folded array of successive defense lines, behind
which scientists ensconce themselves” [Latour, 1988]
Thursday, September 17, 2009 5
6. Modality Dropping
- Fact creation occurs through social acceptance: “[Y]ou can
transform .. fiction into fact just by adding or subtracting
references” [Latour, 1988]
- When references are cited the modality is dropped:
- A: ‘these results suggest/demonstate/imply that’ X
- B: ‘A et al. have shown that X [A, 2009]’
- C: ‘X [2009]’
- D: ‘Since X, we investigated the possibility that Y’
Thursday, September 17, 2009 6
7. Overall Research Questions
I. (How) can we add epistemic value to results from a
text mining system?
II. How is a scientific fact created, as it moves from a
hedged claim to a throughout successive citations?
III. Can we identify a rhetorically successful text (and
help authors create them)?
Thursday, September 17, 2009 7
8. Present work:
Perform discourse analysis on a few selected texts in
biology:
1. Parse text into discourse segments (edu’s) containing a
single rhetorical move (if possible...)
2. Determine categories or types of discourse segments
that have similar rhetorical/pragmatic properties
3. Look at a number of linguistic characteristics and see if
these segments share those characteristics.
Thursday, September 17, 2009 8
9. Present research questions:
i. Can these segments indeed be grouped by linguistic
characteristics (verb tense, verb registry, metadiscourse
markers?)
ii. Does this offer a useful version of the structure of a
paper?
iii. Is this useful for enabling automated epistemic markup?
iv. Can this help us to trace evolution of a hypothesis?
Thursday, September 17, 2009 9
11. Method
1. Parse text into Discourse Segments (EDUs) according to
syntactic criteria
2. Define set of semantic segment types
3. Identify semantic type for each segment
4. Specify linguistic and structural properties for each
segment
5. Identify correlations between semantic type and
structural/syntactic properties
6. Trace a hypothesis through the process of fact creation
Thursday, September 17, 2009 11
12. Segmentation Criteria
Goal: ‘one new thought per segment’:
Figure 4A shows that following RASV12 stimulation, p53
was stabilized and activated, and its target gene, p21cip1,
was induced in all cases, indicating an intact p53 pathway
in these cells.
a. Figure 4a shows that
b. following RASV12 stimulation
c. p53 was stabilized and activated
d. and the target gene, p21cip1, was induced in all cases,
e. indicating an intact p53 pathway in these cells.
Thursday, September 17, 2009 12
13. Segmentation Criteria (summary)
Finite/
Grammatical role Segment? Example
Non-finite
The extent to which miRNAs specifically affect
Finite/Non-finite Subject N metastasis
Finite/Non-finite Direct Object Y these miRNAs are potential novel oncogenes
Phrase-level adjunct (restrictive and
Nonfinite N spanning a given miRNA genomic region
non-restrictive)
Nonfinite Clause-level adjunct Y by cloning eight miR-Vec plasmids
which is only active when tamoxifen is added (De
Finite Non-restrictive Phrase-level adjunct Y Vita et al, 2005) […]
Finite Restrictive Phrase-level adjunct N that we examined
which correlates with the reported ES-cell
Finite Clause-level adjunct Y expression pattern of the miR-371-3 cluster (Suh et
al, 2004)
Thursday, September 17, 2009 13
14. Basic Segment Types
Segment Description Example
a known fact, generally
Fact mature miR-373 is a homolog of miR-372
without explicit citation
a proposed idea, not
Hypothesis This could for instance be a result of high mdm2 levels
supported by evidence
unresolved, contradictory, or However, further investigation is required to
Problem
unclear issue demonstrate the exact mechanism of LATS2 action
Goal research goal To identify novel functions of miRNAs,
Method experimental method Using fluorescence microscopy and luciferase assays,
a restatement of the outcome all constructs yielded high expression levels of mature
Result
of an experiment miRNAs
an interpretation of the
our procedure is sensitive enough to detect mild growth
Implication results, in light of earlier
hypotheses and facts differences
Thursday, September 17, 2009 14
15. Two Types of Derived Segment Types
‘Other-segments’, related to (referenced) other work:
- other-result: ‘they are also found in the FCX and other cortical structures
([Sokoloff et al., 1990]’
- other-goal: ‘the role of D3 receptors in the control of motivation and affect
has been intensively studied [Heidbreder et al., 2005]’
- other-implication: ‘D1 or, more likely, D5, receptors have been implicated in
mechanisms underlying long-term spatial memory [Hersi et al., 1995]’
Regulatory segments, acting as matrix sentences framing other segments:
- reg-hypothesis: ‘we hypothesized that ’
- reg-implication: ‘These observations suggest that’
- intratextual: ‘Fig 4 shows that’
- intertextual: ‘reviewed in (Serrano, 1997)’
Thursday, September 17, 2009 15
17. Linguistic and structural properties
1. Position in text
- Section of the paper (Introduction, Results, Discussion)
- Beginning/middle/end of section
- First/second third part of sentence
2. Verb:
- Tense, aspect, voice
- Verb class (idiosyncratic)
- Lexicon
3. Metadiscourse markers [Hyland, 2003]:
- Connectives
- Endophorics, Evidentials
- Hedges, Boosters
- Person markers
Thursday, September 17, 2009 17
18. Verb class
Two types of entities interact in biology texts:
- Thing:
- Thing -> Increase, die, etc
- Thing-thing: affect, stimulate etc.
- People:
- People -> Thing:
- Examine (Goal)
- Operate (Method)
- Observe (Result)
- Implicate (Implication)
- People - people: Report
Thursday, September 17, 2009 18
20. Two texts
1. Voorhoeve, 2006: Cell
- Cell biology text, written by group in Amsterdam
- Dealing with microRNAs - hot topic
- 290 citations in Google Scholar: succesful paper!
2. Louiseau, 2008: European Neuropsychopharmacology
- Text on schizophrenia
- Prompted by interest from Pharma company
- Adjacent subfield of biology (neuropharmacology)
Thursday, September 17, 2009 20
27. Interpretation: 3 Realms of Science:
(1) Oncogene-induced senescence is (4b) transduction with either
Conceptual characterized by the appearance of miR-Vec-371&2 or miR-Vec-
V12
cells with a flat morphology that 373 prevents RAS -
realm express senescence associated (SA)- induced growth arrest in
-Galactosid a s e . primary human cells.
(2a) Indeed, (4a) Altogether, these data
show that
Experimental realm (2b) control RAS V12
-arrested (3b) very few cells showed
cells showed relatively high senescent morphology when
(3a) Consistent
abundance of flat cells transduced with either miR-
with the cell
expressing SA- - Vec-371&2, miR-Vec-373, or
growth assay, kd
Galactosidase control p53 .
(2c) (Figures
2G and 2H).
Data realm
(Figures)
Thursday, September 17, 2009 27
28. Tense 1: Concepts vs. Experiment
(1) Oncogene-induced senescence is (4b) transduction with either
Concept realm
characterized by the appearance of miR-Vec-371&2 or miR-Vec-
V12
cells with a flat morphology that 373 prevents RAS -
express senescence associated (SA)- induced growth arrest in
-Galactosid a s e . primary human cells.
(2a) Indeed, (4a) Altogether, these data
show that
Experimental realm
(personal, past)
V12
(2b) control RAS -arrested (3b) very few cells showed
cells showed relatively high senescent morphology when
(3a) Consistent
abundance of flat cells transduced with either miR-
with the cell
expressing SA- - Vec-371&2, miR-Vec-373, or
growth assay, kd
Galactosidase control p53 .
(2c) (Figures
2G and 2H).
(nontverbal)
Data realm
(Figures)
Thursday, September 17, 2009 28
29. Tense 2: Referral
past present future
Introduction Discussion
own paper
After Before current Current work After current
other work: present work: past
(= Results section)
work:
past
other papers
Other Work
Thursday, September 17, 2009 29
30. Tense 1+ 2 = 3:
Claim,
fact
Conceptual
Experi
ment
Experiential
past present future
Reading time
Thursday, September 17, 2009 30
31. Discourse Fact-ory
hypothetical realm: hypothesis realm of activity:
(might, would) (to test, to see)
goal
to
problem results
we realm of
introduction method experience:
past
resulting in
result
suggests that
discussion realm of models:
fact fact fact present
implication
Shared view Own view discussion
Thursday, September 17, 2009 31
32. Citation and fact creation: Yabuta, JBioChem 2007
Voorhoeve, 2006 miR-372 and miR-373 target the
Lats2 tumor suppressor
To investigate the possibility that (Voorhoeve et al., 2006)
miR-372 and miR-373 suppress the
expression of LATS2, we...
Raver-Shapira et.al, JMolCell 2007
Therefore, these results point to two miRNAs, miRNA-372 and-373, function as
LATS2 as a mediator of the miR-372 and potential novel oncogenes in testicular germ cell
miR-373 effects on cell proliferation and tumors by inhibition of LATS2 expression, which
tumorigenicity, suggests that Lats2 is an important tumor
suppressor (Voorhoeve et al., 2006).
KnownFact KnownFact
Concepts Hypothesis Implication Fact
Goal
Goal
Method Result Method Result
Data Data
Experiment 1 Experiment 2
Thursday, September 17, 2009 32
33. Answers to current research questions:
i. Can these segments indeed be identified?
✓ yes, adequate evidence, probably ok segments:
‣ need more annotators!
ii. Does this offer a useful version of the structure of a paper?
✓ yes, offers insight, and a possible model
‣ need to be validated whether this structure holds over more
papers, different subcategories
iii. Is this useful for enabling automated epistemic markup?
✓ first efforts seem promising: simple markers (‘suggest’ verbs,
connectives, etc.) already help
‣ ongoing research! (Sandor, XRCE; Buitelaar, DERI)
iv. Can this help us to trace the evolution of a hypothesis?
✓ anecdotal: promising
‣ need to scale up!
Thursday, September 17, 2009 33
34. Where are we on overall research questions?
I. (How) can we add epistemic value to results from a
text mining system?
‣ Segment types help - need to expand + verify
II. How is a scientific fact created, as it moves from a
hedged claim to a throughout successive citations?
‣ Model is developing, also spurt of other work!
III. Can we identify a rhetorically successful text (and
help authors create them)?
‣ Not addressed yet - verb tense, hedging seem
important.
Thursday, September 17, 2009 34
35. Work on (biological) scientific discourse
- Is a growing field of interest!
- Several projects developing going ‘beyond the facts’
- Epistemic modality is becoming a term
bioinformaticians are exploring
- Room for people who know about discourse
analysis!
Thursday, September 17, 2009 35