Automatic Strategies for Decision Support in TriageTelephone
1. Automatic Strategies for Decision
Support in Telephone Triage
Framework and Testbed in
Smalltalk
Carlos E. Ferro
ceferro@ciudad.com.ar
Director: Dan Rozenfarb
2. Agenda
Introduction
Software Application: ExpertCare
Overview of the Framework:
Concept representation
Session, automation and simulation
Strategies
Some statistics
Examples and results
3. Telephone Triage
Phone call from a
patient
Initial data gathered
Questions and
answers
Presumptive
diagnosis
Ambulance dispatch
or treatment
indications
6. ExpertCare
New questions
are suggested.
Answers are
recorded.
Diagnoses are
re-evaluated.
Question in plain Spanish
List of scored diagnosesList of symptoms
Session information
10. ExpertCare syndrome definition
Syndrome definitions are logical
expressions in terms of symptoms.
Examples:
Definition of Appendicitis :
“right iliac fossa pain” AND “abdominal
pain” AND NOT “appendix operation”
Definition of Massive Obesity :
“intense weight increase” OR “intense
body fat increase”
11. ExpertCare size in numbers
Rules 3209
Symptoms 2383
Syndromes 673
Other 157
Rules account for 50% of size, but 80%
of complexity and 90% of costs.
They also hinder software evolution.
12. Target
Our main metrics is the amount of questions:
Red (Emergency): 3 or 4 questions
Yellow (Urgency): 4 or 5 questions
Green: around 6, but may reach 12
13. Solution approach
Automated strategy
Dynamic interrogatory
Navigation and gathering of information
from the knowledge base
Adaptation to session status
Framework for session and strategies
Virtual lab as testbed
17. Automation
AnswerProvider simulates a
patient/caller
Strategy guides the interrogatory,
suggesting #nextQuestionFor:
aCallSession
SUnit tests run through all syndromes
using different strategies
StatisticalCollector gathers and caches
information from the knowledge base
21. Step-by-step example
This is a typical
red syndrome.
According to the
definition,
AnswerProvider
can choose
among 8 pairs of
symptoms (2x4).
Each one is
called a
subsyndrome
Diabetic Ketoacidosis
System: Metabolic
Frequency: Low
Severity: Red
Definition: (Diabetes OR
History of diabetes)
AND (Unconsciousness OR
Confusion OR Ketonic breath
OR Dyspnea)
22. Step-by-step example
Choosing clues:
Diabetes Systems pregnancy and metabolic
Confusion Associated to 9 different systems
Let’s choose Diabetes as a clue, and try to
establish the presence of Confusion, in order to
make a Diabetic Ketoacidosis diagnosis.
Diabetic Ketoacidosis_2
Definition: Diabetes AND Confusion
23. Example - Step 1
Choosing symptoms to ask:
We should try to discern the system from among
these two
Diabetes
772 syndromes
51 systems
2 systems
18 syndromes
positive evidence
6 syndromes
2 systems
24. Example - Step 1
Choosing symptoms to ask:
Using information from the knowledge base and some
abductive reasoning, we have 9 candidates left.
We choose symptom pregnancy, in order to confirm or
discard system pregnancy.
System pregnancy
6 syndromes
31 symptoms
System metabolic
13 syndromes
48 symptoms
System pregnancy
1 syndrome
1 symptom
Diabetes
System metabolic
5 syndromes
8 symptoms
25. Example - Step 2
Now we “know” that only one system has
chances left.
Diabetes
Not pregnancy
772 syndromes
51 systems
1 system
5 syndromes
positive evidence
26. Example - Step 2
Now we try to discern severity, first trying to decide whether it
is red or yellow.
Using information from the knowledge base and some
abductive reasoning, we have 8 symptom candidates left.
Here is where we need some tool for comparing or choosing
among them. For instance, we could ask for symptom
dyspnea.
System pregnancy
6 syndromes
31 symptoms
System metabolic
13 syndromes
48 symptoms
System pregnancy
0 syndrome
0 symptom
Diabetes
Not pregnancy
System metabolic
5 syndromes
8 symptoms
1 syndrome
3 syndromes
9 syndromes
1 syndrome
2 syndromes
2 syndromes
27. Example - Step 3
The new information did not reduce syndromes
Diabetes
Not pregnancy
Not dyspnea
772 syndromes
51 systems
1 system
5 syndromes
positive evidence
28. Example - Step 3
We still try to discern severity, because “not dyspnea”
only rejected some branches of some syndromes, but
did not reduce the total number.
Now we have 7 symptom candidates left. This way, we
could use up to 7 more questions to “hit” the symptom
that the simulated patient has and make a diagnosis.
System metabolic
13 syndromes
48 symptoms
Diabetes
Not pregnancy
Not dyspnea
System metabolic
5 syndromes
8 symptoms
1 syndrome
3 syndromes
9 syndromes
1 syndrome
2 syndromes
2 syndromes
29. Strategies
One family of first attempts, using none or
little information:
SequentialStrategy, RandomStrategy, MoreSatisfiersStrategy,
LessSatisfiersStrategy, MiddleSatisfiersStrategy,
MoreCriticSeparationStrategy
Second family, attempting to guess the
system by different indicators:
GuessSystemByFrequencyStrategy, MoreCorrelationStrategy,
LessNegationStrategy, GuessSystemStrategy,
GuessSystemUsingPairsStrategy, LessNegationPairStrategy
31. Strategies - Support
We coined the notion of support
Intuitively, it is a numeric representation of the
degree of likelihood of a given set of syndromes
in the current session.
Calculation is straigthforward.
Syndromes with full diagnoses add a large positive
value.
Syndromes with disproved diagnoses add a large
negative value.
For the rest, symptoms confirmed add positive value
and symptoms negated add negative value.
32. SupportSeparationStrategy
The third family of strategies is
based on support.
Most promising results
15 different strategies
Hierarchy 7 levels deep
Every level evolving from the
previous one
SUCCESS according to target
36. Conclusions and remarks
It was great doing this work because:
Enhancing the ExpertCare application could
have a direct impact on the population’s
health.
Automated strategies allow ExpertCare
architecture to be used in other domains.
We applied a scientific research approach
and techniques to this “real world” software
problem.
We learned from Artificial Intelligence, Object-
Oriented Programming and Medicine in an
interdisciplinary work.
37. Conclusions and remarks
Smalltalk proved to be an adequate tool
because:
Representation of the knowledge base was
almost trivial.
Building a virtual lab for essays and
benchmarks was very easy.
Additional tools for exploring the knowledge
base and studying it were easy to implement.
There were no barriers for implementing and
testing several strategies with diverse heuristics.
It was easy to get feedback and to debug
troublesome cases, in order to enhance and
refine strategies
38. Future work (technical)
A visual tool for representing the
session. It should be some navigational
metaphor.
The tool could be enhanced for tracing
during simulation runs.
More tools for developers to understand
and interact with strategy/session.
More tools for better comparative
benchmarking.
39. Future work (domain model)
Integrate with ExpertCare
Incorporate exceptions and special
rules
Test with real samples
Try some adaptation to other
knowledge bases
This is a research and application work that began with a proposal to make improvements on an existing application.
That application was developed mainly by Dan Rozenfarb at Ilón Software, with Dolphin 6. It is named ExpertCare and it is a decision support tool intended to help telephonic operators in a dispatch call center for medical emergencies, doing what is called telephone triage.
Before proceeding with my exposition, I would like to remark that the ExpertCare project and its enhancements have a great potential impact. Our interest in it is not only academic. By way of example, one of the success stories involves Cordoba state, where the state government installed this system in 35 positions of its call center to cover Cordoba city and its surroundings. In total, a population of about 1.500.000.
There were registered peaks of 70,000 calls per month.
{
ExpertCare analyzes the symptoms reported by the caller and suggests new questions –about other symptoms– in order to complete a presumptive diagnosis and determine whether an ambulance is needed or not. The system relies upon a knowledge base and a large set of rules used to direct the questioning. But building and maintaining those rules is the most hard and expensive part of the application, and it hinders tailoring it to other requirements or knowledge domains. So we are attempting a different approach that allows us to build automatic strategies for interrogation guidance.
With this objective in mind, we had to develop a small framework for using it as a virtual lab, where we could simulate patients and calls, and test and benchmark the strategies we built.
}
First, we are going to explain what we mean by telephone triage and delineate the scope of our interest area and objectives.
Then, we will see how ExpertCare works and its internals and architecture, and later we will show the main features of the framework we work on.
I will show some statistics of the current knowledge base, which provided us with guidelines for our study and bases to design different strategies. Finally, I will display some result tables to compare the different strategies we tested.
All telephone dispatch system for medical emergencies work alike. They can have more or less computer and medical support. But always a patient –or relative– makes a phone call to the dispatch center. There, an operator –with more or less training and skills– answers the call.
Some initial data are gathered at the beginning of the call, for identifying the patient, but they can also have some use in the analysis of his problem –for example, age and sex determine impossibility of certain syndromes. Also, one or more symptoms reported by the patient are recorded.
From there on, the dialog continues with further questions and answers until the operator has enough information to decide whether an ambulance must be sent and/or give complementary recommendations for dealing with the situation.
In the case of ExpertCare, this is a big part of the problem, because the fact that it not only addresses urgencies but also makes recommendations for other, “lighter” problems, amounts for a greater number of possible diagnoses. The scale is 30 diagnoses for emergencies, but more than 1100 overall.
Here I will show some screenshots from ExpertCare, to illustrate how it works. It is developed in Dolphin Smalltalk, as I said before.
This is the first input for an incoming call, a standard form for recording the patient’s personal data.
On the left there is a list with all the symptoms in the knowledge base. There the operator picks all symptoms reported by the patient from the beginning.
Note that we speak about “symptoms” because they represent the majority of this set; actually, there are other kinds of information in this category. For example, some symptoms in the knowledge base are Man, Woman, Senior Adult or Diabetes antecedents.
This is the main screen used along the session. In the lower middle part, there are possible diagnoses, each with its score. This is the most important information for the operator. The score grows as new evidence makes a diagnosis more likely. With every new answer all scores are updated. This is necessary because in a real session, patients can contradict themselves and correct or enhance the information they had given before.
Above this list there appears a question, suggested by the system itself. It asks for a symptom with a broader description, in common, non-technical language that patients and untrained operators can understand.
On the right side you can see previous questions and answers, which are recorded for statistics and auditing.
In the normal course of a session, a group of diagnoses quickly acquires scores differentiating it from the rest. Using this the operator configures a presumptive diagnosis and closes the session, giving the caller indications and advice provided by the system.
Roughly speaking, the architecture and design of ExpertCare are similar to those of traditional expert systems. There is an underlying ontology and a knowledge base, composed by concrete symptoms and syndromes. There is an inference engine independent of that knowledge base, that works making deductions from logical predicates –which are the definitions of the syndromes– in the context of the current session.
It cooperates with the Interrogator, who provides new questions operating with the interrogatory rules. The Scorer evaluates diagnoses for syndromes in the current context.
These are the classes of the ontology. They are pretty simple, except for a detail we will see on the next slide.
Symptoms are elements which do not have any attribute apart from their names. There is an implication relationship defined between symptoms; for instance, Severe abdominal pain implies Abdominal pain.
Syndromes have attributes for systems –plural ‘systems’, because syndromes may belong to more than one system–, severity and frequency. We are planning to expand this class to encompass more attributes, for instance physiopathology, but they are beyond the scope of the present work.
The definition of a syndrome is the relationship that completes the ontology. It is a complex relationship between a syndrome and several alternative sets of symptoms that characterize the syndrome. In order to work with these definitions, it is easier to express them as logical expressions, so that is the representation we implemented.
Here we have real data about the size of ExpertCare’s knowledge base.
This base is pretty small in terms of quantity of objects. It fits well in RAM memory and that simplifies our work by not requiring an external database.
It is important to notice here the weigth of the set of rules. It is not adequately represented by their quantity, because rules are more complex objects. They become especially complex in their interrelationships. They are fairly harder to maintain and test than the knowledge base. That is why we attempt to replace the set of rules with something automatic and more flexible and adaptable.
There are other difficulties when working to obtain the rules: symptoms, syndromes and definitions are easy to obtain and check from common technical bibliography. It is relatively easy to achieve consensus on those matters. But it is very hard to get consensuated interrogatory rules from experts where several medical specialties are involved.
Another advantage of having automated strategies lies in bringing in more objectivity and independence from any personal bias in decisions. Automatic implementation would expose clearly and explicitly the criterion used. Working with the rules, the criterion is always an opinion based on domain expertise.
We work with a very simple metric, easy to measure and strictly objective: the amount of questions that an operator should do before closing a session with a sound presumptive diagnosis.
According to the severity of the situation, our target changes. Emergency situations ask for faster decisions because in many cases they are literally life or death decisions where every second counts.
There are many other possible metrics. For instance, performance of a decision algorithm in time or in memory space, or microprocessor usage. But beyond some practical and obvious range such things are not relevant to this problem. It is pointless to concentrate on accelerating an algorithm since in a telephonic session question/answer time works at a largely different scale.
In this work, we try to generate automated strategies for interrogatories, dynamically adjusted to information acquired during the session. These strategies would only be based on the knowledge base and on information obtained from the current session.
For this project an experimentation environment was required. We called this the “virtual lab”, a space where we could develop different strategies and simulate interaction in call sessions. The target of these simulations is to study and enhance the strategies, but mostly to measure their performance.
Of course, this environment is a Smalltalk where we used objects to model every domain concept. There we also instantiated the knowledge base.
Main ontology classes are modelled in this very simple hierarchy. We used an abstract class –NamedValue. Its main characteristic is that of representing objects identified by one name. We also use class variables to define repositories for instances, where we store the knowledge base. We also have a protocol for accessing a specific instance or all of them.
We did not use classes to model the concepts of frequency and severity; instead, we used symbols for them.
We defined another simple hierarchy for logical expressions that we use as syndrome definitions. This is almost a book exercise: most operations have a trivial implementation and there are only two domain-related details that stray from the standard.
Variables are actually linked to symptoms in a session. Most of them only have boolean values, because symptoms are present or not. But some of them are quantified, e.g. temperature or blood pressure. For those values we had to add comparisons with fixed numbers.
Observation hierarchy refers to what I said before: symptom observation tells us if it is present or not, or could return a quantity in some cases. An observation is what we have in the answer to a question.
We started modelling questions and answers as first-class objects, but for this work their behavior was not interesting.
Diagnoses represent a possible diagnosis for a specific syndrome, in the context of a session. Evaluating syndrome definition in that context, diagnoses could be definitely false –if current information makes definition evaluate false. It could be complete if definition evaluation results in true or just open in any other case.
CallSessions represent real interrogatory sessions, they save questions and answers and interact with a strategy to get the next appropriate question or to decide if session should end.
We needed automation for testing and benchmarking strategies, simulating patient calls. AnswerProvider plays the role of a patient. It knows a specific syndrome and one set of symptoms which characterize it, then it answers consistently the session questions.
We used Sunit framework for benchmarking. Although we were not doing unit tests, as Andres Valloud demonstrated before, Sunit has several properties that make it a comfortable tool to organize massive essays. We collect the results in text files.
We added StatisticalCollector later, to provide more navigation capabilities on the knowledge base. For instance, all syndromes of a specific system, or having a red severity. With this statistical collector we began a quantitative study of the knowledge base.
This is a plot of quantity of syndromes per system. Every vertical bar stands for a specific system.
We can see that syndromes are fairly scattered. The first group of three systems stick out, concentrating a big mass of syndromes, although none of them has one hundred, not even ninety. Every other system has less than forty syndromes, most of them have less than twenty and we have a good quantity with ten or less.
Refining our plots to see the quantity of syndromes per system and severity, we appreciate the same scattering on a different scale.
In emergencies, red severity, only two systems have more than ten syndromes and eight systems have only one.
In urgencies, yellow severity, only two are above twenty, only four above ten, and most of them have less than five.
In green severity we have two above fifty, but excepting the first group of five, all are below twenty and most have less than ten syndromes.
These figures show that there is great spreading among systems. But they also tell us that if we could somehow guess the right system and severity, the problem shrinks greatly and becomes manageable. We checked it with experts and this matches some criteria they commonly use.
This plot displays in how many systems a symptom appears.
The vast majority of the symptoms, seventy-five percent, appear in syndromes of only one system. There are few symptoms appearing in more than five systems.
This information points that in most cases, with one or two symptoms reported by the patient, we might determine the system.
I will show an example of operation, to better understand the terms and problems I mentioned before. Now I will follow the simulation of a call, where the patient has Diabetic Ketoacidosis. This syndrome definition has some alternatives for characteristic symptoms, so we could choose among 8 (minimal) sets of symptoms; the simulation will try them all in a sequence of simulated calls.
Let’s analyze what happens with a particular combination, Diabetes and confusion. The answerProvider chooses one of them, diabetes, as first reported by the patient. As we can see, this would not be enough yet to determine a system. The session should continue asking, guided by a strategy and by the knowledge base, trying to discover that the patient also has confusion and close with a diabetic ketoacidosis diagnosis.
Here we see how the current information reduces the alternatives with positive evidence. This is an application of abductive reasoning, very common in medical applications. From seven hundred, seventy two syndromes, we are going to focus on only 6 syndromes, which have diabetes as a symptom in their definitions.
These six syndromes have nine different symptoms that we could choose to ask. If we were going at random, we would risk doing 9 questions, a number too large for an emergency. In this case, a good criterion could be to discern if pregnancy has something to do here.
A negative answer gives us good information, one more system is discarded; but we still have 5 syndromes.
Knowing the system, we could try a criterion to decide quickly if it is an emergency or not. There is not a criterion that serves well for all cases and that is why we developed different strategies. Let’s go one step more, supposing we somehow choose dyspnea.
A negative for this symptom did not add valuable information. Note that a positive one would have!
So, that was one “wasted” question and we still should decide which of seven symptoms to ask.
We first built a generation or family of strategies with little or no information. They were the first attempts to study the domain and tune the framework. But they also serve as stakeholders: if a strategy yields worse results than asking at random –that is what RandomStrategy does– then it does not deserve attention.
The second generation was inspired by the statistics we’ve seen on previous slides. Each one represents an attempt to guess somehow the system of a patient’s syndrome, in order to cope with a reduced problem later. But these attempts were not successful.
These are the results for the first and second family of strategies in cases of emergency.
We can only see how inadequate they are. Only MoreSatisfiers and MoreCriticSeparation seem a bit interesting in the first group. In the second, GuessSystemUsingPairs yields a good quantity of questions, but an error rate far too high.
Note that none of them has 0% of error, but this is because of some internals of the knowledge base and the criterion used to automatically close the session. There are some emergency syndromes with very small sets of symptoms, which are implied by other syndromes. When we simulate a call for a syndrome implying another, it is common to complete the small one before and then to close the session. In real-life cases or non-simulated tests, the operator does not do that.
From previous attempts we envisioned the utility of a unique measure for the degree of confirmation or likelihood for a set of syndromes. With it we could know if a group of system/severity is more likely in a context than another, or if plausibility for one group grows or diminishes after a given question.
With a single indicator it is easier to focus strategies on maximizing it.
So we defined the support, quite similar to the score ExpertCare already had.
Basically, we add positive points for syndromes having in its definition symptoms already present in the session, and negative points for symptoms confirmed as absent. We give bonus to syndromes with fully confirmed diagnoses and punish those with false diagnoses.
Third-generation strategies are based on different selection criteria, but they all look for a symptom producing the bigger support difference.
And here results improved dramatically.
We developed several strategies; some of them were plain failures but most were successful. We were able to reach the amounts of questions that we had established as our target at the beginning.
Here you can see for yourselves that for urgency, the target is achieved in the last five: an average between one and two questions, with a median of one and a low error rate.
Here we have the performance of support-based strategies on yellow severity. Again, our target is fulfilled in the last group.
And here we see how support-based strategies perform when it comes to green severity. The results are surprisingly good in terms of amount of questions. The error rate increases, because it is more probable that one syndrome with higher severity gets confirmed before the “real” one.
There were several reasons for which this is a very successful work. It is my thesis work, and that is not a minor detail to me. But I was interested –and so I told my director– in doing something that was not only an interesting research never to be applied. Then he gave me this chance with a real product, that is in the market and works for people’s health.
Getting rid of the high cost of developing complex sets of interrogatory rules and replacing it with the task of tuning strategies to a new domain would allow the application to make it to new markets.
From the beginning, we stick to the point of having a scientific way of working, at all levels: discussion, analysis, code, tests… all was addressed with scientific practices.
Finally, this was a truly interdisciplinary work.
You all know Smalltalk’s advantages for modelling and simulation tasks, but I will summarize some features we used specifically in this work.
The class hierarchies implementing domain concepts and all the virtual lab were developed very quickly. In a couple of part-time weeks we had the environment and could test the first group of strategies.
From there on, the gap between having an idea, implementing and testing it was minimal. The main stopper was the time required for running the tests through all combinations of symptoms of every syndrome. For some strategies, a full run takes several days (but for the last group, little over one hour).
Performance problems were attacked to accelerate this long runs of tests. This was solved with simple caches. No complex or sophisticated programming techniques or tools were needed. There is no big or complex data structure, there are not lots of code lines.
The debugger was the main tool for this work. We used it to run the strategies step by step, verifying by hand its internal operation or analyzing particular cases requiring large numbers of questions.
There is much more work to be done. Here we settled some basics, defining a workline and we have a good proof of concept.
One of the things in our wishlist throughout the job is a visual tool to graphically analyze strategy operation. At the beginning we did not face it because definitions of neighbourhood and navigation were not available. We could have done it before ending the work and it might have been useful for studying some cases. That would be better than using the debugger and the inspector.
Along that same line, we could have used tools for interacting with sessions and strategies, for a better understanding of their operation.
It would have been useful to have a tool to configure and run the benchmarks, which was done by adding test methods and running test cases by hand.
Besides programming tools, there is a lot of work to be done for integrating the automated strategies into ExpertCare in domain terms.
We need a place for exceptions and special rules defined by hand. Some will be required to cut cases where automated strategies do not perform well. Some others are to deal with patient psychology and session handling.
Tests and benchmarks with real interrogatories are a must. All our tests up to here were performed with simulated patients. They never fail, they don’t contradict themselves, they don’t ignore things and they don’t get stuck with questions.
Finally, we should attempt adaptation to other different knowledge bases in order to verify that these strategies are not relying too much on the properties of this knowledge base.