Autonomous medical coding with discriminative transformers

Autonomous Medical Coding
with Discriminative Transformers
Patrick Nicolas Feb 2023 v0.3
A Tale from the trenches...

“Autonomous medical coding refers to the use of artificial
intelligence (AI) and machine learning (ML) technologies to
automatically assign medical codes to patient records. Medical
coding is the process of assigning standardized codes to
diagnoses, medical procedures, and services provided during a
patient's visit to a healthcare facility. These codes are used for
billing, reimbursement, and research purposes.
…
By automating the medical coding process, healthcare
organizations can improve efficiency, accuracy, and consistency,
while also reducing costs associated with manual coding.”
ChatGPT

Transformers and self-attention models are becoming pre-dominant
solutions in data scientist’s NLP tool box.
This presentation describes the creation, deployment and evaluation
a discriminative transformer to extract medical codes from
Electronic Health Records (EHR) while minimizing development and
training costs and keeping model up to date.
* This presentation assumes that the reader is familiar with concept of
transformer and architecture [Ref 1, 2]. This introduction to transformer for
medical coding should not be regarded as a formal technical paper.
Objective
Autonomous Medical Coding with Discriminative Transformers Patrick Nicolas

Background
Medical coding is the transformation of healthcare diagnosis,
procedures, medical services described in electronic health
records, physician's notes or laboratory results into
alphanumeric codes [Ref 3]
Medical codes are assembled into a claim to be paid by private
insurance carriers or Medicare.
See Appendix 6,7

Problem definition
Modeling
Implementation
Evaluation
References

Three challenges:
1. How to extract medical codes reliably given error prone labeling of
medical codes and inconsistency of clinical documentation?
2. How to minimize the cost of self- training complex deep models
such as transformers while preserving an acceptable accuracy?
3. How to continuously keep models up to date in production
environment?
Problem definition/Challenges

The sheer number of medical codes makes their extraction very
challenging:
• Almost infinite number of codes combination associated with a
given medical note
• Highly inconsistent patient charts (terminology, format, length)
• Difficulty to extract contextual information (Medical information
systems)
Problem definition/Challenges/Medical codes extraction

Model
Medical
Notes
Medical codes
Cohort studies
Insurance claims
Prediction hospital stay
Effect medication
…
This study focuses on automated generation of medical codes and
health insurance claims from a given clinical note.
There is a significant interest in extracting useful information from
medical notes such as diagnostic codes, insurance claims, prediction
of hospital re-admission.
Electronic
Medical
Records
(EMR)

Medical codes are extracted from clinical notes or patient charts to
predict procedure outcome, hospital stays or produce insurance claims.
The most common medical codes are
• International Classification of Diseases (ICD-10) for diagnostic
(~72,000 codes)
• Current Procedural Terminology (CPT) for procedure, medication
(~19,000 codes)
• Modifiers, SNOMED, …

A widely discussed energy study of deep learning models estimates
that training a large language model (LLM) produces 626,155
pounds of planet-warming carbon dioxide, equal to the lifetime
emissions of five cars.
For example, GPT-3/ChatGPT was trained on half a trillion words
with 175 billion parameters. It would take 355 GPU-years and cost
at least $4.6M for a single training run. Research is underway to
optimize resources for creating future models [Ref 4].
Problem definition/Challenges/Minimizing costs

Problem definition/Challenges/Up-to-date models
Real-time customer data is always evolving, sometimes outside
the range of data distribution used to train the models.
This issue is specially acute for transformers which require fine
tuning for the classification task and potentially re-initiate
pretraining, all being costly operations.
What is the benefit/cost of frequent updates?

Modeling/Overview
The extraction of medical codes from clinical documents relies on 4
distinct processes:
• Deterministic: Generate vocabulary and tokenize
• Unsupervised learning: Encoding EHR document
• Supervised learning: Classify medical codes
• Active/Transfer learning: Update model
Tokenizer
Transformer
Feed forward network
Active/Transfer learning

Modeling/Overview/Architecture
Beside the 4 key AI/NLP components Tokenizer, Bidirectional
Encoder Representations from Transformers (BERT), Neural
classifier and Active/transfer learning model, the architecture
needs to support a flexible integration mechanism with existing
IT systems.
Asynchronous messaging queue with streaming (i.e. Kafka) and
REST API interfaces are commonly used in the productization of
AI systems.

EMR
Kafka
Transformer
Encoder
Feed
Forward
Classifier
REST
Integrator
Interface
BERT Neural model
Vocabulary
Tokenizer
Modeling/Overview/Architecture
Redox.epic,
Topflight, …
Active/Transfer learning
Simulator
Illustration of an architecture that supports integration with external
medical IT systems and 4 key AI components

p(claim | note) =
p(claim | note embedding)
x p(note embedding | tokens)
x p(tokens | vocabulary, note)
Priors
Modeling/Overview/Priors
The goal is to predict a claim given a medical note and EMR context using
• Tokenizer extracts tokens, segments & vocabulary from a corpus of
notes
• Transformer encoder generates an embedding of note [Ref 5]
• Neural classifier predicts a set of medical code or insurance claim given
the embedding of the note
Tokenizer
Transformer
Feed forward network

MLM
Text processor
NSP
...
Token embedding
Position embedding
Type embedding
Embedding norm
Embedding dropout
Encoded Chart & context
Encoded segment
Tokenizer
Segmentation model
Token embedding
Position embedding
Type embedding
Embedding norm
Embedding dropout
Softmax
Hidden layer
Input layer
Concatenation
Medical codes
...
Patient chart
Tokenizer
Transformer
Classifier
• NLP processes and encodes
patient charts/medical
notes
• Transformer encodes
segments/sections of notes
[Ref 1]
• Classifier concatenates
segment embeddings and
predicts bag of codes
Modeling/Overview/Components

Load data (S3, RDS,..)
Pre-process
Importance sampling (*)
Build training set
Transformer pretraining
Select subset of
pretrained data
Classifier training
Evaluation
Training set
lifecycle
Create
Update state
Update
state
It is critical to keep track of the
state of the training set along the
various data transformation and
models.
It defines the operating range of
the model once deployed in
production
(*) Required only to overcome
limited computational resource in
pretraining
Modeling/Overview/Training data lifecycle

The quality of output from a transformer encoder is as good as its input:
tokens and segments/sentences extracted from the clinical documents:
1. What type of vocabulary is relevant to the extraction of tokens from
the notes (Domain specific, abbreviations, Tf-Idf, …)?
2. How to break a note in meaningful segments (sections, sentences)?
3. How can we input/embed contextual data related to the patient,
provider into the encoder?
Modeling/Tokenizer

The tokens input to the transformer encoder are extracting from medical
corpus using a custom vocabulary.
Vocabulary are built using any combination of
• American Medical Association terminology and abbreviation
• Terms weighted/filtered by their TfIdf score
• Word sense
• Abbreviation
• Semantic definition
Autonomous Medical Coding with Discriminative Transformers Patrick
Modeling/Tokenizer/Vocabulary

Stemming and lemming are implicitly implemented through the
wordpiece tokenizer of the transformer encoder
Rule or
machine
learning
model
AMA glossary
TF-IDF score ranking
Abbreviations
Stemming/lemming
Corpus Vocabulary
The generation of the vocabulary can be implemented through
either a rule (current implementation) or a probabilistic model.

Q: What is the optimum vocabulary to predict medical codes?
A: The American Medical Association uncased words combined
with the top 85% terms extracted from the training medical notes
with the highest TfIdf score. However, this combination requires a
costly update and it depends on the availability of training data.
See Appendix 1

Transformer
encoder
Segment
embedding
Medical
note
embedding
This approach breaks down a medical note into sections/segments
that are encoded by the transformer.
Medical note
Segment
Segment
embedding
Segment
embedding
The embeddings are then concatenated to product an embedding
for the medical note.
Concatenation
Modeling/Tokenizer/Segmentation

Labeled
medical
notes
Medical note
embedding
Medical document encoder using labeled-based clustering of
clinical notes [4]
The documents associated to a given labeled group of medical
codes are group into a corpus with document as segments.
Medical notes
Transformer
encoder
Medical note
embedding
Medical note
embedding

Contextual information such as patient data, location … is tokenized
(bucketing) and added to the tokens extracted from the medical note.
These contextual tokens can be
• Defined in their own segment
• Added to the tokens associated with first segment/section of the note
• Added to the tokens associated with a random segment/section of the
note [Ref 6]
Modeling/Tokenizer/Context embedding

Q: What is the optimum segmentation of medical records?
A: Organizing electronic health records as logical or arbitrary group
of segments/sentences is not a trivial endeavor.
Surprisingly, using entire note a sentence with AMA vocabulary
(disabling NSP model) generate poor accuracy for the downstream
classifier.
See Appendix 2

We are using a variant of Bidirectional Representation for
Transformer (BERT) [Ref 3] to
• Understand the contextual meaning of medical expressions
• Generate embedding/representation of the combine clinical note
and EMR contextual data.
Such a model is built into 2 steps
1. Pretraining on a large domain specific corpus
2. Fine tuning for specific application (classification)
This presentation does not describe the concept behind
transformer, word and document embedding
Modeling/Transformer

Pretraining
Pretrained Classifier
Embedding
weights
Classifier
weights
Accuracy
loss
Features extraction
Fine-tuning
Accuracy
loss
Pretrained Classifier
Embedding
weights
Classifier
weights
Training
set
Once pretraining is completed, the
classifier is trained
• Either from the output of the
pretrained model (note
embeddings)
• Or as part of fine tuning the
pretrained model [Ref 1, 6]. Fine
tuning is used on conjunction with
active learning to update models
Training
set
Modeling/Transformer/Strategy

Modeling/Transformer/Pretraining or not
It is recommended to leverage one of the pretrained BERT
models such as ClinicalBERT [Ref 7] embedding output then
customize the transformer for classification using a fine tuning
strategy.
However, for this project, we pretrained BERT on specific
corpus of clinical notes to estimate the impact of vocabulary
and segmentation on the accuracy of the prediction.

A transformer module is based on
a self-attention block which
processes token, position and type
embedding before normalization.
These modules are stacked to form
the encoder. Similar design is used
for the decoder.
Self-attention
Sum & normalization
Sum & normalization
FFNN FFNN
Next encoder block
Tokens and positions encoding
Modeling/Transformer/Attention

Transformer
Segment
embedding
Segment
embedding
Document
embedding
Feed forward
neural network
Softmax
In this configuration, the segment embedding generated by the
transformer are concatenated into a single vector as input to the
fully connected classifier.
A softmax layer is used to predict the claim with the highest
probability
Modeling/Transformer/Document embedding

Transformer
Segment
embedding
Segment
embedding
Feed forward
neural network
Softmax
The list of aggregation operators include addition, multiplication
max, convolution.
The segment embedded vectors are aggregated, preserving the
dimension of the output of the transformer.

Q: What is the most performant generation of document
embedding from output of transformer encoder?
A: Concatenating segment embeddings to represent a medical
document improves accuracy over aggregating those segment
embeddings.
The embedding of CLS token output of the BERT encoder is more
appropriate for classification tasks [Ref 2,8]
See Appendix 4

Modeling/Neural classifier
The classifier is implemented as a very simple feed forward
neural network (fully connected) as a more complex
architecture may not improved significantly the accuracy of
predictions. Beside the usual hyper-parameters optimization,
various network layout have been evaluated.
The layout of the network (number and size of hidden layout)
has a limited impact on the overall performance of the
prediction
See Appendix 5

Modeling/Active-Transfer learning
Update models for covariate shift in the distribution of real-
time data during inference.
Two-prong strategy
1. Sample data for which labels are outlier to the distribution
initially used in training (Active learning) [Ref 9]
2. Fine tune transformer for classification task with the
sample (Transfer learning)

Python
DL framework/C++
TensorFlow, Torch, MXNet, Blast,..
Model parameters
Java/Scala
DJL Engine/JNI
Training
Inference
Wrapper/CPython
S3/Local file/RDBMS/HDFS
Implementation/training vs inference
We leverage the most appropriate
frameworks for
1. Training: Python/MXNet to train
and store model parameters
2. Inference: Deep Java Library (DJL)
to load parameters and run
predictions.

Deep Java Library (DJL) is open-source Java framework that supports
the most common deep learning frameworks; MXNet, PyTorch and
TensorFlow [Ref 10]
DJL ability to leverage any hardware configuration (CPU, GPU) and
integrated with big data frameworks makes it the ideal solution for a
highly performant distributed inference engine. DJL can be optionally
used for training too.
Implementation/Deep Java Library (DJL)

Kafka Streams Spark 3.x
Kubernetes
DJL
TensorFlow
MXNet
Torch
Memory
Manager
The Java API of DJL makes is easy to integrate with existing big data
frameworks such as Kafka and Spark to sketch and implement an
efficient distributed inference production platform [Ref 11]
Source code related to apache Spark, Kafka and DJL execute on JVM
powered by CPU cores, while deep learning libraries execute binary
code (C++) on GPU.
Implementation/Data streaming

Deep learning models such as transformers
have 100+ million parameters.
They are broken down into reusable and
testable blocks [Ref 10]
1. Transformer as stack of pretraining blocks
2. Pretraining block contains BERT module
3. BERT block contains multiple embeddings
4. Each embedding has several parameters
Transformer
Pretraining block
BERT block
Token embedding
Parameter
Implementation/Neural blocks

Genetic algorithm Model Execution
Metrics
& Loss
Parameters
Segmentation
BERT size
FFNN layout
Batch size
Vocabulary
Learning rates
Convergence
Down sampling
….
Evolutionary algorithm was used to optimize the parameters for
the transformer and classification models [Ref 12]
Genetic algorithm have to be pre-tuned (population size,
mutation ratio, fitness …)
Implementation/Hyper-parameters optimization

CPU
Core
Memory
GPU
TPU
Core
Memory
ETL
Deep
learning
Computation load distribution
across CPU and GPU/TPU cores
Implementation/CPU-GPU process balancing
One key challenge for for an effective
training and inference is the distribution
of the computation load:
• Across CPUs for data processing
pipeline/ETLs
• Across GPUs for deep learning
pipeline
• Between the CPU and GPU clusters
See Appendix 8

Implementation/Computation load
Tokenizer
Transformer
Encoder
Feed
Forward
Classifier
CPU
GPU
TPU
Note Vocabulary
tokens
segments
Encoded
note &
context
Medical
codes
Tokenizer and NLP algorithms
distribute the execution on
multi-core CPU
The BERT encoder and Feed-
forward neural network
invokes deep learning
libraries that run efficiently
on GPU or TPU

Implementation/CPU-GPU Memory management
Contrary to the code executing on CPU which relies on JVM to manage
memory consumption, memory block tensors processed by GPU have to
be manually allocated and released.
1. Convert input from Java object to Float32 tensor
2. Create a memory manager for GPU processing
3. Attach input to memory manager
4. GPU processes input and generate output
5. Attach output to memory manager
6. Convert output to Java object
7. Close/delete memory manager

Pretraining a transformer is a very expensive operation. The
parameters for the classifier and transformer are coarsely estimated
on a small training set before been refined on a larger set.
Evaluation/strategy
Training/evaluation on 20K notes

Future development
Future improvement/evaluation
• Evaluate Convolutional Neural Network for classifier
• Encode semantic information along with token and positioning for
Masked Language Model (MLM)
• Apply importance sampling to reduce the cost of pretraining
• Quantify the impact of pre-trained ClinicalBERT

Special thanks to
• AWS-AI Deep Java Learning team for its support
• LinkedIn Artificial Intelligence In Health Care and Natural
Language Processing groups for their timely feedback

References
[1] Getting started with Google S. Ravichandiran Bert 2021 Packt Publishing
[2] Pretraining of Deep Bidirectional Transformers for Language Understanding
[3] Medical coding – AAPC
[4] A Survey on Efficient Training of Transformers
[5] Towards Transformer-based Automated ICD coding: Challenges, Pitfalls and Solutions
[6] BERT-XML: Large Scale Automated ICD Coding Using BERT Pretraining
[7] ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission
[8] BERT for Long Documents: A Case Study of Automated ICD Coding
[9] Active Learning Sampling Strategies
[10] Deep Java Library
[11] kafka.apache.org
[12] Hyper-parameter optimization algorithms: a short review

A1 Impact of vocabulary
Vocabulary Size Prediction
precision
TfIdf50 93K 70.11%
TfIdf80 101K 70.78%
AMA 117K 66.18%
AMA-TfIdf85 121K 76.41%
AMA-TfTdf95-
CodeDesc
124K 72.85%
• AMA: American Medical Association
terminology + Abbreviation
• TfIdf50: Terms with top 50% highest TfIdf
score
• AMA-TfIdf85: Combined AMA and top 95%
highest TfIdf score terms
• AMA-TfIdf95-CodeDesc: Combined AMA,
top 95% highest TfIdf score terms and label
code descriptor
Context: 240K notes with batch size 24, Wordpiece tokens, BERT base and
concatenated segment embedding
Impact of vocabulary on accuracy of the extraction of health insurance claims

A2 Impact of segmentation
Segmen
t model
Prediction
precision
F1
CT-T 70.11% 49.11%
CT-T-T 76.78% 52.92%
CT-T-T-
T
74.03% 52.74%
Group4 60.85% 41.17%
Group6 58.19% 41.83%
• CT-T: Note split in 2 section with contextual
data in first section
• CT-T-T: Note split in 3 sections with
contextual data in first section
• CT-T-T-T: Note split in 4 sections with
contextual data in first section
• Group4: Document defined as 4 sentences,
each defined as a clinical note
• Group6: Document defines as 6
notes/sentences
Context: 240K notes with 24 notes batch, Wordpiece tokens, BERT base,
Concatenated segment embedding
Impact of segmentation on accuracy of the extraction of health insurance claims

A3 Impact of BERT model size
BERT
model
Prediction
precision
Micro
F1
Micro 67.12% 49.66%
Base 75.29% 56.20%
Large 73.03% 51.85%
(*) The training set was limited to ~160K medical notes because
of high cost of training using BERT large model
Context
• Segmentation: CT-T-T
• Wordpiece tokenizer
• Note embedding: concatenated
• Vocabulary: AMA-TfIdf85
• Convergence ratio: 0.99
• Training set: 168K notes
Impact of BERT model size on accuracy of the genertion of
health insurance claims

A4 Impact of encoding scheme
Segment
embedding
Note embedding Prediction
precision
Concatenate Pooled output 67.95%
Sum Pooled output 65.89%
Concatenate CLS Embedding 77.17%
Sum CLS Embedding 70.87%
Context
• Learning rate: 5e-4
• Vocabulary: AMA
Impact of the embedding scheme for segment and note on accuracy of
the genertion of health insurance claims

A5 Impact of classifier neural layout
Network layout Hidden layer
layout
Prediction
precision
1 Hidden layer 64 70.98%
2 hidden layers 64, 16 70.61%
2 hidden layers 128, 24 68.95%
3 hidden layers 128,48,20 69.80%
Hyper-parameters
• Learning rate: 5e-4
• Vocabulary: AMA
• Note embedding: Concatenate
• Segment embedding: pooled output
Impact of the layout of feed forward neural classifier on accuracy of the
genertion of health insurance claims

A6 Anatomy of a claim
Procedure
code (CPT)
Modifier
codes
Diagnostic
codes (ICD-10)
70498 26 R29.818
G9637 R29.818
A health insurance claim consists of one or more description of
procedure, modifier and diagnostic codes which reflects the logic
behind a rendered medical service.

A7 Claim vs. diagnostic codes
A health insurance claim reflects service provided by a provider.
The medical documentation related to this service may have
vastly different content and format.
The extraction of medical codes from a clinical note has to be
absolutely accurate because the outcome such as hospital stay,
medication or procedure Is directly link these diagnostic codes.
Claims with slightly different codes may by valid for a given
service, as long as the input note, history of patient, diagnosis
and recommended procedure are consistent.

A8 AWS deployment
Amazon/AWS instances for training
• p3.2xLarge 8 vcores CPU/64 G, 1 V100 GPU/16G
• p3.8xlarge 16 vcores CPU/96G, 4 V100 GPU/64G
• g5.4xlarge 16 vcores CPU/64 G, 1 A10G GPU/24G
AWS instance for inference
• g4dn.4xlarge 16 vcores CPU/64 GB, 1 T4 GPU/16G

Autonomous medical coding with discriminative transformers

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Autonomous medical coding with discriminative transformers

Similar a Autonomous medical coding with discriminative transformers (20)

Más de Patrick Nicolas

Más de Patrick Nicolas (12)

Último

Último (20)

Autonomous medical coding with discriminative transformers