SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
Contiguity, LLC
www.contiguitydx.com
Introduction to Digital Biomarkers
A Technology White Paper
V. 1.0
August 10, 2016
Image Analysis and Classification
Transforming image based data sources into powerful
predictive tools.
INTRODUCTION TO DIGITAL BIOMARKERS
2
Contents
Introduction ..................................................................................................................................................3
Classification Algorithms...............................................................................................................................4
Digital Biomarkers.........................................................................................................................................5
Dual Optimization.........................................................................................................................................6
The Classifier Development Process.............................................................................................................7
Developing the Software and Algorithms – The CAMELYON16 Grand Challenge........................................7
Strategy for Meeting the Grand Challenge...................................................................................................8
Conclusions .................................................................................................................................................16
Applications.................................................................................................................................................17
Medicine and Life Science.......................................................................................................................17
Other Industries......................................................................................................................................17
Methods and Acknowledgements..............................................................................................................18
Contacting Contiguity .................................................................................................................................18
References ..................................................................................................................................................19
INTRODUCTION TO DIGITAL BIOMARKERS
3
Introduction
Contiguity has set out to develop a software suite to digitally identify unique features in image based
data, to use these features “Digital Biomarkers” to assemble multivariate classification algorithms and
to deploy these classification algorithms in an application that can analyze new images and “classify”
them with little to no user intervention.
Key software features include:
1. Analysis of very large images (> 1Gb per image, i.e. BigTIFF files)
2. Allow for an iterative Digital Biomarker and classifier optimization process
3. Rapid development of classification algorithms that are fit-to-purpose, including multi-tiered
decision tree approaches
4. No requirements for deep, image specific subject matter expertise to successfully identify and
employ effective digital biomarkers, and
5. Customizable information output
“Image analysis” is a broad term which includes both the highly specific tasks of machine vision[1] and
the more general methods of pattern recognition[2]. Over the past decade, the continuing
improvement in processor speed combined with the effectively infinite storage capacity of the cloud has
not only made image analysis ubiquitous in consumer applications, but also economically feasible across
a wide range of business and industrial applications[3] [4]. Nearly everyone today carries a computer in
their pocket capable of facial detection, speech recognition, and high speed data transfer (not to
mention phone calls), and terabytes of image data are added to the public and private domains on an
hourly basis. The inescapable conclusion is that image analysis will soon be one of the most significant
expenditures of computational power in the world, if it is not already.
Many current approaches to computer based image analysis largely seek to utilize the same features
and same relationships that humans use when analyzing an image [2]. While this computer assisted
process can provide faster and more consistent results compared to the human-only approach, expert
opinion spans the spectrum on whether it will ever match the performance of the human observer, let
alone exceed it. The only consensus is that such an achievement will be hard.
Another approach that has rapidly gained prominence involves so-called deep learning[5] [3], usually
involving neural networks and other higher-level abstractions of data. The most familiar application of
deep learning is found in the speech recognition ability of modern cell phones, and the same techniques
are being applied to image analysis. Such techniques require very large data sets and the
implementation of “black box” algorithms, often requiring computations that are offset in the cloud.
We focus our approach on extracting features that are obvious to the human eye, as well as usually non-
obvious features found by computer search for local optima. We utilize some deep learning concepts by
integrating the discovery of these features with a recursive, broad scope approach to analytics and
multivariate classifier development. This approach is intended both to match current computer or
INTRODUCTION TO DIGITAL BIOMARKERS
4
human vision based analysis and to enable new capabilities. Our approach takes full advantage of what
computers are useful for: quantifying, calculating relationships, and considering massive data
combinations in order to meet predefined performance criteria.
Classification Algorithms
Although the discussion below is applicable across a wide range of fields, for clarity it will be restricted
to terms commonly used in the biomedical field. Biomarkers are any biological quantity that can be
measured and therefore assigned a numerical value[6] [7] [8]. Although biomarkers are most
commonly measured from biological fluids, they can also be such quantities as temperature, blood
pressure or body mass index (BMI). The mathematical manipulation of biomarkers to find a diagnosis is
called a classification algorithm[2] [8] [9]. Conventionally, a single biomarker has been used for
diagnosis, for example, the concentration of PSA in blood to diagnose prostate cancer. PSA
concentrations above a certain threshold indicate some likelihood for cancer. Clearly, using more than
one biomarker relevant to a particular disease should give better results, but how to combine the
measurements?
The simplest way to combine biomarkers would be a “voting” algorithm, so that if two out of three
biomarkers indicated positive for disease, for instance, and one indicated negative, the combined result
would be positive. But this method fails to take into account the relative strength of the different
biomarkers. If the negative biomarker was considered a much more reliable indicator than the two
positive biomarkers, the combined result should perhaps be negative, not positive.
A linear combination of the biomarker concentrations is the simplest way to take the relative
significance into account[10]:
L = a A + b B + c C + …
Where L is the combined score, A, B, and C are the various biomarker concentrations or levels, and a, b,
and c are coefficients which may be positive or negative. When L is above a predetermined level (or
within a predetermined range), the result is positive, when L is below, negative. In effect, this method
amounts to a weighted voting algorithm. When the coefficients are determined by a specific
combination of the biomarker averages and standard deviations across the positive and negative disease
states, the above equation is known as a Fisher Linear Determinant [2]. In practice, the Fisher
coefficients are often used as a starting point to find a more optimal determinant, often using ROC curve
analysis[11] [12] [13].
Although there are many possible classification algorithms available, (Bayesian, Support Vector
Machines, Primary Component Analysis, etc. [1] [2] [5]), we have found the Linear Determinant
algorithms to be comparably as powerful as more sophisticated methods for the problems we have
addressed. Linear Determinants have the added advantages of simplicity and fewer parameters,
reducing the dangers of over fitting.
INTRODUCTION TO DIGITAL BIOMARKERS
5
Digital Biomarkers
Features are extracted from images in our method using very simple (and therefore computationally
non-intensive) region growing methods, in which contiguous pixels are tested for similarity, after first
passing simple thresholding tests. Pixels that are similar enough are grouped into clusters [4]. The
resulting pixel clusters) are then subject to population based tests (number of pixels, pixel density).
The selection of features from an image relies on the algorithms described briefly above, which
themselves rely on several adjustable parameters (our own algorithms have as few as four and as many
as fourteen parameters, depending on the degree of specificity desired). The changing of these
parameters results in the selection of different features. Population analysis of the feature collections
thus selected (for instance, average size, number, color, etc.) yields numbers that are analogous to
biomarker concentrations in medical diagnostics, and can likewise be used to build classifier algorithms,
and which we therefore refer to as digital biomarkers. An example of features that may be selected,
and subsequently used for the generation of digital biomarkers, is shown in Figure 1 below. The image
on the left is an unprocessed image of a section of lymphatic tissue which is negative for cancer. The
dark ovoid features are cell nuclei. Judicious choice of the feature-selection parameters allow most of
these nuclei to be selected, as shown in the identical image to the right, in which the selected features
have been highlighted in various colors. The population statistics of all these selected nuclei are what
are used as digital biomarkers.
Figure 1. An example of features extracted from a non-lesion histology slide to produce digital biomarkers.
The advantages of digital biomarkers over conventional biomarkers are two-fold. First, conventional
biomarkers suffer from variability due to variations in sampling, preparation, storage, and analysis. Most
of these variations have a component of human variability. Although digital biomarkers have to contend
with differences in cameras, exposure, magnification, dyes, etc., these sources of variability are much
easier to control or adjust for. Digital biomarkers are therefore much less subject to preparation
variability than conventional biomarkers. The second advantage of digital biomarkers is more significant
and compelling. Unlike conventional biomarkers, which are “found” quantities, digital biomarkers are to
a large extent the product of the adjustable parameters used to find them. This means they can be
INTRODUCTION TO DIGITAL BIOMARKERS
6
optimized for the task at hand. Some part of this optimization happens as a matter of course, through
human adjustment of the parameters to pick out obvious features. However, application of basic
optimization techniques (such as the Downhill Simplex method [14]), allows digital biomarkers to be
tuned by computer from a human-selected starting point that shows good performance, to the nearest
local maximum of improved performance.
Dual Optimization
The classification algorithms discussed above combine the data from several biomarkers in ways that
optimize the accuracy for diagnostic purposes (medical or otherwise). With conventional biomarkers,
that is the best that can be hoped for. However, the use of digital biomarkers allows for another layer of
optimization, as the biomarkers themselves may be optimized for improved diagnosis. As with the
classifier algorithm as a whole, the individual biomarkers can be optimized using ROC curve analysis [11]
[12] [13]. In actuality, the biomarkers are not optimized individually, but optimized in sets derived from
each feature collection, which are in turn determined by the parameters set for feature determination.
This process is shown in Figure 2 below.
Figure 2. Schematic of process used to develop image classifier algorithms from digital biomarkers.
INTRODUCTION TO DIGITAL BIOMARKERS
7
The Classifier Development Process
The software for developing the classifier algorithm combines many functions. First is the development
of the patterns from which the digital biomarkers are derived. This is initially a visual process, as shown
in Figure 1, but the visually determined starting points are then optimized computationally. The digital
biomarkers thus obtained are then moved through a multivariate classifier discovery process, in which
different combinations of the digital biomarkers are explored. The effectiveness of these classifiers is
determined against different positive and negative image classes.
Once the multivariate classifiers have been developed, they are placed into a decision-tree-like structure
in order to separate out all the different classes of image samples. Technically, these are not decision
trees, as node loops are allowed for calculation efficiency, but it is theoretically possible to recombine
these node loops to regain a purely binary branched decision tree. The size of the image samples and
the sampling frequency are chosen, and the large WSI (typically 2-4 gigabytes) is segmented and
classified. The results are coded into an XML file that can be read by the ASAP [15] software, or
analyzed offline in our own software.
Developing the Software and Algorithms – The CAMELYON16 Grand
Challenge
Contiguity sought out a data set that could be used as a component in the development and testing of
our software suite. The data set that was chosen comes from the Camelyon16 Grand Challenge [15].
This data set was used to develop and test the ability to:
1. Analyze very large images (> 1Gb per image)
2. Apply an iterative digital biomarker and classifier optimization process
3. Develop fit-for-purpose classification algorithms including multi-tiered decision tree
approaches
4. Demonstrate that the software can support development of digital biomarkers and
classification algorithms by personnel without deep subject matter expertise.
5. Effectively analyze new images that the software is naïve to and produce classification data
about the image. In this case, the classification data was the location of tumor lesions.
The CAMELYON16 challenge is entitled “ISBI Challenge on Cancer Metastasis Detection in Lymph Node”
and is one of the Grand Challenges in Biomedical Image Analysis put forth by the Consortium for Open
Medical Image Computing[16]. The CAMELYON16 website provides whole-slide images (WSIs) of
sentinel lymph node sections collected from Radboud University Medical Center, and the University
Medical Center Utrecht. Both universities and CAMELYON16 consortium are located in the Netherlands.
160 normal slides and 110 slides with metastases were included in the data set. Each metastases WSI
was paired with an XML file containing annotation contours delineating the metastasized areas,
prepared by professional pathologists.
INTRODUCTION TO DIGITAL BIOMARKERS
8
Contiguity has registered with CAMELYON16 for the purpose of gaining access to the datasets in order
to develop and introduce new tools and technologies. The deadline for submission passed in April 2016.
New submissions are still being accepted for evaluation purposes (without public posting of results), and
Contiguity intends to submit as resources permit.
Strategy for Meeting the Grand Challenge
The multi-gigabyte size of the images from the Camelyon16 Grand Challenge prohibits the full analysis of
a histology slide in a timely manner. Instead, our method depends on sampling the image along a grid,
pulling out image segments that typically total 3% - 12% of the area of the full image.
Regions representing bare glass or adipose tissue are quickly eliminated with a filter that tests for
grayness, so that image segments with average colors ranging from white to black, or shades of gray in
between, do not undergo time-consuming testing. Many of the remaining segmented images may be
classified as negative using a simple and fast color filter, as negative tissue regions are more densely
populated with muclei inconsistent with tumors. The remaining image segments are subject to the full
multivariate classifier treatment, usually utilizing cascading branches of classifiers in a decision tree,
ultimately deciding whether a particular image segment is classified as positive or negative. If an image
segment is found to be positive, the region around it is searched and classified exhaustively, providing
full coverage of that region.
Figure 3 shows a section of a tumor-positive histology slide. The regions demarcated with the dark blue
line are those judged by histologists associated with CAMELYON16 to be tumor positive. The light blue
squares indicate regions judged by our algorithm to be tumor negative, via color pre-filter. The squares
do not indicate the actual size of the region analyzed, but rather the center of the region. The actual
region analyzed extends one third of the distance to each neighboring indication square. The red
squares indicate tumor-positive regions as determined by the classifier decision tree, while the yellow
regions indicate tumor-negative regions as determined by the classifier decision tree.
INTRODUCTION TO DIGITAL BIOMARKERS
9
Figure 3. A region of a lymph node histology slide showing a tumor lesion (delineated by a dark blue border). Blue squares
indicate the center points of sampled regions which a color filter has determined to be negative. Red squares indicate the
center points of sampled regions which a digital biomarker classifier has determined to be positive, while yellow squares
indicate regions classified as negative. The larger spacing to the left is indicative of the initial sample spacing, while the tighter
spacing in the lesion region is indicative of the total coverage used when positive samples are found.
The Camelyon16 Grand Challenge employed a two-tier evaluation process. A test set consisting of 130
WSI (both positive and negative, blinded) were used for the evaluation. The first tier asked merely to
provide the probability that the provided images were positive for metastatic tumor lesions. The second
tier asked for the location of the lesions within each image, along with the confidence that the
determined locations indicated an actual lesion. Due to time and resource constraints, as a preliminary
step, we chose to analyze 8 known tumor WSIs and 11 known normal WSIs, all of which were naïve to
the training process. Images were chosen to cover the spectrum of images found in the training set. In
particular, the number of lesions per image varied from a single lesion to in excess of 40 lesions, with
the size of the lesions varying from 0.01 megapixels to more than 1400 megapixels.
For the purposes of classifying discovered lesions as true or false positives (TP or FP), or true or false
negatives (TN or FN), it was necessary to define what qualified as a valid lesion (therefore TP or FN,
INTRODUCTION TO DIGITAL BIOMARKERS
10
depending on how classified), and what qualified as a positive lesion (TP or FP). Nearly all of the smaller
lesions were satellite lesions to larger lesions – in close proximity or even touching the major lesions. In
most cases, these were clearly artifacts of the “ground truthing” process, and not lesions in their own
right, often not big enough to hold even a single tumor cell, and definitely smaller than the image
sample size. Because all positive WSIs contained at least one lesion larger than 1 megapixel, and
because the diameter of a 1 megapixel lesion is roughly equivalent to the sample spacing we prefer to
use, we set 1 megapixel to be the lower bound size for a valid lesion.
Image samples which classify as false positives are fairly common, and it is desirable to eliminate these.
One method is to only count these samples as positives if they cluster in numbers above a certain
threshold. As shown in Figure 4, raising the threshold for the number of contiguous pixels in a cluster up
to four allows for perfect image specificity (correctly selecting out all 11 negative images) at some
expense of image specificity (only selecting 6 of the 8 positive images correctly). However, perfect
image sensitivity proves to be impossible without resorting to much more sophisticated techniques, as
one of the positive images only contains one valid lesion, which is largely made up of cell-free inclusions
that the algorithm declines to classify (see Figure 5). As seen in Figure 4, high sensitivity for the lesions
is difficult to achieve, without denser sampling due to the small size of many lesions. Better results are
achieved by executing a complete sampling around positive points as shown in Figure 3.
Figure 4. Image and lesion sensitivity and specificity versus positive cluster size. Open red squares are image sensitivity. Open
blue square are image specificity. Closed red circles are lesion sensitivity. Closed blue circles are lesion specificity. The
increased difficulty of detecting positive lesions over positive images is apparent in the data. Setting the cluster threshold to
four allows for perfect image specificity, but lowers the image sensitivity to 6/8.
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10
Sensitivity,Specificity
Minimum Contigous Points
INTRODUCTION TO DIGITAL BIOMARKERS
11
Figure 5. The single valid lesion in the positive WSI Tumor_066 has a large cell-free inclusion at the image sample location,
which the algorithm declines to classify, thereby classifying this WSI as a false negative. Note the false positive sample point
(red) to the left of the lesion. This point is eliminated by setting the clustering threshold greater than one.
A high segment-density scan of WSI Tumor_37 was performed and results are shown in Figures 6 – 10.
Sampling was performed as 200 x 200 square pixel regions spaced every 400 pixels across the entire
image to demonstrate the ability to identify small tumors. Since we are using consumer level processors
and a very high level language, analysis of this particular image required approximately 4 hours of
processing time. This processing time can be reduced by a factor of 100 or more by employing a
number of readily available technologies that are outside the scope and budget of this immediate study.
INTRODUCTION TO DIGITAL BIOMARKERS
12
Figure 6. WSI Tumor_37 with supplied tumor annotations (dark blue) and rectangular annotations showing regions magnified
as; Figure 8-Yellow, Figure 9-Green and Figure 10-Light Blue. Image file size is 2.6 Gb. Included Image Scale Bar equals 5mm.
INTRODUCTION TO DIGITAL BIOMARKERS
13
Figure 7. WSI Tumor_37 with supplied tumor annotations (dark blue and difficult to see) and segments analyzed by Contiguity
software. The light blue segments represent regions screened out as highly unlikely to contain tumor cells and the yellow
segments were fully analyzed regions classified normal and red segments represent likely tumor regions. Result overview is;
Total Positive Points: 868, Total Negative Points: 11732, Number of true positives: 145, Number of false positives: 723,
Number of true negatives: 11670, Number of false negatives: 62. Included Image Scale Bar equals 5mm.
INTRODUCTION TO DIGITAL BIOMARKERS
14
Figure 8. Region defined as yellow in Figure 6 showing recognition of multiple tumor regions by red squares and non-tumor
regions by yellow and light blue squares. Included Image Scale Bar equals 1mm.
INTRODUCTION TO DIGITAL BIOMARKERS
15
Figure 9. Region defined as green in Figure 6 showing recognition of multiple tumor regions including very small tumors by red
squares and non-tumor regions by yellow and light blue squares. Included Image Scale Bar equals 1mm.
INTRODUCTION TO DIGITAL BIOMARKERS
16
Figure 10. A full slide image showing all true positives and false positives. Certain types of lymph node medullary tissue are
problematic at the microscale. Contiguity is confident that this issue can be accurately classified in the future by introducing
and optimizing additional digital biomarkers, building classifiers against smaller image segments to ensure a higher degree of
training image homogeneity and introducing additional classification levels to decision trees which utilize more macroscale
features of an image which tend to be considerably different than tumor containing regions. Included Image Scale Bar equals
4mm.
Conclusions
Contiguity undertook this challenge as part of an image analysis software development process. We
have made significant progress developing a suite of software tools that can be used to classify and
analyze a very broad range of images. Early results are promising; there is more work to do in order to
provide an application that can fully meet the goals of the Camelyon16 Grand Challenge.
Contiguity devoted 8 weeks to this process. This included writing and testing the software, developing
algorithms, testing and optimization. We are continuing to develop tools that will improve and
accelerate this classification process. Developing algorithms that demonstrate both high sensitivity and
high specificity appears possible based on current data and rate of progress. What is clear is that this is
a difficult challenge due to the high degree of heterogeneity demonstrated across images from this
challenge and that Digital Biomarkers can be employed to identify and classify specific tissue types
across very large and heterogeneous images.
INTRODUCTION TO DIGITAL BIOMARKERS
17
Future work will include:
1. Continue optimizing broader pattern finding techinques which will enable us to produce more
effective classifiers.
2. Improve the speed of analysis through;
a. Enabling the software to utilize 64 bit processors and graphics cards
b. Modify the architecture to take advantage of multiple processors
c. Convert the computationally intensive processes to assembly language or similar.
3. Improving speed will;
a. Improve sensitivity to small tumors by simply analyzing more of the image,
b. Set the software on a path towards being an effective tool to be used in large studies or
in histology labs.
4. Integrate all elements of the software suite to;
a. Enable rapid optimization and modification of decision trees,
b. Enable a rapid recursive process whereby false positives and negatives are identified
and used to enhance existing classifiers as well as add new decision tree branches.
Applications
Medicine and Life Science
Although this white paper has focused on medical imaging, specifically on the diagnosis of tumor
metastasis in lymph nodes, it should be clear that these same methods have applicability across a wide
range of fields. Most any sort of microscopy, laboratory imaging analysis or other medical image based
tools such as MRI, CAT Scan, X-Ray to name a few could benefit from these techniques. For instance:
 Discover situation specific biomarkers, custom biomarker identification and deployment in support of
clinical trials.
 Develop more effective and efficient sorting and staging algorithms
 Discover and deploy new biomarker and biomarker relationships with multi-index classifiers to achieve
significant improvements in sensitivity or specificity.
 If your clients are requesting a new analytical capability, our approaches may enable you to offer a new
high-value service.
 Deep analysis of therapeutic or procedural effects
 Applications that enable quick but well characterized biomarker based disease classifier identification
and verification.
Other Industries
Many quality assurance tools already make extensive use of imaging. New tools could be developed, or
old tools extended for:
 Monitoring material properties.
 Weld joint analysis.
 Automated checks of in-process manufacturing, packaging, and material acceptance.
INTRODUCTION TO DIGITAL BIOMARKERS
18
 Increasing the capabilities of SPC by discovering and employing non-obvious relationships and metrics
available in existing processes.
 Automated image and data inspection.
 Chasing down indicators for manufacturing defects.
Methods and Acknowledgements
TIFF (Tagged Image File Format [17]) images are normally limited to 4 GBytes. Although all the histology
images we analyzed were in the 1-3 GBytes range, the Consortium for Open Medical Image Computing
has shown foresight by curating all images in the BigTIFF format[18], which in theory allows image sizes
of up to 16 million Terabytes. Big TIFF files cannot be opened by most commercial imaging software.
Automated Slide Analysis Platform (ASAP) [19] is an open source platform for visualizing, annotating and
automatically analyzing whole-slide histopathology images. ASAP is built on top of several well-
developed open source packages like OpenSlide, Qt and OpenCV[15]. We have relied on ASAP for
viewing and annotating the Camelyon16 images, and TIFFFASTCROP[20] for segmenting these original
images into JPEG images of a size we could analyze (typically 200-500 pixels square). TIFFFASTCROP was
developed by the modelling team of the IMNC laboratory near Paris, France.
The image analysis software was written in MicroSoft’s C# language using Visual Studio 2013. The
classifier software was written in Java using Eclipse Europa. Software was run on either a Lenovo
ThinkPad T430 having an Intel® Core™ CPU running at 2.50 GHz, 8 GB RAM, running Windows 7
Professional, or a laptop comupter having an Intel® 6th
Generation Core™ CPU running at 2.50 GHz, 16
GB RAM, with an NVIDIA GeForce 940M video card, running Windows 10.
Contacting Contiguity
Web: www.contiguitydx.com
Phone: (888) 913-3915
Email: steve.tyrrell@contiguitydx.com
INTRODUCTION TO DIGITAL BIOMARKERS
19
References
[1] B. G. Batchelor, Machine Vision Handbook, Springer, 2012.
[2] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, John Wiley & Sons, 2001.
[3] Vision Systems, "Machine Vision: The past, the present, and the future.," [Online]. Available:
http://www.vision-systems.com. [Accessed 22 July 2016].
[4] C. Steger, M. Ulrich and C. Wiedermann, Machine Vision Algorithms and Applications, Wiley VCH,
2007.
[5] S. Marsland, Machine Learning. An Algorithmic Perspective, Boca Raton, FL: CRC Press, 2009.
[6] E. Boja, T. Hiltke, R. Rivers, C. Kinsinger, A. Rahbar, M. Mesri and H. Rodriguez, "Evolution of Clinical
Proteomics and its Role in Medicine," Journal of Proteome Research, 2010.
[7] S. M. Pepe, F. Ziding, H. Janes, P. M. Bossuyt and J. D. Potter, "Pivotol Evaluation of the Accuracy of
a Biomarker Used for Classification or Prediction: Standards for Study Design," J Natl Cancer Inst,
vol. 100, pp. 1432-1438, 2008.
[8] R. Ostroff, W. Bigbee, W. Franklin, L. Gold, M. Mehan, Y. Miller, H. Pass, W. Rom, J. Siegfried, A.
Stewart, J. Walker, J. Weissfeld, S. Williams, D. Zichi and E. Brody, "Unlocking Biomarker Discovery:
Large Scale Application of Aptamer Proteomic Technology for Early Detection of Lung Cancer," PLoS
ONE, vol. 5, no. 12, p. e15003, 2010.
[9] R. Simon, "Development and Validation of Biomarker Classifiers for Treatment Selection," J Stat
Plan Inference, vol. 138, no. 2, pp. 308-320, 2008.
[10] R. E. Larson, R. P. Hostetler and B. H. Edwards, Multivariable Calculus, Lexington MA: D.C. Heath
and Company, 1994.
[11] T. A. Lasko, J. G. Bhagwat, K. H. Zou and L. Ohno-Machado, "The use of receiver operating
characteristic curves in biomedical informatics," Journal of Biomedical Informatics, vol. 38, pp. 404-
415, 2005.
[12] T. Fawcett, "ROC Graphs: Notes and Practical Considerations for Researchers," Palo Alto, 2004.
[13] M. H. Zweig and G. Campbell, "Receiver-Operating Characteristic (ROC) Plots: A Fundamental
Evaluation Tool in Clinical Medicine," Clinical Chemistry, vol. 39, no. 4, pp. 561-577, 1993.
INTRODUCTION TO DIGITAL BIOMARKERS
20
[14] W. H. Press, B. P. Flannery, S. A. Teukolsky and W. T. Vetterling, Numerical Recipes (Fortran),
Cambridge University Press, 1990.
[15] "ISBI challenge on metastasis detection in lymph node.," 2016. [Online]. Available:
http://camelyon16.grand-challenge.org. [Accessed 22 July 2016].
[16] "Grand Challenges in Biomedical Image Analysis," 2016. [Online]. Available: http://grand-
challenge.org/Home. [Accessed 22 July 2016].
[17] "TIFF. Revision 6.0.," 3 June 1992. [Online]. Available:
https://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf. [Accessed 22 July 2016].
[18] "The BigTIFF File Format Proposal," [Online]. Available:
http://www.awaresystems.be/imaging/tiff/bigtiff.html#structures. [Accessed 22 July 2016].
[19] "Automated Slide Analysis Platform," [Online]. Available: https://githup.com/GeerLitjens/ASAP.
[Accessed 22 July 2016].
[20] C. DeRoulers, "TIFFFASTCROP," 19 February 2016. [Online]. Available:
http://www.imnc.in2p3.fr/pagesperso/deroulers/software/largetifftools/tifffastcrop.html..
[Accessed 22 July 2016].
[21] A. M. Molinaro, R. Simon and R. M. Pfeiffer, "Prediction error estimation: a comparison of
resampling methods," Bioinformatics, vol. 21, no. 15, pp. 3301-3307, 2005.
[22] M. D. Radmacher, L. M. McShane and R. Simon, "A Paradigm for Class Prediction Using Gene
Expression Profiles," Journal of Computational Biology, vol. 9, no. 3, pp. 505-511, 2002.

Más contenido relacionado

La actualidad más candente

Treated by Computers?- a futuristic perspective of health care
Treated by Computers?- a futuristic perspective of health careTreated by Computers?- a futuristic perspective of health care
Treated by Computers?- a futuristic perspective of health careKatarzyna Wac & The QoL Lab
 
coQoL Approach: coCalibrating Physical and Psychological Outcomes & Consumer...
coQoL Approach: coCalibrating Physical and Psychological Outcomes  & Consumer...coQoL Approach: coCalibrating Physical and Psychological Outcomes  & Consumer...
coQoL Approach: coCalibrating Physical and Psychological Outcomes & Consumer...Katarzyna Wac & The QoL Lab
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbioBenjamin Good
 
Brain Tumor Segmentation and Extraction of MR Images Based on Improved Waters...
Brain Tumor Segmentation and Extraction of MR Images Based on Improved Waters...Brain Tumor Segmentation and Extraction of MR Images Based on Improved Waters...
Brain Tumor Segmentation and Extraction of MR Images Based on Improved Waters...IOSR Journals
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Paolo Missier
 
IRJET - Machine Learning Applications on Cancer Prognosis and Prediction
IRJET - Machine Learning Applications on Cancer Prognosis and PredictionIRJET - Machine Learning Applications on Cancer Prognosis and Prediction
IRJET - Machine Learning Applications on Cancer Prognosis and PredictionIRJET Journal
 
1 springer format chronic changed edit iqbal qc
1 springer format chronic changed edit iqbal qc1 springer format chronic changed edit iqbal qc
1 springer format chronic changed edit iqbal qcIAESIJEECS
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Kevin Mader
 
A Review on Brain Disorder Segmentation in MR Images
A Review on Brain Disorder Segmentation in MR ImagesA Review on Brain Disorder Segmentation in MR Images
A Review on Brain Disorder Segmentation in MR ImagesIJMER
 
Smart Plant Disease Detection System
Smart Plant Disease Detection SystemSmart Plant Disease Detection System
Smart Plant Disease Detection SystemAI Publications
 
IRJET- Classifying Chest Pathology Images using Deep Learning Techniques
IRJET- Classifying Chest Pathology Images using Deep Learning TechniquesIRJET- Classifying Chest Pathology Images using Deep Learning Techniques
IRJET- Classifying Chest Pathology Images using Deep Learning TechniquesIRJET Journal
 
Brainsci 10-00118
Brainsci 10-00118Brainsci 10-00118
Brainsci 10-00118imen jdey
 
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...DATAVERSITY
 
IRJET- Automatic Follicle Detection in Ultrasound Images of Ovaries
IRJET- Automatic Follicle Detection in Ultrasound Images of OvariesIRJET- Automatic Follicle Detection in Ultrasound Images of Ovaries
IRJET- Automatic Follicle Detection in Ultrasound Images of OvariesIRJET Journal
 
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)Joel Saltz
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesNamkug Kim
 
BIOCOMPATIBLE WIRELESS BRAIN SENSORS
BIOCOMPATIBLE WIRELESS BRAIN SENSORSBIOCOMPATIBLE WIRELESS BRAIN SENSORS
BIOCOMPATIBLE WIRELESS BRAIN SENSORSpragatii karna
 

La actualidad más candente (20)

Treated by Computers?- a futuristic perspective of health care
Treated by Computers?- a futuristic perspective of health careTreated by Computers?- a futuristic perspective of health care
Treated by Computers?- a futuristic perspective of health care
 
coQoL Approach: coCalibrating Physical and Psychological Outcomes & Consumer...
coQoL Approach: coCalibrating Physical and Psychological Outcomes  & Consumer...coQoL Approach: coCalibrating Physical and Psychological Outcomes  & Consumer...
coQoL Approach: coCalibrating Physical and Psychological Outcomes & Consumer...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio2015 6 bd2k_biobranch_knowbio
2015 6 bd2k_biobranch_knowbio
 
Quality of Life Technologies: From Cure to Care
Quality of Life Technologies: From Cure to CareQuality of Life Technologies: From Cure to Care
Quality of Life Technologies: From Cure to Care
 
Brain Tumor Segmentation and Extraction of MR Images Based on Improved Waters...
Brain Tumor Segmentation and Extraction of MR Images Based on Improved Waters...Brain Tumor Segmentation and Extraction of MR Images Based on Improved Waters...
Brain Tumor Segmentation and Extraction of MR Images Based on Improved Waters...
 
Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)Transparency in ML and AI (humble views from a concerned academic)
Transparency in ML and AI (humble views from a concerned academic)
 
IRJET - Machine Learning Applications on Cancer Prognosis and Prediction
IRJET - Machine Learning Applications on Cancer Prognosis and PredictionIRJET - Machine Learning Applications on Cancer Prognosis and Prediction
IRJET - Machine Learning Applications on Cancer Prognosis and Prediction
 
1 springer format chronic changed edit iqbal qc
1 springer format chronic changed edit iqbal qc1 springer format chronic changed edit iqbal qc
1 springer format chronic changed edit iqbal qc
 
S4502115119
S4502115119S4502115119
S4502115119
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
 
A Review on Brain Disorder Segmentation in MR Images
A Review on Brain Disorder Segmentation in MR ImagesA Review on Brain Disorder Segmentation in MR Images
A Review on Brain Disorder Segmentation in MR Images
 
Smart Plant Disease Detection System
Smart Plant Disease Detection SystemSmart Plant Disease Detection System
Smart Plant Disease Detection System
 
IRJET- Classifying Chest Pathology Images using Deep Learning Techniques
IRJET- Classifying Chest Pathology Images using Deep Learning TechniquesIRJET- Classifying Chest Pathology Images using Deep Learning Techniques
IRJET- Classifying Chest Pathology Images using Deep Learning Techniques
 
Brainsci 10-00118
Brainsci 10-00118Brainsci 10-00118
Brainsci 10-00118
 
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
WEBINAR: The Yosemite Project PART 6 -- Data-Driven Biomedical Research with ...
 
IRJET- Automatic Follicle Detection in Ultrasound Images of Ovaries
IRJET- Automatic Follicle Detection in Ultrasound Images of OvariesIRJET- Automatic Follicle Detection in Ultrasound Images of Ovaries
IRJET- Automatic Follicle Detection in Ultrasound Images of Ovaries
 
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
Biomedical Informatics Program -- Atlanta CTSA (ACTSI)
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectives
 
BIOCOMPATIBLE WIRELESS BRAIN SENSORS
BIOCOMPATIBLE WIRELESS BRAIN SENSORSBIOCOMPATIBLE WIRELESS BRAIN SENSORS
BIOCOMPATIBLE WIRELESS BRAIN SENSORS
 

Similar a Introduction to Digital Biomarkers V1.0

An Introduction To Artificial Intelligence And Its Applications In Biomedical...
An Introduction To Artificial Intelligence And Its Applications In Biomedical...An Introduction To Artificial Intelligence And Its Applications In Biomedical...
An Introduction To Artificial Intelligence And Its Applications In Biomedical...Jill Brown
 
Final Report
Final ReportFinal Report
Final Reportimu409
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
 
Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397Editor IJARCET
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
 
Smart Health Prediction Report
Smart Health Prediction ReportSmart Health Prediction Report
Smart Health Prediction ReportArhind Gautam
 
Seminar report SMART HEALTH PREDICTION
Seminar report SMART HEALTH PREDICTIONSeminar report SMART HEALTH PREDICTION
Seminar report SMART HEALTH PREDICTIONArhind Gautam
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
 
Multibiometric Secure Index Value Code Generation for Authentication and Retr...
Multibiometric Secure Index Value Code Generation for Authentication and Retr...Multibiometric Secure Index Value Code Generation for Authentication and Retr...
Multibiometric Secure Index Value Code Generation for Authentication and Retr...ijsrd.com
 
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAGRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAcscpconf
 
QPLC: A Novel Multimodal Biometric Score Fusion Method
QPLC: A Novel Multimodal Biometric Score Fusion MethodQPLC: A Novel Multimodal Biometric Score Fusion Method
QPLC: A Novel Multimodal Biometric Score Fusion MethodCSCJournals
 
A new approach for content-based image retrieval for medical applications usi...
A new approach for content-based image retrieval for medical applications usi...A new approach for content-based image retrieval for medical applications usi...
A new approach for content-based image retrieval for medical applications usi...IJECEIAES
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsHarsh Parekh
 
Efficiency of Prediction Algorithms for Mining Biological Databases
Efficiency of Prediction Algorithms for Mining Biological  DatabasesEfficiency of Prediction Algorithms for Mining Biological  Databases
Efficiency of Prediction Algorithms for Mining Biological DatabasesIOSR Journals
 
Paper id 25201494
Paper id 25201494Paper id 25201494
Paper id 25201494IJRAT
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisjournalBEEI
 
Semantic Web Based Sentiment Engine
Semantic Web Based Sentiment EngineSemantic Web Based Sentiment Engine
Semantic Web Based Sentiment EngineJames Dellinger
 
Paper id 25201441
Paper id 25201441Paper id 25201441
Paper id 25201441IJRAT
 

Similar a Introduction to Digital Biomarkers V1.0 (20)

An Introduction To Artificial Intelligence And Its Applications In Biomedical...
An Introduction To Artificial Intelligence And Its Applications In Biomedical...An Introduction To Artificial Intelligence And Its Applications In Biomedical...
An Introduction To Artificial Intelligence And Its Applications In Biomedical...
 
Final Report
Final ReportFinal Report
Final Report
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397Ijarcet vol-2-issue-4-1393-1397
Ijarcet vol-2-issue-4-1393-1397
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
 
Smart Health Prediction Report
Smart Health Prediction ReportSmart Health Prediction Report
Smart Health Prediction Report
 
Seminar report irm
Seminar report irmSeminar report irm
Seminar report irm
 
Seminar report SMART HEALTH PREDICTION
Seminar report SMART HEALTH PREDICTIONSeminar report SMART HEALTH PREDICTION
Seminar report SMART HEALTH PREDICTION
 
Csit110713
Csit110713Csit110713
Csit110713
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
Multibiometric Secure Index Value Code Generation for Authentication and Retr...
Multibiometric Secure Index Value Code Generation for Authentication and Retr...Multibiometric Secure Index Value Code Generation for Authentication and Retr...
Multibiometric Secure Index Value Code Generation for Authentication and Retr...
 
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAGRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
 
QPLC: A Novel Multimodal Biometric Score Fusion Method
QPLC: A Novel Multimodal Biometric Score Fusion MethodQPLC: A Novel Multimodal Biometric Score Fusion Method
QPLC: A Novel Multimodal Biometric Score Fusion Method
 
A new approach for content-based image retrieval for medical applications usi...
A new approach for content-based image retrieval for medical applications usi...A new approach for content-based image retrieval for medical applications usi...
A new approach for content-based image retrieval for medical applications usi...
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data Analytics
 
Efficiency of Prediction Algorithms for Mining Biological Databases
Efficiency of Prediction Algorithms for Mining Biological  DatabasesEfficiency of Prediction Algorithms for Mining Biological  Databases
Efficiency of Prediction Algorithms for Mining Biological Databases
 
Paper id 25201494
Paper id 25201494Paper id 25201494
Paper id 25201494
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysis
 
Semantic Web Based Sentiment Engine
Semantic Web Based Sentiment EngineSemantic Web Based Sentiment Engine
Semantic Web Based Sentiment Engine
 
Paper id 25201441
Paper id 25201441Paper id 25201441
Paper id 25201441
 

Introduction to Digital Biomarkers V1.0

  • 1. Contiguity, LLC www.contiguitydx.com Introduction to Digital Biomarkers A Technology White Paper V. 1.0 August 10, 2016 Image Analysis and Classification Transforming image based data sources into powerful predictive tools.
  • 2. INTRODUCTION TO DIGITAL BIOMARKERS 2 Contents Introduction ..................................................................................................................................................3 Classification Algorithms...............................................................................................................................4 Digital Biomarkers.........................................................................................................................................5 Dual Optimization.........................................................................................................................................6 The Classifier Development Process.............................................................................................................7 Developing the Software and Algorithms – The CAMELYON16 Grand Challenge........................................7 Strategy for Meeting the Grand Challenge...................................................................................................8 Conclusions .................................................................................................................................................16 Applications.................................................................................................................................................17 Medicine and Life Science.......................................................................................................................17 Other Industries......................................................................................................................................17 Methods and Acknowledgements..............................................................................................................18 Contacting Contiguity .................................................................................................................................18 References ..................................................................................................................................................19
  • 3. INTRODUCTION TO DIGITAL BIOMARKERS 3 Introduction Contiguity has set out to develop a software suite to digitally identify unique features in image based data, to use these features “Digital Biomarkers” to assemble multivariate classification algorithms and to deploy these classification algorithms in an application that can analyze new images and “classify” them with little to no user intervention. Key software features include: 1. Analysis of very large images (> 1Gb per image, i.e. BigTIFF files) 2. Allow for an iterative Digital Biomarker and classifier optimization process 3. Rapid development of classification algorithms that are fit-to-purpose, including multi-tiered decision tree approaches 4. No requirements for deep, image specific subject matter expertise to successfully identify and employ effective digital biomarkers, and 5. Customizable information output “Image analysis” is a broad term which includes both the highly specific tasks of machine vision[1] and the more general methods of pattern recognition[2]. Over the past decade, the continuing improvement in processor speed combined with the effectively infinite storage capacity of the cloud has not only made image analysis ubiquitous in consumer applications, but also economically feasible across a wide range of business and industrial applications[3] [4]. Nearly everyone today carries a computer in their pocket capable of facial detection, speech recognition, and high speed data transfer (not to mention phone calls), and terabytes of image data are added to the public and private domains on an hourly basis. The inescapable conclusion is that image analysis will soon be one of the most significant expenditures of computational power in the world, if it is not already. Many current approaches to computer based image analysis largely seek to utilize the same features and same relationships that humans use when analyzing an image [2]. While this computer assisted process can provide faster and more consistent results compared to the human-only approach, expert opinion spans the spectrum on whether it will ever match the performance of the human observer, let alone exceed it. The only consensus is that such an achievement will be hard. Another approach that has rapidly gained prominence involves so-called deep learning[5] [3], usually involving neural networks and other higher-level abstractions of data. The most familiar application of deep learning is found in the speech recognition ability of modern cell phones, and the same techniques are being applied to image analysis. Such techniques require very large data sets and the implementation of “black box” algorithms, often requiring computations that are offset in the cloud. We focus our approach on extracting features that are obvious to the human eye, as well as usually non- obvious features found by computer search for local optima. We utilize some deep learning concepts by integrating the discovery of these features with a recursive, broad scope approach to analytics and multivariate classifier development. This approach is intended both to match current computer or
  • 4. INTRODUCTION TO DIGITAL BIOMARKERS 4 human vision based analysis and to enable new capabilities. Our approach takes full advantage of what computers are useful for: quantifying, calculating relationships, and considering massive data combinations in order to meet predefined performance criteria. Classification Algorithms Although the discussion below is applicable across a wide range of fields, for clarity it will be restricted to terms commonly used in the biomedical field. Biomarkers are any biological quantity that can be measured and therefore assigned a numerical value[6] [7] [8]. Although biomarkers are most commonly measured from biological fluids, they can also be such quantities as temperature, blood pressure or body mass index (BMI). The mathematical manipulation of biomarkers to find a diagnosis is called a classification algorithm[2] [8] [9]. Conventionally, a single biomarker has been used for diagnosis, for example, the concentration of PSA in blood to diagnose prostate cancer. PSA concentrations above a certain threshold indicate some likelihood for cancer. Clearly, using more than one biomarker relevant to a particular disease should give better results, but how to combine the measurements? The simplest way to combine biomarkers would be a “voting” algorithm, so that if two out of three biomarkers indicated positive for disease, for instance, and one indicated negative, the combined result would be positive. But this method fails to take into account the relative strength of the different biomarkers. If the negative biomarker was considered a much more reliable indicator than the two positive biomarkers, the combined result should perhaps be negative, not positive. A linear combination of the biomarker concentrations is the simplest way to take the relative significance into account[10]: L = a A + b B + c C + … Where L is the combined score, A, B, and C are the various biomarker concentrations or levels, and a, b, and c are coefficients which may be positive or negative. When L is above a predetermined level (or within a predetermined range), the result is positive, when L is below, negative. In effect, this method amounts to a weighted voting algorithm. When the coefficients are determined by a specific combination of the biomarker averages and standard deviations across the positive and negative disease states, the above equation is known as a Fisher Linear Determinant [2]. In practice, the Fisher coefficients are often used as a starting point to find a more optimal determinant, often using ROC curve analysis[11] [12] [13]. Although there are many possible classification algorithms available, (Bayesian, Support Vector Machines, Primary Component Analysis, etc. [1] [2] [5]), we have found the Linear Determinant algorithms to be comparably as powerful as more sophisticated methods for the problems we have addressed. Linear Determinants have the added advantages of simplicity and fewer parameters, reducing the dangers of over fitting.
  • 5. INTRODUCTION TO DIGITAL BIOMARKERS 5 Digital Biomarkers Features are extracted from images in our method using very simple (and therefore computationally non-intensive) region growing methods, in which contiguous pixels are tested for similarity, after first passing simple thresholding tests. Pixels that are similar enough are grouped into clusters [4]. The resulting pixel clusters) are then subject to population based tests (number of pixels, pixel density). The selection of features from an image relies on the algorithms described briefly above, which themselves rely on several adjustable parameters (our own algorithms have as few as four and as many as fourteen parameters, depending on the degree of specificity desired). The changing of these parameters results in the selection of different features. Population analysis of the feature collections thus selected (for instance, average size, number, color, etc.) yields numbers that are analogous to biomarker concentrations in medical diagnostics, and can likewise be used to build classifier algorithms, and which we therefore refer to as digital biomarkers. An example of features that may be selected, and subsequently used for the generation of digital biomarkers, is shown in Figure 1 below. The image on the left is an unprocessed image of a section of lymphatic tissue which is negative for cancer. The dark ovoid features are cell nuclei. Judicious choice of the feature-selection parameters allow most of these nuclei to be selected, as shown in the identical image to the right, in which the selected features have been highlighted in various colors. The population statistics of all these selected nuclei are what are used as digital biomarkers. Figure 1. An example of features extracted from a non-lesion histology slide to produce digital biomarkers. The advantages of digital biomarkers over conventional biomarkers are two-fold. First, conventional biomarkers suffer from variability due to variations in sampling, preparation, storage, and analysis. Most of these variations have a component of human variability. Although digital biomarkers have to contend with differences in cameras, exposure, magnification, dyes, etc., these sources of variability are much easier to control or adjust for. Digital biomarkers are therefore much less subject to preparation variability than conventional biomarkers. The second advantage of digital biomarkers is more significant and compelling. Unlike conventional biomarkers, which are “found” quantities, digital biomarkers are to a large extent the product of the adjustable parameters used to find them. This means they can be
  • 6. INTRODUCTION TO DIGITAL BIOMARKERS 6 optimized for the task at hand. Some part of this optimization happens as a matter of course, through human adjustment of the parameters to pick out obvious features. However, application of basic optimization techniques (such as the Downhill Simplex method [14]), allows digital biomarkers to be tuned by computer from a human-selected starting point that shows good performance, to the nearest local maximum of improved performance. Dual Optimization The classification algorithms discussed above combine the data from several biomarkers in ways that optimize the accuracy for diagnostic purposes (medical or otherwise). With conventional biomarkers, that is the best that can be hoped for. However, the use of digital biomarkers allows for another layer of optimization, as the biomarkers themselves may be optimized for improved diagnosis. As with the classifier algorithm as a whole, the individual biomarkers can be optimized using ROC curve analysis [11] [12] [13]. In actuality, the biomarkers are not optimized individually, but optimized in sets derived from each feature collection, which are in turn determined by the parameters set for feature determination. This process is shown in Figure 2 below. Figure 2. Schematic of process used to develop image classifier algorithms from digital biomarkers.
  • 7. INTRODUCTION TO DIGITAL BIOMARKERS 7 The Classifier Development Process The software for developing the classifier algorithm combines many functions. First is the development of the patterns from which the digital biomarkers are derived. This is initially a visual process, as shown in Figure 1, but the visually determined starting points are then optimized computationally. The digital biomarkers thus obtained are then moved through a multivariate classifier discovery process, in which different combinations of the digital biomarkers are explored. The effectiveness of these classifiers is determined against different positive and negative image classes. Once the multivariate classifiers have been developed, they are placed into a decision-tree-like structure in order to separate out all the different classes of image samples. Technically, these are not decision trees, as node loops are allowed for calculation efficiency, but it is theoretically possible to recombine these node loops to regain a purely binary branched decision tree. The size of the image samples and the sampling frequency are chosen, and the large WSI (typically 2-4 gigabytes) is segmented and classified. The results are coded into an XML file that can be read by the ASAP [15] software, or analyzed offline in our own software. Developing the Software and Algorithms – The CAMELYON16 Grand Challenge Contiguity sought out a data set that could be used as a component in the development and testing of our software suite. The data set that was chosen comes from the Camelyon16 Grand Challenge [15]. This data set was used to develop and test the ability to: 1. Analyze very large images (> 1Gb per image) 2. Apply an iterative digital biomarker and classifier optimization process 3. Develop fit-for-purpose classification algorithms including multi-tiered decision tree approaches 4. Demonstrate that the software can support development of digital biomarkers and classification algorithms by personnel without deep subject matter expertise. 5. Effectively analyze new images that the software is naïve to and produce classification data about the image. In this case, the classification data was the location of tumor lesions. The CAMELYON16 challenge is entitled “ISBI Challenge on Cancer Metastasis Detection in Lymph Node” and is one of the Grand Challenges in Biomedical Image Analysis put forth by the Consortium for Open Medical Image Computing[16]. The CAMELYON16 website provides whole-slide images (WSIs) of sentinel lymph node sections collected from Radboud University Medical Center, and the University Medical Center Utrecht. Both universities and CAMELYON16 consortium are located in the Netherlands. 160 normal slides and 110 slides with metastases were included in the data set. Each metastases WSI was paired with an XML file containing annotation contours delineating the metastasized areas, prepared by professional pathologists.
  • 8. INTRODUCTION TO DIGITAL BIOMARKERS 8 Contiguity has registered with CAMELYON16 for the purpose of gaining access to the datasets in order to develop and introduce new tools and technologies. The deadline for submission passed in April 2016. New submissions are still being accepted for evaluation purposes (without public posting of results), and Contiguity intends to submit as resources permit. Strategy for Meeting the Grand Challenge The multi-gigabyte size of the images from the Camelyon16 Grand Challenge prohibits the full analysis of a histology slide in a timely manner. Instead, our method depends on sampling the image along a grid, pulling out image segments that typically total 3% - 12% of the area of the full image. Regions representing bare glass or adipose tissue are quickly eliminated with a filter that tests for grayness, so that image segments with average colors ranging from white to black, or shades of gray in between, do not undergo time-consuming testing. Many of the remaining segmented images may be classified as negative using a simple and fast color filter, as negative tissue regions are more densely populated with muclei inconsistent with tumors. The remaining image segments are subject to the full multivariate classifier treatment, usually utilizing cascading branches of classifiers in a decision tree, ultimately deciding whether a particular image segment is classified as positive or negative. If an image segment is found to be positive, the region around it is searched and classified exhaustively, providing full coverage of that region. Figure 3 shows a section of a tumor-positive histology slide. The regions demarcated with the dark blue line are those judged by histologists associated with CAMELYON16 to be tumor positive. The light blue squares indicate regions judged by our algorithm to be tumor negative, via color pre-filter. The squares do not indicate the actual size of the region analyzed, but rather the center of the region. The actual region analyzed extends one third of the distance to each neighboring indication square. The red squares indicate tumor-positive regions as determined by the classifier decision tree, while the yellow regions indicate tumor-negative regions as determined by the classifier decision tree.
  • 9. INTRODUCTION TO DIGITAL BIOMARKERS 9 Figure 3. A region of a lymph node histology slide showing a tumor lesion (delineated by a dark blue border). Blue squares indicate the center points of sampled regions which a color filter has determined to be negative. Red squares indicate the center points of sampled regions which a digital biomarker classifier has determined to be positive, while yellow squares indicate regions classified as negative. The larger spacing to the left is indicative of the initial sample spacing, while the tighter spacing in the lesion region is indicative of the total coverage used when positive samples are found. The Camelyon16 Grand Challenge employed a two-tier evaluation process. A test set consisting of 130 WSI (both positive and negative, blinded) were used for the evaluation. The first tier asked merely to provide the probability that the provided images were positive for metastatic tumor lesions. The second tier asked for the location of the lesions within each image, along with the confidence that the determined locations indicated an actual lesion. Due to time and resource constraints, as a preliminary step, we chose to analyze 8 known tumor WSIs and 11 known normal WSIs, all of which were naïve to the training process. Images were chosen to cover the spectrum of images found in the training set. In particular, the number of lesions per image varied from a single lesion to in excess of 40 lesions, with the size of the lesions varying from 0.01 megapixels to more than 1400 megapixels. For the purposes of classifying discovered lesions as true or false positives (TP or FP), or true or false negatives (TN or FN), it was necessary to define what qualified as a valid lesion (therefore TP or FN,
  • 10. INTRODUCTION TO DIGITAL BIOMARKERS 10 depending on how classified), and what qualified as a positive lesion (TP or FP). Nearly all of the smaller lesions were satellite lesions to larger lesions – in close proximity or even touching the major lesions. In most cases, these were clearly artifacts of the “ground truthing” process, and not lesions in their own right, often not big enough to hold even a single tumor cell, and definitely smaller than the image sample size. Because all positive WSIs contained at least one lesion larger than 1 megapixel, and because the diameter of a 1 megapixel lesion is roughly equivalent to the sample spacing we prefer to use, we set 1 megapixel to be the lower bound size for a valid lesion. Image samples which classify as false positives are fairly common, and it is desirable to eliminate these. One method is to only count these samples as positives if they cluster in numbers above a certain threshold. As shown in Figure 4, raising the threshold for the number of contiguous pixels in a cluster up to four allows for perfect image specificity (correctly selecting out all 11 negative images) at some expense of image specificity (only selecting 6 of the 8 positive images correctly). However, perfect image sensitivity proves to be impossible without resorting to much more sophisticated techniques, as one of the positive images only contains one valid lesion, which is largely made up of cell-free inclusions that the algorithm declines to classify (see Figure 5). As seen in Figure 4, high sensitivity for the lesions is difficult to achieve, without denser sampling due to the small size of many lesions. Better results are achieved by executing a complete sampling around positive points as shown in Figure 3. Figure 4. Image and lesion sensitivity and specificity versus positive cluster size. Open red squares are image sensitivity. Open blue square are image specificity. Closed red circles are lesion sensitivity. Closed blue circles are lesion specificity. The increased difficulty of detecting positive lesions over positive images is apparent in the data. Setting the cluster threshold to four allows for perfect image specificity, but lowers the image sensitivity to 6/8. 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 Sensitivity,Specificity Minimum Contigous Points
  • 11. INTRODUCTION TO DIGITAL BIOMARKERS 11 Figure 5. The single valid lesion in the positive WSI Tumor_066 has a large cell-free inclusion at the image sample location, which the algorithm declines to classify, thereby classifying this WSI as a false negative. Note the false positive sample point (red) to the left of the lesion. This point is eliminated by setting the clustering threshold greater than one. A high segment-density scan of WSI Tumor_37 was performed and results are shown in Figures 6 – 10. Sampling was performed as 200 x 200 square pixel regions spaced every 400 pixels across the entire image to demonstrate the ability to identify small tumors. Since we are using consumer level processors and a very high level language, analysis of this particular image required approximately 4 hours of processing time. This processing time can be reduced by a factor of 100 or more by employing a number of readily available technologies that are outside the scope and budget of this immediate study.
  • 12. INTRODUCTION TO DIGITAL BIOMARKERS 12 Figure 6. WSI Tumor_37 with supplied tumor annotations (dark blue) and rectangular annotations showing regions magnified as; Figure 8-Yellow, Figure 9-Green and Figure 10-Light Blue. Image file size is 2.6 Gb. Included Image Scale Bar equals 5mm.
  • 13. INTRODUCTION TO DIGITAL BIOMARKERS 13 Figure 7. WSI Tumor_37 with supplied tumor annotations (dark blue and difficult to see) and segments analyzed by Contiguity software. The light blue segments represent regions screened out as highly unlikely to contain tumor cells and the yellow segments were fully analyzed regions classified normal and red segments represent likely tumor regions. Result overview is; Total Positive Points: 868, Total Negative Points: 11732, Number of true positives: 145, Number of false positives: 723, Number of true negatives: 11670, Number of false negatives: 62. Included Image Scale Bar equals 5mm.
  • 14. INTRODUCTION TO DIGITAL BIOMARKERS 14 Figure 8. Region defined as yellow in Figure 6 showing recognition of multiple tumor regions by red squares and non-tumor regions by yellow and light blue squares. Included Image Scale Bar equals 1mm.
  • 15. INTRODUCTION TO DIGITAL BIOMARKERS 15 Figure 9. Region defined as green in Figure 6 showing recognition of multiple tumor regions including very small tumors by red squares and non-tumor regions by yellow and light blue squares. Included Image Scale Bar equals 1mm.
  • 16. INTRODUCTION TO DIGITAL BIOMARKERS 16 Figure 10. A full slide image showing all true positives and false positives. Certain types of lymph node medullary tissue are problematic at the microscale. Contiguity is confident that this issue can be accurately classified in the future by introducing and optimizing additional digital biomarkers, building classifiers against smaller image segments to ensure a higher degree of training image homogeneity and introducing additional classification levels to decision trees which utilize more macroscale features of an image which tend to be considerably different than tumor containing regions. Included Image Scale Bar equals 4mm. Conclusions Contiguity undertook this challenge as part of an image analysis software development process. We have made significant progress developing a suite of software tools that can be used to classify and analyze a very broad range of images. Early results are promising; there is more work to do in order to provide an application that can fully meet the goals of the Camelyon16 Grand Challenge. Contiguity devoted 8 weeks to this process. This included writing and testing the software, developing algorithms, testing and optimization. We are continuing to develop tools that will improve and accelerate this classification process. Developing algorithms that demonstrate both high sensitivity and high specificity appears possible based on current data and rate of progress. What is clear is that this is a difficult challenge due to the high degree of heterogeneity demonstrated across images from this challenge and that Digital Biomarkers can be employed to identify and classify specific tissue types across very large and heterogeneous images.
  • 17. INTRODUCTION TO DIGITAL BIOMARKERS 17 Future work will include: 1. Continue optimizing broader pattern finding techinques which will enable us to produce more effective classifiers. 2. Improve the speed of analysis through; a. Enabling the software to utilize 64 bit processors and graphics cards b. Modify the architecture to take advantage of multiple processors c. Convert the computationally intensive processes to assembly language or similar. 3. Improving speed will; a. Improve sensitivity to small tumors by simply analyzing more of the image, b. Set the software on a path towards being an effective tool to be used in large studies or in histology labs. 4. Integrate all elements of the software suite to; a. Enable rapid optimization and modification of decision trees, b. Enable a rapid recursive process whereby false positives and negatives are identified and used to enhance existing classifiers as well as add new decision tree branches. Applications Medicine and Life Science Although this white paper has focused on medical imaging, specifically on the diagnosis of tumor metastasis in lymph nodes, it should be clear that these same methods have applicability across a wide range of fields. Most any sort of microscopy, laboratory imaging analysis or other medical image based tools such as MRI, CAT Scan, X-Ray to name a few could benefit from these techniques. For instance:  Discover situation specific biomarkers, custom biomarker identification and deployment in support of clinical trials.  Develop more effective and efficient sorting and staging algorithms  Discover and deploy new biomarker and biomarker relationships with multi-index classifiers to achieve significant improvements in sensitivity or specificity.  If your clients are requesting a new analytical capability, our approaches may enable you to offer a new high-value service.  Deep analysis of therapeutic or procedural effects  Applications that enable quick but well characterized biomarker based disease classifier identification and verification. Other Industries Many quality assurance tools already make extensive use of imaging. New tools could be developed, or old tools extended for:  Monitoring material properties.  Weld joint analysis.  Automated checks of in-process manufacturing, packaging, and material acceptance.
  • 18. INTRODUCTION TO DIGITAL BIOMARKERS 18  Increasing the capabilities of SPC by discovering and employing non-obvious relationships and metrics available in existing processes.  Automated image and data inspection.  Chasing down indicators for manufacturing defects. Methods and Acknowledgements TIFF (Tagged Image File Format [17]) images are normally limited to 4 GBytes. Although all the histology images we analyzed were in the 1-3 GBytes range, the Consortium for Open Medical Image Computing has shown foresight by curating all images in the BigTIFF format[18], which in theory allows image sizes of up to 16 million Terabytes. Big TIFF files cannot be opened by most commercial imaging software. Automated Slide Analysis Platform (ASAP) [19] is an open source platform for visualizing, annotating and automatically analyzing whole-slide histopathology images. ASAP is built on top of several well- developed open source packages like OpenSlide, Qt and OpenCV[15]. We have relied on ASAP for viewing and annotating the Camelyon16 images, and TIFFFASTCROP[20] for segmenting these original images into JPEG images of a size we could analyze (typically 200-500 pixels square). TIFFFASTCROP was developed by the modelling team of the IMNC laboratory near Paris, France. The image analysis software was written in MicroSoft’s C# language using Visual Studio 2013. The classifier software was written in Java using Eclipse Europa. Software was run on either a Lenovo ThinkPad T430 having an Intel® Core™ CPU running at 2.50 GHz, 8 GB RAM, running Windows 7 Professional, or a laptop comupter having an Intel® 6th Generation Core™ CPU running at 2.50 GHz, 16 GB RAM, with an NVIDIA GeForce 940M video card, running Windows 10. Contacting Contiguity Web: www.contiguitydx.com Phone: (888) 913-3915 Email: steve.tyrrell@contiguitydx.com
  • 19. INTRODUCTION TO DIGITAL BIOMARKERS 19 References [1] B. G. Batchelor, Machine Vision Handbook, Springer, 2012. [2] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, John Wiley & Sons, 2001. [3] Vision Systems, "Machine Vision: The past, the present, and the future.," [Online]. Available: http://www.vision-systems.com. [Accessed 22 July 2016]. [4] C. Steger, M. Ulrich and C. Wiedermann, Machine Vision Algorithms and Applications, Wiley VCH, 2007. [5] S. Marsland, Machine Learning. An Algorithmic Perspective, Boca Raton, FL: CRC Press, 2009. [6] E. Boja, T. Hiltke, R. Rivers, C. Kinsinger, A. Rahbar, M. Mesri and H. Rodriguez, "Evolution of Clinical Proteomics and its Role in Medicine," Journal of Proteome Research, 2010. [7] S. M. Pepe, F. Ziding, H. Janes, P. M. Bossuyt and J. D. Potter, "Pivotol Evaluation of the Accuracy of a Biomarker Used for Classification or Prediction: Standards for Study Design," J Natl Cancer Inst, vol. 100, pp. 1432-1438, 2008. [8] R. Ostroff, W. Bigbee, W. Franklin, L. Gold, M. Mehan, Y. Miller, H. Pass, W. Rom, J. Siegfried, A. Stewart, J. Walker, J. Weissfeld, S. Williams, D. Zichi and E. Brody, "Unlocking Biomarker Discovery: Large Scale Application of Aptamer Proteomic Technology for Early Detection of Lung Cancer," PLoS ONE, vol. 5, no. 12, p. e15003, 2010. [9] R. Simon, "Development and Validation of Biomarker Classifiers for Treatment Selection," J Stat Plan Inference, vol. 138, no. 2, pp. 308-320, 2008. [10] R. E. Larson, R. P. Hostetler and B. H. Edwards, Multivariable Calculus, Lexington MA: D.C. Heath and Company, 1994. [11] T. A. Lasko, J. G. Bhagwat, K. H. Zou and L. Ohno-Machado, "The use of receiver operating characteristic curves in biomedical informatics," Journal of Biomedical Informatics, vol. 38, pp. 404- 415, 2005. [12] T. Fawcett, "ROC Graphs: Notes and Practical Considerations for Researchers," Palo Alto, 2004. [13] M. H. Zweig and G. Campbell, "Receiver-Operating Characteristic (ROC) Plots: A Fundamental Evaluation Tool in Clinical Medicine," Clinical Chemistry, vol. 39, no. 4, pp. 561-577, 1993.
  • 20. INTRODUCTION TO DIGITAL BIOMARKERS 20 [14] W. H. Press, B. P. Flannery, S. A. Teukolsky and W. T. Vetterling, Numerical Recipes (Fortran), Cambridge University Press, 1990. [15] "ISBI challenge on metastasis detection in lymph node.," 2016. [Online]. Available: http://camelyon16.grand-challenge.org. [Accessed 22 July 2016]. [16] "Grand Challenges in Biomedical Image Analysis," 2016. [Online]. Available: http://grand- challenge.org/Home. [Accessed 22 July 2016]. [17] "TIFF. Revision 6.0.," 3 June 1992. [Online]. Available: https://partners.adobe.com/public/developer/en/tiff/TIFF6.pdf. [Accessed 22 July 2016]. [18] "The BigTIFF File Format Proposal," [Online]. Available: http://www.awaresystems.be/imaging/tiff/bigtiff.html#structures. [Accessed 22 July 2016]. [19] "Automated Slide Analysis Platform," [Online]. Available: https://githup.com/GeerLitjens/ASAP. [Accessed 22 July 2016]. [20] C. DeRoulers, "TIFFFASTCROP," 19 February 2016. [Online]. Available: http://www.imnc.in2p3.fr/pagesperso/deroulers/software/largetifftools/tifffastcrop.html.. [Accessed 22 July 2016]. [21] A. M. Molinaro, R. Simon and R. M. Pfeiffer, "Prediction error estimation: a comparison of resampling methods," Bioinformatics, vol. 21, no. 15, pp. 3301-3307, 2005. [22] M. D. Radmacher, L. M. McShane and R. Simon, "A Paradigm for Class Prediction Using Gene Expression Profiles," Journal of Computational Biology, vol. 9, no. 3, pp. 505-511, 2002.