Instead of talking about artificial intelligence at the organizational level in hospitals and in research laboratories, the focus for non-machine learning practitioner should be on understanding the data pipes and what is involved around the model training.
alternative download link:
https://www.dropbox.com/s/9tv673sxkxcnojj/dataStrategyForOphthalmology.pdf?dl=0
3. Value of Data Trask @iamtrask
iamtrask.github.io
Somethingworthremembering.Despitethe
hype,deeplearningalgorithmsare
commodities.It'sthedatathat'stherealvalue.
Well, actually, good-
quality structured
data that can be
refined to information
and knowledge with
suitable models
4. Structuring
Data
Commonways
Excelsheetsnotmaybe
thewaytogoforgood
qualitydatabases
ATypicalDataScienceDepartment
Most companies structuretheir datasciencedepartments into 3
groups:
Datascientists:the folks who are “better engineers than
statisticians and betterstatisticians than engineers”. Aka, “the
thinkers”.
Dataengineers:thesearethe folks who build pipelines that
feed datascientists with dataand takethe ideas from the data
scientists and implement them. Aka, “thedoers”.
Infrastructureengineers: these arethefolks who maintain
theHadoop cluster / big datainfrastructure. Aka, “theplumbers”.
http://www.kdnuggets.com/2016/03/engineers-shouldnt-wri
te-etl.html
5. Data vs
Models vs
Hardware
Sun et al. (2017)
https://arxiv.org/abs/1707.02968
“Analogous
to going higher in
polynomial
order”
“Better models
cannot reach
theirfull capacity
as datasets stay
small”
“Heavier models
processed in
reasonabletime,
or old ones
faster”
6. Preprocessing
vs
Data Engineering
vs
“The AI” part
vs
Deployment
Only a small fraction of real-world ML systems is composed of
the ML code, as shown by the small black box in the middle.
The required surrounding infrastructure is vast and complex.
Google (2016) at NIPS:
“Hidden Technical Debt in Machine Learning Systems”
Successful hospitals and labs are the
ones thatcan or want to re-design their
processes for “data-driven medicine”
7. Data
andthenthere is
Structured
Data
Not enoughformostofthe
applicationstohave a
bunchofimagesonahard
drive withnolabeling
(pathology, outlinesof
structuresofinterest, etc.)
Glaucomacotousfundusimagefrom
RIM-ONEr3Database (S-17-L)
http://medimrg.webs.ull.es/research/downloads/
Semantic
(image)
Segmentation
- Optic Disc
- Optic Cup
Image Classification
- Health vs
- Glaucoma severity
9. Structuring
Data
Commonways
Excelsheetsnotmaybe
thewaytogoforgood
qualitydatabases
ETL (Extract, transformand load)
ATypicalDataScienceDepartment
Most companies structuretheir datasciencedepartments into 3
groups:
Datascientists:the folks who are “better engineers than
statisticians and betterstatisticians than engineers”. Aka, “the
thinkers”.
Dataengineers:thesearethe folks who build pipelines that
feed datascientists with dataand takethe ideas from the data
scientists and implement them. Aka, “thedoers”.
Infrastructureengineers: these arethefolks who maintain
theHadoop cluster / big datainfrastructure. Aka, “theplumbers”.
http://www.kdnuggets.com/2016/03/engineers-shouldnt-wri
te-etl.html
12. Changes in
Publishing
MostAIpapers
publishedinarXiV
withoutpeer-reviewed
accelerationof→
science
“We(Science)do allowposting ofresearchpapersonnot-for-profitpreprintserverssuchas arxiv.org andbioRxiv”
“Presentationofdataonapre-print serverdoesnot conflictwithsubmission toThe Lancet”
The ArXiv preprint server is the medium of choice for (mainly) physicists and astronomers who wish to share drafts of their
papers with their colleagues, and with anyone else with sufficient time and knowledge to navigate it. [...] If scientists wish to
display drafts of their research papers on an established preprint server before or during submission to Nature or any
Naturejournal,that'sfineby us."
17. Multimodal
Diagnostics
Power
Mininghospital
databasesfordisease
predictors(“withouta
hypothesis”)
In 2015, a research group at Mount Sinai Hospital in New York was inspired to
apply deep learning to the hospital’s vast database of patient records. This
data set features hundreds of variables on patients, drawn from their test results,
doctor visits, and so on. The resulting program, which the researchers named
Deep Patient, was trained using data from about 700,000 individuals, and
when testedonnew records,itprovedincrediblygoodat predictingdisease.
https://www.technologyreview.com/s/604087/the-dark-secret-at-the-
heart-of-ai/
18. Deep Patient
“We performed evaluation
using76,214 test patients
comprising 78 diseases
fromdiverse clinical
domains and temporal
windows.
Prediction performance
forsevere diabetes,
schizophrenia,and various
cancerswere among the
topperforming. “
19. Does not stop
to diagnostics
Stratify your
population,andgo
deeperinto
personalized
medicine
Throwaway your heuristicdecisiontrees
Wolfset al.(2000) IOVS forprimaryopen-angleglaucoma(POAG)classification
https://doi.org/10.1016/j.preteyeres.2015.07.007