Data-Driven Insights to Improve Pharmaceutical Manufacturing

Data Driven Drugs:
Predictive Models to Improve
Product Quality in Pharmaceuticals
Sarah Aerni, PhD
Senior Data Scientist at Pivotal
saerni@gopivotal.com
Strata RX
September 26, 2013

© Copyright 2013 Pivotal. All rights reserved.

2

The Quantified Patient
Medical History!

Genetics!

Family !
History!

Imaging!
Clinical!
Narratives!

Medications!

Molecular!
Diagnostics!

Lab tests!

Environment!

Sensors!
& Mobile!
3

Data driven drugs: From discovery to delivery
Drug discovery
+ development

RICH DATA SOURCES

Clinical
Trials
Distribution and
surveillance

!  Molecular data

–  Cellular drug screens
–  Animal models

!  Clinical data including notes, images,
markers (e.g. genomics, lab results)
!  Sensor and assay data
!  Internal and partner/purchased external
data

Manufacturing !  Contact center data
Marketing


!  Patient registries, public and federal
data, clinical partnerships

4

Data integration
How Pivotal can enable industries to
extract new value from data sources


5

Successful transformation into a data-driven
enterprise requires a paradigm shift
!  Bring available data sources to a
central location
Integration of a variety of data leads to
new insights

DATA
IS THE NEW
CENTER OF GRAVITY

!  Analyze large volumes of variable
data for richer models
Building models without data movement
reduces time to insight

!  Share data, insights and ideas
Leveraging various expertise will lead to
more relevant business insights

Data > Application!
6

Traditional Analytics Processes

If you think databases are only good for storing data
Time-to-Insights

sample

In-memory
statistics
tool

In-memory
optimization
tool

solution

forecast


7

Pivotal One: Heritage
Application Fabric

Data Fabric

GemFire

Ingest & Query: very high-capacity &
in-memory
Scale-out storage: HDFS/Object

vFabric
Languages
&
Frameworks

Services

Analytics

Automation: App Provisioning & Life-cycle
Service Registry
Cloud Abstraction (portability)

Cloud Fabric


8

Performance Through Parallelism
!  Automatic parallelization

Database

–  Load and query like any database
–  Automatically distributed tables across
nodes
–  No need for manual partitioning or tuning

!  Analytics Optimized:

–  Analytics-oriented query optimization

!  Extremely scalable MPP shared-nothing
architecture

Interconnect
Compute
Storage

Loading

–  All nodes can scan and process in parallel
–  Linear scalability by adding nodes


9

Performance Through Parallelism
!  Automatic parallelization

Database

–  Load and query like any database
–  Automatically distributed tables across
nodes
–  No need for manual partitioning or tuning

!  Analytics Optimized:

–  Analytics-oriented query optimization

!  Extremely scalable MPP shared-nothing
architecture
–  All nodes can scan and process in parallel
–  Linear scalability by adding nodes


Interconnect
Compute
Storage

ETL

Loadin
File
g
Systems

External Sources: Loading, streaming, etc.

10

Pivotal HD Architecture
Pivotal HD
Enterprise
Resource
Management
& Workflow

Pig, Hive,
Mahout

HBase

Map Reduce

Configure,
Monitor, Manage

Hadoop Virtualization (HVE)

Yarn

Command

HDFS

Zookeeper

Center
Sqoop

Apache


Deploy,

Data Loader

Flume

Pivotal HD Enterprise

11

Pivotal HD Architecture
HAWQ– Advanced
Database Services
ANSI SQL + Analytics

Pivotal HD
Enterprise
Resource
Management
& Workflow

Xtension
Framework
HBase

Query
Optimizer

Dynamic Pipelining

Pig, Hive,
Mahout
Map Reduce

Deploy,
Configure,
Monitor, Manage

Hadoop Virtualization (HVE)

Yarn

Command

HDFS

Zookeeper

Center
Sqoop

Apache


Catalog
Services

Flume

Data Loader

Pivotal HD Enterprise

HAWQ

12

Leveraging healthcare data to drive predictive and
precision care
Clinical!
Narratives!
Medications!

Decision support

Imaging!

Precision care

Genetics!
Environment!

Labs test!

Cohort identification

Unified data supporting unified risk evaluation, decision-making, etc.
! Acting on full patient and medical profile!

13

Traditional Analytics Processes

If you think databases are only good for storing data
Time-to-Insights

sample

In-memory
statistics
tool

In-memory
optimization
tool

solution

forecast


14

Analytics with Pivotal

A single address for everything analytics
Time-to-Insights

Forecasting

Clustering

Regression

Optimization
Classification


15

Analytics Ecosystem
COMMERCIAL

OPEN SOURCE
MADlib

SAS/ACCESS&
SAS&Scoring&Accelerator&
SAS&High&Performance&
Analy7cs&

In0database&analy6cs&

PL/R,&PL/Python&PL/Java&


16

MADlib: Machine Learning at Scale

Collaborators


17

Drug discovery
+ development

!  Molecular data
Clinical
Trials

Distribution and
surveillance

Marketing


–  Cellular drug screens
–  Animal models

!  Clinical data including notes,
images, markers (e.g. genomics,
lab results)
!  Sensor and assay data

!  Internal and partner/purchased
external data
Manufacturing
!  Contact center data
!  Patient registries, public and
federal data, clinical partnerships
18

Manufacturing
Data-driven approaches to tuning a
drug manufacturing process


19

Predicting potency in vaccine manufacturing
Customer

Solution

A major pharmaceutical company

• 

Introduced a new data model to make
data accessible and enable analytics

• 

Built automated outlier detection/
correction methods to address manual
data entry quality issues

• 

Devised imputation methods to deal with
data completeness issues

• 

Built predictive models with high accuracy

Business Problem
Predict potency and antigen levels of live
virus vaccines based on manufacturing
sensor data and manual data collected
throughout the process.
Challenges
• 

Customer’s data model was not optimal
for running analytical queries

• 

Manual data quality issues

• 

Data capture was performed with
varying consistency due to high cost
associated with manual data collection


20

Building predictive models to improved outcomes in
manufacturing of vaccines
Temp

Counts

Future Looking
Predictive Models

Cell
expansion

Virus
propagation

Duration of step

Time
Warning!
Entered value not
in expected range


Pooling into
final product

Backward Looking
Models

21

Enabling predictive models through rearchitecting
Challenges
•  Accessibility
–  Certain parts of the data have
never been used in any predictive
modeling since it is extremely hard
to query them

Cell
expansion

•  Data Integrity
–  Manual data entries are prone to
errors. There is no immediate
feedback to examine the validity of
the values entered

Virus
propagation

•  Data Completeness
–  Manual data entry is time
consuming. There is no feedback
on what data is most useful in
improving the efficiency and
quality and hence no prioritization
of what data should be collected

Pooling into
final product

22

Enabling predictive models through rearchitecting
Challenges
•  Accessibility
–  Certain parts of the data have
never been used in any predictive
modeling since it is extremely hard
to query them

Purpose-built data models for rapid
data querying and exploration

•  Data Integrity
–  Manual data entries are prone to
errors. There is no immediate
feedback to examine the validity of
the values entered

Automated data cleansing
techniques

•  Data Completeness
–  Manual data entry is time
consuming. There is no feedback
on what data is most useful in
improving the efficiency and
quality and hence no prioritization
of what data should be collected

Opportunities to eliminate collection
of incomplete or non-predictive data

23

Identifying and correcting data integrity problems
Creating automated methods for detection and correction
all data

60

80

100

!  Data integrity problems cause
challenges in modeling

0

20

40

!  Sources of variation in entries
of measurements
1

3

5

7

9

11

13

15

17

19

21

23

–  Variable units of
measurement
–  Manual data entry errors

Approach: Detect the optimal
threshold to separate two
distributions

24

all data

60

80

100

!  Data integrity problems cause
challenges in modeling

20

40

!  Sources of variation in entries
of measurements
–  Variable units of
measurement
–  Manual data entry errors

0

1

3

5

7

9

11

13

15

17

19

lower half
lower half
upper half

23

!  Approach: Detect the
optimal threshold to
separate two distributions

40
10

20

510 5 20 10
10 15 20
30

lower half

30

Frequency

15
40

50

5020 60

60

upper half

0

0
00

Frequency
Frequency Frequency

21

0.12

0.12
0.12 12

0.14

0.16

0.18

0.20

0.14
0.16
0.18
0.14 newVals[seq(1, maxBreak, 1)] 0.20 22
0.16
14
16
180.18 20 0.20
newVals[seq(1, maxBreak, 1)]
newVals[seq(maxBreak + 1, length(newVals), 1)]


0.22

0.22
0.22
24

12

14

16

18

20

22

24


25


0

20

40

60

80

100

all data

1

3

5

7

9

11

13

15

17

19

lower half
lower half
upper half

23

Foreground

Background

40
10

20

510 5 20 10
10 15 20
30

lower half

30

Frequency

15
40

50

5020 60

60

upper half

0

0
00

Frequency
Frequency Frequency

21

0.12

0.12
0.12 12

0.14

0.16

0.18

0.20

0.14
0.16
0.18
0.16
14
16
180.18 20 0.20


0.22

0.22
0.22
24

12

14

16

18

20

22

24


26


0

20

40

60

80

100

all data

1

3

5

7

9

11

13

15

17

19

lower half
lower half
upper half

23

Foreground

Background

40
10

20

510 5 20 10
10 15 20
30

lower half

30

Frequency

15
40

50

5020 60

60

upper half

0

0
00

Frequency
Frequency Frequency

21

0.12

0.12
0.12 12

0.14

0.16

0.18

0.20

0.14
0.16
0.18
0.16
14
16
180.18 20 0.20


0.22

0.22
0.22
24

12

14

16

18

20

22

24


27


60

80

100

all data

5

7

9

11

13

15

17

19

lower half
lower half
upper half

23

0

40

12

20

510 5 20 10
10 15 20
30

lower half

30

Frequency

15
40

50

5020 60

60

20
20

upper half

12
12

14

14
14

16 16
16

18 18
18

20 20
20

22 22
22

24
24

10

c(loh, uph)

0

0
00

Frequency
Frequency Frequency

21

40
40

3

Frequency

1

60
60

0

20

8080

40

cleanedHistogram of c(loh, uph) = 100
histogram with multiplier

0.12

0.12
0.12 12

0.14

0.16

0.18

0.20

0.14
0.16
0.18
0.16
14
16
180.18 20 0.20


0.22

0.22
0.22
24

12

14

16

18

20

22

24


28

Building models: First, start with the answer
How to build models that solve the right problem
Cell
expansion

Approach: Use historical data to build a model
predicting potency of a final product using data
from the manufacturing process
!  Model form, how do we pick the right one?

Virus
propagation

–  How do we deal with correlated features?
–  Accuracy or interpretability?

!  Available data
Pooling into
final product


–  Thousands of features, without expert guidance how do we
choose the right ones?
–  What data do we want to use to predict? When is the right
time for an intervention?

29

Model generation and evaluation
Predicting vaccine potency using manufacturing data

13.5

!  Feature engineering and transformation

Test R2=0.742
Train R2=0.823

–  Enabled by rapid in-database processing

●
●
●

13.0

●

●

predTest[, i]

Predicted Potency

Total test 0.742003189411406

●

●

●
●
●

12.5

●
●

●
●
●
●

●
●

●
●
●
●
●●
●
●●
●
●
●
●
●
●
● ● ●●
●
● ●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●
●●

●
●

●
● ●
●
●
●
●●
●
●
● ● ●
●● ● ●
●
●
●
●●●
●●
●
●
● ●
● ●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
● ●● ● ●
●
●
●
●
●
●
●● ● ● ●●● ●
●
●
●
●
●
● ● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●

●

12.0

●
●
●

●● ●
●

●

●

●
● ●

●
●

●
●

●
●●
●●
●
●● ● ●
● ●
● ●●
●
●
●
●● ●
● ● ●
●
●
●
●
●
●
●● ●
● ● ●
●● ●
●
●
●
●
●
●
●● ●
● ● ●
● ●●●
●
●●
●
●
●●
●
● ●●
●
●
● ●
● ●
●
●
●
●●
●
●●
●
●
●
●
●

●

●
●

● ●
●

●
●

●

●

●
●

●

●
●
●
●
●
●

●
●
●

●

●●

–  Partial least squares
–  Random forest
–  Regularized regression

●

●
●
●

!  Interpretation of model results for
insight generation

●
●

●

●

●

12.0

12.5

13.0

True Potency
allTest[, i]


!  Experimentation with model forms

13.5

–  Use cross-validation framework to
assess variable importance
30

Sample model insights
Interpreting the utility of a measure obtained during manufacturing based
on model outcomes
13.0
12.8

13.0

Log of Potency

12.6

Potency

12.6

12.2

12.4

!  Features consistently absent
from models may be
uninformative for predicting
potency

12.4

12.8

Potency

12.0

12.2
12.0

Log of Potency

!  Some features may reveal
tunable parameters to alter
potency, others may simply
be markers

Correlation = 0.38

Correlation = -0.45

0.20

0.25

0.30

0.35

0.40

SP1 Total Viable Cells Harvested Per Sq. Cm

Assayed value


0.45

12

12.5

13

13.5

14

14.5

15

15.5

SP2 Total Trypsinization Exposure Time of per CCS

Duration of a step

>=16

!  Opportunities to provide realtime feedback on data entry
errors and predicted potency
outcomes

31

Data-driven drugs
Opportunities for data mining across the
pharmaceutical industry


32

Drug discovery
+ development
Clinical
Trials
Distribution and
surveillance

Manufacturing
Marketing


33

Drug discovery
+ development
Clinical
Trials
Distribution and
surveillance

!  Data repurposing
New value exists in leveraging
historical data across drugs and stages
!  Data discovery
External and publicly available
datasets can augment proprietary
sources

Manufacturing !  Data collection
Marketing


Obtaining new data from different
sources drives additional value

34

Drug discovery
+ development
Clinical
Trials
Distribution and
surveillance

!  Data repurposing
New value exists in leveraging
historical data across drugs and stages
Adverse events for new clinical
indications
!  Data discovery
External and publicly available
datasets can augment proprietary
sources
Twitter data to forecast demand

Manufacturing !  Data collection
Marketing


Obtaining new data from different
sources drives additional value
Mobile and sensor data to measure
patient adherence and outcomes
35

Leveraging Data to Improve Demand Forecasts
Hospitals
Doctor’s Offices
Supply Distr.

Surgery Centers

Sales Data

Pharmacies

Analyze orders from
customers

Patients

Laboratories

Self-Reporting

Publicly Available Resources
Monitoring Patient Populations


36

Promising Advancements in Diabetes Studies
Use of telehealth to provide tight glucose control

Biochemical
Measurements

EMR
Genomics
Lifestyle

Intervention


37

Launching a successful diabetes management program
Multiple potential points of failure, requires use of analytics at every step

Increase
Awareness

Patient
Enrollment

Comparative
Effectiveness

Remote
Patient
Monitoring

Design
Interventions

Measure
Impact on
Population

Best channel
per cohort

Best therapy for
Resource
each cohort:
allocation
Identify highest
•  Medication
decisions
impact channels
•  Delivery
Medication
Method
adherence
Stochastic •  Monitoring
Churn
Identify
entity
prediction
influencers
Method
Predict risk of
resolution
negative
Measure
Campaign
outcome for
engagement
optimization
A/B testing to design best
next 3 months
engagement platform

Attribution
models

Careful design
of experiment to
quantify the
Impact

38

Launching a successful diabetes management program
Interdisciplinary collaboration of data scientists essential to success
Marketing

Increase
Awareness

Healthcare

Patient
Enrollment

Web Analytics

Comparative
Effectiveness

Remote
Patient
Monitoring

Optimization

Design
Interventions

General ML

Measure
Impact on
Population

Best channel
per cohort

Best therapy for
Resource
each cohort:
allocation
Identify highest
•  Medication
decisions
impact channels
•  Delivery
Medication
Method
adherence
Stochastic •  Monitoring
Churn
Identify
entity
prediction
influencers
Method
Predict risk of
resolution
negative
Measure
Campaign
outcome for
engagement
optimization
A/B testing to design best
next 3 months
engagement platform

Attribution
models

Careful design
of experiment
to quantify the
Impact

39

Pivotal Labs rapid application development
!  Rheumatoid arthritis remote patient
monitoring system
–  Self-reporting
–  Intuitive user interface

https://itunes.apple.com/us/app/myra/id563338979?mt=8


40

Pivotal One: Heritage
Application Fabric

Data Fabric

GemFire

Ingest & Query: very high-capacity &
in-memory
Scale-out storage: HDFS/Object

vFabric
Languages
&
Frameworks

Services

Analytics

Automation: App Provisioning & Life-cycle
Service Registry
Cloud Abstraction (portability)

Cloud Fabric


41

Data-Driven Insights to Improve Pharmaceutical Manufacturing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (13)

Similar a Data-Driven Insights to Improve Pharmaceutical Manufacturing

Similar a Data-Driven Insights to Improve Pharmaceutical Manufacturing (20)

Más de EMC

Más de EMC (20)

Último

Último (20)

Data-Driven Insights to Improve Pharmaceutical Manufacturing