SlideShare una empresa de Scribd logo
1 de 62
Descargar para leer sin conexión
A Tree-Based Approach
for Addressing Self-selection in Impact Studies
with Big Data
Inbal Yahav Galit Shmueli Deepa Mani
Bar Ilan University Indian School of Business
Israel India
@ HKUST
Business School
Dept of ISBSOM
May 16, 2017
PART A (BACKGROUND):
EXPERIMENTS (& FRIENDS),
RANDOMIZATION, AND CAUSAL INFERENCE
PART C:
OUR NEW TREE APPROACH
(AN ALTERNATIVE TO PSM)
PART B:
DEALING WITH SELF SELECTION
(FOR CAUSAL INFERENCE)
PART A (BACKGROUND):
EXPERIMENTS (& FRIENDS),
RANDOMIZATION, AND CAUSAL INFERENCE
Experimental Studies
• Goal: Causal inference
• Effects of causes (causal description) vs.
causes of effects (causal explanation)
• Manipulable cause
Randomization vs self-selection
RCT: Random Assignment
Manipulation
Quasi-Experiment
(Self-selection or administrator selection)
Manipulation
Self
Selection
Alternative explanations
Random assignment Balanced groups
Confound
(third variable)
Counterfactual
(always?)
Experiments & Variations
• Randomized experiment (RCT), natural experiment,
quasi-experiment
• Lab vs. field experiments
Validity
External validity: generalization
Internal validity: alternative explanations,
heterogeneous treatment effect
PART B:
DEALING WITH SELF SELECTION
(FOR CAUSAL INFERENCE)
Self selection: the challenge
• Large impact studies of an intervention
• Individuals/firms self-select intervention group/duration
• Even in RCT, some variables might remain unbalanced
How to identify and adjust for self-selection?
Three Applications
Impact of training on earnings
Field experiment by US govt
• LaLonde (1986) compared to observational control
• Re-analysis by PSM (Dehejia & Wahba, 1999, 2002)
RCT
Impact of e-Gov service in India
New online passport service
• survey of online + offline users
• bribes, travel time, etc.
Quasi-
experiment
Impact of outsourcing contract features
on financial performance
• pricing mechanism
• contract duration Observational
Common Approaches
• Heckman-type modeling
• Propensity Score Approach (Rubin & Rosenbaum)
Two steps:
1. Selection model: T = f(X)
2. Performance analysis on matched samples
Y = performance measure(s)
T = intervention
X = pre-intervention variables
Propensity Scores Approach
Step 1: Estimate selection model logit(T) = f(X)
to compute propensity scores P(T|X)
Step 3: Estimate Effect on Y (compare groups)
e.g., t-test or Y = b0 + b1 T+ b2 X+ b3 PS +e
Y = performance measure(s)
T = intervention
X = pre-intervention variables
Self-selection: P(T|X) ≠P(T)
Step 2: Use scores to create matched samples
PSM = use matching algorithm
PSS = divide scores into bins
The Idea of PSM: Balancing
“The propensity score allows one to design and
analyze an observational (nonrandomized) study so
that it mimics some of the particular characteristics of
a randomized controlled trial. In particular, the
propensity score is a balancing score: conditional on
the propensity score, the distribution of observed
baseline covariates will be similar between treated and
untreated subjects.”
Study 1: Impact of training on financial gains
(LaLonde 1986)
Experiment: US govt program randomly assigns
eligible candidates to training program
• Goal: increase future earnings
• LaLonde (1986) shows:
Groups statistically equal in terms of demographic
& pre-training earnings
 ATE = $1794 (p<0.004)
Training effect:
Observational control group
LaLonde also compared with observational
control groups (PSID, CPS)
– experimental training group + obs control group
– shows training effect not estimated correctly with
structural equations
PSID = Panel Study of Income Dynamics
CPS = Westat’s Matched Current Population Survey (Social Security Administration)
This paper compares the effect on trainee earnings
of an employment program that was run as a field
experiment where participants were randomly
assigned to treatment and control groups with the
estimates that would have been produced by an
econometrician. This comparison shows that many
of the econometric procedures do not replicate the
experimentally determined results, and it suggests
that researchers should be aware of the potential
for specification errors in other nonexperimental
evaluations.
Yahav et al./Tree-Based Approach for Addressing Self-Selection
Table 4. Summary Statistics of Datasets Used by Dehejia and Wahba (1999) (Average values and standard
deviations computed directly from the datasets in http://sekhon.berkeley.edu/matching/lalonde.html.)
Characteristics (Variable Name)
Experimental NSW Data Nonexperimental CPS Data
Treatment Control Control
Age (age)
25.82 25.05 33.22
(7.16) (7.06) (11.05)
Years of schooling (educ)
10.35 10.09 12.03
(2.01) (1.61) (2.87)
Proportion of blacks (black)
0.84 0.83 0.07
(0.36) (0.38) (0.26)
Proportion of Hispanic (hisp)
0.06 0.11 0.07
(0.24) (0.31) (0.26)
Proportion of married (married)
0.19 0.15 0.71
(0.39) (0.36) (0.45)
Proportion of high school dropouts (nodegr)
0.71 0.83 0.3
(0.46) (0.37) (0.46)
Real earning 24 month prior to training (1974)
(re74)
2,096 2,107 14,024
(4,887) (5,688) (9578.99)
Real earning 12 month prior to training (1975)
(re75)
1532 1,267 13,642
(3,219) (3,103) (9,260)
Proportion of nonworkers in 1974 (u74)
0.71 0.75 0.88
(0.46) (0.43) (0.32)
Proportion of nonworkers in 1975 (u75)
0.6 0.68 0.89
(0.49) (0.47) (0.31)
Outcome (Real earning in 1978) (re78)
6,349 4,555 14,855
(7,867) (5,484) (9,658)
Sample size 185 260 15,991
Table 5. Training Effect in NSW Experiment: Comparison between Approaches (Based on DW99 sample.
Tree-based results are split by presence/absence of a high school degree. Overall tree-approach training effect is
computed by a weighted average of HS degree and HS dropout (computed for comparison only; due to the
PSM:
Observational control group
Dehejia & Wahba (1999,2002) re-analyzed CPS
control group (n=15,991), using PSM
– Effects in range $1122-$1681, depends on settings
– “Best” setting effect: $1360
– PSM uses only 119 control group members
How did Dehejia & Wahba use PSM?
D&W obtained training effects in the range $1,122 to $1,681 under
different PSM settings and several matching schemes:
• Subset selection with/without replacement, combined with low-to-
high/high-to-low/random/nearest-neighbor (NN)/caliper matching.
• DW02 show that selection with replacement followed by NN
matching best captures the effect of the training program. However,
other matching schemes often yield poor performance, such as a
negative training effect.
• The overall training effect under their best settings (they can
compare to the actual experimental results!) is reported as $1,360.
• Note: This setting leads to dropping most of the 15,991 control
records, with only 119 records remaining in the control group.
Bottom line:
The researcher must make choices about settings and
parameters/bins; different choices can lead to different results.
PART C:
OUR NEW TREE APPROACH
(AN ALTERNATIVE TO PSM)
Vol. 40 No. 4, pp. 819-848 / December
2016
Challenges of PS in Big Data
1. Matching leads to severe data loss
2. PS methods suffer from “data dredging”
3. No variable selection (cannot identify variables that
drive the selection)
4. Assumes constant intervention effect
5. Sequential nature is computationally costly
6. Logistic model requires researcher to specify exact
form of selection model
Proposed Solution:
Tree-based approach
Propensity scores
P(T|X)
Y, T, X E(Y|T)
Even E(Y|T,X)
“Kill the Intermediary”
Classification Tree
Output: T (treat/control)
Inputs: X’s (income, edu, family…)
Records in each terminal node share same
profile (X) and same propensity score P(T=1| X)
Tree Creation
Which algorithm?
Conditional-Inference trees (Hothorn et al., 2006)
– Stop tree growth using statistical tests of
independence
– Binary splits
Tree-Based Approach
Four steps:
1. Run selection model: fit tree T = f(X)
2. Present resulting tree; discover unbalanced X’s
3. Treat each terminal node as sub-sample for
measuring Y; conduct terminal-node-level
performance analysis
4. Present terminal-node-analyses visually
5. [optional]: combine analyses from nodes with
homogeneous effects
Like PS, assumes observable self-selection
Tree on Lalonde’s RCT data
If groups are completely
balanced, we expect…
Y = Earnings in 1978
T = Received NSW training (T = 1) or not (T =
0)
X = Demographic information and prior
earnings
Tree reveals…
LaLonde’s naïve
approach (experiment)
Tree approach
HS dropout
(n=348)
HS degree
(n=97)
Not trained (n=260) $4554 $4,495 $4,855
Trained (n=185) $6349 $5,649 $8,047
Training effect
$1794
(p=0.004)
$1,154
(p=0.063)
$3,192
(p=0.015)
Overall: $1598
(p=0.017)
no yes
High school
degree
1. Unbalanced variable (HS degree)
2. Heterogeneous effect
Tree for obs control group reveals…
unemployed prior to training
in 1974 (u74=0 )
-> negative effect
1. Unbalanced variables
2. Heterogeneous effect in u74
3. Outlier
4. Eligibility issue
outlier
eligibility
issue!
some profiles are rare in
trained group but
popular in control group
Solves challenges of PS in Big Data
1. Matching leads to severe data loss
2. PS methods suffer from “data dredging”
3. No variable selection (cannot identify variables that
drive the selection)
4. Assumes constant intervention effect
5. Sequential nature is computationally costly
6. Logistic model requires researcher to specify exact
form of selection model
Why Trees in Explanatory Study?
Flexible non-parametric
selection model (f)
Automated detection of
unbalanced pre-intervention
variables (X)
Easy to interpret,
transparent, visual
Applicable to binary, polytomous,
continuous intervention (T)
Useful in Big Data
context
Identify heterogeneous
effects (effect of T on Y)
Survey commissioned by Govt of India in 2006
• >9500 individuals who used passport services
• Representative sample of 13 Passport Offices
• “Quasi-experimental, non-equivalent groups design”
• Equal number of offline and online users, matched
by geography and demographics
Study 2:
Impact of eGov Initiative
(India)
Current Practice
Assess impact by
comparing
online/offline
performance stats
Awareness of electronic services
provided by Government of India
% bribe RPO
% use agent
%prefer online
% bribe police
Simpson’s Paradox
1. Demographics properly balanced
2. Unbalanced variable (Aware)
3. Heterogeneous effects on various y’s
+ even Simpson’s paradox
PSMAwareness of electronic services
provided by Government of India
Would we detect this
with PSM?
Heterogeneous effect
Scaling Up to Big Data
• We inflated eGov dataset by bootstrap
• Up to 9,000,000 records and 360 variables
• 10 runs for each configuration: runtime for tree
20 sec
Big Data Simulation
Binary intervention
T = {0, 1}
Continuous intervention
T∼ N
Sample sizes (n) 10K, 100K, 1M
#Pre-intervention
variables (p)
4, 50 (+interactions)
Pre-intervention
variable types
Binary, Likert-scale, continuous
Outcome
variable types
Binary, continuous
Selection models
#1: P (T=1) = logit (b0 + b1 x1 +…+ bp xp)
#2: P (T=1) = logit (b0 + b1 x1 +…+ bp xp + interactions)
Intervention
effects
1. Homogeneous
Control: E(Y | T = 0) = 0.5
Intervention: E(Y | T = 1) = 0.7
2. Heterogeneous
Control: E(Y | T = 0) = 0.5
Intervention: E(Y | T = 1, X1=0) = 0.7
E(Y | T = 1, X1=1) = 0.3
1. Homogeneous
Control: E(Y | T = 0) = 0
Intervention: E(Y | T = 1) = 1
2. Heterogeneous
Control: E(Y | T = 0) = 0
Intervention: E(Y | T = 1, X1=0) = 1
E(Y | T = 1, X1=1) = -1
Results for selection model
P (T=1 | X ) = logit (b0 + b1 X1 +…+ bp Xp)
PSS (5 bins)
Big Data Scalability
Theoretical Complexity:
• O(mn/p) for binary X
• O(m/p nlog(n) ) for continuous X
Runtime as function of sample size, dimension
Scaling Trees Even Further
• “Big Data” in research vs. industry
• Industrial scaling
– Sequential trees: efficient data structure, access
(SPRINT, SLIQ, RainForest)
– Parallel computing (parallel SPRINT, ScalParC,
SPARK, PLANET) “as long as split metric can be
computed on subsets of the training data and
later aggregated, PLANET can be easily extended”
Heterogeneous Effect
Continuous Intervention
16 nodes
Tree Approach Benefits
1. Data-driven selection model
2. Scales up to Big Data
3. Less user choices (data dredging)
4. Nuanced insights
• Detect unbalanced variables
• Detect heterogeneous effect from anticipated outcomes
5. Simple to communicate
6. Automatic variable selection
7. Missing values do not remove record
8. Binary, multiple, continuous interventions
9. Post-analysis of RCT, quasi-experiments & observational studies
Tree Approach Limits
1. Assumes selection on observables
2. Need sufficient data
3. Continuous variables can lead to large tree
4. Instability
[possible solution: use variable importance scores (forest)]
Insights from tree-approach in
the three applications
Labor (Lalonde ‘86)
Heterogeneous effect:
Impact of training depends
on High school diploma
Contract Duration
First attempt to study
effect of duration on
contract performance
Price Mechanism
Heterogeneous effect:
Fixed-price creates long-
term market value (not
productivity), but only in
high-trust contracts
eGov
Heterogeneous effect:
Impact of online system
depends on user awareness
Yahav, I., Shmueli, G. and Mani, D. (2016) “A Tree-Based Approach for Addressing Self-Selection
in Impact Studies with Big Data”, MIS Quarterly, vol 40 no 4, pp. 819-848.
Impact of IT Outsourcing Contract Attributes
How does financial performance of
outsourcing contracts vary with two
attributes of the contract:
• Pricing mechanisms (6 options)
• Contract duration (continuous)
Observational Data
• >1400 contracts, implemented 1996-2008
• 374 vendors and 710 clients
• Obtained from IDC database, Lexis-Nexis,
COMPUSTAT, etc.
T = Six Pricing Mechanisms
(polytomous intervention)
Interventions (T):
1. Fixed Price
2. Transactional Price
3. Time-and-Materials
4. Incentive
5. Combination
6. Joint Venture
Fixed Price
Variable Price
Pre-Intervention Variables (X):
Task Type
Bid Type
Contract Value
Uncertainty in business requirements
Outsourcing Experience
Firm Size (market value of equity)
Outcomes (Y):
Announcement Returns
Long Term Returns
Median Income Efficiency
Six Pricing Mechanisms
(polytomous intervention)
Interventions (T):
1. Fixed Price
2. Transactional Price
3. Time-and-Materials
4. Incentive
5. Combination
6. Joint Venture
Fixed Price - Fixed payment per billing cycle
Transactional - Fixed payment per transaction
per billing cycle
Time and Materials - Payment based on input
time and materials used during billing cycle
Incentive - Payment based on output
improvements against key performance
indicators or any combination of indicators
Combination - A combination of any of the
above contract types, largely fixed price and
time and materials
Joint Venture - A separately incorporated
entity, jointly owned by the client and the
vendor, used to govern the outsourcing
relationship.
Fixed Price
Variable Price
Six Price Mechanisms
Questions of interest:
1)Do all simple outsourcing engagements, governed by fixed or transactional price
contracts, create value?
2)What types of complex outsourcing engagements create value for the client?
3)How do firms mitigate risks inherent to these engagements?
Impact Measures (Y)
Announcement Returns
Firm-specific daily abnormal returns ( 𝜀𝑖𝑡, for firm i on day t)
• Computed as 𝜀𝑖𝑡= 𝑟𝑖𝑡- 𝑟𝑖𝑡 , where 𝑟𝑖𝑡 = daily return (to the value weighted S&P500),
estimated from the market model: 𝑟𝑖𝑡 = α𝑖+𝛽𝑖 𝑟 𝑚𝑡+ 𝜀𝑖𝑡.
• Model used to predict daily returns for each firm over announcement period [-5,+5].
Long Term Returns
Monthly abnormal returns
• Estimated from the Fama- French three-factor model as excess of that achieved by
passive investments in systematic risk factors.
• Expected to be zero under the null hypothesis of market efficiency.
• Used to estimate the implied three-year abnormal return following the contract.
Median Income Efficiency
Income efficiency is estimated as earnings before interest and taxes divided by
number of employees.
• Median of income efficiency for the three-year period following contract
implementation
Six pricing methodologies
Selection Model
Large custom IT tasks
(complex)
BPO+simple
tasks
high-trust
Node-level Performance Analysis
Combination contracts
create value for
complex engagements
Custom IT
(complex,
high trust)
Contract Duration
(Continuous Intervention)
T = Contract duration (months)
Pre-Intervention Variables (X):
Task Type
Bid Type
Contract Value
Uncertainty in business requirements
Outsourcing Experience
Firm Size (market value of equity)
Outcomes (Y)
Announcement Returns
Long Term Returns
Median Income Efficiency
Contract duration has no impact on
performance gains from outsourcing
Contract Duration
Selection Model (regression tree)
Duration
proportional to
contract value
(1,6,8 vs. 4,5)
Node-Level Performance Analysis (Reg)
Markets reward
long-term for high-
value contracts and
low-value minimal
scope contracts
contracts requiring
specific or non-
contractible
investments (costs
outweigh benefits)
Price methodologies: Main insight
Prior research:
• Role of trust only in complex contracts
• Fixed price known to create value; unrelated to trust
Tree finding:
Fixed-price creates
long-term market value
(not productivity), but
only in high-trust
contracts!

Más contenido relacionado

La actualidad más candente

T12 non-parametric tests
T12 non-parametric testsT12 non-parametric tests
T12 non-parametric tests
kompellark
 
Parametric tests seminar
Parametric tests seminarParametric tests seminar
Parametric tests seminar
drdeepika87
 
Applied statistics lecture_2
Applied statistics lecture_2Applied statistics lecture_2
Applied statistics lecture_2
Daria Bogdanova
 
Reporting statistics in psychology
Reporting statistics in psychologyReporting statistics in psychology
Reporting statistics in psychology
Reiner-Vinicius
 
Exploratory
Exploratory Exploratory
Exploratory
toby2036
 

La actualidad más candente (20)

Thiyagu viva voce prsesentation
Thiyagu viva voce prsesentationThiyagu viva voce prsesentation
Thiyagu viva voce prsesentation
 
T test statistics
T test statisticsT test statistics
T test statistics
 
Anova by Hazilah Mohd Amin
Anova by Hazilah Mohd AminAnova by Hazilah Mohd Amin
Anova by Hazilah Mohd Amin
 
T12 non-parametric tests
T12 non-parametric testsT12 non-parametric tests
T12 non-parametric tests
 
Analysis of variance anova
Analysis of variance anovaAnalysis of variance anova
Analysis of variance anova
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
 
Parametric tests seminar
Parametric tests seminarParametric tests seminar
Parametric tests seminar
 
Day 11 t test for independent samples
Day 11 t test for independent samplesDay 11 t test for independent samples
Day 11 t test for independent samples
 
Analysis of Data - Dr. K. Thiyagu
Analysis of Data - Dr. K. ThiyaguAnalysis of Data - Dr. K. Thiyagu
Analysis of Data - Dr. K. Thiyagu
 
Applied statistics lecture_2
Applied statistics lecture_2Applied statistics lecture_2
Applied statistics lecture_2
 
Workshop QCI- regression_analysis
Workshop QCI- regression_analysis Workshop QCI- regression_analysis
Workshop QCI- regression_analysis
 
Stats chapter 5
Stats chapter 5Stats chapter 5
Stats chapter 5
 
Reporting statistics in psychology
Reporting statistics in psychologyReporting statistics in psychology
Reporting statistics in psychology
 
Parametric test
Parametric testParametric test
Parametric test
 
T-Test
T-TestT-Test
T-Test
 
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
 
Presentation on item analysis
Presentation on item analysisPresentation on item analysis
Presentation on item analysis
 
Exploratory
Exploratory Exploratory
Exploratory
 
Potential Solutions to the Fundamental Problem of Causal Inference: An Overview
Potential Solutions to the Fundamental Problem of Causal Inference: An OverviewPotential Solutions to the Fundamental Problem of Causal Inference: An Overview
Potential Solutions to the Fundamental Problem of Causal Inference: An Overview
 
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate TestStudent's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
 

Similar a A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data

Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statistics
albertlaporte
 
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docxWeek 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
cockekeshia
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
Brian Lin
 
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docxWeek 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
cockekeshia
 
jhghgjhgjhgjhfhcgjfjhvjhjgjkggjhgjhgjhfjgjgfgfhgfhg
jhghgjhgjhgjhfhcgjfjhvjhjgjkggjhgjhgjhfjgjgfgfhgfhgjhghgjhgjhgjhfhcgjfjhvjhjgjkggjhgjhgjhfjgjgfgfhgfhg
jhghgjhgjhgjhfhcgjfjhvjhjgjkggjhgjhgjhfjgjgfgfhgfhg
UMAIRASHFAQ20
 

Similar a A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data (20)

Repurposing predictive tools for causal research
Repurposing predictive tools for causal researchRepurposing predictive tools for causal research
Repurposing predictive tools for causal research
 
Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...Repurposing Classification & Regression Trees for Causal Research with High-D...
Repurposing Classification & Regression Trees for Causal Research with High-D...
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statistics
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docxWeek 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
 
Data analysis
Data analysisData analysis
Data analysis
 
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Bi...
 
Research Procedure
Research ProcedureResearch Procedure
Research Procedure
 
Factorial Experiments
Factorial ExperimentsFactorial Experiments
Factorial Experiments
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
 
Analysis of data thiyagu
Analysis of data  thiyaguAnalysis of data  thiyagu
Analysis of data thiyagu
 
Analysis of Data - Dr. K. Thiyagu
Analysis of Data - Dr. K. ThiyaguAnalysis of Data - Dr. K. Thiyagu
Analysis of Data - Dr. K. Thiyagu
 
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docxWeek 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
 
Inferential Statistics.pptx
Inferential Statistics.pptxInferential Statistics.pptx
Inferential Statistics.pptx
 
jhghgjhgjhgjhfhcgjfjhvjhjgjkggjhgjhgjhfjgjgfgfhgfhg
jhghgjhgjhgjhfhcgjfjhvjhjgjkggjhgjhgjhfjgjgfgfhgfhgjhghgjhgjhgjhfhcgjfjhvjhjgjkggjhgjhgjhfjgjgfgfhgfhg
jhghgjhgjhgjhfhcgjfjhvjhjgjkggjhgjhgjhfjgjgfgfhgfhg
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
mean comparison.pptx
mean comparison.pptxmean comparison.pptx
mean comparison.pptx
 
mean comparison.pptx
mean comparison.pptxmean comparison.pptx
mean comparison.pptx
 
Metaanalysis copy
Metaanalysis    copyMetaanalysis    copy
Metaanalysis copy
 

Más de Galit Shmueli

Más de Galit Shmueli (20)

“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification“Improving” prediction of human behavior using behavior modification
“Improving” prediction of human behavior using behavior modification
 
To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?
 
Behavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchBehavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare Research
 
Reinventing the Data Analytics Classroom
Reinventing the Data Analytics ClassroomReinventing the Data Analytics Classroom
Reinventing the Data Analytics Classroom
 
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiBehavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
 
Statistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and PredictingStatistical Modeling in 3D: Describing, Explaining and Predicting
Statistical Modeling in 3D: Describing, Explaining and Predicting
 
Workshop on Information Quality
Workshop on Information QualityWorkshop on Information Quality
Workshop on Information Quality
 
Behavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should CareBehavioral Big Data: Why Quality Engineers Should Care
Behavioral Big Data: Why Quality Engineers Should Care
 
Statistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingStatistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, Describing
 
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...Researcher Dilemmas  using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
Researcher Dilemmas using Behavioral Big Data in Healthcare (INFORMS DMDA Wo...
 
Prediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PMPrediction-based Model Selection in PLS-PM
Prediction-based Model Selection in PLS-PM
 
When Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of MarriageWhen Prediction Met PLS: What We learned in 3 Years of Marriage
When Prediction Met PLS: What We learned in 3 Years of Marriage
 
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
Research Using Behavioral Big Data: A Tour and Why Mechanical Engineers Shoul...
 
Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)Research Using Behavioral Big Data (BBD)
Research Using Behavioral Big Data (BBD)
 
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral IssuesAnalyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
Analyzing Behavioral Big Data: Methodological, Practical, Ethical & Moral Issues
 
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...Big Data - To Explain or To Predict?  Talk at U Toronto's Rotman School of Ma...
Big Data - To Explain or To Predict? Talk at U Toronto's Rotman School of Ma...
 
Information Quality: A Framework for Evaluating Empirical Studies
Information Quality: A Framework for Evaluating Empirical Studies Information Quality: A Framework for Evaluating Empirical Studies
Information Quality: A Framework for Evaluating Empirical Studies
 
E.SUN Academic Award presentation (Jan 2016)
E.SUN Academic Award presentation (Jan 2016)E.SUN Academic Award presentation (Jan 2016)
E.SUN Academic Award presentation (Jan 2016)
 
Big Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative IndustriesBig Data & Analytics in the Digital Creative Industries
Big Data & Analytics in the Digital Creative Industries
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
 

Último

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Último (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data

  • 1. A Tree-Based Approach for Addressing Self-selection in Impact Studies with Big Data Inbal Yahav Galit Shmueli Deepa Mani Bar Ilan University Indian School of Business Israel India @ HKUST Business School Dept of ISBSOM May 16, 2017
  • 2. PART A (BACKGROUND): EXPERIMENTS (& FRIENDS), RANDOMIZATION, AND CAUSAL INFERENCE PART C: OUR NEW TREE APPROACH (AN ALTERNATIVE TO PSM) PART B: DEALING WITH SELF SELECTION (FOR CAUSAL INFERENCE)
  • 3. PART A (BACKGROUND): EXPERIMENTS (& FRIENDS), RANDOMIZATION, AND CAUSAL INFERENCE
  • 4. Experimental Studies • Goal: Causal inference • Effects of causes (causal description) vs. causes of effects (causal explanation) • Manipulable cause
  • 7. Quasi-Experiment (Self-selection or administrator selection) Manipulation Self Selection
  • 8. Alternative explanations Random assignment Balanced groups Confound (third variable) Counterfactual (always?)
  • 9. Experiments & Variations • Randomized experiment (RCT), natural experiment, quasi-experiment • Lab vs. field experiments Validity External validity: generalization Internal validity: alternative explanations, heterogeneous treatment effect
  • 10. PART B: DEALING WITH SELF SELECTION (FOR CAUSAL INFERENCE)
  • 11. Self selection: the challenge • Large impact studies of an intervention • Individuals/firms self-select intervention group/duration • Even in RCT, some variables might remain unbalanced How to identify and adjust for self-selection?
  • 12. Three Applications Impact of training on earnings Field experiment by US govt • LaLonde (1986) compared to observational control • Re-analysis by PSM (Dehejia & Wahba, 1999, 2002) RCT Impact of e-Gov service in India New online passport service • survey of online + offline users • bribes, travel time, etc. Quasi- experiment Impact of outsourcing contract features on financial performance • pricing mechanism • contract duration Observational
  • 13. Common Approaches • Heckman-type modeling • Propensity Score Approach (Rubin & Rosenbaum) Two steps: 1. Selection model: T = f(X) 2. Performance analysis on matched samples Y = performance measure(s) T = intervention X = pre-intervention variables
  • 14. Propensity Scores Approach Step 1: Estimate selection model logit(T) = f(X) to compute propensity scores P(T|X) Step 3: Estimate Effect on Y (compare groups) e.g., t-test or Y = b0 + b1 T+ b2 X+ b3 PS +e Y = performance measure(s) T = intervention X = pre-intervention variables Self-selection: P(T|X) ≠P(T) Step 2: Use scores to create matched samples PSM = use matching algorithm PSS = divide scores into bins
  • 15. The Idea of PSM: Balancing “The propensity score allows one to design and analyze an observational (nonrandomized) study so that it mimics some of the particular characteristics of a randomized controlled trial. In particular, the propensity score is a balancing score: conditional on the propensity score, the distribution of observed baseline covariates will be similar between treated and untreated subjects.”
  • 16. Study 1: Impact of training on financial gains (LaLonde 1986) Experiment: US govt program randomly assigns eligible candidates to training program • Goal: increase future earnings • LaLonde (1986) shows: Groups statistically equal in terms of demographic & pre-training earnings  ATE = $1794 (p<0.004)
  • 17. Training effect: Observational control group LaLonde also compared with observational control groups (PSID, CPS) – experimental training group + obs control group – shows training effect not estimated correctly with structural equations PSID = Panel Study of Income Dynamics CPS = Westat’s Matched Current Population Survey (Social Security Administration)
  • 18. This paper compares the effect on trainee earnings of an employment program that was run as a field experiment where participants were randomly assigned to treatment and control groups with the estimates that would have been produced by an econometrician. This comparison shows that many of the econometric procedures do not replicate the experimentally determined results, and it suggests that researchers should be aware of the potential for specification errors in other nonexperimental evaluations.
  • 19. Yahav et al./Tree-Based Approach for Addressing Self-Selection Table 4. Summary Statistics of Datasets Used by Dehejia and Wahba (1999) (Average values and standard deviations computed directly from the datasets in http://sekhon.berkeley.edu/matching/lalonde.html.) Characteristics (Variable Name) Experimental NSW Data Nonexperimental CPS Data Treatment Control Control Age (age) 25.82 25.05 33.22 (7.16) (7.06) (11.05) Years of schooling (educ) 10.35 10.09 12.03 (2.01) (1.61) (2.87) Proportion of blacks (black) 0.84 0.83 0.07 (0.36) (0.38) (0.26) Proportion of Hispanic (hisp) 0.06 0.11 0.07 (0.24) (0.31) (0.26) Proportion of married (married) 0.19 0.15 0.71 (0.39) (0.36) (0.45) Proportion of high school dropouts (nodegr) 0.71 0.83 0.3 (0.46) (0.37) (0.46) Real earning 24 month prior to training (1974) (re74) 2,096 2,107 14,024 (4,887) (5,688) (9578.99) Real earning 12 month prior to training (1975) (re75) 1532 1,267 13,642 (3,219) (3,103) (9,260) Proportion of nonworkers in 1974 (u74) 0.71 0.75 0.88 (0.46) (0.43) (0.32) Proportion of nonworkers in 1975 (u75) 0.6 0.68 0.89 (0.49) (0.47) (0.31) Outcome (Real earning in 1978) (re78) 6,349 4,555 14,855 (7,867) (5,484) (9,658) Sample size 185 260 15,991 Table 5. Training Effect in NSW Experiment: Comparison between Approaches (Based on DW99 sample. Tree-based results are split by presence/absence of a high school degree. Overall tree-approach training effect is computed by a weighted average of HS degree and HS dropout (computed for comparison only; due to the
  • 20. PSM: Observational control group Dehejia & Wahba (1999,2002) re-analyzed CPS control group (n=15,991), using PSM – Effects in range $1122-$1681, depends on settings – “Best” setting effect: $1360 – PSM uses only 119 control group members
  • 21.
  • 22.
  • 23. How did Dehejia & Wahba use PSM? D&W obtained training effects in the range $1,122 to $1,681 under different PSM settings and several matching schemes: • Subset selection with/without replacement, combined with low-to- high/high-to-low/random/nearest-neighbor (NN)/caliper matching. • DW02 show that selection with replacement followed by NN matching best captures the effect of the training program. However, other matching schemes often yield poor performance, such as a negative training effect. • The overall training effect under their best settings (they can compare to the actual experimental results!) is reported as $1,360. • Note: This setting leads to dropping most of the 15,991 control records, with only 119 records remaining in the control group. Bottom line: The researcher must make choices about settings and parameters/bins; different choices can lead to different results.
  • 24. PART C: OUR NEW TREE APPROACH (AN ALTERNATIVE TO PSM)
  • 25. Vol. 40 No. 4, pp. 819-848 / December 2016
  • 26. Challenges of PS in Big Data 1. Matching leads to severe data loss 2. PS methods suffer from “data dredging” 3. No variable selection (cannot identify variables that drive the selection) 4. Assumes constant intervention effect 5. Sequential nature is computationally costly 6. Logistic model requires researcher to specify exact form of selection model
  • 27. Proposed Solution: Tree-based approach Propensity scores P(T|X) Y, T, X E(Y|T) Even E(Y|T,X) “Kill the Intermediary”
  • 28. Classification Tree Output: T (treat/control) Inputs: X’s (income, edu, family…) Records in each terminal node share same profile (X) and same propensity score P(T=1| X)
  • 29. Tree Creation Which algorithm? Conditional-Inference trees (Hothorn et al., 2006) – Stop tree growth using statistical tests of independence – Binary splits
  • 30. Tree-Based Approach Four steps: 1. Run selection model: fit tree T = f(X) 2. Present resulting tree; discover unbalanced X’s 3. Treat each terminal node as sub-sample for measuring Y; conduct terminal-node-level performance analysis 4. Present terminal-node-analyses visually 5. [optional]: combine analyses from nodes with homogeneous effects Like PS, assumes observable self-selection
  • 31. Tree on Lalonde’s RCT data If groups are completely balanced, we expect… Y = Earnings in 1978 T = Received NSW training (T = 1) or not (T = 0) X = Demographic information and prior earnings
  • 32. Tree reveals… LaLonde’s naïve approach (experiment) Tree approach HS dropout (n=348) HS degree (n=97) Not trained (n=260) $4554 $4,495 $4,855 Trained (n=185) $6349 $5,649 $8,047 Training effect $1794 (p=0.004) $1,154 (p=0.063) $3,192 (p=0.015) Overall: $1598 (p=0.017) no yes High school degree 1. Unbalanced variable (HS degree) 2. Heterogeneous effect
  • 33. Tree for obs control group reveals… unemployed prior to training in 1974 (u74=0 ) -> negative effect 1. Unbalanced variables 2. Heterogeneous effect in u74 3. Outlier 4. Eligibility issue outlier eligibility issue! some profiles are rare in trained group but popular in control group
  • 34. Solves challenges of PS in Big Data 1. Matching leads to severe data loss 2. PS methods suffer from “data dredging” 3. No variable selection (cannot identify variables that drive the selection) 4. Assumes constant intervention effect 5. Sequential nature is computationally costly 6. Logistic model requires researcher to specify exact form of selection model
  • 35. Why Trees in Explanatory Study? Flexible non-parametric selection model (f) Automated detection of unbalanced pre-intervention variables (X) Easy to interpret, transparent, visual Applicable to binary, polytomous, continuous intervention (T) Useful in Big Data context Identify heterogeneous effects (effect of T on Y)
  • 36. Survey commissioned by Govt of India in 2006 • >9500 individuals who used passport services • Representative sample of 13 Passport Offices • “Quasi-experimental, non-equivalent groups design” • Equal number of offline and online users, matched by geography and demographics Study 2: Impact of eGov Initiative (India)
  • 37. Current Practice Assess impact by comparing online/offline performance stats
  • 38. Awareness of electronic services provided by Government of India % bribe RPO % use agent %prefer online % bribe police Simpson’s Paradox 1. Demographics properly balanced 2. Unbalanced variable (Aware) 3. Heterogeneous effects on various y’s + even Simpson’s paradox
  • 39. PSMAwareness of electronic services provided by Government of India Would we detect this with PSM?
  • 41. Scaling Up to Big Data • We inflated eGov dataset by bootstrap • Up to 9,000,000 records and 360 variables • 10 runs for each configuration: runtime for tree 20 sec
  • 42. Big Data Simulation Binary intervention T = {0, 1} Continuous intervention T∼ N Sample sizes (n) 10K, 100K, 1M #Pre-intervention variables (p) 4, 50 (+interactions) Pre-intervention variable types Binary, Likert-scale, continuous Outcome variable types Binary, continuous Selection models #1: P (T=1) = logit (b0 + b1 x1 +…+ bp xp) #2: P (T=1) = logit (b0 + b1 x1 +…+ bp xp + interactions) Intervention effects 1. Homogeneous Control: E(Y | T = 0) = 0.5 Intervention: E(Y | T = 1) = 0.7 2. Heterogeneous Control: E(Y | T = 0) = 0.5 Intervention: E(Y | T = 1, X1=0) = 0.7 E(Y | T = 1, X1=1) = 0.3 1. Homogeneous Control: E(Y | T = 0) = 0 Intervention: E(Y | T = 1) = 1 2. Heterogeneous Control: E(Y | T = 0) = 0 Intervention: E(Y | T = 1, X1=0) = 1 E(Y | T = 1, X1=1) = -1
  • 43. Results for selection model P (T=1 | X ) = logit (b0 + b1 X1 +…+ bp Xp) PSS (5 bins)
  • 44. Big Data Scalability Theoretical Complexity: • O(mn/p) for binary X • O(m/p nlog(n) ) for continuous X Runtime as function of sample size, dimension
  • 45. Scaling Trees Even Further • “Big Data” in research vs. industry • Industrial scaling – Sequential trees: efficient data structure, access (SPRINT, SLIQ, RainForest) – Parallel computing (parallel SPRINT, ScalParC, SPARK, PLANET) “as long as split metric can be computed on subsets of the training data and later aggregated, PLANET can be easily extended”
  • 48. Tree Approach Benefits 1. Data-driven selection model 2. Scales up to Big Data 3. Less user choices (data dredging) 4. Nuanced insights • Detect unbalanced variables • Detect heterogeneous effect from anticipated outcomes 5. Simple to communicate 6. Automatic variable selection 7. Missing values do not remove record 8. Binary, multiple, continuous interventions 9. Post-analysis of RCT, quasi-experiments & observational studies
  • 49. Tree Approach Limits 1. Assumes selection on observables 2. Need sufficient data 3. Continuous variables can lead to large tree 4. Instability [possible solution: use variable importance scores (forest)]
  • 50. Insights from tree-approach in the three applications Labor (Lalonde ‘86) Heterogeneous effect: Impact of training depends on High school diploma Contract Duration First attempt to study effect of duration on contract performance Price Mechanism Heterogeneous effect: Fixed-price creates long- term market value (not productivity), but only in high-trust contracts eGov Heterogeneous effect: Impact of online system depends on user awareness
  • 51. Yahav, I., Shmueli, G. and Mani, D. (2016) “A Tree-Based Approach for Addressing Self-Selection in Impact Studies with Big Data”, MIS Quarterly, vol 40 no 4, pp. 819-848.
  • 52. Impact of IT Outsourcing Contract Attributes How does financial performance of outsourcing contracts vary with two attributes of the contract: • Pricing mechanisms (6 options) • Contract duration (continuous) Observational Data • >1400 contracts, implemented 1996-2008 • 374 vendors and 710 clients • Obtained from IDC database, Lexis-Nexis, COMPUSTAT, etc.
  • 53. T = Six Pricing Mechanisms (polytomous intervention) Interventions (T): 1. Fixed Price 2. Transactional Price 3. Time-and-Materials 4. Incentive 5. Combination 6. Joint Venture Fixed Price Variable Price Pre-Intervention Variables (X): Task Type Bid Type Contract Value Uncertainty in business requirements Outsourcing Experience Firm Size (market value of equity) Outcomes (Y): Announcement Returns Long Term Returns Median Income Efficiency
  • 54. Six Pricing Mechanisms (polytomous intervention) Interventions (T): 1. Fixed Price 2. Transactional Price 3. Time-and-Materials 4. Incentive 5. Combination 6. Joint Venture Fixed Price - Fixed payment per billing cycle Transactional - Fixed payment per transaction per billing cycle Time and Materials - Payment based on input time and materials used during billing cycle Incentive - Payment based on output improvements against key performance indicators or any combination of indicators Combination - A combination of any of the above contract types, largely fixed price and time and materials Joint Venture - A separately incorporated entity, jointly owned by the client and the vendor, used to govern the outsourcing relationship. Fixed Price Variable Price
  • 55. Six Price Mechanisms Questions of interest: 1)Do all simple outsourcing engagements, governed by fixed or transactional price contracts, create value? 2)What types of complex outsourcing engagements create value for the client? 3)How do firms mitigate risks inherent to these engagements?
  • 56. Impact Measures (Y) Announcement Returns Firm-specific daily abnormal returns ( 𝜀𝑖𝑡, for firm i on day t) • Computed as 𝜀𝑖𝑡= 𝑟𝑖𝑡- 𝑟𝑖𝑡 , where 𝑟𝑖𝑡 = daily return (to the value weighted S&P500), estimated from the market model: 𝑟𝑖𝑡 = α𝑖+𝛽𝑖 𝑟 𝑚𝑡+ 𝜀𝑖𝑡. • Model used to predict daily returns for each firm over announcement period [-5,+5]. Long Term Returns Monthly abnormal returns • Estimated from the Fama- French three-factor model as excess of that achieved by passive investments in systematic risk factors. • Expected to be zero under the null hypothesis of market efficiency. • Used to estimate the implied three-year abnormal return following the contract. Median Income Efficiency Income efficiency is estimated as earnings before interest and taxes divided by number of employees. • Median of income efficiency for the three-year period following contract implementation
  • 57. Six pricing methodologies Selection Model Large custom IT tasks (complex) BPO+simple tasks high-trust
  • 58. Node-level Performance Analysis Combination contracts create value for complex engagements Custom IT (complex, high trust)
  • 59. Contract Duration (Continuous Intervention) T = Contract duration (months) Pre-Intervention Variables (X): Task Type Bid Type Contract Value Uncertainty in business requirements Outsourcing Experience Firm Size (market value of equity) Outcomes (Y) Announcement Returns Long Term Returns Median Income Efficiency Contract duration has no impact on performance gains from outsourcing
  • 60. Contract Duration Selection Model (regression tree) Duration proportional to contract value (1,6,8 vs. 4,5)
  • 61. Node-Level Performance Analysis (Reg) Markets reward long-term for high- value contracts and low-value minimal scope contracts contracts requiring specific or non- contractible investments (costs outweigh benefits)
  • 62. Price methodologies: Main insight Prior research: • Role of trust only in complex contracts • Fixed price known to create value; unrelated to trust Tree finding: Fixed-price creates long-term market value (not productivity), but only in high-trust contracts!

Notas del editor

  1. Experimental causes must be manipulable (Shadish, Cook & Campbell, p8)
  2. http://sekhon.berkeley.edu/matching/lalonde.html
  3. Joint ventures are equity arrangements, so, they usually don't come in the gambit of contractual agreements. Positive long term returns for Transactional: Market underestimates the value created by transactional pricing contracts. Positive income efficiency gains for Fixed Price, Incentive, Combination – but not impounded in the market. Q’s that arise: Do all simple outsourcing engagements, governed by fixed or transactional price contracts, create value? What types of complex outsourcing engagements create value for the client? How do firms mitigate risks inherent to these engagements?
  4. Joint ventures are equity arrangements, so, they usually don't come in the gambit of contractual agreements. Positive long term returns for Transactional: Market underestimates the value created by transactional pricing contracts. Positive income efficiency gains for Fixed Price, Incentive, Combination – but not impounded in the market. Q’s that arise: Do all simple outsourcing engagements, governed by fixed or transactional price contracts, create value? What types of complex outsourcing engagements create value for the client? How do firms mitigate risks inherent to these engagements?
  5. Fixed and Transactional: positive Med.Income.Eff but zero Ann.Returns = market underestimates the value created by these contracts when they are announced. Nodes 1,4 – Incentive. No data for Node 4 Ann.Returns.
  6. Novelty: literature has not investigated contract duration as a self-selected intervention; no tool for continuous intervention.
  7. Ann.Returns: Positive: Markets reward long-term contracts for high-value contracts (node 5) and low-value custom-IT (=minimal scope) contracts (node 8) Negative: In long-term IT contracts that require specific or non-contractible investments (nodes 3, 10), as the scope of the engagement increases, the costs outweigh the benefits.