SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
Off policy evaluation -survey- 1
Masatoshi Uehara (Harvard University)
December 25, 2019
1
Disclaimer; this is a very casual note
Masatoshi Uehara (Harvard University) OPE December 25, 2019 1 / 50
Overview
1 Motivation
2 Contextual bandit setting (With parametric models)
3 Bandit setting (With nonparametric models)
4 RL setting (Sequential or longitudinal setting)
5 Open Problems (General DAG, Mediation, Interference)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 2 / 50
Off policy evaluation (OPE)
Goal is evaluating the value of the policy from the historical data. More
formally, estimating the value of the evaluation policy πe from the data
obtained by the behavior policy πb.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 3 / 50
Some notations from semiparametric theory
Refer to (van der Vaart, 1998; Bickel et al., 1998; Tsiatis, 2006; Kennedy,
2016)
(Semiparametric models)... Combination of parametric and
nonparametric models
(Semiparametric efficiency bound)....Extension of Cramer-Rao lower
bound for parametric models to semiparametric models.
(Influence function (IF) of the estimator and estimand)... φ(x) for ˆθ
or θ∗
√
N(ˆθ − θ∗
) =
1
√
N
N
i=1
φ(x(i)
) + op(1/
√
N)
(Efficient influence function (EIF))... IF of the estimand minimizing
the variance.
(Efficient estimator).... Estimator achieving the efficiency bound
Masatoshi Uehara (Harvard University) OPE December 25, 2019 4 / 50
Contextual bandit setting
Setting
We have {s(i), a(i), r(i)}N
i=1 ∼ p(s)πb(a|s)p(r|s, a). We want to estimate
Eπe [r] = Ep[rπe
(a|s)] = rp(s)πe
(a|s)p(r|s, a)dµ(r, s, a)
.
Good surveys (Rotnitzky and Vansteelandt, 2014; Seaman and
Vansteelandt, 2018; Huber, 2019; Diaz, 2019)
Unless otherwise noted, the expectation is taken w.r.t behavior policy
Extension to conterfactual setting is easy
EN[·] Empirical approximation
Value function and Q-functions are defined for evalution policies
Masatoshi Uehara (Harvard University) OPE December 25, 2019 5 / 50
CB; Semiparemtric Lower bound
The efficiency bound under nonparametric model is
var{v(s)} + E{η(s, a)2
var(r|s, a)},
where E(r|s, a) = q(s, a) and Eπe {E(r|s, a)|s} = v(s), η(s, a) = πe/πb.
How to obtain?
Approximate your infinite dimensional model as a parametric model. Then,
calculate the supremum of the Cramer-Rao lower bound.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 6 / 50
Implication of semiparametric lower bound
Semiparametric lower bound gives the lower bound of asymptotic MSE
among regular estimators. Therefore, for example,
var{v(s)} + E{η(s, a)2
var(r|s, a)} < var{η(s, a)r}.
Importantly, this lower bound is not changed whether behavior policy is
known or not.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 7 / 50
Common estimators
IS (Importance sampling a.k.a IPW, HorvitzThompson);
EN [ˆη(s, a)r] ,
πe(a|s)
πb(a|s)
= η(s, a)
NIS (Normalized IS);
EN [ˆη(a, s)r/EN[ˆη(s, a)]]
DM (Direct method); EN[ˆq(s, a)], (E[r|a, s] = q(s, a))
AIS (Augmented IS (Robins et al., 1994; Dudik et al., 2014));
EN [ˆη(a, s)(r − ˆq(s, a)) + ˆv(s)] , (ˆv(s) = E[ˆq(s, a) | s])
Masatoshi Uehara (Harvard University) OPE December 25, 2019 8 / 50
Useful properties for AIS 1
Model double robustness (In terms of consistency and
√
N–consistency)
η(s, a) ≈ ˆη(s, a)? q(s, a) ≈ ˆq(s, a)?
Red... Consistent, Green.... Not Consistent
Masatoshi Uehara (Harvard University) OPE December 25, 2019 9 / 50
Useful properties for AIS 2
Rate double robustness
ˆη − η 2 = op(N−1/4) and ˆq − q 2 = op(N−1/4) are sufficient conditions
to guarantee the efficiency (Chernozhukov et al., 2018; Rotnitzky and
Smucler, 2019)
Fact regarding plug-in
Even if nuisance functions are estimated with parametric
√
N–rate,
the asymptotic variance will be generally changed
Thanks to the orthogonality of IF, the asymptotic variance is not
changed even if there is plug-in (Rotnitzky et al., 2019)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 10 / 50
Double robust IS or Double robust direct estimator
Double robust regression estimator (Scharfstein et al., 1999; Kang
and Schafer, 2007)
Learn q(s, a) with some covariate including ˆη(s, a) (weighted
regression).
Define an estimator as EN[ˆq(s, a)]
This is double robust!!
Close to TMLE
Double robust IS estimator (Robins et al., 2007)
Learn η(s, a) with some covariate based on ˆq(s, a)
Define an IS estimator EN[ˆη(s, a)r]
This is double robust
Close to TMLE
Masatoshi Uehara (Harvard University) OPE December 25, 2019 11 / 50
More doubly robust (MDR) estimator
Motivation...AIS has poor performance when q(s, a) is mis-specified
(Rubin and van Der Laan, 2008; Cao et al., 2009)
MDR
MDR is minimizing the variance among some class of estimators
irrespective of the model-specification of q(s, a)
When behavior policy is known, Q-function is estimated as follows;
ˆq = arg min
q∈Fq
var{v(s)} + E{η(s, a)2
var(r|s, a)} .
Then, plug it in DR.
(Property)... Still double robust
Can be extended when behavior policy is unknown.
Extension to RL (Farajtabar et al., 2018)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 12 / 50
Intrinsic efficient estimator
Motivation...The performance of AIS can become worse than IS or NIS
(when q-models are mis-specified).
Intrinsic efficient estimator (Tan, 2006, 2010)
Making the class of estimator including IS and NIS, and optimizing so
that the variance is minimized
(Property)... Still double robust and better than IS and NIS
Extension to RL (Kallus and Uehara, 2019c)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 13 / 50
Bias reduced estimator (Vermeulen and Vansteelandt,
2015)
(Motivation)...What will happen when both models are mis-specified?
Vermeulen and Vansteelandt (2015) has introduced an estimator
based on the idea of reducing MSE irrespective of
model-specifications.
(Property)...Double robust and robust to model-misspecifications!!
Masatoshi Uehara (Harvard University) OPE December 25, 2019 14 / 50
Nonparametric IS (Hirano et al., 2003)
IS when πb is estimated nonparametrically
This achieves the efficiency bound under some smoothness conditions
Plug-in paradox (Robins et al., 1992; Henmi and Eguchi, 2004; Henmi
et al., 2007)
Plug-in estimator based on MLE is more efficient than non plug-in
estimator
If so, is plug-in IS estimator better than no plug-in estimator?
Yes; If models are well-specified. Kind of using some control variate
(Robins et al., 2007)
No; If models are mis-specified
Masatoshi Uehara (Harvard University) OPE December 25, 2019 15 / 50
Nonoparametric direct method
Hahn (1998) introduce an estimator based on a direct method when
q(a, x) is estimated nonparametrically
This achieves the efficiency bound under some smoothness conditions
Parametric direct method
A.K.A G-formula (Hernan and Robins, 2019)
We can also assume a parametric model for q(a, x) directly
(semiparametric direct method).
Efficiency bound under parametric q-model is smaller than efficiency
bound under nonparametric model (Tan, 2007)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 16 / 50
Double debiased machine learning (Chernozhukov et al.,
2018)
The estimator is EN [ˆµ(a, s)(r − ˆq(s, a)) + ˆv(s)] , (Eπe [r|s] = v(s))
with cross fitting (aka. sample splitting) (van der Vaart, 1998)
Both µ and q are estimated nonparametrically.
Rate double robustness is attained without Donsker conditions for
nuisance estimators
Masatoshi Uehara (Harvard University) OPE December 25, 2019 17 / 50
TMLE (Rubin, 2006; van der Laan, 2011; Benkeser et al.,
2017)
TMLE??... Updating the estimator based on the efficient influence
function of the target. (Super-learner is also used here)
When EIF is analytically written, TMLE is reduced to a one-step
estimator. See Page 11.
When EIF does not have a closed form, iterative estimator.
Corraborative double robustness (van Der Laan and Gruber, 2010;
Diaz, 2018)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 18 / 50
Other important estimators
Switching estimator (Tsiatis and Davidian, 2007; Wang et al., 2017)
Matching estimator (Abadie and Imbens, 2006; Wang and
Zubizarreta, 2019a)
Covariate balancing with various divergences (Imai and Ratkovic,
2014; Wang and Zubizarreta, 2019b)
Minimax estimator (Kallus, 2018; Chernozhukov et al., 2018;
Hirshberg and Wager, 2019)
High dimensional setting (Many... E.g. Farrell (2015); Smucler et al.
(2019))
Continuous treatment (estimand is the difference) (Kennedy et al.,
2017)
Finite population inference (Bojinov and Shephard, 2019)
Multiple robustness (Rotnitzky et al., 2017)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 19 / 50
RL setting (Application)
Figure: ADHD Example [Chakraborty,2009]
Masatoshi Uehara (Harvard University) OPE December 25, 2019 20 / 50
Summary of RL situation
Table: Efficiency bounds and estimators for OPE
Efficiency bound Efficient estimator
NMDP Kallus and Uehara (2019a) Jiang and Li (2016)
Thomas and Brunskill (2016)
TMDP Kallus and Uehara (2019a) Kallus and Uehara (2019a)
MDP Kallus and Uehara (2019b) Kallus and Uehara (2019b)
Jiang and Li (2016) also calculated bounds of NMDP and TMDP for
a tabular case.
Note that efficiency bound and estimator under NMDP are kind of
given in causal inference literature (Murphy, 2003; van Der Laan and
Robins, 2003; Bang and Robins, 2005)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 21 / 50
MDP
MDP = {S, A, R, p}
S, A, R... State space, Action space, Reward space
Transition density... p(s |s, a)
Reward distribution.... p(r|s, a)
Initial distribution.... p(0)
(s0)
Evaluation policy πe(a|s), behavior policy πb(a|s)
The induced distribution by MDP and the behavior policy is
p(s0, a0, r0, a0, s1, a1, r1, s2, a2, r2, · · · )
= p(0)
(s0)πb
(a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb
(a1|s1)p(r1|s1, a1) · · · .
s0 a0 r0 s1 a1 r1 s2
|
|
|
|
||
||
||
||
|||
|||
Figure: MDP
Masatoshi Uehara (Harvard University) OPE December 25, 2019 22 / 50
NMDP and TMDP
MDP can be relaxed into two ways; NMDP (without Markovness) and
TMDP (Without time-invariance)
Figure: NMDP (Non-Markov Decision process)
Figure: TMDP (Time-varying Markov Decision Process)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 23 / 50
Goal in OPE for RL
[Goal]; Estimate ρπe
;
ρπe
= (1 − γ)
∞
t=0
Eπe [γt
rt], (γ < 1)
Note that this expectation is taken w.r.t
p
(0)
e (s0)πe
(a0|s0)p(r0|s0, a0)p(s1|s0, a0)πe
(a1|s1)p(r1|s1, a1) · · · .
We can use a set of samples generated by MDP and the behavior policy
πb;
{s
(i)
t , a
(i)
t , r
(i)
t }N,T
i=1,t=0.
following
p
(0)
b (s0)πb
(a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb
(a1|s1)p(r1|s1, a1) · · · .
Masatoshi Uehara (Harvard University) OPE December 25, 2019 24 / 50
Common three approaches
DM (Direct Method) estimator
ˆρDM = (1 − γ) EN [Eπe [ˆq(s0, a0)|s0]] ,
where E[ ∞
t=0 γtrt|s0, a0] = q(s0, a0).
SIS (Sequential Importance Sampling) estimator
ˆρSIS = (1 − γ) EN
T
t=0
γt
νtrt ,
where
νt(Hat ) =
t
k=0
ηk(sk, ak), ηk(sk, ak) =
πe(ak|sk)
πb(ak|sk)
.
Double Robust (DR) estimator (Jiang and Li, 2016; Thomas and
Brunskill, 2016)
ˆρDR = (1 − γ) EN
T
t=0
γt
(νt(rt − ˆqt) + νt−1 ˆvt(st)) .
Masatoshi Uehara (Harvard University) OPE December 25, 2019 25 / 50
Curse of horizon
Eπe [
T
t=0
γt
rt] = Eπb
T
k=0
πe(ak|sk)
πb(ak|sk)
T
t=0
γt
rt
= Eπb
T
t=0
t
k=0
πe(ak|sk)
πb(ak|sk)
γt
rt
= Eπb
T
t=0
γt
νtrt ≈ EN
T
t=0
γt
νtrt
Problem: Variance grows exponentially w.r.t T.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 26 / 50
Curse of horizon
SIS and DR estimator suffer from the curse of horizon
DM estimator does not. But it suffer from the model misspefication.
Q; Are there any solutions?
A; MDP assumptions are not exploited fully.
Markov assumption
Time-invariant assumption
Masatoshi Uehara (Harvard University) OPE December 25, 2019 27 / 50
Leveraging Markovness
Xie et al. (2019) proposed a marginal importance sampling estimator;
Eπe
T
t=0
γt
rt = Eπb
T
t=0
t
k=0
πe(ak|sk)
πb(ak|sk)
γt
rt
= Eπb
T
t=0
γt
µtrt ≈ EN
T
t=0
γt
µtrt .
Here, µt is a marginal density ratio at t;
µt =
pπe (st, at)
pπb (st, at)
.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 28 / 50
Efficiency bound under NMDP and TMDP
Theorem (EB under NMDP)
EB(M1) = (1 − γ)2
∞
k=1
E[γ2(k−1)
ν2
k−1var rk−1 + vk|Hak−1
].
Theorem (EB under TMDP)
EB(M2) = (1 − γ)2
∞
k=1
E[γ2(k−1)
µ2
k−1var (rk−1 + vk|ak−1, sk−1)].
Typical behavior of ν2
k−1 is O(Ck). Typical behavior of µ2
k−1 is O(1).
Variance does not grow exponentially w.r.t T under TMDP
Masatoshi Uehara (Harvard University) OPE December 25, 2019 29 / 50
Double Reinforcement learning (for TMDP)
Kallus and Uehara (2019a) has proposed an estimator (DRL) achieving the
efficiency bound under TMDP;
ˆρDRL(M2) = (1 − γ) EN
T
t=0
γt
(ˆµt(rt − ˆqt) + ˆµt−1 ˆvt(st)) .
Masatoshi Uehara (Harvard University) OPE December 25, 2019 30 / 50
Double robustness of DRL for TMDP
Model double robustness (Also, rate double robustness)
µt(s, a) ≈ ˆµt(s, a)? qt(s, a) ≈ ˆqt(s, a)?
Red... Consistent, Green.... Not Consistent
Masatoshi Uehara (Harvard University) OPE December 25, 2019 31 / 50
Curse of horizon (Again)
Q: Is the curse of horizon solved?
A: At least, it does not blow up w.r.t horizon. But, rate is not right under
MDP
Masatoshi Uehara (Harvard University) OPE December 25, 2019 32 / 50
Correct rate for OPE under Ergodic MDP
The rate of the estimator (MSE) introduced so far is 1/N
However, we can learn the estimand with 1/NT–rate assuming
Ergodicity.
Importantly, we can learn from a single trajectory (N = 1, T → ∞)
Masatoshi Uehara (Harvard University) OPE December 25, 2019 33 / 50
Leveraging Time-invariance
Liu et al. (2018) proposed an Ergodic importance sampling estimator;
lim
T→∞
(1 − γ)Eπe
T
t=0
γt
rt = rp∞
e,γ(s, a)dµ(s, a, r)
= r
p∞
e,γ(s, a)
p∞
b (s, a)
p∞
b (s, a)dµ(s, a, r)
= Eπ∞
b
[rw(s, a)]
≈ ENET [rw(s, a)] =
1
N
1
T
N
i=1
T
t=1
r
(i)
t w(s
(i)
t , a
(i)
t )
where p∞
e,γ(s, a) is an average visitation distribution of state and action,
w(s, a) =
p∞
e,γ(s, a)
p∞
b (s, a)
.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 34 / 50
Efficiency bound under Ergodic MDP
The lower bound of asymptotic MSE scaled by NT among regular
estimators is
EB(M3) = Ep
(∞)
b


w2
(s, a)
Distribution mismatch
{r + γv(s ) − q(s, a)}2
Bellman squared residual


 .
Table: Comparison regarding rate
Rate Curse of horizon
NMDP O(1/N) Yes
TMDP O(1/N) No. But still rate...
MDP O(1/NT) No
Masatoshi Uehara (Harvard University) OPE December 25, 2019 35 / 50
Efficient estimator under ergodic MDP
Defining v(s0) = Eπe [q(s0, a0)|s0], efficient estimator ˆρDRL(M3) is defined
as follows;
(1 − γ)ENEp
(0)
e
[ˆv(s0)]
+ ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))],
or
ENET [ ˆw(s, a)r]
+ (1 − γ)ENEp
(0)
e
[ˆv(s0)] + ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))].
Here, Red terms correspond to IS estimator or DM estimator. And, Blue
terms correspond to control variates.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 36 / 50
Double robustness of DRL for MDP
Model double robustness (Also, rate double robustness)
w(s, a) ≈ ˆw(s, a)? q(s, a) ≈ ˆq(s, a)?
Red... Consistent, Green.... Not Consistent
Masatoshi Uehara (Harvard University) OPE December 25, 2019 37 / 50
General causal DAG (when all of variables are measured)
Given causal DAG (FFRCISTG), G-formula or IS estimator give
identification formulas (Hernan and Robins, 2019).
How to obtain an efficient estimator? ... See van Der Laan and
Robins (2003)
The problem is how the estimator can be simplified (Rotnitzky and
Smucler, 2019).
Masatoshi Uehara (Harvard University) OPE December 25, 2019 38 / 50
General causal DAG (With unmeasured variables)
ID algorithm (Shpitser and Pearl, 2008; Tian, 2008; Shpitser and
Sherman, 2018) gives a sufficient and necessary identification formula.
The relation with efficient estimation is a still opening problem??
Figure: With unmeasured confounding
Masatoshi Uehara (Harvard University) OPE December 25, 2019 39 / 50
Mediation effect (Pathway effect)
Modified ID algorithm (Shpitser and Sherman, 2018) is a sufficient
and necessary identification formula.
Efficient theory is still being constructed (Nabi et al., 2018).
Figure: Edge intervention
Masatoshi Uehara (Harvard University) OPE December 25, 2019 40 / 50
Network, interference
Some estimation method and its theory(Ogburn et al., 2017).
Chain graph is also useful for network setting (Ogburn et al., 2018)
Figure: Chain Graph
Identification formula is given (Sherman and Shpitser, 2018)
Since each unit is not i.i.d, difficult
E.g. To estimate from a single network, ergodicity is needed.
Ordinary semiparmatric theory assume i.i.d.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 41 / 50
Ref I
Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators
for average treatment effects. Econometrica 74, 235–267.
Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal
inference models. Biometrics 61, 962–973.
Benkeser, D., M. Carone, M. J. V. D. Laan, and P. B. Gilbert (2017). Doubly robust
nonparametric inference on the average treatment effect. Biometrika 104, 863–880.
Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and J. A. Wellner (1998). Efficient and
Adaptive Estimation for Semiparametric Models. Springer.
Bojinov, I. and N. Shephard (2019). Time series experiments and causal estimands:
exact randomization tests and trading. Journal of the American Statistical
Association.
Cao, W., A. A. Tsiatis, and M. Davidian (2009). Improving efficiency and robustness of
the doubly robust estimator for a population mean with incomplete data.
Biometrika 96, 723–734.
Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and
J. Robins (2018). Double/debiased machine learning for treatment and structural
parameters. Econometrics Journal 21, C1–C68.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 42 / 50
Ref II
Chernozhukov, V., W. Newey, J. Robins, and R. Singh (2018). Double/de-biased
machine learning of global and local parameters using regularized riesz representers.
arXiv.org.
Diaz, I. (2018). Doubly robust estimators for the average treatment effect under
positivity violations: introducing the e-score. arXiv.org.
Diaz, I. (2019). Machine learning in the estimation of causal effects: targeted minimum
loss-based estimation and double/debiased machine learning. Biostatistics.
Dudik, M., D. Erhan, J. Langford, and L. Li (2014). Doubly robust policy evaluation
and optimization. Statistical Science 29, 485–511.
Farajtabar, M., Y. Chow, and M. Ghavamzadeh (2018). More robust doubly robust
off-policy evaluation. In Proceedings of the 35th International Conference on Machine
Learning, 1447–1456.
Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more
covariates than observations. Journal of Econometrics 189, 1–23.
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric
estimation of average treatment effects. Econometrica 66, 315–331.
Henmi, M. and S. Eguchi (2004). A paradox concerning nuisance parameters and
projected estimating functions. Biometrika 91, 929–941.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 43 / 50
Ref III
Henmi, M., R. Yoshida, and S. Eguchi (2007). Importance sampling via the estimated
sampler. Biometrika 94, 985–991.
Hernan, M. and J. Robins (2019). Causal Inference. Boca Raton: Chapman &
Hall/CRC.
Hirano, K., G. Imbens, and G. Ridder (2003). Efficient estimation of average treatment
effects using the estimated propensity score. Econometrica 71, 1161–1189.
Hirshberg, D. and S. Wager (2019). Augmented minimax linear estimation. arXiv.org.
Huber, M. (2019). An introduction to flexible methods for policy evaluation. arXiv.org.
Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. J. R. Statist.
Soc. B 76, 243–263.
Jiang, N. and L. Li (2016). Doubly robust off-policy value evaluation for reinforcement
learning. In Proceedings of the 33rd International Conference on International
Conference on Machine Learning-Volume, 652–661.
Kallus, N. (2018). Balanced policy evaluation and learning. In Advances in Neural
Information Processing Systems 31, pp. 8895–8906.
Kallus, N. and M. Uehara (2019a). Double reinforcement learning for efficient off-policy
evaluation in markov decision processes. arXiv preprint arXiv:1908.08526.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 44 / 50
Ref IV
Kallus, N. and M. Uehara (2019b). Efficiently breaking the curse of horizon: Double
reinforcement learning in infinite-horizon processes. arXiv preprint arXiv:1909.05850.
Kallus, N. and M. Uehara (2019c). Intrinsically efficient, stable, and bounded off-policy
evaluation for reinforcement learning. In Advances in Neural Information Processing
Systems 32, pp. 3320–3329.
Kang, J. D. Y. and J. L. Schafer (2007). Demystifying double robustness: A comparison
of alternative strategies for estimating a population mean from incomplete data.
Statistical Science 22, 523–529.
Kennedy, E. (2016). Semiparametric theory and empirical processes in causal inference.
arXiv.org.
Kennedy, E. H., Z. Ma, M. D. Mchugh, and D. S. Small (2017). Nonparametric
methods for doubly robust estimation of continuous treatment effects. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 79, 1229–1245.
Liu, Q., L. Li, Z. Tang, and D. Zhou (2018). Breaking the curse of horizon:
Infinite-horizon off-policy estimation. In Advances in Neural Information Processing
Systems 31, pp. 5356–5366.
Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 65, 331–355.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 45 / 50
Ref V
Nabi, R., P. Kanki, and I. Shpitser (2018). Estimation of personalized effects associated
with causal pathways. Uncertainty in artificial intelligence : proceedings of the ...
conference. Conference on Uncertainty in Artificial Intelligence 2018.
Ogburn, E., O. Sofrygin, and I. Diaz (2017). Causal inference for social network data.
arXiv.org.
Ogburn, E. L., I. Shpitser, and Y. Lee (2018). Causal inference, social networks, and
chain graphs. arXiv.org.
Robins, J., M. Sued, Q. Lei-Gomez, and A. Rotnitzky (2007). Comment: Performance
of double-robust estimators when ”inverse probability” weights are highly variable.
Statistical Science 22, 544–559.
Robins, J. M., S. D. Mark, and W. K. Newey (1992). Estimating exposure effects by
modelling the expectation of exposure conditional on confounders. Biometrics 48,
479–495.
Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients
when some regressors are not always observed. Journal of the American Statistical
Association 89, 846–866.
Rotnitzky, A., J. Robins, and L. Babino (2017). On the multiply robust estimation of
the mean of the g-functional. arXiv preprint arXiv:1705.08582.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 46 / 50
Ref VI
Rotnitzky, A. and E. Smucler (2019). Efficient adjustment sets for population average
treatment effect estimation in non-parametric causal graphical models. arXiv.org.
Rotnitzky, A., E. Smucler, and J. Robins (2019). Characterization of parameters with a
mixed bias property. arXiv preprint arXiv:1509.02556.
Rotnitzky, A. and S. Vansteelandt (2014). Double-robust methods. In Handbook of
missing data methodology. In Handbooks of Modern Statistical Methods, pp.
185–212. Chapman and Hall/CRC.
Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of
Biostatistics 2, 1043–1043.
Rubin, D. B. and M. J. van Der Laan (2008). Empirical efficiency maximization:
improved locally efficient covariate adjustment in randomized experiments and
survival analysis. The international journal of biostatistics 4.
Scharfstein, D., A. Rotnizky, and J. M. Robins (1999). Adjusting for nonignorable
dropout using semi-parametric models. Journal of the American Statistical
Association 94, 1096–1146.
Seaman, S. R. and S. Vansteelandt (2018). Introduction to double robust methods for
incomplete data. Statistical science 33, 184–197.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 47 / 50
Ref VII
Sherman, E. and I. Shpitser (2018). Identification and estimation of causal effects from
dependent data. In Advances in Neural Information Processing Systems 31, pp.
9424–9435.
Shpitser, I. and J. Pearl (2008). Complete identification methods for the causal
hierarchy. Journal of Machine Learning Research, 1941–1979.
Shpitser, I. and E. Sherman (2018). Identification of personalized effects associated with
causal pathways. Uncertainty in artificial intelligence : proceedings of the ...
conference. Conference on Uncertainty in Artificial Intelligence 2018.
Smucler, E., A. Rotnitzky, and J. M. Robins (2019). A unifying approach for
doubly-robust 1 regularized estimation of causal contrasts. arXiv preprint
arXiv:1904.03737.
Tan, Z. (2006). A distributional approach for causal inference using propensity scores.
Journal of the American Statistical Association 101, 1619–1637.
Tan, Z. (2007). Comment: Understanding or, ps and dr. Statistical Science 22,
560–568.
Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting.
Biometrika 97, 661–682.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 48 / 50
Ref VIII
Thomas, P. and E. Brunskill (2016). Data-efficient off-policy policy evaluation for
reinforcement learning. In Proceedings of the 33rd International Conference on
Machine Learning, 2139–2148.
Tian, J. (2008). Identifying dynamic sequential plans. In Proceed-ings of the
Twenty-Fourth Conference Annual Conferenceon Uncertainty in Artificial Intelligence
(UAI-08), 554–561.
Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer Series in
Statistics. New York, NY: Springer New York.
Tsiatis, A. A. and M. Davidian (2007). Comment: Demystifying double robustness: A
comparison of alternative strategies for estimating a population mean from
incomplete data. Statistical science 22, 569–573.
van Der Laan, M. and S. Gruber (2010). Collaborative double robust targeted maximum
likelihood estimation. International Journal of Biostatistics 6, 1181–1181.
van der Laan, M. J. (2011). Targeted Learning :Causal Inference for Observational and
Experimental Data (1st ed. 2011. ed.). Springer Series in Statistics. New York, NY:
Springer.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 49 / 50
Ref IX
van Der Laan, M. J. and J. M. Robins (2003). Unified Methods for Censored
Longitudinal Data and Causality. Springer Series in Statistics,. New York, NY:
Springer New York.
van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge, UK: Cambridge
University Press.
Vermeulen, K. and S. Vansteelandt (2015). Bias-reduced doubly robust estimation.
Journal of the American Statistical Association 110, 1024–1036.
Wang, Y. and J. Zubizarreta (2019a). Large sample properties of matching for balance.
arXiv.org.
Wang, Y. and J. Zubizarreta (2019b). Minimal dispersion approximately balancing
weights: Asymptotic properties and practical considerations. arXiv.org.
Wang, Y.-X., A. Agarwal, and M. Dud´ık (2017). Optimal and adaptive off-policy
evaluation in contextual bandits. In Proceedings of the 34th International Conference
on Machine Learning, Volume 70, pp. 3589–3597.
Xie, T., Y. Ma, and Y.-X. Wang (2019). Towards optimal off-policy evaluation for
reinforcement learning with marginalized importance sampling. In Advances in Neural
Information Processing Systems 32, pp. 9665–9675.
Masatoshi Uehara (Harvard University) OPE December 25, 2019 50 / 50

Más contenido relacionado

La actualidad más candente

階層ベイズによるワンToワンマーケティング入門
階層ベイズによるワンToワンマーケティング入門階層ベイズによるワンToワンマーケティング入門
階層ベイズによるワンToワンマーケティング入門
shima o
 
グラフィカルモデル入門
グラフィカルモデル入門グラフィカルモデル入門
グラフィカルモデル入門
Kawamoto_Kazuhiko
 

La actualidad más candente (20)

Counterfaual Machine Learning(CFML)のサーベイ
Counterfaual Machine Learning(CFML)のサーベイCounterfaual Machine Learning(CFML)のサーベイ
Counterfaual Machine Learning(CFML)のサーベイ
 
【論文調査】XAI技術の効能を ユーザ実験で評価する研究
【論文調査】XAI技術の効能を ユーザ実験で評価する研究【論文調査】XAI技術の効能を ユーザ実験で評価する研究
【論文調査】XAI技術の効能を ユーザ実験で評価する研究
 
“機械学習の説明”の信頼性
“機械学習の説明”の信頼性“機械学習の説明”の信頼性
“機械学習の説明”の信頼性
 
cvpaper.challenge 研究効率化 Tips
cvpaper.challenge 研究効率化 Tipscvpaper.challenge 研究効率化 Tips
cvpaper.challenge 研究効率化 Tips
 
pymcとpystanでベイズ推定してみた話
pymcとpystanでベイズ推定してみた話pymcとpystanでベイズ推定してみた話
pymcとpystanでベイズ推定してみた話
 
構造方程式モデルによる因果探索と非ガウス性
構造方程式モデルによる因果探索と非ガウス性構造方程式モデルによる因果探索と非ガウス性
構造方程式モデルによる因果探索と非ガウス性
 
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
Rによるやさしい統計学第20章「検定力分析によるサンプルサイズの決定」
 
傾向スコアの概念とその実践
傾向スコアの概念とその実践傾向スコアの概念とその実践
傾向スコアの概念とその実践
 
統計的因果推論への招待 -因果構造探索を中心に-
統計的因果推論への招待 -因果構造探索を中心に-統計的因果推論への招待 -因果構造探索を中心に-
統計的因果推論への招待 -因果構造探索を中心に-
 
【DL輪読会】Dropout Reduces Underfitting
【DL輪読会】Dropout Reduces Underfitting【DL輪読会】Dropout Reduces Underfitting
【DL輪読会】Dropout Reduces Underfitting
 
機械学習モデルの判断根拠の説明
機械学習モデルの判断根拠の説明機械学習モデルの判断根拠の説明
機械学習モデルの判断根拠の説明
 
ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定
 
階層ベイズによるワンToワンマーケティング入門
階層ベイズによるワンToワンマーケティング入門階層ベイズによるワンToワンマーケティング入門
階層ベイズによるワンToワンマーケティング入門
 
強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習強化学習と逆強化学習を組み合わせた模倣学習
強化学習と逆強化学習を組み合わせた模倣学習
 
協調フィルタリング入門
協調フィルタリング入門協調フィルタリング入門
協調フィルタリング入門
 
グラフィカルモデル入門
グラフィカルモデル入門グラフィカルモデル入門
グラフィカルモデル入門
 
Oracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoOracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslasso
 
TabNetの論文紹介
TabNetの論文紹介TabNetの論文紹介
TabNetの論文紹介
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
ELBO型VAEのダメなところ
ELBO型VAEのダメなところELBO型VAEのダメなところ
ELBO型VAEのダメなところ
 

Similar a Off policy evaluation

ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC DISTRIBUTION USING MAXIMUM LIKELIH...
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC  DISTRIBUTION USING MAXIMUM LIKELIH...ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC  DISTRIBUTION USING MAXIMUM LIKELIH...
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC DISTRIBUTION USING MAXIMUM LIKELIH...
BRNSS Publication Hub
 
8064-an-excel-spreadsheet-and-vba-macro-for-model-selection-and-predictor-imp...
8064-an-excel-spreadsheet-and-vba-macro-for-model-selection-and-predictor-imp...8064-an-excel-spreadsheet-and-vba-macro-for-model-selection-and-predictor-imp...
8064-an-excel-spreadsheet-and-vba-macro-for-model-selection-and-predictor-imp...
MFOZOUNI
 
Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Naïve Bayes Machine Learning Classification with R Programming: A case study ...Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Naïve Bayes Machine Learning Classification with R Programming: A case study ...
SubmissionResearchpa
 
With the consideration of the recurring theme and ideas of .docx
With the consideration of the recurring theme and ideas of .docxWith the consideration of the recurring theme and ideas of .docx
With the consideration of the recurring theme and ideas of .docx
madlynplamondon
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
Bonnie Green
 
Estimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataEstimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale data
Nick Stauner
 
Ejbrm volume6-issue1-article183
Ejbrm volume6-issue1-article183Ejbrm volume6-issue1-article183
Ejbrm volume6-issue1-article183
Soma Sinha Roy
 

Similar a Off policy evaluation (20)

Assessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generateAssessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generate
 
Assessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIFAssessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIF
 
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC DISTRIBUTION USING MAXIMUM LIKELIH...
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC  DISTRIBUTION USING MAXIMUM LIKELIH...ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC  DISTRIBUTION USING MAXIMUM LIKELIH...
ALPHA LOGARITHM TRANSFORMED SEMI LOGISTIC DISTRIBUTION USING MAXIMUM LIKELIH...
 
Off policy learning
Off policy learningOff policy learning
Off policy learning
 
8064-an-excel-spreadsheet-and-vba-macro-for-model-selection-and-predictor-imp...
8064-an-excel-spreadsheet-and-vba-macro-for-model-selection-and-predictor-imp...8064-an-excel-spreadsheet-and-vba-macro-for-model-selection-and-predictor-imp...
8064-an-excel-spreadsheet-and-vba-macro-for-model-selection-and-predictor-imp...
 
Lad
LadLad
Lad
 
Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Naïve Bayes Machine Learning Classification with R Programming: A case study ...Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Naïve Bayes Machine Learning Classification with R Programming: A case study ...
 
Dd31720725
Dd31720725Dd31720725
Dd31720725
 
With the consideration of the recurring theme and ideas of .docx
With the consideration of the recurring theme and ideas of .docxWith the consideration of the recurring theme and ideas of .docx
With the consideration of the recurring theme and ideas of .docx
 
Review of "Survey Research Methods & Design in Psychology"
Review of "Survey Research Methods & Design in Psychology"Review of "Survey Research Methods & Design in Psychology"
Review of "Survey Research Methods & Design in Psychology"
 
ESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSIONESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSION
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
 
Estimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataEstimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale data
 
Ejbrm volume6-issue1-article183
Ejbrm volume6-issue1-article183Ejbrm volume6-issue1-article183
Ejbrm volume6-issue1-article183
 
Improving the Efficiency of Ratio Estimators by Calibration Weightings
Improving the Efficiency of Ratio Estimators by Calibration WeightingsImproving the Efficiency of Ratio Estimators by Calibration Weightings
Improving the Efficiency of Ratio Estimators by Calibration Weightings
 
'ACCOST' for differential HiC analysis
'ACCOST' for differential HiC analysis'ACCOST' for differential HiC analysis
'ACCOST' for differential HiC analysis
 
Presenting Data
Presenting DataPresenting Data
Presenting Data
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
Penalized Regressions with Different Tuning Parameter Choosing Criteria and t...
 
1607.01152.pdf
1607.01152.pdf1607.01152.pdf
1607.01152.pdf
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Off policy evaluation

  • 1. Off policy evaluation -survey- 1 Masatoshi Uehara (Harvard University) December 25, 2019 1 Disclaimer; this is a very casual note Masatoshi Uehara (Harvard University) OPE December 25, 2019 1 / 50
  • 2. Overview 1 Motivation 2 Contextual bandit setting (With parametric models) 3 Bandit setting (With nonparametric models) 4 RL setting (Sequential or longitudinal setting) 5 Open Problems (General DAG, Mediation, Interference) Masatoshi Uehara (Harvard University) OPE December 25, 2019 2 / 50
  • 3. Off policy evaluation (OPE) Goal is evaluating the value of the policy from the historical data. More formally, estimating the value of the evaluation policy πe from the data obtained by the behavior policy πb. Masatoshi Uehara (Harvard University) OPE December 25, 2019 3 / 50
  • 4. Some notations from semiparametric theory Refer to (van der Vaart, 1998; Bickel et al., 1998; Tsiatis, 2006; Kennedy, 2016) (Semiparametric models)... Combination of parametric and nonparametric models (Semiparametric efficiency bound)....Extension of Cramer-Rao lower bound for parametric models to semiparametric models. (Influence function (IF) of the estimator and estimand)... φ(x) for ˆθ or θ∗ √ N(ˆθ − θ∗ ) = 1 √ N N i=1 φ(x(i) ) + op(1/ √ N) (Efficient influence function (EIF))... IF of the estimand minimizing the variance. (Efficient estimator).... Estimator achieving the efficiency bound Masatoshi Uehara (Harvard University) OPE December 25, 2019 4 / 50
  • 5. Contextual bandit setting Setting We have {s(i), a(i), r(i)}N i=1 ∼ p(s)πb(a|s)p(r|s, a). We want to estimate Eπe [r] = Ep[rπe (a|s)] = rp(s)πe (a|s)p(r|s, a)dµ(r, s, a) . Good surveys (Rotnitzky and Vansteelandt, 2014; Seaman and Vansteelandt, 2018; Huber, 2019; Diaz, 2019) Unless otherwise noted, the expectation is taken w.r.t behavior policy Extension to conterfactual setting is easy EN[·] Empirical approximation Value function and Q-functions are defined for evalution policies Masatoshi Uehara (Harvard University) OPE December 25, 2019 5 / 50
  • 6. CB; Semiparemtric Lower bound The efficiency bound under nonparametric model is var{v(s)} + E{η(s, a)2 var(r|s, a)}, where E(r|s, a) = q(s, a) and Eπe {E(r|s, a)|s} = v(s), η(s, a) = πe/πb. How to obtain? Approximate your infinite dimensional model as a parametric model. Then, calculate the supremum of the Cramer-Rao lower bound. Masatoshi Uehara (Harvard University) OPE December 25, 2019 6 / 50
  • 7. Implication of semiparametric lower bound Semiparametric lower bound gives the lower bound of asymptotic MSE among regular estimators. Therefore, for example, var{v(s)} + E{η(s, a)2 var(r|s, a)} < var{η(s, a)r}. Importantly, this lower bound is not changed whether behavior policy is known or not. Masatoshi Uehara (Harvard University) OPE December 25, 2019 7 / 50
  • 8. Common estimators IS (Importance sampling a.k.a IPW, HorvitzThompson); EN [ˆη(s, a)r] , πe(a|s) πb(a|s) = η(s, a) NIS (Normalized IS); EN [ˆη(a, s)r/EN[ˆη(s, a)]] DM (Direct method); EN[ˆq(s, a)], (E[r|a, s] = q(s, a)) AIS (Augmented IS (Robins et al., 1994; Dudik et al., 2014)); EN [ˆη(a, s)(r − ˆq(s, a)) + ˆv(s)] , (ˆv(s) = E[ˆq(s, a) | s]) Masatoshi Uehara (Harvard University) OPE December 25, 2019 8 / 50
  • 9. Useful properties for AIS 1 Model double robustness (In terms of consistency and √ N–consistency) η(s, a) ≈ ˆη(s, a)? q(s, a) ≈ ˆq(s, a)? Red... Consistent, Green.... Not Consistent Masatoshi Uehara (Harvard University) OPE December 25, 2019 9 / 50
  • 10. Useful properties for AIS 2 Rate double robustness ˆη − η 2 = op(N−1/4) and ˆq − q 2 = op(N−1/4) are sufficient conditions to guarantee the efficiency (Chernozhukov et al., 2018; Rotnitzky and Smucler, 2019) Fact regarding plug-in Even if nuisance functions are estimated with parametric √ N–rate, the asymptotic variance will be generally changed Thanks to the orthogonality of IF, the asymptotic variance is not changed even if there is plug-in (Rotnitzky et al., 2019) Masatoshi Uehara (Harvard University) OPE December 25, 2019 10 / 50
  • 11. Double robust IS or Double robust direct estimator Double robust regression estimator (Scharfstein et al., 1999; Kang and Schafer, 2007) Learn q(s, a) with some covariate including ˆη(s, a) (weighted regression). Define an estimator as EN[ˆq(s, a)] This is double robust!! Close to TMLE Double robust IS estimator (Robins et al., 2007) Learn η(s, a) with some covariate based on ˆq(s, a) Define an IS estimator EN[ˆη(s, a)r] This is double robust Close to TMLE Masatoshi Uehara (Harvard University) OPE December 25, 2019 11 / 50
  • 12. More doubly robust (MDR) estimator Motivation...AIS has poor performance when q(s, a) is mis-specified (Rubin and van Der Laan, 2008; Cao et al., 2009) MDR MDR is minimizing the variance among some class of estimators irrespective of the model-specification of q(s, a) When behavior policy is known, Q-function is estimated as follows; ˆq = arg min q∈Fq var{v(s)} + E{η(s, a)2 var(r|s, a)} . Then, plug it in DR. (Property)... Still double robust Can be extended when behavior policy is unknown. Extension to RL (Farajtabar et al., 2018) Masatoshi Uehara (Harvard University) OPE December 25, 2019 12 / 50
  • 13. Intrinsic efficient estimator Motivation...The performance of AIS can become worse than IS or NIS (when q-models are mis-specified). Intrinsic efficient estimator (Tan, 2006, 2010) Making the class of estimator including IS and NIS, and optimizing so that the variance is minimized (Property)... Still double robust and better than IS and NIS Extension to RL (Kallus and Uehara, 2019c) Masatoshi Uehara (Harvard University) OPE December 25, 2019 13 / 50
  • 14. Bias reduced estimator (Vermeulen and Vansteelandt, 2015) (Motivation)...What will happen when both models are mis-specified? Vermeulen and Vansteelandt (2015) has introduced an estimator based on the idea of reducing MSE irrespective of model-specifications. (Property)...Double robust and robust to model-misspecifications!! Masatoshi Uehara (Harvard University) OPE December 25, 2019 14 / 50
  • 15. Nonparametric IS (Hirano et al., 2003) IS when πb is estimated nonparametrically This achieves the efficiency bound under some smoothness conditions Plug-in paradox (Robins et al., 1992; Henmi and Eguchi, 2004; Henmi et al., 2007) Plug-in estimator based on MLE is more efficient than non plug-in estimator If so, is plug-in IS estimator better than no plug-in estimator? Yes; If models are well-specified. Kind of using some control variate (Robins et al., 2007) No; If models are mis-specified Masatoshi Uehara (Harvard University) OPE December 25, 2019 15 / 50
  • 16. Nonoparametric direct method Hahn (1998) introduce an estimator based on a direct method when q(a, x) is estimated nonparametrically This achieves the efficiency bound under some smoothness conditions Parametric direct method A.K.A G-formula (Hernan and Robins, 2019) We can also assume a parametric model for q(a, x) directly (semiparametric direct method). Efficiency bound under parametric q-model is smaller than efficiency bound under nonparametric model (Tan, 2007) Masatoshi Uehara (Harvard University) OPE December 25, 2019 16 / 50
  • 17. Double debiased machine learning (Chernozhukov et al., 2018) The estimator is EN [ˆµ(a, s)(r − ˆq(s, a)) + ˆv(s)] , (Eπe [r|s] = v(s)) with cross fitting (aka. sample splitting) (van der Vaart, 1998) Both µ and q are estimated nonparametrically. Rate double robustness is attained without Donsker conditions for nuisance estimators Masatoshi Uehara (Harvard University) OPE December 25, 2019 17 / 50
  • 18. TMLE (Rubin, 2006; van der Laan, 2011; Benkeser et al., 2017) TMLE??... Updating the estimator based on the efficient influence function of the target. (Super-learner is also used here) When EIF is analytically written, TMLE is reduced to a one-step estimator. See Page 11. When EIF does not have a closed form, iterative estimator. Corraborative double robustness (van Der Laan and Gruber, 2010; Diaz, 2018) Masatoshi Uehara (Harvard University) OPE December 25, 2019 18 / 50
  • 19. Other important estimators Switching estimator (Tsiatis and Davidian, 2007; Wang et al., 2017) Matching estimator (Abadie and Imbens, 2006; Wang and Zubizarreta, 2019a) Covariate balancing with various divergences (Imai and Ratkovic, 2014; Wang and Zubizarreta, 2019b) Minimax estimator (Kallus, 2018; Chernozhukov et al., 2018; Hirshberg and Wager, 2019) High dimensional setting (Many... E.g. Farrell (2015); Smucler et al. (2019)) Continuous treatment (estimand is the difference) (Kennedy et al., 2017) Finite population inference (Bojinov and Shephard, 2019) Multiple robustness (Rotnitzky et al., 2017) Masatoshi Uehara (Harvard University) OPE December 25, 2019 19 / 50
  • 20. RL setting (Application) Figure: ADHD Example [Chakraborty,2009] Masatoshi Uehara (Harvard University) OPE December 25, 2019 20 / 50
  • 21. Summary of RL situation Table: Efficiency bounds and estimators for OPE Efficiency bound Efficient estimator NMDP Kallus and Uehara (2019a) Jiang and Li (2016) Thomas and Brunskill (2016) TMDP Kallus and Uehara (2019a) Kallus and Uehara (2019a) MDP Kallus and Uehara (2019b) Kallus and Uehara (2019b) Jiang and Li (2016) also calculated bounds of NMDP and TMDP for a tabular case. Note that efficiency bound and estimator under NMDP are kind of given in causal inference literature (Murphy, 2003; van Der Laan and Robins, 2003; Bang and Robins, 2005) Masatoshi Uehara (Harvard University) OPE December 25, 2019 21 / 50
  • 22. MDP MDP = {S, A, R, p} S, A, R... State space, Action space, Reward space Transition density... p(s |s, a) Reward distribution.... p(r|s, a) Initial distribution.... p(0) (s0) Evaluation policy πe(a|s), behavior policy πb(a|s) The induced distribution by MDP and the behavior policy is p(s0, a0, r0, a0, s1, a1, r1, s2, a2, r2, · · · ) = p(0) (s0)πb (a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb (a1|s1)p(r1|s1, a1) · · · . s0 a0 r0 s1 a1 r1 s2 | | | | || || || || ||| ||| Figure: MDP Masatoshi Uehara (Harvard University) OPE December 25, 2019 22 / 50
  • 23. NMDP and TMDP MDP can be relaxed into two ways; NMDP (without Markovness) and TMDP (Without time-invariance) Figure: NMDP (Non-Markov Decision process) Figure: TMDP (Time-varying Markov Decision Process) Masatoshi Uehara (Harvard University) OPE December 25, 2019 23 / 50
  • 24. Goal in OPE for RL [Goal]; Estimate ρπe ; ρπe = (1 − γ) ∞ t=0 Eπe [γt rt], (γ < 1) Note that this expectation is taken w.r.t p (0) e (s0)πe (a0|s0)p(r0|s0, a0)p(s1|s0, a0)πe (a1|s1)p(r1|s1, a1) · · · . We can use a set of samples generated by MDP and the behavior policy πb; {s (i) t , a (i) t , r (i) t }N,T i=1,t=0. following p (0) b (s0)πb (a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb (a1|s1)p(r1|s1, a1) · · · . Masatoshi Uehara (Harvard University) OPE December 25, 2019 24 / 50
  • 25. Common three approaches DM (Direct Method) estimator ˆρDM = (1 − γ) EN [Eπe [ˆq(s0, a0)|s0]] , where E[ ∞ t=0 γtrt|s0, a0] = q(s0, a0). SIS (Sequential Importance Sampling) estimator ˆρSIS = (1 − γ) EN T t=0 γt νtrt , where νt(Hat ) = t k=0 ηk(sk, ak), ηk(sk, ak) = πe(ak|sk) πb(ak|sk) . Double Robust (DR) estimator (Jiang and Li, 2016; Thomas and Brunskill, 2016) ˆρDR = (1 − γ) EN T t=0 γt (νt(rt − ˆqt) + νt−1 ˆvt(st)) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 25 / 50
  • 26. Curse of horizon Eπe [ T t=0 γt rt] = Eπb T k=0 πe(ak|sk) πb(ak|sk) T t=0 γt rt = Eπb T t=0 t k=0 πe(ak|sk) πb(ak|sk) γt rt = Eπb T t=0 γt νtrt ≈ EN T t=0 γt νtrt Problem: Variance grows exponentially w.r.t T. Masatoshi Uehara (Harvard University) OPE December 25, 2019 26 / 50
  • 27. Curse of horizon SIS and DR estimator suffer from the curse of horizon DM estimator does not. But it suffer from the model misspefication. Q; Are there any solutions? A; MDP assumptions are not exploited fully. Markov assumption Time-invariant assumption Masatoshi Uehara (Harvard University) OPE December 25, 2019 27 / 50
  • 28. Leveraging Markovness Xie et al. (2019) proposed a marginal importance sampling estimator; Eπe T t=0 γt rt = Eπb T t=0 t k=0 πe(ak|sk) πb(ak|sk) γt rt = Eπb T t=0 γt µtrt ≈ EN T t=0 γt µtrt . Here, µt is a marginal density ratio at t; µt = pπe (st, at) pπb (st, at) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 28 / 50
  • 29. Efficiency bound under NMDP and TMDP Theorem (EB under NMDP) EB(M1) = (1 − γ)2 ∞ k=1 E[γ2(k−1) ν2 k−1var rk−1 + vk|Hak−1 ]. Theorem (EB under TMDP) EB(M2) = (1 − γ)2 ∞ k=1 E[γ2(k−1) µ2 k−1var (rk−1 + vk|ak−1, sk−1)]. Typical behavior of ν2 k−1 is O(Ck). Typical behavior of µ2 k−1 is O(1). Variance does not grow exponentially w.r.t T under TMDP Masatoshi Uehara (Harvard University) OPE December 25, 2019 29 / 50
  • 30. Double Reinforcement learning (for TMDP) Kallus and Uehara (2019a) has proposed an estimator (DRL) achieving the efficiency bound under TMDP; ˆρDRL(M2) = (1 − γ) EN T t=0 γt (ˆµt(rt − ˆqt) + ˆµt−1 ˆvt(st)) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 30 / 50
  • 31. Double robustness of DRL for TMDP Model double robustness (Also, rate double robustness) µt(s, a) ≈ ˆµt(s, a)? qt(s, a) ≈ ˆqt(s, a)? Red... Consistent, Green.... Not Consistent Masatoshi Uehara (Harvard University) OPE December 25, 2019 31 / 50
  • 32. Curse of horizon (Again) Q: Is the curse of horizon solved? A: At least, it does not blow up w.r.t horizon. But, rate is not right under MDP Masatoshi Uehara (Harvard University) OPE December 25, 2019 32 / 50
  • 33. Correct rate for OPE under Ergodic MDP The rate of the estimator (MSE) introduced so far is 1/N However, we can learn the estimand with 1/NT–rate assuming Ergodicity. Importantly, we can learn from a single trajectory (N = 1, T → ∞) Masatoshi Uehara (Harvard University) OPE December 25, 2019 33 / 50
  • 34. Leveraging Time-invariance Liu et al. (2018) proposed an Ergodic importance sampling estimator; lim T→∞ (1 − γ)Eπe T t=0 γt rt = rp∞ e,γ(s, a)dµ(s, a, r) = r p∞ e,γ(s, a) p∞ b (s, a) p∞ b (s, a)dµ(s, a, r) = Eπ∞ b [rw(s, a)] ≈ ENET [rw(s, a)] = 1 N 1 T N i=1 T t=1 r (i) t w(s (i) t , a (i) t ) where p∞ e,γ(s, a) is an average visitation distribution of state and action, w(s, a) = p∞ e,γ(s, a) p∞ b (s, a) . Masatoshi Uehara (Harvard University) OPE December 25, 2019 34 / 50
  • 35. Efficiency bound under Ergodic MDP The lower bound of asymptotic MSE scaled by NT among regular estimators is EB(M3) = Ep (∞) b   w2 (s, a) Distribution mismatch {r + γv(s ) − q(s, a)}2 Bellman squared residual    . Table: Comparison regarding rate Rate Curse of horizon NMDP O(1/N) Yes TMDP O(1/N) No. But still rate... MDP O(1/NT) No Masatoshi Uehara (Harvard University) OPE December 25, 2019 35 / 50
  • 36. Efficient estimator under ergodic MDP Defining v(s0) = Eπe [q(s0, a0)|s0], efficient estimator ˆρDRL(M3) is defined as follows; (1 − γ)ENEp (0) e [ˆv(s0)] + ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))], or ENET [ ˆw(s, a)r] + (1 − γ)ENEp (0) e [ˆv(s0)] + ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))]. Here, Red terms correspond to IS estimator or DM estimator. And, Blue terms correspond to control variates. Masatoshi Uehara (Harvard University) OPE December 25, 2019 36 / 50
  • 37. Double robustness of DRL for MDP Model double robustness (Also, rate double robustness) w(s, a) ≈ ˆw(s, a)? q(s, a) ≈ ˆq(s, a)? Red... Consistent, Green.... Not Consistent Masatoshi Uehara (Harvard University) OPE December 25, 2019 37 / 50
  • 38. General causal DAG (when all of variables are measured) Given causal DAG (FFRCISTG), G-formula or IS estimator give identification formulas (Hernan and Robins, 2019). How to obtain an efficient estimator? ... See van Der Laan and Robins (2003) The problem is how the estimator can be simplified (Rotnitzky and Smucler, 2019). Masatoshi Uehara (Harvard University) OPE December 25, 2019 38 / 50
  • 39. General causal DAG (With unmeasured variables) ID algorithm (Shpitser and Pearl, 2008; Tian, 2008; Shpitser and Sherman, 2018) gives a sufficient and necessary identification formula. The relation with efficient estimation is a still opening problem?? Figure: With unmeasured confounding Masatoshi Uehara (Harvard University) OPE December 25, 2019 39 / 50
  • 40. Mediation effect (Pathway effect) Modified ID algorithm (Shpitser and Sherman, 2018) is a sufficient and necessary identification formula. Efficient theory is still being constructed (Nabi et al., 2018). Figure: Edge intervention Masatoshi Uehara (Harvard University) OPE December 25, 2019 40 / 50
  • 41. Network, interference Some estimation method and its theory(Ogburn et al., 2017). Chain graph is also useful for network setting (Ogburn et al., 2018) Figure: Chain Graph Identification formula is given (Sherman and Shpitser, 2018) Since each unit is not i.i.d, difficult E.g. To estimate from a single network, ergodicity is needed. Ordinary semiparmatric theory assume i.i.d. Masatoshi Uehara (Harvard University) OPE December 25, 2019 41 / 50
  • 42. Ref I Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74, 235–267. Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973. Benkeser, D., M. Carone, M. J. V. D. Laan, and P. B. Gilbert (2017). Doubly robust nonparametric inference on the average treatment effect. Biometrika 104, 863–880. Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and J. A. Wellner (1998). Efficient and Adaptive Estimation for Semiparametric Models. Springer. Bojinov, I. and N. Shephard (2019). Time series experiments and causal estimands: exact randomization tests and trading. Journal of the American Statistical Association. Cao, W., A. A. Tsiatis, and M. Davidian (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96, 723–734. Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21, C1–C68. Masatoshi Uehara (Harvard University) OPE December 25, 2019 42 / 50
  • 43. Ref II Chernozhukov, V., W. Newey, J. Robins, and R. Singh (2018). Double/de-biased machine learning of global and local parameters using regularized riesz representers. arXiv.org. Diaz, I. (2018). Doubly robust estimators for the average treatment effect under positivity violations: introducing the e-score. arXiv.org. Diaz, I. (2019). Machine learning in the estimation of causal effects: targeted minimum loss-based estimation and double/debiased machine learning. Biostatistics. Dudik, M., D. Erhan, J. Langford, and L. Li (2014). Doubly robust policy evaluation and optimization. Statistical Science 29, 485–511. Farajtabar, M., Y. Chow, and M. Ghavamzadeh (2018). More robust doubly robust off-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning, 1447–1456. Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics 189, 1–23. Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66, 315–331. Henmi, M. and S. Eguchi (2004). A paradox concerning nuisance parameters and projected estimating functions. Biometrika 91, 929–941. Masatoshi Uehara (Harvard University) OPE December 25, 2019 43 / 50
  • 44. Ref III Henmi, M., R. Yoshida, and S. Eguchi (2007). Importance sampling via the estimated sampler. Biometrika 94, 985–991. Hernan, M. and J. Robins (2019). Causal Inference. Boca Raton: Chapman & Hall/CRC. Hirano, K., G. Imbens, and G. Ridder (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 1161–1189. Hirshberg, D. and S. Wager (2019). Augmented minimax linear estimation. arXiv.org. Huber, M. (2019). An introduction to flexible methods for policy evaluation. arXiv.org. Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. J. R. Statist. Soc. B 76, 243–263. Jiang, N. and L. Li (2016). Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume, 652–661. Kallus, N. (2018). Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems 31, pp. 8895–8906. Kallus, N. and M. Uehara (2019a). Double reinforcement learning for efficient off-policy evaluation in markov decision processes. arXiv preprint arXiv:1908.08526. Masatoshi Uehara (Harvard University) OPE December 25, 2019 44 / 50
  • 45. Ref IV Kallus, N. and M. Uehara (2019b). Efficiently breaking the curse of horizon: Double reinforcement learning in infinite-horizon processes. arXiv preprint arXiv:1909.05850. Kallus, N. and M. Uehara (2019c). Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. In Advances in Neural Information Processing Systems 32, pp. 3320–3329. Kang, J. D. Y. and J. L. Schafer (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 22, 523–529. Kennedy, E. (2016). Semiparametric theory and empirical processes in causal inference. arXiv.org. Kennedy, E. H., Z. Ma, M. D. Mchugh, and D. S. Small (2017). Nonparametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 1229–1245. Liu, Q., L. Li, Z. Tang, and D. Zhou (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems 31, pp. 5356–5366. Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 331–355. Masatoshi Uehara (Harvard University) OPE December 25, 2019 45 / 50
  • 46. Ref V Nabi, R., P. Kanki, and I. Shpitser (2018). Estimation of personalized effects associated with causal pathways. Uncertainty in artificial intelligence : proceedings of the ... conference. Conference on Uncertainty in Artificial Intelligence 2018. Ogburn, E., O. Sofrygin, and I. Diaz (2017). Causal inference for social network data. arXiv.org. Ogburn, E. L., I. Shpitser, and Y. Lee (2018). Causal inference, social networks, and chain graphs. arXiv.org. Robins, J., M. Sued, Q. Lei-Gomez, and A. Rotnitzky (2007). Comment: Performance of double-robust estimators when ”inverse probability” weights are highly variable. Statistical Science 22, 544–559. Robins, J. M., S. D. Mark, and W. K. Newey (1992). Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics 48, 479–495. Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. Rotnitzky, A., J. Robins, and L. Babino (2017). On the multiply robust estimation of the mean of the g-functional. arXiv preprint arXiv:1705.08582. Masatoshi Uehara (Harvard University) OPE December 25, 2019 46 / 50
  • 47. Ref VI Rotnitzky, A. and E. Smucler (2019). Efficient adjustment sets for population average treatment effect estimation in non-parametric causal graphical models. arXiv.org. Rotnitzky, A., E. Smucler, and J. Robins (2019). Characterization of parameters with a mixed bias property. arXiv preprint arXiv:1509.02556. Rotnitzky, A. and S. Vansteelandt (2014). Double-robust methods. In Handbook of missing data methodology. In Handbooks of Modern Statistical Methods, pp. 185–212. Chapman and Hall/CRC. Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics 2, 1043–1043. Rubin, D. B. and M. J. van Der Laan (2008). Empirical efficiency maximization: improved locally efficient covariate adjustment in randomized experiments and survival analysis. The international journal of biostatistics 4. Scharfstein, D., A. Rotnizky, and J. M. Robins (1999). Adjusting for nonignorable dropout using semi-parametric models. Journal of the American Statistical Association 94, 1096–1146. Seaman, S. R. and S. Vansteelandt (2018). Introduction to double robust methods for incomplete data. Statistical science 33, 184–197. Masatoshi Uehara (Harvard University) OPE December 25, 2019 47 / 50
  • 48. Ref VII Sherman, E. and I. Shpitser (2018). Identification and estimation of causal effects from dependent data. In Advances in Neural Information Processing Systems 31, pp. 9424–9435. Shpitser, I. and J. Pearl (2008). Complete identification methods for the causal hierarchy. Journal of Machine Learning Research, 1941–1979. Shpitser, I. and E. Sherman (2018). Identification of personalized effects associated with causal pathways. Uncertainty in artificial intelligence : proceedings of the ... conference. Conference on Uncertainty in Artificial Intelligence 2018. Smucler, E., A. Rotnitzky, and J. M. Robins (2019). A unifying approach for doubly-robust 1 regularized estimation of causal contrasts. arXiv preprint arXiv:1904.03737. Tan, Z. (2006). A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association 101, 1619–1637. Tan, Z. (2007). Comment: Understanding or, ps and dr. Statistical Science 22, 560–568. Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika 97, 661–682. Masatoshi Uehara (Harvard University) OPE December 25, 2019 48 / 50
  • 49. Ref VIII Thomas, P. and E. Brunskill (2016). Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, 2139–2148. Tian, J. (2008). Identifying dynamic sequential plans. In Proceed-ings of the Twenty-Fourth Conference Annual Conferenceon Uncertainty in Artificial Intelligence (UAI-08), 554–561. Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer Series in Statistics. New York, NY: Springer New York. Tsiatis, A. A. and M. Davidian (2007). Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science 22, 569–573. van Der Laan, M. and S. Gruber (2010). Collaborative double robust targeted maximum likelihood estimation. International Journal of Biostatistics 6, 1181–1181. van der Laan, M. J. (2011). Targeted Learning :Causal Inference for Observational and Experimental Data (1st ed. 2011. ed.). Springer Series in Statistics. New York, NY: Springer. Masatoshi Uehara (Harvard University) OPE December 25, 2019 49 / 50
  • 50. Ref IX van Der Laan, M. J. and J. M. Robins (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer Series in Statistics,. New York, NY: Springer New York. van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge, UK: Cambridge University Press. Vermeulen, K. and S. Vansteelandt (2015). Bias-reduced doubly robust estimation. Journal of the American Statistical Association 110, 1024–1036. Wang, Y. and J. Zubizarreta (2019a). Large sample properties of matching for balance. arXiv.org. Wang, Y. and J. Zubizarreta (2019b). Minimal dispersion approximately balancing weights: Asymptotic properties and practical considerations. arXiv.org. Wang, Y.-X., A. Agarwal, and M. Dud´ık (2017). Optimal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, Volume 70, pp. 3589–3597. Xie, T., Y. Ma, and Y.-X. Wang (2019). Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems 32, pp. 9665–9675. Masatoshi Uehara (Harvard University) OPE December 25, 2019 50 / 50