Off policy evaluation

Oﬀ policy evaluation -survey- 1
Masatoshi Uehara (Harvard University)
December 25, 2019
1
Disclaimer; this is a very casual note
Masatoshi Uehara (Harvard University) OPE December 25, 2019 1 / 50

Overview
1 Motivation
2 Contextual bandit setting (With parametric models)
3 Bandit setting (With nonparametric models)
4 RL setting (Sequential or longitudinal setting)
5 Open Problems (General DAG, Mediation, Interference)

Oﬀ policy evaluation (OPE)
Goal is evaluating the value of the policy from the historical data. More
formally, estimating the value of the evaluation policy πe from the data
obtained by the behavior policy πb.

Some notations from semiparametric theory
Refer to (van der Vaart, 1998; Bickel et al., 1998; Tsiatis, 2006; Kennedy,
2016)
(Semiparametric models)... Combination of parametric and
nonparametric models
(Semiparametric efficiency bound)....Extension of Cramer-Rao lower
bound for parametric models to semiparametric models.
(Influence function (IF) of the estimator and estimand)... φ(x) for ˆθ
or θ∗
√
N(ˆθ − θ∗
) =
1
√
N
N
i=1
φ(x(i)
) + op(1/
√
N)
(Efficient influence function (EIF))... IF of the estimand minimizing
the variance.
(Efficient estimator).... Estimator achieving the efficiency bound

Contextual bandit setting
Setting
We have {s(i), a(i), r(i)}N
i=1 ∼ p(s)πb(a|s)p(r|s, a). We want to estimate
Eπe [r] = Ep[rπe
(a|s)] = rp(s)πe
(a|s)p(r|s, a)dµ(r, s, a)
.
Good surveys (Rotnitzky and Vansteelandt, 2014; Seaman and
Vansteelandt, 2018; Huber, 2019; Diaz, 2019)
Unless otherwise noted, the expectation is taken w.r.t behavior policy
Extension to conterfactual setting is easy
EN[·] Empirical approximation
Value function and Q-functions are deﬁned for evalution policies

CB; Semiparemtric Lower bound
The eﬃciency bound under nonparametric model is
var{v(s)} + E{η(s, a)2
var(r|s, a)},
where E(r|s, a) = q(s, a) and Eπe {E(r|s, a)|s} = v(s), η(s, a) = πe/πb.
How to obtain?
Approximate your inﬁnite dimensional model as a parametric model. Then,
calculate the supremum of the Cramer-Rao lower bound.

Implication of semiparametric lower bound
Semiparametric lower bound gives the lower bound of asymptotic MSE
among regular estimators. Therefore, for example,
var(r|s, a)} < var{η(s, a)r}.
Importantly, this lower bound is not changed whether behavior policy is
known or not.

Common estimators
IS (Importance sampling a.k.a IPW, HorvitzThompson);
EN [ˆη(s, a)r] ,
πe(a|s)
πb(a|s)
= η(s, a)
NIS (Normalized IS);
EN [ˆη(a, s)r/EN[ˆη(s, a)]]
DM (Direct method); EN[ˆq(s, a)], (E[r|a, s] = q(s, a))
AIS (Augmented IS (Robins et al., 1994; Dudik et al., 2014));
EN [ˆη(a, s)(r − ˆq(s, a)) + ˆv(s)] , (ˆv(s) = E[ˆq(s, a) | s])

Useful properties for AIS 1
Model double robustness (In terms of consistency and
√
N–consistency)
η(s, a) ≈ ˆη(s, a)? q(s, a) ≈ ˆq(s, a)?
Red... Consistent, Green.... Not Consistent

Useful properties for AIS 2
Rate double robustness
ˆη − η 2 = op(N−1/4) and ˆq − q 2 = op(N−1/4) are suﬃcient conditions
to guarantee the eﬃciency (Chernozhukov et al., 2018; Rotnitzky and
Smucler, 2019)
Fact regarding plug-in
Even if nuisance functions are estimated with parametric
√
N–rate,
the asymptotic variance will be generally changed
Thanks to the orthogonality of IF, the asymptotic variance is not
changed even if there is plug-in (Rotnitzky et al., 2019)

Double robust IS or Double robust direct estimator
Double robust regression estimator (Scharfstein et al., 1999; Kang
and Schafer, 2007)
Learn q(s, a) with some covariate including ˆη(s, a) (weighted
regression).
Deﬁne an estimator as EN[ˆq(s, a)]
This is double robust!!
Close to TMLE
Double robust IS estimator (Robins et al., 2007)
Learn η(s, a) with some covariate based on ˆq(s, a)
Deﬁne an IS estimator EN[ˆη(s, a)r]
This is double robust
Close to TMLE

More doubly robust (MDR) estimator
Motivation...AIS has poor performance when q(s, a) is mis-speciﬁed
(Rubin and van Der Laan, 2008; Cao et al., 2009)
MDR
MDR is minimizing the variance among some class of estimators
irrespective of the model-speciﬁcation of q(s, a)
When behavior policy is known, Q-function is estimated as follows;
ˆq = arg min
q∈Fq
var(r|s, a)} .
Then, plug it in DR.
(Property)... Still double robust
Can be extended when behavior policy is unknown.
Extension to RL (Farajtabar et al., 2018)

Intrinsic efficient estimator
Motivation...The performance of AIS can become worse than IS or NIS
(when q-models are mis-specified).
Intrinsic efficient estimator (Tan, 2006, 2010)
Making the class of estimator including IS and NIS, and optimizing so
that the variance is minimized
(Property)... Still double robust and better than IS and NIS
Extension to RL (Kallus and Uehara, 2019c)

Bias reduced estimator (Vermeulen and Vansteelandt,
2015)
(Motivation)...What will happen when both models are mis-specified?
Vermeulen and Vansteelandt (2015) has introduced an estimator
based on the idea of reducing MSE irrespective of
model-specifications.
(Property)...Double robust and robust to model-misspecifications!!

Nonparametric IS (Hirano et al., 2003)
IS when πb is estimated nonparametrically
This achieves the efficiency bound under some smoothness conditions
Plug-in paradox (Robins et al., 1992; Henmi and Eguchi, 2004; Henmi
et al., 2007)
Plug-in estimator based on MLE is more efficient than non plug-in
estimator
If so, is plug-in IS estimator better than no plug-in estimator?
Yes; If models are well-specified. Kind of using some control variate
(Robins et al., 2007)
No; If models are mis-specified

Nonoparametric direct method
Hahn (1998) introduce an estimator based on a direct method when
q(a, x) is estimated nonparametrically
This achieves the efficiency bound under some smoothness conditions
Parametric direct method
A.K.A G-formula (Hernan and Robins, 2019)
We can also assume a parametric model for q(a, x) directly
(semiparametric direct method).
Efficiency bound under parametric q-model is smaller than efficiency
bound under nonparametric model (Tan, 2007)

Double debiased machine learning (Chernozhukov et al.,
2018)
The estimator is EN [ˆµ(a, s)(r − ˆq(s, a)) + ˆv(s)] , (Eπe [r|s] = v(s))
with cross ﬁtting (aka. sample splitting) (van der Vaart, 1998)
Both µ and q are estimated nonparametrically.
Rate double robustness is attained without Donsker conditions for
nuisance estimators

TMLE (Rubin, 2006; van der Laan, 2011; Benkeser et al.,
2017)
TMLE??... Updating the estimator based on the eﬃcient inﬂuence
function of the target. (Super-learner is also used here)
When EIF is analytically written, TMLE is reduced to a one-step
estimator. See Page 11.
When EIF does not have a closed form, iterative estimator.
Corraborative double robustness (van Der Laan and Gruber, 2010;
Diaz, 2018)

Other important estimators
Switching estimator (Tsiatis and Davidian, 2007; Wang et al., 2017)
Matching estimator (Abadie and Imbens, 2006; Wang and
Zubizarreta, 2019a)
Covariate balancing with various divergences (Imai and Ratkovic,
2014; Wang and Zubizarreta, 2019b)
Minimax estimator (Kallus, 2018; Chernozhukov et al., 2018;
Hirshberg and Wager, 2019)
High dimensional setting (Many... E.g. Farrell (2015); Smucler et al.
(2019))
Continuous treatment (estimand is the diﬀerence) (Kennedy et al.,
2017)
Finite population inference (Bojinov and Shephard, 2019)
Multiple robustness (Rotnitzky et al., 2017)

RL setting (Application)
Figure: ADHD Example [Chakraborty,2009]

Summary of RL situation
Table: Efficiency bounds and estimators for OPE
Efficiency bound Efficient estimator
NMDP Kallus and Uehara (2019a) Jiang and Li (2016)
Thomas and Brunskill (2016)
TMDP Kallus and Uehara (2019a) Kallus and Uehara (2019a)
MDP Kallus and Uehara (2019b) Kallus and Uehara (2019b)
Jiang and Li (2016) also calculated bounds of NMDP and TMDP for
a tabular case.
Note that efficiency bound and estimator under NMDP are kind of
given in causal inference literature (Murphy, 2003; van Der Laan and
Robins, 2003; Bang and Robins, 2005)

MDP
MDP = {S, A, R, p}
S, A, R... State space, Action space, Reward space
Transition density... p(s |s, a)
Reward distribution.... p(r|s, a)
Initial distribution.... p(0)
(s0)
Evaluation policy πe(a|s), behavior policy πb(a|s)
The induced distribution by MDP and the behavior policy is
p(s0, a0, r0, a0, s1, a1, r1, s2, a2, r2, · · · )
= p(0)
(s0)πb
(a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb
(a1|s1)p(r1|s1, a1) · · · .
s0 a0 r0 s1 a1 r1 s2
|
|
|
|
||
||
||
||
|||
|||
Figure: MDP

NMDP and TMDP
MDP can be relaxed into two ways; NMDP (without Markovness) and
TMDP (Without time-invariance)
Figure: NMDP (Non-Markov Decision process)
Figure: TMDP (Time-varying Markov Decision Process)

Goal in OPE for RL
[Goal]; Estimate ρπe
;
ρπe
= (1 − γ)
∞
t=0
Eπe [γt
rt], (γ < 1)
Note that this expectation is taken w.r.t
p
(0)
e (s0)πe
(a0|s0)p(r0|s0, a0)p(s1|s0, a0)πe
(a1|s1)p(r1|s1, a1) · · · .
We can use a set of samples generated by MDP and the behavior policy
πb;
{s
(i)
t , a
(i)
t , r
(i)
t }N,T
i=1,t=0.
following
p
(0)
b (s0)πb
(a0|s0)p(r0|s0, a0)p(s1|s0, a0)πb
(a1|s1)p(r1|s1, a1) · · · .

Common three approaches
DM (Direct Method) estimator
ˆρDM = (1 − γ) EN [Eπe [ˆq(s0, a0)|s0]] ,
where E[ ∞
t=0 γtrt|s0, a0] = q(s0, a0).
SIS (Sequential Importance Sampling) estimator
ˆρSIS = (1 − γ) EN
T
t=0
γt
νtrt ,
where
νt(Hat ) =
t
k=0
ηk(sk, ak), ηk(sk, ak) =
πe(ak|sk)
πb(ak|sk)
.
Double Robust (DR) estimator (Jiang and Li, 2016; Thomas and
Brunskill, 2016)
ˆρDR = (1 − γ) EN
T
t=0
γt
(νt(rt − ˆqt) + νt−1 ˆvt(st)) .

Curse of horizon
Eπe [
T
t=0
γt
rt] = Eπb
T
k=0
πe(ak|sk)
πb(ak|sk)
T
t=0
γt
rt
= Eπb
T
t=0
t
k=0
πe(ak|sk)
πb(ak|sk)
γt
rt
= Eπb
T
t=0
γt
νtrt ≈ EN
T
t=0
γt
νtrt
Problem: Variance grows exponentially w.r.t T.

Curse of horizon
SIS and DR estimator suffer from the curse of horizon
DM estimator does not. But it suffer from the model misspefication.
Q; Are there any solutions?
A; MDP assumptions are not exploited fully.
Markov assumption
Time-invariant assumption

Leveraging Markovness
Xie et al. (2019) proposed a marginal importance sampling estimator;
Eπe
T
t=0
γt
rt = Eπb
T
t=0
t
k=0
πe(ak|sk)
πb(ak|sk)
γt
rt
= Eπb
T
t=0
γt
µtrt ≈ EN
T
t=0
γt
µtrt .
Here, µt is a marginal density ratio at t;
µt =
pπe (st, at)
pπb (st, at)
.

Eﬃciency bound under NMDP and TMDP
Theorem (EB under NMDP)
EB(M1) = (1 − γ)2
∞
k=1
E[γ2(k−1)
ν2
k−1var rk−1 + vk|Hak−1
].
Theorem (EB under TMDP)
EB(M2) = (1 − γ)2
∞
k=1
E[γ2(k−1)
µ2
k−1var (rk−1 + vk|ak−1, sk−1)].
Typical behavior of ν2
k−1 is O(Ck). Typical behavior of µ2
k−1 is O(1).
Variance does not grow exponentially w.r.t T under TMDP

Double Reinforcement learning (for TMDP)
Kallus and Uehara (2019a) has proposed an estimator (DRL) achieving the
eﬃciency bound under TMDP;
ˆρDRL(M2) = (1 − γ) EN
T
t=0
γt
(ˆµt(rt − ˆqt) + ˆµt−1 ˆvt(st)) .

Double robustness of DRL for TMDP
Model double robustness (Also, rate double robustness)
µt(s, a) ≈ ˆµt(s, a)? qt(s, a) ≈ ˆqt(s, a)?

Curse of horizon (Again)
Q: Is the curse of horizon solved?
A: At least, it does not blow up w.r.t horizon. But, rate is not right under
MDP

Correct rate for OPE under Ergodic MDP
The rate of the estimator (MSE) introduced so far is 1/N
However, we can learn the estimand with 1/NT–rate assuming
Ergodicity.
Importantly, we can learn from a single trajectory (N = 1, T → ∞)

Leveraging Time-invariance
Liu et al. (2018) proposed an Ergodic importance sampling estimator;
lim
T→∞
(1 − γ)Eπe
T
t=0
γt
rt = rp∞
e,γ(s, a)dµ(s, a, r)
= r
p∞
e,γ(s, a)
p∞
b (s, a)
p∞
b (s, a)dµ(s, a, r)
= Eπ∞
b
[rw(s, a)]
≈ ENET [rw(s, a)] =
1
N
1
T
N
i=1
T
t=1
r
(i)
t w(s
(i)
t , a
(i)
t )
where p∞
e,γ(s, a) is an average visitation distribution of state and action,
w(s, a) =
p∞
e,γ(s, a)
p∞
b (s, a)
.

Eﬃciency bound under Ergodic MDP
The lower bound of asymptotic MSE scaled by NT among regular
estimators is
EB(M3) = Ep
(∞)
b


w2
(s, a)
Distribution mismatch
{r + γv(s ) − q(s, a)}2
Bellman squared residual


 .
Table: Comparison regarding rate
Rate Curse of horizon
NMDP O(1/N) Yes
TMDP O(1/N) No. But still rate...
MDP O(1/NT) No

Efficient estimator under ergodic MDP
Defining v(s0) = Eπe [q(s0, a0)|s0], efficient estimator ˆρDRL(M3) is defined
as follows;
(1 − γ)ENEp
(0)
e
[ˆv(s0)]
+ ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))],
or
ENET [ ˆw(s, a)r]
+ (1 − γ)ENEp
(0)
e
[ˆv(s0)] + ENET [ ˆw(s, a)(r + γˆv(s ) − ˆq(s, a))].
Here, Red terms correspond to IS estimator or DM estimator. And, Blue
terms correspond to control variates.

Double robustness of DRL for MDP
Model double robustness (Also, rate double robustness)
w(s, a) ≈ ˆw(s, a)? q(s, a) ≈ ˆq(s, a)?

General causal DAG (when all of variables are measured)
Given causal DAG (FFRCISTG), G-formula or IS estimator give
identification formulas (Hernan and Robins, 2019).
How to obtain an efficient estimator? ... See van Der Laan and
Robins (2003)
The problem is how the estimator can be simplified (Rotnitzky and
Smucler, 2019).

General causal DAG (With unmeasured variables)
ID algorithm (Shpitser and Pearl, 2008; Tian, 2008; Shpitser and
Sherman, 2018) gives a sufficient and necessary identification formula.
The relation with efficient estimation is a still opening problem??
Figure: With unmeasured confounding

Mediation effect (Pathway effect)
Modified ID algorithm (Shpitser and Sherman, 2018) is a sufficient
and necessary identification formula.
Efficient theory is still being constructed (Nabi et al., 2018).
Figure: Edge intervention

Network, interference
Some estimation method and its theory(Ogburn et al., 2017).
Chain graph is also useful for network setting (Ogburn et al., 2018)
Figure: Chain Graph
Identiﬁcation formula is given (Sherman and Shpitser, 2018)
Since each unit is not i.i.d, diﬃcult
E.g. To estimate from a single network, ergodicity is needed.
Ordinary semiparmatric theory assume i.i.d.

Ref I
Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators
for average treatment effects. Econometrica 74, 235–267.
Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal
inference models. Biometrics 61, 962–973.
Benkeser, D., M. Carone, M. J. V. D. Laan, and P. B. Gilbert (2017). Doubly robust
nonparametric inference on the average treatment effect. Biometrika 104, 863–880.
Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and J. A. Wellner (1998). Efficient and
Adaptive Estimation for Semiparametric Models. Springer.
Bojinov, I. and N. Shephard (2019). Time series experiments and causal estimands:
exact randomization tests and trading. Journal of the American Statistical
Association.
Cao, W., A. A. Tsiatis, and M. Davidian (2009). Improving efficiency and robustness of
the doubly robust estimator for a population mean with incomplete data.
Biometrika 96, 723–734.
Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and
J. Robins (2018). Double/debiased machine learning for treatment and structural
parameters. Econometrics Journal 21, C1–C68.

Ref II
Chernozhukov, V., W. Newey, J. Robins, and R. Singh (2018). Double/de-biased
machine learning of global and local parameters using regularized riesz representers.
arXiv.org.
Diaz, I. (2018). Doubly robust estimators for the average treatment effect under
positivity violations: introducing the e-score. arXiv.org.
Diaz, I. (2019). Machine learning in the estimation of causal effects: targeted minimum
loss-based estimation and double/debiased machine learning. Biostatistics.
Dudik, M., D. Erhan, J. Langford, and L. Li (2014). Doubly robust policy evaluation
and optimization. Statistical Science 29, 485–511.
Farajtabar, M., Y. Chow, and M. Ghavamzadeh (2018). More robust doubly robust
off-policy evaluation. In Proceedings of the 35th International Conference on Machine
Learning, 1447–1456.
Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more
covariates than observations. Journal of Econometrics 189, 1–23.
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric
estimation of average treatment effects. Econometrica 66, 315–331.
Henmi, M. and S. Eguchi (2004). A paradox concerning nuisance parameters and
projected estimating functions. Biometrika 91, 929–941.

Ref III
Henmi, M., R. Yoshida, and S. Eguchi (2007). Importance sampling via the estimated
sampler. Biometrika 94, 985–991.
Hernan, M. and J. Robins (2019). Causal Inference. Boca Raton: Chapman &
Hall/CRC.
Hirano, K., G. Imbens, and G. Ridder (2003). Efficient estimation of average treatment
effects using the estimated propensity score. Econometrica 71, 1161–1189.
Hirshberg, D. and S. Wager (2019). Augmented minimax linear estimation. arXiv.org.
Huber, M. (2019). An introduction to flexible methods for policy evaluation. arXiv.org.
Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. J. R. Statist.
Soc. B 76, 243–263.
Jiang, N. and L. Li (2016). Doubly robust off-policy value evaluation for reinforcement
learning. In Proceedings of the 33rd International Conference on International
Conference on Machine Learning-Volume, 652–661.
Kallus, N. (2018). Balanced policy evaluation and learning. In Advances in Neural
Information Processing Systems 31, pp. 8895–8906.
Kallus, N. and M. Uehara (2019a). Double reinforcement learning for efficient off-policy
evaluation in markov decision processes. arXiv preprint arXiv:1908.08526.

Ref IV
Kallus, N. and M. Uehara (2019b). Efficiently breaking the curse of horizon: Double
reinforcement learning in infinite-horizon processes. arXiv preprint arXiv:1909.05850.
Kallus, N. and M. Uehara (2019c). Intrinsically efficient, stable, and bounded off-policy
evaluation for reinforcement learning. In Advances in Neural Information Processing
Systems 32, pp. 3320–3329.
Kang, J. D. Y. and J. L. Schafer (2007). Demystifying double robustness: A comparison
of alternative strategies for estimating a population mean from incomplete data.
Statistical Science 22, 523–529.
Kennedy, E. (2016). Semiparametric theory and empirical processes in causal inference.
arXiv.org.
Kennedy, E. H., Z. Ma, M. D. Mchugh, and D. S. Small (2017). Nonparametric
methods for doubly robust estimation of continuous treatment effects. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 79, 1229–1245.
Liu, Q., L. Li, Z. Tang, and D. Zhou (2018). Breaking the curse of horizon:
Infinite-horizon off-policy estimation. In Advances in Neural Information Processing
Systems 31, pp. 5356–5366.
Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 65, 331–355.

Ref V
Nabi, R., P. Kanki, and I. Shpitser (2018). Estimation of personalized effects associated
with causal pathways. Uncertainty in artificial intelligence : proceedings of the ...
conference. Conference on Uncertainty in Artificial Intelligence 2018.
Ogburn, E., O. Sofrygin, and I. Diaz (2017). Causal inference for social network data.
arXiv.org.
Ogburn, E. L., I. Shpitser, and Y. Lee (2018). Causal inference, social networks, and
chain graphs. arXiv.org.
Robins, J., M. Sued, Q. Lei-Gomez, and A. Rotnitzky (2007). Comment: Performance
of double-robust estimators when ”inverse probability” weights are highly variable.
Statistical Science 22, 544–559.
Robins, J. M., S. D. Mark, and W. K. Newey (1992). Estimating exposure effects by
modelling the expectation of exposure conditional on confounders. Biometrics 48,
479–495.
Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients
when some regressors are not always observed. Journal of the American Statistical
Association 89, 846–866.
Rotnitzky, A., J. Robins, and L. Babino (2017). On the multiply robust estimation of
the mean of the g-functional. arXiv preprint arXiv:1705.08582.

Ref VI
Rotnitzky, A. and E. Smucler (2019). Efficient adjustment sets for population average
treatment effect estimation in non-parametric causal graphical models. arXiv.org.
Rotnitzky, A., E. Smucler, and J. Robins (2019). Characterization of parameters with a
mixed bias property. arXiv preprint arXiv:1509.02556.
Rotnitzky, A. and S. Vansteelandt (2014). Double-robust methods. In Handbook of
missing data methodology. In Handbooks of Modern Statistical Methods, pp.
185–212. Chapman and Hall/CRC.
Rubin, D. (2006). Targeted maximum likelihood learning. The International Journal of
Biostatistics 2, 1043–1043.
Rubin, D. B. and M. J. van Der Laan (2008). Empirical efficiency maximization:
improved locally efficient covariate adjustment in randomized experiments and
survival analysis. The international journal of biostatistics 4.
Scharfstein, D., A. Rotnizky, and J. M. Robins (1999). Adjusting for nonignorable
dropout using semi-parametric models. Journal of the American Statistical
Association 94, 1096–1146.
Seaman, S. R. and S. Vansteelandt (2018). Introduction to double robust methods for
incomplete data. Statistical science 33, 184–197.

Ref VII
Sherman, E. and I. Shpitser (2018). Identification and estimation of causal effects from
dependent data. In Advances in Neural Information Processing Systems 31, pp.
9424–9435.
Shpitser, I. and J. Pearl (2008). Complete identification methods for the causal
hierarchy. Journal of Machine Learning Research, 1941–1979.
Shpitser, I. and E. Sherman (2018). Identification of personalized effects associated with
causal pathways. Uncertainty in artificial intelligence : proceedings of the ...
conference. Conference on Uncertainty in Artificial Intelligence 2018.
Smucler, E., A. Rotnitzky, and J. M. Robins (2019). A unifying approach for
doubly-robust 1 regularized estimation of causal contrasts. arXiv preprint
arXiv:1904.03737.
Tan, Z. (2006). A distributional approach for causal inference using propensity scores.
Journal of the American Statistical Association 101, 1619–1637.
Tan, Z. (2007). Comment: Understanding or, ps and dr. Statistical Science 22,
560–568.
Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting.
Biometrika 97, 661–682.

Ref VIII
Thomas, P. and E. Brunskill (2016). Data-efficient off-policy policy evaluation for
reinforcement learning. In Proceedings of the 33rd International Conference on
Machine Learning, 2139–2148.
Tian, J. (2008). Identifying dynamic sequential plans. In Proceed-ings of the
Twenty-Fourth Conference Annual Conferenceon Uncertainty in Artificial Intelligence
(UAI-08), 554–561.
Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer Series in
Statistics. New York, NY: Springer New York.
Tsiatis, A. A. and M. Davidian (2007). Comment: Demystifying double robustness: A
comparison of alternative strategies for estimating a population mean from
incomplete data. Statistical science 22, 569–573.
van Der Laan, M. and S. Gruber (2010). Collaborative double robust targeted maximum
likelihood estimation. International Journal of Biostatistics 6, 1181–1181.
van der Laan, M. J. (2011). Targeted Learning :Causal Inference for Observational and
Experimental Data (1st ed. 2011. ed.). Springer Series in Statistics. New York, NY:
Springer.

Ref IX
van Der Laan, M. J. and J. M. Robins (2003). Unified Methods for Censored
Longitudinal Data and Causality. Springer Series in Statistics,. New York, NY:
Springer New York.
van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge, UK: Cambridge
University Press.
Vermeulen, K. and S. Vansteelandt (2015). Bias-reduced doubly robust estimation.
Journal of the American Statistical Association 110, 1024–1036.
Wang, Y. and J. Zubizarreta (2019a). Large sample properties of matching for balance.
arXiv.org.
Wang, Y. and J. Zubizarreta (2019b). Minimal dispersion approximately balancing
weights: Asymptotic properties and practical considerations. arXiv.org.
Wang, Y.-X., A. Agarwal, and M. Dud´ık (2017). Optimal and adaptive off-policy
evaluation in contextual bandits. In Proceedings of the 34th International Conference
on Machine Learning, Volume 70, pp. 3589–3597.
Xie, T., Y. Ma, and Y.-X. Wang (2019). Towards optimal off-policy evaluation for
reinforcement learning with marginalized importance sampling. In Advances in Neural
Information Processing Systems 32, pp. 9665–9675.

Off policy evaluation

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Off policy evaluation

Similar a Off policy evaluation (20)

Último

Último (20)

Off policy evaluation