Complexity of planning and games with partial information

Sequential decision making:
decidability and complexity

Searching with partial
observation
Olivier.Teytaud@inria.fr + too many people for being all cited. Includes Inria, Cnrs, Univ.
Paris-Sud, LRI, Taiwan universities (including NUTN), CITINES project

TAO, Inria-Saclay IDF, Cnrs 8623,
Lri, Univ. Paris-Sud,
Digiteo Labs, Pascal
Network of Excellence.

Bielefeld
September 2012.

A quite general model

A directed graph (finite).
A Starting point on the graph, a target (or
several targets, with different rewards).
I want to reach a target.

Labels(=decisions) on edges:
Next node = f( current node, decision)

Each node is either:
- random node (random decision).
- decision node (I choose a decision)
- opponent node (an opponent chooses)

Partial observation

Each decision node
is equipped with an observation;
you can make decisions using
the list of past observations

==> you don't know
where you are in the graph

Overview

● 10%: overview of Alternating Turing
machine & computational complexity
(great tool for complexity upper bounds)

● 50%: general culture on games
(including undecidability)
● 35%: general culture on fictitious play
(matrix games) (probably no time for this...)
● 4%: my results on that stuff
==> 2 detailed proofs (one new)
==> feel free of interrupting

Outline

● Complexity and ATM

● Complexity and games (incl. planning)

● Bounded horizon games

Classical complexity classes
P ⊂ NP ⊂ PSPACE ⊂ EXPTIME ⊂ NEXPTIME ⊂ EXPSPACE

Proved:
PSPACE ≠ EXPSPACE P ≠ EXPTIME
NP ≠ NEXPTIME

Believed, not proved:
P≠NP EXPTIME≠NEXPTIME
NEXPTIME≠EXPSPACE

Complexity and alternating
Turing machines
● Turing machine (TM)= abstract computer
● Non-deterministic Turing Machine (NTM)
= TM with “for all” states (i.e. several
transitions, accepts if all transitions
accept)
● Co-NTM: TM with “exists” states (i.e.
several transitions, accepts if at least one
transition accepts)
● ATM: TM with both “exists” and “for all”
states.

Turing machines
= TM with “exists” states (i.e. several
transitions, accepts if at least one
accepts)
● Co-NTM: TM with “exists” states (i.e.
several transitions, accepts if at least one
transition accepts)
states.

Turing machines
= TM with “exists” states (i.e. several
transitions, accepts if at least one
accepts)
● Co-NTM: TM with “for all” states (i.e.
several transitions, accepts if all lead to
accept)
states.

Outline

● Complexity and ATM

● Complexity and games (incl.
planning)

● Bounded horizon games

Computational complexity:
framework

Uncertainty can be:
– Adversarial: I focus on worst case
– Stochastic: I focus on average result
– Or both.

“Stochastic = adversarial” if goal = 100%
success.
“Stochastic != adversarial” in the general case.

framework

Many representations for problems. E.g.:
– Succinct: a circuit computes the ith bit of
the proba that action a leads to a
transition from s to s'
– Compressed: a circuit computes many bits
simultaneously
– Flat: longer encoding (transition tables)

==> does not matter for decidability
==> matters for complexity

framework

Many representations for problems. E.g.:
– Succinct
– Compressed
– Flat

Compressed representation “somehow” natural
(state space has exponential size, transitions
are fast): see e.g. Mundhenk for detailed defs
and flat representations.

framework
We use mainly compressed representation; see
also Mundhenk for flat representations.

Typically, exponentially small representations
lead to exponentially higher complexity
==> but it's not always the case...

Simple things can change a lot the complexity:
“superko”: rules forbid twice the same position;
some fully observable 2Player games become
EXPSPACE instead of EXP ==> discussed later

Computational complexity: framework
for first tables of results

Either search (find a target)
or optimize (cumulate rewards over time)

Compressed (written with circuits or others...)
or not (flat).

Horizon:
- Short horizon: horizon ≤ size of input
- Long horizon: log2(horizon) ≤ size of input
- Infinite horizon: no limit

Mundhenk's summary: one player,
limited horizon: expected reward >0 ?

Mundhenk's summary: one player, non-negative
reward, looking for non-neg. average reward
(= positive proba of reaching): easier

Complexity, partial observation, infinite
horizon, proba of reaching a target

● 1P+random, unobservable: undecidable
(Madani et al)
● 1P+random, P(win=1),
or equivalently 2P, P(win=1):
[Rintanen and refs therein]
– Fully observable: EXP [Littman94]

– Unobservable: EXPSPACE [Hasslum et al 2000]
– Partial observability: 2EXP [rintanen, 2003]

Rmk: “2P, P(win=1)” is not “2P”!

Complexity, partial observation,
infinite horizon

● 2P vs 1P,P(win)=1?:undecidable![Hearn, Demaine]
● 2P (random or not):
– Existence of sure win: equiv. to 1P+random !
● EXP full-observable (e.g. Go, Robson 1984)
● PSPACE unobservable
● 2EXP partially observable
– Existence of sure win, same state forbidden:
EXPSPACE-complete (Go with Chinese rules ?
rather conjectured EXPTIME or PSPACE...)
– General case (optimal play): undecidable
(Auger, Teytaud) (what about phantom-Go ?)

Complexity, partial observation

Remarks:
● Continuous case ?
● Purely epistemic (we gather information, we
don't change the state) ? [Sabbadin et al]
● Restrictions on the policy, on the set of
actions...
● Discounted reward
● DEC-POMDP, POSG : many players,
same/opposite/different reward functions...

What are the approaches ?

– Dynamic programming (Massé – Bellman 50's) (still
the main approach in industry), alpha-beta, retrograde analysis
– Reinforcement learning
– MCTS (R. Coulom. Efficient Selectivity and Backup
Operators in Monte-Carlo Tree Search. In
Proceedings of the 5th International Conference on
Computers and Games, Turin, Italy, 2006)
– Scripts + Tuning / Direct Policy Search
– Coevolution

All have their PO extensions but the two last
are the most convenient in this case.

Partially observable games

Many tools for fully observable games.
Not so many for partially observable ones.

● Shi-Fu-Mi (Rock Paper Scissor)

● Card games

● Phantom games

Shi-Fu-Mi (Rock-Paper-Scissors)
● Fully observable in simultaneous play, but
partially observable in turn-based version.

● Computers stronger than humans (yes, it's
true).

Card games, phantom games
● Phantomized version of a game:
– You don't see the move of your opponents
– If you play an illegal move, you are
informed that it's illegal, you play again
– Usually, you get a few more information
(captures, threats...) <== game-dependent
● Phantom-games:
– phantom-Chess = Kriegspiel
==> Dark Chess: more info
– phantom-Go
– etc.

Partially observable games
● Usually quite heuristic algorithms
● Best performing algorithms combine:
– Opponent modelling (as for Shi-Fu-Mi)
– Belief state (often by Monte-Carlo
simulations)
– Not a lot of tree search
– A lot of tuning
==> usually no consistency analysis

Part I: Complexity analysis
(unbounded horizon)

– Game:
● One or two players
● Win, loss, draw (incl. endless loop)

– Partial observability, no random part

– Finite state space:
● state=transition(state,action)
● action decided by each player in turn

State of the art

- makes sense in fully observable games
- not so much in non-observable games

State of the art

EXPTIME-complete in the general
fully-observable case

EXPTIME-complete fully
observable games

- Chess (for some nxn generalization)

- Go (with no superko)

- Draughts (international or english)

- Chinese checkers

- Shogi

PSPACE-complete fully
observable games

- Amazons
- Hex polynomial horizon
- Go-moku +
- Connect-6 full observation
- Qubic ==> PSPACE
- Reversi
- Tic-Tac-Toe

Many games with filling of each cell once and only once

EXPSPACE-complete
unobservable games (Hasslun & Jonnsson)

The two-player unobservable case is
EXPSPACE-complete
(games in succinct form, infinite horizon).

(still for 100%win “UD” criterion -
for not fully observable cases it
is necessary to be precise...)

Importantly, the UD criterion means that strategies
are the same if the opponent has full observation
as if he has no observation ==> UD is very bad :-(

E X P S P Atwo-player unobservable case is
The C E - c o m p l e t e
EXPSPACE-complete
(games in succinct form).

PROOF:
(I) First note that strategies are just sequences of actions
(no observability!)
(II) It is in EXPSPACE=NEXPSPACE, because of the
following algorithm:
(a) Non-deterministically choose the sequence of
Actions
(b) Check the result against all possible strategies
(III) We have to check the hardness only.

EXPSPACE-complete

PROOF:
(no observability!)
actions (exponential list of actions is enough...)

EXPSPACE-complete

PROOF:
(no observability!)
actions

EXPSPACE-complete
PROOF of the hardness:
Reduction to: is my TM with exponential tape
going to halt ?

Consider a TM with tape of size N=2^n.

We must find a game
- with size n ( n= log2(N) )
- such that the first player has a winning
strategy for player 1 iff the TM halts.

EXPSPACE-complete
uEncoding ravTuring machine s ( Ha stape & J osizes oN)
n o b s e a b l e g a m e with a s l u n of n n s n
as a game with state O(log(N))

Player 1 chooses the sequence of
configurations of the tape (N=4):

x(0,1),x(0,2),x(0,3),x(0,4) ==> initial state
x(1,1),x(1,2),x(1,3),x(1,4)
x(2,1),x(2,2),x(2,3),x(2,4)
x(3,1),x(3,2),x(3,3),x(3,4)
.....................................

EXPSPACE-complete


x(1,1),x(1,2),x(1,3),x(1,4)
x(2,1),x(2,2),x(2,3),x(2,4)
x(3,1),x(3,2),x(3,3),x(3,4)
.....................................
x(N,1), x(N,2), x(N,3), x(N,4)

Wins by
final state !

EXPSPACE-complete


x(1,1),x(1,2),x(1,3),x(1,4)
x(2,1),x(2,2),x(2,3),x(2,4)Except if P2 finds an
x(3,1),x(3,2),x(3,3),x(3,4) illegal transition!
..................................... ==> P2 can check the
x(N,1), x(N,2), x(N,3), x(N,4)
consistency of one 3-uple per line

Wins by ==> requests space log(N)
final state ! ( = position of the 3-uple)

EXPSPACE-complete
unobservable games

The 1P+unknown initial state in the
unobservable case is
EXPSPACE-complete

2P+unobservable as well.

2EXPTIME-complete PO games

The two-player PO case,
or 1P+random PO is
2EXP-complete

(2P = 1P+random because of UD)

Undecidable games (B. Hearn)

The three-player PO case is
undecidable. (two players against one,
not allowed to communicate)

Hummm ?

Do you know a PO game in which you can
ensure a win with probability 1 ?

Another formalization

c

==> much more satisfactory
(might have drawbacks as well...)

Madani et al.

c

1 player + random = undecidable
(even without opponent!)

Madani et al.

1 player + random = undecidable.
==> answers a (related) question by
Papadimitriou and Tsitsiklis.

Proof ?

Based on the emptiness problem for
probabilistic finite automata (see Paz 71):

Given a probabilistic finite automaton,
is there a word accepted with proba at least c ?
==> undecidable

Consequence for unobservable
games

c

1 player + random = undecidable
==> 2 players = undecidable.

Proof of “undecidability with 1 player
against random” ==> “undecidability with
2 players”

How to simulate 1 player + random with 2
players ?

A random node to be rewritten

Rewritten as follows:
● Player 1 chooses a in [[0,N-1]]
● Player 2 chooses b in [[0,N-1]]
● c=(a+b) modulo N
● Go to tc

Each player can force the game to be equivalent to
the initial one (by playing uniformly)
==> the proba of winning for player 1 (in case of perfect play)
is the same as for for the initial game
==> undecidability!

Important remark

Existence of a strategy for winning with
proba 0.5 = also undecidable for the
restriction to games in which the proba
is >0.6 or <0.4 ==> not just a subtle
precision trouble.

So what ?

We have seen that
unbounded horizon
+ partial observability
+ natural criterion (not sure win)
==> undecidability
contrarily to what is expected from usual definitions.

What about bounded horizon, 2P ?
– Clearly decidable
– Complexity ?
– Algorithms ? (==> coevolution & LP)

Complexity (2P, 0-sum, no
random)
Unbounded Exponential Polynomial
horizon horizon horizon
Full
Observability EXP EXP PSPACE

No obs EXPSPACE NEXP
(X=100%) (Hasslum et al, 2000)

Partially 2EXP EXPSPACE
Observable (Rintanen) (Mundhenk)
(X=100%)

Simult. Actions ? EXPSPACE ? <<<= EXP <<<= EXP

No obs undecidable <=2EXP (PL) <=EXP (PL)
(concise matrix games)
Partially undecidable <= 2EXP (PL) <= EXP (PL)
Observable

Part II: Fictitious play (bounded
horizon) in the antagonist case

Fictitious play ?
Somehow an abstract version of
antagonist coevolution with full memory

● illimited population (finite, but
increasing): one more indiv. per iteration
● perfect choice of each mutation against
the current population of opponents

Part II: Fictitious play in the
zero-sum case

Why zero-sum cases ?

Evolutionary stable solutions (found by
FP) are usually sub-optimal (as well as nature,
for choosing lion's strategies or cheating behaviors in Scaly-
breasted Munia)

What is a matrix 0-sum game ?

● A matrix M is given (type n x m).
● Player 1 chooses (privately) i in [[1,n]]
● Player 2 chooses j in [[1,n]]
● Reward
= Mij for player 1
= -Mij for player 2 (zero-sum game)
==> Model for finite antagonist games

Nash equilibrium

● Nash equilibrium: there is a distribution
of probability for each player
(= mixed strategy)
such that the reward is optimum (for the
worst case on the distribution of
probabilities by the opponent)
● Linear programming is a polynomial
algorithm for finding the Nash eq.
● FP= tool for approximating it
(at least in 0-sum cases)

Fictitious play (Brown 1949)

● Each player starts with a distribution on
its strategies
● Each player in turn:
– Finds an optimal strategy against the
current opponent's distribution (randomly
break ties)

– Adds it to its distribution (the distrib. does
not sum to 1!)

Matching penny

1 -1 (i.e. player 1 wins iff i=j)
-1 1
● HT1=(1,0) HT2=(0,1)
● HT1=(1,1) HT2=(0,2)
● HT1=(1,2) HT2=(1,2)
● HT1=(1,3) HT2=(2,2)
● HT1=(1,4) HT2=(3,2)
● HT1=(2,4) HT2=(4,2)
● HT1=(3,4) HT2=(5,2)
● HT1=(4,4) HT2=(6,2)
● HT1=(5,4) HT2=(6,3) .......

Rock-paper-scissor

● Rock:1, Papers=0, Scissors:0
● RPS1=(1,0,0) RPS2=(1,0,0)
● RPS1=(1,1,0) RPS2=(1,1,0)
● RPS1=(1,2,0) RPS2=(1,1,1)
● RPS1=(1,3,0) RPS2=(1,1,2)
● RPS1=(2,3,0) RPS2=(1,2,2)
● …
===> converges to Nash (Robinson 51)

Improvements for KxK matrix
game: approximations

● There exists  approximations in size
O(log(K)/2) [Althoefer]
● Such an approximation can be found in
time O(Klog K / 2) [Grigoriadis et al]: basically a
stochastic FP

game: exact solution if k-sparse

stochastic FP


stochastic FP
● Exact solution in time (Auger, Ruette, Teytaud)

O (K log K · k 2k + poly(k) )
if solution k-sparse (good only if k
smaller than log(K)/log(log(K)) !
better ?)


So, LP & FP are two tools for matrix
games.

LP programming can be adapted to PO
games without building the complete
matrix (using information sets).

The same for FP variants ?

Conclusions

There are still natural questions which
provide nice decidability problems
Madani et al (1 player against random, no observability), extended here to
2 players with no random

==> undecidable problems “less than”
the Halting problem ?

Solving zero-sum matrix-games is still an
active area of research
● Approximate cases
● Sparse case

Open problems

● Phantom-Go undecidable ? (or other “real” game...)
● Complexity of Go with Chinese rules ?
(conjectured: PSPACE or EXPTIME;
proved PSPACE-hard + EXPSPACE)
● More to say about “epistemic” games (internal
state not modified)
● Frontier of undecidability in PO games ?
(100% halting game: 2P become decidable)
● Chess with finitely many pieces on infinite board:
decidability of forced-mate ?
(n-move: Brumleve et al, 2012, simulation in Presburger
(thanks S. Riis :-) )

Complexity of planning and games with partial information

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (11)

Similar a Complexity of planning and games with partial information

Similar a Complexity of planning and games with partial information (20)

Último

Último (20)

Complexity of planning and games with partial information