This document describes stochastic definite clause grammars (SDCG), which extend definite clause grammars (DCG) with probabilities. SDCG transforms a DCG into a stochastic logic program using PRISM, allowing probabilistic inferences and parameter learning. The probabilistic model assigns a random variable to each rule expansion. SDCG introduces syntax extensions like regular expressions and macros to make grammars more concise. Conditioned rules allow modeling higher-order hidden Markov models by selecting rules based on variable unification. SDCG provides tools for parsing sentences and learning rule probabilities from data.
2. What and why?
● DCG Syntax ● Probabilistic model
– Convenient – Polynomial parsing
– Expresssive – Parameter learning
– Flexible – Robust
Stochastic
Definite Clause
Grammars
3. DCG Grammar rules
● Definite Clause Grammars
– Grammar formalism on top of Prolog.
– Production rules with unification variables
– Context-sensitive.. (stronger actually)
– Exploits unification semantics of Prolog
Simple DCG grammar Difference list representation
sentence --> subject(N), verb(N), object.
sentence(L1,L4) :-
subject(sing) --> [he].
subject(N,L1, L2),
subject(plur) --> [they].
verb(N,L2,L3),
object --> [cake].
object(L3,L4).
object --> [food].
subject(sing,[he|R],R).
verb(sing) --> [eats].
...
verb(plur) --> [eat].
4. Stochastic Definite Clause
Grammars
● Implemented as a DCG compiler
– With some extensions to DCG syntax
● Transforms a DCG (grammar) into a stochastic
logic program implemented in PRISM.
● Probabilistic inferences and parameter learning
are then performed using PRISM
(S)DCG Compilation PRISM program
6. ● PRISM - http://sato-www.cs.titech.ac.jp/prism/
● Extends Prolog with random variables (msws in PRISM lingo)
● Performs probabilistic inferences over such programs ,
● Probability calculation - probability of a derivation
● Viterbi - find most probable derivation
● EM learning – learn parameters from a set of example goals
PRISM program example: Bernoulli trials
target(ber,2).
values(coin,[heads,tails]).
:- set_sw(coin, 0.6+0.4).
ber(N,[R,Y]) :-
N>0,
msw(coin,R), % Probabilistic choice
N1 is N – 1,
ber(N1,Y). % Recursion
ber(0,[]).
7. The probabilistic model
One random variable encodes probability
of expansion for rules with same functor/
arity s(N) ==> np(N).
s(N) ==> np(N),vp(N).
The choice is made a selection rule
The selected rule is invoked through
unification transformation
target(s,2).
values(s,[s1,s2]).
Selection rule s(A,B) :- msw(s,Outcome), s(Outcome, A, B).
s(s1, A, B) :- np(_, A, B).
Implementation rules
s(s2, A, B) :- np(N, A, D), vp(N, D, B).
8. Unification failure
Since SDCG embodies unification constraints,
some derivations may fail
We only observe the successful
derivations in sample data.
All derivations
If the training algorithm only
considers successful derivations, it
will converge to a wrong probability
distribution (missing probability Failed derivations
mass).
In PRISM this is handled using the fgEM algorithm, which is based on
Cussens Failure-Adjusted Maximization (FAM) algorithm.
A “failure program” which traces all derivations is derived using First Order
Compilaton and the probabilities of failed derivations are estimated as part of
the fgEM algorithm.
9. Unification failure issues
Infinite/long derivation paths
● Impossible/difficult to derive failure program.
● Workaround: SDCG has an option which limits the depth of
derivation.
● Still: size of the failure program is very much an issue.
FOC requirement - “universally quantified clauses”:
● Not the case with Difference Lists: 'C'([X|Y], X,Y).
● Workaround 1:
– Trick the first order compiler by manually adding
implications after program is partly compiled.
– Works empirically, but may be dubious
● Workaround 2:
– Append based grammar
– Works, but have inherent inefficiencies
10. Syntax extensions
● SDCG extends the usual DCG syntax
– Compatible with DCG (superset)
● Extensions:
– Regular expression operators
● Convenient rule recursion
– “Macros”
● Allows writing rules as templates which are filled out
according to certain rules
– Conditioning
● Convenient expression of higher order HMM's
● Lexicalization
11. Regular expression operators
Regular expressions operators can be associated with rule constituents:
name ==> ?(title), +(firstname), *(lastname).
Meaning:
? may be repeated zero or one times
* may be repeated zero or more times
+ may be one or more time
The constituent in the original rule is replaced with a substitute which
refers to intermediary rules, which implements the regular expression.
?
regex_sub ==> []
*
regex_sub ==> original_constituent
regex_sub ==> regex_sub,regex_sub
+
Limitation: Cannot be used in rules with unification variables.
12. Template macros
Special goals prefixed with @ are treated as macros.
Grammar rules with macros are dynamically expanded.
expand_mode
Example: determines which
word(he,sg,masc). word(she,sg,fem). variables to keep
number(Word,Number) :- word(Word,Number,_).
gender(Word,Gender) :- word(Word,_,Gender).
wordlist(X,[X]).
remove
insert
expand_mode(number(-, +)). word(@number(Word, N), @gender(Word,G)) ==>
expand_mode(gender(-, +)). @wordlist(Word, WordList).
expand_mode(wordlist(-, +)).
Meta rule is created and called,
exp(Word, N, G, WordList) :- number(Word,N), gender(Word, G), wordlist(Word,WordList).
Resulting grammar:
word(sg,masc) ==> [ he ].
find all answers
word(sg,fem) ==> [ she ].
13. Conditioning
A conditioned rule takes the form,
name(F1,F2,...,Fn) | V1,V2,...,Vn ==> C1,C2,...,Cn.
The | operator can be seen as a guard that assures the rule is only
expanded if the conditions V1..Vn unify with F1..FN
It is possible to specify which variables must unify using a condition_mode:
condition_mode(n(+,+,-)).
n(A,B,C) | x,y ==> c1, c2.
Conditioned rules are grouped by non-terminal name and arity and
always has the same number of conditions.
Probabilistic semantics: A distinct probability distribution for each
distinct set of conditions.
14. Conditioning semantics
Model without conditioning: Model with conditioning:
n ==> n1. n|a ==> n1(X).
n ==> n2. n|a ==> n2(X).
n1 ==> ... n|b ==> n1(X).
... n|b ==> n2(X).
...
n1_1
n|a
n1 n2_1
n n
n2 n1_2
n|b
n2_2
Stochastic selection
Selection using unification
15. Example, simple toy grammar
start ==> s(N). n(sg) ==> [time].
s(N) ==> np(N). n(pl) ==> [flies].
s(N) ==> np(N),vp(N). v(sg) ==> [flies].
np(N) ==> n(sg),n(N). v(sg) ==> [crawls].
np(N) ==> n(N). v(pl) ==> [fly].
vp(N) ==> v(N),np(N).
vp(N) ==> v(N)
Probability of a
| ?- prob(start([time,flies],[],Tree), P). sentence
P = 0.083333333333333 ?
yes
| ?- viterbig(start([time,flies],[],Tree), P).
Tree = [start,[[s(pl),[[np(pl),[[n(sg),[[]]],[n(pl),[[]]]]]]]]] The most probable
P = 0.0625 ? parse
yes
| ?- n_viterbig(10,start([time,flies],[],Tree), P).
Tree = [start,[[s(pl),[[np(pl),[[n(sg),[[]]],[n(pl),[[]]]]]]]]]
P = 0.0625 ?; Most probable parses
Tree = [start,[[s(sg),[[np(sg),[[n(sg),[[]]]]],[vp(sg),[[v(sg),[[]]]]]]]]] (indeed all two)
P = 0.020833333333333 ?;
no
16. More interesting example
Simple part of speech tagger – fully connected first order HMM.
consume_word([Word]) :-
word(Word).
conditioning_mode(tag_word(+,-,-)).
start(TagList) ==>
tag_word(none,_,TagList).
tag_word(Previous, @tag(Current), [Current|TagsRest]) | @tag(SomeTag) ==>
@consume_word(W),
?(tag_word(Current,_,TagsRest)).
Some tags Some words
tag(none).
word(the).
tag(det).
word(can).
tag(noun).
word(will).
tag(verb).
word(rust).
tag(modalverb).