Commit 14eb34f8948b11bfdd9acb2e767e5d04f805c697
1 parent
6be7d8ee
Exists in
master
Organised Text
Showing
23 changed files
with
810 additions
and
953 deletions
Show diff stats
text/00_ASP.md
... | ... | @@ -1,239 +0,0 @@ |
1 | -# Answer Set Programming | |
2 | - | |
3 | -> **Answer set programming (ASP) is a form of declarative programming oriented towards difficult (primarily NP-hard) search problems.** | |
4 | -> | |
5 | -> It is **based on the stable model (answer set) semantics** of logic programming. In ASP, search problems are reduced to computing stable models, and answer set solvers ---programs for generating stable models--- are used to perform search. | |
6 | - | |
7 | ---- | |
8 | - | |
9 | -**ASP** "programs" generates "deduction-minimal" models _aka_ **stable models** or **answer sets**. | |
10 | -- Given an ASP program $P$, a model $X$ of $P$ is a set where each element $x \in X$ has a proof using $P$. | |
11 | -- In a "deduction-minimal" model $X$ each element $x \in X$ has a proof using $P$. Non-minimal models have elements without a proof. | |
12 | - | |
13 | -## Key Questions | |
14 | - | |
15 | -1. What is the relation between ASP and Prolog? | |
16 | - 1. **Prolog** performs **top-down query evaluation**. Solutions are extracted from the instantiation of variables of successful queries. | |
17 | - 2. **ASP** proceeds in two steps: first, **grounding** generates a (finite) _propositional representation of the program_; second, **solving** computes the _stable models_ of that representation. | |
18 | -2. What are the roles of **grounding** with `gringo` and **solving** with `clasp`? | |
19 | -3. Can ASP be used to **pLP**? | |
20 | - 1. What are the key probabilistic tasks/questions/problems? | |
21 | - 2. Where does distribution semantics enters? What about **pILP**? | |
22 | -4. Can the probabilistic task control the grounding (`gringo`) or solving (`clasp`) steps in ASP? | |
23 | -5. Can ASP replace kanren? | |
24 | - 1. As much as ASP can replace Prolog. | |
25 | - | |
26 | -## Formal Foundations | |
27 | - | |
28 | -### Common Concepts and Notation | |
29 | - | |
30 | - context | true, false | if | and | or | iff | default negation | classical negation | |
31 | ----------|------------|----|-----|----|-----|------|----- | |
32 | -source | | `:-` | `,` | `|` | | `not` | `-` | |
33 | -logic prog. | | ← | , | ; | | ̃ | ¬ | |
34 | -formula | ⊤, ⊥ | → | ∧ | ∨ | ↔ | ̃ | ¬ | |
35 | - | |
36 | -> - **default negation** or **negation as failure (naf)**, `not a` ($\sim a$), means "_no information about `a`_". | |
37 | -> - **classical negation** or **strong negation**, `-a` ($\neg a$), means "_positive information about `-a`_" ie "_negative information about `a`_". Likewise `a`: "_positive informations about `a`_". | |
38 | -> - The symbol `not` ($\sim$), is a new logical connective; `not a` ($\sim a$) is often read as "_it is not believed that `a` is true_" or "_there is no proof of `a`_". Note that this does not imply that `a` is believed to be false. | |
39 | - | |
40 | -- **Interpretation.** A _boolean_ interpretation is a function from ground atoms to **⊤** and **⊥**. It is represented by the atoms mapped to **⊤**. | |
41 | - - if u, v are two interpretations **u ≤ v** iff u ⊆ v under this representation. | |
42 | - - **partial interpretations** are represented by ( {true atoms}, {false atoms}) leaving the undefined atoms implicit. | |
43 | - - an **ordered boolean assignment** $a$ over $dom(a)$ in represented by a sequence $a = (V_ix_i | i \in 1:n)$ where $V_i$ is either $\top$ or $\bot$ and each $x_i\in dom(a)$. | |
44 | - - $a^\top \subseteq a$ such that $\top x \in a$; $a^\bot \subseteq a$ such that $\bot x \in a$. | |
45 | - - An ordered assignment $(a^\top, a^\bot)$ is a partial boolean interpretation. | |
46 | -- Subsets have a partial order for the $\subset$ relation; remember maximal and minimal elements. | |
47 | -- Directed graphs; Path; **Strongly connected** iff all vertex pairs (a,b) are connected; The **strongly connected components** are the strongly connected subgraphs. | |
48 | - | |
49 | -### Basic ASP syntax and semantics | |
50 | - | |
51 | -- A **definite clause** is, by definition, $a_0 \vee \neg a_1 \vee \cdots \vee \neg a_n$, a disjunction with exactly one positive atom. | |
52 | - - Also denoted $a_0 \leftarrow a_1 \wedge \cdots \wedge a_n$. | |
53 | - - **A set of definite clauses has exactly one smallest model.** | |
54 | -- A **horn clause** has at most one positive atom. | |
55 | - - A horn clause without positive atom is an _integrity_ constraint - _a conjunction that **can't** hold_. | |
56 | - - **A set of horn clauses has one or zero smallest models.** | |
57 | -- If $P$ is a **positive program**: | |
58 | - - A set $X$ is **closed** under $P$ if $head(r) \in X$ if $body^+(r) \subset X$. | |
59 | - - $Cn(P)$ is, by definition, the set of **consequences of $P$**. | |
60 | - - $Cn(P)$ is the smallest set closed under $P$. | |
61 | - - $Cn(P)$ is the $\subseteq$-smallest model of $P$. | |
62 | - - The **stable model** of $P$ is, by definition, $Cn(P)$. | |
63 | - - If $P$ is a positive program, $Cn(P)$ is the smallest model of the definite clauses of $P$. | |
64 | - | |
65 | -#### Example calculation of stable models | |
66 | - | |
67 | -Consider the program P: | |
68 | -```prolog | |
69 | -person(joey). | |
70 | -male(X); female(X) :- person(X). | |
71 | -bachelor(X) :- male(X), not married(X). | |
72 | -``` | |
73 | - | |
74 | -1. Any SM of P must have the **fact** `person(joey)`. | |
75 | -2. Therefore the **grounded rule** `male(joey) ; female(joey) :- person(joey).` entails that the SMs of P either have `male(joey)` or `female(joey)`. | |
76 | -3. Any **SM must contain** either A: `{person(joey), male(joey)}` or B: `{person(joey), female(joey)}`. | |
77 | -4. In **the reduct** of P in A we get the rule `bachelor(joey) :- male(joey).` and therefore `bachelor(joey)` must be in a SM that contains A. Let A1: `{person(joey), male(joey), bachelor(joey)}`. | |
78 | -5. No further conclusions result from P on A1. Therefore A1 is a SM. | |
79 | -6. Also no further conclusions result from P on B; It is also a SM. | |
80 | -7. The SMs of P are: | |
81 | - 1. `{person(joey), male(joey), bachelor(joey)}` | |
82 | - 2. `{person(joey), female(joey)}` | |
83 | - | |
84 | - | |
85 | -```prolog | |
86 | --a. | |
87 | -not a. | |
88 | -% | |
89 | -% { -a } | |
90 | -% | |
91 | --a. | |
92 | -a. | |
93 | -% | |
94 | -% UNSAT. | |
95 | -% | |
96 | -not a. | |
97 | -a. | |
98 | -% | |
99 | -% UNSAT | |
100 | -% | |
101 | -%---------------------------------------- | |
102 | -% | |
103 | -a. | |
104 | -%% Answer: 1 | |
105 | -%% a | |
106 | -%% SATISFIABLE | |
107 | -% | |
108 | -% There is (only) one (stable) model: {a} | |
109 | -% | |
110 | -%---------------------------------------- | |
111 | -% | |
112 | --a. | |
113 | -%% Answer: 1 | |
114 | -%% -a | |
115 | -%% SATISFIABLE | |
116 | -% | |
117 | -% Same as above. | |
118 | -% | |
119 | -%---------------------------------------- | |
120 | -% | |
121 | ---a. | |
122 | -%% *** ERROR: (clingo): parsing failed | |
123 | -% | |
124 | -% WTF? | |
125 | -% | |
126 | -%---------------------------------------- | |
127 | -% | |
128 | -not a. | |
129 | -%% Answer: 1 | |
130 | -%% | |
131 | -%% SATISFIABLE | |
132 | -% | |
133 | -% ie there is (only) one (stable) model: {} | |
134 | -% | |
135 | -% This program states that there is no information. | |
136 | -% In particular, there is no information about a. | |
137 | -% Therefore there are no provable atoms. | |
138 | -% Hence the empty set is a stable model. | |
139 | -% | |
140 | -%---------------------------------------- | |
141 | -% | |
142 | -not not a. | |
143 | -%% UNSATISFIABLE | |
144 | -% | |
145 | -% ie no models. Because | |
146 | -% 1. No model can contain ~p. | |
147 | -% 2. Any model contains all the facts. | |
148 | -% 3. Suppose X is a model. | |
149 | -% 4. Since ~~a is a fact, by 2, ~~a ∈ X. | |
150 | -% 5. But, by 1, ~~a ∉ X. | |
151 | -% 6. Therefore there are no models. | |
152 | -% | |
153 | -%---------------------------------------- | |
154 | -% | |
155 | -not -a. | |
156 | -%% Answer: 1 | |
157 | -%% | |
158 | -%% SATISFIABLE | |
159 | -% | |
160 | -% Same as ~a. | |
161 | -% | |
162 | -%---------------------------------------- | |
163 | -% | |
164 | -b. | |
165 | -a;-a. | |
166 | -not a :- b. | |
167 | -% Answer: 1 | |
168 | -% b -a | |
169 | -% SATISFIABLE | |
170 | -% | |
171 | -% 1. Any model must contain b (fact b). | |
172 | -% 2. Any models entails ~a (rule not a :- b.). | |
173 | -% 3. Any model must contain one of a or ¬a (rule a;-a). | |
174 | -% 4. No model can contain both a and ~a. | |
175 | -% 5. Therefore any model must contain {b, ¬a}, which is stable. | |
176 | -% | |
177 | -% Q: Why ~a does not contradicts -a | |
178 | -% A: Not sure. Maybe because "~a" states that no model can contain a but says nothing about ¬a. | |
179 | -% | |
180 | -%---------------------------------------- | |
181 | -% | |
182 | -b. | |
183 | -a;c. | |
184 | -% Answer: 1 | |
185 | -% b c | |
186 | -% Answer: 2 | |
187 | -% b a | |
188 | -% SATISFIABLE | |
189 | -% | |
190 | -% 1. Any model must have b. | |
191 | -% 2. Any model must have one of a or c. | |
192 | -% 3. No model with both a and c is minimal because either one satisfies a;c | |
193 | -``` | |
194 | - | |
195 | -- Why is the double strong negation, `--a`, a syntax error but the double naf, `not not a` is not? | |
196 | - | |
197 | -#### Definitions and basic propositions | |
198 | -1. Let $\cal{A}$ be a **set of ground atoms**. | |
199 | -2. A **normal rule** $r$ has the form $a \leftarrow b_1, \ldots, b_m, \sim c_1, \ldots, \sim c_n$ with $0 \leq m \leq n$. | |
200 | - - _Intuitively,_ the head $a$ is true if **each one of the $b_i$ has a proof** and **none of the $c_j$ has a proof**. | |
201 | -3. A **program** is a finite set of rules. | |
202 | -4. The **head** of the rule is $\text{head}(r) = a$; The **body** is $\text{body}(r) = \left\lbrace b_1, \ldots, b_m, \sim c_1, \ldots, \sim c_n \right\rbrace$. | |
203 | -5. A **fact** is a rule with empty body and is simply denoted $a$. | |
204 | -6. A **literal** is an atom $a$ or the default negation $\sim a$ of an atom. | |
205 | -7. Let $X$ be a set of literals. $X^+ = X \cap \cal{A}$ and the $X^- = \left\lbrace p\middle| \sim p \in X\right\rbrace$. | |
206 | -9. The set of atoms that occur in program $P$ is denoted $\text{atom}(P)$. Also $\text{body}(P) = \left\lbrace \text{body}(r)~\middle|~r \in P\right\rbrace$. At last, $\text{body}_P(a) = \left\lbrace \text{body}(r)~\middle|~r \in P \wedge \text{head}(r) = a\right\rbrace$. | |
207 | -10. A **model** of the program $P$ is a set of ground atoms $X \subseteq \cal{A}$ such that, for each rule $r \in P$, $$\text{body}^+(r) \subseteq X \wedge \text{body}^-(r) \cap X = \emptyset \to \text{head}(r) \in X.$$ | |
208 | -8. A rule $r$ is **positive** if $\text{body}(r)^- = \emptyset$; A program is positive if all its rules are positive. | |
209 | -11. _A positive program has an unique $\subseteq$-minimal model._ **Is this the link to prolog?** | |
210 | -12. The **reduct** of a formula $f$ relative to $X$ is the formula $f^X$ that results from $f$ replacing each maximal sub-formula _not satisfied by $X$_ by $\bot$. | |
211 | -13. The **reduct** of program $P$ relative to $X$ is $$P^X = \left\lbrace \text{head}(r) \leftarrow \text{body}^+(r) \middle| r \in P \wedge \text{body}^-(r) \cap X = \emptyset \right\rbrace.$$ Thus $P^X$ results from | |
212 | - 1. Remove every rule with a naf literal $\sim a$ where $a \in X$. | |
213 | - 2. Remove the naf literals of the remaining rules. | |
214 | -14. Since $P^X$ is a positive program, it has a unique $\subseteq$-minimal model. | |
215 | -15. $X$ is a **stable model** of $P$ if $X$ is the $\subseteq$-minimal model of $P^X$. | |
216 | -16. **Alternatively,** let ${\cal C}$ be the **consequence operator**, that yields the smallest model of a positive program. A **stable model** $X$ is a solution of $${\cal C}\left(P^X\right) = X.$$ | |
217 | - - _negative literals must only be true, while positive ones must also be provable._ | |
218 | -17. _A stable model is $\subseteq$-minimal but not the converse._ | |
219 | -18. _A positive program has a unique stable model, its smallest model._ | |
220 | -19. _If $X,Y$ are stable models of a normal program then $X \not\subset Y$._ | |
221 | -20. _Also, $X \subseteq {\cal C}(P^X) \subseteq \text{head}(P^X)$._ | |
222 | - | |
223 | -## ASP Programming Strategies | |
224 | - | |
225 | -- **Elimination of unnecessary combinatorics.** The number of grounded instances has an huge impact on performance. Rules can be used as "pre-computation" steps. | |
226 | -- **Boolean Constraint Solving.** This is at the core of the **solving** step, e.g. `clasp`. | |
227 | - | |
228 | -## ASP vs. Prolog | |
229 | - | |
230 | -- The different number of stable models lies precisely at the core difference between Prolog and ASP. **In Prolog, the presence of programs with negation that do not have a unique stable model cause trouble and the SLDNF resolution does not terminate on them [17]**. However, ASP embraces the disparity of stable models and treats the stable models of the programs as solutions to a given search program (from [Prolog and Answer Set Programming: Languages in Logic Programming](https://silviacasacuberta.files.wordpress.com/2020/07/final_paper.pdf) ) | |
231 | -- Prolog programs may not terminate (`p :- \+ p.`); ASP "programs" always terminate (`p :- not p.` has zero solutions). | |
232 | -- ASP doesn't allow function symbols; Prolog does. | |
233 | - | |
234 | - | |
235 | -## References | |
236 | - | |
237 | -1. Martin Gebser, Roland Kaminski, Benjamin Kaufmann, Torsten Schaub - Answer Set Solving in Practice-Morgan & Claypool (2013) | |
238 | -2. [Potassco, clingo and gringo](https://potassco.org/): <https://potassco.org/ | |
239 | -3. ["Answer Set Programming" lecture notes](http://web.stanford.edu/~vinayc/logicprogramming/html/answer_set_programming.html) for the Stanfords' course on Logic Programming by Vinay K. Chaudhri. Check also [the ILP section](http://web.stanford.edu/~vinayc/logicprogramming/html/inductive_logic_programming.html), this ASP example of an [encoding](http://www.stanford.edu/~vinayc/logicprogramming/epilog/jackal_encoding.lp) and related [instance](http://www.stanford.edu/~vinayc/logicprogramming/epilog/jackal_instance.lp) and [project suggestions](http://web.stanford.edu/~vinayc/logicprogramming/html/projects.html). | |
240 | 0 | \ No newline at end of file |
text/00_DistSem.md
... | ... | @@ -1,111 +0,0 @@ |
1 | -# Distribution Semantics of Probabilistic Logic Programs | |
2 | - | |
3 | -> There are two major approaches to integrating probabilistic reasoning into logical representations: **distribution semantics** and **maximum entropy**. | |
4 | - | |
5 | -> - Is there a **sound interpretation of ASP**, in particular of **stable models**, to any of the two approaches above? | |
6 | -> - Under such interpretation, **what probabilistic problems can be addressed?** MARG? MLE? MAP? Decision? | |
7 | -> - **What is the relation to other logic and uncertainty approaches?** Independent Choice Logic? Abduction? Stochastic Logic Programs? etc. | |
8 | - | |
9 | - | |
10 | -## Maximum Entropy Summary | |
11 | - | |
12 | -> ME approaches annotate uncertainties only at the level of a logical theory. That is, they assume that the predicates in the BK are labelled as either true or false, but the label may be incorrect. | |
13 | - | |
14 | -These approaches are not based on logic programming, but rather on first-order logic. Consequently, the underlying semantics are different: rather than consider proofs, **these approaches consider models or groundings of a theory**. | |
15 | - | |
16 | -This difference primarily changes what uncertainties represent. For instance, Markov Logic Networks (MLN) represent programs as a set of weighted clauses. The weights in MLN do not correspond to probabilities of a formula being true but, intuitively, to a log odds between a possible world (an interpretation) where the clause is true and a world where the clause is false. | |
17 | - | |
18 | -## Distribution Semantics | |
19 | - | |
20 | -> DS approaches explicitly annotate uncertainties in BK. To allow such annotation, they extend Prolog with two primitives for stochastic execution: probabilistic facts and annotated disjunctions. | |
21 | - | |
22 | -Probabilistic facts are the most basic stochastic primitive and they take the form of logical facts labelled with a probability p. **Each probabilistic fact represents a Boolean random variable that is true with probability p and false with probability 1 − p.** _This is very close to facts in ASP. A "simple" syntax extension would be enough to capture probability annotations. **What about the semantics of such programs?**_ | |
23 | - | |
24 | -Whereas probabilistic facts introduce non-deterministic behaviour on the level of facts, annotated disjunctions introduce non-determinism on the level of clauses. Annotated disjunctions allow for multiple literals in the head, where only one of the head literals can be true at a time. | |
25 | - | |
26 | -### Core Distribution Semantics | |
27 | - | |
28 | -- Let $F$ be a set of **grounded probabilistic facts** and $P:F \to \left[0, 1 \right]$. | |
29 | - | |
30 | -> For example, `F` and `P` result from | |
31 | -> ```prolog | |
32 | -> 0.9::edge(a,c). | |
33 | -> 0.7::edge(c,b). | |
34 | -> 0.6::edge(d,c). | |
35 | -> 0.9::edge(d,b). | |
36 | -> ``` | |
37 | - | |
38 | -- **Facts are assumed marginally independent:** $$\forall a,b \in F, P(a \wedge b) = P(a)P(b).$$ | |
39 | - | |
40 | -- The **probability of $S \subseteq F$** is $$P_F(S) = \prod_{f \in S} P(f) \prod_{f \not\in S} \left(1 - P(f) \right).$$ | |
41 | - | |
42 | -- Let $R$ be a set of **definite clauses** defining further (new) predicates. | |
43 | - | |
44 | -> For example, `R` is | |
45 | -> ```prolog | |
46 | -> path(X,Y) :- edge(X,Y). | |
47 | -> path(X,Y) :- edge(X,Z), path(Z,Y). | |
48 | -> ``` | |
49 | - | |
50 | -- Any combination $S \cup R$ has an **unique least Herbrand model**, $$W = M(S \cup R).$$ | |
51 | - | |
52 | -- **That uniqueness fails for stable models.** Exactly why? - What is the relation of stable models and least Herbrand models? | |
53 | - | |
54 | -- The set of ground facts $S$ is an **explanation** of the world $W = M(S \cup R)$. A world might have multiple explanations. In ASP a explanation can entail 0, 1 or more worlds. | |
55 | - | |
56 | -- The **probability of a possible world** $W$ is | |
57 | -$$P(W) = \sum_{S \subseteq F :~W=M(S\cup R)} P_F(S).$$ | |
58 | - | |
59 | -- The **probability of a ground proposition** $p$ is (defined as) the probability that $p$ has a proof: $$P(p) = \sum_{S :~ S\cup R ~\vdash~ p} P_F(S) = \sum_{W :~ p\in W} P(W).$$ | |
60 | - | |
61 | -- A proposition may have many proofs in a single world $M(S\cup W)$. Without further guarantees, the probabilities of those proofs cannot be summed. The definition above avoids this problem. | |
62 | - | |
63 | - | |
64 | -> For example, a proof of `path(a,b)` employs (only) the facts `edge(a,c)` and `edge(c,b)` _i.e._ these facts are an explanation of `path(a,b)`. Since these facts are (marginally) independent, **the probability of the proof** is $$\begin{aligned}P(\text{path}(a, b)) & = P(\text{edge}(a,c) \wedge\text{edge}(c,b)) \\&= P(\text{edge}(a,c)) \times P(\text{edge}(c,b)) \\ &= 0.9 \times 0.7 \\ &= 0.63. \end{aligned}$$ | |
65 | -> This is the only proof of `path(a,b)` so $P(\text{path}(a,b)) = 0.63$. | |
66 | -> | |
67 | -> On the other hand, since `path(d,b)` has two explanations, `edge(d,b)` and `edge(d,c), edge(c,b)`: $$\begin{aligned} P(\text{path}(d,b)) & = P\left(\text{edge}(d,c) \vee \left(\text{edge}(d,c)\wedge\text{edge}(c,b)\right)\right) \\ &= 0.9 + 0.6 \times 0.7 - 0.9 \times 0.6 \times 0.7 \\ &= 0.942.\end{aligned}$$ | |
68 | - | |
69 | -- With this **semantics of the probability of a possible world**, the probability of an arbitrary proposition is still hard to compute, because of the _disjunct-sum_ problem: **An explanation can have many worlds.** Since the probability is computed via the explanation, if there are many models for a single explanation, **how to assign probability to specific worlds within the same explanation?** | |
70 | - | |
71 | -> Because computing the probability of a fact or goal under the distribution semantics is hard, systems such as Prism [4] and Probabilistic Horn Abduction (PHA) [8] impose additional restrictions that can be used to improve the efficiency of the inference procedure. | |
72 | -> | |
73 | -> **The key assumption is that the explanations for a goal are mutually exclusive, which overcomes the disjoint-sum problem.** If the different explanations of a fact do not overlap, then its probability is simply the sum of the probabilities of its explanations. This directly follows from the inclusion-exclusion formulae as under the exclusive-explanation assumption the conjunctions (or intersections) are empty (_Statistical Relational Learning_, Luc De Raedt and Kristian Kersting, 2010) | |
74 | -> | |
75 | -> **This assumption/restriction is quite _ad-hoc_ and overcoming it requires further inquiry.** | |
76 | - | |
77 | -- Reading Fabio Gagliardi Cozman, Denis Deratani Mauá, _The joy of Probabilistic Answer Set Programming: Semantics - complexity, expressivity, inference_ (2020) gave a big boost securing my initial intuition. | |
78 | - | |
79 | -- The problem can be illustrated with disjunctive clauses, such as the one in the following example. | |
80 | - | |
81 | -```prolog | |
82 | -a ; -a. % prob(a) = 0.7 | |
83 | -b ; c :- a. | |
84 | -``` | |
85 | - | |
86 | -- More specifically, in the example above, **the explanation `a` entails two possible worlds, `ab` and `ac`. How to assign probability of each one?** | |
87 | - | |
88 | -### Assigning Probabilities on "Multiple Worlds per Explanation" Scenarios | |
89 | - | |
90 | -#### Clause Annotations | |
91 | - | |
92 | -> Assign a probability to each case in the head of the clause. For example, annotate $P(b|a) = 0.8$. | |
93 | - | |
94 | -This case needs further study on the respective consequences, specially concerning the joint probability distribution. | |
95 | - | |
96 | -- In particular, $P(b|a) = 0.8$ entails $P(\neg b | a) = 0.2$. But $\neg b$ is not in any world. | |
97 | -- Also, unless assumed the contrary, the independence of $b$ and $c$ is unknown. | |
98 | - | |
99 | -#### Learn from Observations | |
100 | - | |
101 | -> Leave the probabilities uniformly distributed; update them from observation. | |
102 | - | |
103 | -Under this approach, how do observations affect the assigned probabilities? | |
104 | - | |
105 | -- In particular, how to update the probabilities of the worlds `a b` and `a c` given observations such as `a`, `b`, `ab`, `a-b`, `-ab` or `abc`? | |
106 | - 1. Define a criterium to decide if an observation $z$ is compatible world $w$. For example, $z \subseteq w$. | |
107 | - 2. Define the probability of a world from on the explanation probability and a count of **compatible observations**. | |
108 | - | |
109 | -#### Leave One World Out | |
110 | - | |
111 | -> Define a **compatibility criterium** for observations and worlds, add another world and update its probability on incompatible observations; The probability of this world measures the model+sensors limitations. |
text/00_DistSem.pdf
No preview for this file type
text/00_ILP.md
... | ... | @@ -1,74 +0,0 @@ |
1 | -# Inductive Logic Programming | |
2 | - | |
3 | -> Inductive logic programming (ILP) is a form of machine learning (ML). As with | |
4 | -other forms of ML, the goal of ILP is to induce a hypothesis that generalises training examples. However, whereas most forms of ML use vectors/tensors to represent data (examples and hypotheses), ILP uses logic programs (sets of logical rules). Moreover, whereas most forms of ML learn functions, ILP learns relations. | |
5 | - | |
6 | -## Why ILP? | |
7 | - | |
8 | -- **Data efficiency.** Many forms of ML are notorious for their inability to generalise from small numbers of training examples, notably deep learning. By contrast, ILP can induce hypotheses from small numbers of examples, often from a single example. | |
9 | -- **Background knowledge.** ILP learns using BK represented as a logic program. Moreover, because hypotheses are symbolic, hypotheses can be added the to BK, and thus ILP systems naturally support lifelong and transfer learning. | |
10 | -- **Expressivity.** Because of the expressivity of logic programs, ILP can learn complex relational theories. Because of the symbolic nature of logic programs, ILP can reason about hypotheses, which allows it to learn optimal programs. | |
11 | -- **Explainability.** Because of logic’s similarity to natural language, logic programs can be easily read by humans, which is crucial for explainable AI. | |
12 | - | |
13 | -## Recent Advances | |
14 | - | |
15 | -- Search: Meta-level | |
16 | -- Recursion: Yes | |
17 | -- Predicate Invention: Limited | |
18 | -- Hypotheses: Higher-order; ASP | |
19 | -- Optimality: Yes | |
20 | -- Technology: Prolog; ASP; NNs | |
21 | - | |
22 | -### Review | |
23 | - | |
24 | -- **Search.** The fundamental ILP problem is to efficiently search a large hypothesis space. Most older ILP approaches search in either a top-down or bottom-up fashion. A third new search approach has recently emerged called meta-level ILP. | |
25 | - - **Top-down** approaches start with a general hypothesis and then specialise it. | |
26 | - - **Bottom-up** approaches start with the examples and generalise them. | |
27 | - - **Meta-level.** (Most) approaches encode the ILP problem as a program that reasons about programs. | |
28 | -- **Recursion.** Learning recursive programs has long been considered a difficult problem for ILP. The power of recursion is that an infinite number of computations can be described by a finite recursive program. | |
29 | - - Interest in recursion has resurged with the introduction of meta-interpretive learning (MIL) and the MIL system Metagol. The key idea of MIL is to use metarules, or program templates, to restrict the form of inducible programs, and thus the hypothesis space. A metarule is a higher-order clause. Following MIL, many meta-level ILP systems can learn recursive programs. With recursion, ILP systems can now generalise from small numbers of examples, often a single example. Moreover, the ability to learn recursive programs has opened up ILP to new application areas. | |
30 | -- **Predicate invention.** A key characteristic of ILP is the use of BK. BK is similar to features used in most forms of ML. However, whereas features are tables, BK contains facts and rules (extensional and intensional definitions) in the form of a logic program. | |
31 | - - Rather than expecting a user to provide all the necessary BK, the goal of predicate invention (PI) is for an ILP system to automatically invent new auxiliary predicate symbols. Whilst PI has attracted interest since the beginnings of ILP, and has subsequently been repeatedly stated as a major challenge, most ILP systems do not support it. | |
32 | - - Several PI approaches try to address this challenge: Placeholders, Metarules, Pre/post-processing, Lifelong Learning. | |
33 | - - The aforementioned techniques have improved the ability of ILP to invent high-level concepts. However, PI is still difficult and there are many challenges to overcome. The challenges are that (i) many systems struggle to perform PI at all, and (ii) those that do support PI mostly need much user-guidance, metarules to restrict the space of invented symbols or that a user specifies the arity and argument types of invented symbols. | |
34 | -- ILP systems have traditionally induced definite and normal logic programs, typically represented as Prolog programs. A recent development has been to use different **hypothesis representations**. | |
35 | - - **Datalog** is a syntactical subset of Prolog which disallows complex terms as arguments of predicates and imposes restrictions on the use of negation. The general motivation for reducing the expressivity of the representation language from Prolog to Datalog is to allow the problem to be encoded as a satisfiability problem, particularly to leverage recent developments in SAT and SMT. | |
36 | - - **Answer Set Programming** (ASP) is a logic programming paradigm based on the stable model semantics of normal logic programs that can be implemented using the latest advances in SAT solving technology. | |
37 | - - When learning Prolog programs, the procedural aspect of SLD-resolution must be taken into account. By contrast, as ASP is a truly declarative language, no such consideration need be taken into account when learning ASP programs. Compared to Datalog and Prolog, ASP supports additional language constructs, such as disjunction in the head of a clause, choice rules, and hard and weak constraints. | |
38 | - - **A key difference between ASP and Prolog is semantics.** A definite logic program has only one model (the least Herbrand model). By contrast, an ASP program can have one, many, or even no stable models (answer sets). Due to its non-monotonicity, ASP is particularly useful for expressing common-sense reasoning. | |
39 | - - Approaches to learning ASP programs can mostly be divided into two categories: **brave learners**, which aim to learn a program such that at least one answer set covers the examples, and **cautious learners**, which aim to find a program which covers the examples in all answer sets. | |
40 | - - **Higher-order programs** where predicate symbols can be used as terms. | |
41 | - - **Probabilistic logic programs.** A major limitation of logical representations, such as Prolog and its derivatives, is the implicit assumption that the BK is perfect. This assumption is problematic if data is noisy, which is often the case. | |
42 | - - **Integrating probabilistic reasoning into logical representations** is a principled way to handle such uncertainty in data. This integration is the focus of statistical relational artificial intelligence (StarAI). In essence, StarAI hypothesis representations extend BK with probabilities or weights indicating the degree of confidence in the correctness of parts of BK. Generally, StarAI techniques can be divided in two groups: _distribution representations_ and _maximum entropy_ approaches. | |
43 | - - **Distribution semantics** approaches explicitly annotate uncertainties in BK. To allow such annotation, they extend Prolog with two primitives for stochastic execution: probabilistic facts and annotated disjunctions. Probabilistic facts are the most basic stochastic primitive and they take the form of logical facts labelled with a probability p. Each probabilistic fact represents a Boolean random variable that is true with probability p and false with probability 1 − p. Whereas probabilistic facts introduce non-deterministic behaviour on the level of facts, annotated disjunctions introduce non-determinism on the level of clauses. Annotated disjunctions allow for multiple literals in the head, where only one of the head literals can be true at a time. | |
44 | - - **Maximum entropy** approaches annotate uncertainties only at the level of a logical theory. That is, they assume that the predicates in the BK are labelled as either true or false, but the label may be incorrect. These approaches are not based on logic programming, but rather on first-order logic. Consequently, the underlying semantics are different: rather than consider proofs, these approaches consider models or groundings of a theory. This difference primarily changes what uncertainties represent. For instance, Markov Logic Networks (MLN) represent programs as a set of weighted clauses. The weights in MLN do not correspond to probabilities of a formula being true but, intuitively, to a log odds between a possible world (an interpretation) where the clause is true and a world where the clause is false. | |
45 | - - The techniques from learning such probabilistic programs are typically direct extensions of ILP techniques. | |
46 | -- **Optimality.** There are often multiple (sometimes infinitely many) hypotheses that explain the data. Deciding which hypothesis to choose has long been a difficult problem. | |
47 | - - Older ILP systems were not guaranteed to induce optimal programs, where optimal typically means with respect to the size of the induced program or the coverage of examples. A key reason for this limitation was that most search techniques learned a single clause at a time, leading to the construction of sub-programs which were sub-optimal in terms of program size and coverage. | |
48 | - - Newer ILP systems try to address this limitation. As with the ability to learn recursive programs, the main development is to take a global view of the induction task by using meta-level search techniques. In other words, rather than induce a single clause at a time from a single example, the idea is to induce multiple clauses from multiple examples. | |
49 | - - The ability to learn optimal programs opens up ILP to new problems. For instance, learning efficient logic programs has long been considered a difficult problem in ILP, mainly because there is no declarative difference between an efficient program and an inefficient program. | |
50 | -- **Technologies.** Older ILP systems mostly use Prolog for reasoning. Recent work considers using different technologies. | |
51 | - - **Constraint satisfaction and satisfiability.** There have been tremendous recent advances in SAT. | |
52 | - - To leverage these advances, much recent work in ILP uses related techniques, notably ASP. The main motivations for using ASP are to leverage (i) the language benefits of ASP, and (ii) the efficiency and optimisation techniques of modern ASP solvers, which supports conflict propagation and learning. | |
53 | - - With similar motivations, other approaches encode the ILP problem as SAT or SMT problems. | |
54 | - - These approaches have been shown able to **reduce learning times** compared to standard Prolog-based approaches. However, some unresolved issues remain. A key issue is that most approaches **encode an ILP problem as a single (often very large) satisfiability problem**. These approaches therefore often struggle to scale to very large problems, although preliminary work attempts to tackle this issue. | |
55 | - - **Neural Networks.** With the rise of deep learning, several approaches have explored using gradient-based methods to learn logic programs. These approaches all **replace discrete logical reasoning with a relaxed version that yields continuous values** reflecting the confidence of the conclusion. | |
56 | -- **Applications.** | |
57 | - - **Scientific discovery.** Perhaps the most prominent application of ILP is in scientific discovery: identify and predict ligands (substructures responsible for medical activity) and infer missing pathways in protein signalling networks; ecology. | |
58 | - - **Program analysis.** learning SQL queries; programming language semantics, and code search. | |
59 | - - **Robotics.** Robotics applications often require incorporating domain knowledge or imposing certain requirements on the learnt programs. | |
60 | - - **Games.** Inducing game rules has a long history in ILP, where chess has often been the focus | |
61 | - - **Data curation and transformation.** Another successful application of ILP is in data curation and transformation, which is again largely because ILP can learn executable programs. There is much interest in this topic, largely due to success in synthesising programs for end-user problems, such as string transformations. Other transformation tasks include extracting values from semi-structured data (e.g. XML files or medical records), extracting relations from ecological papers, and spreadsheet manipulation. | |
62 | - - **Learning from trajectories.** Learning from interpretation transitions (LFIT) automatically constructs a model of the dynamics of a system from the observation of its state transitions. LFIT has been applied to learn biological models, like Boolean Networks, under several semantics: memory-less deterministic systems, and their multi-valued extensions. The Apperception Engine explain sequential data, such as cellular automata traces, rhythms and simple nursery tunes, image occlusion tasks, game dynamics, and sequence induction intelligence tests. Surprisingly, can achieve human-level performance on the sequence induction intelligence tests in the zero-shot setting (without having been trained on lots of other examples of such tests, and without hand-engineered knowledge of the particular setting). At a high level, these systems take the unique selling point of ILP systems (the ability to strongly generalise from a handful of data), and apply it to the self-supervised setting, producing an explicit human-readable theory that explains the observed state transitions. | |
63 | -- **Limitations and future research.** | |
64 | - - **Better systems.** A problem with ILP is the lack of well engineered tools. They state that whilst over 100 ILP systems have been built, less than a handful of systems can be meaningfully used by ILP researchers. By contrast, driven by industry, other forms of ML now have reliable and well-maintained implementations, which has helped drive research. A frustrating issue with ILP systems is that they use many different language biases or even different syntax for the same biases. _For ILP to be more widely adopted both inside and outside of academia, we must develop more standardised, user-friendly, and better-engineered tools._ | |
65 | - - **Language biases.** One major issue with ILP is choosing an appropriate language bias. Even for ILP experts, determining a suitable language bias is often a frustrating and time-consuming process. We think the need for an almost perfect language bias is severely holding back ILP from being widely adopted. _We think that an important direction for future work in ILP is to develop techniques for automatically identifying suitable language biases._ This area of research is largely under-researched. | |
66 | - - **Better datasets.** Interesting problems, alongside usable systems, drive research and attract interest in a research field. This relationship is most evident in the deep learning community which has, over a decade, grown into the largest AI community. This community growth has been supported by the constant introduction of new problems, datasets, and well-engineered tools. ILP has, unfortunately, failed to deliver on this front: most research is still evaluated on 20-year old datasets. Most new datasets that have been introduced often come from toy domains and are designed to test specific properties of the introduced technique. _We think that the ILP community should learn from the experiences of other AI communities and put significant efforts into developing datasets that identify limitations of existing methods as well as showcase potential applications of ILP._ | |
67 | - - **Relevance.** New methods for predicate invention have improved the abilities of ILP systems to learn large programs. Moreover, these techniques raise the potential for ILP to be used in lifelong learning settings. However, inventing and acquiring new BK could lead to a problem of too much BK, which can overwhelm an ILP system. On this issue, a key under-explored topic is that of relevancy. _Given a new induction problem with large amounts of BK, how does an ILP system decide which BK is relevant?_ One emerging technique is to train a neural network to score how relevant programs are in the BK and to then only use BK with the highest score to learn programs. Without efficient methods of relevance identification, it is unclear how efficient lifelong learning can be achieved. | |
68 | - - **Handling mislabelled and ambiguous data.** A major open question in ILP is how best to handle noisy and ambiguous data. Neural ILP systems are designed from the start to robustly handle mislabelled data. Although there has been work in recent years on designing ILP systems that can handle noisy mislabelled data, there is much less work on the even harder and more fundamental problem of designing ILP systems that can handle raw ambiguous data. ILP systems typically assume that the input has already been preprocessed into symbolic declarative form (typically, a set of ground atoms representing positive and negative examples). But real-world input does not arrive in symbolic form. _For ILP systems to be widely applicable in the real world, they need to be redesigned so they can handle raw ambiguous input from the outset._ | |
69 | - - **Probabilistic ILP.** Real-world data is often noisy and uncertain. Extending ILP to deal with such uncertainty substantially broadens its applicability. While StarAI is receiving growing attention, **learning probabilistic programs from data is still largely under-investigated due to the complexity of joint probabilistic and logical inference.** When working with probabilistic programs, we are interested in the probability that a program covers an example, not only whether the program covers the example. Consequently, probabilistic programs need to compute all possible derivations of an example, not just a single one. Despite added complexity, probabilistic ILP opens many new challenges. Most of the existing work on probabilistic ILP considers the minimal extension of ILP to the probabilistic setting, by assuming that either (i) BK facts are uncertain, or (ii) that learned clauses need to model uncertainty. **These assumptions make it possible to separate structure from uncertainty and simply reuse existing ILP techniques.** Following this minimal extension, the existing work focuses on discriminative learning in which the goal is to learn a program for a single target relation. However, a grand challenge in probabilistic programming is generative learning. That is, learning a program describing a generative process behind the data, not a single target relation. **Learning generative programs is a significantly more challenging problem, which has received very little attention in probabilistic ILP.** | |
70 | - - **Explainability.** Explainability is one of the claimed advantages of a symbolic representation. Recent work evaluates the comprehensibility of ILP hypotheses using Michie’s framework of ultra-strong machine learning, where a learned hypothesis is expected to not only be accurate but to also demonstrably improve the performance of a human being provided with the learned hypothesis. [Some work] empirically demonstrate improved human understanding directly through learned hypotheses. _However, more work is required to better understand the conditions under which this can be achieved, especially given the rise of PI._ | |
71 | - | |
72 | -## Bibliography | |
73 | - | |
74 | -1. Inductive logic programming at 30 | |
75 | 0 | \ No newline at end of file |
text/00_ILP.pdf
No preview for this file type
text/00_PASP.md
... | ... | @@ -1,143 +0,0 @@ |
1 | -# Probabilistic Answer Set Programming | |
2 | - | |
3 | -## Non-stratified programs | |
4 | - | |
5 | -> Minimal example of **non-stratified program**. | |
6 | - | |
7 | -The following annotated LP, with clauses $c_1, c_2, c_3$ respectively, is non-stratified (because has a cycle with negated arcs) but no head is disjunctive: | |
8 | -```prolog | |
9 | -0.3::a. % c1 | |
10 | -s :- not w, not a. % c2 | |
11 | -w :- not s. % c3 | |
12 | -``` | |
13 | - | |
14 | -This program has three stable models: | |
15 | -$$ | |
16 | -\begin{aligned} | |
17 | -m_1 &= \set{ a, w } \cr | |
18 | -m_2 &= \set{ \neg a, s } \cr | |
19 | -m_3 &= \set{ \neg a, w } | |
20 | -\end{aligned} | |
21 | -$$ | |
22 | - | |
23 | -The probabilistic clause `0.3::a.` defines a **total choice** | |
24 | -$$ | |
25 | -\Theta = \set{ | |
26 | - \theta_1 = \set{ a } | |
27 | - \theta_2 = \set{ \neg a } | |
28 | -} | |
29 | -$$ | |
30 | -such that | |
31 | -$$ | |
32 | -\begin{aligned} | |
33 | -P(\Theta = \set{ a }) &= 0.3\cr | |
34 | -P(\Theta = \set{ \neg a }) &= 0.7 \cr | |
35 | -\end{aligned} | |
36 | -$$ | |
37 | - | |
38 | -> While it is natural to extend $P( m_1 ) = 0.3$ from $P(\theta_1) = 0.3$ there is no clear way to assign $P(m_2), P(m_3)$ since both models result from the total choice $\theta_2$. | |
39 | - | |
40 | - | |
41 | - | |
42 | -Under the **CWA**, $\sim\!\!q \models \neg q$, so $c_2, c_3$ induce probabilities: | |
43 | - | |
44 | -$$ | |
45 | -\begin{aligned} | |
46 | -p_a &= P(a | \Theta) \cr | |
47 | -p_s &= P(s | \Theta) &= (1 - p_w)(1 - p_a) \cr | |
48 | -p_w &= P(w | \Theta) &= (1 - p_w) | |
49 | -\end{aligned} | |
50 | -$$ | |
51 | -from which results | |
52 | -$$ | |
53 | -\begin{equation} | |
54 | -p_s = p_s(1 - p_a). | |
55 | -\end{equation} | |
56 | -$$ | |
57 | - | |
58 | -So, if $\Theta = \theta_1 = \set{ a }$ (one stable model): | |
59 | - | |
60 | -- We have $p_a = P(a | \Theta = \set{ a }) = 1$. | |
61 | -- Equation (1) becomes $p_s = 0$. | |
62 | -- From $p_w = 1 - p_s$ we get $P(w | \Theta) = 1$. | |
63 | - | |
64 | -and if $\Theta = \theta_2 = \set{ \neg a }$ (two stable models): | |
65 | - | |
66 | -- We have $p_a = P(a | \Theta = \set{ \neg a }) = 0$. | |
67 | -- Equation (1) becomes $p_s = p_s$; Since we know nothing about $p_s$, we let $p_s = \alpha \in \left[0, 1\right]$. | |
68 | -- We still have the relation $p_w = 1 - p_s$ so $p_w = 1 - \alpha$. | |
69 | - | |
70 | -We can now define the **marginals** for $s, w$: | |
71 | -$$ | |
72 | -\begin{aligned} | |
73 | -P(s) &=\sum_\theta P(s|\theta)P(\theta)= 0.7\alpha \cr | |
74 | -P(w) &=\sum_\theta P(s|\theta)P(\theta)= 0.3 + 0.7(1 - \alpha) \cr | |
75 | -\alpha &\in\left[ 0, 1 \right] | |
76 | -\end{aligned} | |
77 | -$$ | |
78 | - | |
79 | -> The parameter $\alpha$ not only **expresses insufficient information** to sharply define $p_s$ but also **relates** $p_s$ and $p_w$. | |
80 | - | |
81 | -## Disjunctive heads | |
82 | - | |
83 | -> Minimal example of **disjunctive heads** program. | |
84 | - | |
85 | -Consider this LP | |
86 | - | |
87 | -```prolog | |
88 | -0.3::a. | |
89 | -b ; c :- a. | |
90 | -``` | |
91 | - | |
92 | -with three stable models: | |
93 | -$$ | |
94 | -\begin{aligned} | |
95 | -m_1 &= \set{ \neg a } \cr | |
96 | -m_2 &= \set{ a, b } \cr | |
97 | -m_3 &= \set{ a, c } | |
98 | -\end{aligned} | |
99 | -$$ | |
100 | - | |
101 | -Again, $P(m_1) = 0.3$ is quite natural but there are no clear assignments for $P(m_2), P(m_3)$. | |
102 | - | |
103 | -The total choices here are | |
104 | -$$ | |
105 | -\Theta = \set{ | |
106 | - \theta_1 = \set{ a } | |
107 | - \theta_2 = \set{ \neg a } | |
108 | -} | |
109 | -$$ | |
110 | -such that | |
111 | -$$ | |
112 | -\begin{aligned} | |
113 | -P(\Theta = \set{ a }) &= 0.3\cr | |
114 | -P(\Theta = \set{ \neg a }) &= 0.7 \cr | |
115 | -\end{aligned} | |
116 | -$$ | |
117 | -and the LP induces | |
118 | -$$ | |
119 | -P(b \vee c | \Theta) = P(a | \Theta). | |
120 | -$$ | |
121 | - | |
122 | -Since the disjunctive expands as | |
123 | -$$ | |
124 | -\begin{equation} | |
125 | -P(b \vee c | \Theta) = P(b | \Theta) + P( c | \Theta) - P(b \wedge c | \Theta) | |
126 | -\end{equation} | |
127 | -$$ | |
128 | -and we know that $P(b \vee c | \Theta) = P(a | \Theta)$ we need two independent parameters, for example | |
129 | -$$ | |
130 | -\begin{aligned} | |
131 | -P(b | \Theta) &= \beta \in \cr | |
132 | -P(c | \Theta) &= \gama \cr | |
133 | -\end{aligned} | |
134 | -$$ | |
135 | -where | |
136 | -$$ | |
137 | -\begin{aligned} | |
138 | - \alpha & \in \left[0, 0.3\right] \cr | |
139 | - \beta & \in \left[0, \alpha\right] | |
140 | -\end{aligned} | |
141 | -$$ | |
142 | - | |
143 | -This example also calls for reconsidering the CWA since it entails that **we should assume that $b$ and $c$ are conditionally independent given $a$.** | |
144 | 0 | \ No newline at end of file |
text/00_PASP.pdf
No preview for this file type
text/00_POTASSCO.md
... | ... | @@ -1,13 +0,0 @@ |
1 | -# Potassco | |
2 | - | |
3 | -> [Potassco](https://potassco.org/), the Potsdam Answer Set Solving Collection, bundles tools for Answer Set Programming developed at the University of Potsdam. | |
4 | - | |
5 | -- [The Potassco Guide](https://github.com/potassco/guide) | |
6 | - | |
7 | -## clingo | |
8 | - | |
9 | -> Current answer set solvers work on variable-free programs. Hence, a grounder is needed that, given an input program with first-order variables, computes an equivalent ground (variable-free) program. gringo is such a grounder. Its output can be processed further with clasp, claspfolio, or clingcon. | |
10 | -> | |
11 | -> [clingo](https://potassco.org/clingo/) combines both gringo and clasp into a monolithic system. This way it offers more control over the grounding and solving process than gringo and clasp can offer individually - e.g., incremental grounding and solving. | |
12 | - | |
13 | -- [Python module list](https://potassco.org/clingo/python-api/current/) | |
14 | 0 | \ No newline at end of file |
text/00_PROB.md
... | ... | @@ -1,41 +0,0 @@ |
1 | -# Probability Problems | |
2 | - | |
3 | ->- What are the general tasks we expect to solve with probabilistic programs? | |
4 | -> - The **MAP** task is the one with best applications. It is also the hardest to compute. | |
5 | -> - **MLE** is the limit case of **MAP**. Has simpler computations but overfits the data. | |
6 | - | |
7 | -## Background | |
8 | - | |
9 | -- **Conditional Probability** $$P(A, B) = P(B | A) P(A).$$ | |
10 | -- **Bayes Theorem** $$P(B | A) = \frac{P(A | B) P(B)}{P(A)}.$$ | |
11 | -- **For maximization tasks** $$P(B | A) \propto P(A | B) P(B).$$ | |
12 | -- **Marginal** $$P(A) = \sum_b P(A,b).$$ | |
13 | -- In $P(B | A) \propto P(A | B) P(B)$, if the **posterior** $P(B | A)$ and the **prior** $P(B)$ follow distributions of the same family, $P(B)$ is a **conjugate prior** for the **likelihood** $P(A | B)$. | |
14 | -- **Density Estimation:** Estimate a joint probability distribution from a set of observations; Select a probability distribution function and the parameters that best explains the distributions of the observations. | |
15 | - | |
16 | -## MLE: Maximum Likelihood Estimation | |
17 | - | |
18 | -> Given a probability **distribution** $d$ and a set of **observations** $X$, find the distribution **parameters** $\theta$ that maximize the **likelihood** (_i.e._ the probability of those observations) for that distribution. | |
19 | -> | |
20 | -> **Overfits the data:** high variance of the parameter estimate; sensitive to random variations in the data. Regularization with $P(\theta)$ leads to **MAP**. | |
21 | - | |
22 | -Given $d, X$, find | |
23 | -$$ | |
24 | -\hat{\theta}_{\text{MLE}}(d,X) = \arg_{\theta} \max P_d(X | \theta). | |
25 | -$$ | |
26 | - | |
27 | -## MAP: Maximum A-Priori | |
28 | - | |
29 | -> Given a probability **distribution** $d$ and a set of **observations** $X$, find the distribution **parameters** $\theta$ that best explain those observations. | |
30 | - | |
31 | -Given $d, X$, find | |
32 | -$$ | |
33 | -\hat{\theta}_{\text{MAP}}(d, X) = \arg_{\theta}\max P(\theta | X). | |
34 | -$$ | |
35 | - | |
36 | -Using $P(A | B) \propto P(B | A) P(A)$, | |
37 | -$$\hat{\theta}_{\text{MAP}}(d, X) = \arg_{\theta} \max P_d(X | \theta) P(\theta)$$ | |
38 | - | |
39 | -Variants: | |
40 | -- **Viterbi algorithm:** Find the most likely sequence of hidden states (on HMMs) that results in a sequence of observed events. | |
41 | -- **MPE: Most Probable Explanation** and **Max-sum, Max-product algorithms:** Calculates the marginal distribution for each unobserved node, conditional on any observed nodes; Defines the most likely assignment to all the random variables that is consistent with the given evidence. |
text/00_Z3.md
... | ... | @@ -1,14 +0,0 @@ |
1 | -# Z3 - An SMT solver | |
2 | - | |
3 | -> `Z3` is a theorem prover from Microsoft Research. | |
4 | -> **However, `Potassco` seems more to the point.** | |
5 | - | |
6 | -## Introduction | |
7 | - | |
8 | -An Answer Set Program can be solved translating it into a SAP problem. | |
9 | - | |
10 | -## References | |
11 | - | |
12 | -1. [Programming Z3](https://theory.stanford.edu/~nikolaj/programmingz3.html): <https://theory.stanford.edu/~nikolaj/programmingz3.html>. | |
13 | -2. [Julia Package](https://www.juliapackages.com/p/z3): <https://www.juliapackages.com/p/z3>. | |
14 | -3. [Repository](https://github.com/Z3Prover): <https://github.com/Z3Prover>. | |
15 | 0 | \ No newline at end of file |
No preview for this file type
... | ... | @@ -0,0 +1,318 @@ |
1 | +\documentclass[a4paper, 12pt]{article} | |
2 | + | |
3 | +\usepackage[x11colors]{xcolor} | |
4 | +% | |
5 | +\usepackage{tikz} | |
6 | +\usetikzlibrary{calc} | |
7 | +% | |
8 | +\usepackage{hyperref} | |
9 | +\hypersetup{ | |
10 | + colorlinks=true, | |
11 | + linkcolor=blue, | |
12 | +} | |
13 | +% | |
14 | +\usepackage{commath} | |
15 | +\usepackage{amsthm} | |
16 | +\newtheorem{assumption}{Assumption} | |
17 | +\usepackage{amssymb} | |
18 | +% | |
19 | +% Local commands | |
20 | +% | |
21 | +\newcommand{\note}[1]{\marginpar{\scriptsize #1}} | |
22 | +\newcommand{\naf}{\ensuremath{\sim\!}} | |
23 | +\newcommand{\larr}{\ensuremath{\leftarrow}} | |
24 | +\newcommand{\at}[1]{\ensuremath{\!\del{#1}}} | |
25 | +\newcommand{\co}[1]{\ensuremath{\overline{#1}}} | |
26 | +\newcommand{\fml}[1]{\ensuremath{{\cal #1}}} | |
27 | +\newcommand{\deft}[1]{\textbf{#1}} | |
28 | +\newcommand{\pset}[1]{\ensuremath{\mathbb{P}\at{#1}}} | |
29 | +\newcommand{\ent}{\ensuremath{\lhd}} | |
30 | +\newcommand{\cset}[2]{\ensuremath{\set{#1,~#2}}} | |
31 | +\newcommand{\langof}[1]{\ensuremath{\fml{L}\at{#1}}} | |
32 | +\newcommand{\uset}[1]{\ensuremath{\left|{#1}\right>}} | |
33 | +\newcommand{\lset}[1]{\ensuremath{\left<{#1}\right|}} | |
34 | +\newcommand{\pr}[1]{\ensuremath{\mathrm{P}\at{#1}}} | |
35 | +\newcommand{\given}{\ensuremath{~\middle|~}} | |
36 | + | |
37 | +\title{Zugzwang\\\textit{Logic and Artificial Intelligence}} | |
38 | +\author{Francisco Coelho\\ \texttt{fc@uevora.pt}} | |
39 | + | |
40 | +\begin{document} | |
41 | +\maketitle | |
42 | + | |
43 | +\begin{abstract} | |
44 | + A major limitation of logical representations is the implicit assumption that the Background Knowledge (BK) is perfect. This assumption is problematic if data is noisy, which is often the case. Here we aim to explore how ASP specifications with probabilistic facts can lead to characterizations of probability functions on the specification's domain. | |
45 | +\end{abstract} | |
46 | + | |
47 | +\section{Introduction and Motivation } | |
48 | + | |
49 | +Answer Set Programming (ASP) is a logic programming paradigm based on the Stable Model semantics of Normal Logic Programs (NP) that can be implemented using the latest advances in SAT solving technology. ASP is a truly declarative language that supports language constructs such as disjunction in the head of a clause, choice rules, and hard and weak constraints. | |
50 | + | |
51 | +The Distribution Semantics (DS) is a key approach to extend logical representations with probabilistic reasoning. Probabilistic Facts (PF) are the most basic stochastic DS primitive and they take the form of logical facts, $a$, labelled with a probability, such as $p::a$; Each probabilistic fact represents a boolean random variable that is true with probability $p$ and false with probability $1 - p$. A (consistent) combination of the PFs defines a \textit{total choice} $c = \set{p::a, \ldots}$ such that | |
52 | + | |
53 | +\begin{equation} | |
54 | + \pr{C = x} = \prod_{a\in c} p \prod_{a \not\in c} (1- p). | |
55 | + \label{eq:prob.total.choice} | |
56 | +\end{equation} | |
57 | + | |
58 | +Our goal is to extend this probability, from total choices, to cover the specification domain. We can foresee two key applications of this extended probability: | |
59 | + | |
60 | +\begin{enumerate} | |
61 | + \item Support any probabilistic reasoning/task on the specification domain. | |
62 | + \item Also, given a dataset and a divergence measure, now the specification can be scored (by the divergence w.r.t.\ the \emph{empiric} distribution of the dataset), and sorted amongst other specifications. This is a key ingredient in algorithms searching, for example, an optimal specification of the dataset. | |
63 | +\end{enumerate} | |
64 | + | |
65 | +This goal faces a critical problem concerning situations where multiple standard models result from a given total choice, illustrated by the following example. The specification | |
66 | +\begin{equation} | |
67 | + \begin{aligned} | |
68 | + 0.3::a&,\cr | |
69 | + b \vee c& \leftarrow a. | |
70 | + \end{aligned} | |
71 | + \label{eq:example.1} | |
72 | +\end{equation} | |
73 | +has three stable models, $\co{a}, ab$ and $ac$. While it is straightforward to set $P(\co{a})=0.7$, there is \textit{no further information} to assign values to $P(ab)$ and $P(ac)$. At best, we can use a parameter $x$ such that | |
74 | +$$ | |
75 | +\begin{aligned} | |
76 | +P(ab) &= 0.3 x,\cr | |
77 | +P(ac) &= 0.3 (1 - x). | |
78 | +\end{aligned} | |
79 | +$$ | |
80 | + | |
81 | +This uncertainty in inherent to the specification, but can be mitigated with the help of a dataset: the parameter $x$ can be estimated from the empirical distribution. | |
82 | + | |
83 | +In summary, if an ASP specification is intended to describe some observable system then: | |
84 | + | |
85 | +\begin{enumerate} | |
86 | + \item The observations can be used to estimate the value of the parameters (such as $x$ above and others entailed from further clauses). | |
87 | + \item With a probability set for the stable models, we want to extend it to all the samples (\textit{i.e.} consistent sets of literals) of the specification. | |
88 | + \item This extended probability can then be related to the \textit{empirical distribution}, using a probability divergence, such as Kullback-Leibler; and the divergence value used as a \textit{performance} measure of the specification with respect to the observations. | |
89 | + \item If that specification is only but one of many possible candidates then that performance measure can be used, \textit{e.g.} as fitness, by algorithms searching (optimal) specifications of a dataset of observations. | |
90 | +\end{enumerate} | |
91 | + | |
92 | +Currently, we are on the step two above: Extending a probability function (with parameters such as $x$), defined on the stable sets of a specification, to all the events of the specification. This extension must, of course, respect the axioms of probability so that probabilistic reasoning is consistent with the ASP specification. | |
93 | + | |
94 | +\section{Extending Probabilities} | |
95 | + | |
96 | +Given an ASP specification, we consider the \textit{atoms} $a \in \fml{A}$ and \textit{literals}, $z \in \fml{L}$, \textit{events} $e \in \fml{E} \iff e \subseteq \fml{L}$ and \textit{worlds} $w \in \fml{W}$ (consistent events), \textit{total choices} $c \in \fml{C} \iff c = a \vee \neg a$ and \textit{stable models} $s \in \fml{S}$. | |
97 | + | |
98 | +% In a statistical setting, the outcomes are the literals $x$, $\neg x$ for each atom $x$, the events express a set of possible outcomes (including $\emptyset$, $\set{a, b}$, $\set{a, \neg a, b}$, \textit{etc.}), and worlds are events with no contradictions. | |
99 | + | |
100 | +Our path, traced by equations (\ref{eq:prob.total.choice}) and (\ref{eq:prob.stablemodel} --- \ref{eq:prob.events}), starts with the probability of total choices, $\pr{C = c}$, expands it to stable models, $\pr{S = s}$, and then to worlds $\pr{W = w}$ and events $\pr{E = e}$. | |
101 | + | |
102 | +\begin{enumerate} | |
103 | + \item \textbf{Total Choices.} This case is given by $\pr{C = c}$, from equation \ref{eq:prob.total.choice}. Each total choice $C = c$ (together with the facts and rules) entails some stable models, $s \in S_c$, and each stable model $S = s$ contains a single total choice $c_s \subseteq s$. | |
104 | + \item \textbf{Stable Models.} Given a stable model $s \in \fml{S}$, and variables/values $x_{s,c} \in \intcc{0, 1}$, | |
105 | + \begin{equation} | |
106 | + \pr{S = s \given C = c} = \begin{cases} | |
107 | + x_{s,c} & \text{if~} s \in S_c,\cr | |
108 | + 0&\text{otherwise} | |
109 | + \end{cases} | |
110 | + \label{eq:prob.stablemodel} | |
111 | + \end{equation} | |
112 | + such that $\sum_{s \in S_c} x_{s,c} = 1$. | |
113 | + \item\label{item:world.cases} \textbf{Worlds.} Each world $W = w$ either: | |
114 | + \begin{enumerate} | |
115 | + \item Is a \textit{stable model}. Then | |
116 | + \begin{equation} | |
117 | + \pr{W = w \given C = c} = \pr{S = s \given C = c}. | |
118 | + \label{eq:world.fold.stablemodel} | |
119 | + \end{equation} | |
120 | + \item \textit{Contains} some stable models. Then | |
121 | + \begin{equation} | |
122 | + \pr{W = w \given C = c} = \prod_{s \subset w}\pr{S = s \given C = c}. | |
123 | + \label{eq:world.fold.superset} | |
124 | + \end{equation} | |
125 | + \item \textit{Is contained} in some stable models. Then | |
126 | + \begin{equation} | |
127 | + \pr{W = w \given C = c} = \sum_{s \supset w}\pr{S = s \given C = c}. | |
128 | + \label{eq:world.fold.subset} | |
129 | + \end{equation} | |
130 | + \item Neither contains nor is contained by a stable model. Then | |
131 | + \begin{equation} | |
132 | + \pr{W = w} = 0. | |
133 | + \label{eq:world.fold.independent} | |
134 | + \end{equation} | |
135 | + \end{enumerate} | |
136 | + \item \textbf{Events.} For each event $E = e$, | |
137 | + \begin{equation} | |
138 | + \pr{E = e \given C = c} = \begin{cases} | |
139 | + \pr{W = e \given C = c} & e \in \fml{W}, \cr | |
140 | + 0 & \text{otherwise}. | |
141 | + \end{cases} | |
142 | + \label{eq:prob.events} | |
143 | + \end{equation} | |
144 | +\end{enumerate} | |
145 | + | |
146 | +Since stable model are minimal, there is no proper chain $s_1 \subset w \subset s_2$ so each world folds into exactly one ot the four cases of point \ref{item:world.cases} above. | |
147 | + | |
148 | +% PARAMETERS FOR UNCERTAINTY | |
149 | + | |
150 | +Equation (\ref{eq:prob.stablemodel}) expresses the lack of knowledge about the probability assignment when a single total choice entails more than one stable model. In this case, how to distribute the respective probability? Our answer to this problem consists in assigning an unknown probability, $x_{s,c}$, conditional on the total choice, $c$, to each stable model $s$. This approach allow the expression of an unknown quantity and future estimation, given observed data. | |
151 | + | |
152 | +% STABLE MODEL | |
153 | +The stable model case, in equation (\ref{eq:world.fold.stablemodel}), identifies the probability of a stable model \textit{as a world} with its probability as defined previously in equation (\ref{eq:prob.stablemodel}), as a stable model. | |
154 | + | |
155 | +% SUPERSET | |
156 | +Equation \ref{eq:world.fold.superset} results from conditional independence of the stable models $s \subset w$. Conditional independence of stable worlds asserts a least informed strategy that we make explicit: | |
157 | + | |
158 | +\begin{assumption} | |
159 | + Stable models are conditionally independent, given their total choices. | |
160 | +\end{assumption} | |
161 | + | |
162 | +Consider the stable models $ab, ac$ from the example above. They result from the clause $b \vee c \leftarrow a$ and the total choice $a$. These formulas alone impose no relation between $b$ and $c$ (given $a$), so none should be assumed. Dependence relations are further discussed in Subsection (\ref{subsec:dependence}). | |
163 | + | |
164 | +% SUBSET | |
165 | +\hrule | |
166 | + | |
167 | +\bigskip | |
168 | +I'm not sure about what to say here.\marginpar{todo} | |
169 | + | |
170 | +My first guess was | |
171 | +\begin{equation*} | |
172 | + \pr{W = w \given C = c} = \sum_{s \supset w}\pr{S = s \given C = c}. | |
173 | +\end{equation*} | |
174 | + | |
175 | +$\pr{W = w \given C = c}$ already separates $\pr{W}$ into \textbf{disjoint} events! | |
176 | + | |
177 | +Also, I am assuming that stable models are independent. | |
178 | + | |
179 | +This would entail $p(w) = p(s_1) + p(s_2) - p(s_1)p(s_2)$ \textit{if I'm bound to set inclusion}. But I'm not. I'm defining a relation | |
180 | + | |
181 | +Also, if I set $p(w) = p(s_1) + p(s_2)$ and respect the laws of probability, this entails $p(s_1)p(s_2) = 0$. | |
182 | + | |
183 | +So, maybe what I want is (1) to define the cover $\hat{w} = \cup_{s \supset w} s$ | |
184 | + | |
185 | +\begin{equation*} | |
186 | + \pr{W = w \given C = c} = \sum_{s \supset w}\pr{S = s \given C = c} - \pr{W = \hat{w} \given C = c}. | |
187 | +\end{equation*} | |
188 | + | |
189 | +But this doesn't works, because we'd get $\pr{W = a \given C = a} < 1$. | |
190 | +% | |
191 | + | |
192 | +% | |
193 | +\bigskip | |
194 | +\hrule | |
195 | + | |
196 | +% INDEPENDENCE | |
197 | + | |
198 | +A world that neither contains nor is contained in a stable model describes a case that, according to the specification, should never be observed. So the respective probability is set to zero, per equation (\ref{eq:world.fold.independent}). | |
199 | +% | |
200 | +% ================================================================ | |
201 | +% | |
202 | +\subsection{Dependence} | |
203 | +\label{subsec:dependence} | |
204 | + | |
205 | +Dependence relations in the underlying system can be explicitly expressed in the specification. | |
206 | + | |
207 | +For example, $b \leftarrow c \wedge d$, where $d$ is an atomic choice, explicitly expressing this dependence between $b$ and $c$. One would get, for example, the specification | |
208 | +$$ | |
209 | +0.3::a, b \vee c \leftarrow a, 0.2::d, b \leftarrow c \wedge d. | |
210 | +$$ | |
211 | +with the stable models | |
212 | +$ | |
213 | +\co{ad}, \co{a}d, a\co{d}b, a\co{d}c, adb | |
214 | +$. | |
215 | + | |
216 | + | |
217 | +The interesting case is the subtree of the total choice $ad$. Notice that no stable model $s$ contains $adc$ because (1) $adb$ is a stable model and (2) if $adc \subset s$ then $b \in s$ so $adb \subset s$. | |
218 | + | |
219 | +Following equations (\ref{eq:world.fold.stablemodel}) and (\ref{eq:world.fold.independent}) this sets | |
220 | +\begin{equation*} | |
221 | + \begin{cases} | |
222 | + \pr{W = adc \given C = ad} = 0,\cr | |
223 | + \pr{W = adb \given C = ad} = 1 | |
224 | + \end{cases} | |
225 | +\end{equation*} | |
226 | +which concentrates all probability mass from the total choice $ad$ in the $adb$ branch, including the node $W = adbc$. This leads to the following cases: | |
227 | +$$ | |
228 | +\begin{array}{l|r} | |
229 | + x & \pr{W = x \given C = ad}\\ | |
230 | + \hline | |
231 | + ad & 1 \\ | |
232 | + adb & 1\\ | |
233 | + adc & 0\\ | |
234 | + adbc & 1 | |
235 | +\end{array} | |
236 | +$$ | |
237 | +so, for $C = ad$, | |
238 | +$$ | |
239 | +\begin{aligned} | |
240 | + \pr{W = b} &= \frac{2}{4} \cr | |
241 | + \pr{W = c} &= \frac{1}{4} \cr | |
242 | + \pr{W = bc} &= \frac{1}{4} \cr | |
243 | + &\not= \pr{W = b}\pr{W = c} | |
244 | +\end{aligned} | |
245 | +$$ | |
246 | +\textit{i.e.} the events $W = b$ and $W = c$ are dependent and that dependence results directly from the segment $0.2::d, b \leftarrow c \wedge d$ in the specification. | |
247 | + | |
248 | + | |
249 | +% | |
250 | + | |
251 | +% | |
252 | +\hrule | |
253 | +\begin{quotation}\note{Todo} | |
254 | + | |
255 | + Prove the four world cases (done), support the product (done) and sum (tbd) options, with the independence assumptions. | |
256 | +\end{quotation} | |
257 | + | |
258 | +\section{Developed Example} | |
259 | + | |
260 | +We continue with the specification from equation \ref{eq:example.1}. | |
261 | + | |
262 | +\textbf{Step 1: Total Choices.} The total choices, and respective stable models, are | |
263 | +\begin{center} | |
264 | + \begin{tabular}{l|r|r} | |
265 | + Total Choice ($c$) & $\pr{C = c}$ & Stable Models ($s$)\\ | |
266 | + \hline | |
267 | + $a$ & $0.3$ & $ab$ and $ac$.\\ | |
268 | + $\co{a} = \neg a$ & $\co{0.3} = 0.7$ & $\co{a}$. | |
269 | + \end{tabular} | |
270 | +\end{center} | |
271 | + | |
272 | +\textbf{Step 2: Stable Models.} Suppose now that | |
273 | +\begin{center} | |
274 | + \begin{tabular}{l|c|r} | |
275 | + Stable Models ($s$) & Total Choice ($c$) & $\pr{S = c \given C = c}$\\ | |
276 | + \hline | |
277 | + $\co{a}$ & $1.0$ & $\co{a}$. \\ | |
278 | + $ab$ & $0.8$ & $a$. \\ | |
279 | + $ac$ & $0.2 = \co{0.8}$ & $a$. | |
280 | + \end{tabular} | |
281 | +\end{center} | |
282 | + | |
283 | +\textbf{Step 3: Worlds.} Following equations \ref{eq:world.fold.stablemodel} --- \ref{eq:world.fold.independent} we get: | |
284 | +\begin{center} | |
285 | + \begin{tabular}{l|c|l|c|r} | |
286 | + Occ. ($o$) & S.M. ($s$) & Relation & T.C. ($c$) & $\pr{W = w}$\\ | |
287 | + \hline | |
288 | + $\emptyset$ & all & contained & $a$, $\co{a}$ & $1.0$ \\ | |
289 | + $a$ & $ab$, $ac$ & contained & $a$ & $0.8\times 0.3 + 0.2\times 0.3 = 0.3$ \\ | |
290 | + $b$ & $ab$ & contained & $a$ & $0.8\times 0.3 = 0.24$ \\ | |
291 | + $c$ & $ac$ & contained & $a$ & $0.2\times 0.3 = 0.06$ \\ | |
292 | + $\co{a}$ & $\co{a}$ & stable model & $\co{a}$ & $1.0\times 0.3 = 0.3$ \\ | |
293 | + $\co{b}$ & none & independent & none & $0.0$ \\ | |
294 | + $\co{c}$ & none & \ldots & & \\ | |
295 | + $ab$ & $ab$ & stable model & $a$ & $0.24$ \\ | |
296 | + $ac$ & $ac$ & stable model & $a$ & $0.06$ \\ | |
297 | + $a\co{b}$ & none & \ldots & & \\ | |
298 | + $a\co{c}$ & none & \ldots & & \\ | |
299 | + $\co{a}b$ & $\co{a}$ & contains & $\co{a}$ & $1.0$ \\ | |
300 | + $\co{a}c$ & $\co{a}$ & \ldots & & \\ | |
301 | + $\co{a}\co{b}$ & $\co{a}$ & \ldots & & \\ | |
302 | + $\co{a}\co{c}$ & $\co{a}$ & \ldots & & \\ | |
303 | + $abc$ & $ab$, $ac$ & contains & $a$ & $0.8\times 0.2 = 0.016$ \\ | |
304 | + \end{tabular} | |
305 | +\end{center} | |
306 | + | |
307 | +\section*{References} | |
308 | + | |
309 | + | |
310 | +\begin{enumerate} | |
311 | + \item Victor Verreet, Vincent Derkinderen, Pedro Zuidberg Dos Martires, Luc De Raedt, Inference and Learning with Model Uncertainty in Probabilistic Logic Programs (2022) | |
312 | + \item Andrew Cropper, Sebastijan Dumancic, Richard Evans, Stephen H. Muggleton, Inductive logic programming at 30 (2021) | |
313 | + \item Fabio Gagliardi Cozman, Denis Deratani Mauá, The joy of Probabilistic Answer Set Programming: Semantics - complexity, expressivity, inference (2020) | |
314 | + \item Fabrizio Riguzzi, Foundations of Probabilistic Logic Programming Languages, Semantics, Inference and Learning. Rivers Publishers (2018) | |
315 | + \item Martin Gebser, Roland Kaminski, Benjamin Kaufmann, and Torsten Schaub, Answer Set Solving in Practice, Morgan \& Claypool Publishers (2013) | |
316 | +\end{enumerate} | |
317 | + | |
318 | +\end{document} | |
0 | 319 | \ No newline at end of file | ... | ... |
text/pre-paper.pdf
No preview for this file type
text/pre-paper.tex
... | ... | @@ -1,318 +0,0 @@ |
1 | -\documentclass[a4paper, 12pt]{article} | |
2 | - | |
3 | -\usepackage[x11colors]{xcolor} | |
4 | -% | |
5 | -\usepackage{tikz} | |
6 | -\usetikzlibrary{calc} | |
7 | -% | |
8 | -\usepackage{hyperref} | |
9 | -\hypersetup{ | |
10 | - colorlinks=true, | |
11 | - linkcolor=blue, | |
12 | -} | |
13 | -% | |
14 | -\usepackage{commath} | |
15 | -\usepackage{amsthm} | |
16 | -\newtheorem{assumption}{Assumption} | |
17 | -\usepackage{amssymb} | |
18 | -% | |
19 | -% Local commands | |
20 | -% | |
21 | -\newcommand{\note}[1]{\marginpar{\scriptsize #1}} | |
22 | -\newcommand{\naf}{\ensuremath{\sim\!}} | |
23 | -\newcommand{\larr}{\ensuremath{\leftarrow}} | |
24 | -\newcommand{\at}[1]{\ensuremath{\!\del{#1}}} | |
25 | -\newcommand{\co}[1]{\ensuremath{\overline{#1}}} | |
26 | -\newcommand{\fml}[1]{\ensuremath{{\cal #1}}} | |
27 | -\newcommand{\deft}[1]{\textbf{#1}} | |
28 | -\newcommand{\pset}[1]{\ensuremath{\mathbb{P}\at{#1}}} | |
29 | -\newcommand{\ent}{\ensuremath{\lhd}} | |
30 | -\newcommand{\cset}[2]{\ensuremath{\set{#1,~#2}}} | |
31 | -\newcommand{\langof}[1]{\ensuremath{\fml{L}\at{#1}}} | |
32 | -\newcommand{\uset}[1]{\ensuremath{\left|{#1}\right>}} | |
33 | -\newcommand{\lset}[1]{\ensuremath{\left<{#1}\right|}} | |
34 | -\newcommand{\pr}[1]{\ensuremath{\mathrm{P}\at{#1}}} | |
35 | -\newcommand{\given}{\ensuremath{~\middle|~}} | |
36 | - | |
37 | -\title{Zugzwang\\\textit{Logic and Artificial Intelligence}} | |
38 | -\author{Francisco Coelho\\ \texttt{fc@uevora.pt}} | |
39 | - | |
40 | -\begin{document} | |
41 | -\maketitle | |
42 | - | |
43 | -\begin{abstract} | |
44 | - A major limitation of logical representations is the implicit assumption that the Background Knowledge (BK) is perfect. This assumption is problematic if data is noisy, which is often the case. Here we aim to explore how ASP specifications with probabilistic facts can lead to characterizations of probability functions on the specification's domain. | |
45 | -\end{abstract} | |
46 | - | |
47 | -\section{Introduction and Motivation } | |
48 | - | |
49 | -Answer Set Programming (ASP) is a logic programming paradigm based on the Stable Model semantics of Normal Logic Programs (NP) that can be implemented using the latest advances in SAT solving technology. ASP is a truly declarative language that supports language constructs such as disjunction in the head of a clause, choice rules, and hard and weak constraints. | |
50 | - | |
51 | -The Distribution Semantics (DS) is a key approach to extend logical representations with probabilistic reasoning. Probabilistic Facts (PF) are the most basic stochastic DS primitive and they take the form of logical facts, $a$, labelled with a probability, such as $p::a$; Each probabilistic fact represents a boolean random variable that is true with probability $p$ and false with probability $1 - p$. A (consistent) combination of the PFs defines a \textit{total choice} $c = \set{p::a, \ldots}$ such that | |
52 | - | |
53 | -\begin{equation} | |
54 | - \pr{C = x} = \prod_{a\in c} p \prod_{a \not\in c} (1- p). | |
55 | - \label{eq:prob.total.choice} | |
56 | -\end{equation} | |
57 | - | |
58 | -Our goal is to extend this probability, from total choices, to cover the specification domain. We can foresee two key applications of this extended probability: | |
59 | - | |
60 | -\begin{enumerate} | |
61 | - \item Support any probabilistic reasoning/task on the specification domain. | |
62 | - \item Also, given a dataset and a divergence measure, now the specification can be scored (by the divergence w.r.t.\ the \emph{empiric} distribution of the dataset), and sorted amongst other specifications. This is a key ingredient in algorithms searching, for example, an optimal specification of the dataset. | |
63 | -\end{enumerate} | |
64 | - | |
65 | -This goal faces a critical problem concerning situations where multiple standard models result from a given total choice, illustrated by the following example. The specification | |
66 | -\begin{equation} | |
67 | - \begin{aligned} | |
68 | - 0.3::a&,\cr | |
69 | - b \vee c& \leftarrow a. | |
70 | - \end{aligned} | |
71 | - \label{eq:example.1} | |
72 | -\end{equation} | |
73 | -has three stable models, $\co{a}, ab$ and $ac$. While it is straightforward to set $P(\co{a})=0.7$, there is \textit{no further information} to assign values to $P(ab)$ and $P(ac)$. At best, we can use a parameter $x$ such that | |
74 | -$$ | |
75 | -\begin{aligned} | |
76 | -P(ab) &= 0.3 x,\cr | |
77 | -P(ac) &= 0.3 (1 - x). | |
78 | -\end{aligned} | |
79 | -$$ | |
80 | - | |
81 | -This uncertainty in inherent to the specification, but can be mitigated with the help of a dataset: the parameter $x$ can be estimated from the empirical distribution. | |
82 | - | |
83 | -In summary, if an ASP specification is intended to describe some observable system then: | |
84 | - | |
85 | -\begin{enumerate} | |
86 | - \item The observations can be used to estimate the value of the parameters (such as $x$ above and others entailed from further clauses). | |
87 | - \item With a probability set for the stable models, we want to extend it to all the samples (\textit{i.e.} consistent sets of literals) of the specification. | |
88 | - \item This extended probability can then be related to the \textit{empirical distribution}, using a probability divergence, such as Kullback-Leibler; and the divergence value used as a \textit{performance} measure of the specification with respect to the observations. | |
89 | - \item If that specification is only but one of many possible candidates then that performance measure can be used, \textit{e.g.} as fitness, by algorithms searching (optimal) specifications of a dataset of observations. | |
90 | -\end{enumerate} | |
91 | - | |
92 | -Currently, we are on the step two above: Extending a probability function (with parameters such as $x$), defined on the stable sets of a specification, to all the events of the specification. This extension must, of course, respect the axioms of probability so that probabilistic reasoning is consistent with the ASP specification. | |
93 | - | |
94 | -\section{Extending Probabilities} | |
95 | - | |
96 | -Given an ASP specification, we consider the \textit{atoms} $a \in \fml{A}$ and \textit{literals}, $z \in \fml{L}$, \textit{events} $e \in \fml{E} \iff e \subseteq \fml{L}$ and \textit{worlds} $w \in \fml{W}$ (consistent events), \textit{total choices} $c \in \fml{C} \iff c = a \vee \neg a$ and \textit{stable models} $s \in \fml{S}$. | |
97 | - | |
98 | -% In a statistical setting, the outcomes are the literals $x$, $\neg x$ for each atom $x$, the events express a set of possible outcomes (including $\emptyset$, $\set{a, b}$, $\set{a, \neg a, b}$, \textit{etc.}), and worlds are events with no contradictions. | |
99 | - | |
100 | -Our path, traced by equations (\ref{eq:prob.total.choice}) and (\ref{eq:prob.stablemodel} --- \ref{eq:prob.events}), starts with the probability of total choices, $\pr{C = c}$, expands it to stable models, $\pr{S = s}$, and then to worlds $\pr{W = w}$ and events $\pr{E = e}$. | |
101 | - | |
102 | -\begin{enumerate} | |
103 | - \item \textbf{Total Choices.} This case is given by $\pr{C = c}$, from equation \ref{eq:prob.total.choice}. Each total choice $C = c$ (together with the facts and rules) entails some stable models, $s \in S_c$, and each stable model $S = s$ contains a single total choice $c_s \subseteq s$. | |
104 | - \item \textbf{Stable Models.} Given a stable model $s \in \fml{S}$, and variables/values $x_{s,c} \in \intcc{0, 1}$, | |
105 | - \begin{equation} | |
106 | - \pr{S = s \given C = c} = \begin{cases} | |
107 | - x_{s,c} & \text{if~} s \in S_c,\cr | |
108 | - 0&\text{otherwise} | |
109 | - \end{cases} | |
110 | - \label{eq:prob.stablemodel} | |
111 | - \end{equation} | |
112 | - such that $\sum_{s \in S_c} x_{s,c} = 1$. | |
113 | - \item\label{item:world.cases} \textbf{Worlds.} Each world $W = w$ either: | |
114 | - \begin{enumerate} | |
115 | - \item Is a \textit{stable model}. Then | |
116 | - \begin{equation} | |
117 | - \pr{W = w \given C = c} = \pr{S = s \given C = c}. | |
118 | - \label{eq:world.fold.stablemodel} | |
119 | - \end{equation} | |
120 | - \item \textit{Contains} some stable models. Then | |
121 | - \begin{equation} | |
122 | - \pr{W = w \given C = c} = \prod_{s \subset w}\pr{S = s \given C = c}. | |
123 | - \label{eq:world.fold.superset} | |
124 | - \end{equation} | |
125 | - \item \textit{Is contained} in some stable models. Then | |
126 | - \begin{equation} | |
127 | - \pr{W = w \given C = c} = \sum_{s \supset w}\pr{S = s \given C = c}. | |
128 | - \label{eq:world.fold.subset} | |
129 | - \end{equation} | |
130 | - \item Neither contains nor is contained by a stable model. Then | |
131 | - \begin{equation} | |
132 | - \pr{W = w} = 0. | |
133 | - \label{eq:world.fold.independent} | |
134 | - \end{equation} | |
135 | - \end{enumerate} | |
136 | - \item \textbf{Events.} For each event $E = e$, | |
137 | - \begin{equation} | |
138 | - \pr{E = e \given C = c} = \begin{cases} | |
139 | - \pr{W = e \given C = c} & e \in \fml{W}, \cr | |
140 | - 0 & \text{otherwise}. | |
141 | - \end{cases} | |
142 | - \label{eq:prob.events} | |
143 | - \end{equation} | |
144 | -\end{enumerate} | |
145 | - | |
146 | -Since stable model are minimal, there is no proper chain $s_1 \subset w \subset s_2$ so each world folds into exactly one ot the four cases of point \ref{item:world.cases} above. | |
147 | - | |
148 | -% PARAMETERS FOR UNCERTAINTY | |
149 | - | |
150 | -Equation (\ref{eq:prob.stablemodel}) expresses the lack of knowledge about the probability assignment when a single total choice entails more than one stable model. In this case, how to distribute the respective probability? Our answer to this problem consists in assigning an unknown probability, $x_{s,c}$, conditional on the total choice, $c$, to each stable model $s$. This approach allow the expression of an unknown quantity and future estimation, given observed data. | |
151 | - | |
152 | -% STABLE MODEL | |
153 | -The stable model case, in equation (\ref{eq:world.fold.stablemodel}), identifies the probability of a stable model \textit{as a world} with its probability as defined previously in equation (\ref{eq:prob.stablemodel}), as a stable model. | |
154 | - | |
155 | -% SUPERSET | |
156 | -Equation \ref{eq:world.fold.superset} results from conditional independence of the stable models $s \subset w$. Conditional independence of stable worlds asserts a least informed strategy that we make explicit: | |
157 | - | |
158 | -\begin{assumption} | |
159 | - Stable models are conditionally independent, given their total choices. | |
160 | -\end{assumption} | |
161 | - | |
162 | -Consider the stable models $ab, ac$ from the example above. They result from the clause $b \vee c \leftarrow a$ and the total choice $a$. These formulas alone impose no relation between $b$ and $c$ (given $a$), so none should be assumed. Dependence relations are further discussed in Subsection (\ref{subsec:dependence}). | |
163 | - | |
164 | -% SUBSET | |
165 | -\hrule | |
166 | - | |
167 | -\bigskip | |
168 | -I'm not sure about what to say here.\marginpar{todo} | |
169 | - | |
170 | -My first guess was | |
171 | -\begin{equation*} | |
172 | - \pr{W = w \given C = c} = \sum_{s \supset w}\pr{S = s \given C = c}. | |
173 | -\end{equation*} | |
174 | - | |
175 | -$\pr{W = w \given C = c}$ already separates $\pr{W}$ into \textbf{disjoint} events! | |
176 | - | |
177 | -Also, I am assuming that stable models are independent. | |
178 | - | |
179 | -This would entail $p(w) = p(s_1) + p(s_2) - p(s_1)p(s_2)$ \textit{if I'm bound to set inclusion}. But I'm not. I'm defining a relation | |
180 | - | |
181 | -Also, if I set $p(w) = p(s_1) + p(s_2)$ and respect the laws of probability, this entails $p(s_1)p(s_2) = 0$. | |
182 | - | |
183 | -So, maybe what I want is (1) to define the cover $\hat{w} = \cup_{s \supset w} s$ | |
184 | - | |
185 | -\begin{equation*} | |
186 | - \pr{W = w \given C = c} = \sum_{s \supset w}\pr{S = s \given C = c} - \pr{W = \hat{w} \given C = c}. | |
187 | -\end{equation*} | |
188 | - | |
189 | -But this doesn't works, because we'd get $\pr{W = a \given C = a} < 1$. | |
190 | -% | |
191 | - | |
192 | -% | |
193 | -\bigskip | |
194 | -\hrule | |
195 | - | |
196 | -% INDEPENDENCE | |
197 | - | |
198 | -A world that neither contains nor is contained in a stable model describes a case that, according to the specification, should never be observed. So the respective probability is set to zero, per equation (\ref{eq:world.fold.independent}). | |
199 | -% | |
200 | -% ================================================================ | |
201 | -% | |
202 | -\subsection{Dependence} | |
203 | -\label{subsec:dependence} | |
204 | - | |
205 | -Dependence relations in the underlying system can be explicitly expressed in the specification. | |
206 | - | |
207 | -For example, $b \leftarrow c \wedge d$, where $d$ is an atomic choice, explicitly expressing this dependence between $b$ and $c$. One would get, for example, the specification | |
208 | -$$ | |
209 | -0.3::a, b \vee c \leftarrow a, 0.2::d, b \leftarrow c \wedge d. | |
210 | -$$ | |
211 | -with the stable models | |
212 | -$ | |
213 | -\co{ad}, \co{a}d, a\co{d}b, a\co{d}c, adb | |
214 | -$. | |
215 | - | |
216 | - | |
217 | -The interesting case is the subtree of the total choice $ad$. Notice that no stable model $s$ contains $adc$ because (1) $adb$ is a stable model and (2) if $adc \subset s$ then $b \in s$ so $adb \subset s$. | |
218 | - | |
219 | -Following equations (\ref{eq:world.fold.stablemodel}) and (\ref{eq:world.fold.independent}) this sets | |
220 | -\begin{equation*} | |
221 | - \begin{cases} | |
222 | - \pr{W = adc \given C = ad} = 0,\cr | |
223 | - \pr{W = adb \given C = ad} = 1 | |
224 | - \end{cases} | |
225 | -\end{equation*} | |
226 | -which concentrates all probability mass from the total choice $ad$ in the $adb$ branch, including the node $W = adbc$. This leads to the following cases: | |
227 | -$$ | |
228 | -\begin{array}{l|r} | |
229 | - x & \pr{W = x \given C = ad}\\ | |
230 | - \hline | |
231 | - ad & 1 \\ | |
232 | - adb & 1\\ | |
233 | - adc & 0\\ | |
234 | - adbc & 1 | |
235 | -\end{array} | |
236 | -$$ | |
237 | -so, for $C = ad$, | |
238 | -$$ | |
239 | -\begin{aligned} | |
240 | - \pr{W = b} &= \frac{2}{4} \cr | |
241 | - \pr{W = c} &= \frac{1}{4} \cr | |
242 | - \pr{W = bc} &= \frac{1}{4} \cr | |
243 | - &\not= \pr{W = b}\pr{W = c} | |
244 | -\end{aligned} | |
245 | -$$ | |
246 | -\textit{i.e.} the events $W = b$ and $W = c$ are dependent and that dependence results directly from the segment $0.2::d, b \leftarrow c \wedge d$ in the specification. | |
247 | - | |
248 | - | |
249 | -% | |
250 | - | |
251 | -% | |
252 | -\hrule | |
253 | -\begin{quotation}\note{Todo} | |
254 | - | |
255 | - Prove the four world cases (done), support the product (done) and sum (tbd) options, with the independence assumptions. | |
256 | -\end{quotation} | |
257 | - | |
258 | -\section{Developed Example} | |
259 | - | |
260 | -We continue with the specification from equation \ref{eq:example.1}. | |
261 | - | |
262 | -\textbf{Step 1: Total Choices.} The total choices, and respective stable models, are | |
263 | -\begin{center} | |
264 | - \begin{tabular}{l|r|r} | |
265 | - Total Choice ($c$) & $\pr{C = c}$ & Stable Models ($s$)\\ | |
266 | - \hline | |
267 | - $a$ & $0.3$ & $ab$ and $ac$.\\ | |
268 | - $\co{a} = \neg a$ & $\co{0.3} = 0.7$ & $\co{a}$. | |
269 | - \end{tabular} | |
270 | -\end{center} | |
271 | - | |
272 | -\textbf{Step 2: Stable Models.} Suppose now that | |
273 | -\begin{center} | |
274 | - \begin{tabular}{l|c|r} | |
275 | - Stable Models ($s$) & Total Choice ($c$) & $\pr{S = c \given C = c}$\\ | |
276 | - \hline | |
277 | - $\co{a}$ & $1.0$ & $\co{a}$. \\ | |
278 | - $ab$ & $0.8$ & $a$. \\ | |
279 | - $ac$ & $0.2 = \co{0.8}$ & $a$. | |
280 | - \end{tabular} | |
281 | -\end{center} | |
282 | - | |
283 | -\textbf{Step 3: Worlds.} Following equations \ref{eq:world.fold.stablemodel} --- \ref{eq:world.fold.independent} we get: | |
284 | -\begin{center} | |
285 | - \begin{tabular}{l|c|l|c|r} | |
286 | - Occ. ($o$) & S.M. ($s$) & Relation & T.C. ($c$) & $\pr{W = w}$\\ | |
287 | - \hline | |
288 | - $\emptyset$ & all & contained & $a$, $\co{a}$ & $1.0$ \\ | |
289 | - $a$ & $ab$, $ac$ & contained & $a$ & $0.8\times 0.3 + 0.2\times 0.3 = 0.3$ \\ | |
290 | - $b$ & $ab$ & contained & $a$ & $0.8\times 0.3 = 0.24$ \\ | |
291 | - $c$ & $ac$ & contained & $a$ & $0.2\times 0.3 = 0.06$ \\ | |
292 | - $\co{a}$ & $\co{a}$ & stable model & $\co{a}$ & $1.0\times 0.3 = 0.3$ \\ | |
293 | - $\co{b}$ & none & independent & none & $0.0$ \\ | |
294 | - $\co{c}$ & none & \ldots & & \\ | |
295 | - $ab$ & $ab$ & stable model & $a$ & $0.24$ \\ | |
296 | - $ac$ & $ac$ & stable model & $a$ & $0.06$ \\ | |
297 | - $a\co{b}$ & none & \ldots & & \\ | |
298 | - $a\co{c}$ & none & \ldots & & \\ | |
299 | - $\co{a}b$ & $\co{a}$ & contains & $\co{a}$ & $1.0$ \\ | |
300 | - $\co{a}c$ & $\co{a}$ & \ldots & & \\ | |
301 | - $\co{a}\co{b}$ & $\co{a}$ & \ldots & & \\ | |
302 | - $\co{a}\co{c}$ & $\co{a}$ & \ldots & & \\ | |
303 | - $abc$ & $ab$, $ac$ & contains & $a$ & $0.8\times 0.2 = 0.016$ \\ | |
304 | - \end{tabular} | |
305 | -\end{center} | |
306 | - | |
307 | -\section*{References} | |
308 | - | |
309 | - | |
310 | -\begin{enumerate} | |
311 | - \item Victor Verreet, Vincent Derkinderen, Pedro Zuidberg Dos Martires, Luc De Raedt, Inference and Learning with Model Uncertainty in Probabilistic Logic Programs (2022) | |
312 | - \item Andrew Cropper, Sebastijan Dumancic, Richard Evans, Stephen H. Muggleton, Inductive logic programming at 30 (2021) | |
313 | - \item Fabio Gagliardi Cozman, Denis Deratani Mauá, The joy of Probabilistic Answer Set Programming: Semantics - complexity, expressivity, inference (2020) | |
314 | - \item Fabrizio Riguzzi, Foundations of Probabilistic Logic Programming Languages, Semantics, Inference and Learning. Rivers Publishers (2018) | |
315 | - \item Martin Gebser, Roland Kaminski, Benjamin Kaufmann, and Torsten Schaub, Answer Set Solving in Practice, Morgan \& Claypool Publishers (2013) | |
316 | -\end{enumerate} | |
317 | - | |
318 | -\end{document} | |
319 | 0 | \ No newline at end of file |
... | ... | @@ -0,0 +1,239 @@ |
1 | +# Answer Set Programming | |
2 | + | |
3 | +> **Answer set programming (ASP) is a form of declarative programming oriented towards difficult (primarily NP-hard) search problems.** | |
4 | +> | |
5 | +> It is **based on the stable model (answer set) semantics** of logic programming. In ASP, search problems are reduced to computing stable models, and answer set solvers ---programs for generating stable models--- are used to perform search. | |
6 | + | |
7 | +--- | |
8 | + | |
9 | +**ASP** "programs" generates "deduction-minimal" models _aka_ **stable models** or **answer sets**. | |
10 | +- Given an ASP program $P$, a model $X$ of $P$ is a set where each element $x \in X$ has a proof using $P$. | |
11 | +- In a "deduction-minimal" model $X$ each element $x \in X$ has a proof using $P$. Non-minimal models have elements without a proof. | |
12 | + | |
13 | +## Key Questions | |
14 | + | |
15 | +1. What is the relation between ASP and Prolog? | |
16 | + 1. **Prolog** performs **top-down query evaluation**. Solutions are extracted from the instantiation of variables of successful queries. | |
17 | + 2. **ASP** proceeds in two steps: first, **grounding** generates a (finite) _propositional representation of the program_; second, **solving** computes the _stable models_ of that representation. | |
18 | +2. What are the roles of **grounding** with `gringo` and **solving** with `clasp`? | |
19 | +3. Can ASP be used to **pLP**? | |
20 | + 1. What are the key probabilistic tasks/questions/problems? | |
21 | + 2. Where does distribution semantics enters? What about **pILP**? | |
22 | +4. Can the probabilistic task control the grounding (`gringo`) or solving (`clasp`) steps in ASP? | |
23 | +5. Can ASP replace kanren? | |
24 | + 1. As much as ASP can replace Prolog. | |
25 | + | |
26 | +## Formal Foundations | |
27 | + | |
28 | +### Common Concepts and Notation | |
29 | + | |
30 | + context | true, false | if | and | or | iff | default negation | classical negation | |
31 | +---------|------------|----|-----|----|-----|------|----- | |
32 | +source | | `:-` | `,` | `|` | | `not` | `-` | |
33 | +logic prog. | | ← | , | ; | | ̃ | ¬ | |
34 | +formula | ⊤, ⊥ | → | ∧ | ∨ | ↔ | ̃ | ¬ | |
35 | + | |
36 | +> - **default negation** or **negation as failure (naf)**, `not a` ($\sim a$), means "_no information about `a`_". | |
37 | +> - **classical negation** or **strong negation**, `-a` ($\neg a$), means "_positive information about `-a`_" ie "_negative information about `a`_". Likewise `a`: "_positive informations about `a`_". | |
38 | +> - The symbol `not` ($\sim$), is a new logical connective; `not a` ($\sim a$) is often read as "_it is not believed that `a` is true_" or "_there is no proof of `a`_". Note that this does not imply that `a` is believed to be false. | |
39 | + | |
40 | +- **Interpretation.** A _boolean_ interpretation is a function from ground atoms to **⊤** and **⊥**. It is represented by the atoms mapped to **⊤**. | |
41 | + - if u, v are two interpretations **u ≤ v** iff u ⊆ v under this representation. | |
42 | + - **partial interpretations** are represented by ( {true atoms}, {false atoms}) leaving the undefined atoms implicit. | |
43 | + - an **ordered boolean assignment** $a$ over $dom(a)$ in represented by a sequence $a = (V_ix_i | i \in 1:n)$ where $V_i$ is either $\top$ or $\bot$ and each $x_i\in dom(a)$. | |
44 | + - $a^\top \subseteq a$ such that $\top x \in a$; $a^\bot \subseteq a$ such that $\bot x \in a$. | |
45 | + - An ordered assignment $(a^\top, a^\bot)$ is a partial boolean interpretation. | |
46 | +- Subsets have a partial order for the $\subset$ relation; remember maximal and minimal elements. | |
47 | +- Directed graphs; Path; **Strongly connected** iff all vertex pairs (a,b) are connected; The **strongly connected components** are the strongly connected subgraphs. | |
48 | + | |
49 | +### Basic ASP syntax and semantics | |
50 | + | |
51 | +- A **definite clause** is, by definition, $a_0 \vee \neg a_1 \vee \cdots \vee \neg a_n$, a disjunction with exactly one positive atom. | |
52 | + - Also denoted $a_0 \leftarrow a_1 \wedge \cdots \wedge a_n$. | |
53 | + - **A set of definite clauses has exactly one smallest model.** | |
54 | +- A **horn clause** has at most one positive atom. | |
55 | + - A horn clause without positive atom is an _integrity_ constraint - _a conjunction that **can't** hold_. | |
56 | + - **A set of horn clauses has one or zero smallest models.** | |
57 | +- If $P$ is a **positive program**: | |
58 | + - A set $X$ is **closed** under $P$ if $head(r) \in X$ if $body^+(r) \subset X$. | |
59 | + - $Cn(P)$ is, by definition, the set of **consequences of $P$**. | |
60 | + - $Cn(P)$ is the smallest set closed under $P$. | |
61 | + - $Cn(P)$ is the $\subseteq$-smallest model of $P$. | |
62 | + - The **stable model** of $P$ is, by definition, $Cn(P)$. | |
63 | + - If $P$ is a positive program, $Cn(P)$ is the smallest model of the definite clauses of $P$. | |
64 | + | |
65 | +#### Example calculation of stable models | |
66 | + | |
67 | +Consider the program P: | |
68 | +```prolog | |
69 | +person(joey). | |
70 | +male(X); female(X) :- person(X). | |
71 | +bachelor(X) :- male(X), not married(X). | |
72 | +``` | |
73 | + | |
74 | +1. Any SM of P must have the **fact** `person(joey)`. | |
75 | +2. Therefore the **grounded rule** `male(joey) ; female(joey) :- person(joey).` entails that the SMs of P either have `male(joey)` or `female(joey)`. | |
76 | +3. Any **SM must contain** either A: `{person(joey), male(joey)}` or B: `{person(joey), female(joey)}`. | |
77 | +4. In **the reduct** of P in A we get the rule `bachelor(joey) :- male(joey).` and therefore `bachelor(joey)` must be in a SM that contains A. Let A1: `{person(joey), male(joey), bachelor(joey)}`. | |
78 | +5. No further conclusions result from P on A1. Therefore A1 is a SM. | |
79 | +6. Also no further conclusions result from P on B; It is also a SM. | |
80 | +7. The SMs of P are: | |
81 | + 1. `{person(joey), male(joey), bachelor(joey)}` | |
82 | + 2. `{person(joey), female(joey)}` | |
83 | + | |
84 | + | |
85 | +```prolog | |
86 | +-a. | |
87 | +not a. | |
88 | +% | |
89 | +% { -a } | |
90 | +% | |
91 | +-a. | |
92 | +a. | |
93 | +% | |
94 | +% UNSAT. | |
95 | +% | |
96 | +not a. | |
97 | +a. | |
98 | +% | |
99 | +% UNSAT | |
100 | +% | |
101 | +%---------------------------------------- | |
102 | +% | |
103 | +a. | |
104 | +%% Answer: 1 | |
105 | +%% a | |
106 | +%% SATISFIABLE | |
107 | +% | |
108 | +% There is (only) one (stable) model: {a} | |
109 | +% | |
110 | +%---------------------------------------- | |
111 | +% | |
112 | +-a. | |
113 | +%% Answer: 1 | |
114 | +%% -a | |
115 | +%% SATISFIABLE | |
116 | +% | |
117 | +% Same as above. | |
118 | +% | |
119 | +%---------------------------------------- | |
120 | +% | |
121 | +--a. | |
122 | +%% *** ERROR: (clingo): parsing failed | |
123 | +% | |
124 | +% WTF? | |
125 | +% | |
126 | +%---------------------------------------- | |
127 | +% | |
128 | +not a. | |
129 | +%% Answer: 1 | |
130 | +%% | |
131 | +%% SATISFIABLE | |
132 | +% | |
133 | +% ie there is (only) one (stable) model: {} | |
134 | +% | |
135 | +% This program states that there is no information. | |
136 | +% In particular, there is no information about a. | |
137 | +% Therefore there are no provable atoms. | |
138 | +% Hence the empty set is a stable model. | |
139 | +% | |
140 | +%---------------------------------------- | |
141 | +% | |
142 | +not not a. | |
143 | +%% UNSATISFIABLE | |
144 | +% | |
145 | +% ie no models. Because | |
146 | +% 1. No model can contain ~p. | |
147 | +% 2. Any model contains all the facts. | |
148 | +% 3. Suppose X is a model. | |
149 | +% 4. Since ~~a is a fact, by 2, ~~a ∈ X. | |
150 | +% 5. But, by 1, ~~a ∉ X. | |
151 | +% 6. Therefore there are no models. | |
152 | +% | |
153 | +%---------------------------------------- | |
154 | +% | |
155 | +not -a. | |
156 | +%% Answer: 1 | |
157 | +%% | |
158 | +%% SATISFIABLE | |
159 | +% | |
160 | +% Same as ~a. | |
161 | +% | |
162 | +%---------------------------------------- | |
163 | +% | |
164 | +b. | |
165 | +a;-a. | |
166 | +not a :- b. | |
167 | +% Answer: 1 | |
168 | +% b -a | |
169 | +% SATISFIABLE | |
170 | +% | |
171 | +% 1. Any model must contain b (fact b). | |
172 | +% 2. Any models entails ~a (rule not a :- b.). | |
173 | +% 3. Any model must contain one of a or ¬a (rule a;-a). | |
174 | +% 4. No model can contain both a and ~a. | |
175 | +% 5. Therefore any model must contain {b, ¬a}, which is stable. | |
176 | +% | |
177 | +% Q: Why ~a does not contradicts -a | |
178 | +% A: Not sure. Maybe because "~a" states that no model can contain a but says nothing about ¬a. | |
179 | +% | |
180 | +%---------------------------------------- | |
181 | +% | |
182 | +b. | |
183 | +a;c. | |
184 | +% Answer: 1 | |
185 | +% b c | |
186 | +% Answer: 2 | |
187 | +% b a | |
188 | +% SATISFIABLE | |
189 | +% | |
190 | +% 1. Any model must have b. | |
191 | +% 2. Any model must have one of a or c. | |
192 | +% 3. No model with both a and c is minimal because either one satisfies a;c | |
193 | +``` | |
194 | + | |
195 | +- Why is the double strong negation, `--a`, a syntax error but the double naf, `not not a` is not? | |
196 | + | |
197 | +#### Definitions and basic propositions | |
198 | +1. Let $\cal{A}$ be a **set of ground atoms**. | |
199 | +2. A **normal rule** $r$ has the form $a \leftarrow b_1, \ldots, b_m, \sim c_1, \ldots, \sim c_n$ with $0 \leq m \leq n$. | |
200 | + - _Intuitively,_ the head $a$ is true if **each one of the $b_i$ has a proof** and **none of the $c_j$ has a proof**. | |
201 | +3. A **program** is a finite set of rules. | |
202 | +4. The **head** of the rule is $\text{head}(r) = a$; The **body** is $\text{body}(r) = \left\lbrace b_1, \ldots, b_m, \sim c_1, \ldots, \sim c_n \right\rbrace$. | |
203 | +5. A **fact** is a rule with empty body and is simply denoted $a$. | |
204 | +6. A **literal** is an atom $a$ or the default negation $\sim a$ of an atom. | |
205 | +7. Let $X$ be a set of literals. $X^+ = X \cap \cal{A}$ and the $X^- = \left\lbrace p\middle| \sim p \in X\right\rbrace$. | |
206 | +9. The set of atoms that occur in program $P$ is denoted $\text{atom}(P)$. Also $\text{body}(P) = \left\lbrace \text{body}(r)~\middle|~r \in P\right\rbrace$. At last, $\text{body}_P(a) = \left\lbrace \text{body}(r)~\middle|~r \in P \wedge \text{head}(r) = a\right\rbrace$. | |
207 | +10. A **model** of the program $P$ is a set of ground atoms $X \subseteq \cal{A}$ such that, for each rule $r \in P$, $$\text{body}^+(r) \subseteq X \wedge \text{body}^-(r) \cap X = \emptyset \to \text{head}(r) \in X.$$ | |
208 | +8. A rule $r$ is **positive** if $\text{body}(r)^- = \emptyset$; A program is positive if all its rules are positive. | |
209 | +11. _A positive program has an unique $\subseteq$-minimal model._ **Is this the link to prolog?** | |
210 | +12. The **reduct** of a formula $f$ relative to $X$ is the formula $f^X$ that results from $f$ replacing each maximal sub-formula _not satisfied by $X$_ by $\bot$. | |
211 | +13. The **reduct** of program $P$ relative to $X$ is $$P^X = \left\lbrace \text{head}(r) \leftarrow \text{body}^+(r) \middle| r \in P \wedge \text{body}^-(r) \cap X = \emptyset \right\rbrace.$$ Thus $P^X$ results from | |
212 | + 1. Remove every rule with a naf literal $\sim a$ where $a \in X$. | |
213 | + 2. Remove the naf literals of the remaining rules. | |
214 | +14. Since $P^X$ is a positive program, it has a unique $\subseteq$-minimal model. | |
215 | +15. $X$ is a **stable model** of $P$ if $X$ is the $\subseteq$-minimal model of $P^X$. | |
216 | +16. **Alternatively,** let ${\cal C}$ be the **consequence operator**, that yields the smallest model of a positive program. A **stable model** $X$ is a solution of $${\cal C}\left(P^X\right) = X.$$ | |
217 | + - _negative literals must only be true, while positive ones must also be provable._ | |
218 | +17. _A stable model is $\subseteq$-minimal but not the converse._ | |
219 | +18. _A positive program has a unique stable model, its smallest model._ | |
220 | +19. _If $X,Y$ are stable models of a normal program then $X \not\subset Y$._ | |
221 | +20. _Also, $X \subseteq {\cal C}(P^X) \subseteq \text{head}(P^X)$._ | |
222 | + | |
223 | +## ASP Programming Strategies | |
224 | + | |
225 | +- **Elimination of unnecessary combinatorics.** The number of grounded instances has an huge impact on performance. Rules can be used as "pre-computation" steps. | |
226 | +- **Boolean Constraint Solving.** This is at the core of the **solving** step, e.g. `clasp`. | |
227 | + | |
228 | +## ASP vs. Prolog | |
229 | + | |
230 | +- The different number of stable models lies precisely at the core difference between Prolog and ASP. **In Prolog, the presence of programs with negation that do not have a unique stable model cause trouble and the SLDNF resolution does not terminate on them [17]**. However, ASP embraces the disparity of stable models and treats the stable models of the programs as solutions to a given search program (from [Prolog and Answer Set Programming: Languages in Logic Programming](https://silviacasacuberta.files.wordpress.com/2020/07/final_paper.pdf) ) | |
231 | +- Prolog programs may not terminate (`p :- \+ p.`); ASP "programs" always terminate (`p :- not p.` has zero solutions). | |
232 | +- ASP doesn't allow function symbols; Prolog does. | |
233 | + | |
234 | + | |
235 | +## References | |
236 | + | |
237 | +1. Martin Gebser, Roland Kaminski, Benjamin Kaufmann, Torsten Schaub - Answer Set Solving in Practice-Morgan & Claypool (2013) | |
238 | +2. [Potassco, clingo and gringo](https://potassco.org/): <https://potassco.org/ | |
239 | +3. ["Answer Set Programming" lecture notes](http://web.stanford.edu/~vinayc/logicprogramming/html/answer_set_programming.html) for the Stanfords' course on Logic Programming by Vinay K. Chaudhri. Check also [the ILP section](http://web.stanford.edu/~vinayc/logicprogramming/html/inductive_logic_programming.html), this ASP example of an [encoding](http://www.stanford.edu/~vinayc/logicprogramming/epilog/jackal_encoding.lp) and related [instance](http://www.stanford.edu/~vinayc/logicprogramming/epilog/jackal_instance.lp) and [project suggestions](http://web.stanford.edu/~vinayc/logicprogramming/html/projects.html). | |
0 | 240 | \ No newline at end of file | ... | ... |
... | ... | @@ -0,0 +1,111 @@ |
1 | +# Distribution Semantics of Probabilistic Logic Programs | |
2 | + | |
3 | +> There are two major approaches to integrating probabilistic reasoning into logical representations: **distribution semantics** and **maximum entropy**. | |
4 | + | |
5 | +> - Is there a **sound interpretation of ASP**, in particular of **stable models**, to any of the two approaches above? | |
6 | +> - Under such interpretation, **what probabilistic problems can be addressed?** MARG? MLE? MAP? Decision? | |
7 | +> - **What is the relation to other logic and uncertainty approaches?** Independent Choice Logic? Abduction? Stochastic Logic Programs? etc. | |
8 | + | |
9 | + | |
10 | +## Maximum Entropy Summary | |
11 | + | |
12 | +> ME approaches annotate uncertainties only at the level of a logical theory. That is, they assume that the predicates in the BK are labelled as either true or false, but the label may be incorrect. | |
13 | + | |
14 | +These approaches are not based on logic programming, but rather on first-order logic. Consequently, the underlying semantics are different: rather than consider proofs, **these approaches consider models or groundings of a theory**. | |
15 | + | |
16 | +This difference primarily changes what uncertainties represent. For instance, Markov Logic Networks (MLN) represent programs as a set of weighted clauses. The weights in MLN do not correspond to probabilities of a formula being true but, intuitively, to a log odds between a possible world (an interpretation) where the clause is true and a world where the clause is false. | |
17 | + | |
18 | +## Distribution Semantics | |
19 | + | |
20 | +> DS approaches explicitly annotate uncertainties in BK. To allow such annotation, they extend Prolog with two primitives for stochastic execution: probabilistic facts and annotated disjunctions. | |
21 | + | |
22 | +Probabilistic facts are the most basic stochastic primitive and they take the form of logical facts labelled with a probability p. **Each probabilistic fact represents a Boolean random variable that is true with probability p and false with probability 1 − p.** _This is very close to facts in ASP. A "simple" syntax extension would be enough to capture probability annotations. **What about the semantics of such programs?**_ | |
23 | + | |
24 | +Whereas probabilistic facts introduce non-deterministic behaviour on the level of facts, annotated disjunctions introduce non-determinism on the level of clauses. Annotated disjunctions allow for multiple literals in the head, where only one of the head literals can be true at a time. | |
25 | + | |
26 | +### Core Distribution Semantics | |
27 | + | |
28 | +- Let $F$ be a set of **grounded probabilistic facts** and $P:F \to \left[0, 1 \right]$. | |
29 | + | |
30 | +> For example, `F` and `P` result from | |
31 | +> ```prolog | |
32 | +> 0.9::edge(a,c). | |
33 | +> 0.7::edge(c,b). | |
34 | +> 0.6::edge(d,c). | |
35 | +> 0.9::edge(d,b). | |
36 | +> ``` | |
37 | + | |
38 | +- **Facts are assumed marginally independent:** $$\forall a,b \in F, P(a \wedge b) = P(a)P(b).$$ | |
39 | + | |
40 | +- The **probability of $S \subseteq F$** is $$P_F(S) = \prod_{f \in S} P(f) \prod_{f \not\in S} \left(1 - P(f) \right).$$ | |
41 | + | |
42 | +- Let $R$ be a set of **definite clauses** defining further (new) predicates. | |
43 | + | |
44 | +> For example, `R` is | |
45 | +> ```prolog | |
46 | +> path(X,Y) :- edge(X,Y). | |
47 | +> path(X,Y) :- edge(X,Z), path(Z,Y). | |
48 | +> ``` | |
49 | + | |
50 | +- Any combination $S \cup R$ has an **unique least Herbrand model**, $$W = M(S \cup R).$$ | |
51 | + | |
52 | +- **That uniqueness fails for stable models.** Exactly why? - What is the relation of stable models and least Herbrand models? | |
53 | + | |
54 | +- The set of ground facts $S$ is an **explanation** of the world $W = M(S \cup R)$. A world might have multiple explanations. In ASP a explanation can entail 0, 1 or more worlds. | |
55 | + | |
56 | +- The **probability of a possible world** $W$ is | |
57 | +$$P(W) = \sum_{S \subseteq F :~W=M(S\cup R)} P_F(S).$$ | |
58 | + | |
59 | +- The **probability of a ground proposition** $p$ is (defined as) the probability that $p$ has a proof: $$P(p) = \sum_{S :~ S\cup R ~\vdash~ p} P_F(S) = \sum_{W :~ p\in W} P(W).$$ | |
60 | + | |
61 | +- A proposition may have many proofs in a single world $M(S\cup W)$. Without further guarantees, the probabilities of those proofs cannot be summed. The definition above avoids this problem. | |
62 | + | |
63 | + | |
64 | +> For example, a proof of `path(a,b)` employs (only) the facts `edge(a,c)` and `edge(c,b)` _i.e._ these facts are an explanation of `path(a,b)`. Since these facts are (marginally) independent, **the probability of the proof** is $$\begin{aligned}P(\text{path}(a, b)) & = P(\text{edge}(a,c) \wedge\text{edge}(c,b)) \\&= P(\text{edge}(a,c)) \times P(\text{edge}(c,b)) \\ &= 0.9 \times 0.7 \\ &= 0.63. \end{aligned}$$ | |
65 | +> This is the only proof of `path(a,b)` so $P(\text{path}(a,b)) = 0.63$. | |
66 | +> | |
67 | +> On the other hand, since `path(d,b)` has two explanations, `edge(d,b)` and `edge(d,c), edge(c,b)`: $$\begin{aligned} P(\text{path}(d,b)) & = P\left(\text{edge}(d,c) \vee \left(\text{edge}(d,c)\wedge\text{edge}(c,b)\right)\right) \\ &= 0.9 + 0.6 \times 0.7 - 0.9 \times 0.6 \times 0.7 \\ &= 0.942.\end{aligned}$$ | |
68 | + | |
69 | +- With this **semantics of the probability of a possible world**, the probability of an arbitrary proposition is still hard to compute, because of the _disjunct-sum_ problem: **An explanation can have many worlds.** Since the probability is computed via the explanation, if there are many models for a single explanation, **how to assign probability to specific worlds within the same explanation?** | |
70 | + | |
71 | +> Because computing the probability of a fact or goal under the distribution semantics is hard, systems such as Prism [4] and Probabilistic Horn Abduction (PHA) [8] impose additional restrictions that can be used to improve the efficiency of the inference procedure. | |
72 | +> | |
73 | +> **The key assumption is that the explanations for a goal are mutually exclusive, which overcomes the disjoint-sum problem.** If the different explanations of a fact do not overlap, then its probability is simply the sum of the probabilities of its explanations. This directly follows from the inclusion-exclusion formulae as under the exclusive-explanation assumption the conjunctions (or intersections) are empty (_Statistical Relational Learning_, Luc De Raedt and Kristian Kersting, 2010) | |
74 | +> | |
75 | +> **This assumption/restriction is quite _ad-hoc_ and overcoming it requires further inquiry.** | |
76 | + | |
77 | +- Reading Fabio Gagliardi Cozman, Denis Deratani Mauá, _The joy of Probabilistic Answer Set Programming: Semantics - complexity, expressivity, inference_ (2020) gave a big boost securing my initial intuition. | |
78 | + | |
79 | +- The problem can be illustrated with disjunctive clauses, such as the one in the following example. | |
80 | + | |
81 | +```prolog | |
82 | +a ; -a. % prob(a) = 0.7 | |
83 | +b ; c :- a. | |
84 | +``` | |
85 | + | |
86 | +- More specifically, in the example above, **the explanation `a` entails two possible worlds, `ab` and `ac`. How to assign probability of each one?** | |
87 | + | |
88 | +### Assigning Probabilities on "Multiple Worlds per Explanation" Scenarios | |
89 | + | |
90 | +#### Clause Annotations | |
91 | + | |
92 | +> Assign a probability to each case in the head of the clause. For example, annotate $P(b|a) = 0.8$. | |
93 | + | |
94 | +This case needs further study on the respective consequences, specially concerning the joint probability distribution. | |
95 | + | |
96 | +- In particular, $P(b|a) = 0.8$ entails $P(\neg b | a) = 0.2$. But $\neg b$ is not in any world. | |
97 | +- Also, unless assumed the contrary, the independence of $b$ and $c$ is unknown. | |
98 | + | |
99 | +#### Learn from Observations | |
100 | + | |
101 | +> Leave the probabilities uniformly distributed; update them from observation. | |
102 | + | |
103 | +Under this approach, how do observations affect the assigned probabilities? | |
104 | + | |
105 | +- In particular, how to update the probabilities of the worlds `a b` and `a c` given observations such as `a`, `b`, `ab`, `a-b`, `-ab` or `abc`? | |
106 | + 1. Define a criterium to decide if an observation $z$ is compatible world $w$. For example, $z \subseteq w$. | |
107 | + 2. Define the probability of a world from on the explanation probability and a count of **compatible observations**. | |
108 | + | |
109 | +#### Leave One World Out | |
110 | + | |
111 | +> Define a **compatibility criterium** for observations and worlds, add another world and update its probability on incompatible observations; The probability of this world measures the model+sensors limitations. | ... | ... |
No preview for this file type
... | ... | @@ -0,0 +1,74 @@ |
1 | +# Inductive Logic Programming | |
2 | + | |
3 | +> Inductive logic programming (ILP) is a form of machine learning (ML). As with | |
4 | +other forms of ML, the goal of ILP is to induce a hypothesis that generalises training examples. However, whereas most forms of ML use vectors/tensors to represent data (examples and hypotheses), ILP uses logic programs (sets of logical rules). Moreover, whereas most forms of ML learn functions, ILP learns relations. | |
5 | + | |
6 | +## Why ILP? | |
7 | + | |
8 | +- **Data efficiency.** Many forms of ML are notorious for their inability to generalise from small numbers of training examples, notably deep learning. By contrast, ILP can induce hypotheses from small numbers of examples, often from a single example. | |
9 | +- **Background knowledge.** ILP learns using BK represented as a logic program. Moreover, because hypotheses are symbolic, hypotheses can be added the to BK, and thus ILP systems naturally support lifelong and transfer learning. | |
10 | +- **Expressivity.** Because of the expressivity of logic programs, ILP can learn complex relational theories. Because of the symbolic nature of logic programs, ILP can reason about hypotheses, which allows it to learn optimal programs. | |
11 | +- **Explainability.** Because of logic’s similarity to natural language, logic programs can be easily read by humans, which is crucial for explainable AI. | |
12 | + | |
13 | +## Recent Advances | |
14 | + | |
15 | +- Search: Meta-level | |
16 | +- Recursion: Yes | |
17 | +- Predicate Invention: Limited | |
18 | +- Hypotheses: Higher-order; ASP | |
19 | +- Optimality: Yes | |
20 | +- Technology: Prolog; ASP; NNs | |
21 | + | |
22 | +### Review | |
23 | + | |
24 | +- **Search.** The fundamental ILP problem is to efficiently search a large hypothesis space. Most older ILP approaches search in either a top-down or bottom-up fashion. A third new search approach has recently emerged called meta-level ILP. | |
25 | + - **Top-down** approaches start with a general hypothesis and then specialise it. | |
26 | + - **Bottom-up** approaches start with the examples and generalise them. | |
27 | + - **Meta-level.** (Most) approaches encode the ILP problem as a program that reasons about programs. | |
28 | +- **Recursion.** Learning recursive programs has long been considered a difficult problem for ILP. The power of recursion is that an infinite number of computations can be described by a finite recursive program. | |
29 | + - Interest in recursion has resurged with the introduction of meta-interpretive learning (MIL) and the MIL system Metagol. The key idea of MIL is to use metarules, or program templates, to restrict the form of inducible programs, and thus the hypothesis space. A metarule is a higher-order clause. Following MIL, many meta-level ILP systems can learn recursive programs. With recursion, ILP systems can now generalise from small numbers of examples, often a single example. Moreover, the ability to learn recursive programs has opened up ILP to new application areas. | |
30 | +- **Predicate invention.** A key characteristic of ILP is the use of BK. BK is similar to features used in most forms of ML. However, whereas features are tables, BK contains facts and rules (extensional and intensional definitions) in the form of a logic program. | |
31 | + - Rather than expecting a user to provide all the necessary BK, the goal of predicate invention (PI) is for an ILP system to automatically invent new auxiliary predicate symbols. Whilst PI has attracted interest since the beginnings of ILP, and has subsequently been repeatedly stated as a major challenge, most ILP systems do not support it. | |
32 | + - Several PI approaches try to address this challenge: Placeholders, Metarules, Pre/post-processing, Lifelong Learning. | |
33 | + - The aforementioned techniques have improved the ability of ILP to invent high-level concepts. However, PI is still difficult and there are many challenges to overcome. The challenges are that (i) many systems struggle to perform PI at all, and (ii) those that do support PI mostly need much user-guidance, metarules to restrict the space of invented symbols or that a user specifies the arity and argument types of invented symbols. | |
34 | +- ILP systems have traditionally induced definite and normal logic programs, typically represented as Prolog programs. A recent development has been to use different **hypothesis representations**. | |
35 | + - **Datalog** is a syntactical subset of Prolog which disallows complex terms as arguments of predicates and imposes restrictions on the use of negation. The general motivation for reducing the expressivity of the representation language from Prolog to Datalog is to allow the problem to be encoded as a satisfiability problem, particularly to leverage recent developments in SAT and SMT. | |
36 | + - **Answer Set Programming** (ASP) is a logic programming paradigm based on the stable model semantics of normal logic programs that can be implemented using the latest advances in SAT solving technology. | |
37 | + - When learning Prolog programs, the procedural aspect of SLD-resolution must be taken into account. By contrast, as ASP is a truly declarative language, no such consideration need be taken into account when learning ASP programs. Compared to Datalog and Prolog, ASP supports additional language constructs, such as disjunction in the head of a clause, choice rules, and hard and weak constraints. | |
38 | + - **A key difference between ASP and Prolog is semantics.** A definite logic program has only one model (the least Herbrand model). By contrast, an ASP program can have one, many, or even no stable models (answer sets). Due to its non-monotonicity, ASP is particularly useful for expressing common-sense reasoning. | |
39 | + - Approaches to learning ASP programs can mostly be divided into two categories: **brave learners**, which aim to learn a program such that at least one answer set covers the examples, and **cautious learners**, which aim to find a program which covers the examples in all answer sets. | |
40 | + - **Higher-order programs** where predicate symbols can be used as terms. | |
41 | + - **Probabilistic logic programs.** A major limitation of logical representations, such as Prolog and its derivatives, is the implicit assumption that the BK is perfect. This assumption is problematic if data is noisy, which is often the case. | |
42 | + - **Integrating probabilistic reasoning into logical representations** is a principled way to handle such uncertainty in data. This integration is the focus of statistical relational artificial intelligence (StarAI). In essence, StarAI hypothesis representations extend BK with probabilities or weights indicating the degree of confidence in the correctness of parts of BK. Generally, StarAI techniques can be divided in two groups: _distribution representations_ and _maximum entropy_ approaches. | |
43 | + - **Distribution semantics** approaches explicitly annotate uncertainties in BK. To allow such annotation, they extend Prolog with two primitives for stochastic execution: probabilistic facts and annotated disjunctions. Probabilistic facts are the most basic stochastic primitive and they take the form of logical facts labelled with a probability p. Each probabilistic fact represents a Boolean random variable that is true with probability p and false with probability 1 − p. Whereas probabilistic facts introduce non-deterministic behaviour on the level of facts, annotated disjunctions introduce non-determinism on the level of clauses. Annotated disjunctions allow for multiple literals in the head, where only one of the head literals can be true at a time. | |
44 | + - **Maximum entropy** approaches annotate uncertainties only at the level of a logical theory. That is, they assume that the predicates in the BK are labelled as either true or false, but the label may be incorrect. These approaches are not based on logic programming, but rather on first-order logic. Consequently, the underlying semantics are different: rather than consider proofs, these approaches consider models or groundings of a theory. This difference primarily changes what uncertainties represent. For instance, Markov Logic Networks (MLN) represent programs as a set of weighted clauses. The weights in MLN do not correspond to probabilities of a formula being true but, intuitively, to a log odds between a possible world (an interpretation) where the clause is true and a world where the clause is false. | |
45 | + - The techniques from learning such probabilistic programs are typically direct extensions of ILP techniques. | |
46 | +- **Optimality.** There are often multiple (sometimes infinitely many) hypotheses that explain the data. Deciding which hypothesis to choose has long been a difficult problem. | |
47 | + - Older ILP systems were not guaranteed to induce optimal programs, where optimal typically means with respect to the size of the induced program or the coverage of examples. A key reason for this limitation was that most search techniques learned a single clause at a time, leading to the construction of sub-programs which were sub-optimal in terms of program size and coverage. | |
48 | + - Newer ILP systems try to address this limitation. As with the ability to learn recursive programs, the main development is to take a global view of the induction task by using meta-level search techniques. In other words, rather than induce a single clause at a time from a single example, the idea is to induce multiple clauses from multiple examples. | |
49 | + - The ability to learn optimal programs opens up ILP to new problems. For instance, learning efficient logic programs has long been considered a difficult problem in ILP, mainly because there is no declarative difference between an efficient program and an inefficient program. | |
50 | +- **Technologies.** Older ILP systems mostly use Prolog for reasoning. Recent work considers using different technologies. | |
51 | + - **Constraint satisfaction and satisfiability.** There have been tremendous recent advances in SAT. | |
52 | + - To leverage these advances, much recent work in ILP uses related techniques, notably ASP. The main motivations for using ASP are to leverage (i) the language benefits of ASP, and (ii) the efficiency and optimisation techniques of modern ASP solvers, which supports conflict propagation and learning. | |
53 | + - With similar motivations, other approaches encode the ILP problem as SAT or SMT problems. | |
54 | + - These approaches have been shown able to **reduce learning times** compared to standard Prolog-based approaches. However, some unresolved issues remain. A key issue is that most approaches **encode an ILP problem as a single (often very large) satisfiability problem**. These approaches therefore often struggle to scale to very large problems, although preliminary work attempts to tackle this issue. | |
55 | + - **Neural Networks.** With the rise of deep learning, several approaches have explored using gradient-based methods to learn logic programs. These approaches all **replace discrete logical reasoning with a relaxed version that yields continuous values** reflecting the confidence of the conclusion. | |
56 | +- **Applications.** | |
57 | + - **Scientific discovery.** Perhaps the most prominent application of ILP is in scientific discovery: identify and predict ligands (substructures responsible for medical activity) and infer missing pathways in protein signalling networks; ecology. | |
58 | + - **Program analysis.** learning SQL queries; programming language semantics, and code search. | |
59 | + - **Robotics.** Robotics applications often require incorporating domain knowledge or imposing certain requirements on the learnt programs. | |
60 | + - **Games.** Inducing game rules has a long history in ILP, where chess has often been the focus | |
61 | + - **Data curation and transformation.** Another successful application of ILP is in data curation and transformation, which is again largely because ILP can learn executable programs. There is much interest in this topic, largely due to success in synthesising programs for end-user problems, such as string transformations. Other transformation tasks include extracting values from semi-structured data (e.g. XML files or medical records), extracting relations from ecological papers, and spreadsheet manipulation. | |
62 | + - **Learning from trajectories.** Learning from interpretation transitions (LFIT) automatically constructs a model of the dynamics of a system from the observation of its state transitions. LFIT has been applied to learn biological models, like Boolean Networks, under several semantics: memory-less deterministic systems, and their multi-valued extensions. The Apperception Engine explain sequential data, such as cellular automata traces, rhythms and simple nursery tunes, image occlusion tasks, game dynamics, and sequence induction intelligence tests. Surprisingly, can achieve human-level performance on the sequence induction intelligence tests in the zero-shot setting (without having been trained on lots of other examples of such tests, and without hand-engineered knowledge of the particular setting). At a high level, these systems take the unique selling point of ILP systems (the ability to strongly generalise from a handful of data), and apply it to the self-supervised setting, producing an explicit human-readable theory that explains the observed state transitions. | |
63 | +- **Limitations and future research.** | |
64 | + - **Better systems.** A problem with ILP is the lack of well engineered tools. They state that whilst over 100 ILP systems have been built, less than a handful of systems can be meaningfully used by ILP researchers. By contrast, driven by industry, other forms of ML now have reliable and well-maintained implementations, which has helped drive research. A frustrating issue with ILP systems is that they use many different language biases or even different syntax for the same biases. _For ILP to be more widely adopted both inside and outside of academia, we must develop more standardised, user-friendly, and better-engineered tools._ | |
65 | + - **Language biases.** One major issue with ILP is choosing an appropriate language bias. Even for ILP experts, determining a suitable language bias is often a frustrating and time-consuming process. We think the need for an almost perfect language bias is severely holding back ILP from being widely adopted. _We think that an important direction for future work in ILP is to develop techniques for automatically identifying suitable language biases._ This area of research is largely under-researched. | |
66 | + - **Better datasets.** Interesting problems, alongside usable systems, drive research and attract interest in a research field. This relationship is most evident in the deep learning community which has, over a decade, grown into the largest AI community. This community growth has been supported by the constant introduction of new problems, datasets, and well-engineered tools. ILP has, unfortunately, failed to deliver on this front: most research is still evaluated on 20-year old datasets. Most new datasets that have been introduced often come from toy domains and are designed to test specific properties of the introduced technique. _We think that the ILP community should learn from the experiences of other AI communities and put significant efforts into developing datasets that identify limitations of existing methods as well as showcase potential applications of ILP._ | |
67 | + - **Relevance.** New methods for predicate invention have improved the abilities of ILP systems to learn large programs. Moreover, these techniques raise the potential for ILP to be used in lifelong learning settings. However, inventing and acquiring new BK could lead to a problem of too much BK, which can overwhelm an ILP system. On this issue, a key under-explored topic is that of relevancy. _Given a new induction problem with large amounts of BK, how does an ILP system decide which BK is relevant?_ One emerging technique is to train a neural network to score how relevant programs are in the BK and to then only use BK with the highest score to learn programs. Without efficient methods of relevance identification, it is unclear how efficient lifelong learning can be achieved. | |
68 | + - **Handling mislabelled and ambiguous data.** A major open question in ILP is how best to handle noisy and ambiguous data. Neural ILP systems are designed from the start to robustly handle mislabelled data. Although there has been work in recent years on designing ILP systems that can handle noisy mislabelled data, there is much less work on the even harder and more fundamental problem of designing ILP systems that can handle raw ambiguous data. ILP systems typically assume that the input has already been preprocessed into symbolic declarative form (typically, a set of ground atoms representing positive and negative examples). But real-world input does not arrive in symbolic form. _For ILP systems to be widely applicable in the real world, they need to be redesigned so they can handle raw ambiguous input from the outset._ | |
69 | + - **Probabilistic ILP.** Real-world data is often noisy and uncertain. Extending ILP to deal with such uncertainty substantially broadens its applicability. While StarAI is receiving growing attention, **learning probabilistic programs from data is still largely under-investigated due to the complexity of joint probabilistic and logical inference.** When working with probabilistic programs, we are interested in the probability that a program covers an example, not only whether the program covers the example. Consequently, probabilistic programs need to compute all possible derivations of an example, not just a single one. Despite added complexity, probabilistic ILP opens many new challenges. Most of the existing work on probabilistic ILP considers the minimal extension of ILP to the probabilistic setting, by assuming that either (i) BK facts are uncertain, or (ii) that learned clauses need to model uncertainty. **These assumptions make it possible to separate structure from uncertainty and simply reuse existing ILP techniques.** Following this minimal extension, the existing work focuses on discriminative learning in which the goal is to learn a program for a single target relation. However, a grand challenge in probabilistic programming is generative learning. That is, learning a program describing a generative process behind the data, not a single target relation. **Learning generative programs is a significantly more challenging problem, which has received very little attention in probabilistic ILP.** | |
70 | + - **Explainability.** Explainability is one of the claimed advantages of a symbolic representation. Recent work evaluates the comprehensibility of ILP hypotheses using Michie’s framework of ultra-strong machine learning, where a learned hypothesis is expected to not only be accurate but to also demonstrably improve the performance of a human being provided with the learned hypothesis. [Some work] empirically demonstrate improved human understanding directly through learned hypotheses. _However, more work is required to better understand the conditions under which this can be achieved, especially given the rise of PI._ | |
71 | + | |
72 | +## Bibliography | |
73 | + | |
74 | +1. Inductive logic programming at 30 | |
0 | 75 | \ No newline at end of file | ... | ... |
No preview for this file type
No preview for this file type
... | ... | @@ -0,0 +1,13 @@ |
1 | +# Potassco | |
2 | + | |
3 | +> [Potassco](https://potassco.org/), the Potsdam Answer Set Solving Collection, bundles tools for Answer Set Programming developed at the University of Potsdam. | |
4 | + | |
5 | +- [The Potassco Guide](https://github.com/potassco/guide) | |
6 | + | |
7 | +## clingo | |
8 | + | |
9 | +> Current answer set solvers work on variable-free programs. Hence, a grounder is needed that, given an input program with first-order variables, computes an equivalent ground (variable-free) program. gringo is such a grounder. Its output can be processed further with clasp, claspfolio, or clingcon. | |
10 | +> | |
11 | +> [clingo](https://potassco.org/clingo/) combines both gringo and clasp into a monolithic system. This way it offers more control over the grounding and solving process than gringo and clasp can offer individually - e.g., incremental grounding and solving. | |
12 | + | |
13 | +- [Python module list](https://potassco.org/clingo/python-api/current/) | |
0 | 14 | \ No newline at end of file | ... | ... |
... | ... | @@ -0,0 +1,41 @@ |
1 | +# Probability Problems | |
2 | + | |
3 | +>- What are the general tasks we expect to solve with probabilistic programs? | |
4 | +> - The **MAP** task is the one with best applications. It is also the hardest to compute. | |
5 | +> - **MLE** is the limit case of **MAP**. Has simpler computations but overfits the data. | |
6 | + | |
7 | +## Background | |
8 | + | |
9 | +- **Conditional Probability** $$P(A, B) = P(B | A) P(A).$$ | |
10 | +- **Bayes Theorem** $$P(B | A) = \frac{P(A | B) P(B)}{P(A)}.$$ | |
11 | +- **For maximization tasks** $$P(B | A) \propto P(A | B) P(B).$$ | |
12 | +- **Marginal** $$P(A) = \sum_b P(A,b).$$ | |
13 | +- In $P(B | A) \propto P(A | B) P(B)$, if the **posterior** $P(B | A)$ and the **prior** $P(B)$ follow distributions of the same family, $P(B)$ is a **conjugate prior** for the **likelihood** $P(A | B)$. | |
14 | +- **Density Estimation:** Estimate a joint probability distribution from a set of observations; Select a probability distribution function and the parameters that best explains the distributions of the observations. | |
15 | + | |
16 | +## MLE: Maximum Likelihood Estimation | |
17 | + | |
18 | +> Given a probability **distribution** $d$ and a set of **observations** $X$, find the distribution **parameters** $\theta$ that maximize the **likelihood** (_i.e._ the probability of those observations) for that distribution. | |
19 | +> | |
20 | +> **Overfits the data:** high variance of the parameter estimate; sensitive to random variations in the data. Regularization with $P(\theta)$ leads to **MAP**. | |
21 | + | |
22 | +Given $d, X$, find | |
23 | +$$ | |
24 | +\hat{\theta}_{\text{MLE}}(d,X) = \arg_{\theta} \max P_d(X | \theta). | |
25 | +$$ | |
26 | + | |
27 | +## MAP: Maximum A-Priori | |
28 | + | |
29 | +> Given a probability **distribution** $d$ and a set of **observations** $X$, find the distribution **parameters** $\theta$ that best explain those observations. | |
30 | + | |
31 | +Given $d, X$, find | |
32 | +$$ | |
33 | +\hat{\theta}_{\text{MAP}}(d, X) = \arg_{\theta}\max P(\theta | X). | |
34 | +$$ | |
35 | + | |
36 | +Using $P(A | B) \propto P(B | A) P(A)$, | |
37 | +$$\hat{\theta}_{\text{MAP}}(d, X) = \arg_{\theta} \max P_d(X | \theta) P(\theta)$$ | |
38 | + | |
39 | +Variants: | |
40 | +- **Viterbi algorithm:** Find the most likely sequence of hidden states (on HMMs) that results in a sequence of observed events. | |
41 | +- **MPE: Most Probable Explanation** and **Max-sum, Max-product algorithms:** Calculates the marginal distribution for each unobserved node, conditional on any observed nodes; Defines the most likely assignment to all the random variables that is consistent with the given evidence. | ... | ... |
... | ... | @@ -0,0 +1,14 @@ |
1 | +# Z3 - An SMT solver | |
2 | + | |
3 | +> `Z3` is a theorem prover from Microsoft Research. | |
4 | +> **However, `Potassco` seems more to the point.** | |
5 | + | |
6 | +## Introduction | |
7 | + | |
8 | +An Answer Set Program can be solved translating it into a SAP problem. | |
9 | + | |
10 | +## References | |
11 | + | |
12 | +1. [Programming Z3](https://theory.stanford.edu/~nikolaj/programmingz3.html): <https://theory.stanford.edu/~nikolaj/programmingz3.html>. | |
13 | +2. [Julia Package](https://www.juliapackages.com/p/z3): <https://www.juliapackages.com/p/z3>. | |
14 | +3. [Repository](https://github.com/Z3Prover): <https://github.com/Z3Prover>. | |
0 | 15 | \ No newline at end of file | ... | ... |