In which we look at two pragmatic hacks that lead to the Bayesian approach of probabilities, when pushed further and added as constraints.

## Coinflips

Let’s say we have a coin, and we want to decide if it’s fair. We throw it $N$ times and we get $m$ heads, we can code heads=1, tails=0. With $\mu$ the ratio of heads:

### Maximum likelihood

How do we set $\mu$? We could maximize the probability of the data that we saw under our model, that is maximizing the *likelihood*. Let’s say that $D = {x_1 \dots x_N}$, then we have:

The maximum of this function of $\mu$ is reached for $\mu= \frac{m}{N}$. The problem arises if we have little data (in fact, when we have data that does not cover the whole space of possible data). If $D=(1,1,1)$, the maximum likelihood estimate of $\mu$ will be $1.0$. It means that we predict that *all* the tosses will land on heads, after only three observations!

### Smoothing

A classical hack is to smooth the maximum likelihood estimate by adding “fake data”, we could consider that we already saw the coin land on heads and tails once, before getting our data. This way, before (“prior to”) the experiment, we would have $\mu=1/2=0.5$. After (*posterior* to) our experiment, taking the data into account, we would have $\mu = (3+1)/(3+2) = 0.8$. How do we set the these prior coin flips (smoothing parameters)?

### Maximum A Posteriori

The *right way* to encode this prior knowledge is to put a probability distribution on the parameter $\mu$. As $\mu$ is a ratio, we should have a continuous distribution on $[0, 1]$ that can represent a whole range of prior belief on what the coin’s ratio of heads is. For these reasons, a sensible choice is the Beta distribution:

On Wikpedia, we can check how the $Beta(x|\alpha, \beta)$ distribution looks like:

Now we can compute again what is the *posterior* value of $\mu$ knowing the data $D$ and the prior Beta ($\propto$ means “proportional to”):

Hopefully, the Beta distribution is the conjugate prior for the Bernouilli and binomial distributions, and thus a bit of calculus reduces it to:

We can compute that, when $N \rightarrow \infty$, the expectation of $\mu$: $\mathbb{E}[\mu] = \mu_{ML}$, as:

### First conclusion

This approach of using a prior on the parameters of the distributions that are essential to our model (the predicting distribution) is central to the Bayesian approach of building models. It makes the model robust to what can happen, even though we had few data. It makes it easier to reason about our prior assumptions that simply “adding unseen data”, and it yields in the presence of more data.

If you’re interested about Bayesian modeling, there are plenty of very good textbooks. My prefered gradual introduction is MacKay’s ITILA, that you can find as a free ebook.

## Causality

Now here is another hack for logical reasoning, that leads to Bayesian probabilities. Let’s say that you want to express that an event $A$ entails an event $B$, in logic you would write $A \Rightarrow B$. We will be abusing the notation $A=[A=true]$ and $\neg A=[A=false]$. Now with the *modus ponens*, you can deduce $B$ whenever $A$ is true.

### Plausible reasoning

Now, we want to extend prepositional logic to *plausible reasoning*, in which we can have degrees of probability that rules are true; or degrees of belief in these rules and facts. A pragmatic way to do that is to introduce the variable $C$ which represents $A \Rightarrow B$, that is: if $P(C)=p$, there is a probability $p$ that $A \Rightarrow B$. Then, this previous *modus ponens* translates to:

And actually, as $P(A,B|C)=P(A|C)$, we have $P(B|A,C)=1$, which corresponds to the strong syllogism of *modus ponens*.

So now, if we are only 80% sure of $C$, we can write $P(C) = 0.8$ and seek for $P(B|A)$ (we are 100% sure of A):

Which means that $B$ has 80% chances to be true by following the strong syllogism of modus ponens, but it can also be true even though $C=false$.

Finally, contrary to prepositional logic, we *also* get the weak syllogism (and I’ll let you think it through):

A similar derivation and observation can be done for *modus tollens*.

### Cox-Jaynes theorem

A reasoning mechanism needs to be consistent (one cannot prove $A$ and $\neg A$ at the same time). For plausible reasoning, consistency means: a) all the possible ways to reach a conclusion leads to the same result, b) information cannot be ignored, c) two equal states of knowledge have the same plausibilities. Adding consistency to plausible reasoning leads to Cox’s theorem, which derives the laws of probability (the product-rule and the sum-rule). So, the degrees of belief of any consistent induction mechanism verify Kolmogorov’s axioms.

### Second and last conclusion

With plausible reasoning, we get all the benefits of prepositional logic, but we can also reason with/about facts and rules that are not 100% true. We have another example of how a pragmatical (sensical) hack to extend logic to “degrees of beliefs” (probabilities) leads to Bayesian probabilities.

If you are interested by learning about plausible reasonning, you can look at my thesis, or, better yet, read it directly from one of the masters in Jayne’s Probability Theory: The Logic of Science for which the pre-print is there.