KL Divergence Background
Suppose there is a true generative distribution \(P\) and an agent that approximates this distribution with \(Q\). The Kullback-Leibler (KL) divergence can be interpreted as the average relative surprise the agent will experience. This definition can be built up in three stages: 'surprise', 'relative surprise', and 'average relative surprise'.
Surprise
Intuitively, the surprise is a measure of the probability of observing a sample, that goes up as the probability of the sample goes down. We would like it to have the property that if we have two independent events, the surprise of the 'event' of observing both is the sum of their individual surprises. Per the rules of probability, the probability of observing two independent events is the product of their individual probabilities. Taken together, these desiderata are satisfied by the logarithm function.
Side Note on PDFs and Surprise
If the distribution \(P(x)\) is a continuous density, then the surprise has the same formula, but will be negative for points above 1. It can be considered as more a 'surprise density'. Consider some region of \(P(x)\) with all points above one (say, at 8), whose definite integral (the probability mass) in some region (say, a region with length \(\Delta x = 0.0125\)) is .1. Then the surprise (in bits) at each point in this region is \(-log_2(8) = -3\). Since the probability mass is 0.1, the surprise in bits for this region is \(-log_2(0.1) \approx 3.322\). We can convert between the "surprise density" and the "surprise mass" by subtracting \(-log_2(\Delta x)\). This will always yield a positive surprise mass.
To summarize the above, if you don't like thinking about negative surprise, just mentally add a constant offset to get a positive surprise mass.
Relative Surprise
The relative surprise is the difference between the surprise that the agent expects from a sample and the surprise it should expect if it had a perfect model of the true generative distribution. That we have a quantity we're interpreting as the 'expected surprise' is counterintuitive; if an agent expects to be surprised, is it really surprised? It makes more sense when we think about aggregating many samples at a particular point in the sample space. If the agent knew the true distribution, it will be surprised by samples coming from a low probability region, so in this sense it is expecting to be surprised by those samples. But if there is a mismatch (divergence) between the agent's model and the true distribution, the agent will be more or less surprised by these samples than it would be if it knew the true generative distribution.
Suppose we have two beta distributions, as shown below, representing the true generative distribution and the agent's current approximation respectively. Then intuitively, the relative surprise is how much more surprised the agent is when observing a sample at \(x\) than it ought to be. Algebraically:
In the plot below, the individual surprises for two beta distributions are shown, as well as their relative surprise.
Average Relative Surprise: KL Divergence
To measure how well \(Q\) approximates \(P\) overall, we can weight the relative surprises by how often the agent will encounter them. In other words, we should weight by \(P(x)\). This is shown as the dotted green line in the plot below. Then, we can integrate over the entire sample space to get the 'average relative surprise', which is the Kullback-Leibler (KL) divergence.
When Q matches P perfectly, the relative surprise is zero, so the KL divergence is zero. When Q underestimates P in regions where P has substantial probability mass, those regions contribute positively to the integral, increasing the KL divergence. On the other hand, when Q overestimates P, the agent has underestimated the surprise, so this brings down the KL divergence. However, since this underestimated surprise is less experienced by the agent, it contributes less. Taken together, the positive regions outweigh the negative regions, and the KL divergence can never be negative. The only time it is zero is when the agent's estimate of \(P\) is perfect.
Subjective Logic
A lot of the inspiration for this post came from this Opinion Visualization Demo, created by Gaëtan Bouguier. This demo visualizes the correspondance between opinions in Audun Jøsang's Subjective Logic framework, and beta distributions. In my post, I used precision as one of the parameters for the beta distribution, where precision is the inverse of the variance of the beta. Given the two usual parameters \(\alpha\) and \(\beta\), the precision is given:
In a subjective logic opinion, the distribution is parameterized by uncertainty. Using equation 3.11 of Jøsang's book Subjective Logic (2016), if we assume an uninformative prior, then \(W\) (the prior mass) is 2.
It's convenient to define an evidence parameter \(S\), which is the sum of \(\alpha\) and \(\beta\). This is the number of samples the agent has seen plus \(W\), the prior evidence.
Then the uncertainty is given by
The expected value of the beta distribution is given by:
This means that we can equally well parameterize the beta distribution by the following pairs, and freely convert between them:
Python Code for Computing KL Divergence of Two Beta Distributions
I tested the numerical values of the javascript code on this site against a python implementation. The python library is at beta_lib.py and corresponding tests at beta_lib_test.py. In both the javascript and python versions, surprise is computed in base 2 while the KL divergence is computed in natural logarithms.