Data Generation

Suppose the data we are analyzing is generated by the procedure shown in the figure above. For each data point, we start at the top. The darkness of the arrows indicates the relative probabilities that a single class will be chosen. There are 5 classes, labeled
So the entire procedure for data generation is: start at the top, pick a cause according to a probability distribution. Then, the presence or absence of each effect is sampled from the Bernoulli distribution with a probability entirely determined by the cause.
This generative model is a simple example of a Bayesian Network. A Bayesian Network specifies a joint probability distribution with a very useful property: if the value of the parents of a particular variable are fixed, no change in variables anywhere else in the network will influence the value of the variable. In our case, once a particular cause is chosen, if we perform an experiment where we change the value of
The central problem: how can we determine which cause was present if we can only observe the effects? To solve this problem we will use the naive Bayes classifier. Before we get into the classifier itself, let's generate data as described above.
Generate Data
import numpy as np
import matplotlib.pyplot as plt
import torch
import pyro
import pyro.distributions as dist
cause_probs = [.1, .05, .4, .25, .2]
def sample_point():
cause = dist.Categorical(torch.tensor(cause_probs)).sample().item()
if cause == 0:
effect_probs = [0, 0, 0, 0, 0, 1]
elif cause == 1:
effect_probs = [.2, .2, .2, .2, .2, 0]
elif cause == 2:
effect_probs = [.5, .5, 0, 0, 0, 0]
elif cause == 3:
effect_probs = [.1, .1, .2, .2, .01, .39]
elif cause == 4:
effect_probs = [.05, .05, .3, .3, .2, .1]
effects = [dist.Bernoulli(e).sample().item() for e in effect_probs]
return cause, effects
data = [sample_point() for i in range(1000)]
causes = np.array([d[0] for d in data])
effects = np.array([d[1] for d in data])
Naive Bayes Classifier
The naive Bayes classifier uses Bayes' rule and a conditional independence assumption to determine the which class a particular data point belongs to. In our generative framework, this means finding the cause that generated the particular effects we observed. Bayes' rule is given by:
The prior
Instead of having to estimate 64 different probabilities for each class - one for each unique combination of effect (technically 63, because all probabilities have to sum to 1) - we only have to estimate 6. This is the power of the conditional independence assumption, and of the naive Bayes classifier. Our probability estimates will be much more accurate because have fewer probabilities to estimate, so we have more data per estimate.
Putting it all together, in order to estimate the posterior probability distribution of each cause given the effects, we can compute
Let's do this in code:
## Get prior
_, counts = np.unique(causes, return_counts=True)
prior = counts / np.sum(counts)
## Get array of likelihoods
likelihoods = []
for j in range(5):
effects_j = effects[np.where(causes==j)[0]]
likelihoods.append(np.mean(effects_j, axis=0))
likelihoods = np.array(likelihoods)
def get_posterior(effects):
unnormalized_posterior = []
for j in range(nCauses):
lj = likelihoods[j]
l = [lj[i] if effects[i]==1 else 1-lj[i] for i in range(len(effects))]
unnormalized_posterior.append(np.prod(l) * prior[j])
posterior = unnormalized_posterior / np.sum(unnormalized_posterior)
return posterior
cause, effects = sample_point()
posterior = get_posterior(effects)
print(f"Cause: {1 + cause}")
print(f"Effects: {effects}")
print(f"Posterior: {posterior}")
print(f"Predicted Cause: {1 + np.argmax(posterior)}")
Output:
Cause: 3
Effects: [1 1 0 0 0 0]
Posterior: [0 0.014 0.974 0.019 0.001]
Predicted Cause: 3
Here we see that for this data point, the hidden cause was