3. User guide
3.1. Densities
Pypmc revolves around adapting mixture densities of the form
where each component
or a Student’s t distribution
The free parameters of the mixture density,
3.1.1. Component density
The two densities — Gauss and Student’s t — supported by pypmc
come in two variants whose methods have identical names but differ in
their arguments. The standard classes are
Gauss
and
StudentT
:
mean = np.zeros(2)
sigma = np.array([[ 0.1, -0.001],
[-0.001, 0.02]])
dof = 5.
gauss = pypmc.density.gauss.Gauss(mean, sigma)
student_t = pypmc.density.student_t.StudentT(mean, sigma, dof)
# density at (1, 1)
f = gauss.evaluate(np.ones(2))
# draw sample from density
x = gauss.propose()
For the use as proposal densities in Markov chains (see below), there
are also local variants whose mean can be varied in each call to
evaluate
or propose
. LocalGauss
and LocalStudentT
don’t take the
mean
argument in the constructor. To reproduce the previous
results, one would do:
local_gauss = pypmc.density.gauss.LocalGauss(sigma)
f = local_gauss.evaluate(mean, np.ones(2))
x = local_gauss.propose(mean)
3.1.2. Mixture density
Mixture densities are represented by
MixtureDensity
. One can create a
mixture density from a list of arbitrary component densities and
weights. However, usually all one needs are the convenience shortcuts
to create mixtures of Gaussians and Student’s from means and
covariances. For example, for two components with weight 60% and 40%:
from pypmc.density.mixture import create_gaussian_mixture, create_t_mixture
weights = np.array([0.6, 0.4])
means = [np.zeros(D), np.ones(D)]
covariances = [np.eye(D), np.eye(D)]
gauss_mixture = create_gaussian_mixture(means, covariances, weights)
dofs = [13, 17]
mixture = create_t_mixture(means, covariances, dofs, weights)
The most common interaction pattern with a mixture density requires only a few attributes and methods:
gauss_mixture.evaluate(np.zeros(D))
samples = gauss_mixture.propose(N=500)
D = gauss_mixture.dim
first_component = gauss_mixture.components[0]
second_weight = gauss_mixture.weight[1]
3.2. Indicator function
The indicator function
The indicator
module provides indicator
functions for a ball
and a
hyperrectangle
in
from pypmc.tools.indicator import \
merge_function_with_indicator
def target_density(x):
# define unbounded density on log scale
# define indicator
ind_lower = [p.range_min for p in priors]
ind_upper = [p.range_max for p in priors]
ind = pypmc.tools.indicator.hyperrectangle(ind_lower, ind_upper)
# merge with indicator
log_target = merge_function_with_indicator(target_density, ind, -np.inf)
3.3. Markov chain
3.3.1. Initialization
We provide a generic implementation of adaptive local-random-walk MCMC
[HST01] featuring Gauss and Student’s t local proposals. To create a
MarkovChain
, one needs three ingredients:
Evaluate the target density on the log scale.
A local proposal density.
A valid initial point.
For example:
import pypmc.density.student_t.LocalStudentT
import pypmc.sampler.markov_chain.AdaptiveMarkovChain
# unit gaussian, unnormalized
def log_target(x):
return -0.5 * x.dot(x)
prop = LocalStudentT(prop_sigma, prop_dof)
start = np.array([-2., 10.])
mc = AdaptiveMarkovChain(log_target, prop, start)
The initial proposal covariance should be chosen similar to the
target’s covariance, but scaled to yield an acceptance rate in the
range of 20%. For a Gaussian target and a Gaussian proposal in
In order to constrain the support of the target in a simple way, one
can pass an indicator
function to the
constructor using the keyword argument ind=indicator
. Then any
proposed point is first checked to lie in the support; i.e., if
indicator(x) == True
. Only then is the target density called. This
leads to significant speed-ups if the mass of the target density is
close to a boundary, and its evaluation is slow.
3.3.2. Adaptation
The prototypical use is to run the chain for a number of iterations until it finds the bulk of the distribution, and to discard these samples as burn-in or warm-up. Then the samples can be used to tune the proposal covariance:
mc.run(10**4)
mc.clear()
# run 100,000 steps adapting the proposal every 500 steps
# hereby save the accept count which is returned by mc.run
accept_count = 0
for i in range(200):
accept_count += mc.run(500)
mc.adapt()
Note that the proposal can be tuned continously so the Markov property is lost but the samples are still asymptotically distributed according to the target; i.e., there is no need to fix the proposal to generate valid samples.
The parameters like the desired minimum and maximum acceptance rate
can be set via
set_adapt_params
.
3.4. Importance sampling
3.4.1. Standard
The standard
ImportanceSampler
implements serial importance sampling to compute the expectation of
some function
where
To start, one only needs the target density
import pypmc.sampler.importance_sampling.ImportanceSampler
sampler = ImportanceSampler(log_target, log_proposal)
Optionally, the sampler
accepts an indicator
;
see Indicator function. What to do with sampler
? Run it:
sampler.run(N=500)
to draw 500 samples. If the proposal is a
MixtureDensity
and the option
trace_sort=True
, then run
returns the generating component for
each sample.
The samples and weights are stored in two
history
objects:
samples = sampler.samples[-1]
weights = sampler.weights[-1]
Note that a History
object can contain the output
of several runs, the last one is available as history[-1]
.
The samples are ordered according to the generating component if trace_sort=True.
3.4.2. Deterministic mixture
If weighted samples from the same target but different proposal
densities are available, the weights can be combined in a clever way
as though they were drawn from the mixture of individual proposals
[Cor+12]. This preserves the unbiasedness of the fundamental
estimate of importance sampling. The motivation to
combine multiple proposals is to improve the variance of the estimator
by reducing the effect of outliers; i.e., samples with very large
weights in the tails of
The function
combine_weights
takes the
samples and regular importance weights as lists of arrays and the
proposals as a list and returns the combined weights as
History
object such that the weights for each
proposal are easily accessible.
3.4.3. Comparison
Compared to the regular
ImportanceSampler
,
combined_weights
requires
more memory and slightly more cpu, but usually increases the relative
effective sample size, and in most cases significantly increases the
total effective sample size compared to throwing away samples from all
but the last run. If the samples are all drawn from the same
proposal, then both samplers yield identical results.
3.5. PMC
Population Monte Carlo [Cap+08] is a class of algorithms designed to approximate the target density by a mixture density. The basic idea is to minimize the Kullback-Leibler divergence between the target and the mixture by optimizing the mixture parameters. The expectation values taken over the unknown target distribution are approximated by importance sampling using samples from the proposal mixture; the set of samples is the population. The algorithm is a form of expectation maximization (EM) and yields the optimal values of the parameters of a Gaussian or Student’s t mixture density. The crucial task (more on this below) is to supply a good initial proposal.
3.5.1. Basic approach
In the simplest scheme, new samples are drawn from the proposal
import pypmc.density.mixture.MixtureDensity
import pypmc.sampler.importance_sampling.ImportanceSampler
import pypmc.mix_adapt.pmc.gaussian_pmc
initial_proposal = MixtureDensity(initial_prop_components)
sampler = ImportanceSampler(log_target, initial_proposal)
for i in range(10):
generating_components.append(sampler.run(10**3, trace_sort=True))
samples = sampler.samples[-1]
weights = sampler.weights[-1]
gaussian_pmc(samples, sampler.proposal,
weights,
latent=generating_components[-1],
mincount=20, rb=True, copy=False)
In the example code, we keep track of which sample came from which
component by passing the argument trace_sort=True
to the
sampler
that returns the indices from the run
method. The PMC
update can use this information to prune irrelevant components that
contributed less than mincount
samples. If mincount=0
, the
pruning is disabled. This may lead to many components with vanishing
weights, which can slow down the PMC update, but otherwise does no
harm.
Note that in the actual parameter update, one needs the latent
variables but when using the recommended Rao-Blackwellization
(rb=True
), the generating components are ignored, and the
corresponding latent variables are inferred from the data. This is
more time consuming, but leads to more robust fits [Cap+08]. The
faster but less powerful variant rb=False
then requires that the
generating components be passed to latent
.
The keyword copy=False
allows gaussian_pmc
to update the
density
in place.
3.5.2. Student’s t
A Student’s t distribution should be preferred over a Gaussian mixture
if one suspects long tails in the target density. In the original
proposal by Cappé et al. [Cap+08], the degree of freedom of each
component,
The function student_t_pmc
is invoked
just like its Gaussian counterpart, but has three extra arguments to
limit the number of steps of the numerical solver
(dof_solver_steps
), and to pass the allowed range of values of
mindof, maxdof
). The Student’s t distribution
converges to the Gaussian distribution as
For small problems (few samples/components), the numerical
solver may add a significant overhead to the overall time of one PMC
update. But since it adds flexibility, our recommendation is to start
with it and to only turn it off (dof_solver_steps=0
) if the overhead is
intolerable.
3.5.3. PMC with multiple EM steps
In order to make the most out of the available samples, it is better to run multiple EM update steps to infer the parameters of the mixture density. The convergence criterion is the likelihood value given in Eq. 5 of [Cap+08]. Depending on the sought precision, several hundreds of EM steps may be required. We advise the user to decide based on the cost of computing new samples whether it is worth running the EM for many iterations or if one gets better results by just computing new samples for a mixture that is not quite at the (local) optimum.
A PMC
object handles the convergence
testing for both Gaussian and Student’s t mixtures as follows:
pmc = PMC(samples, prop)
pmc.run(iterations=1000, rel_tol=1e-10, abs_tol=1e-5, verbose=True)
This means a maximum number of 1000 updates is performed but the
updates are stopped if convergence is reached before that. For
details, see run
that calls
gaussian_pmc
or
student_t_pmc
to do the heavy lifting.
3.6. Variational Bayes
The general idea of variational Bayes is pedagogically explained in
[Bis06], Ch. 10. In a nutshell, the unknown joint posterior density
of hidden (or latent) data
In our case, we assume the data
where the latent data have been marginalized out. The priors over the
parameters
Note that variational Bayes yields an approximation of the posterior
over the mixture parameters
Perhaps the biggest advantage of variational Bayes over PMC is that we can choose a prior that is noninformative but still prevents the usual pathologies of maximum likelihood such as excessive model complexity due to components that are responsible for only one sample and whose covariance matrix shrinks to zero. Variational Bayes is very effective at automatically determining a suitable number of components by assigning weight zero to irrelevant components.
As opposed to PMC, variational Bayes has a natural convergence criterion, the lower bound to the evidence. We propose to run as many update steps as necessary until the change of the lower bound is less than some user-configurable number. Often the smaller that number, the more irrelevant components are removed.
We implement two variants of variational Bayes, both yield a posterior over the parameters of a Gaussian mixture. In either case, one can fully specify all hyperparameter values for both the prior and the starting point of the posterior.
The classic version [Bis06] is the most well known and widely
used. It takes
3.6.1. Classic version
A basic example: draw samples from a standard Gaussian in 2D. Then run variational Bayes to recover that exact Gaussian. Paste the following code into your python shell and you should get plots similar to those shown modulo the random data points:
import numpy as np
from pypmc.mix_adapt.variational import GaussianInference
from pypmc.tools import plot_mixture
import matplotlib.pyplot as plt
# data points
N = 500
data = np.random.normal(size=2*N).reshape(N, 2)
# maximum number of components in mixture
K = 6
vb = GaussianInference(data, components=K,
alpha=10*np.ones(K),
nu=3*np.ones(K))
# plot data and initial guess
plt.subplot(1, 2, 1)
plt.scatter(data[:, 0], data[:, 1], color='gray')
initial_mix = vb.make_mixture()
plot_mixture(initial_mix, cmap='gist_rainbow')
x_range = (-4, 4)
y_range = x_range
plt.xlim(x_range)
plt.ylim(y_range)
plt.gca().set_aspect('equal')
plt.title('Initial')
# compute variational Bayes posterior
vb.run(prune=0.5*len(data) / K, verbose=True)
# obtain most probable mixture and plot it
mix = vb.make_mixture()
plt.subplot(1, 2, 2)
plt.scatter(data[:, 0], data[:, 1], color='gray')
plt.xlim(x_range)
plt.ylim(y_range)
plt.gca().set_aspect('equal')
plot_mixture(mix, cmap='gist_rainbow')
plt.title('Final')
plt.show()
(Source code
, png
, hires.png
, pdf
)

3.6.1.1. Initialization
In more complicated examples, it may be necessary to give good
starting values to the means and covariances of the components in
order to accelerate convergence to a sensible solution. You can pass
this information when you create the
GaussianInference
object. Internally, the info is forwarded to a call to
set_variational_parameters
,
where all parameter names and symbols are explained in detail.
If an initial guess in the form of a Gaussian
MixtureDensity
is available, this can
be used to define the initial values using
GaussianInference(... initial_guess=mixture)
Note that the vb
object carries the posterior distribution of
hyperparameters describing a Gaussian mixture. Invoking
make_mixture()
singles out the mixture at the mode of the
posterior. To have a well defined mode one needs nu[k] > d
and
alpha[k] > 0
for at least one component k
. We set
equals GaussianInference(...initial_guess="random")
is to randomly select
K
data points and start with a Gaussian of unit covariance
there. K
is the maximum number of components and has to be chosen
by user. A safe procedure is to choose K
larger than desired, and
let variational Bayes figures out the right value.
3.6.1.2. Running
Running variational Bayes with vb.run()
can take a while if you
have a lot of data points, lots of components, and high-dimensional
data. Monitor the progress with verbose=True
.
The pruning (removal) of components is determined by the prune
keyword. After a VB update, every component is responsible for an
effective number of samples. If this is lower than the threshold set
by prune
, the component is pruned. In our experiments, a good rule
of thumb to remove many components is to set the threshold to
3.6.1.3. Results
Continuing the example, you can inspect how all hyperparameters were updated by the data:
vb.prior_posterior()
and you can check that the mean of the most probable Gaussian (assuming the mixture only has one component) is close to zero and the covariance is close to the identity matrix:
mix = vb.make_mixture()
mix.components[0].mu
mix.components[0].sigma
3.7. Mixture reduction
Let us suppose samples are fed into a clustering algorithm that yields a Gaussian mixture. To save memory, we discard the samples and retain only the mixture as a description of the data. Assume the same procedure is carried out on different sets of samples from the same parent distribution, and we end up with a collection of mixture densities that contain similar information. How to combine them? A simple merge would be overly complex, as similar information is stored in every mixture. How then to compress this collection into one Gaussian mixture with less components but similar descriptive power? We provide two algorithms for this task illustrated in the example Mixture reduction.
3.7.1. Hierarchical clustering
While the KL divergence between two Gaussians is known analytically, the corresponding result between Gaussian mixtures is not known. The hierarchical clustering described in [GR04] seeks to minimize an ad-hoc function used as a proxy for the metric between two Gaussian mixtures. The basic idea is very simple: map input components to output components such that every component in the output mixture is made up of an integer number of input components (regroup step). Then update the output component weights, means, and covariances (refit step). Continue until the metric is unchanged.
Note that this is a discrete problem: each input component is
associated to only one output component, thus if the mapping doesn’t
change, then the metric does not change either. Output components can
only die out if they receive no input component. Typically this is
rare, so the number of output components is essentially chosen by the
user, and not by the algorithm
Hierarchical
. A user has to
supply the input mixture, and an initial guess of the output mixture,
thereby defining the maximum number of components:
from pypmc.mix_adapt.hierarchical import Hierarchical
h = Hierarchical(input_mixture, initial_guess)
where both arguments are pypmc.density.mixture.MixtureDensity
objects. To perform the clustering:
h.run()
Optional arguments to pypmc.density.mixture.MixtureDensity.run
are the tolerance by which the metric may change to declare
convergence (eps
), whether to remove output components with zero
weight (kill
), and the total number of (regroup + refit) steps
(max_steps
).
3.7.2. VBmerge
In [BGP10], a variational algorithm is derived in the limit of large
VBMerge
object, the required
inputs are a MixtureDensity
, the total
number of samples encoded in the mixture
from pypmc.mix_adapt.variational import VBMerge
VBMerge(input_mixture, N, K)
As guidance, if
The classes VBMerge
and
GaussianInference
share the
same interface; please check Classic version.
The great advantage compared to hierarchical clustering is that the
number of output components is chosen automatically. One starts with
(too) many components, updates, and removes those components with
small weight using the prune
argument to
pypmc.mix_adapt.variational.GaussianInference.run
.
3.8. Putting it all together
The examples in the next section show how to use the different algorithms in practice. The most advanced example, MCMC + variational Bayes, demonstrates how to combine various algorithms to integrate and sample from a multimodal function:
run multiple Markov chains to learn the local features of the target density;
combine the samples into a mixture density with variational Bayes
run importance sampling
rerun variational Bayes on importance samples
repeat importance sampling with improved proposal
combine samples with the deterministic-mixture approach