Advanced Use¶
Log Likelihood Function¶
The Log Likelihood function needs to be defined in a separate .py file. It should be a function of one argument,
either a numpy array or a dictionary.
If you need to pass more information (e.g. data, covariance matrix, precision matrix, etc.) to the Log Likelihood function you should declare those as global variables. This is the easiest and most consistent way to make MPI not complain; it’s also the most computationally efficient method (i.e. passing the whole dataset to all processes eveytime you call the function can be slow).
Here we show a short toy example where we demonstrate how we should define such a function.
import numpy as np
ndim = 10
data = np.random.randn(ndim) # Random data vector
C = np.identity(ndim) # Identity Covariance Matrix
Cinv = np.linalg.inv(C) # Inverse Covariance Matrix
def log_likelihood(x): # Normal distribution
return -0.5*np.dot(x, np.dot(Cinv, x))
Parameter File¶
The parameter file can generally include more information than the options presented in the Quick Start page.
Likelihood¶
Usually the argument of the Log Likelihood function is a 1D numpy array but we can also use a dictionary instead.
To do so we need to add the dictionary: True option to the Likelihood block, for instance:
Likelihood:
path: path/to/logprob.py
function: log_likelihood
dictionary: True
Parameters¶
Every parameter needs to be either fixed or free:
- For fixed parameters we need to specify their value in Parameter block (i.e. parameter
ain the following example). - For free parameters we need to specify a prior instead. So far, only
uniformandnormalpriors are supported. For auniformprior we need to specify the uniform interval(min, max)(i.e. parameterbin the following example). For anormalprior we need to specify the meanlocand standard deviationscale(i.e. parametercin the following example).
Parameters:
a:
fixed: 1.0
b:
prior:
type: uniform
min: -1.0
max: 1.0
c:
prior:
type: normal
loc: 0.0
scale: 1.0
Sampler¶
cronus supports three different samplers, zeus (Default), emcee, and dynesty. The prefered sampler can be specified
using the name option in the Sampler section of the parameter file, for instance:
Sampler:
name: zeus
...
When either zeus or emcee is used as the prefered sampler then the following options are available:
ndimis the total number of parameters/dimensions.nwalkersis the total number of walkers (i.e. internal parallel chains for zeus or emcee). This number needs to be at least twice the number of free parameters.nchainsis the number of parallel chains, we recommend at least two and preferably 4 to get good estimate of the Gelman-Rubin diagnostic.ncheckspecifies the number of steps after which the samples are saved and the Convergence Criteria are assessed. The default value is 100 which means that the samples are saved and convergence is diagnosed every 100 steps.maxiterspecifies the maximum number of iterations (Default is inf).miniterspecifies the minimum number of iterations (Default is 0).maxcallspecifies the maximum number of Log Likelihood evalluations/calls (Default is inf).initialcontrols the initialization of the walker positions. The available options are:ellipse(this is a small ellipse around the Maximum a posteriori estimate, this is the default and recommended choice),laplace(sample the initial positions of the walkers from the Laplace approximation of the posterior distribution), andprior(sample the initial positions from the prior distribution, not the best choice).thinis the thinning rate for the chains (i.e. ifthin=5then save every 5th element to the chain). This can significantly reduce the size of the output files if the autocorrelation time of the chain is large. The default value is 1.
When dynesty is used as the prefered sampler then the following options are available:
ndimis the total number of parameters/dimensions.bounddlogzmaxiterspecifies the maximum number of iterations (Default is inf).maxcallspecifies the maximum number of Log Likelihood evalluations/calls (Default is inf).pfrac
Diagnostics¶
So far cronus includes two distinct convergence diagnostics, the Gelman-Rubin statistic and the Autocorrelation Time test.
Their combination seems to work well in Astrophysical and Cosmological likelihoods.
Lets see how one can customize the thresholds of those criteria:
- Either of them can be turned off or on (Default) using the
useargument. |R_hat - 1| < epsilonis the threshold for the Potential Scale Reduction Factor (PSRF). We recommend to use a value ofepsilonthat it is smaller than 0.05 (Default).- In terms of the Integrated Autocorrelation Time (IAT) we provide two criteria, if the chain is longer than
nact = 20(Default) times the estimated IAT and the IAT has changed less thandact = 3%(Default) the criteria are satisfied. If both Gelman-Rubin and IAT criteria are satisfied then sampling stops.
All of the diagnostic options can be seen here:
Diagnostics:
Gelman-Rubin:
use: True
epsilon: 0.05
Autocorrelation:
use: True
nact: 20
dact: 0.03
Output¶
The only option of the Output block is a directory path in which the samples/results will be saved. If
the provided directory doesn’t exist one will be created by cronus. The default directory is the current one.
Output: path/to/output/folder/chains
Running cronus¶
To run cronus, given a parameter file file.yaml, we execute the following command:
$ mpiexec -n [nprocesses] cronus-run file.yaml
where, nprocesses is the number of available CPUs. Depending on the cluster you are using you may need to use
mpirun or srun instead of mpiexec.
Note
For better performance we recommend to use a number of processes that can be divided by the number of chains nchains.
Ideally, we recommend to use nchains * (nwalkers/2 + 1) if available, there’s no real computational benefit in using
more than this.
Results¶
zeus or emcee¶
When either zeus or emcee is used as the prefered sampler then the results are saved as h5 files.
There are as many h5 files saved as the number of chains nchains. Each file contains two datasets, one
called samples which constists of the samples as the name suggests, and one named logprob which includes
the respective values of the Log Posterior Distribution.
After a few seconds of running the following files will be created in the provided Output directory:
chains ├── chain_0.h5 ├── chain_1.h5 ├── ... └── chain_[nchains].h5
The files will iteratively be updated every few iterations.
Note
You can access those results by doing:
import numpy as np import h5py with h5py.File('chains/chain_0.h5', "r") as hf: samples = np.copy(hf['samples']) logprob = np.copy(hf['logprob'])
The shape of the samples array would be (Iteration, nwalkers, ndim) and the shape of the Log Posterior array will
be (Iteration, nwalkers). You can easily flatten this, combining all the walkers into one chain and discarding
the first half of the chain, by running:
nsamples, nwalkers, ndim_prime = np.shape(samples) samples_flat = samples[nsamples//2:].reshape(-1, ndim_prime) logprob_flat = logprob[nsamples//2:].reshape(-1, 1)
dynesty¶
When dynesty is used as the sampler then the results are saved as a numpy npy format file.