Case Study: Bayesian Curve Fitting

In this case study, we use vegas to fit a straight line to data with with outliers. We use the specialized integrator PDFIntegrator with a non-Gaussian probability density function (PDF) in a Bayesian analysis. We look at two examples, one with 4 parameters and the other with 22 parameters.

This case study is adapted from an example by Jake Vanderplas on his Python blog. It is also discussed in the documentation for the lsqfit module.

The Problem

We want to extrapolate the y values in the following figure to x=0:

_images/outliers1.png

A linear least-squares fit to the data (dotted line) is unconvincing; in particular, the extrapolated value at x=0 is larger than one while most of the data near x=0 suggest an intercept less than 0.5. The problem, of course, is caused by the outliers. There are at least three outliers.

A Solution

There are many ad hoc prescriptions for handling outliers. In the best of situations one has an explanation for the outliers and can model them accordingly. For example, we might know that some fraction w of the time our device malfunctions, resulting in much larger measurement errors than usual. This can be modeled in a Bayesian analysis by describing the data using a linear combination of two Gaussian probability density functions (PDFs) for each data point. One is the usual PDF proportional to exp(-(y-f(x))^2  / 2\sigma_y^2)), where the fit function f(x) is c_0 + c_1 x. The second is the same but with the \sigma_y \to b \sigma_y for some b>1. The relative weights assigned to these two terms are 1-w and w, respectively.

The following code does a Bayesian fit with this modified PDF:

import numpy as np
import gvar as gv
import vegas

def main():
    # 1) create data
    x = np.array([
        0.2, 0.4, 0.6, 0.8, 1.,
        1.2, 1.4, 1.6, 1.8, 2.,
        2.2, 2.4, 2.6, 2.8, 3.,
        3.2, 3.4, 3.6, 3.8
        ])
    y = gv.gvar([
        '0.38(20)', '2.89(20)', '0.85(20)', '0.59(20)', '2.88(20)',
        '1.44(20)', '0.73(20)', '1.23(20)', '1.68(20)', '1.36(20)',
        '1.51(20)', '1.73(20)', '2.16(20)', '1.85(20)', '2.00(20)',
        '2.11(20)', '2.75(20)', '0.86(20)', '2.73(20)'
        ])

    # 2) create prior and modified PDF
    prior = make_prior()
    mod_pdf = ModifiedPDF(data=(x, y), fitfcn=fitfcn, prior=prior)

    # 3) create integrator and adapt it to the modified PDF
    expval = vegas.PDFIntegrator(prior, pdf=mod_pdf)
    nitn = 10
    nstrat = [10, 10, 2, 2]
    warmup = expval(nstrat=nstrat, nitn=2*nitn)

    # 4) evaluate statistics for g(p)
    @vegas.rbatchintegrand
    def g(p):
        return {k:p[k] for k in ['c', 'b', 'w']}
    results = expval.stats(g, nitn=nitn)
    print(results.summary())

    # 5) print out results
    print('Bayesian fit results:')
    for k in results:
        print(f'      {k} = {results[k]}')
        if k == 'c':
            # correlation matrix for c
            print(
                ' corr_c =',
                np.array2string(gv.evalcorr(results['c']), prefix=10 * ' ', precision=3),
                )
    # Bayes Factor
    print('\n  logBF =', np.log(results.pdfnorm))

def fitfcn(x, p):
    c = p['c']
    return c[0] + c[1] * x

def make_prior(w_shape=()):
    prior = gv.BufferDict()
    prior['c'] = gv.gvar(['0(5)', '0(5)'])
    # uniform distributions for w and b
    prior['gw(w)'] = gv.BufferDict.uniform('gw', 0., 1., shape=w_shape)
    prior['gb(b)'] = gv.BufferDict.uniform('gb', 5., 20.)
    return prior

@vegas.rbatchintegrand
class ModifiedPDF:
    """ Modified PDF to account for measurement failure. """
    def __init__(self, data, fitfcn, prior):
        x, y = data
        self.fitfcn = fitfcn
        self.prior_pdf = gv.PDF(prior, mode='rbatch')
        # add rbatch index
        self.x = x[:, None]
        self.ymean = gv.mean(y)[:, None]
        self.yvar = gv.var(y)[:, None]

    def __call__(self, p):
        w = p['w']
        b = p['b']

        # modified PDF for data
        fxp = self.fitfcn(self.x, p)
        chi2 = (self.ymean - fxp) ** 2 / self.yvar
        norm = np.sqrt(2 * np.pi * self.yvar)
        y_pdf = np.exp(-chi2 / 2) / norm
        yb_pdf = np.exp(-chi2 / (2 * b**2)) / (b * norm)
        # product over PDFs for each y[i]
        data_pdf = np.prod((1 - w) * y_pdf + w * yb_pdf, axis=0)

        # multiply by prior PDF
        return data_pdf * self.prior_pdf(p)

if __name__ == '__main__':
    main()

Here class ModifiedPDF implements the modified PDF. The parameters for this distribution are the fit function coefficients c = p['c'], the weight w = p['w'], and a breadth parameter p['b']. As usual the PDF for the parameters (in __call__) is the product of a PDF for the data times a PDF for the priors. The data PDF consists of the two Gaussian distributions for each data point: one, y_pdf, with the nominal data errors and weight (1-w), and the other, yb_pdf, with errors that are p['b'] times larger and weight w. The PDF for the prior is implemented using gvar.PDF from the gvar module.

We want broad Gaussian priors for the fit function coefficients, but uniform priors for the weight parameter (0<w<1) and breadth parameter (5<b<20). An easy way to implement the uniform priors for use by vegas.PDFIntegrator is to replace the weight and breadth parameters by new parameters p['gw(w)'] and p['gb(b)'], respectively, that map the uniform distributions onto Gaussian distributions (0 ± 1). Values for the weight p['w'] and breadth p['b'] are then obtained from the new variables using the inverse map. This strategy is easily implemented using a gvar.BufferDict dictionary to describe the parameter prior.

The parameter priors are specified in make_prior() which returns the BufferDict dictionary, with a Gaussian random variable (a gvar.GVar) for each parameter. The fit function coefficients (prior['c']) have broad priors: 0 ± 5. The prior for parameter p['gw(w)'] is specified by

prior['gw(w)'] = gv.BufferDict.uniform('gw', 0., 1.)

which assigns it a Gaussian prior (0 ± 1) while also instructing any BufferDict dictionary p that includes a value for p['gw(w)'] to automatically generate the corresponding value for the weight p['w']. This makes the weight parameter available automatically even though vegas.PDFIntegrator integrates over p['gw(w)']. The same strategy is used for the breadth parameter.

The Bayesian integrals are estimated using vegas.PDFIntegrator expval, which is created from the prior and the modified PDF. It is used to evaluate expectation values of arbitrary functions of the fit variables. Here it optimizes the integration variables for integrals of the prior’s PDF, but replaces that PDF with the modified PDF when evaluating expectation values.

We first call expval with no function, to allow the integrator to adapt to the modified PDF. We specify the number of stratifications for each direction in parameter space, concentrating stratifications in the c[0] and c[1] directions (because we expect more structure there); expval uses about 3000 integrand evaluations per iteration with this stratification.

We next use the integrator’s PDFIntegrator.stats() method to evaluate the expectation values, standard deviations and correlations of the functions of the fit parameters returned by g(p). The output dictionary results contains fit results (gvar.GVars) for each of the fit parameters.

Note that g(p) and mod_pdf(p) are both batch integrands, with the batch index on the right (i.e., the last index). This significantly reduces the time required for the integrations.

The results from this code are as follows:

itn   integral        average         chi2/dof        Q
-------------------------------------------------------
  1   4.437(45)e-11   4.437(45)e-11       0.00     1.00
  2   4.457(43)e-11   4.447(31)e-11       0.30     1.00
  3   4.509(49)e-11   4.468(26)e-11       0.81     0.76
  4   4.548(60)e-11   4.488(25)e-11       0.61     0.98
  5   4.491(45)e-11   4.488(22)e-11       0.54     1.00
  6   4.523(42)e-11   4.494(20)e-11       0.57     1.00
  7   4.444(42)e-11   4.487(18)e-11       0.58     1.00
  8   4.433(42)e-11   4.480(16)e-11       0.62     1.00
  9   4.530(48)e-11   4.486(16)e-11       0.62     1.00
 10   4.511(47)e-11   4.488(15)e-11       0.63     1.00

Bayesian fit results:
      c = [0.29(14) 0.620(58)]
 corr_c = [[ 1.    -0.897]
           [-0.897  1.   ]]
      b = 10.6(3.7)
      w = 0.27(12)

  logBF = -23.8270(33)

The table shows results for the normalization of the modified PDF from each of the nitn=10 iterations of the vegas algorithm used to estimate the integrals. The logarithm of the normalization (logBF) is -23.8. This is the logarithm of the Bayes Factor (or Evidence) for the fit. It is much larger than the value -117.5 obtained from a least-squares fit (i.e., from the script above but with w=0 in the PDF). This means that the data much prefer the modified prior (by a factor of exp(-23.8 + 117.4) or about 1041).

The new fit parameters are much more reasonable than the results from the least-squares fit. In particular the intercept is 0.29(14) which is much more plausible than the least-squares result (compare the dashed line in red with the dotted line):

_images/outliers2.png

The analysis also gives us an estimate for the failure rate w=0.27(12) of our devices — they fail about a quarter of the time — and shows that the y errors are b=10.6(3.7) times larger when there is a failure.

Note, from the correlation matrix, that the intercept and slope are anti-correlated, as one might guess for this fit. We can illustrate this correlation and look for others by sampling the distribution associated with the modified PDF and using the samples to create histograms and contour plots of the distributions. The following code uses vegas.PDFIntegrator.sample() to sample the distribution, and the corner Python module to make the plots (requires the the corner and arviz Python modules, which are not included in vegas):

import corner
import matplotlib.pyplot as plt

def make_cornerplots(expval):
    # sample the distribution and repack with the variables we want to analyze
    wgts, all_samples = expval.sample(nbatch=50_000)
    samples = dict(
        c0=all_samples['c'][0], c1=all_samples['c'][1],
        b=all_samples['b'], w=all_samples['w']
        )
    # corner plots
    fig = corner.corner(
        data=samples, weights=wgts, range=4*[0.99],
        show_titles=True, quantiles=[0.16, 0.5, 0.84],
        plot_datapoints=False, fill_contours=True,
        contourf_kwargs=dict(cmap='Blues', colors=None),
        )
    plt.show()

The plots are created from approximately 50,000 random samples all_samples, which is a dictionary where: all_samples['c'][d,i] are samples for parameters c[d] where index d=0,1 labels directions in c-space and index i labels the sample; and all_samples['w'][i] and all_samples['b'][i] are the corresponding samples for parameters w and b, respectively. The corresponding weight for this sample is wgts[i].

The output shows histograms of the probability density for each parameter, and contour plots for each pair of parameters:

_images/outliers3.png

From the plots, parameters c[0] and c[1] are reasonably Gaussian, with a strong negative correlation, as expected. The other parameters are somewhat skewed, but show only weak correlations. The contour lines are at 0.5, 1, 1.5, and 2 sigma. The histograms are labeled with the median values of the parameters with plus/minus uncertainties that enclose 34% of the probability on either side of the median (quantiles=[0.16, 0.5, 0.84]).

Finally, note that the Monte Carlo integrations can be made twice as accurate (or faster) by using the results of a least-squares fit in place of the prior to define the vegas.PDFIntegrator. This is done, for example, using the lsqfit module to replace

expval = vegas.PDFIntegrator(prior, pdf=mod_pdf)

by

fit = lsqfit.nonlinear_fit(data=(x,y), prior=prior, fcn=fitfcn)
expval = vegas.PDFIntegrator(fit.p, pdf=mod_pdf)

where fit.p are the best-fit values of the parameters from the fit. The values of the expectation values are unchanged in the second case but the optimized integration variables used by vegas.PDFIntegrator are better suited to the PDF.

A Variation

A somewhat different model for the data PDF assigns a separate value w to each data point. The script above does this if

prior = make_prior()

is replaced by

prior = make_prior(w_shape=19)

and nstat is replaced by nstat = [20, 20] + 20 * [1]. The Bayesian integral then has 22 parameters, rather than the 4 parameters before. The code still takes only seconds to run on a 2020 laptop.

The final results are quite similar to the other model:

itn   integral        average         chi2/dof        Q
-------------------------------------------------------
  1   2.558(97)e-11   2.558(97)e-11       0.00     1.00
  2   2.88(22)e-11    2.72(12)e-11        1.13     0.06
  3   2.691(96)e-11   2.710(86)e-11       1.03     0.28
  4   2.658(91)e-11   2.697(68)e-11       1.02     0.33
  5   2.69(16)e-11    2.695(63)e-11       1.00     0.48
  6   2.63(10)e-11    2.685(55)e-11       0.99     0.58
  7   2.495(73)e-11   2.658(49)e-11       0.98     0.68
  8   2.462(81)e-11   2.633(44)e-11       0.97     0.82
  9   2.446(76)e-11   2.613(40)e-11       0.98     0.72
 10   2.51(11)e-11    2.602(37)e-11       0.97     0.84

Bayesian fit results:
      c = [0.31(17) 0.608(70)]
 corr_c = [[ 1.    -0.902]
           [-0.902  1.   ]]
      b = 8.8(3.0)
      w = [0.39(27) 0.66(23) 0.40(27) 0.40(27) 0.66(24) 0.49(29) 0.51(29) 0.37(26)
 0.42(28) 0.39(27) 0.38(26) 0.37(26) 0.42(28) 0.39(27) 0.38(26) 0.39(27)
 0.48(29) 0.66(24) 0.39(27)]

  logBF = -24.372(14)

The logarithm of the Bayes Factor logBF is slightly lower for this model than before. It is also less accurately determined (4x), because 22-parameter integrals are more difficult than 4-parameter integrals. More precision can be obtained by increasing nstat, but the current precision is more than adequate.

Only three of the w[i] values listed in the output are more than two standard deviations away from zero. Not surprisingly, these correspond to the unambiguous outliers. The fit function parameters are almost the same as before.

The outliers in this case are pretty obvious; one is tempted to simply drop them. It is clearly better, however, to understand why they have occurred and to quantify the effect if possible, as above. Dropping outliers would be much more difficult if they were, say, three times closer to the rest of the data. The least-squares fit would still be poor (chi**2 per degree of freedom of 3) and its intercept a bit too high (0.6(1)). Using the modified PDF, on the other hand, would give results very similar to what we obtained above: for example, the intercept would be 0.35(17).