.. |GVar| replace:: :class:`gvar.GVar` .. |nonlinear_fit| replace:: :class:`lsqfit.nonlinear_fit` .. |BufferDict| replace:: :class:`gvar.BufferDict` .. |~| unicode:: U+00A0 :trim: .. _case_curve_fitting: Case Study: Bayesian Curve Fitting ===================================================== In this case study, we use :mod:`vegas` to fit a straight line to data with with outliers. We use the specialized integrator :class:`PDFIntegrator` with a non-Gaussian probability density function (PDF) in a Bayesian analysis. We look at two examples, one with 4 |~| parameters and the other with 22 |~| parameters. This case study is adapted from an example by Jake Vanderplas on his `Python blog `_. It is also discussed in the documentation for the :mod:`lsqfit` module. The Problem ------------------ We want to extrapolate the ``y`` values in the following figure to ``x=0``: .. image:: outliers1.png :width: 60% A linear least-squares fit to the data (dotted line) is unconvincing; in particular, the extrapolated value at ``x=0`` is larger than one while most of the data near ``x=0`` suggest an intercept less than |~| 0.5. The problem, of course, is caused by the outliers. There are at least three outliers. A Solution ----------------- There are many *ad hoc* prescriptions for handling outliers. In the best of situations one has an explanation for the outliers and can model them accordingly. For example, we might know that some fraction :math:`w` of the time our device malfunctions, resulting in much larger measurement errors than usual. This can be modeled in a Bayesian analysis by describing the data using a linear combination of two Gaussian probability density functions (PDFs) for each data point. One is the usual PDF proportional to :math:`exp(-(y-f(x))^2 / 2\sigma_y^2))`, where the fit function :math:`f(x)` is :math:`c_0 + c_1 x`. The second is the same but with the :math:`\sigma_y \to b \sigma_y` for some |~| :math:`b>1`. The relative weights assigned to these two terms are :math:`1-w` and :math:`w`, respectively. The following code does a Bayesian fit with this modified PDF:: import numpy as np import gvar as gv import vegas def main(): # 1) create data x = np.array([ 0.2, 0.4, 0.6, 0.8, 1., 1.2, 1.4, 1.6, 1.8, 2., 2.2, 2.4, 2.6, 2.8, 3., 3.2, 3.4, 3.6, 3.8 ]) y = gv.gvar([ '0.38(20)', '2.89(20)', '0.85(20)', '0.59(20)', '2.88(20)', '1.44(20)', '0.73(20)', '1.23(20)', '1.68(20)', '1.36(20)', '1.51(20)', '1.73(20)', '2.16(20)', '1.85(20)', '2.00(20)', '2.11(20)', '2.75(20)', '0.86(20)', '2.73(20)' ]) # 2) create prior and modified PDF prior = make_prior() mod_pdf = ModifiedPDF(data=(x, y), fitfcn=fitfcn, prior=prior) # 3) create integrator and adapt it to the modified PDF expval = vegas.PDFIntegrator(prior, pdf=mod_pdf) nitn = 10 nstrat = [10, 10, 2, 2] warmup = expval(nstrat=nstrat, nitn=2*nitn) # 4) evaluate statistics for g(p) @vegas.rbatchintegrand def g(p): return {k:p[k] for k in ['c', 'b', 'w']} results = expval.stats(g, nitn=nitn) print(results.summary()) # 5) print out results print('Bayesian fit results:') for k in results: print(f' {k} = {results[k]}') if k == 'c': # correlation matrix for c print( ' corr_c =', np.array2string(gv.evalcorr(results['c']), prefix=10 * ' ', precision=3), ) # Bayes Factor print('\n logBF =', np.log(results.pdfnorm)) def fitfcn(x, p): c = p['c'] return c[0] + c[1] * x def make_prior(w_shape=()): prior = gv.BufferDict() prior['c'] = gv.gvar(['0(5)', '0(5)']) # uniform distributions for w and b prior['gw(w)'] = gv.BufferDict.uniform('gw', 0., 1., shape=w_shape) prior['gb(b)'] = gv.BufferDict.uniform('gb', 5., 20.) return prior @vegas.rbatchintegrand class ModifiedPDF: """ Modified PDF to account for measurement failure. """ def __init__(self, data, fitfcn, prior): x, y = data self.fitfcn = fitfcn self.prior_pdf = gv.PDF(prior, mode='rbatch') # add rbatch index self.x = x[:, None] self.ymean = gv.mean(y)[:, None] self.yvar = gv.var(y)[:, None] def __call__(self, p): w = p['w'] b = p['b'] # modified PDF for data fxp = self.fitfcn(self.x, p) chi2 = (self.ymean - fxp) ** 2 / self.yvar norm = np.sqrt(2 * np.pi * self.yvar) y_pdf = np.exp(-chi2 / 2) / norm yb_pdf = np.exp(-chi2 / (2 * b**2)) / (b * norm) # product over PDFs for each y[i] data_pdf = np.prod((1 - w) * y_pdf + w * yb_pdf, axis=0) # multiply by prior PDF return data_pdf * self.prior_pdf(p) if __name__ == '__main__': main() Here class ``ModifiedPDF`` implements the modified PDF. The parameters for this distribution are the fit function coefficients ``c = p['c']``, the weight ``w = p['w']``, and a breadth parameter ``p['b']``. As usual the PDF for the parameters (in ``__call__``) is the product of a PDF for the data times a PDF for the priors. The data PDF consists of the two Gaussian distributions for each data point: one, ``y_pdf``, with the nominal data errors and weight ``(1-w)``, and the other, ``yb_pdf``, with errors that are ``p['b']`` times larger and weight |~| ``w``. The PDF for the prior is implemented using ``gvar.PDF`` from the :mod:`gvar` module. We want broad Gaussian priors for the fit function coefficients, but uniform priors for the weight parameter (:math:`0