项目作者: lzkelley

项目描述 :
Kernel Density Estimation and (re)sampling
高级语言: Python
项目地址: git://github.com/lzkelley/kalepy.git
创建时间: 2019-05-17T18:56:34Z
项目社区:https://github.com/lzkelley/kalepy

开源协议:Other

下载


kalepy: Kernel Density Estimation and Sampling

build
codecov
Documentation Status
DOI
DOI

kalepy animated logo

This package performs KDE operations on multidimensional data to: 1) calculate estimated PDFs (probability distribution functions), and 2) resample new data from those PDFs.

Documentation

A number of examples (also used for continuous integration testing) are included in the package notebooks. Some background information and references are included in the JOSS paper.

Full documentation is available on kalepy.readthedocs.io.

README Contents

Installation

from pypi (i.e. via pip)

  1. pip install kalepy

from source (e.g. for development)

  1. git clone https://github.com/lzkelley/kalepy.git
  2. pip install -e kalepy/

In this case the package can easily be updated by changing into the source directory, pulling, and rebuilding:

  1. cd kalepy
  2. git pull
  3. pip install -e .
  4. # Optional: run unit tests (using the `pytest` package)
  5. pytest

Basic Usage

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. import matplotlib as mpl
  4. import kalepy as kale
  5. from kalepy.plot import nbshow

Generate some random data, and its corresponding distribution function

  1. NUM = int(1e4)
  2. np.random.seed(12345)
  3. # Combine data from two different PDFs
  4. _d1 = np.random.normal(4.0, 1.0, NUM)
  5. _d2 = np.random.lognormal(0, 0.5, size=NUM)
  6. data = np.concatenate([_d1, _d2])
  7. # Calculate the "true" distribution
  8. xx = np.linspace(0.0, 7.0, 100)[1:]
  9. yy = 0.5*np.exp(-(xx - 4.0)**2/2) / np.sqrt(2*np.pi)
  10. yy += 0.5 * np.exp(-np.log(xx)**2/(2*0.5**2)) / (0.5*xx*np.sqrt(2*np.pi))

Plotting Smooth Distributions

  1. # Reconstruct the probability-density based on the given data points.
  2. points, density = kale.density(data, probability=True)
  3. # Plot the PDF
  4. plt.plot(points, density, 'k-', lw=2.0, alpha=0.8, label='KDE')
  5. # Plot the "true" PDF
  6. plt.plot(xx, yy, 'r--', alpha=0.4, lw=3.0, label='truth')
  7. # Plot the standard, histogram density estimate
  8. plt.hist(data, density=True, histtype='step', lw=2.0, alpha=0.5, label='hist')
  9. plt.legend()
  10. nbshow()

png

resampling: constructing statistically similar values

Draw a new sample of data-points from the KDE PDF

  1. # Draw new samples from the KDE reconstructed PDF
  2. samples = kale.resample(data)
  3. # Plot new samples
  4. plt.hist(samples, density=True, label='new samples', alpha=0.5, color='0.65', edgecolor='b')
  5. # Plot the old samples
  6. plt.hist(data, density=True, histtype='step', lw=2.0, alpha=0.5, color='r', label='input data')
  7. # Plot the KDE reconstructed PDF
  8. plt.plot(points, density, 'k-', lw=2.0, alpha=0.8, label='KDE')
  9. plt.legend()
  10. nbshow()

png

Multivariate Distributions

  1. reload(kale.plot)
  2. # Load some random-ish three-dimensional data
  3. np.random.seed(9485)
  4. data = kale.utils._random_data_3d_02(num=3e3)
  5. # Construct a KDE
  6. kde = kale.KDE(data)
  7. # Construct new data by resampling from the KDE
  8. resamp = kde.resample(size=1e3)
  9. # Plot the data and distributions using the builtin `kalepy.corner` plot
  10. corner, h1 = kale.corner(kde, quantiles=[0.5, 0.9])
  11. h2 = corner.clean(resamp, quantiles=[0.5, 0.9], dist2d=dict(median=False), ls='--')
  12. corner.legend([h1, h2], ['input data', 'new samples'])
  13. nbshow()

png

  1. # Resample the data (default output is the same size as the input data)
  2. samples = kde.resample()
  3. # ---- Plot the input data compared to the resampled data ----
  4. fig, axes = plt.subplots(figsize=[16, 4], ncols=kde.ndim)
  5. for ii, ax in enumerate(axes):
  6. # Calculate and plot PDF for `ii`th parameter (i.e. data dimension `ii`)
  7. xx, yy = kde.density(params=ii, probability=True)
  8. ax.plot(xx, yy, 'k--', label='KDE', lw=2.0, alpha=0.5)
  9. # Draw histograms of original and newly resampled datasets
  10. *_, h1 = ax.hist(data[ii], histtype='step', density=True, lw=2.0, label='input')
  11. *_, h2 = ax.hist(samples[ii], histtype='step', density=True, lw=2.0, label='resample')
  12. # Add 'kalepy.carpet' plots showing the data points themselves
  13. kale.carpet(data[ii], ax=ax, color=h1[0].get_facecolor())
  14. kale.carpet(samples[ii], ax=ax, color=h2[0].get_facecolor(), shift=ax.get_ylim()[0])
  15. axes[0].legend()
  16. nbshow()

png

Fancy Usage

Reflecting Boundaries

What if the distributions you’re trying to capture have edges in them, like in a uniform distribution between two bounds? Here, the KDE chooses ‘reflection’ locations based on the extrema of the given data.

  1. # Uniform data (edges at -1 and +1)
  2. NDATA = 1e3
  3. np.random.seed(54321)
  4. data = np.random.uniform(-1.0, 1.0, int(NDATA))
  5. # Create a 'carpet' plot of the data
  6. kale.carpet(data, label='data')
  7. # Histogram the data
  8. plt.hist(data, density=True, alpha=0.5, label='hist', color='0.65', edgecolor='k')
  9. # ---- Standard KDE will undershoot just-inside the edges and overshoot outside edges
  10. points, pdf_basic = kale.density(data, probability=True)
  11. plt.plot(points, pdf_basic, 'r--', lw=3.0, alpha=0.5, label='KDE')
  12. # ---- Reflecting KDE keeps probability within the given bounds
  13. # setting `reflect=True` lets the KDE guess the edge locations based on the data extrema
  14. points, pdf_reflect = kale.density(data, reflect=True, probability=True)
  15. plt.plot(points, pdf_reflect, 'b-', lw=2.0, alpha=0.75, label='reflecting KDE')
  16. plt.legend()
  17. nbshow()

png

Explicit reflection locations can also be provided (in any number of dimensions).

  1. # Construct random data, add an artificial 'edge'
  2. np.random.seed(5142)
  3. edge = 1.0
  4. data = np.random.lognormal(sigma=0.5, size=int(3e3))
  5. data = data[data >= edge]
  6. # Histogram the data, use fixed bin-positions
  7. edges = np.linspace(edge, 4, 20)
  8. plt.hist(data, bins=edges, density=True, alpha=0.5, label='data', color='0.65', edgecolor='k')
  9. # Standard KDE with over & under estimates
  10. points, pdf_basic = kale.density(data, probability=True)
  11. plt.plot(points, pdf_basic, 'r--', lw=4.0, alpha=0.5, label='Basic KDE')
  12. # Reflecting KDE setting the lower-boundary to the known value
  13. # There is no upper-boundary when `None` is given.
  14. points, pdf_basic = kale.density(data, reflect=[edge, None], probability=True)
  15. plt.plot(points, pdf_basic, 'b-', lw=3.0, alpha=0.5, label='Reflecting KDE')
  16. plt.gca().set_xlim(edge - 0.5, 3)
  17. plt.legend()
  18. nbshow()

png

Multivariate Reflection

  1. # Load a predefined dataset that has boundaries at:
  2. # x: 0.0 on the low-end
  3. # y: 1.0 on the high-end
  4. data = kale.utils._random_data_2d_03()
  5. # Construct a KDE with the given reflection boundaries given explicitly
  6. kde = kale.KDE(data, reflect=[[0, None], [None, 1]])
  7. # Plot using default settings
  8. kale.corner(kde)
  9. nbshow()

png

Specifying Bandwidths and Kernel Functions

  1. # Load predefined 'random' data
  2. data = kale.utils._random_data_1d_02(num=100)
  3. # Choose a uniform x-spacing for drawing PDFs
  4. xx = np.linspace(-2, 8, 1000)
  5. # ------ Choose the kernel-functions and bandwidths to test ------- #
  6. kernels = ['parabola', 'gaussian', 'box'] #
  7. bandwidths = [None, 0.9, 0.15] # `None` means let kalepy choose #
  8. # ----------------------------------------------------------------- #
  9. ylabels = ['Automatic', 'Course', 'Fine']
  10. fig, axes = plt.subplots(figsize=[16, 10], ncols=len(kernels), nrows=len(bandwidths), sharex=True, sharey=True)
  11. plt.subplots_adjust(hspace=0.2, wspace=0.05)
  12. for (ii, jj), ax in np.ndenumerate(axes):
  13. # ---- Construct KDE using particular kernel-function and bandwidth ---- #
  14. kern = kernels[jj] #
  15. bw = bandwidths[ii] #
  16. kde = kale.KDE(data, kernel=kern, bandwidth=bw) #
  17. # ---------------------------------------------------------------------- #
  18. # If bandwidth was set to `None`, then the KDE will choose the 'optimal' value
  19. if bw is None:
  20. bw = kde.bandwidth[0, 0]
  21. ax.set_title('{} (bw={:.3f})'.format(kern, bw))
  22. if jj == 0:
  23. ax.set_ylabel(ylabels[ii])
  24. # plot the KDE
  25. ax.plot(*kde.pdf(points=xx), color='r')
  26. # plot histogram of the data (same for all panels)
  27. ax.hist(data, bins='auto', color='b', alpha=0.2, density=True)
  28. # plot carpet of the data (same for all panels)
  29. kale.carpet(data, ax=ax, color='b')
  30. ax.set(xlim=[-2, 5], ylim=[-0.2, 0.6])
  31. nbshow()

png

Resampling

Using different data weights

  1. # Load some random data (and the 'true' PDF, for comparison)
  2. data, truth = kale.utils._random_data_1d_01()
  3. # ---- Resample the same data, using different weightings ---- #
  4. resamp_uni = kale.resample(data, size=1000) #
  5. resamp_sqr = kale.resample(data, weights=data**2, size=1000) #
  6. resamp_inv = kale.resample(data, weights=data**-1, size=1000) #
  7. # ------------------------------------------------------------ #
  8. # ---- Plot different distributions ----
  9. # Setup plotting parameters
  10. kw = dict(density=True, histtype='step', lw=2.0, alpha=0.75, bins='auto')
  11. xx, yy = truth
  12. samples = [resamp_inv, resamp_uni, resamp_sqr]
  13. yvals = [yy/xx, yy, yy*xx**2/10]
  14. labels = [r'$\propto X^{-1}$', r'$\propto 1$', r'$\propto X^2$']
  15. plt.figure(figsize=[10, 5])
  16. for ii, (res, yy, lab) in enumerate(zip(samples, yvals, labels)):
  17. hh, = plt.plot(xx, yy, ls='--', alpha=0.5, lw=2.0)
  18. col = hh.get_color()
  19. kale.carpet(res, color=col, shift=-0.1*ii)
  20. plt.hist(res, color=col, label=lab, **kw)
  21. plt.gca().set(xlim=[-0.5, 6.5])
  22. # Add legend
  23. plt.legend()
  24. # display the figure if this is a notebook
  25. nbshow()

png

Resampling while ‘keeping’ certain parameters/dimensions

  1. # Construct covariant 2D dataset where the 0th parameter takes on discrete values
  2. xx = np.random.randint(2, 7, 1000)
  3. yy = np.random.normal(4, 2, xx.size) + xx**(3/2)
  4. data = [xx, yy]
  5. # 2D plotting settings: disable the 2D histogram & disable masking of dense scatter-points
  6. dist2d = dict(hist=False, mask_dense=False)
  7. # Draw a corner plot
  8. kale.corner(data, dist2d=dist2d)
  9. nbshow()

png

A standard KDE resampling will smooth out the discrete variables, creating a smooth(er) distribution. Using the keep parameter, we can choose to resample from the actual data values of that parameter instead of resampling with ‘smoothing’ based on the KDE.

  1. kde = kale.KDE(data)
  2. # ---- Resample the data both normally, and 'keep'ing the 0th parameter values ---- #
  3. resamp_stnd = kde.resample() #
  4. resamp_keep = kde.resample(keep=0) #
  5. # --------------------------------------------------------------------------------- #
  6. corner = kale.Corner(2)
  7. dist2d['median'] = False # disable median 'cross-hairs'
  8. h1 = corner.plot(resamp_stnd, dist2d=dist2d)
  9. h2 = corner.plot(resamp_keep, dist2d=dist2d)
  10. corner.legend([h1, h2], ['Standard', "'keep'"])
  11. nbshow()

png

Development & Contributions

Please visit the github page <https://github.com/lzkelley/kalepy>_ for issues or bug reports. Contributions and feedback are very welcome.

Contributors:

JOSS Paper:

Attribution

A JOSS paper has been submitted. If you have found this package useful in your research, please add a reference to the code paper:

.. code-block:: tex

  1. @article{kalepy,
  2. author = {Luke Zoltan Kelley},
  3. title = {kalepy: a python package for kernel density estimation and sampling},
  4. journal = {The Journal of Open Source Software},
  5. publisher = {The Open Journal},
  6. }