Subsampling

Functions for subsampling datasets.

The functions featured in this module can be used to easily subsample either dHdl or u_nk datasets to give less correlated timeseries.

API Reference

alchemlyb.preprocessing.subsampling.slicing(df, lower=None, upper=None, step=None, force=False)

Subsample a DataFrame using simple slicing.

Parameters
  • df (DataFrame) – DataFrame to subsample.

  • lower (float) – Lower time to slice from.

  • upper (float) – Upper time to slice to (inclusive).

  • step (int) – Step between rows to slice by.

  • force (bool) – Ignore checks that DataFrame is in proper form for expected behavior.

Returns

df subsampled.

Return type

DataFrame

alchemlyb.preprocessing.subsampling.statistical_inefficiency(df, series=None, lower=None, upper=None, step=None, conservative=True, drop_duplicates=False, sort=False)

Subsample a DataFrame based on the calculated statistical inefficiency of a timeseries.

If series is None, then this function will behave the same as slicing().

Parameters
  • df (DataFrame) – DataFrame to subsample according statistical inefficiency of series.

  • series (Series) – Series to use for calculating statistical inefficiency. If None, no statistical inefficiency-based subsampling will be performed.

  • lower (float) – Lower bound to pre-slice series data from.

  • upper (float) – Upper bound to pre-slice series to (inclusive).

  • step (int) – Step between series items to pre-slice by.

  • conservative (bool) – True use ceil(statistical_inefficiency) to slice the data in uniform intervals (the default). False will sample at non-uniform intervals to closely match the (fractional) statistical_inefficieny, as implemented in pymbar.timeseries.subsampleCorrelatedData().

  • drop_duplicates (bool) – Drop the duplicated lines based on time.

  • sort (bool) – Sort the Dataframe based on the time column.

Returns

df subsampled according to subsampled series.

Return type

DataFrame

Warning

The series and the data to be sliced, df, need to have the same number of elements because the statistical inefficiency is calculated based on the index of the series (and not an associated time). At the moment there is no automatic conversion from a time to an index.

Note

For a non-integer statistical ineffciency \(g\), the default value conservative=True will provide _fewer_ data points than allowed by \(g\) and thus error estimates will be _higher_. For large numbers of data points and converged free energies, the choice should not make a difference. For small numbers of data points, conservative=True decreases a false sense of accuracy and is deemed the more careful and conservative approach.

Changed in version 0.2.0: The conservative keyword was added and the method is now using pymbar.timeseries.statisticalInefficiency(); previously, the statistical inefficiency was _rounded_ (instead of ceil()) and thus one could end up with correlated data.

alchemlyb.preprocessing.subsampling.equilibrium_detection(df, series=None, lower=None, upper=None, step=None)

Subsample a DataFrame using automated equilibrium detection on a timeseries.

If series is None, then this function will behave the same as slicing().

Parameters
  • df (DataFrame) – DataFrame to subsample according to equilibrium detection on series.

  • series (Series) – Series to detect equilibration on. If None, no equilibrium detection-based subsampling will be performed.

  • lower (float) – Lower bound to pre-slice series data from.

  • upper (float) – Upper bound to pre-slice series to (inclusive).

  • step (int) – Step between series items to pre-slice by.

Returns

df subsampled according to subsampled series.

Return type

DataFrame

See also

pymbar.timeseries.detectEquilibration

detailed background