Subsampling¶
Functions for subsampling datasets.
The functions featured in this module can be used to easily subsample either dHdl or u_nk datasets to give less correlated timeseries.
API Reference¶
- alchemlyb.preprocessing.subsampling.slicing(df, lower=None, upper=None, step=None, force=False)¶
Subsample a DataFrame using simple slicing.
- Parameters
- Returns
df subsampled.
- Return type
DataFrame
- alchemlyb.preprocessing.subsampling.statistical_inefficiency(df, series=None, lower=None, upper=None, step=None, conservative=True, drop_duplicates=False, sort=False)¶
Subsample a DataFrame based on the calculated statistical inefficiency of a timeseries.
If series is
None
, then this function will behave the same asslicing()
.- Parameters
df (DataFrame) – DataFrame to subsample according statistical inefficiency of series.
series (Series) – Series to use for calculating statistical inefficiency. If
None
, no statistical inefficiency-based subsampling will be performed.lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
conservative (bool) –
True
useceil(statistical_inefficiency)
to slice the data in uniform intervals (the default).False
will sample at non-uniform intervals to closely match the (fractional) statistical_inefficieny, as implemented inpymbar.timeseries.subsampleCorrelatedData()
.drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
- Returns
df subsampled according to subsampled series.
- Return type
DataFrame
Warning
The series and the data to be sliced, df, need to have the same number of elements because the statistical inefficiency is calculated based on the index of the series (and not an associated time). At the moment there is no automatic conversion from a time to an index.
Note
For a non-integer statistical ineffciency \(g\), the default value
conservative=True
will provide _fewer_ data points than allowed by \(g\) and thus error estimates will be _higher_. For large numbers of data points and converged free energies, the choice should not make a difference. For small numbers of data points,conservative=True
decreases a false sense of accuracy and is deemed the more careful and conservative approach.See also
pymbar.timeseries.statisticalInefficiency
detailed background
pymbar.timeseries.subsampleCorrelatedData
used for subsampling
Changed in version 0.2.0: The
conservative
keyword was added and the method is now usingpymbar.timeseries.statisticalInefficiency()
; previously, the statistical inefficiency was _rounded_ (instead ofceil()
) and thus one could end up with correlated data.
- alchemlyb.preprocessing.subsampling.equilibrium_detection(df, series=None, lower=None, upper=None, step=None)¶
Subsample a DataFrame using automated equilibrium detection on a timeseries.
If series is
None
, then this function will behave the same asslicing()
.- Parameters
df (DataFrame) – DataFrame to subsample according to equilibrium detection on series.
series (Series) – Series to detect equilibration on. If
None
, no equilibrium detection-based subsampling will be performed.lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
- Returns
df subsampled according to subsampled series.
- Return type
DataFrame
See also
pymbar.timeseries.detectEquilibration
detailed background