Subsampling¶
Functions for subsampling datasets.
The functions featured in this module can be used to easily subsample either dHdl or u_nk datasets to give less correlated timeseries.
High-level functions¶
Two high-level functions
decorrelate_u_nk()
and
decorrelate_dhdl()
can be used to
preprocess the dHdl or u_nk in an automatic
fashion. The following code removes an initial “burnin” period and
decorrelates the data.
>>> from alchemlyb.parsing.gmx import extract_u_nk, extract_dHdl
>>> from alchemlyb.preprocessing.subsampling import (decorrelate_u_nk,
>>> decorrelate_dhdl)
>>> bz = load_benzene().data
>>> u_nk = extract_u_nk(bz['Coulomb'], T=300)
>>> decorrelated_u_nk = decorrelate_u_nk(u_nk, method='dhdl',
>>> remove_burnin=True)
>>> dhdl = extract_dHdl(bz['Coulomb'], T=300)
>>> decorrelated_dhdl = decorrelate_dhdl(dhdl, remove_burnin=True)
Low-level functions¶
To decorrelate the data, in addition to the dataframe that contains the
dHdl or u_nk, a pandas.Series
is needed for
the autocorrection analysis. The series could be generated with
u_nk2series()
or
dhdl2series()
and feed into
statistical_inefficiency()
or
equilibrium_detection()
.
>>> from alchemlyb.parsing.gmx import extract_u_nk, extract_dHdl
>>> from alchemlyb.preprocessing.subsampling import (u_nk2series,
>>> dhdl2series, statistical_inefficiency, equilibrium_detection)
>>> bz = load_benzene().data
>>> u_nk = extract_u_nk(bz['Coulomb'], T=300)
>>> u_nk_series = u_nk2series(u_nk, method='dE')
>>> decorrelate_u_nk = statistical_inefficiency(u_nk, series=u_nk_series)
>>> decorrelate_u_nk = equilibrium_detection(u_nk, series=u_nk_series)
>>> dhdl = extract_dHdl(bz['Coulomb'], T=300)
>>> dhdl_series = dhdl2series(dhdl)
>>> decorrelate_dhdl = statistical_inefficiency(dhdl, series=dhdl_series)
>>> decorrelate_dhdl = equilibrium_detection(dhdl, series=dhdl_series)
API Reference¶
- alchemlyb.preprocessing.subsampling.decorrelate_u_nk(df, method='dE', drop_duplicates=True, sort=True, remove_burnin=False, **kwargs)¶
Subsample an u_nk DataFrame based on the selected method.
The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used. This is a wrapper function around the function
statistical_inefficiency()
orequilibrium_detection()
.- Parameters:
df (DataFrame) – DataFrame to be subsampled according to the selected method.
method ({'all', 'dE'}) – Method for decorrelating the data.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (
True
) or just do statistical inefficiency (False
).Added in version 1.0.0.
**kwargs – Additional keyword arguments for
statistical_inefficiency()
orequilibrium_detection()
.
- Returns:
df subsampled according to selected method.
- Return type:
DataFrame
Note
The default of
True
for drop_duplicates and sort should result in robust decorrelation but can lose data.Added in version 0.6.0.
Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed. Rename method value ‘dhdl_all’ to ‘all’ and deprecate the ‘dhdl’.
- alchemlyb.preprocessing.subsampling.decorrelate_dhdl(df, drop_duplicates=True, sort=True, remove_burnin=False, **kwargs)¶
Subsample a dhdl DataFrame. This is a wrapper function around the function
statistical_inefficiency()
andequilibrium_detection()
.- Parameters:
df (DataFrame) – DataFrame to subsample according to the selected method.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (
True
) or just do statistical inefficiency (False
).Added in version 1.0.0.
**kwargs – Additional keyword arguments for
statistical_inefficiency()
orequilibrium_detection()
.
- Returns:
df subsampled.
- Return type:
DataFrame
Note
The default of
True
for drop_duplicates and sort should result in robust decorrelation but can loose data.Added in version 0.6.0.
Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed.
- alchemlyb.preprocessing.subsampling.u_nk2series(df, method='dE')¶
Convert an u_nk DataFrame into a series based on the selected method for subsampling.
The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used.
- Parameters:
df (DataFrame) – DataFrame to be converted according to the selected method.
method ({'all', 'dE'}) – Method for converting the data.
- Returns:
series to be used as input for
statistical_inefficiency()
orequilibrium_detection()
.- Return type:
Series
Added in version 1.0.0.
Changed in version 2.0.1: The dE method computes the difference between the current lambda and the next lambda (previous lambda for the last window), instead of using the next lambda or the previous lambda for the last window.
- alchemlyb.preprocessing.subsampling.dhdl2series(df, method='all')¶
Convert a dhdl DataFrame to a series for subsampling.
The series is generated by summing over all energy components (axis 1 of df), as for
method='all'
inu_nk2series()
. Commonly, df only contains a single energy component but in some cases (such as using a split protocol in GROMACS), it can contain multiple columns for different energy terms.- Parameters:
df (DataFrame) – DataFrame to subsample according to the selected method.
method ('all') – Only ‘all’ is available; the keyword is provided for compatibility with
u_nk2series()
.
- Returns:
series to be used as input for
statistical_inefficiency()
orequilibrium_detection()
.- Return type:
Series
Added in version 1.0.0.
- alchemlyb.preprocessing.subsampling.slicing(df, lower=None, upper=None, step=None, force=False)¶
Subsample a DataFrame using simple slicing.
- Parameters:
- Returns:
df subsampled.
- Return type:
DataFrame
Changed in version 1.0.1: The rows with NaN values are not dropped by default.
- alchemlyb.preprocessing.subsampling.statistical_inefficiency(df, series=None, lower=None, upper=None, step=None, conservative=True, drop_duplicates=False, sort=False)¶
Subsample a DataFrame based on the calculated statistical inefficiency of a timeseries.
If series is
None
, then this function will behave the same asslicing()
.- Parameters:
df (DataFrame) – DataFrame to subsample according statistical inefficiency of series.
series (Series) – Series to use for calculating statistical inefficiency. If
None
, no statistical inefficiency-based subsampling will be performed.lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
conservative (bool) –
True
useceil(statistical_inefficiency)
to slice the data in uniform intervals (the default).False
will sample at non-uniform intervals to closely match the (fractional) statistical_inefficieny, as implemented inpymbar.timeseries.subsample_correlated_data()
.drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
- Returns:
df subsampled according to subsampled series.
- Return type:
DataFrame
Warning
The series and the data to be sliced, df, need to have the same number of elements because the statistical inefficiency is calculated based on the index of the series (and not an associated time). At the moment there is no automatic conversion from a time to an index.
Note
For a non-integer statistical ineffciency \(g\), the default value
conservative=True
will provide _fewer_ data points than allowed by \(g\) and thus error estimates will be _higher_. For large numbers of data points and converged free energies, the choice should not make a difference. For small numbers of data points,conservative=True
decreases a false sense of accuracy and is deemed the more careful and conservative approach.See also
pymbar.timeseries.statistical_inefficiency
detailed background
pymbar.timeseries.subsample_correlated_data
used for subsampling
Changed in version 0.2.0: The
conservative
keyword was added and the method is now usingpymbar.timeseries.statistical_inefficiency()
; previously, the statistical inefficiency was _rounded_ (instead ofceil()
) and thus one could end up with correlated data.Changed in version 1.0.0: Fixed a bug that would effectively ignore the
lower
andstep
keywords when returning the subsampled DataFrame object. See issue #198 for more details.
- alchemlyb.preprocessing.subsampling.equilibrium_detection(df, series=None, lower=None, upper=None, step=None, drop_duplicates=False, sort=False)¶
Subsample a DataFrame using automated equilibrium detection on a timeseries.
This function uses the
pymbar
implementation of the simple automated equilibrium detection algorithm in [Chodera2016].If series is
None
, then this function will behave the same asslicing()
.- Parameters:
df (DataFrame) – DataFrame to subsample according to equilibrium detection on series.
series (Series) – Series to detect equilibration on. If
None
, no equilibrium detection-based subsampling will be performed.lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
- Returns:
df subsampled according to subsampled series.
- Return type:
DataFrame
Notes
Please cite [Chodera2016] when you use this function in published work.
See also
pymbar.timeseries.detect_equilibration
detailed background
pymbar.timeseries.subsample_correlated_data
used for subsampling
Changed in version 1.0.0: Add the drop_duplicates and sort keyword to unify the behaviour between
statistical_inefficiency()
orequilibrium_detection()
.