Subsampling

Functions for subsampling datasets.

alchemlyb.preprocessing.subsampling.decorrelate_u_nk(df: DataFrame, method: str = 'dE', drop_duplicates: bool = True, sort: bool = True, remove_burnin: bool = False, **kwargs: Any) → DataFrame[source]

Subsample an u_nk DataFrame based on the selected method.

The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used. This is a wrapper function around the function statistical_inefficiency() or equilibrium_detection().

Parameters:

df (DataFrame) – DataFrame to be subsampled according to the selected method.
method ({'all', 'dE'}) – Method for decorrelating the data.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (True) or just do statistical inefficiency (False).

Added in version 1.0.0.
**kwargs – Additional keyword arguments for statistical_inefficiency() or equilibrium_detection().

Returns:

df subsampled according to selected method.

Return type:

pandas.DataFrame

Note

The default of True for drop_duplicates and sort should result in robust decorrelation but can lose data.

Added in version 0.6.0.

Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed. Rename method value ‘dhdl_all’ to ‘all’ and deprecate the ‘dhdl’.

alchemlyb.preprocessing.subsampling.decorrelate_dhdl(df: DataFrame, drop_duplicates: bool = True, sort: bool = True, remove_burnin: bool = False, **kwargs: Any) → DataFrame[source]

Subsample a dhdl DataFrame. This is a wrapper function around the function statistical_inefficiency() and equilibrium_detection().

Parameters:

df (DataFrame) – DataFrame to subsample according to the selected method.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (True) or just do statistical inefficiency (False).

Added in version 1.0.0.
**kwargs – Additional keyword arguments for statistical_inefficiency() or equilibrium_detection().

Returns:

df subsampled.

Return type:

pandas.DataFrame

Note

The default of True for drop_duplicates and sort should result in robust decorrelation but can loose data.

Added in version 0.6.0.

Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed.

alchemlyb.preprocessing.subsampling.u_nk2series(df: DataFrame, method: str = 'dE') → Series[source]

Convert an u_nk DataFrame into a series based on the selected method for subsampling.

The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used.

Parameters:

df (DataFrame) – DataFrame to be converted according to the selected method.
method ({'all', 'dE'}) – Method for converting the data.

Returns:

series to be used as input for statistical_inefficiency() or equilibrium_detection().

Return type:

pandas.Series

Added in version 1.0.0.

Changed in version 2.0.1: The dE method computes the difference between the current lambda and the next lambda (previous lambda for the last window), instead of using the next lambda or the previous lambda for the last window.

alchemlyb.preprocessing.subsampling.dhdl2series(df: DataFrame, method: str = 'all') → Series[source]

Convert a dhdl DataFrame to a series for subsampling.

The series is generated by summing over all energy components (axis 1 of df), as for method='all' in u_nk2series(). Commonly, df only contains a single energy component but in some cases (such as using a split protocol in GROMACS), it can contain multiple columns for different energy terms.

Parameters:

df (DataFrame) – DataFrame to subsample according to the selected method.
method ('all') – Only ‘all’ is available; the keyword is provided for compatibility with u_nk2series().

Returns:

series to be used as input for statistical_inefficiency() or equilibrium_detection().

Return type:

pandas.Series

Added in version 1.0.0.

Subsample a DataFrame using simple slicing.

Parameters:

df (pandas.DataFrame) – DataFrame to subsample.
lower (float) – Lower time to slice from.
upper (float) – Upper time to slice to (inclusive).
step (int) – Step between rows to slice by.
force (bool) – Ignore checks that DataFrame is in proper form for expected behavior.

Returns:

df subsampled.

Return type:

pandas.DataFrame

Changed in version 1.0.1: The rows with NaN values are not dropped by default.

alchemlyb.preprocessing.subsampling.statistical_inefficiency(df: DataFrame | Series, series: None | Series = None, lower: None | float = None, upper: None | float = None, step: None | int = None, conservative: bool = True, drop_duplicates: bool = False, sort: bool = False) → DataFrame | Series[source]

Subsample a DataFrame based on the calculated statistical inefficiency of a timeseries.

If series is None, then this function will behave the same as slicing().

Parameters:

df (pandas.DataFrame) – DataFrame to subsample according statistical inefficiency of series.
series (pandas.Series) – Series to use for calculating statistical inefficiency. If None, no statistical inefficiency-based subsampling will be performed.
lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
conservative (bool) – True use ceil(statistical_inefficiency) to slice the data in uniform intervals (the default). False will sample at non-uniform intervals to closely match the (fractional) statistical_inefficieny, as implemented in pymbar.timeseries.subsample_correlated_data().
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.

Returns:

df subsampled according to subsampled series.

Return type:

pandas.DataFrame

Warning

The series and the data to be sliced, df, need to have the same number of elements because the statistical inefficiency is calculated based on the index of the series (and not an associated time). At the moment there is no automatic conversion from a time to an index.

Note

For a non-integer statistical ineffciency \(g\), the default value conservative=True will provide _fewer_ data points than allowed by \(g\) and thus error estimates will be _higher_. For large numbers of data points and converged free energies, the choice should not make a difference. For small numbers of data points, conservative=True decreases a false sense of accuracy and is deemed the more careful and conservative approach.

See also

pymbar.timeseries.statistical_inefficiency: detailed background
pymbar.timeseries.subsample_correlated_data: used for subsampling

Changed in version 0.2.0: The conservative keyword was added and the method is now using pymbar.timeseries.statistical_inefficiency(); previously, the statistical inefficiency was _rounded_ (instead of ceil()) and thus one could end up with correlated data.

Changed in version 1.0.0: Fixed a bug that would effectively ignore the lower and step keywords when returning the subsampled DataFrame object. See issue #198 for more details.

Subsample a DataFrame using automated equilibrium detection on a timeseries.

This function uses the pymbar implementation of the simple automated equilibrium detection algorithm in [Chodera2016].

If series is None, then this function will behave the same as slicing().

Parameters:

df (pandas.DataFrame) – DataFrame to subsample according to equilibrium detection on series.
series (pandas.Series) – Series to detect equilibration on. If None, no equilibrium detection-based subsampling will be performed.
lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.

Returns:

df subsampled according to subsampled series.

Return type:

pandas.DataFrame

Notes

Please cite [Chodera2016] when you use this function in published work.

See also

pymbar.timeseries.detect_equilibration: detailed background
pymbar.timeseries.subsample_correlated_data: used for subsampling

Changed in version 1.0.0: Add the drop_duplicates and sort keyword to unify the behaviour between statistical_inefficiency() or equilibrium_detection().

The functions featured in this module can be used to easily subsample either dHdl or u_nk datasets to give less correlated timeseries.

High-level functions

Two high-level functions decorrelate_u_nk() and decorrelate_dhdl() can be used to preprocess the dHdl or u_nk in an automatic fashion. The following code removes an initial “burnin” period and decorrelates the data.

>>> from alchemlyb.parsing.gmx import extract_u_nk, extract_dHdl
>>> from alchemlyb.preprocessing.subsampling import (decorrelate_u_nk,
>>>     decorrelate_dhdl)
>>> bz = load_benzene().data
>>> u_nk = extract_u_nk(bz['Coulomb'], T=300)
>>> decorrelated_u_nk = decorrelate_u_nk(u_nk, method='dhdl',
>>>     remove_burnin=True)
>>> dhdl = extract_dHdl(bz['Coulomb'], T=300)
>>> decorrelated_dhdl = decorrelate_dhdl(dhdl, remove_burnin=True)

Low-level functions

To decorrelate the data, in addition to the dataframe that contains the dHdl or u_nk, a pandas.Series is needed for the autocorrection analysis. The series could be generated with u_nk2series() or dhdl2series() and feed into statistical_inefficiency() or equilibrium_detection().

>>> from alchemlyb.parsing.gmx import extract_u_nk, extract_dHdl
>>> from alchemlyb.preprocessing.subsampling import (u_nk2series,
>>>     dhdl2series, statistical_inefficiency, equilibrium_detection)
>>> bz = load_benzene().data
>>> u_nk = extract_u_nk(bz['Coulomb'], T=300)
>>> u_nk_series = u_nk2series(u_nk, method='dE')
>>> decorrelate_u_nk = statistical_inefficiency(u_nk, series=u_nk_series)
>>> decorrelate_u_nk = equilibrium_detection(u_nk, series=u_nk_series)
>>> dhdl = extract_dHdl(bz['Coulomb'], T=300)
>>> dhdl_series = dhdl2series(dhdl)
>>> decorrelate_dhdl = statistical_inefficiency(dhdl, series=dhdl_series)
>>> decorrelate_dhdl = equilibrium_detection(dhdl, series=dhdl_series)

API Reference

alchemlyb.preprocessing.subsampling.decorrelate_u_nk(df: DataFrame, method: str = 'dE', drop_duplicates: bool = True, sort: bool = True, remove_burnin: bool = False, **kwargs: Any) → DataFrame[source]

Subsample an u_nk DataFrame based on the selected method.

The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used. This is a wrapper function around the function statistical_inefficiency() or equilibrium_detection().

Parameters:

df (DataFrame) – DataFrame to be subsampled according to the selected method.
method ({'all', 'dE'}) – Method for decorrelating the data.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (True) or just do statistical inefficiency (False).

Added in version 1.0.0.
**kwargs – Additional keyword arguments for statistical_inefficiency() or equilibrium_detection().

Returns:

df subsampled according to selected method.

Return type:

pandas.DataFrame

Note

The default of True for drop_duplicates and sort should result in robust decorrelation but can lose data.

Added in version 0.6.0.

Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed. Rename method value ‘dhdl_all’ to ‘all’ and deprecate the ‘dhdl’.

alchemlyb.preprocessing.subsampling.decorrelate_dhdl(df: DataFrame, drop_duplicates: bool = True, sort: bool = True, remove_burnin: bool = False, **kwargs: Any) → DataFrame[source]

Subsample a dhdl DataFrame. This is a wrapper function around the function statistical_inefficiency() and equilibrium_detection().

Parameters:

df (DataFrame) – DataFrame to subsample according to the selected method.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (True) or just do statistical inefficiency (False).

Added in version 1.0.0.
**kwargs – Additional keyword arguments for statistical_inefficiency() or equilibrium_detection().

Returns:

df subsampled.

Return type:

pandas.DataFrame

Note

The default of True for drop_duplicates and sort should result in robust decorrelation but can loose data.

Added in version 0.6.0.

Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed.

alchemlyb.preprocessing.subsampling.u_nk2series(df: DataFrame, method: str = 'dE') → Series[source]

Convert an u_nk DataFrame into a series based on the selected method for subsampling.

The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used.

Parameters:

df (DataFrame) – DataFrame to be converted according to the selected method.
method ({'all', 'dE'}) – Method for converting the data.

Returns:

series to be used as input for statistical_inefficiency() or equilibrium_detection().

Return type:

pandas.Series

Added in version 1.0.0.

Changed in version 2.0.1: The dE method computes the difference between the current lambda and the next lambda (previous lambda for the last window), instead of using the next lambda or the previous lambda for the last window.

alchemlyb.preprocessing.subsampling.dhdl2series(df: DataFrame, method: str = 'all') → Series[source]

Convert a dhdl DataFrame to a series for subsampling.

The series is generated by summing over all energy components (axis 1 of df), as for method='all' in u_nk2series(). Commonly, df only contains a single energy component but in some cases (such as using a split protocol in GROMACS), it can contain multiple columns for different energy terms.

Parameters:

df (DataFrame) – DataFrame to subsample according to the selected method.
method ('all') – Only ‘all’ is available; the keyword is provided for compatibility with u_nk2series().

Returns:

series to be used as input for statistical_inefficiency() or equilibrium_detection().

Return type:

pandas.Series

Added in version 1.0.0.

Subsample a DataFrame using simple slicing.

Parameters:

df (pandas.DataFrame) – DataFrame to subsample.
lower (float) – Lower time to slice from.
upper (float) – Upper time to slice to (inclusive).
step (int) – Step between rows to slice by.
force (bool) – Ignore checks that DataFrame is in proper form for expected behavior.

Returns:

df subsampled.

Return type:

pandas.DataFrame

Changed in version 1.0.1: The rows with NaN values are not dropped by default.

alchemlyb.preprocessing.subsampling.statistical_inefficiency(df: DataFrame | Series, series: None | Series = None, lower: None | float = None, upper: None | float = None, step: None | int = None, conservative: bool = True, drop_duplicates: bool = False, sort: bool = False) → DataFrame | Series[source]

Subsample a DataFrame based on the calculated statistical inefficiency of a timeseries.

If series is None, then this function will behave the same as slicing().

Parameters:

df (pandas.DataFrame) – DataFrame to subsample according statistical inefficiency of series.
series (pandas.Series) – Series to use for calculating statistical inefficiency. If None, no statistical inefficiency-based subsampling will be performed.
lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
conservative (bool) – True use ceil(statistical_inefficiency) to slice the data in uniform intervals (the default). False will sample at non-uniform intervals to closely match the (fractional) statistical_inefficieny, as implemented in pymbar.timeseries.subsample_correlated_data().
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.

Returns:

df subsampled according to subsampled series.

Return type:

pandas.DataFrame

Warning

The series and the data to be sliced, df, need to have the same number of elements because the statistical inefficiency is calculated based on the index of the series (and not an associated time). At the moment there is no automatic conversion from a time to an index.

Note

For a non-integer statistical ineffciency \(g\), the default value conservative=True will provide _fewer_ data points than allowed by \(g\) and thus error estimates will be _higher_. For large numbers of data points and converged free energies, the choice should not make a difference. For small numbers of data points, conservative=True decreases a false sense of accuracy and is deemed the more careful and conservative approach.

See also

pymbar.timeseries.statistical_inefficiency: detailed background
pymbar.timeseries.subsample_correlated_data: used for subsampling

Changed in version 0.2.0: The conservative keyword was added and the method is now using pymbar.timeseries.statistical_inefficiency(); previously, the statistical inefficiency was _rounded_ (instead of ceil()) and thus one could end up with correlated data.

Changed in version 1.0.0: Fixed a bug that would effectively ignore the lower and step keywords when returning the subsampled DataFrame object. See issue #198 for more details.

Subsample a DataFrame using automated equilibrium detection on a timeseries.

This function uses the pymbar implementation of the simple automated equilibrium detection algorithm in [Chodera2016].

If series is None, then this function will behave the same as slicing().

Parameters:

df (pandas.DataFrame) – DataFrame to subsample according to equilibrium detection on series.
series (pandas.Series) – Series to detect equilibration on. If None, no equilibrium detection-based subsampling will be performed.
lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.

Returns:

df subsampled according to subsampled series.

Return type:

pandas.DataFrame

Notes

Please cite [Chodera2016] when you use this function in published work.

See also

pymbar.timeseries.detect_equilibration: detailed background
pymbar.timeseries.subsample_correlated_data: used for subsampling

Changed in version 1.0.0: Add the drop_duplicates and sort keyword to unify the behaviour between statistical_inefficiency() or equilibrium_detection().