Subsampling
Functions for subsampling datasets.
- alchemlyb.preprocessing.subsampling.decorrelate_u_nk(df: DataFrame, method: str = 'dE', drop_duplicates: bool = True, sort: bool = True, remove_burnin: bool = False, **kwargs: Any) DataFrame[source]
Subsample an u_nk DataFrame based on the selected method.
The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used. This is a wrapper function around the function
statistical_inefficiency()orequilibrium_detection().- Parameters:
df (DataFrame) – DataFrame to be subsampled according to the selected method.
method ({'all', 'dE'}) – Method for decorrelating the data.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (
True) or just do statistical inefficiency (False).Added in version 1.0.0.
**kwargs – Additional keyword arguments for
statistical_inefficiency()orequilibrium_detection().
- Returns:
df subsampled according to selected method.
- Return type:
Note
The default of
Truefor drop_duplicates and sort should result in robust decorrelation but can lose data.Added in version 0.6.0.
Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed. Rename method value ‘dhdl_all’ to ‘all’ and deprecate the ‘dhdl’.
- alchemlyb.preprocessing.subsampling.decorrelate_dhdl(df: DataFrame, drop_duplicates: bool = True, sort: bool = True, remove_burnin: bool = False, **kwargs: Any) DataFrame[source]
Subsample a dhdl DataFrame. This is a wrapper function around the function
statistical_inefficiency()andequilibrium_detection().- Parameters:
df (DataFrame) – DataFrame to subsample according to the selected method.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (
True) or just do statistical inefficiency (False).Added in version 1.0.0.
**kwargs – Additional keyword arguments for
statistical_inefficiency()orequilibrium_detection().
- Returns:
df subsampled.
- Return type:
Note
The default of
Truefor drop_duplicates and sort should result in robust decorrelation but can loose data.Added in version 0.6.0.
Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed.
- alchemlyb.preprocessing.subsampling.u_nk2series(df: DataFrame, method: str = 'dE') Series[source]
Convert an u_nk DataFrame into a series based on the selected method for subsampling.
The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used.
- Parameters:
df (DataFrame) – DataFrame to be converted according to the selected method.
method ({'all', 'dE'}) – Method for converting the data.
- Returns:
series to be used as input for
statistical_inefficiency()orequilibrium_detection().- Return type:
Added in version 1.0.0.
Changed in version 2.0.1: The dE method computes the difference between the current lambda and the next lambda (previous lambda for the last window), instead of using the next lambda or the previous lambda for the last window.
- alchemlyb.preprocessing.subsampling.dhdl2series(df: DataFrame, method: str = 'all') Series[source]
Convert a dhdl DataFrame to a series for subsampling.
The series is generated by summing over all energy components (axis 1 of df), as for
method='all'inu_nk2series(). Commonly, df only contains a single energy component but in some cases (such as using a split protocol in GROMACS), it can contain multiple columns for different energy terms.- Parameters:
df (DataFrame) – DataFrame to subsample according to the selected method.
method ('all') – Only ‘all’ is available; the keyword is provided for compatibility with
u_nk2series().
- Returns:
series to be used as input for
statistical_inefficiency()orequilibrium_detection().- Return type:
Added in version 1.0.0.
- alchemlyb.preprocessing.subsampling.slicing(df: DataFrame | Series, lower: None | float = None, upper: None | float = None, step: None | int = None, force: bool = False) DataFrame | Series[source]
Subsample a DataFrame using simple slicing.
- Parameters:
df (
pandas.DataFrame) – DataFrame to subsample.lower (float) – Lower time to slice from.
upper (float) – Upper time to slice to (inclusive).
step (int) – Step between rows to slice by.
force (bool) – Ignore checks that DataFrame is in proper form for expected behavior.
- Returns:
df subsampled.
- Return type:
Changed in version 1.0.1: The rows with NaN values are not dropped by default.
- alchemlyb.preprocessing.subsampling.statistical_inefficiency(df: DataFrame | Series, series: None | Series = None, lower: None | float = None, upper: None | float = None, step: None | int = None, conservative: bool = True, drop_duplicates: bool = False, sort: bool = False) DataFrame | Series[source]
Subsample a DataFrame based on the calculated statistical inefficiency of a timeseries.
If series is
None, then this function will behave the same asslicing().- Parameters:
df (
pandas.DataFrame) – DataFrame to subsample according statistical inefficiency of series.series (
pandas.Series) – Series to use for calculating statistical inefficiency. IfNone, no statistical inefficiency-based subsampling will be performed.lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
conservative (bool) –
Trueuseceil(statistical_inefficiency)to slice the data in uniform intervals (the default).Falsewill sample at non-uniform intervals to closely match the (fractional) statistical_inefficieny, as implemented inpymbar.timeseries.subsample_correlated_data().drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
- Returns:
df subsampled according to subsampled series.
- Return type:
Warning
The series and the data to be sliced, df, need to have the same number of elements because the statistical inefficiency is calculated based on the index of the series (and not an associated time). At the moment there is no automatic conversion from a time to an index.
Note
For a non-integer statistical ineffciency \(g\), the default value
conservative=Truewill provide _fewer_ data points than allowed by \(g\) and thus error estimates will be _higher_. For large numbers of data points and converged free energies, the choice should not make a difference. For small numbers of data points,conservative=Truedecreases a false sense of accuracy and is deemed the more careful and conservative approach.See also
pymbar.timeseries.statistical_inefficiencydetailed background
pymbar.timeseries.subsample_correlated_dataused for subsampling
Changed in version 0.2.0: The
conservativekeyword was added and the method is now usingpymbar.timeseries.statistical_inefficiency(); previously, the statistical inefficiency was _rounded_ (instead ofceil()) and thus one could end up with correlated data.Changed in version 1.0.0: Fixed a bug that would effectively ignore the
lowerandstepkeywords when returning the subsampled DataFrame object. See issue #198 for more details.
- alchemlyb.preprocessing.subsampling.equilibrium_detection(df: DataFrame | Series, series: None | Series = None, lower: None | float = None, upper: None | float = None, step: None | int = None, drop_duplicates: bool = False, sort: bool = False) DataFrame | Series[source]
Subsample a DataFrame using automated equilibrium detection on a timeseries.
This function uses the
pymbarimplementation of the simple automated equilibrium detection algorithm in [Chodera2016].If series is
None, then this function will behave the same asslicing().- Parameters:
df (
pandas.DataFrame) – DataFrame to subsample according to equilibrium detection on series.series (
pandas.Series) – Series to detect equilibration on. IfNone, no equilibrium detection-based subsampling will be performed.lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
- Returns:
df subsampled according to subsampled series.
- Return type:
Notes
Please cite [Chodera2016] when you use this function in published work.
See also
pymbar.timeseries.detect_equilibrationdetailed background
pymbar.timeseries.subsample_correlated_dataused for subsampling
Changed in version 1.0.0: Add the drop_duplicates and sort keyword to unify the behaviour between
statistical_inefficiency()orequilibrium_detection().
The functions featured in this module can be used to easily subsample either dHdl or u_nk datasets to give less correlated timeseries.
High-level functions
Two high-level functions
decorrelate_u_nk() and
decorrelate_dhdl() can be used to
preprocess the dHdl or u_nk in an automatic
fashion. The following code removes an initial “burnin” period and
decorrelates the data.
>>> from alchemlyb.parsing.gmx import extract_u_nk, extract_dHdl
>>> from alchemlyb.preprocessing.subsampling import (decorrelate_u_nk,
>>> decorrelate_dhdl)
>>> bz = load_benzene().data
>>> u_nk = extract_u_nk(bz['Coulomb'], T=300)
>>> decorrelated_u_nk = decorrelate_u_nk(u_nk, method='dhdl',
>>> remove_burnin=True)
>>> dhdl = extract_dHdl(bz['Coulomb'], T=300)
>>> decorrelated_dhdl = decorrelate_dhdl(dhdl, remove_burnin=True)
Low-level functions
To decorrelate the data, in addition to the dataframe that contains the
dHdl or u_nk, a pandas.Series is needed for
the autocorrection analysis. The series could be generated with
u_nk2series() or
dhdl2series() and feed into
statistical_inefficiency() or
equilibrium_detection().
>>> from alchemlyb.parsing.gmx import extract_u_nk, extract_dHdl
>>> from alchemlyb.preprocessing.subsampling import (u_nk2series,
>>> dhdl2series, statistical_inefficiency, equilibrium_detection)
>>> bz = load_benzene().data
>>> u_nk = extract_u_nk(bz['Coulomb'], T=300)
>>> u_nk_series = u_nk2series(u_nk, method='dE')
>>> decorrelate_u_nk = statistical_inefficiency(u_nk, series=u_nk_series)
>>> decorrelate_u_nk = equilibrium_detection(u_nk, series=u_nk_series)
>>> dhdl = extract_dHdl(bz['Coulomb'], T=300)
>>> dhdl_series = dhdl2series(dhdl)
>>> decorrelate_dhdl = statistical_inefficiency(dhdl, series=dhdl_series)
>>> decorrelate_dhdl = equilibrium_detection(dhdl, series=dhdl_series)
API Reference
- alchemlyb.preprocessing.subsampling.decorrelate_u_nk(df: DataFrame, method: str = 'dE', drop_duplicates: bool = True, sort: bool = True, remove_burnin: bool = False, **kwargs: Any) DataFrame[source]
Subsample an u_nk DataFrame based on the selected method.
The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used. This is a wrapper function around the function
statistical_inefficiency()orequilibrium_detection().- Parameters:
df (DataFrame) – DataFrame to be subsampled according to the selected method.
method ({'all', 'dE'}) – Method for decorrelating the data.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (
True) or just do statistical inefficiency (False).Added in version 1.0.0.
**kwargs – Additional keyword arguments for
statistical_inefficiency()orequilibrium_detection().
- Returns:
df subsampled according to selected method.
- Return type:
Note
The default of
Truefor drop_duplicates and sort should result in robust decorrelation but can lose data.Added in version 0.6.0.
Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed. Rename method value ‘dhdl_all’ to ‘all’ and deprecate the ‘dhdl’.
- alchemlyb.preprocessing.subsampling.decorrelate_dhdl(df: DataFrame, drop_duplicates: bool = True, sort: bool = True, remove_burnin: bool = False, **kwargs: Any) DataFrame[source]
Subsample a dhdl DataFrame. This is a wrapper function around the function
statistical_inefficiency()andequilibrium_detection().- Parameters:
df (DataFrame) – DataFrame to subsample according to the selected method.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
remove_burnin (bool) –
Whether to perform equilibrium detection (
True) or just do statistical inefficiency (False).Added in version 1.0.0.
**kwargs – Additional keyword arguments for
statistical_inefficiency()orequilibrium_detection().
- Returns:
df subsampled.
- Return type:
Note
The default of
Truefor drop_duplicates and sort should result in robust decorrelation but can loose data.Added in version 0.6.0.
Changed in version 1.0.0: Add the remove_burnin keyword to allow unequilibrated frames to be removed.
- alchemlyb.preprocessing.subsampling.u_nk2series(df: DataFrame, method: str = 'dE') Series[source]
Convert an u_nk DataFrame into a series based on the selected method for subsampling.
The method can be either ‘all’ (obtained as a sum over all energy components) or ‘dE’. In the latter case the energy differences \(dE_{i,i+1}\) (\(dE_{i,i-1}\) for the last lambda) are used.
- Parameters:
df (DataFrame) – DataFrame to be converted according to the selected method.
method ({'all', 'dE'}) – Method for converting the data.
- Returns:
series to be used as input for
statistical_inefficiency()orequilibrium_detection().- Return type:
Added in version 1.0.0.
Changed in version 2.0.1: The dE method computes the difference between the current lambda and the next lambda (previous lambda for the last window), instead of using the next lambda or the previous lambda for the last window.
- alchemlyb.preprocessing.subsampling.dhdl2series(df: DataFrame, method: str = 'all') Series[source]
Convert a dhdl DataFrame to a series for subsampling.
The series is generated by summing over all energy components (axis 1 of df), as for
method='all'inu_nk2series(). Commonly, df only contains a single energy component but in some cases (such as using a split protocol in GROMACS), it can contain multiple columns for different energy terms.- Parameters:
df (DataFrame) – DataFrame to subsample according to the selected method.
method ('all') – Only ‘all’ is available; the keyword is provided for compatibility with
u_nk2series().
- Returns:
series to be used as input for
statistical_inefficiency()orequilibrium_detection().- Return type:
Added in version 1.0.0.
- alchemlyb.preprocessing.subsampling.slicing(df: DataFrame | Series, lower: None | float = None, upper: None | float = None, step: None | int = None, force: bool = False) DataFrame | Series[source]
Subsample a DataFrame using simple slicing.
- Parameters:
df (
pandas.DataFrame) – DataFrame to subsample.lower (float) – Lower time to slice from.
upper (float) – Upper time to slice to (inclusive).
step (int) – Step between rows to slice by.
force (bool) – Ignore checks that DataFrame is in proper form for expected behavior.
- Returns:
df subsampled.
- Return type:
Changed in version 1.0.1: The rows with NaN values are not dropped by default.
- alchemlyb.preprocessing.subsampling.statistical_inefficiency(df: DataFrame | Series, series: None | Series = None, lower: None | float = None, upper: None | float = None, step: None | int = None, conservative: bool = True, drop_duplicates: bool = False, sort: bool = False) DataFrame | Series[source]
Subsample a DataFrame based on the calculated statistical inefficiency of a timeseries.
If series is
None, then this function will behave the same asslicing().- Parameters:
df (
pandas.DataFrame) – DataFrame to subsample according statistical inefficiency of series.series (
pandas.Series) – Series to use for calculating statistical inefficiency. IfNone, no statistical inefficiency-based subsampling will be performed.lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
conservative (bool) –
Trueuseceil(statistical_inefficiency)to slice the data in uniform intervals (the default).Falsewill sample at non-uniform intervals to closely match the (fractional) statistical_inefficieny, as implemented inpymbar.timeseries.subsample_correlated_data().drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
- Returns:
df subsampled according to subsampled series.
- Return type:
Warning
The series and the data to be sliced, df, need to have the same number of elements because the statistical inefficiency is calculated based on the index of the series (and not an associated time). At the moment there is no automatic conversion from a time to an index.
Note
For a non-integer statistical ineffciency \(g\), the default value
conservative=Truewill provide _fewer_ data points than allowed by \(g\) and thus error estimates will be _higher_. For large numbers of data points and converged free energies, the choice should not make a difference. For small numbers of data points,conservative=Truedecreases a false sense of accuracy and is deemed the more careful and conservative approach.See also
pymbar.timeseries.statistical_inefficiencydetailed background
pymbar.timeseries.subsample_correlated_dataused for subsampling
Changed in version 0.2.0: The
conservativekeyword was added and the method is now usingpymbar.timeseries.statistical_inefficiency(); previously, the statistical inefficiency was _rounded_ (instead ofceil()) and thus one could end up with correlated data.Changed in version 1.0.0: Fixed a bug that would effectively ignore the
lowerandstepkeywords when returning the subsampled DataFrame object. See issue #198 for more details.
- alchemlyb.preprocessing.subsampling.equilibrium_detection(df: DataFrame | Series, series: None | Series = None, lower: None | float = None, upper: None | float = None, step: None | int = None, drop_duplicates: bool = False, sort: bool = False) DataFrame | Series[source]
Subsample a DataFrame using automated equilibrium detection on a timeseries.
This function uses the
pymbarimplementation of the simple automated equilibrium detection algorithm in [Chodera2016].If series is
None, then this function will behave the same asslicing().- Parameters:
df (
pandas.DataFrame) – DataFrame to subsample according to equilibrium detection on series.series (
pandas.Series) – Series to detect equilibration on. IfNone, no equilibrium detection-based subsampling will be performed.lower (float) – Lower bound to pre-slice series data from.
upper (float) – Upper bound to pre-slice series to (inclusive).
step (int) – Step between series items to pre-slice by.
drop_duplicates (bool) – Drop the duplicated lines based on time.
sort (bool) – Sort the Dataframe based on the time column.
- Returns:
df subsampled according to subsampled series.
- Return type:
Notes
Please cite [Chodera2016] when you use this function in published work.
See also
pymbar.timeseries.detect_equilibrationdetailed background
pymbar.timeseries.subsample_correlated_dataused for subsampling
Changed in version 1.0.0: Add the drop_duplicates and sort keyword to unify the behaviour between
statistical_inefficiency()orequilibrium_detection().