Parsing data files

alchemlyb features parsing submodules for getting raw data from different software packages into common data structures that can be used directly by its subsamplers and estimators. Each submodule features at least two functions, namely:

extract_dHdl()

Extract the gradient of the Hamiltonian, \(\frac{dH}{d\lambda}\), for each timestep of the sampled state. Required input for TI-based estimators.

extract_u_nk()

Extract reduced potentials, \(u_{nk}\), for each timestep of the sampled state and all neighboring states. Required input for FEP-based estimators.

extract()

Extract both reduced potentials and the gradient of the Hamiltonian, \(u_{nk}\) and \(\frac{dH}{d\lambda}\), in the form of a dictionary 'dHdl': Series,  'u_nk': DataFrame. Required input for FEP-based estimators and TI-based estimators.

These functions have a consistent interface across all submodules, often taking a single file as input and any additional parameters required for giving either dHdl or u_nk in standard form.

Standard forms of raw data

All components of alchemlyb are designed to work together well with minimal work on the part of the user. To make this possible, the library deals in a common data structure for each dHdl and u_nk, and all parsers yield these quantities in these standard forms. The common data structure is a pandas.DataFrame. Normally, it should be sufficient to just pass the dHdl and u_nk dataframes from one alchemlyb function to the next. However, being a DataFrame provides enormous flexibility if the data need to be reorganized or transformed because of the powerful tools that pandas makes available to manipulate these data structures.

Warning

When alchemlyb dataframes are transformed with standard pandas functions (such as pandas.concat()), care needs to be taken to ensure that alchemlyb metadata, which are stored in the dataframe, are maintained and propagated during processing of alchemlyb dataframes. See metadata propagation for how do work with dataframes safely in alchemlyb.

The metadata (such as the unit of the energy and temperature) are stored in pandas.DataFrame.attrs, a dict. Functions in alchemlyb are aware of these metadata but working with the data using pandas requires some care due to shortcomings in how pandas currently handles metadata (see issue pandas-dev/pandas#28283).

Serialisation

Alchemlyb data structures (dHdl and u_nk) can be serialized as dataframes and made persistent. We use the parquet format for serializing (writing) to a file and de-serializing (reading) from a parquet file.

For serialization we simply use the pandas.DataFrame.to_parquet() method of a pandas.DataFrame. For loading alchemlyb data we provide the alchemlyb.parsing.parquet.extract_dHdl() and alchemlyb.parsing.parquet.extract_u_nk() functions as shown in the example:

from alchemlyb.parsing.parquet import extract_dHdl, extract_u_nk
import pandas as pd

u_nk.to_parquet(path='u_nk.parquet', index=True)
dHdl.to_parquet(path='dHdl.parquet', index=True)

new_u_nk = extract_u_nk('u_nk.parquet', T=300)
new_dHdl = extract_dHdl('dHdl.parquet', T=300)

Note

Serialization of pandas.DataFrame to parquet file is only allowed for pandas>=2, whereas the deserialization is permitted for any pandas version.

dHdl standard form

All parsers yielding dHdl gradients return this as a pandas.DataFrame with the following structure:

                                   coul        vdw
time  coul-lambda vdw-lambda
  0.0 0.0         0.0         10.264125  -0.522539
  1.0 0.0         0.0          9.214077  -2.492852
  2.0 0.0         0.0         -8.527066  -0.405814
  3.0 0.0         0.0         11.544028  -0.358754
..... ...         ...         .........  .........
 97.0 1.0         1.0        -10.681702 -18.603644
 98.0 1.0         1.0         29.518990  -4.955664
 99 0 1.0         1.0         -3.833667  -0.836967
100.0 1.0         1.0        -12.835707   0.786278

This is a multi-index DataFrame, giving time for each sample as the outermost index, and the value of each \(\lambda\) from which the sample came as subsequent indexes. The columns of the DataFrame give the value of \(\frac{dH}{d\lambda}\) with respect to each of these separate \(\lambda\) parameters.

For datasets that sample with only a single \(\lambda\) parameter, then the DataFrame will feature only a single column perhaps like:

                        fep
time  fep-lambda
  0.0 0.0         10.264125
  1.0 0.0          9.214077
  2.0 0.0         -8.527066
  3.0 0.0         11.544028
..... ...         .........
 97.0 1.0        -10.681702
 98.0 1.0         29.518990
 99 0 1.0         -3.833667
100.0 1.0        -12.835707

u_nk standard form

All parsers yielding u_nk reduced potentials return this as a pandas.DataFrame with the following structure:

                                 (0.0, 0.0) (0.25, 0.0) (0.5, 0.0)  ...  (1.0, 1.0)
time  coul-lambda vdw-lambda
  0.0 0.0         0.0         -22144.50   -22144.24  -22143.98        -21984.81
  1.0 0.0         0.0         -21985.24   -21985.10  -21984.96        -22124.26
  2.0 0.0         0.0         -22124.58   -22124.47  -22124.37        -22230.61
  3.0 1.0         0.1         -22230.65   -22230.63  -22230.62        -22083.04
..... ...         ...         .........   .........  .........  ...   .........
 97.0 1.0         1.0         -22082.29   -22082.54  -22082.79        -22017.42
 98.0 1.0         1.0         -22087.57   -22087.76  -22087.94        -22135.15
 99.0 1.0         1.0         -22016.69   -22016.93  -22017.17        -22057.68
100.0 1.0         1.0         -22137.19   -22136.51  -22135.83        -22101.26

This is a multi-index DataFrame, giving time for each sample as the outermost index, and the value of each \(\lambda\) from which the sample came as subsequent indexes. The columns of the DataFrame give the value of \(u_{nk}\) for each set of \(\lambda\) parameters values were recorded for. Column labels are the values of the \(\lambda\) parameters as a tuple in the same order as they appear in the multi-index.

For datasets that sample only a single \(\lambda\) parameter, then the DataFrame will feature only a single index in addition to time, with the values of \(\lambda\) for which reduced potentials were recorded given as column labels:

                        0.0        0.25        0.5  ...         1.0
time  fep-lambda
  0.0 0.0         -22144.50   -22144.24  -22143.98        -21984.81
  1.0 0.0         -21985.24   -21985.10  -21984.96        -22124.26
  2.0 0.0         -22124.58   -22124.47  -22124.37        -22230.61
  3.0 1.0         -22230.65   -22230.63  -22230.62        -22083.04
..... ...         .........   .........  .........  ...   .........
 97.0 1.0         -22082.29   -22082.54  -22082.79        -22017.42
 98.0 1.0         -22087.57   -22087.76  -22087.94        -22135.15
 99.0 1.0         -22016.69   -22016.93  -22017.17        -22057.68
100.0 1.0         -22137.19   -22136.51  -22135.83        -22101.26

A note on units

alchemlyb reads input files in native energy units and converts them to a common unit, the energy measured in \(k_B T\), where \(k_B\) is Boltzmann’s constant and \(T\) is the thermodynamic absolute temperature in Kelvin. Therefore, all parsers require specification of \(T\).

Throughout alchemlyb, the metadata, such as the energy unit and temperature of the dataset, are stored as a dictionary in pandas.DataFrame.attrs metadata attribute. The keys of the attrs dictionary are

"temperature"

the temperature at which the simulation was performed, in Kelvin

"energy_unit"

the unit of energy, such as “kT”, “kcal/mol”, “kJ/mol” (as defined in units)

Conversion functions in alchemlyb.postprocessing and elsewhere may use the metadata for unit conversion and other transformations.

As the following example shows, after parsing of data files, the energy unit is “kT”, i.e., the \(\partial H/\partial\lambda\) timeseries is measured in multiples of \(k_B T\) per lambda step:

>>> from alchemtest.gmx import load_benzene
>>> from alchemlyb.parsing.gmx import extract_dHdl
>>> dataset = load_benzene()
>>> dhdl = extract_dHdl(dataset['data']['Coulomb'][0], 310)
>>> dhdl.attrs['temperature']
310
>>> dhdl.attrs['energy_unit']
'kT'

Also, although parsers will extract timestamps from input data, these are taken as-is and the library does not have any awareness of units for these. Keep this in mind when doing, e.g. subsampling.

Metadata Propagation

The metadata is stored in pandas.DataFrame.attrs. Though common pandas functions can safely propagate the metadata, the metadata might get lost during some operations such as concatenation (pandas-dev/pandas#28283). alchemlyb.concat() is provided to replace pandas.concat() allowing the safe propagation of metadata.

>>> import alchemlyb
>>> from alchemtest.gmx import load_benzene
>>> from alchemlyb.parsing.gmx import extract_dHdl
>>> dataset = load_benzene().data
>>> dhdl_coul = alchemlyb.concat([extract_dHdl(xvg, T=300) for xvg in dataset['Coulomb']])
>>> dhdl_coul.attrs
{'temperature': 300, 'energy_unit': 'kT'}
alchemlyb.concat(objs, *args, **kwargs)

Concatenate pandas objects while persevering the attrs.

Concatenate pandas objects along a particular axis with optional set logic along the other axes. If all pandas objects have the same attrs attribute, the new pandas objects would have this attrs attribute. A ValueError would be raised if any pandas object has a different attrs.

Parameters:

objs – A sequence or mapping of Series or DataFrame objects.

Returns:

Concatenated pandas object.

Return type:

DataFrame

Raises:

ValueError – If not all pandas objects have the same attrs.

See also

pandas.concat

Added in version 0.5.0.

Changed in version 1.0.1: When input is single dataframe, it will be sent out directly.

Although all functions in alchemlyb will safely propagate the metadata, if the user is interested in writing custom data manipulation functions, a decorator alchemlyb.pass_attrs() could be used to pass the metadata from the input data frame (first positional argument) to the output dataframe to ensure safe propagation of metadata.

>>> from alchemlyb import pass_attrs
>>> @pass_attrs
>>> def manipulation(dataframes, *args, **kwargs):
>>>     return func(dataframes, *args, **kwargs)
alchemlyb.pass_attrs(func)

Pass the attrs from the first positional argument to the output dataframe.

Added in version 0.5.0.

Parsers by software package

alchemlyb tries to provide parser functions for as many simulation packages as possible. See the documentation for the package you are using for more details on parser usage, including the assumptions parsers make and suggestions for how output data should be structured for ease of use:

gmx

Parsers for extracting alchemical data from Gromacs output files.

amber

Parsers for extracting alchemical data from AMBER output files.

namd

Parsers for extracting alchemical data from NAMD output files.

gomc

Parsers for extracting alchemical data from GOMC output files.

parquet