dataset#

This field is used to store information about the sources of the train/valid/test set. It also controls whether the constructed graphs are saved locally, whether they are loaded from specified files, and which keyscare used to read training labels from the input data.

Note

Below is an example of usage. If a parameter comes from the internal implementation of TACE, it may not be the most up-to-date. For the latest parameters, please refer to the corresponding configuration files on GitHub. A complete list of parameters along with detailed explanations is provided.

Example#

dataset:
  neighborlist_backend: matscipy # [ase, vesin, matscipy] recommend matscipy
  storage_mode: memory # memory or lmdb, if your dataset is large (> 100w), the recommended approach is to use lmdb, as this avoids repeatedly constructing the graph.
  shard_dirs: # if lmdb model, specify a list of path where you save you graph
    - graphCache
  # If using LMDB mode, you must allocate a reasonable pre-storage size.
  # Generally, if the number of neighboring atoms is around 60, about 200 KB per graph is sufficient.
  shard_size: 10000 # number of graphs stored per file in LMDB mode, should be large, here is just an example, if small, it will be slow and may be error
  cache_size: 1024 # cache for faster load data from dataloader
  avg_graph_size_in_KB: 200 # in KB, the total disk usage equals the size of this multiplied by the total number of your structures.
  lmdb_wait_timeout: 86400 # in seconds; when training with multi-GPU, the maximum waiting time.
  force_dtype: null # 32 or 64, if your data grapg saved in lmdb, but original float64, you can force to float32 for convenience
  type: ase # ase or ase-db
  split_seed: ${misc.global_seed} # this random seed is useful if auto split
  train_file: dataset/train_300K.xyz
  valid_file: null
  #test_files: # list of test file or null
  #   - dataset/test_300K.xyz
  #   - dataset/test_600K.xyz
  #   - dataset/test_1200K.xyz
  #   - dataset/test_dih.xyz
  valid_ratio: 0.1 # auto split from train file if valid file is null, The priority order is:  no_valid_set > valid_file > valid_from_index > ``valid_ratio.
  valid_from_index: false # split train and val from train.index and valid.index in current directory
  no_valid_set: false # The prerequisite for enabling this is that you are using a learning rate scheduler that does not depend on the validation set.
  keys: # all default key name is the property name
    level_key: fidelity_idx
    energy_key: energy
    forces_key: forces
    stress_key: stress
    virials_key: virials
    charges_key: charges
    total_charge_key: total_charge
    spin_multiplicity_key: spin_multiplicity
    temperature_key: temperature
    electron_temperature_key: electron_temperature
    direct_forces_key: forces
    direct_stress_key: stress
    direct_virials_key: virials
    polarization_key: polarization
    direct_dipole_key: dipole
    conservative_dipole_key: dipole
    direct_polarizability_key: polarizability
    conservative_polarizability_key: polarizability
    born_effective_charges_key: born_effective_charges
    electric_field_key: electric_field
    magnetic_field_key: magnetic_field
    collinear_magmoms_key: collinear_magmoms
    noncollinear_magmoms_key: noncollinear_magmoms
    total_collinear_magmoms_key: total_collinear_magmoms
    total_noncollinear_magmoms_key: total_noncollinear_magmoms
    collinear_magnetic_forces_key: collinear_magnetic_forces
    noncollinear_magnetic_forces_key: noncollinear_magnetic_forces

Note

  • The priority order is: no_valid_set > valid_file > valid_from_index > valid_ratio.