README
Introduction
Description
This data set is generated with DPGEN [1,2], a concurrent learning scheme that generates the DP training dataset iteratively.
The set of data contains structure-to-energy-force labels for 20 semiconductors, namely, Si, Ge, SiC, BAs, BN, AlN, AlP, AlAs, InP, InAs, InSb, GaN, GaP, GaAs, CdTe, InTe-In2Te3, CdSe-CdSe2, InSe-In2Se3, ZnS, CdS-CdS2. The total number of frames is 151,715.
DFT
The DFT-based electronic structure calculations are computed by ABACUS with numerical atomic orbitals and used triple-zeta plus double polarization (TZDP) basis sets. The generalized gradient approximation (GGA) in the form of the Perdew–Burke–Ernzerhof (PBE) was used for the exchange–correlation functional.
The norm-conserving pseudopotentials were employed and the valence configurations were [B]2s2p, [C]2s2p, [N]2s2p, [Al]3s3p, [Si]3s3p, [P]3s3p, [S]3s3p, [Zn]3s3p3d4s, [Ga]4s4p, [Ge]4s4p, [As]4s4p, [Se]4s4p, [Cd]4s4p4d5s, [In]4d 5s5p, [Sb]5s5p, and [Te]4d 5s5p, respectively.
The numerical atomic orbitals were 3s3p2d for B, 3s3p2d for C, 3s3p2d for N, 3s3p2d for Al, 3s3p2d for Si, 3s3p2d for P, 3s3p2d for S, 6s3p3d2f for Zn, s3p2d for Ga, 3s3p2d for Ge, 3s3p2d for As, 3s3p2d for Se, 6s3p3d2f for Cd, 3s3p3d2f for In, 3s3p2d for Sb, and 3s3p3d2f for Te, respectively.
The cutoff orbitals are set to 9 a.u.. The energy cutoff is set to 100 Ry. The K-points are set using the Monkhorst–Pack mesh with the spacing 0.08 1/a.u.. We employed the Gaussian smearing method with the smearing widths of 0.002 Ry. The electronic iteration convergence threshold was set to 10.
Data generation
se_e2_a model is used in DP-GEN. DP-GEN temperature upto 1 Tm.
Data Format
The directory tree is as follows:
sys_dir (optional) -- train.txt -- valid.txt sys_dir_downstream (optional) -- train.txt -- valid.txt *other_data_dirs
The dataset encompasses upstream and/or downstream splits, which are used for training DPA [3] models. Each system is represented as a single line in *.txt files in the DeePMD [4] format for training/validation. To obtain system paths from unstructured *, one can refer to these *.txt files.
A pbc unit system in DeePMD format usually has the following substructure:
-- system -- -- type_map.raw -- -- type.raw -- -- set.000/box.npy -- -- set.000/coord.npy -- -- set.000/energy.npy -- -- set.000/force.npy -- -- set.000/virial.npy
Format Description
Name | Property | Raw file | Unit | Shape | Description |
---|---|---|---|---|---|
type | Atom type indexes | type.raw | Natoms | Integers that start with 0, represent the atomic type corresponding to type_map.raw | |
type_map | Atom type names | type_map.raw | Ntypes | Atom names that map to atom type, which is unnecessart to be contained in the periodic table | |
coord | Atomic coordinates | coord.npy | Å | Nframes * Natoms * 3 | The atomic coordinates |
box | Boxes | box.npy | Å | Nframes * 3 * 3 | The box axes in the order XX XY XZ YX YY YZ ZX ZY ZZ |
energy | Frame energies | energy.npy | eV | Nframes | The potential energy of snapshot |
force | Atomic forces | force.npy | eV/Å | Nframes * Natoms * 3 | The atomic forces |
virial | Frame virial | virial.npy | eV | Nframes * 9 | The virial frames are in the order XX XY XZ YX YY YZ ZX ZY ZZ |
Check here for more details.
References
[1] Zhang, Linfeng, et al. "Active learning of uniformly accurate interatomic potentials for materials simulation." Physical Review Materials 3.2 (2019): 023804.
[2] Zhang, Yuzhi, et al. "DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models." Computer Physics Communications 253 (2020): 107206.
[3] Zhang, Duo, et al. "DPA-1: Pretraining of Attention-based Deep Potential Model for Molecular Simulation." arXiv preprint arXiv:2208.08236 (2022).
[4] Wang, Han, et al. "DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics." Computer Physics Communications 228 (2018): 178-184.