OpenLAM | 2024 Q0 Report

Posted on 2024-01-10 In OpenLAM Word count in article: 1.7k Reading time ≈ 6 mins.

The slogan for OpenLAM is "Conquer the Periodic Table!" We hope to provide a new infrastructure for microscale scientific research and drive the transformation of microscale industrial design in fields such as materials, energy, and biopharmaceuticals by establishing an open-source ecosystem around large microscale models. Relevant models, data, and workflows will be consolidated around the AIS Square; related software development will take place in the DeepModeling open-source community. At the same time, we welcome open interaction from different communities in model development, data sharing, evaluation, and testing.

See AIS Square for more details.

Model Structure

The DPA-2 model structure (PyTorch based) has been released, showing a significant increase in fitting and transferability compared to the DPA-1 (arxiv:2312.15492).
A new capability for unsupervised denoise pretraining has been added (DOI:10.5281/zenodo.10483908).

Data

The DPA-2 paper includes pretrained data for 18 systems and downstream data for 10 systems, covering over ten million frames and 73 elements (for detailed data inventory, see below; data can also be directly downloaded from DOI:10.5281/zenodo.10483908).
Four new datasets have been added for energy&force data related to electrolytes, solid-state electrolytes, chemical reactions, and methane combustion (for details, see the data inventory below).
Seven new datasets in equilibrium state for unsupervised denoising tasks have been added, including AFLOW, MC2D/3D, CALYPSO, etc. (for details, see the data inventory below).

Training Strategy

The DPA-2 paper includes a multi-task pretraining framework for energy and force, supporting the combined training of datasets with different DFT settings.
Unsupervised denoising task has been added, which is integrated into the multi-task pretraining framework (results are detailed below).

Automation Process

The DPA-2 paper encompasses an automated process for all stages of pretraining, fine-tuning, transferability testing, distillation, and compression (experience it at DP Combo and try it on the notebook).
The AIS-Square website now includes an automated process for integrating user data, automatically determining the coverage of the pretrained model on current data.

Competition

Coming in March...

Teaching

Coming in February...

Readers interested in the background of the project and details of the paper can also refer to the OpenLAM initiative and the DPA-2 paper for further information.

Conclusion

Since the release of DPA-2 less than a month ago, there have been numerous developments that can be summarized as follows:

The DPA-2 multitask pre-training framework has added a new unsupervised training task: it is now possible to train with any data derived from different DFT calculations together, as well as denoise equilibrium state data without DFT labels, thereby learning a broader range of representation information;
The OpenLAM initiative has incorporated more production-type data and integrated more publicly available equilibrium state crystal structure data, with the pre-training data pool continuing to expand rapidly;
After incorporating the unsupervised training task, the overall energy prediction accuracy of the model is higher when compared fairly, indicating that information across different systems and tasks promotes mutual enhancement.

The OpenLAM initiative is currently in rapid continuous iteration. As we move towards the era of large atomic models, open-source sharing becomes an inevitable theme. We welcome like-minded individuals to join, opening up new opportunities for broader scientific discoveries and industrial applications. On the journey to conquering the periodic table of elements, we look forward to creating a new era with you!
To join the "OpenLAM Initiative", visit AISSquare.

Appendix

Unsupervised Denoise Method
- Data Structure
  - Equilibrium state data consisting only of configurations without DFT computational results; noise is added separately to the coordinates and types during preprocessing (such as adding Gaussian noise to coordinate positions and masking certain element types).
- Training Method
  - Configurations with added noise are inputted into the network, processed by DPA-2's unified descriptor and denoise fitting, to yield a denoise vector for each atom (i.e., the network's prediction of the proper displacement) as well as the element types. After restoring the configuration and element types based on the denoise vector, a loss is computed against the true configurations and element types without noise. The model is trained by minimizing this loss.
Data Inventory
The datasets currently used for training the DPA-2 model cover a wide range of systems including semiconductors, perovskites, alloys, surface catalysis, cathode materials, solid-state electrolytes, organic molecules, and more. This includes the newly added unsupervised equilibrium state Denoise datasets. All these data have been uploaded to the AISSquare website, where users can find more detailed data descriptions, as well as download and use the datasets, specifically including:
- Datasets included in the DPA-2 paper

Index	Dataset name	Contributors
1	Alloy_DPA_v1_0	Fuzhi Dai, Wanrun Jiang
2	Cathode(Anode)_DPA_v1_0	Linshuang Zhang, Jianchuan Liu
3	Cluster_DPA_v1_0	Fuqiang Gong
4	Drug(drug-like-molecule)_DPA_v1_0	Manyi Yang
5	FerroEle_DPA_v1_0	Jing Wu, Jiyuan Yang, YuanJinsheng Liu, Duo Zhang, Yudi Yang, Yuzhi Zhang, Linfeng Zhang, Shi Liu
6	Open_Catalyst_2020(OC20_Dataset)	Duo Zhang
7	SSE-PBE_DPA_v1_0	Jianxing Huang
8	SemiCond_DPA_v1_0	Jianchuan Liu
9	H2O-PD_DPA_v1_0	Linfeng Zhang, Han Wang, Roberto Car, Weinan E
10	AgAu-PBE(unitary)_DPA_v1_0	Yinan Wang, LinFeng Zhang, Ben Xu, Xiaoyang Wang, Han Wang
11	AlMgCu_DPA_v1_0	Wanrun Jiang, Yuzhi Zhang, Linfeng Zhang, Han Wang
12	Cu_DPA_v1_0	Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang
13	Sn_DPA_v1_0	Fengbo Yuan
14	Ti_DPA_v1_0	Tongqi Wen, Rui Wang, Lingyu Zhu, Linfeng Zhang, Han Wang, David J Srolovitz, Zhaoxuan Wu
15	V_DPA_v1_0	Rui Wang, Xiaoxiao Ma, Linfeng Zhang, Han Wang, David J Srolovitz, Tongqi Wen, Zhaoxuan Wu
16	W_DPA_v1_0	Xiaoyang Wang, Yinan Wang, Linfeng Zhang, Fuzhi Dai, Han Wang
17	C12H26_DPA_v1_0	Jinzhe Zeng, Linfeng Zhang, Han Wang, Tong Zhu
18	HfO2_DPA_v1_0	Jing Wu, Yuzhi Zhang, Linfeng Zhang, Shi Liu

Four new datasets for energy & force data

Index	Dataset name	Contributors
19	Electrolyte	Mengchao Shi, Yuzhi Zhang
20	Solid_State_Electrolyte	Mengchao Shi, Yuzhi Zhang
21	Organic_reactions_dataset	Tong Zhu, Bowen Li
22	CHO-methane-combustion	Jinzhe Zeng, Liqun Cao, Mingyuan Xu, Tong Zhu, John ZH Zhang

Seven new datasets in equilibrium state for unsupervised denoising

Index	Dataset name	Contributors/Link
1	AFLOW_MP	AFLOW, MP
2	MC2D	Davide Campi, Nicolas Mounet, Marco Gibertini, Giovanni Pizzi, Nicola Marzari, The Materials Cloud 2D database (MC2D), Materials Cloud Archive 2022.84 (2022), doi: 10.24435/materialscloud:36-nd.
3	MC3D	Sebastiaan Huber, Marnik Bercx, Nicolas Hörmann, Martin Uhrin, Giovanni Pizzi, Nicola Marzari, Materials Cloud three-dimensional crystals database (MC3D), Materials Cloud Archive 2022.38 (2022), doi: 10.24435/materialscloud:rw-t0.
4	ChemicalSimilarity	Hai-Chen Wang, Silvana Botti, Miguel A. L. Marques, Finding new crystalline compounds using chemical similarity, Materials Cloud Archive 2021.68 (2021), doi: 10.24435/materialscloud:96-09.
5	ClusterIsomer	Giuseppe Fisicaro, Bastian Schaefer, Jonas A. Finkler, Stefan Goedecker, Principles of isomer stability in small clusters, Materials Cloud Archive 2023.36 (2023), doi: 10.24435/materialscloud:46-nr.
6	MolecularCrystal	Rose Cersonsky, Maria Pakhnova, Edgar Engel, Michele Ceriotti, Lattice energies and relaxed geometries for 2'707 organic molecular crystals and their 3'242 molecular components., Materials Cloud Archive 2023.5 (2023), doi: 10.24435/materialscloud:71-21.
7	CALYPSO_database	Zhenyu Wang, Xiaoshan Luo

Latest Performance (root mean squared error, RMSE) of the Multi-task Pretrained Model (22 energy force systems + 7 unsupervised denoise systems)

	Weight	DPA2 (multi-task 18 heads for 1m steps)		DPA2 (multi-task 29 heads for 1.84m steps)
	Weight	Energy (meV/atom)	Force (meV/Å)	Energy (meV/atom)	Force (meV/Å)
Alloy	2.0	36.5	169.5	32.2	160.5
Cluster	1.0	34.4	162.5	40.6	171.0
Anode	1.0	3.3	39.8	2.5	45.0
FerroEle	1.0	4.4	44.2	1.7	47.2
AgAu-PBE	0.2	9.4	28.2	10.9	31.2
Cu	0.1	3.6	18.2	6.8	21.2
Sn	0.1	24.8	69.7	17.3	76.7
Ti	0.1	16.3	112.4	26.8	133.7
AlMgCu	0.3	4.9	23.4	10.6	28.6
V	0.1	13.9	110.2	16.7	121.3
W	0.1	24.6	157.9	45.8	174.0
C12H26	0.1	62.5	710.6	75.3	1486.7
SSE-PBE	1.0	2.1	64.0	2.2	75.7
HfO2	0.1	3.9	102.8	5.0	108.4
SemiCond	1.0	6.5	131.9	7.2	139.8
Drug	2.0	20.6	128.9	21.8	140.6
OC2M	2.0	29.3	157.6	26.7	138.7
H2O-PD	1.0	3.2	39.7	1.0	45.6
Weighted sum		18.6	116.3	18.3	123.6

Electrolyte	1.0	/	/	2.9	64.3
SSE_new	1.0	/	/	3.2	72.4
Organic_reactions	1.0	/	/	15.1	97.7
Methane-combustion	1.0	/	/	147.2	251.4