Home Journals Progress in Chemistry
Progress in Chemistry

Abbreviation (ISO4): Prog Chem      Editor in chief: Jincai ZHAO

About  /  Aim & scope  /  Editorial board  /  Indexed  /  Contact  / 
Review

Application of Molecular Descriptors and End-to-End Deep Learning in MOFs Design

  • Ying He 1 ,
  • Fangchang Tan 2 ,
  • Xiliang Yan , 3, *
Expand
  • 1 Institute of Environmental Research at Greater Bay Area,Guangzhou University,Guangzhou 510006,China
  • 2 College of Physics and Optoelectronic Engineering,Jinan University,Guangzhou 511443,China
  • 3 College of Animal Science,South China Agricultural University,Guangzhou 510642,China

Received date: 2024-11-05

  Revised date: 2025-03-12

  Online published: 2025-07-30

Supported by

the National Natural Science Foundation of China(22476056)

the National Natural Science Foundation of China(22106025)

Abstract

Metal-organic frameworks (MOFs) exhibit great promise in diverse applications such as gas storage,catalysis,and sensing due to their distinctive structures and physicochemical properties. However,traditional experimental approaches face challenges in quickly and efficiently designing MOFs withthe desired characteristics. In recent years,artificial intelligence (AI) techniques,particularly traditional machine learning and deep learning,have been extensively applied in materials science,yielding numerous noteworthy results. An essential requirement for successful modeling with these techniques is the ability to extract the structural features of MOFs and transform them into computer-readable formats. Therefore,we present a comprehensive review of two feature extraction approaches based on molecular descriptors and end-to-end deep learning. We summarize the fundamental concepts and principles of both methods,emphasizing their specific applications and recent advancements in MOFs design. Finally,we discuss the challenges and future directions for improving the comprehensiveness,interpretability,and reproducibility of structural feature extraction. This review aims to provide valuable insights and theoretical guidance for AI-driven MOFs design.

Contents

1 Introduction

2 Traditional machine learning and end-to-end deep learning

2.1 Basic concepts and historical development of artificial intelligence

2.2 Key steps in traditional machine learning and end-to-end deep Learning

2.3 Differences between traditional machine learning and end-to-end deep Learning

2.4 Overview of the MOF databases

3 Feature extraction based on molecular descriptors

3.1 Structural descriptors

3.2 Chemical characteristics

3.3 Thermodynamic properties

3.4 Feature selection and dimensionality reduction techniques

3.5 Effective strategies for handling missing features and noisy data

4 Application of end-to-end deep learning model to MOFs design

4.1 Convolutional neural networks

4.2 Recurrent neural networks

4.3 Graph neural networks

4.4 Generative adversarial networks

5 Conclusion and outlook

Cite this article

Ying He , Fangchang Tan , Xiliang Yan . Application of Molecular Descriptors and End-to-End Deep Learning in MOFs Design[J]. Progress in Chemistry, 2025 , 37(8) : 1177 -1187 . DOI: 10.7536/PC241104

1 Introduction

Metal-organic frameworks (MOFs) are a class of porous materials formed by self-assembly of metal ions or metal clusters with organic ligands. Due to their unique structural characteristics, MOFs exhibit extremely high specific surface areas, tunable pore structures, and diverse chemical functionalities[1]. These superior properties have led to extensive research and applications of MOFs in various fields. In the area of gas storage and separation, MOFs, with their tunable pore sizes and high specific surface areas, have become ideal storage materials for gases such as hydrogen, methane, and carbon dioxide[2-4]. Additionally, MOFs are used for the selective adsorption and separation of specific components from mixed gases, such as the separation of hydrogen sulfide and carbon dioxide from natural gas[5]. In the field of catalysis, MOFs, owing to the diversity of their metal centers and the functionalization of organic ligands, can serve as homogeneous or heterogeneous catalysts, widely applied in processes such as organic reactions, photocatalysis, and electrocatalysis[6-7]. MOFs also perform exceptionally well in the sensor field. Their highly ordered pore structures and functionalized surfaces enable MOFs to achieve highly sensitive detection of specific gases, liquids, or biomolecules[8-9]. Furthermore, MOFs demonstrate significant application potential in drug delivery, pollutant removal, and electronic devices[10-11].
In traditional MOF design, reliance on experimentation and empirical experience is not only time-consuming but also costly. The introduction of machine learning has transformed this situation, significantly enhancing design efficiency and success rates through data-driven approaches. Machine learning models can learn the complex relationships between MOF structures and their properties from extensive existing datasets, enabling rapid predictions of new materials' performance and substantially reducing the number of materials ultimately tested experimentally. For example, by training regression models, the performance of MOFs in gas storage or catalysis can be quickly predicted[12]. Feature extraction is a core step in machine learning and is particularly important in MOF material design. The purpose of feature extraction is to transform high-dimensional and complex structural features and physicochemical properties into key information that can describe and differentiate various materials, ultimately improving model prediction performance[13]. Moreover, the extracted features should have clear physical and chemical significance, facilitating researchers' interpretation of model predictions and guiding the development of novel MOFs with superior performance.
Currently, MOF feature extraction methods can be broadly categorized into two main types: molecular descriptors and end-to-end deep learning. (1) Molecular descriptors involve converting complex MOF structural information into numerical formats that traditional machine learning models can handle, using experimental or computational approaches. Due to the clear physical and chemical significance of molecular descriptors, traditional machine learning models offer more intuitive interpretations of prediction results. This interpretability helps researchers understand the relationship between material properties and structural features, thereby guiding material design. (2) Unlike traditional machine learning, end-to-end deep learning models can directly recognize raw input data (such as molecular images or SMILES representations) and automatically extract important features during training, without the need for manually generated molecular descriptors[14]. However, the "black-box" nature of end-to-end deep learning models makes their decision-making process difficult to interpret, requiring the assistance of deep learning interpreters for analysis[15].
This article focuses on comparing the applications of traditional machine learning models based on molecular descriptors with end-to-end deep learning models in MOF design, covering basic concepts, common methods, and key steps. It also discusses and provides insights into the development of publicly available high-quality databases, the creation of universal descriptors, and the construction of AI prediction models. With the assistance of traditional machine learning and deep learning, breakthroughs in intelligent MOF design are expected in the future, further advancing the field of materials science.

2 Traditional machine learning and end-to-end deep learning

2.1 The Basic Concepts and Historical Development of AI

AI is a branch of computer science aimed at enabling machines to autonomously learn, reason, and make decisions by simulating human intelligence. Machine learning is at the core of AI, using algorithms to allow computers to learn from data and generate predictions or decisions. The development of AI can be traced back to the 1950s, when Alan Turing proposed the "Turing Test" to assess whether machines possess intelligence[16]. In 1956, at the Dartmouth Conference, John McCarthy first introduced the term "artificial intelligence," marking the official beginning of AI research[17].
The development of AI in materials design can be divided into several key stages. In the early exploration phase (1980s–1990s), the focus was primarily on expert systems and rule-based reasoning methods. Researchers attempted to use knowledge bases and reasoning mechanisms to assist in materials design; however, due to limitations in computing power and data volume, these efforts largely relied on the experience and knowledge of experts. In the initial data-driven application phase (early 21st century), with the advancement of computer and automation technologies, the field of materials science gradually began accumulating large amounts of experimental data. Machine learning methods started to be introduced for analyzing materials data. In 2012, the rapid development of deep learning brought new opportunities for materials design[18]. The emergence of technologies such as convolutional neural networks (CNN) and recurrent neural networks (RNN) made it possible to accurately quantify complex structure-property relationships in materials from large-scale datasets. At this stage, AI began demonstrating strong capabilities in materials discovery, optimization, and performance prediction.

2.2 Key steps in traditional machine learning and end-to-end deep learning

MOFs new material design tasks typically involve two stages: (1) MOFs property prediction. End-to-end deep learning models can directly map molecular structures (such as molecular graphs, SMILES representations, and 3D structures) to target properties, making them suitable for handling large-scale data and complex nonlinear relationships. Unlike traditional machine learning, end-to-end deep learning automatically extracts hierarchical features from data through multi-layer neural networks, significantly reducing reliance on domain knowledge[19]. This allows for the rapid screening of candidate structures with potentially excellent properties and provides initial guidance for subsequent optimization. (2) MOFs structure optimization and screening. At this stage, the incorporation of domain knowledge (such as chemical rules and structural constraints) is crucial. The outputs of end-to-end models may not be directly applicable due to issues such as chemical plausibility (e.g., bond length/angle constraints), synthetic feasibility (e.g., ligand accessibility), or stability concerns[20]. Therefore, domain-knowledge-driven optimization must be introduced at this stage. End-to-end models efficiently narrow down the search space, while domain knowledge ensures that the final designed structures possess both high performance and practical operability.
The establishment of the MOFs prediction model can be further refined into several steps, including the construction of a standard dataset, model training and evaluation, and finally, model application (Figure 1). First, MOFs-related data are collected from public databases, literature, and experimental results. Next comes the feature engineering, which is the main focus of this paper, involving steps such as descriptor (feature) calculation and feature selection. The entire dataset is then divided into a training set and a test set; the training set is used for parameter tuning and learning the patterns inherent in the data, while the test set is employed to evaluate the model's performance on unseen data, thereby testing its generalization capability. Finally, the accuracy of the model is assessed. By validating the model using the test set, if the evaluation results indicate that the model's predictive accuracy meets the expected standards, it can be adopted as the final model for subsequent practical applications, such as new material screening and assisted design. However, if the model performs poorly on the test set—meaning insufficient predictive accuracy or excessive error—it is necessary to re-examine aspects such as the model structure, feature selection, hyperparameter settings, and make necessary adjustments or rebuild the model.
图1 传统机器学习和深度学习方法比较以及关键建模步骤

Fig. 1 Comparison of traditional machine learning and deep learning methods,and the key modeling steps

2.3 The difference between traditional machine learning and end-to-end deep learning

Deep learning is an important branch of machine learning, based on the concept of artificial neural networks, aiming to simulate the structure and function of the human brain. Deep learning models contain no fewer than 3 hidden layers (typically 5 to 20 layers)[21], achieving automatic feature extraction through hierarchical nonlinear transformations, whereas traditional machine learning models (also known as shallow learning) usually have 0 to 2 layers or no layer structure at all[22]. Unlike traditional machine learning methods, deep learning automatically learns features and patterns from large amounts of data through multi-layer neural networks. It excels at handling large-scale and complex datasets, such as images, audio, and natural language processing. As shown in Figure 1, the core advantage of deep learning lies in its powerful feature extraction capability, which allows it to automatically learn high-level feature representations from raw data, thereby reducing reliance on manual feature engineering[23]. This automated feature learning capability has enabled deep learning to achieve remarkable success in many fields, such as image recognition in computer vision, language translation in natural language processing, and speech recognition.
Traditional machine learning and deep learning are key technologies behind today's data-driven decision-making and intelligent systems, each with its own unique advantages and disadvantages. Models built using traditional machine learning have strong interpretability, typically providing explainable results that facilitate understanding the basis of model decisions. They also offer high computational efficiency, with relatively fast training speeds when dealing with small to medium-sized datasets, making them suitable for resource-constrained environments. Traditional machine learning excels at handling datasets with clear features and patterns, such as tabular data. However, it places high demands on feature selection and preprocessing, requiring expertise from domain specialists; its ability to handle nonlinear, large-scale, and high-dimensional data is limited; and it heavily relies on features, with generalization performance affected by feature quality and selection. In contrast, end-to-end deep learning is well-suited for large-scale data and can effectively capture complex nonlinear relationships, such as those found in images, speech, and natural language[24]. It also enables automated feature learning, extracting more abstract feature representations directly from raw data and reducing the need for manual feature engineering. Consequently, end-to-end deep learning often achieves higher predictive accuracy in big data and complex tasks. Training deep learning models, however, requires substantial computational resources and time, such as graphics processing unit (GPU) acceleration. The complexity and large number of parameters in deep learning models make their decision-making processes difficult to interpret, resulting in a lack of transparency. Moreover, deep learning relies heavily on large amounts of labeled data, and insufficient data can lead to degraded model performance[25]. Choosing between traditional machine learning and deep learning requires a comprehensive assessment of factors such as data size, task complexity, and resource availability (Table 1).
表1 传统机器学习与深度学习区别

Table 1 Differences between traditional machine learning and deep learning

Difference Traditional machine learning Deep learning
Model structure Shallow models typically consisting of 0 to 2 hidden layers Deep neural networks usually containing no less than three hidden layers
Data requirements

Feature engineering
Small to medium-sized data (thousand-level samples) Large-scale data (ranging from tens of thousands to millions of samples)
Depend on manual feature extraction and require domain knowledge for designing features Automatic feature learning and extraction of high-order features from raw data through multi-layer networks
Computing resources Less computationally intensive Require high computational power and long training time
Model interpretability High interpretability Low interpretability

2.4 Introduction to the MOFs Database

Over the past several decades, both experimental and computational studies on MOFs have made remarkable progress, generating a wealth of experimental and simulation data. Table 2reveals the developmental trajectory of this process, starting from early experimental and molecular simulation studies and gradually evolving into cutting-edge methods that employ machine learning techniques to process and analyze these data. In the early 20th century, research on MOF materials was primarily conducted through experimental approaches, including synthesis, characterization, and measurement of adsorption and diffusion properties. During this period, the development of computational tools was relatively slow, and in most cases, experimental data were not systematically collected or organized, thus no corresponding databases were established. With the improvement of computational power and the advancement of computational methods, detailed simulation studies of MOFs gradually emerged, enabling researchers to predict and analyze MOF properties. Through meticulous comparison and calibration of simulation results, simulation methodologies were refined. The first computational datasets and databases for MOFs began to appear, such as the hMOF and CoRE MOF databases[26-27]. From the beginning of the 21st century to the present, this stage is characterized by the production of large amounts of high-quality data and the widespread application of machine learning methods. Vast quantities of computational and experimental data have been systematically collected and organized, leading to the establishment of several large-scale MOF databases, such as MOF-DB and CSD-MOF databases[28-29]. The evolution of MOF structural and performance data and databases—from non-existence to abundance—is a result of the combined advancements in automated material synthesis and characterization, enhanced computational capabilities, and improved data processing technologies.
表2 MOFs数据库概览

Table 2 An overview of the MOF databases

Stage Year Main research/data sets
Early experimental research 1995 The adsorption properties of MOFs,includingMOF-1,HKUST-1,and MOF-5,for gases such as H₂,CH₄,and CO₂[30]
Molecular imulation 2004 The gas adsorption of MOFs was simulated for the first time using GCMC method[31]
2004 For the first time,molecular dynamics simulations were performed to investigate gas dispersion in MOFs[31]
Machine learning 2012 The hMOF database with 137953 MOFs was established[32]
2014 The CoRE MOF database contains extensive data on the structures,properties,and potential applications of 5109 MOFs[27]
2017 The CSD MOF database contains 3D crystal structures of 69666 MOFs,including information on their crystal symmetry,atomic coordinates,bond lengths,and angles[33]
2019 The updated CoRE MOF database has contained 14142 MOFs[34]
2021 The QMOF database includes 14482 MOFs,with a special focus on the quantum chemical properties such as quantum states,electronic structures,and magnetism[35]
2023 The MOFX-DB contains adsorption data of more than 160000 MOFs from both experimental measurements and computational simulations[28,36]

3 Feature extraction based on molecular descriptors

Molecular descriptors are numerical characteristics of compounds or materials that quantitatively describe their structure and properties. Based on the method of acquisition, molecular descriptors can be classified into experimental and theoretical descriptors. Experimental descriptors include the morphology, size, Zeta potential, and various spectral data of MOFs. As shown in Figure 2, theoretical descriptors can be further categorized according to their nature as follows: (1) structural feature descriptors: topological structure, metal ions or metal clusters and organic ligand types in MOFs, pore size, surface area, pore volume, etc.; (2) chemical property descriptors: including the types and oxidation states of metals in MOF structures, as well as quantitative descriptors related to specific chemical structures; (3) thermodynamic property descriptors: thermal stability, heat of adsorption, other mechanical properties, band gap, diffusion coefficient, proton conductivity, etc.
图2 常用MOF理论描述符分类

Fig. 2 Classification of commonly used theoretical MOF descriptors

3.1 Structural feature descriptor

Structural feature descriptors are parameters used to characterize molecular structural features, reflecting geometric, topological, and compositional characteristics of molecules. They are commonly applied in fields such as cheminformatics, drug design, and materials science. Among these, topological descriptors are widely used structural features, primarily focusing on the cage-like structure of MOFs, i.e., the connectivity between metal nodes and organic ligands. Topological descriptors can help researchers understand and predict the physicochemical properties of MOFs, such as stability, porosity, adsorption capacity, and catalytic activity. Some common topological descriptors include: (1) describing the types and quantities of metal nodes and organic ligands in MOFs; (2) characterizing the bonding states of each metal node with organic ligands; and (3) describing the network structure types of MOFs, such as cubic, tetrahedral, and octahedral structures. For example, Batra et al.[37]developed a comprehensive chemical feature extraction program to obtain information on MOF metal nodes, organic ligand connectivity, and their molar ratios, and employed various machine learning models to predict the water stability of MOFs. Through feature importance analysis, it was found that the number of cyclic divalent nodes or six-membered rings, as well as the number of hydrogen bond acceptor sites, significantly influence water stability. In addition, structural information such as MOF surface area, pore size, and metal type is also frequently used as input features for machine learning models[38].Structural feature descriptors directly represent the geometric and topological intrinsic properties of MOFs (such as pore size, specific surface area, and topological type), typically obtained directly from crystal structure data or theoretical simulations, offering high data reliability. They are suitable as core features for model input, but their static nature may limit their ability to predict dynamic processes (such as adsorption kinetics).

3.2 Chemical property descriptor

When constructing machine learning models, the chemical properties of MOFs are crucial for understanding their behavior and performance. In recent years, researchers have designed sets of stoichiometric features to describe MOFs, aiming to predict and optimize MOF performance through machine learning and data-driven approaches. These include Stoichiometric-120 Features, Stoichiometric-45 Features, Sine Coulomb Matrix, and others. These stoichiometric descriptors primarily focus on the molecular composition and relative proportions of elements, typically used to quantitatively characterize the composition and chemical properties of compounds.
Stoichiometric-120 Features are stoichiometric descriptors calculated based on the chemical composition of MOFs, specifically the ratio and type of metal centers and organic ligands[39]. Each feature in Stoichiometric-120 Features represents a chemical or structural property of MOFs. This descriptor set comprises 120 properties, primarily calculated according to the ratio and type of metal centers (such as Zn, Cu, etc.) and organic ligands (such as benzenedicarboxylic acid, etc.) in MOFs, as well as several other chemical properties: average atomic weight, average group number, average period number, maximum atomic number difference, and average atomic number. Stoichiometric-45 Features include 45 independent descriptors, each representing a chemical or structural property of MOFs[40]. The Sine Coulomb matrix is a matrix descriptor used to characterize material structures, capturing geometric and electronic properties by considering interactions between atoms in the material, particularly Coulomb interactions[41]. The Orbital Field Matrix (OFM) descriptor is based on information about atomic orbitals within the material, constructing a matrix to represent the local environment and electronic characteristics of the material, with each matrix element containing information related to specific atomic orbitals[42]. The Average Smooth Overlap of Atomic Positions (SOAP) descriptor is an advanced machine learning feature used to capture local environmental information of materials and molecules, by calculating the three-dimensional density distribution of the atomic local environment and converting it into a mathematical representation that can be effectively processed and compared, such as vectors or similarity metrics. Using these descriptors and the kernel ridge regression machine learning algorithm, researchers predicted the quantum chemical properties (such as band gap) of over 14,000 experimentally synthesized MOFs from the QMOF database, with all model prediction accuracies exceeding 0.64. This provides an effective solution for avoiding time-consuming and labor-intensive density functional theory calculations in future research[35]. Recently, Bai et al.[43]developed a predictive model based on machine learning algorithms and applied it to high-throughput screening of MOF catalysts for carbon dioxide cycloaddition reactions. The descriptors used were easily obtainable structural and physicochemical properties of MOFs selected according to the reaction mechanism, for example, using OMS (Open metal sites) charge to replace the Lewis acidity of catalytic centers. With this model, 239 highly active catalysts were successfully screened, and the catalytic performance of the preferred material MOF-76(Y) was experimentally verified. However, the stoichiometric characteristics of MOFs (such as metal/ligand ratio and elemental composition) must be obtained through chemical analysis or theoretical calculations, and data integrity is closely related to experimental methods.

3.3 Thermodynamic property descriptors

Thermodynamic property descriptors are primarily used to characterize the energy properties, stability, and reactivity of MOFs, including adsorption energy, binding energy, desorption energy, band gap, and others. These descriptors can quantitatively represent the performance of MOFs in fields such as gas adsorption, catalysis, batteries, and supercapacitors. Bucior et al.[44]developed a novel energy descriptor based on the interaction between MOFs and guest molecules. Using this descriptor, they constructed a machine learning model capable of predicting the gas adsorption capacity of MOFs across multiple databases, achieving a prediction accuracy within 3 g/L. Furthermore, by applying the constructed model to virtually screen a database containing over 50,000 MOFs, they identified an outstanding candidate material. Under storage conditions of 77 K and 100 bar, and release conditions of 160 K and 5 bar, this material exhibited a hydrogen delivery capacity of 47 g/L (simulated value: 54 g/L), demonstrating the effectiveness of the developed descriptor and the broad application prospects of machine learning-driven new material design. Thermodynamic properties rely on experimental measurements or high-precision calculations (such as density functional theory), and data may contain significant errors or gaps. Data noise or missing values need to be addressed through interpolation, data augmentation, or robust models (such as adversarial training); otherwise, it may lead to a decline in the model's generalization performance.

3.4 Feature Selection and Dimensionality Reduction

In machine learning-assisted materials design, feature selection is an important task aimed at identifying the most representative features from a large number of descriptors to build efficient and accurate predictive models[45]. Using high-dimensional and correlated features in machine learning can lead to increased model complexity, longer training times, and a higher risk of overfitting. There are various methods for feature reduction, including manually examining the Pearson correlation coefficient matrix, as well as automated approaches such as recursive feature addition, recursive feature elimination, univariate feature filtering, and wrapper feature selection. Different methods have their own advantages and disadvantages; selecting an appropriate method can effectively reduce feature dimensionality and improve model performance. For example, Batra et al.[37]demonstrated that recursive feature elimination can significantly reduce feature dimensionality from 149 to approximately 30 while improving model accuracy. Additionally, dimensionality reduction techniques are widely used to address the complexity of feature spaces, including principal component analysis, t-distributed stochastic neighbor embedding, and uniform manifold approximation and projection. These methods aim to reduce the ratio between feature dimensionality and the number of training samples, making models more efficient. They also effectively identify and remove irrelevant or redundant features, helping to mitigate overfitting[46].

3.5 Handling Missing Features and Data Noise

The missing features in MOF descriptors and the noise in experimental data constitute key bottlenecks that limit the predictive accuracy of machine learning models. Specifically, difficult-to-determine structural features (such as pore size and topological type), noisy thermodynamic properties (such as adsorption heat and band gap), and missing chemical compositions (such as metal/ligand ratios) collectively pose core challenges for model optimization. Machine learning addresses these feature-related challenges through various strategies, including data noise suppression, missing data reconstruction, and multi-source data fusion.
Traditional machine learning methods (such as linear regression and random forests) have achieved certain success in low-dimensional data preprocessing through statistical imputation, feature selection, and regularization techniques. Deep learning also efficiently preprocesses high-dimensional and complex features. First, robust loss functions and uncertainty modeling can reduce the interference of noisy characteristics caused by experimental and computational errors in physicochemical properties on the model[47]. Second, for the high missing rate of chemical composition descriptors in public databases, deep learning models such as Graph Attention Networks (GATs) demonstrate significant advantages, effectively leveraging existing data information to infer missing values[48]. Additionally, generative models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) can generate samples similar to the real data distribution, which are used to fill in missing values for MOFs' chemical compositions and physicochemical properties. For instance, Choudhary et al.[49]used ALIGNN (Atomistic Line Graph Neural Network) to predict 29 properties of unlabeled materials. Combining traditional imputation techniques with deep generative features provides a solution for prediction tasks involving high noise and high missing data.

4 Application of End-to-End Deep Learning Models in MOFs Design

End-to-end deep learning automatically extracts features from data by constructing multi-layer neural networks, thereby enabling tasks such as data classification, prediction, and generation. Commonly used models include CNN, RNN, GAN, and autoencoders.

4.1 Convolutional Neural Network

CNN is a deep learning model primarily used for image and video processing. CNN extracts features through convolutional layers, pooling layers, and fully connected layers, and can effectively capture local features and spatial structures in images (Figure 3a). The convolutional layer is the core of CNN, processing image data through convolution operations. A convolutional layer contains multiple learnable filters, each of which identifies different patterns and generates feature maps[50]. CNN can extract features hierarchically, with the level of feature abstraction increasing as the network deepens. Researchers have proposed a general framework for predicting MOF gas adsorption performance, using the potential energy surface as the sole descriptor and employing CNN to process three-dimensional energy images, enabling efficient prediction of MOFs' CO2adsorption performance[51]. Compared to traditional geometric descriptor models, this model exhibits superior performance and requires less training data. Hung et al.[52]developed a gas adsorption property prediction method based on chemically encoded CNN. This method uses atomic positions and corresponding chemical information to represent MOF structural features. Trained on a dataset of Henry constants for CO2and CH4in nearly ten thousand MOF structures obtained from molecular simulations, this CNN model demonstrates excellent accuracy in predicting the adsorption performance of CH4and CO2.
图3 常用的端到端深度学习框架介绍:(a) CNN模型、(b) RNN模型、(c) GNN模型和(d)GAN模型

Fig. 3 Introduction of commonly used end-to-end deep learning frameworks:(a) CNN model,(b) RNN model,(c) GNN model and (d) GAN model

4.2 Recurrent Neural Network

RNN is a deep learning model well-suited for handling sequential data, particularly time series and text data. By using hidden states to remember previous information, RNN captures temporal dependencies within sequences. The basic structure of an RNN (Figure 3b) is as follows: (1) Input layer: receives sequential data; (2) Hidden layer: saves past information through recurrent connections and updates the current state; (3) Output layer: generates outputs related to the input sequence. RNN can be extended into various forms, including Long Short-Term Memory networks (LSTM) and Gated Recurrent Units (GRU), both of which can alleviate the vanishing gradient problem in standard RNNs and are suitable for capturing longer-term dependencies. For example, MOF structures can be encoded into specific SMILES representations, where each SMILES is defined by three components: inorganic nodes (metals), organic ligands, and topological structure. When learning to generate new SMILES, the distribution of the next symbol is determined based on predictions from the RNN model, and properties such as gas adsorption capacity and thermodynamic stability of the newly generated MOF materials are evaluated, helping researchers identify high-performance MOF materials[53]. Recently, Zhang et al.[54] combined Monte Carlo tree search with an RNN algorithm to build a deep learning model for designing novel MOFs with high adsorption performance for methane and carbon dioxide.

4.3 Graph Neural Networks

Graph neural networks (GNNs) are deep learning models specifically designed to handle graph-structured data. Unlike CNNs, which can only process conventional Euclidean data, GNNs can operate on non-Euclidean data composed of nodes and edges, modeling local and global relationships between nodes through a message-passing mechanism (Figure 3c). The core idea is to iteratively update node features, so that each node's representation not only includes its own attributes but also aggregates information from neighboring nodes. This feature enables GNNs to effectively capture the complex topological structures and dynamic interactions within graph data. Traditional neural networks perform poorly when dealing with graph data (network structures composed of nodes and edges), as graph data exhibits irregularity and complex connectivity. MOFs have complex and diverse structures, where atoms can be regarded as nodes in a graph, chemical bonds as edges connecting nodes, and weak interactions such as coordination bonds and hydrogen bonds can be encoded through edge weights or types. Conventional molecular modeling methods (such as those based on geometric descriptors) struggle to effectively capture this complexity, whereas GNNs can naturally handle the graph-structured data of MOFs and extract their topological features (e.g., node connection patterns, subgraph structures). In terms of model complexity, GNNs are generally more complex than RNNs. RNNs process sequential data, and their parameter count grows linearly with sequence length; in contrast, GNNs must simultaneously model both node features and edge relationships, resulting in a parameter count that grows polynomially with the number of nodes and edges[55]. Xie et al.[56]proposed the Crystal Graph Convolutional Neural Network (CGCNN) model, which represents the structure of MOFs using a crystal graph that encodes atomic information and interatomic bonding interactions, and then constructs a CNN to extract features and accurately predict properties such as formation energy, absolute energy, band gap, and Fermi energy. Lu et al.[57]used CGCNN to develop a deep learning model with an accuracy greater than 0.81, applying it to screen the CSD database for high-performance hydrogen storage materials.

4.4 Generative Adversarial Networks

As shown in Figure 3d, GAN is a deep learning framework consisting of two neural networks: the generator and the discriminator. The generator is responsible for producing samples that resemble real data, while the discriminator evaluates whether the input samples are genuine or generated by the generator. Through this adversarial process, the two networks continuously optimize each other, ultimately achieving the goal of generating high-quality samples[58].GAN can learn the characteristics of existing MOF datasets and generate novel MOFs with potentially superior performance. For example, researchers have proposed an automated nanoporous material discovery platform driven by supramolecular VAE, aiming at the generative design of reticular materials. Taking MOF structures as an example, the platform's design process targets the separation of carbon dioxide from natural gas or flue gas. By jointly training with various MOFs identified as superior for gas separation, the autoencoder demonstrates strong optimization capabilities. The newly discovered MOFs also exhibit superior adsorption performance compared to some of the best reported MOF/zeolite materials. Furthermore, GAN shows great application potential in areas such as data augmentation and reverse design[58-59]..
In addition to GANs, VAEs and diffusion models have also made significant progress in the generative design of MOFs. VAEs map MOF structures into a latent space using an encoder and then generate new structures through a decoder. For example, Yao et al.[60]developed a VAE-based MOF generation framework that uses latent space interpolation to create MOFs with continuously varying porosity. Diffusion models generate high-quality MOF structures by gradually removing noise. For instance, Park et al.[61]proposed a diffusion model-based MOF generation method that simulates the crystal growth process to produce MOFs with specific topological structures, whose structural stability is superior to that of MOFs generated by traditional methods. These generative models provide diverse tools for MOF design, and in the future, integrating multiple models could further enhance the efficiency and quality of new material generation.

5 Conclusion and Outlook

This article explores the importance of developing novel descriptors for efficiently characterizing the structural features of MOFs, using recent research examples of traditional machine learning techniques in MOF design, and provides theoretical support for enhancing the predictive performance of machine learning models. Additionally, the article delves into the innovative achievements of end-to-end deep learning models in MOF material design, showcasing their application potential in the materials discovery process.
Future developments of MOF descriptors should focus on the following aspects: First, descriptors should be more intuitive and have clear physical meanings, making model prediction results easier to understand and validate[62]. Second, descriptors should cover the diverse structures and properties of MOFs, ensuring their applicability to different scenarios. Additionally, to enhance the reliability of scientific research, MOF-related descriptors must exhibit high reproducibility. Finally, given the complex structure of MOF materials, optimizing computational costs is another important direction for descriptor development. For example, researchers are developing lightweight descriptors with low computational complexity to facilitate large-scale screening and rapid prediction[63].
Deep learning models have demonstrated significant advantages in handling high-dimensional data and capturing nonlinear relationships, but they also face challenges such as high demands for data volume and computational resources, as well as difficulties in interpretability[64]. Researchers have proposed various interpretability methods to open the "black box" of deep learning models. It is worth emphasizing that in recent years, significant progress has been made in the study of deep learning interpretability. For example, in CNNs, Grad-CAM (Gradient-weighted class activation mapping) technology visualizes the key regions focused on by the model through gradient information, providing an intuitive explanation for image classification tasks[65]. Additionally, model-agnostic explanation methods (such as SHAP and LIME) have also offered new perspectives for deep learning interpretability. Attention-based models (such as Transformer) can also intuitively demonstrate the distribution of importance for input data through attention weights[66]. The Local interpretable model-agnostic explanations (LIME) method generates perturbed samples around the input data and uses simple models to approximate the behavior of complex models, thereby providing local explanations for deep learning models. In image classification tasks, LIME can generate a local linear model that explains how a particular prediction is made and identifies which features (e.g., specific regions in the image) are most important for the decision-making process[67]. Grad-CAM is a visualization technique for CNNs that maps the regions focused on by the model onto the input image by calculating gradient information[68]. Furthermore, combining deep learning with traditional machine learning can also enhance the interpretability of deep learning models. By using deep learning models as feature extractors and constructing a simple, interpretable model to approximate their behavior, one can select an appropriate deep learning model and train it on a specific task to ensure effective prediction. Input data is then passed forward through the deep learning model to obtain intermediate layer outputs (feature representations). The feature outputs of all samples are aggregated into a feature matrix, followed by machine learning modeling. The predictive results of the interpretable model are compared with those of the deep learning model, and the feature importance from machine learning helps to understand the deep learning prediction results[69]. Combining with more easily interpretable traditional machine learning methods further enhances the transparency of deep learning models[70]. These advancements indicate that the "black box" nature of deep learning models is gradually being unraveled.
However, experimental data on MOFs are relatively scarce, and the diversity and representativeness of the data are insufficient, which limits the training and generalization capabilities of machine learning-based materials design models. Therefore, developing an end-to-end framework based on transfer learning or meta-learning to reduce reliance on large-scale annotated datasets has become an important research direction[71]. Additionally, open-source datasets and model sharing will facilitate knowledge exchange among global researchers, accelerating innovation and application in MOF materials design. Integrating multi-source data such as crystal graphs, potential energy surfaces, and experimental spectra to build a more comprehensive MOF design platform has become a key direction for field development. In this context, AI Agents demonstrate great potential, as they can efficiently extract MOF-related structural, property, and synthesis information from vast amounts of scientific literature, constructing a comprehensive knowledge base. Large language models (LLMs), as an emerging deep learning tool, leverage their powerful natural language processing capabilities to extract key information from complex scientific texts, generate structured data, and assist researchers in tasks such as experimental design, computational script writing, and data analysis[72]. For example, Bai et al.[73]recently conducted a systematic evaluation of open-source large language models in MOF research, demonstrating their excellent performance in knowledge extraction, experimental design, computational script generation, and database parsing. Specifically, the Llama2-7B and ChatGLM2-6B models can achieve high-precision information mining with moderate computational resources. This provides significant technical support for AI-driven MOF research. Building upon these tools, Kang et al.[74]developed the L2M3 system, further expanding its application boundaries by automatically extracting MOF synthesis conditions and performance data from over 40,000 articles using LLMs, and constructing a structure-synthesis-performance relationship database. The efficient data processing capabilities of LLMs not only compensate for the shortage of experimental data but also lay a stronger theoretical and data foundation for AI-assisted MOF design through multi-task collaboration (such as predicting material properties and optimizing experimental protocols).
With the widespread application of MOFs, the environmental health risks they pose also require attention. The potential toxicity of MOFs primarily stems from the following aspects: First, metal ions may be released under specific conditions, leading to health issues such as cytotoxicity, neurotoxicity, and immunosuppression[75]. Second, the organic ligands of MOFs may degrade under certain conditions, and the degradation products could have adverse effects on biological systems[76]. Additionally, MOFs typically exist in the form of nanoparticles, and their unique physical properties may trigger a range of biological effects, such as inflammatory responses and accumulation within organisms, increasing the risk of long-term toxicity[77]. Therefore, while focusing on enhancing MOF performance, it is equally important to minimize their adverse impacts on ecosystems and human health. In the future, toxicity prediction models can be developed based on existing toxicity databases and machine learning approaches, enabling the screening of biocompatible MOFs[78].
In conclusion, although the application of traditional machine learning and deep learning in MOF design faces numerous challenges, their development prospects remain broad. With continuous technological advancements and deeper interdisciplinary collaborations, traditional machine learning and deep learning are expected to become crucial tools for intelligent MOF design, driving innovation in materials science and bringing new opportunities for the development and application of novel materials.
[1]
Furukawa H, Cordova K E, O’Keeffe M, Yaghi O M. Science, 2013, 341(6149): 1230444.

[2]
Yang Q Y, Liu D H, Zhong C L, Li J R. Chem. Rev., 2013, 113(10): 8261.

[3]
Li J R, Kuppler R J, Zhou H C. Chem. Soc. Rev., 2009, 38(5): 1477.

[4]
Knebel A, Caro J. Nat. Nanotechnol., 2022, 17(9): 911.

[5]
Belmabkhout Y, Bhatt P M, Adil K, Pillai R S, Cadiau A, Shkurenko A, Maurin G, Liu G P, Koros W J, Eddaoudi M. Nat. Energy, 2018, 3(12): 1059.

[6]
Yarahmadi H, Salamah S K, Kheimi M. Sci. Rep., 2023, 13: 19136.

[7]
Agirrezabal-Telleria I, Luz I, Ortuño M A, Oregui-Bengoechea M, Gandarias I, López N, Lail M A, Soukri M. Nat. Commun., 2019, 10: 2076.

[8]
Wang S, Fu Y, Wang T, Liu W S, Wang J, Zhao P, Ma H P, Chen Y, Cheng P, Zhang Z J. Nat. Commun., 2023, 14: 7261.

[9]
Zhang J, Liu L S, Zheng C F, Li W, Wang C R, Wang T S. Nat. Commun., 2023, 14: 4922.

[10]
Deng K R, Hou Z Y, Li X J, Li C X, Zhang Y X, Deng X R, Cheng Z Y, Lin J. Sci. Rep., 2015, 5: 7851.

[11]
Hu L G, Wu W H, Hu M, Jiang L, Lin D H, Wu J, Yang K. Nat. Commun., 2024, 15: 3204.

[12]
Rosen A S, Fung V, Huck P, O’Donnell C T, Horton M K, Truhlar D G, Persson K A, Notestein J M, Snurr R Q. NPJ Comput. Mater., 2022, 8: 112.

[13]
Erickson B J. Radiol. Clin. N Am, 2021, 59(6): 933.

[14]
Rosenberg I, Sicard G, David E O. Entropy, 2018, 20(5): 390.

[15]
Salih A, Boscolo Galazzo I, Gkontra P, Lee A M, Lekadir K, Raisi-Estabragh Z, Petersen S E. Circ. Cardiovasc. Imag., 2023, 16(4): e014519.

[16]
Turing A M. Mind, 1950, 59(236): 433.

[17]
Mccarthy J, Minsky M, Rochester N, Shannon C E. AI Magazine, 2006, 27.

[18]
Krizhevsky A, Sutskever I, Hinton G JaINIPS. Commun. ACM, 2012, 25: 2.

[19]
Lu C X, Wan X L, Ma X H, Guan X J, Zhu A C. J. Chem. Inf. Model., 2022, 62(14): 3281.

[20]
He Y, Liu F, Min W C, Liu G H, Wu Y B, Wang Y, Yan X L, Yan B. ACS Appl. Mater. Interfaces, 2024, 16(48): 66367.

[21]
LeCun Y, Bengio Y, Hinton G. Nature, 2015, 521(7553): 436.

[22]
Zhang Y, Zhou B H, Cai X R, Guo W Y, Ding X K, Yuan X J. Inf. Sci., 2021, 551: 67.

[23]
Wang Z H, Chen J, Hoi S C H. IEEE Trans. Pattern Anal. Mach. Intell., 2021, 43(10): 3365.

[24]
Choi R. Y, Coyner A, Kalpathy-Cramer J, Chiang M, Campbell. J. P. Transl. Vis. Sci. Technol., 2020, 9: 14.

[25]
Talaei Khoei T, Ould Slimane H, Kaabouch N. Neural Comput. Appl., 2023, 35(31): 23103.

[26]
Wilmer C E, Leaf M, Lee C Y, Farha O K, Hauser B G, Hupp J T, Snurr R Q. Nat. Chem., 2012, 4(2): 83.

[27]
Chung Y G, Camp J, Haranczyk M, Sikora B J, Bury W, Krungleviciute V, Yildirim T, Farha O K, Sholl D S, Snurr R Q. Chem. Mater., 2014, 26(21): 6185.

[28]
Bobbitt N S, Shi K H, Bucior B J, Chen H Y, Tracy-Amoroso N, Li Z, Sun Y, Merlin J H, Siepmann J I, Siderius D W, Snurr R Q. J. Chem. Eng. Data, 2023, 68(2): 483.

[29]
Li A, Perez R B, Wiggin S, Ward S C, Wood P A, Fairen-Jimenez D. Matter, 2021, 4(4): 1105.

[30]
Yaghi O M, Li H L. J. Am. Chem. Soc., 1995, 117(41): 10401.

[31]
Düren T, Sarkisov L, Yaghi O M, Snurr R Q. Langmuir, 2004, 20(7): 2683.

[32]
Wilmer C E, Farha O K, Bae Y S, Hupp J T, Snurr R Q. Energy Environ. Sci., 2012, 5(12): 9849.

[33]
Groom C R, Bruno I J, Lightfoot M P, Ward S C. Acta Crystallogr. Sect. B Struct. Sci. Cryst. Eng. Mater., 2016, 72(2): 171.

[34]
Chung Y G, Haldoupis E, Bucior B J, Haranczyk M, Lee S, Zhang H D, Vogiatzis K D, Milisavljevic M, Ling S L, Camp J S, Slater B, Siepmann J I, Sholl D S, Snurr R Q. J. Chem. Eng. Data, 2019, 64(12): 5985.

[35]
Rosen A S, Iyer S M, Ray D, Yao Z P, Aspuru-Guzik A, Gagliardi L, Notestein J M, Snurr R Q. Matter, 2021, 4(5): 1578.

[36]
Wang J Q, Liu J P, Wang H S, Zhou M S, Ke G L, Zhang L F, Wu J Z, Gao Z F, Lu D N. Nat. Commun., 2024, 15: 1904.

[37]
Batra R, Chen C, Evans T G, Walton K S, Ramprasad R. Nat. Mach. Intell., 2020, 2(11): 704.

[38]
Mashhadimoslem H, Ali Abdol M, Karimi P, Zanganeh K, Shafeen A, Elkamel A, Kamkar M. ACS Nano, 2024, 18(35): 23842.

[39]
Cao Z L, Magar R, Wang Y Y, Barati Farimani A. J. Am. Chem. Soc., 2023, 145(5): 2958.

[40]
He Y P, Cubuk E D, Allendorf M D, Reed E J. J. Phys. Chem. Lett., 2018, 9(16): 4562.

[41]
Faber F, Lindmaa A, von Lilienfeld O A, Armiento R. Int. J. Quantum Chem., 2015, 115(16): 1094.

[42]
Pham T L, Nguyen N D, Nguyen V D, Kino H, Miyake T, Dam H C. J. Chem. Phys., 2018, 148(20): 204106.

[43]
Bai X F, Li Y, Xie Y B, Chen Q C, Zhang X, Li J R. Green Energy Environ., 2025, 10(1): 132.

[44]
Bucior B J, Bobbitt N S, Islamoglu T, Goswami S, Gopalan A, Yildirim T, Farha O K, Bagheri N, Snurr R Q. Mol. Syst. Des. Eng., 2019, 4(1): 162.

[45]
Chandrashekar G, Sahin F. Comput. Electr. Eng., 2014, 40(1): 16.

[46]
Odhiambo Omuya E, Onyango Okeyo G, Waema Kimwele M. Expert Syst. Appl., 2021, 174: 114765.

[47]
Goodall R E A, Lee A A. Nat. Commun., 2020, 11: 6280.

[48]
Zhao M M, Peng H P, Li L X, Ren Y Q. Sensors, 2024, 24(5): 1522.

[49]
Choudhary K, DeCost B. NPJ Comput. Mater., 2021, 7: 185.

[50]
Sanchez-Cesteros O, Rincon M, Bachiller M, Valladares-Rodriguez S. Sensors, 2023, 23(17): 7582.

[51]
Sarikas A P, Gkagkas K, Froudakis G E. Sci. Rep., 2024, 14: 2242.

[52]
Hung T H, Xu Z X, Kang D Y, Lin L C. J. Phys. Chem. C, 2022, 126(5): 2813.

[53]
Arús-Pous J, Johansson S V, Prykhodko O, Bjerrum E J, Tyrchan C, Reymond J L, Chen H M, Engkvist O. J. Cheminf., 2019, 11: 71.

[54]
Zhang X Y, Zhang K X, Lee Y J. ACS Appl. Mater. Interfaces, 2020, 12(1): 734.

[55]
Ju W, Fang Z, Gu Y Y, Liu Z Q, Long Q Q, Qiao Z Y, Qin Y F, Shen J H, Sun F, Xiao Z P, Yang J W, Yuan J Y, Zhao Y S, Wang Y F, Luo X, Zhang M. Neural Netw., 2024, 173: 106207.

[56]
Xie T, Grossman J C. Phys. Rev. Lett., 2018, 120(14): 145301.

[57]
Lu X Y, Xie Z Z, Wu X J, Li M M, Cai W Q. Chem. Eng. Sci., 2022, 259: 117813.

[58]
Sanchez-Lengeling B, Aspuru-Guzik A. Science, 2018, 361(6400): 360.

[59]
Gómez-Bombarelli R, Wei J N, Duvenaud D, Hernández-Lobato J M, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel T D, Adams R P, Aspuru-Guzik A. ACS Cent. Sci., 2018, 4(2): 268.

[60]
Yao Z P, Sánchez-Lengeling B, Bobbitt N S, Bucior B J, Kumar S G H, Collins S P, Burns T, Woo T K, Farha O K, Snurr R Q, Aspuru-Guzik A. Nat. Mach. Intell., 2021, 3(1): 76.

[61]
Park J, Lee Y, Kim J. Nat. Commun., 2025, 16(1): 34.

[62]
Nandy A, Terrones G, Arunachalam N, Duan C R, Kastner D W, Kulik H J. Sci. Data, 2022, 9: 74.

[63]
Burner J, Luo J, White A, Mirmiran A, Kwon O, Boyd P G, Maley S, Gibaldi M, Simrod S, Ogden V, Woo T K. Chem. Mater., 2023, 35(3): 900.

[64]
Masegosa A R, Cabañas R, Langseth H, Nielsen T D, Salmerón A. Entropy, 2021, 23(1): 117.

[65]
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Int. J. Comput. Vis., 2020, 128(2): 336.

[66]
Yeh C, Chen Y D, Wu A Y, Chen C, Viégas F, Wattenberg M. IEEE Trans. Visual. Comput. Graphics, 2024, 30(1): 262.

[67]
Hussain I, Jany R, Boyer R, Azad A, Alyami S A, Park S J, Hasan M M, Hossain M A. Sensors, 2023, 23(17): 7452.

[68]
Rajpoot R, Gour M, Jain S, Semwal V B. Sci. Rep., 2024, 14: 24985.

[69]
Linardatos P, Papastefanopoulos V, Kotsiantis S. Entropy, 2021, 23(1): 18.

[70]
Samek W, Montavon G, Vedaldi A, Hansen L K, Müller K. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer Cham, 2019. 5.

[71]
Kang Y, Park H, Smit B, Kim J. Nat. Mach. Intell., 2023, 5(3): 309.

[72]
Zhang W, Wang Q G, Kong X T, Xiong J C, Ni S K, Cao D H, Niu B Y, Chen M G, Li Y M, Zhang R Z, Wang Y T, Zhang L H, Li X T, Xiong Z P, Shi Q, Huang Z M, Fu Z Y, Zheng M Y. Chem. Sci., 2024, 15(27): 10600.

[73]
Bai X F, Xie Y B, Zhang X, Han H G, Li J R. J. Chem. Inf. Model., 2024, 64(13): 4958.

[74]
Kang Y, Lee W, Bae T, Han S, Jang H, Kim J. J. Am. Chem. Soc., 2025, 147(5): 3943.

[75]
Wiśniewska P, Haponiuk J, Saeb M R, Rabiee N, Bencherif S A. Chem. Eng. J., 2023, 471: 144400.

[76]
Tang M Y, Guan Q, Fang Y L, Wu X, Zhang J J, Xie H, Yu X, Ou R W. Sep. Purif. Technol., 2024, 342: 127059.

[77]
Ettlinger R, Lächelt U, Gref R, Horcajada P, Lammers T, Serre C, Couvreur P, Morris R E, Wuttke S. Chem. Soc. Rev., 2022, 51(2): 464.

[78]
He Y, Liu G, Li C, Yan X. Rev. Environ. Contam. Toxicol., 2022, 260(1): 21.

Outlines

/