Latent Space Embedding Methods for Chemical Molecules： Principles and Applications

Haotian Chen; Tao Yang; Xiaotong Liu

doi:10.7536/PC20250308

Progress in Chemistry >

2025 , Vol. 37 >Issue 10: 1456 - 1478

DOI: https://doi.org/10.7536/PC20250308

Review

Latent Space Embedding Methods for Chemical Molecules： Principles and Applications

Haotian Chen ¹^,² ,
Tao Yang ¹^,² ,
Xiaotong Liu ^,¹^,²^,^*

Expand

¹ College of Computer Science， Beijing Information Science and Technology University， Beijing 100192， China
² Beijing Advanced Innovation Center for Materials Genome Engineering， Beijing Information Science and Technology University， Beijing 100192， China

* liuxiaotong@bistu.edu.cn

Received date: 2025-03-12

Revised date: 2025-06-19

Online published: 2025-10-15

Supported by

National Natural Science Foundation of China(22272009)

National Natural Science Foundation of China(22203008)

Fold

Abstract

Effective representation of chemical molecules is the key to promoting chemical informatics and new material research and development. In recent years， data-driven molecular representation technology has been developed. Compared with traditional manually designed descriptors and graph structure analysis methods， it can effectively avoid noise and information redundancy， and provide support for efficient and accurate property prediction. Embedding representation has the characteristics of efficient information compression， data representation enhancement and semantic retention， and has been widely used in fields such as deep learning and data mining. Inspired by word embeddings in the field of natural language processing， researchers began to explore the application of similar methods to the construction of the latent space of chemical molecules， and proposed a variety of embedding methods for molecular property prediction and molecular structure generation. This review first elucidates the principles of general embedding technology in machine learning， and then sequentially discusses chemical element latent space representation methods and chemical molecule latent space embedding techniques. By examining the innovative applications of related technologies in natural language processing and graph embedding to molecular embeddings， the review reveals that current molecular embedding methods are gradually evolving towards multimodality， self-supervised learning， and dynamic modeling， and it outlines prospects for future research trends.

Contents

1 Introduction

2 Principles of embedding in machine learning

2.1 Word embedding

2.2 Graph embedding

2.3 Multimodal embedding

3 Element latent space representation methods

3.1 Attribute-based element representation

3.2 Element representation based on physicochemical knowledge

3.3 Data-driven element embedding

4 Advances in molecular latent space embedding

4.1 Traditional chemical feature-based molecular descriptors

4.2 Graph theory-driven molecular embedding

4.3 Data-driven molecular embedding

4.4 Multimodal molecular embedding

5 Conclusion and outlook

5.1 Current status and key technology

5.2 Future research prospects

Key words： molecular embedding; machine learning; representation learning; property prediction; multimodality; self-supervised learning

Cite this article

Haotian Chen , Tao Yang , Xiaotong Liu . Latent Space Embedding Methods for Chemical Molecules： Principles and Applications[J]. Progress in Chemistry, 2025 , 37(10) : 1456 -1478 . DOI: 10.7536/PC20250308

1 Introduction

To meet the demand for large-scale data in molecular information processing and intelligent tasks, high-throughput experimental and computational simulation technologies have rapidly advanced in recent years^[1-2],driving an exponential increase in research data in the field of chemistry. How to conduct efficient data analysis and knowledge extraction from large-scale data has become a key challenge in uncovering the intrinsic patterns within the data^[3].Constrained by theoretical formulas and prior assumptions, traditional analytical methods such as empirical formulas, linear regression, and manual feature engineering exhibit significant limitations when dealing with high-dimensional, non-linear, multi-source heterogeneous data, making it difficult to capture the deep relationships between molecular structures and properties. Researchers have attempted to introduce machine learning (ML) methods to transform the traditional research paradigm^[4],with the aim of achieving efficient data compression and discovering latent associations while preserving critical information to the greatest extent possible. As a subset of machine learning methods, deep learning, with its outstanding performance in complex data feature extraction and representation learning, has gradually emerged as a mainstream modeling approach in the fields of materials science and chemistry.

Initially, embedding techniques achieved breakthrough progress in the field of natural language processing^[5-6]. Word embedding technologies enable computers to understand and process semantic relationships in natural language, bringing a qualitative leap to language understanding and generation. As embedding techniques have demonstrated strong adaptability in recommendation systems, computer vision, and social network analysis, researchers have begun to extend their application to complex, diverse molecular and materials systems. Embedding methods transform atomic or molecular structural information into numerical features that can be processed by algorithms, and the amount of meaningful information preserved in these vectors largely determines the performance and predictive accuracy of downstream models^[7].

As an important research direction in cheminformatics, the study of chemical molecule embedding in latent space can reveal the implicit relationships among molecules and has demonstrated unique value in data analysis and model training. Data types such as Simplified Molecular Input Line Entry Specification (SMILES), molecular fingerprints, molecular graphs, molecular images, spectra, and text (scientific literature) provide rich application scenarios for embedding techniques. Experiments have found that these modalities often provide molecular information only from specific perspectives, whereas multimodal embedding methods that integrate multiple data sources can more comprehensively characterize the multidimensional properties of complex molecular systems^[8-11]. By integrating information from different sources, this approach provides models with richer and more stable feature representations. Figure 1illustrates the evolution of molecular embedding methods, reflecting the trend of molecular representation methods toward data-driven learning. Leveraging abundant chemical data resources and scholarly knowledge, along with the computational power of machine learning in feature learning, data-driven embedding methods typically exhibit representational capabilities that surpass those of traditionally hand-crafted descriptors^[12]. Current AI-based molecular models can be broadly categorized into two types: machine learning models based on molecular descriptors and end-to-end geometric deep learning models. The feature representation methods employed by these two categories of molecular models can be further classified into three types according to their generation mechanisms: representations that capture physicochemical and structural properties (descriptors and fingerprints), graph-theory-driven graphical representations of compounds, and data-driven embedding methods that utilize deep learning and natural language processing algorithms.

显示原图|下载原图ZIP|生成PPT

图1 化学分子的隐藏空间嵌入方法发展历程

Fig.1 The evolution of latent space embedding methods for chemical molecules

Chemical molecules are composed of atoms and their bonding patterns, with microscopic information such as atom type, position, and local environment playing a decisive role in determining overall properties. Element embedding, as an atomic-level feature representation, provides foundational data for subsequent molecular descriptions (such as graph embeddings and molecular fingerprints) by capturing atomic information and their interrelationships, making it the primary step in efficiently constructing molecular embeddings. This article will proceed from the local to the global, sequentially introducing the principles of machine learning embeddings, element latent space representations, and molecular latent space embeddings, comparing the similarities and differences among various embedding methods, and outlining future research directions in molecular embedding.

2 The Principle of "Embedding" in Machine Learning

2.1 Word embeddings

The concept of embedding can be traced back to the field of natural language processing, where Bengio et al.^[13]first proposed in 2000 the idea of mapping words to a continuous vector space—known as “word embedding”—with the aim of capturing semantic relationships between words. This concept has effectively advanced tasks such as natural language understanding and generation, and has laid the theoretical and methodological foundation for modern representation learning. Typical word embedding models include Word2Vec^[5],GloVe^[14],and FastText^[15],with their underlying principles illustrated in Table 1.Vectorization methods inspired by linguistic structures have also provided a general framework for embedding techniques applied to other complex data structures. In the field of chemistry, word embedding methods have been introduced into modeling molecular sequences and corpus structures, enabling the handling of text-based structural representations or domain-specific knowledge. For example, SMILES molecular structure sequences can be treated analogously to sentences in natural language, or contextual relationships in literature corpora can be used to represent molecular features.

表1 词嵌入方法总结

Table 1 Summary of word embedding methods

Method	Implementation principles	Ref
Skip-Gram	Predicts context words based on a given center word.	6
CBOW	Predicts the center word based on given context words.	6
GloVe	Constructs a global co-occurrence matrix between words and use the co-occurrence probability to learn word vectors.	14
FastText	Uses subwords （n-grams） to construct word vectors and effectively handle rare words.	15
LSA	Reduces the dimension of the word co-occurrence matrix based on singular value decomposition （SVD）.	16
LDA	Models the co-occurrence relationship between documents and topics to obtain the probability distribution of words.	17
ELMo	Uses a bidirectional Long Short-Term Memory （LSTM） network to obtain contextualized （dynamic） word embeddings.	18
BERT	Generates context-aware word vectors based on the self-attention mechanism.	19

2.2 Graph Embedding

In the field of chemistry, data such as molecular graphs, reaction networks, and material configurations naturally exhibit graph structures. Graph embedding techniques aim to convert nodes, edges, and their local or global structural features into low-dimensional vector representations, thereby effectively capturing relationships among data entities and supporting downstream tasks such as prediction or clustering. Early graph embedding methods, such as Laplacian eigenmaps (LE)^[20]and matrix factorization (MF)^[21],have already been applied to graph analysis tasks in chemistry. In the year following the introduction of Word2Vec, Perozzi^[22]creatively proposed DeepWalk, bringing word embedding methods into the domain of graph embeddings for handling structured and graph data. DeepWalk draws an analogy between random walks on a graph and sentences in a corpus: nodes in a walk sequence are analogous to words in a sentence, and the co-occurrence relationships of nodes during random walks are analogous to word co-occurrence relationships. This method applies the Word2Vec approach to node sequences generated by random walks to obtain node embeddings. In subsequent research, Grover and Leskovec^[23]addressed the limitation of DeepWalk, which cannot be applied to weighted graphs, by introducing parameters that control the direction of random walks and designing probabilistic walk strategies.Figure 2illustrates the process of applying graph embedding methods to downstream tasks. Unlike word embeddings, graph embeddings not only need to capture relationships between nodes but also must account for the topological structure of the graph. Subsequently, methods such as GraphSAGE (Graph SAmple and aggreGatE)^[24]and graph convolutional networks (GCN)^[25]were developed, directly leveraging information from node neighbors or performing graph convolution operations to efficiently handle large-scale or highly complex graph data. The core advantage of graph embeddings lies in their ability to capture overall structural topological features by leveraging local atomic environments, making them well suited for structurally complex and topologically diverse molecular structures. These methods have been widely applied in tasks such as chemical property prediction, molecular classification, and reaction pathway identification.

显示原图|下载原图ZIP|生成PPT

图2 （a）基于矩阵分解的方法使用数据特征矩阵通过矩阵分解学习嵌入；（b）基于随机游走的方法通过随机游走生成节点序列，并应用Word2Vec模型学习嵌入表示；（c）基于神经网络的方法在不同模型中架构和输入存在差异

Fig.2 （a） Matrix factorization-based methods use a data matrix to learn embeddings through factorization. （b） Random walk-based methods generate node sequences via random walks and apply the Word2Vec model to learn embedding representations. （c） Neural network-based methods vary in architecture and input across different models

2.3 Multimodal Embedding

To comprehensively convey information about an entity, multiple perceptions of the same object from different perspectives are recorded in various types of data, such as text, images, video, and audio. The goal of multimodal representation learning is to reduce distributional differences in the joint semantic subspace while preserving the modality-specific semantics. In the field of representation learning, “modality” refers to a specific method or mechanism for encoding information. Because multimodal data describe an entity from different perspectives and are often complementary in content, they contain richer information than single-modal data. In molecular representation learning, information sources encompass various heterogeneous data types, including SMILES strings, molecular graphs, molecular images, spectra, and textual literature. These modalities provide rich molecular features from different dimensions. Since feature vectors from different modalities are initially distributed in their respective independent feature subspaces, even when semantically highly related, their corresponding numerical representations may still exhibit significant differences. This intermodal heterogeneity can hinder the effective utilization of multimodal data by subsequent machine learning models^[26]; therefore, a common approach is to project these heterogeneous features into a common subspace^[27]to reduce the heterogeneity gap between modalities. In this subspace, multimodal data with similar semantics will be represented by similar vectors, and using the aligned features to build machine learning models can help the models better understand and integrate multimodal information. As shown in Figure 3, current multimodal representation learning mainly comprises three frameworks^[28]: the joint representation framework integrates information by learning a shared semantic subspace; the coordinated representation framework learns independent yet coordinated representations for each modality under specific constraints; and the encoder-decoder framework focuses on intermodal transformation while maintaining semantic consistency.

显示原图|下载原图ZIP|生成PPT

图3 多模态表示学习框架：（a）联合表示框架；（b）协调表示框架；（c）编码器解码器框架

Fig.3 Multimodal representation learning frameworks. （a） Joint representation framework；（b） coordinated representation framework；（c） encoder-decoder framework

3 Element Hidden Space Representation Method

3.1 Element Representation Based on Attribute Features

The properties of a molecule are determined by the attributes of its constituent elements and the relationships among these elements. To encode elements, a set of physicochemical properties is typically selected as features based on the specific task, and these are then converted into fixed-length vector representations. The constructed feature representations directly reflect the position of elements in the periodic table and their fundamental chemical characteristics, aiding in the understanding of how elements behave within molecular or material systems. Commonly used physicochemical properties include group, period, electronegativity, atomic radius, electron affinity, and bond energy^[29-30]. Researchers can construct different element representations according to their research needs to accommodate various molecular systems. The atomic number, as a fundamental feature, is also frequently used in encoding design; a typical implementation involves converting the atomic number (or other numerical identifier) of an element into a one-hot encoded vector, which is particularly suitable when the number of chemical elements is relatively limited. However, since one-hot encoded vectors are equidistant in space and fail to capture the physical or chemical similarities among elements, this approach can limit the discriminative power of the model. When atoms serve as nodes in molecular graph structures, their representation varies across different models^[31-34]. A representative example is the Crystal Graph Convolutional Neural Networks (CGCNN) proposed by Xie et al.^[35], which selects nine elemental attributes to represent elements, using categorical encoding for discrete values based on attribute information and dividing continuous values into 10 categories uniformly according to the range of attribute values. The purpose of discretizing continuous attribute values is to introduce interval information for different features into the model, and the segmentation points in the discretization process are typically determined based on data distribution or physicochemical characteristics.

3.2 Element Representation Based on Physical and Chemical Knowledge

Feature vectors composed solely of single-element property information struggle to reflect interactions between elements, and thus often exhibit limitations when handling tasks involving complex interrelationships. Some researchers have turned to element representation methods based on modeling using physical laws, leveraging quantum mechanical theories to characterize the microscopic behavior of elements. A typical approach involves using physical models such as Density Functional Theory (DFT) to compute physical quantities like electron structure, atomic orbital distribution, and band structure in elements or their constituent molecules, and then converting these results into vector representations of the elements^[36].Such representation methods based on physicochemical knowledge are primarily applied in physics domains that require precise modeling. Their feature vectors can incorporate physical laws to capture the deep-level physicochemical properties of elements. To achieve greater accuracy, the feature selection process often involves a high degree of subjectivity and domain-specific limitations.

3.3 Data-Driven Element Embedding

Unlike manually constructed methods based on expert knowledge, data-driven embedding methods rely on models such as neural networks to autonomously learn the distributional representations of chemical elements in a low-dimensional latent space from large-scale data. The effectiveness of embedding vectors depends on data quality and model architecture; through a task-driven training process, the learned vectors can capture patterns of elements in specific chemical environments. In deep learning, a common strategy is to construct a learnable lookup table (i.e., an embedding matrix) that maps discrete input indices (such as atomic numbers) to continuous vector representations of arbitrary length. The model optimizes the randomly initialized embedding vectors using the backpropagation algorithm, adjusting the vector representations to minimize the loss function in the training task, thereby ensuring that elements with similar chemical environments exhibit higher similarity in the vector space. For example, Chen et al.^[37]have employed this approach in MEGNet (MatErials graph network), a materials science machine learning prediction framework built using graph neural networks (GNNs), to represent atoms in crystal graphs, effectively circumventing the limitations of manually selected elemental features.

Knowledge graphs provide a structured and interpretable representation method for capturing chemical elements, compounds, materials, and their complex semantic relationships. In a knowledge graph, entities such as chemical elements, molecules, and material properties are represented as nodes, which are connected through relationships (e.g., "participates in reaction," "possesses property"). This approach integrates scientific data from multiple sources and reveals how elements behave in different chemical environments. Take MatKG (Materials Knowledge Graph)^[38]as an example: it is one of the largest knowledge graphs in the field of materials science, covering a wide range of topics related to materials science. MatKG extracts data from a vast corpus of materials science literature and, using advanced natural language processing techniques, automatically identifies and constructs a structured graph that includes entities such as materials, properties, applications, and synthesis methods. The graph currently covers more than 70,000 entities and 5.4 million relationships. By leveraging knowledge graph embedding (KGE) technology, the nodes and relationships in the graph can be mapped into a low-dimensional vector space. For chemical element nodes, their roles and chemical behavior in different material or molecular environments can be reflected through the graph's structured information.

The Word2Vec model, used in natural language processing for word embedding, has also been successfully applied in the fields of chemistry and materials science. Since the vast majority of scientific knowledge is published in textual form, it is challenging for both traditional statistical methods and modern machine learning approaches to fully unlock its value. Tshitoyan et al.^[39]proposed an unsupervised text embedding method called Mat2Vec, which represents a pioneering effort in constructing text embeddings in materials science. This method applies a Skip-gram variant to a corpus of materials science literature to predict context words surrounding target terms (such as chemical elements, materials, and related physical properties). The rich contextual information enables the generated embeddings to capture not only simple element symbols or positional relationships but also multidimensional information about elements in real-world material environments. Furthermore, models such as BERT enhance contextual understanding through masked language models (MLMs), thereby improving the representational quality of text embeddings. The aforementioned knowledge graph embedding and text embedding methods in materials science are not limited to elemental representation; they are equally applicable to molecular and material descriptions. Relevant advances will be further discussed in the subsequent chapter on data-driven molecular embeddings.

4 Research Progress in Molecular Hidden Space Embedding

4.1 Traditional chemical feature molecular descriptors

A key step in chemical and materials science applications is converting molecular structural information into a digital representation that can be processed by computational models through molecular descriptors. To date, various types of molecular descriptors have been developed, ranging from molecular formulas and 2D molecular graphs to 3D conformations and higher-level representations (such as those that consider intermolecular relative orientations and time-dependent dynamic behavior)^[40].These descriptors encode features at different levels of molecular structure, providing effective input for tasks such as predicting, screening, and classifying molecular properties.

In quantitative structure-activity relationship (QSAR) studies, the widely used molecular descriptors can be categorized into multiple levels, ranging from 0D to 4D. Figure 4takes C₁₃H₁₈O₂as an example to illustrate the information contained in descriptors of different dimensions. The chemical formula is the most basic form of molecular representation; it provides only information on the elemental composition of the molecule and their stoichiometric ratios, but does not include information on atomic connectivity or molecular structure. Zero-dimensional descriptors are calculated directly based on the chemical formula, yielding descriptors such as the count of atom types or the sum of atomic properties. These descriptors are easy to compute, but they contain limited information and cannot be uniquely mapped to specific molecules.

显示原图|下载原图ZIP|生成PPT

图4 不同维度分子描述符

Fig.4 Molecular descriptors of different dimensions

1D descriptors are used to characterize molecular topological structural properties, such as structural fragments or fingerprints. Among these, SMILES has become the most popular method for storing and processing molecular chemical structure information in the form of a 1D text string. Similar representations, such as SMARTS (SMiles ARbitrary target specification), can uniquely encode molecules and implicitly contain chemical information such as atom types, connectivity, bond orders, branching, rings, protonation sites, and chiral centers. Based on this information, more complex 2D and 3D molecular structure representations can be derived. By fragmenting SMILES strings, molecules can be regarded as combinations of several typical substructures, thereby constructing 1D molecular descriptors. These descriptors typically consist of a set of substrings and are encoded in binary form (indicating the presence or absence of specific substructures) or in count form (recording the frequency of substructure occurrences).

Two-dimensional descriptors provide structural information at the molecular level and its properties in a two-dimensional plane, reflecting the connectivity between atoms, the presence of chemical bonds, and their properties. From a molecular graph, multiple two-dimensional descriptors can be derived using mathematical methods. These descriptors are sensitive to structural features and can be used to characterize information such as molecular size, shape, symmetry, branching, and cyclic structures. In addition, specific chemical information can be incorporated through the weighting of atomic attributes in the molecular graph^[41]..

3D descriptors are used to describe the geometric properties of molecules in three-dimensional space, with a focus on reflecting the spatial arrangement of atoms. When a molecule is regarded as a geometric entity in 3D space, the positions and arrangements of its atoms can be represented using Cartesian coordinates. These descriptors can accurately capture bond lengths, bond angles, and atomic connectivity information, thereby integrating the conformational characteristics of the entire molecule. 4D descriptors build upon 3D descriptors by further incorporating dynamic factors such as molecular–receptor interactions and conformational changes.

Different-dimensional molecular descriptors characterize the composition, structure, and behavioral features of molecules at multiple levels. Based on these structural levels, researchers have developed various forms of classical molecular descriptors. Classical molecular descriptors represent molecular structures or chemical properties—derived either from experimental measurements or calculated from the topological and geometric features of molecular structures—by converting them into numerical vectors, where each numerical value in the vector corresponds to a specific structural feature or chemical property of the molecule. To quantitatively describe molecular structure, properties, and reaction behavior, these descriptors can be used individually or in combination with other descriptors, depending on the requirements of the task. The measurement scales of these descriptors include discrete values (such as the number of double bonds or counts of atom types), binary values (such as the presence or absence of specific substituents), and continuous values (such as molecular weight or polarity). Experimental evidence^[42]demonstrates that representing molecule-specific features relevant to a particular task in vector form can effectively support the construction and performance enhancement of machine learning models.

—Molecular fingerprints, as an abstract representation of molecular structural features, were initially used for substructure and similarity searches in chemical databases. By qualitatively encoding important substructures or features within a molecule’s structure, they have been widely employed in machine learning. Fingerprints can be classified into various types based on the type of chemical information, the algorithms used, and the application scenarios. Dictionary-based fingerprints set bit strings according to the presence or absence of specific substructures or features from a predefined list of structures. For example, MACCS (Molecular ACCess system) fingerprints^[43]use 166 fragment substructures encoded by SMARTS to describe molecular structures, covering most of the chemical features required in drug discovery and virtual screening. These fingerprints are simple and efficient; however, since they rely on predefined substructures, they may overlook more complex topological information in molecules. Topology- or path-based fingerprints, on the other hand, generate fingerprints by analyzing all molecular fragments starting from an atom and extending up to a specified number of bonds (typically linear paths), then applying hashing to each path. Such fingerprints are applicable to arbitrary molecules and their length is adjustable. However, due to the hashing process, specific bit positions in the fingerprint do not correspond one-to-one with structural features, meaning that different structures may map to the same bit position. The Daylight fingerprint is a prominent representative of this type of fingerprint; it consists of up to 2048 bit positions and encodes all possible connection pathways of a molecule up to a given length. Circular fingerprints are another type of hashed topology-based fingerprint. Unlike path-based fingerprints, circular fingerprints do not search for paths within the molecule but instead record the atomic environment within a specified radius starting from an atom. The Extended connectivity fingerprint (ECFP)^[44], derived from the Morgan algorithm,is the industry-standard method for circular molecular fingerprints. When used, ECFP generates fingerprints of variable length depending on the set diameter. According to the research by Rogers and Hahn^[44],ECFPs with larger diameters can capture more molecular structural details and are theoretically better suited for machine learning-based predictions; however, due to higher computational costs, ECFP6 and ECFP8 (where the numbers represent the diameter) are more commonly used in practice. In addition, text-based fingerprints (LINGO^[45]and SMIfp^[46])generate compound fingerprints from the canonical SMILES string of a compound. With advances in chemical informatics, classical fingerprint types have been improved to form enhanced fingerprints. For example, Bender et al.^[47],building on the work of Nidhi et al.^[48],combined ECFP4 with Bayesian models and the Pearson correlation coefficient to develop the Bayesian affinity fingerprint, which is used to predict ligand-target binding affinity based on ligand structure. Since a single fingerprint mode often struggles to cover all key molecular features or compound properties, multi-fingerprint fusion strategies have emerged. These strategies integrate fingerprint vectors from different sources to simultaneously account for structural, electronic, and bioactive information. For example, Laufkotter et al.^[49]combined the bioactive descriptor HTSFP66 with the structural descriptor ECFP4 to propose the bioactive–structural hybrid (BaSH) fingerprint, which, when combined with machine learning, demonstrates superior compound activity prediction capabilities and scaffold-hopping potential.

4.2 Graph Theory–Driven Molecular Embedding

In molecular modeling, graph structures provide a natural and intuitive method for data representation. Mathematically, a graph is defined as a tuple

G = V, E

, where each edge

e ∈ E

connects a pair of nodes in V. As shown in Figure 5a, the concept of a molecular graph involves mapping the atoms and chemical bonds that constitute a molecule to a set of nodes and edges. In terms of representation, a molecular graph belongs to a two-dimensional topological structure; the nodes themselves do not contain fixed spatial position information, but rather reflect the connectivity between nodes. However, three-dimensional information (such as atomic coordinates and bond angles) can be encoded in the attributes of nodes and edges, thereby effectively representing the spatial configuration of the molecule.

显示原图|下载原图ZIP|生成PPT

图5 （a）图结构的特征标注；（b）通过消息传递和聚合进行特征更新；（c）图神经网络中图的迭代更新

Fig.5 （a） Feature labelling of the graph structure. （b） Feature updates through message passing and aggregation. （c） Iterative graph updates in the graph neural network

To convert a molecular graph into a numerical representation that can be processed by machine learning models, the molecular topology (including the way atoms are connected, the types of atoms, and the types of bonds) must be mapped to a matrix. Specifically, the connectivity between atoms is typically represented as an adjacency matrix, which indicates whether a chemical bond exists between any two nodes. Atomic attributes (such as atom type and formal charge) and bond attributes (such as bond type) are represented through a feature matrix, where each row corresponds to a feature vector for an atom or an edge in the molecular graph. It is important to note that the matrix representation of a graph depends on the order of the nodes. Depending on whether the representation of the same molecule needs to be consistent, graph traversal algorithms such as depth-first search, breadth-first search, or random search are typically used to determine the node order.

The fundamental assumption of molecular graphs is that the key interactions between atomic nuclei and electrons in a molecule can be implicitly represented through graph structures, thereby effectively describing the molecule’s geometry, function, and properties. Although molecular graphs simplify the representation of molecules, they still have limitations when dealing with certain special types of molecules^[50].For example, for hypervalent compound molecules whose bonding patterns do not conform to valence bond theory, hypergraphs must be introduced for processing; in a hypergraph, a hyperedge can connect two or more atoms. Furthermore, for molecules whose atomic arrangements in space are constantly changing—such as during bond formation or breakage and frequent structural reorganization—a static molecular graph representation is no longer suitable. In real-world scenarios, graph structures, including molecular graphs, often need to handle dynamic processes (changes in nodes, edges, and attributes), which poses significant challenges for machine learning-based reasoning and prediction. Although dynamic graph representation learning^[51-52]has become a current research hotspot, most graph learning methods in the field of chemistry still remain at the level of static graph structures, leaving room for future research on dynamics in chemistry.

Before the rise of graph learning, researchers had attempted to use various fixed-size descriptors to represent crystal and molecular structures, such as the Coulomb matrix^[53],classical force-field inspired descriptors (CFID)^[54-55],and Voronoi tessellations^[56].With the evolution of methods, GNNs have demonstrated performance advantages over traditionally hand-engineered descriptors^[32].GNNs can be understood as a generalization of convolutional neural networks (CNNs) to irregular graph structures. They can simultaneously consider both node-specific features and topological relationships between nodes, thereby learning graph embedding representations directly from molecular structures—including molecular structure graphs composed of atoms and chemical bonds, three-dimensional conformations, or point clouds^[57-58].As shown in Figure 5b, the core process of GNNs involves transforming the input molecular graph through multiple rounds of information aggregation and iteration, and then making predictions based on the updated graph structure. Inspired by position-based models, various types of geometric information (distance^[37],bond^[59],and dihedral angle^[60-61])are represented using symmetry or basis functions to construct node or edge representations. Graph networks extend or process this geometric information by employing methods such as Gaussian functions^[62],radial basis functions^[63],and spherical Fourier-Bessel functions^[64].In addition to hand-designed input features, researchers have also drawn on word embedding techniques from natural language processing to explore embedding representation learning^[65-66].

In summary, existing scoring systems have limited predictive capabilities for bleeding events, and their results are inconsistent^[25,30,33].GNNs differ from traditional machine learning models primarily in how they extract structural information. Most GNN architectures proposed in the field of materials science follow the message passing neural networks (MPNN) framework introduced by Gilmer et al.^[67]The aggregation function is a key component of the network architecture; it determines how the model effectively combines information from neighboring nodes and significantly influences the model's performance in practical tasks. Figure 5cillustrates the process by which each layer of a GNN aggregates information from neighboring nodes. To obtain a holistic molecular representation (rather than being limited to node representations), all node information must be aggregated into a global descriptor via a Readout function. Table 2selects five GNN models applied to molecular research and summarizes their similarities and differences in terms of input features and the choice of aggregation methods. In practical applications within the chemical domain, the focus of GNN research has gradually shifted from selecting atomic features to selecting edge (bond) features, thereby accurately capturing interactions within molecules. At the same time, researchers have begun using equivariant graph neural networks to address the impact of spatial rotations on molecular representations and have introduced auxiliary tasks to enhance the model's feature learning capabilities.

表2 GNNs特征选择及聚合信息方法

Table 2 GNNs feature selection and information aggregation methods

Model	Node features	Edge features	Others	Activation function	Aggregation method	Ref
SchNet	Atomic number	Atomic distances expanded with Gaussian basis functions.	-	Softplus	Uses a filter-generating network （a fully connected neural network） based on interatomic positions to generate filter values. It then performs element-wise multiplication of the neighboring atoms’ atomic representations with these filter values and applies continuous-filter convolution to aggregate information from surrounding atoms.	62
MEGNet	Atom type， Chirality， Ring sizes， Hybridization， Acceptor， Donor， Aromatic	Bond type， Same ring， Graph distance， Expanded distance	Defines the global state as the average atomic weight and the number of bonds per atom.	Softplus	Employs a multi-layer perceptron with two hidden layers to sequentially update bond attributes， atom attributes， and global state attributes， where the attributes updated in one step are passed to the next update.	37
DimeNet++	Atomic number	Atomic distances expanded with radial basis functions （RBF） and bond angles expanded with spherical basis functions （SBF）.	-	SiLU	Sums the neighboring message with the message obtained by element-wise multiplying the fully connected layer embedded RBF distance information with the SBF angle information， and then combines this with its own message to complete the message aggregation.	59
ALIGNN	Electronegativity， Group number， Covalent radius， Valence electrons， First ionization energy， Electron affinity， Block， Atomic volume	Atomic distances expanded with RBF.	Constructs the atomistic line graph from the atomistic graph， where the nodes share latent representations with the bonds in the atomistic graph， and the initial edge features are derived from the RBF expansion of the cosine of bond angles.	SiLU	Performs edge-gated graph convolution^[68] on both the atomistic bond graph and the line graph. In the line graph， triplets of atoms and bond features are updated. The newly updated pair features are then propagated to the edges of the direct graph and further updated with the atom features via a second edge-gated graph convolution applied to the direct graph.	69
eqV2 S DeNS	Atomic number	Atomic distances expanded with RBF and relative position vectors expanded using spherical harmonics （SH）.	Utilizes Denoising Non-equilibrium Structures （DeNS）^[70] as an auxiliary task by adding noise to the 3D structure， combining it with forces from the original non-equilibrium structure， and predicting the denoised structure.	SiLU	Incorporates relative positions between nodes by rotating the concatenated source and target node features （ensuring equivariance）. It decomposes the product of the radial function-generated distance embedding and the rotated node features to compute attention weights and value features， multiplies these， rotates back to the original coordinate system， and finally concatenates features from multiple attention heads to generate new features.	71

4.3 Data-Driven Molecular Embeddings

As data science concepts have been increasingly applied in chemistry, researchers have gradually moved away from reliance on manually designed features and instead turned to deep learning and natural language processing algorithms to enable models to autonomously learn representations of molecular structures in a latent space. This section will introduce the application of methods such as language models, generative models, graph contrastive learning, graph pre-training, and graph embedding in molecular embedding learning. These methods have been introduced into the field of cheminformatics with the aim of overcoming the limitations of input formats like SMILES strings and molecular graphs in terms of their capacity for molecular representation; Table 3lists the relevant representative embedding methods discussed in this subsection.

表3 数据驱动嵌入方法

Table 3 Data-driven embedding approaches

Method	Architecture type	Input type	Representation learning strategy	Ref
Mol2Vec	Word2Vec	SMILES	Treats compound substructures derived from the Morgan algorithm as "words" and compounds as "sentences"， applying the Word2Vec algorithm on a corpus of compounds.	72
Grammar2Vec	Word2Vec	SMILES	Treats the grammar rules that generate SMILES as "words" and molecules as "sentences"， applying the CBOW method on the dataset.	73
Mat2Vec	Word2Vec	Text （material science abstracts）	Builds a vocabulary of 500000 from 3.3 million scientific abstracts and applies the skip-gram method on the text corpus.	39
OWL2Vec*	Word2Vec	OWL^[74]	Encodes OWL ontology semantics by considering graph structure， lexical information， and logical constructors， implementing an ontology embedding method based on random walks and word embeddings.	75
Smiles2Vec	RNN	SMILES	Encodes the SMILES string into a fixed-length vector and uses RNN to process the string character by character.	76
FP2Vec	CNN	Fingerprint	Extracts molecular substructures from SMILES representations， builds a lookup table for the generated fingerprint indices， and maps these indices into trainable embedding vectors.	77
HiMol	GNN	Molecular graph	Comprises a hierarchical molecular graph neural network and multi-level self-supervised pre-training. It builds a molecular representation learning method based on node-motif-graph hierarchical information， augments graph-level nodes to simulate molecular graph representations， and enables bidirectional transmission of local and global features.	78
SMILES-BERT	BERT	SMILES	Adjusts the Transformer layer design in BERT and pre-trains the model using a Masked SMILES Recovery task. It also incorporates a quantitative drug similarity prediction task to improve classification performance during fine-tuning.	79
ChemicalBERT	BERT	Text （chemical texts from PMC abstracts）	Combines ChemicalBERT and AGGCN^[80] components to generate high-quality contextual representations and capture syntactic graph information. It merges features from sequential and syntactic graph representations in parallel to predict chemical-protein interaction types.	81
Mol-BERT	BERT	SMILES	Obtains atomic identifiers with radii 0 and 1 using the Morgan algorithm， uses identifier embeddings as input for pre-training BERT modules， and performs pre-training using only the Masked Language Model （MLM） task.	82
MolRoPE-BERT	BERT	SMILES	Modifies Mol-BERT by replacing absolute positional encoding with Rotary Positional Encoding （RoPE） to address the insensitivity of the self-attention mechanism to positional information.	83
Chem-BERT	BERT	SMILES	Designs a matrix embedding layer to learn molecular connectivity. In addition to the MLM task， it adds a quantitative drug similarity prediction task to obtain representations integrating SMILES and chemistry context.	84
ChemBERTa	RoBERTa	SMILES	Processes SMILES sequences using Byte-Pair Encoding （BPE） tokenization and pre-trains using the MLM task.	85
DeBERTaSSL	DeBERTa	SMILES	Using SMILES tokenization to represent molecular components， and pre-trains the DeBERTa model with the MLM task. It further combines GCN for self-supervised learning of molecular graph structure information， to capture both sequence and molecular structure information.	86
MoLFormer	Transformer	SMILES	Combines Rotary Positional Encoding， a linear attention mechanism， and the MLM pre-training method to demonstrate that training on SMILES learns spatial relationships between atoms in a molecule.	87
Molformer	Transformer	Heterogeneous molecular graph	Constructs heterogeneous molecular graphs by extracting motifs， then uses a Transformer with heterogeneous self-attention to distinguish multi-level node interactions. Incorporates an attentive downsampling algorithm to aggregate informative molecular representations efficiently.	88

In recent years, natural language processing (NLP) techniques have begun to be applied to representation learning of classical molecular features^[32,89-90].One common approach involves using “term frequency-inverse document frequency” (TF-IDF)^[89]and “latent Dirichlet allocation” (LDA) methods^[91];another approach learns embedded representations of molecular fragments via Word2Vec. Mol2Vec^[72]is a representative method in recent years that learns molecular embeddings based on Word2Vec, effectively addressing the issue of insufficient representation of substructure correlations in ECFP. Mol2Vec decomposes molecules into fragments (substructures), and its processing of substructures draws on the Word2Vec model from NLP, treating substructures as “words” and compounds as “sentences.” The molecular representation is obtained by taking a weighted average of the vectors of all substructures within the molecule. Similarly, SMILES lacks consideration of the overall molecular structure and stereochemistry when encoding molecular structures, whereas Grammar2Vec^[73]leverages SMILES syntax generation rules^[92],treating molecules as “sentences” composed of these rules, with each rule regarded as a “word,” and applying the CBOW (Continuous Bag-of-Words) method from Word2Vec to generate molecular vector representations.

Hinton et al.^[93]proposed an autoencoder framework that uses unsupervised methods to obtain latent representations of input data, providing a new approach to chemical structure representation. This method encodes and compresses input data into a low-dimensional space, capturing essential data information by learning to reconstruct the original input. Building on this idea, Gómez-Bombarelli et al.^[94]were the first to introduce SMILES sequences into variational auto-encoders (VAEs), extracting and reconstructing latent representations of chemical structures within an encoder-decoder (ED) architecture, thereby generating continuous and reversible molecular representations. However, this method is based on character-level reconstruction and fails to fully preserve the overall chemical structural information of molecules. To address this limitation, Winter et al.^[95]introduced the concept of neural machine translation, using a sequence-to-sequence (Seq2Seq) architecture to build a model that translates random SMILES into canonical SMILES (SMILES-Seq2Seq). This method enables conversion between random SMILES and canonical SMILES and can be further extended to conversions with other string-based representations (e.g., InChI). In addition, Winter et al.^[94]introduced SMILES2Vec into the model architecture, using a gated recurrent unit (GRU) architecture to generate latent representations.

Given the sequential nature of SMILES, researchers have further developed various methods based on recurrent neural networks (RNNs) for generating new molecules or learning new molecular embeddings^[96-98]. For example, Xu et al.^[98]proposed the unsupervised method Seq2Seq Fingerprint, which uses a multi-layer RNN autoencoder built on GRU to obtain molecular embeddings by concatenating the hidden states of the autoencoder. Huo et al.^[99]used a bidirectional long short-term memory network (BiLSTM) combined with channel and spatial attention modules to explore SMILES representations. However, RNN models have limitations in capturing interatomic relationships and bond types. To address this, Mol2Context-vec^[100]generates dynamic representations of molecular substructures by integrating multiple internal states. With the development of pre-trained language models, MolGPT^[101]was the first to apply the GPT^[102](generative pre-training transformer) model to molecular generation tasks, achieving success in both molecular property prediction and structural optimization.

—Molecular embedding has become a crucial step in achieving efficient molecular representation learning and chemical structure generation. Continuous latent-space embedding and discrete molecular embedding are two common strategies for applying molecular embedding in generative models. The former provides a smooth representation that facilitates gradient optimization and probabilistic inference but faces the challenge of mapping discrete structures into continuous space; the latter preserves the inherent discrete properties of molecules and is better suited to capturing graph-structural distributional features. This section will focus on these two strategies and introduce their applications in various generative models. In the context of continuous latent-space modeling, Jin et al.^[103]proposed the JT-VAE for molecular graph generation, enabling automated molecular design based on specific chemical characteristics. Compared with generating general graphs with degree constraints, creating tree structures is less computationally complex, thanks to the implementation of a tree-based graph representation (which decomposes a molecular graph into molecular substructures represented by tree nodes). JT-VAE achieved a milestone 100% molecular validity rate. HierVAE^[104]builds on JT-VAE by introducing larger and more flexible graph structural units as basic building blocks, demonstrating higher performance when handling larger molecules. G-SchNet^[105]also employs an autoregressive approach, placing atoms one by one in three-dimensional Euclidean space. By integrating geometric constraints in Euclidean space and rotational invariance of atomic distributions as prior knowledge, this method directly generates three-dimensional molecular structures without relying on any graph- or bond-based information. Deep generative models learn a continuous latent space by encoding molecular graphs, but in accelerating drug discovery, they often fail to simultaneously ensure that the generated atoms and bond types comply with chemical bonding constraints. Compared with VAEs, GANs, and autoregressive models, flow-based models can memorize and precisely reconstruct the entire input dataset. Shi et al.^[106]proposed GraphAF, a flow-based autoregressive molecular graph generation model, which converts discrete graph data into continuous data by adding real-valued noise. This method abstracts the molecular graph generation problem as a sequential decision-making process: starting from an empty graph, new nodes are generated sequentially according to subgraph structures, and edges between new nodes and existing nodes are systematically constructed. Even without explicit chemical knowledge rules, GraphAF achieves an effective molecular generation rate of 68%. MoFlow^[107]also builds on a flow-based model, enabling reversible mapping to generate molecular graphs in a single step while ensuring chemical validity. MoFlow is based on the Glow model^[108]to generate multiple types of chemical bonds and uses graph convolutions to construct a graph-conditioned flow model for generating atoms with specified bonds, ultimately assembling atoms and bonds to form valid molecular graphs that satisfy valence-bond constraints. Compared with GraphAF, MoFlow increases the effective molecular generation rate to 82%. In generating three-dimensional molecules, models typically first predict atom types and corresponding atomic coordinates, then determine atomic bonds based on interatomic distances, which can lead to the generation of unrealistic topological structures (such as large rings) or errors in atomic valence bonds. Peng et al.^[109]addressed the issue of atom–bond inconsistency in three-dimensional molecular generation by proposing MolDiff, a diffusion model that performs probabilistic sampling of both atoms and bonds simultaneously, significantly enhancing the drug-likeness of generated molecules. MolDiff is based on SE3-equivariant neural networks, performing message passing for both atoms and chemical bonds, and uses gradients from an atomic–bond predictor during molecular generation to guide the formation of more chemically appropriate bonds. The diffusion model DrugDiff^[110]achieves mapping of molecular SELFIES^[111](SELF-referencIng embedded strings) sequences into a continuous latent space through VAE training,^[112]while a series of attribute predictors guide the latent-space diffusion model during the sampling process.^[113]This process generates novel compounds with multiple desired molecular properties. The dequantization process of converting discrete graphs into continuous data weakens the model’s ability to accurately represent the distribution of the original discrete graph structure. In the context of discrete molecular embedding, Luo et al.^[114]proposed GraphDF, a discrete-flow model based on GraphAF^[106], which uses discrete latent variables to generate molecular graphs, thereby addressing the issues arising from dequantization. In GraphDF, all latent variables are discrete, sampled via multinomial distributions, and the model reversibly maps discrete latent variables to new nodes and edges. In addition to generating complete molecules, some models focus on specific structural design scenarios. The graph-based deep-learning method DeLinker^[115]is the first to directly incorporate three-dimensional structural information into the molecular generation model design process. DeLinker generates or replaces connectors between two molecular fragments based on their relative positions and orientations, using the relative distances and orientations between partial structures to achieve protein-context sensitivity, and has demonstrated effectiveness and applicability in various design problems (fragment linking, scaffold hopping, and chimeric constructs targeting protein degradation). Liu et al.^[116]proposed the energy-based molecular graph generation model GraphEBM to address the issue of model bias arising from the failure to ensure permutation invariance. Permutation invariance is an intrinsic and desirable inductive bias in graph modeling; this study parameterizes the energy function in a permutation-invariant manner, endowing the model with permutation invariance. The model employs Langevin dynamics^[117]to train the energy function using an approximate maximum likelihood approach, and then generates samples from the trained energy function. Xu et al.^[118]build on the denoising diffusion model^[119]by treating atoms as particles in a thermodynamic system, using simulation of diffusion and reverse-generation processes to ensure rotational and translational invariance, thereby effectively generating three-dimensional conformations of molecules. For structure-based drug design, Liu et al.^[120]proposed the GraphBP model, which generates three-dimensional molecules with specific three-dimensional binding sites by placing atoms one by one. This method uses a three-dimensional graph neural network to extract current semantic information, generates atoms sequentially through an autoregressive flow model, and takes into account the equivariance properties of three-dimensional space during generation, effectively producing molecules capable of binding to target proteins.

The introduction of the Transformer architecture has opened up new avenues for molecular representation learning. Researchers pre-train models on large-scale unsupervised data and then fine-tune them to suit specific chemical tasks. The successful application of SMILES-BERT^[79]to BERT has demonstrated the feasibility of molecular representation research based on the Transformer framework, laying a foundation for subsequent studies. Building on this, Chithrananda et al.^[85]proposed ChemBERTa, which replaces the traditional BERT model with RoBERTa^[121]to effectively enhance the model’s ability to represent SMILES strings. Experimental results show that the MLM-based pre-training strategy significantly improves the model’s predictive performance in downstream tasks on MoleculeNet^[122], demonstrating the cross-domain transfer potential of language model pre-training in cheminformatics. Liu et al.^[83]introduced rotary position embedding (RoPE)^[123]to more efficiently encode positional information in SMILES sequences, thereby enhancing the ability of BERT pre-trained models to extract latent molecular substructure information. Building on the Mol-BERT model^[82], this approach replaces absolute position encoding with RoPE, addressing the issue of model performance being affected by sequence length and achieving performance improvements in molecular property prediction tasks. MoLFormer^[87]also introduces RoPE, integrating molecular language representation with Transformer encoder modules to model intra-molecular atomic spatial relationships in large-scale molecular language models. In addition to sequence modeling, Wu et al.^[88]introduced the concept of heterogeneous molecular graphs (HMGs) and proposed Molformer, a molecular representation learning framework that simultaneously leverages molecular motifs and three-dimensional geometric structures. This method uses a heterogeneous self-attention mechanism (HSA) to distinguish interaction relationships among nodes at different levels and employs an attentive farthest point sampling (AFPS) algorithm to aggregate molecular representations, demonstrating its potential and advantages in molecular modeling tasks across multiple fields, including quantum chemistry, physiology, and biophysics. In addition to optimizing the encoder, MegaMolBART, a small-molecule language model, explores the application of bidirectional and auto-regressive transformer architectures (BART) in molecular pre-training and has demonstrated embedding performance superior to Morgan fingerprints^[12]. These research findings fully demonstrate that large-scale molecular language models can effectively capture chemical and structural information, providing strong support for predicting various molecular properties.

One-dimensional representations based on string-type molecular descriptions struggle to capture molecular topological features; therefore, researchers have shifted their focus from one-dimensional molecular sequences to two-dimensional graph representations that convey more structural information^[124]. Given the successful application of self-supervised learning (SSL) frameworks to molecular sequences, graph-based pre-training frameworks^[125-129]have rapidly developed in recent years. In the field of molecular representation learning, an increasing number of studies are attempting to model molecular graphs by incorporating the inherent characteristics of the molecular graph itself. For example, Zang et al.^[78]proposed the HiMol framework to effectively mine the chemical structural features within molecular graphs and address the limitation of Readout functions in lacking global information. The framework achieves this goal by leveraging hierarchical molecular graph neural networks and multi-level self-supervised pre-training. Zhang et al.^[130]proposed the self-supervised framework MGSSL, which implements structure-aware self-supervised learning by defining the traversal order of motifs in the graph (using either depth-first search or breadth-first search) as a pretext task. In addition, a series of contrastive learning methods have emerged that focus on subgraphs (MICRO-Graph^[131]), molecular graph structure-enhancing patterns (MolCLR^[132]), motif learning (iMolCLR^[133]), and modeling chemical reaction relationships (MolR^[134]). Molecular contrastive learning (MCL) has become a key approach for addressing the lack of explicit relationships between molecules. One of the central challenges is how to design reasonable molecular graph augmentation strategies to generate effective positive and negative samples. Existing MCL methods and their variants largely rely on the molecular graph random perturbation augmentation scheme proposed by MolCLR: atom masking, bond deletion, and subgraph deletion. However, this augmentation scheme often overlooks the chemical rules and prior knowledge embedded in molecular structures. To address this issue, Gong et al.^[135]proposed the MIFS framework, which aims to tackle the limitations of information propagation within molecules and the lack of chemical plausibility in self-supervised learning. By integrating three distinct information propagation pathways, MIFS for the first time realizes a dedicated molecular graph encoder with multi-path propagation. In its adaptive contrastive pre-training strategy, MIFS generates augmented instances based on the molecular backbone and side chains to ensure structural plausibility, and during the fine-tuning phase, it incorporates additional chemical knowledge through an elemental knowledge graph. MDFCL^[136]is similarly inspired by MolCLR; it implements an adaptive augmentation strategy by performing structural operations directly on the molecular main chain and side chains, while also integrating multimodal data. Building on traditional graph contrastive learning, this method constructs four different types of augmented instances and a three-level loss function, thereby thoroughly exploring subtle differences between molecules and optimizing the representation of molecules in chemical space.

In contrast to the above-mentioned model training that utilizes only a single type of data (such as structural information), knowledge graphs can provide richer external molecular information, such as functional groups, molecular physicochemical properties, and other critical prior knowledge. Knowledge graphs encompass various entity types (e.g., chemical elements, compounds, drugs, and proteins) and relationships (e.g., chemical reactions between compounds), and their heterogeneity provides strong support for characterizing multi-level connections among entities. Due to the large scale of knowledge graphs and the extensive external information they contain, entity and relation representations are high-dimensional and complex. To address this challenge, current research largely focuses on KGE methods, which aim to map entities and relations in knowledge graphs into dense, low-dimensional real-valued vectors, thereby effectively reducing computational complexity while preserving the structural integrity of the knowledge graph. Wang et al.^[137]have systematically reviewed existing KGE model approaches, categorizing these methods into distance-based models, semantic matching models, and neural network-based and other novel model embedding approaches, and have analyzed the applicability and performance of each type of model in relation modeling capabilities and specific downstream tasks.

4.4 Molecular Multimodal Embedding

In Section 2.3, this paper introduced the basic concepts of multimodal embedding, including the definition of modalities, the complementarity of multimodal data, methods for constructing joint semantic subspaces, and common multimodal representation learning frameworks. Building on this foundation, this section focuses on the application of multimodal embedding in molecular representation learning. The research by Karim et al.^[8-9]demonstrates the importance of integrating multiple molecular representations, indicating that fusing molecular features from different perspectives has become a key research direction in molecular property prediction. As shown in Figure 6,the core of multimodal learning lies in how to effectively integrate and leverage heterogeneous modal information. Current research primarily avoids introducing additional noise during the fusion process by rationally designing learning frameworks and modality interaction strategies.

显示原图|下载原图ZIP|生成PPT

图6 多模态分子表示嵌入

Fig.6 Multimodal molecular representation embedding

In molecular representation, SMILES records atoms and their connectivity in the form of a character sequence, making it the most common molecular representation method in chemical databases and the most readily available data modality for molecular property prediction tasks. Its linear structure naturally aligns with various sequence modeling approaches, and it is often introduced as a foundational modality when constructing multimodal learning frameworks. For example, GraSeq^[138]combines SMILES with molecular graphs using LSTM and GNN to capture both sequential and topological information, thereby enhancing cross-task performance; MTBG^[139]integrates SMILES with molecular graphs using BiGRU (Bidirectional gated recurrent unit) and GraphSAGE. These studies treat SMILES as a source of sequential information and jointly model it with structural modalities (molecular graphs) to exploit the complementary nature of chemical information. Given the impact of molecular conformation on various molecular properties (such as solubility and toxicity), relying solely on SMILES and two-dimensional molecular graphs has limitations when dealing with properties closely tied to molecular geometry. To address this issue, Nguyen et al.^[140]combine the two-dimensional structure of molecules with multiple three-dimensional conformational information, introducing differentiable optimal transport methods such as FGW (Fused Gromov-Wasserstein) barycenters^[141]to aggregate geometric and semantic information from different conformations, thereby generating a unified, structure-aware molecular representation. To obtain a simpler and more efficient multimodal molecular representation architecture, MolMix^[10]introduces Transformer, message-passing neural networks, and equivariant neural networks as encoders for SMILES, two-dimensional graph features, and three-dimensional conformational isomers, concatenating modal features into a unified multimodal sequence that is then applied to downstream Transformer models. While ensuring computational efficiency and scalability, MolMix achieves state-of-the-art results on multiple property prediction tasks in the MoleculeNet^[122]and MARCEL^[142]datasets, further validating the potential of jointly modeling SMILES with other modalities. In addition, molecular fingerprints derived from SMILES or molecular graphs are also frequently used as supplementary modalities or preprocessing features in multimodal frameworks, enhancing the model's sensitivity to differences in molecular structure^[143-148]. Yi et al.^[149]extract three modal features from drug molecules: one-dimensional fingerprint features, two-dimensional topological structure, and three-dimensional geometric structure, and design a dynamic weighting mechanism based on an energy function to fuse these modal features, thereby addressing the issue of increased false-positive predictions caused by semantic redundancy and noise across different modalities.

Molecular graph representations can accurately reflect molecular structures and are one of the core methods for current molecular modeling tasks. In multimodal learning, molecular graphs are often fused with modal information such as sequences, conformations, and molecular images^[150-153].In the area of fusing graph structures with three-dimensional geometric information, GraphMVP^[127]was the first to introduce three-dimensional conformational information into graph self-supervised learning, using contrastive SSL and generative SSL to jointly pre-train by leveraging the consistency between two-dimensional topological structures and three-dimensional geometric perspectives. Stärk et al.^[154]also utilized the two-dimensional and three-dimensional graph structures of molecules, proposing a method based on information maximization to enhance the mutual information between two-dimensional and three-dimensional molecular embeddings. Addressing the challenge of heterogeneous modal fusion, Chen et al.^[155]integrated different molecular modal representations using heterogeneous graphs, constructing a unified molecular graph by defining element knowledge graphs and meta-paths. However, it is difficult to effectively bridge inter-modal associations by relying solely on direct or coarse-grained molecular alignment. To address this issue, Chen et al.^[156]conduct node-level and graph-level pre-training tasks on both two-dimensional topological and three-dimensional geometric data, starting from both the node perspective and the graph perspective, to enable cross-modal knowledge sharing. To ensure that graph-level representations acquire more comprehensive knowledge, MolGT^[156]clusters molecular fingerprints to generate prototype labels, providing prior knowledge for graph-level representations to form cluster structures in the feature space. In addition to structural data, molecular images obtained from electron microscopy, X-ray crystallography, and molecular rendering tools provide intuitive geometric information for molecular modeling. Multimodal molecular representation methods based on two-dimensional/three-dimensional topological structures and images are gradually becoming mainstream^[9,157].For example, DLF-MFF^[158]employs four deep learning frameworks to extract and fuse features from molecular fingerprints, two-dimensional molecular graphs, three-dimensional molecular graphs, and molecular images, providing a powerful computational tool for COVID-19 drug screening tasks. Addressing the potential oversight of inter-modal interactions in similar approaches, MMSA^[159]fully considers the structural invariance information among molecules, integrating molecular graphs, molecular images, and three-dimensional conformations. By constructing a hypergraph structure to simulate higher-order relationships among molecules and combining this with memory mechanisms, the approach aligns invariant structural knowledge.

Large language models (LLMs) have been applied to tasks such as drug design, materials screening, and property prediction^[160-164]. Compared with structural data, text can capture information that is difficult to quantify or lacks standardized representation, such as experimental conditions, synthesis routes, and mechanisms of biological activity. Early work has attempted to jointly model two-dimensional molecular representations with text^[165-167]. To enhance the model's generalization ability to unseen instances, Hua et al.^[168]used a pre-trained large language model to extract generic textual features of proteins and drugs, innovatively introducing domain-adversarial training and contrastive learning—two domain-generalization techniques—into the classifier, thereby achieving the fusion of textual and structural features. Polat et al.^[169]incorporated textual information from PubChem^[170](such as IUPAC names, molecular formulas, computed physicochemical descriptors, and spectral features) and combined it with molecular graph structures, using a gating fusion mechanism^[171]to adaptively integrate textual and geometric features, balancing and fusing these two types of information on a data-driven basis. MolLM^[172]further integrates three-dimensional information with natural language, proposing for the first time a multimodal molecular language model that combines two-dimensional, three-dimensional structures, and natural language. This model uses a graph Transformer to encode two- and three-dimensional molecular structures, employs a text Transformer to process biomedical text, and aligns semantic representations across the three modalities through contrastive learning. Synthesis routes determine how complex target molecules are synthesized; traditional retrosynthetic methods typically represent synthesis routes as graphs or tree structures, but are limited by heuristic search, vast combinatorial spaces, and a lack of high-quality structural data. Inspired by natural language processing, researchers have leveraged textual information to enhance the model's understanding of reaction mechanisms. RetroInText^[173]focuses on valuable contextual information derived from complete synthesis routes. This method uses ChatGPT to generate a textual description of the entire retrosynthetic pathway starting from the name of the target product, treating the textual description together with three-dimensional molecular structure information as training inputs. The method employs a multimodal encoder to extract features from text, molecular graphs, and three-dimensional structures, and uses an attention-based mechanism to fuse this information, enabling context-aware prediction of single-step intermediates and screening of reactants. RetroInText excels particularly in handling longer synthesis routes; on the USPTO path dataset RetroBench^[174], its Top-1 accuracy exceeds that of state-of-the-art methods by 5%. A key challenge in applying large language models to materials and drug design lies in achieving coherent autoregressive generation between textual and graph structures. In response to this challenge, Liu et al.^[175]proposed Llamole, the first multimodal large language model that supports alternating text and graph generation. This model combines an A* search with molecular reverse design to enable efficient retrosynthetic pathway planning. Llamole integrates a foundational LLM with two pre-trained graph modules: the Graph Diffusion Transformer (Graph DiT)^[176]for molecular generation under multiple conditions, and a GNN for predicting reaction templates; the LLM can dynamically activate the corresponding graph module based on the content being generated.

In previous studies, most physicochemical descriptors constructed based on expert knowledge and experiments have demonstrated good predictive performance. These descriptors can directly quantify global properties such as polarizability, hydrophobicity, and molecular volume, thereby compensating for the limitation of structural modalities that focus primarily on local topology. In line with the idea of integrating structural embeddings with physicochemical descriptors, researchers have conducted numerous studies. In the context of general property prediction, DNN-PP^[177]combines molecular structure embeddings from graph attention mechanisms with molecular descriptors processed by deep neural networks (DNNs), representing molecular features from both structural and physicochemical perspectives to enhance the predictive power of molecular properties. To address representation conflicts and training imbalance in molecular property prediction, He et al.^[178]proposed a self-supervised contrastive learning framework based on descriptors and molecular graphs, introducing a multi-branch predictor structure in the supervised learning phase to achieve information fusion and training balance. For toxicity prediction scenarios, TopTox^[179]integrates physicochemical descriptors and fingerprint features into deep neural networks and consensus models for regression-based toxicity prediction; Karim et al.^[8]enhance overall toxicity prediction performance by fusing SMILES, molecular images, and physicochemical features extracted from PADEL software^[180]. The blood-brain barrier (BBB) permeability of compounds, which is difficult to express concisely, is an important consideration in the development of central nervous system drugs. Researchers have attempted to use computational methods to predict compound BBB permeability in order to improve efficiency. For example, Ding et al.^[181]introduced a relational graph convolutional network (RGCN)^[182], combining Mordred descriptors^[183] and drug–protein interaction data to construct a heterogeneous graph, providing high-confidence predictions of drug molecule BBB permeability. Deep-B³^[84] is also used to predict the BBB permeability of candidate compounds. This model integrates tabular data (molecular descriptors, MACCS fingerprints, Morgan fingerprints), text (SMILES representation), and molecular images, and uses pre-trained models to extract latent features from each modality separately. DeePred-BBB^[184]encodes compounds into 1,917 features, including 1,444 physicochemical attributes (such as molecular weight, molecular volume, solubility, partition coefficient, etc.), 166 MACCS fingerprints, and 307 substructure fingerprints, and selects the best-performing DNN model from multiple models to build a tool for predicting BBB permeability.

Spectroscopic techniques are crucial tools for substance discovery and structural identification. Due to the computational cost of quantum chemical methods, existing research has integrated deep learning models to accelerate molecular spectral prediction^[185].Researchers use spectroscopy to link microscopic properties with macroscopic observables. Different spectroscopic techniques reveal molecular information from distinct perspectives: Infrared (IR) spectroscopy focuses on changes in molecular vibrational dipoles, while Raman spectroscopy focuses on changes in molecular polarizability. To this end, Alberts et al.^[186]constructed a multispectral dataset containing 790,000 molecules to support multimodal machine learning. Additionally, Guo et al.^[187]developed MolPuzzle, a multi-step spectroscopic reasoning benchmark that infers molecular structures from various types of spectral data, highlighting the importance of spectral data in complex chemical reasoning. At the level of complementary physical mechanisms, Guo et al.^[188]integrated three spectra with different physical mechanisms—nuclear magnetic resonance (NMR), IR spectroscopy, and Raman spectroscopy—which respectively provide information on molecular structure and chemical bonds. The study established a quantitative relationship between spectral signals and chemical bond properties, and the fine-tuned multispectral model demonstrated excellent transferability across datasets from different sources. Yang et al.^[189]proposed an encoder–decoder machine learning framework based on multimodal descriptors, achieving information synchronization and mutual conversion among structural information, IR spectroscopy, and Raman spectroscopy descriptors through pre-training. The model employs a masking strategy that randomly replaces internal coordinates of the molecular structure with masked spectral data segments and Gaussian noise, processes the three modalities in parallel, and reconstructs complete descriptors via neural networks. The hidden tensors of different modalities are summed and normalized to achieve feature alignment and fusion; as shown in Figure 7, the cosine similarity for predicting a third-modal descriptor using any two modalities exceeds 0.93, significantly enhancing the ability to predict molecular properties under incomplete data conditions. By aligning or fusing spectral data across different modalities, these studies achieve multimodal representation, demonstrating clear advantages in property prediction and molecular identification tasks, and confirming the complementarity of different types of spectra. Determining molecular structure from spectral data is one of the fundamental tasks in chemistry and holds significant importance for multiple cutting-edge research fields, including drug discovery and materials science. Chacko et al.^[190]combined ¹³C and ¹H NMR data with IR spectroscopy for molecular structure analysis, using SELFIES symbols to embed spectral data from IR and NMR into representations that can be converted into molecular structures. The method uses the LLM2Vec model^[191]to process NMR spectral text information and employs a vision model to identify relevant functional group peaks in IR spectra (with an F1 score of 91%), achieving an overall test accuracy of 93% during the inference phase without relying on databases. When applying machine learning methods to automate the process of molecular structure analysis, the lack of confidence metrics often limits their practical application. To address this issue, Mirza et al.^[192]proposed the spec2struct framework, which not only enables the extraction of molecular structures from spectral data but also provides chemists with relevant background information and confidence assessments. The method synergistically combines multimodal embeddings, contrastive learning, and evolutionary algorithms to simulate the approach used by chemical experts in structure determination. By aligning multiple spectroscopic technique encoders with molecular representations, the system can simultaneously analyze various types of spectral evidence, demonstrating an ability to help avoid erroneous structure assignments during structure analysis and successfully identifying structures that were incorrectly assigned in the literature. The deep reinforcement learning method DeepSPInN^[11]formulates the problem of predicting molecular structures based on given IR and ¹³C NMR spectra as a Markov decision process, thereby enabling automated molecular spectral analysis. Rocabert-Oriols et al.^[193]proposed VibraCLIP, which applies cross-modal contrastive learning to vibrational spectroscopy, enabling direct extraction of molecular structures from spectral data. Based on the CLIP architecture^[194],VibraCLIP aligns IR, Raman spectra, and molecular graph embeddings in a shared representation space. When only IR and Raman spectra are aligned, the Top-1 retrieval accuracy increases from 12.4% to 62.9%; after incorporating anchor features (standardized molecular mass), the Top-25 accuracy further improves to 98.9%. By leveraging multiple vibrational spectra simultaneously, the model achieves a molecular recognition capability far exceeding that of single-modal baseline methods. The above-mentioned multispectral studies systematically validate the potential of spectrum fusion from the perspectives of data construction, complementary physical mechanisms, and molecular structure analysis. In terms of pre-training, Wang et al.^[195]proposed MolSpectra, a pre-training framework that enhances three-dimensional molecular representations using spectral energy information. The method designs a Transformer-based multispectral encoder that jointly considers multiple molecular spectra (such as UV–vis spectroscopy, IR spectroscopy, and Raman spectroscopy) and captures peak correlations within and across spectra through a masked patch reconstruction task. By introducing a contrastive loss function, spectral features and the intrinsic knowledge they encode are transferred to three-dimensional representations, enabling the framework to outperform existing pre-training methods for three-dimensional molecular representations in multiple molecular property prediction tasks.

显示原图|下载原图ZIP|生成PPT

图7 双模态协同预测第三模态

Fig.7 Bimodal collaborative prediction of the third modality

5 Conclusion and Outlook

5.1 Current Status and Key Technologies

Data-driven molecular representation learning has become a core technology in fields such as molecular modeling and drug design. Compared with traditional molecular descriptors constructed based on physical and chemical principles, data-driven methods autonomously learn molecular representations in a latent space from large-scale data, capturing the complex relationships between structure and properties through nonlinear mappings, thereby demonstrating stronger generalization capabilities in property prediction tasks. Although data-driven molecular embeddings have already outperformed traditional molecular descriptors in performance, their full potential remains untapped. The current core challenge lies in extracting general representations with clear chemical meaning from multimodal and complex molecular data, including molecular graph structures, 3D conformations, dynamic trajectories, spectral data, and quantum chemical features, in order to uncover the deep relationships between molecular structure and properties. Based on current research progress, future machine learning research on molecular representation can focus on the following three aspects.

(1) As data scales continue to expand, self-supervised learning methods offer new opportunities for molecular representation learning. By learning general molecular representations from vast amounts of unlabeled data, self-supervised learning significantly reduces the dependency on labeled data for molecular modeling. Most models focus on topological structural information in molecular graphs, identifying subgraphs within molecular graphs that contribute to prediction tasks across multiple contexts, which may enable models to extract more stable features. While existing methods have incorporated fundamental chemical constraints when generating augmented instances, they still lack sufficient sensitivity to subtle structural changes in molecules. Enhancing strategies that capture fine-grained differences between molecules represents an important direction for improving the effectiveness of self-supervised training in the future. This requires more refined contrastive learning approaches to enhance the model’s ability to distinguish subtle differences between molecules. Current self-supervised learning methods have already captured multi-level intra-molecular associations through pre-training tasks such as molecular property prediction, molecular structure reconstruction, and local subgraph prediction. However, given the limited and uneven distribution of atomic species in nature, existing masking strategies struggle to fully capture the true chemical semantics among elements in compositional modeling, underscoring the urgent need to design pre-training tasks that are more aligned with the fundamental principles of chemistry.

(2) Single molecular representations can no longer adequately meet the demands of complex predictive modeling, making the integration of multiple molecular representation methods a current research hotspot. Various molecular representation formats (SMILES sequences, molecular graphs, molecular images, spectra, expert features, etc.) provide complementary informational dimensions for models. Researchers employ different methods to achieve semantic alignment of heterogeneous data, thereby integrating the advantages of each modality. Research on multimodal molecular representations is continuously advancing through the use of ensemble learning, contrastive learning, and the incorporation of more expressive embedding methods. Due to the semantic gap between modalities, subsequent research must incorporate domain knowledge (such as reaction mechanisms) to guide cross-modal alignment. Moreover, during dynamic information fusion across different modalities, there is competition among weights, which can lead to fluctuations in evaluation metrics during training. Therefore, there is an urgent need to design more stable dynamic weight fusion strategies to enhance cross-scenario robustness.

(3) As the mainstream architecture for molecular embedding learning and property prediction, graph neural networks typically rely on static topological structures, making it difficult to model dynamic molecular structures and three-dimensional geometric information. Most existing graph convolutional neural networks explicitly assume that the input graph is static; however, molecules are not static structures, as their atomic positions, bond lengths, bond angles, and electron distributions change over time. To address this limitation, dynamic graph representation learning represents molecular graphs as graph structures whose nodes and edges evolve continuously over time, exploring time-series-based graph neural network models to simulate the complex dynamic processes of molecules. Therefore, combining dynamic molecular graphs with efficient dynamic graph convolutional neural network models has become an important direction for breaking through the bottleneck of static modeling.

5.2 Future Research Prospects

Data-driven molecular representation learning has transformed researchers’ understanding of the relationship between molecular structure and properties. It demonstrates superior performance compared to traditional methods in machine learning models and provides a new research paradigm for downstream applications such as molecular design and drug discovery. These methods overcome the limitations of traditional molecular descriptors, which rely heavily on expert knowledge and lack robust automated learning capabilities. By using embedded “languages” to bridge chemical semantics with latent space features, they open up new pathways for enhancing the efficiency of scientific exploration. Moreover, data-driven molecular embedding methods provide critical technical support for related fields such as catalyst design, functional material prediction, and environmental chemical analysis, thereby strongly promoting interdisciplinary integration and development. In light of this, the following recommendations are proposed for future research in this area.

(1) Promote the construction of a shared molecular latent space and develop a unified framework for molecular embedding methods. Currently, the molecular embedding methods employed by different research teams vary in terms of molecular representation formats and data sources. These differences make it challenging for the latent spaces modeled by each team to interoperate, hindering the effective integration of knowledge from multiple sources. Therefore, while safeguarding data privacy and security, efforts should be made to establish an open and shared platform for aligning molecular latent spaces. This initiative can fully leverage data resources distributed across various domains, enhance the generalization capabilities and universal applicability of molecular representation models, strengthen the scalability of cross-domain applications, and provide more robust theoretical support for molecular design and optimization.

(2) Build a high-quality molecular data infrastructure to advance the development of standardized data systems. Data-driven molecular representation learning relies on large-scale, high-quality datasets; however, research in this field is still rapidly evolving, with differences among datasets, model architectures, and evaluation metrics used by various research teams. These discrepancies significantly hinder the comparability and reproducibility of models. Therefore, there is an urgent need to establish a robust data-sharing mechanism and to develop unified data standards and evaluation metrics. As an essential foundation for the application of machine learning in this field, the development of molecular representations is highly dependent on the completeness, reliability, and standardization of data. Only by strengthening the data foundation can breakthroughs be achieved in theoretical research, thereby enabling rapid translation into new materials, new drugs, and other applications.

(3) Strengthen interdisciplinary integration and promote coordinated development from computational prediction to experimental synthesis. Molecular representation learning involves multiple disciplines, including artificial intelligence, computational chemistry, materials science, and biomedicine. Encouraging deep disciplinary integration and supporting multidisciplinary research teams to collaboratively tackle challenges will help develop molecular embedding methods that meet the needs of different industries and have greater practical value. This, in turn, will enhance their application value in strategic industries such as drug discovery and green energy, accelerating the translation of research findings into real-world applications.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Afzal M A F, Hachmann J. Handbook on Big Data and Machine Learning in the Physical Sciences. Singapore: World Scientific, 2020. 1.

[2]	Xu D G, Zhang Q, Huo X Y, Wang Y T, Yang M L. Mater. Genome Eng. Adv., 2023, 1: e11.

[3]	Isayev O, Fourches D, Muratov E, Oses C, Rasch K, Tropscha A, Curtarolo S. Bull. Am. Phys. Soc., 2014, 39799817.

[4]	Ramprasad R, Batra R, Pilania G, Mannodi-Kanakkithodi A, Kim C. NPJ Comput. Mater., 2017, 3: 54.

[5]	Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. In Neural Information Processing Systems. Nevada: NeurIPS, 2013. 16447573.

[6]	Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space. [2013-09-07]. https://doi.org/10.48550/arXiv.1301.3781.

[7]	Yaghoobi M, Alaei M. Comput. Mater. Sci., 2022, 207: 111284.

[8]	Karim A, Singh J, Mishra A, Dehzangi A, Newto, M A H, Sattar A. Lecture Notes in Computer Science. Eds.: Ohara K, Bai Q. Cham: Springer, 2019. 11669: 142.

[9]	Karim A, Riahi V, Mishra A, Hakim Newton M A, Dehzangi A, Balle T, Sattar A. ACS Omega, 2021, 6(18): 12306.

[10]	Manolache A, Tantaru D, Niepert M. MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning. [2024-10-24]. https://doi.org/10.48550/arXiv.2410.07981.

[11]	Devata S, Sridharan B, Mehta S, Pathak Y, Laghuvarapu S, Varma G, Priyakumar U D. Digit. Discov., 2024, 3(4): 818.

[12]	Huang E, Yang J S, Liao K Y K, Tseng W C W, Lee C K, Gill M, Compas C B, See S, Tsai F J. Sci. Rep., 2024, 271087139.

[13]	Bengio Y, Ducharme R, Vincent P. In Neural Information Processing Systems. Colorado: NeurIPS, 2000, 13.

[14]	Pennington J, Socher R, Manning C. 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Stroudsburg: ACL, 2014. 1532.

[15]	Bojanowski P, Grave E, Joulin A, Mikolov T. Trans. Assoc. Comput. Linguist., 2017, 5: 135.

[16]	Deerwester S, Dumais S T, Furnas G W, Landauer T K, Harshman R. J. Am. Soc. Inf. Sci., 1990, 41(6): 391

[17]	Blei D M, Ng A Y, Jordan M I. J. Mach. Learn. Res. 2003, 3(1): 993.

[18]	Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep Contextualized Word Representations. [2018-02-15]. https://doi.org/10.48550/arXiv.1802.05365.

[19]	Devlin J, Chang M W, Lee K, Toutanova K. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minnesota: ACL, 2019. 4171.

[20]	Belkin M, Niyogi P. Neural Comput., 2003, 15(6): 1373.

[21]	Ezzat A, Wu M, Li X L, Kwoh C. Methods, 2017, 129: 81.

[22]	Perozzi B, Al-Rfou R, Skiena S. 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014. 701.

[23]	Grover A, Leskovec J. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. California: ACM, 2016, 855.

[24]	Hamilton W, Ying Z, Leskovec J. Neural Information Processing Systems. California: NeurIPS, 2017. 30.

[25]	Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks. [2017-02-22]. https://doi.org/10.48550/arXiv.1609.02907.

[26]	Peng Y, Qi J. Multimedia Comput. Commun. Appl., 2019, 15(1): 1.

[27]	Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet G R G, Levy R, Vasconcelos N. 18th ACM International Conference on Multimedia. Firenze: ACM, 2010. 251.

[28]	Guo W Z, Wang J W, Wang S P. IEEE Access, 2019, 7: 63373.

[29]	Rupp M, Tkatchenko A, Müller K, von Lilienfeld O A. Phys. Rev. Lett., 2012, 108(5): 058301.

[30]	Ward L, Liu R Q, Krishna A, Hegde V I, Agrawal A, Choudhary A, Wolverton C. Phys. Rev. B, 2017, 96(2): 024104.

[31]	Liu K, Sun X, Jia L, Ma J, Xing H, Wu J, Gao H, Sun Y, Boulnois F, Fan J. Int. J. Mol. Sci., 2019, 20(14): 3389.

[32]	Kearnes S, McCloskey K, Berndl M, Pande V, Riley P. J. Comput. Aided Mol. Des., 2016, 30(8): 595.

[33]	Coley C W, Barzilay R, Green W, Jaakkola T, Jensen K. J. Chem. Inf. Model., 2017, 57(8): 1757.

[34]	Yang K, Swanson K, Jin W G, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R. J. Chem. Inf. Model., 2019, 59(8): 3370.

[35]	Xie T, Grossman J. Phys. Rev. Lett., 2018, 120(14): 145301.

[36]	Ramakrishnan R, Dral P O, Rupp M, von Lilienfeld O A, Sci. Data, 2014, 1: 140022.

[37]	Chen C, Ye W K, Zuo Y X, Zheng C, Ong S. Chem. Mater., 2019, 31(9): 3564.

[38]	Venugopal V, Olivetti E. Sci. Data, 2024, 11(1): 217.

[39]	Tshitoyan V, Dagdelen J, Weston L, Dunn A, Rong Z Q, Kononova O, Persson K A, Ceder G, Jain A. Nature, 2019, 571(7763): 95

[40]	Consonni V, Ballabio D, Todeschini R. Cheminformatics, QSAR and Machine Learning Applications for Novel Drug Development. New York: Academic Press, 2023. 303.

[41]	Consonni V, Todeschini R. Statistical Modelling of Molecular Descriptors in QSAR/QSPR. Dehmer M, Varmuza K, Bonchev D (Eds.). New Jersey: Wiley, 2012, 111.

[42]	Amar Y, Schweidtmann A M, Deutsch P, Cao L W, Lapkin A. Chem. Sci., 2019, 10(27): 6697.

[43]	Durant J L, Leland B A, Henry D R, Nourse J G. J. Chem. Inf. Comput. Sci., 2002, 42(6): 1273.

[44]	Rogers D, Hahn M. J. Chem. Inf. Model., 2010, 50(5): 742.

[45]	Vidal D, Thormann M, Pons M. J. Chem. Inf. Model., 2005, 45(2): 386.

[46]	Schwartz J, Awale M, Reymond J. J. Chem. Inf. Model., 2013, 53(8): 1979.

[47]	Bender A, Jenkins J L, Glick M, Deng Z, Nettles J H, Davies J W. J. Chem. Inf. Model., 2006, 46(6): 2445.

[48]	Nidhi, Glick M, Davies J, Jenkins J. J. Chem. Inf. Model., 2006, 46(3): 1124.

[49]	Laufkötter O, Sturm N, Bajorath J, Chen H M, Engkvist O. J. Cheminform, 2019, 11(1): 54.

[50]	David L, Thakkar A, Mercado R, Engkvist O. J. Cheminform, 2020, 12(1): 56.

[51]	Yu L, Sun L L, Du B W, Lv W F. Adv. Neural Inform. Process. Syst., 2023, 36: 67686.

[52]	Yuan H N, Sun Q Y, Fu X C, Zhang Z W, Ji C, Peng H, Li J X. Neural Information Processing Systems. New York: Curran Associates, 2024, 36.

[53]	Faber F A, Hutchison L, Huang B, Gilmer J, Schoenholz S S, Dahl G E, Vinyals O, Kearnes S, Riley P F, von Lilienfeld O A. J. Chem. Theory Comput., 2017, 13(11): 5255.

[54]	Choudhary K, Garrity K, Ghimire N, Anand N, Tavazza F. Phys. Rev. B, 2021, 103(15): 155131.

[55]	Choudhary K, Garrity K F, Tavazza F. J. Phys. Condens. Matter, 2020, 32(47): 475501.

[56]	Liu C H, Tao Y Z, Hsu D, Du Q, Billinge S. Acta Crystallogr. A: Found. Adv., 2019, 75(4): 633.

[57]	Xu K, Hu W, Leskovec J, Jegelka S. How Powerful Are Graph Neural Networks? [2018-10-01]. https://doi.org/10.48550/arXiv.1810.00826.

[58]

Battaglia

, Hamrick

J B

, Bapst

, Sanchez-Gonzalez

, Zambaldi

, Malinowski

, Tacchetti

, Raposo

, Santoro

, Faulkner

, Gülçehre

, Song

H F

, Ballard

A J

, Gilmer

, Dahl

G E

, Vaswani

, Allen

K R

, Nash

, Langston

, Dyer

, Heess

, Wierstra

, Kohli

, Botvinick

, Vinyals

, Li

Y J

, Pascanu

. Relational Inductive Biases, Deep Learning, and Graph Networks. [2018-06-04]. https://doi.org/10.48550/arXiv.1806.01261.

[59]	Gasteiger J, Giri S, Margraf J T, Günnemann S. Fast and Uncertainty-Aware Directional Message Passing for Non-Equilibrium Molecules. [2022-05-05]. https://doi.org/10.48550/arXiv.2011.14115.

[60]	Flam-Shepherd D, Wu T C, Friederich P, Aspuru-Guzik A. Mach. Learn. Sci. Technol., 2021, 2(4): 045009.

[61]	Gasteiger J, Becker F, Günnemann S. Neural Information Processing Systems. New York: Curran Associates, 2021. 34: 6790.

[62]	Schütt K T, Sauceda H E, Kindermans P J, Tkatchenko A, Müller K R. J. Chem. Phys., 2018, 148(24): 241722.

[63]	Unke O T, Meuwly M. J. Chem. Theory Comput., 2019, 15(6): 3678.

[64]	Gasteiger J, Groß J, Günnemann S. Directional Message Passing for Molecular Graphs. [2022-04-05]. https://doi.org/10.48550/arXiv.2003.03123.

[65]	Chen Z H, You Z H, Guo Z H, Yi H C, Luo G X, Wang Y B. Front. Bioeng. Biotechnol., 2020, 8: 338.

[66]	Jo J, Baek J, Lee S, Kim D, Kang M, Hwang S J. Neural Information Processing Systems. New York: Curran Associates, 2021. 34: 7534.

[67]	Gilmer J, Schoenholz S S, Riley P F, Vinyals O, Dahl G E. International Conference on Machine Learning. Sydney: PMLR, 2017. 1263.

[68]	Dwivedi V P, Joshi C K, Luu A T, Laurent T, Bengio Y, Bresson X. J. Mach. Learn. Res., 2023, 24(43): 1.

[69]	Choudhary K, DeCost B L. NPJ Comput. Mater., 2021, 7(1): 185.

[70]	Liao Y L, Smidt T, Shuaibi M, Da A. Generalizing Denoising to Non-Equilibrium Structures Improves Equivariant Force Fields. [2024-12-19]. https://doi.org/10.48550/arXiv.2403.09549.

[71]	Barroso-Luque L, Shuaibi M, Fu X, Wood B M, Dzamba M, Gao M, Rizvi A, Zitnick C L, Ulissi Z W. Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models. [2024-10-16]. https://arxiv.org/abs/2410.127711

[72]	Jaeger S, Fulle S, Turk S. J. Chem. Inf. Model., 2018, 58(1): 27.

[73]	Mann V, Brito K, Gani R, Venkatasubramanian V. Fluid Phase Equilib., 2022, 561: 113531.

[74]	Bechhofer S, Harmelen F V, Hendler J, Horrocks I, McGuinness D L, Patel-Schneider P, Stein L. OWL Web Ontology Language Reference. [2024-02-10]. http://www.w3.org/TR/2004/rec-owl-ref-20040210/

[75]	Chen J Y, Hu P, Jimenez-Ruiz E, Holter O M, Antonyrajah D, Horrocks I. Mach. Learn., 2021, 1813.

[76]	Goh G B, Hodas N O, Siegel C, Vishnu A. SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties. [2018-03-18]. https://arxiv.org/abs/1712.02034

[77]	Jeon W, Kim D. Bioinformatics, 2019, 35(23): 4979.

[78]	Zang X, Zhao X B, Tang B Z. Commun. Chem., 2023, 6(1): 34.

[79]	Wang S, Guo Y Z, Wang Y H, Sun H M, Huang J Z. 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. New York: ACM, 2019. 429.

[80]	Guo Z J, Zhang Y, Lu W. Attention Guided Graph Convolutional Networks for Relation Extraction. [2019-08-02]. https://www.aclweb.org/anthology/P19-1024/

[81]	Qin L, Dong G C, Peng J. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Seoul: IEEE, 2020. 708.

[82]	Li J, Jiang X. Wirel. Commun. Mob. Computi., 2021, 2021(1): 7181815.

[83]	Liu Y W, Zhang R S, Li T F, Jiang J, Ma J, Wang P. J. Mol. Graph. Model., 2023, 118: 108344.

[84]	Tang Q, Nie F L, Zhao Q, Chen W. Brief. Bioinform., 2022, 23(5): bbac357.

[85]	Chithrananda S, Grand G, Ramsundar B. ChemBERTa: Large-Scale Self-Supervised Pretrainfab shing for Molecular Property Prediction. [2020-10-19]. https://arxiv.org/abs/2010.09885

[86]	Ghaayathri Devi K, Bedadhala R S, Sachin Kumar S, Soman K P, Bodapatiq J D. 2024 2024 4th International Conference on Intelligent Technologies (CONIT). Bangalore: IEEE, 2024. 1.

[87]	Ross J, Belgodere B M, Chenthamarakshan V, Padhi I, Mroueh Y, Das P. Nat. Mach. Intell., 2022, 4(12): 1256.

[88]	Wu F, Radev D, Li S Z. Proc. AAAI Conf. Artif. Intell., 2023, 37(4): 5312.

[89]	Wan F P, Zeng J Y. D. Deep Learning with Feature Embedding for Compound-Protein Interaction Prediction. [2016-11-07]. https://www.biorxiv.org/content/10.1101/086033v1

[90]	Olivecrona M, Blaschke T, Engkvist O, Chen H M. J. Cheminf., 2017, 9: 48.

[91]	Schneider N, Fechner N, Landrum G, Stiefl N. J. Chem. Inf. Model., 2017, 57(8): 1816.

[92]	Mann V, Venkatasubramanian V. AIChE J., 2021, 67(3): e17190.

[93]	Hinton G E, Salakhutdinov R R. Science, 2006, 313(5786): 504.

[94]	Gómez-Bombarelli R, Wei J N, Duvenaud D, Hernández-Lobato J M, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel T D, Adams R P, Aspuru-Guzik A. ACS Cent. Sci., 2018, 4(2): 268.

[95]	Winter R, Montanari F, Noé F, Clevert D A. Chem. sci., 2019, 10(6): 1692.

[96]	Popova M, Isayev O, Tropsha A. Sci. Adv., 2018, 4(7): eaap7885.

[97]	Segler M H S, Kogej T, Tyrchan C, Waller M. ACS Cent. Sci., 2018, 4(1): 120.

[98]	Xu Z, Wang S, Zhu F Y, Huang J Z. 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics. Boston: ACM, 2017. 285.

[99]	Hou Y Y, Wang S Y, Bai B, Chan H C S, Yuan S G. Molecules, 2022, 27(5): 1668.

[100]

Q J

, Chen

G X

, Zhao

, Zhong

W H

, Chen

C Y

. Brief. Bioinform., 2021, 22(6): bbab317.

[101]

Bagal

, Aggarwal

, Vinod

P K

, Priyakumar

U D

. J. Chem. Inf. Model., 2022, 62(9): 2064.

[102]

Radford

, Narasimhan

, Salimans

, Sutskever

. Improving Language Understanding by Generative Pre-Training. [2018-06-11]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

[103]

Jin

W G

, Barzilay

, Jaakkola

. International Conference on Machine Learning. Sweden: PMLR, 2018. 2323.

[104]

Jin

W G

, Barzilay

, Jaakkola

. International Conference on Machine Learning. Vienna: PMLR, 2020. 4839.

[105]

Gebauer

, Gastegger

, Schütt

K T

. Neural Information Processing Systems. Vancouver: PMLR, 2019. 32.

[106]

Shi

, Xu

, Zhu

, Zhang

, Tang

. GraphAF: A Flow-Based Autoregressive Model for Molecular Graph Generation. [2020-02-27]. https://doi.org/10.48550/arXiv.2001.09382.

[107]

Zang

C X

, Wang

. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. California: ACM, 2020. 617.

[108]

Kingma

D P

, Dhariwal

. Neural Information Processing Systems. Montreal: Curran, 2018. 31.

[109]

Peng

, Guan

, Liu

, Ma

. MolDiff: Addressing the Atom-Bond Inconsistency Problem in 3D Molecule Diffusion Generation. [2023-05-11]. https://doi.org/10.48550/arXiv.2305.07508.

[110]

Oestreich

, Merdivan

, Lee

, Schultze

J L

, Piraud

, Becker

. J. Cheminf., 2025, 17: 23.

[111]

Krenn

, Häse

, Nigam

, Friederich

, Aspuru-Guzik

. Mach. Learn. Sci. Technol., 2020, 1(4): 045024.

[112]

Eckmann

, Sun

, Zhao

, Feng

, Gilson

M K

, Yu

. International Conference on Machine Learning, ICML 2022. Vienna: PMLR, 2022. 162: 5777.

[113]

Rombach

, Blattmann

, Lorenz

, Esser

, Ommer

. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans: IEEE, 2022, 10674.

[114]

Luo

, Yan

, Ji

. International Conference on Machine Learning. PMLR, 2021. 7192.

[115]

Imrie

, Bradley

A R

, van der Schaar

, Deane

C M

. J. Chem. Inf. Model., 2020, 60(4): 1983.

[116]

Liu

, Yan

, Oztekin

, Ji

. GraphEBM: Molecular Graph Generation with Energy-Based Models. [2021-04-11]. https://doi.org/10.48550/arXiv.2102.00546.

[117]

Welling

.; Teh

Y. W

. 28th International Conference on Machine Learning, ICML-11. Bellevue: Citeseer, 2011. 681.

[118]

, Yu

, Song

, Shi

, Ermon

, Tang

. GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation. [2022-05-06]. https://doi.org/10.48550/arXiv.2203.02923.

[119]

Sohl-Dickstein

, Weiss

, Maheswaranathan

, Ganguli

. International Conference on Machine Learning. Lille: PMLR, 2015. 2256.

[120]

Liu

, Luo

, Uchino

, Maruhashi

, Ji

. Generating 3D Molecules for Target Protein Binding. [2022-05-30]. https://doi.org/10.48550/arXiv.2204.09410

[121]

. Roberta: A Robustly Optimized Bert Pretraining Approach. [2019-07-26]. https://arxiv.org/abs/1907.11692

[122]

Z Q

, Ramsundar

, Feinberg

E N

, Gomes

, Geniesse

, Pappu

A S

, Leswing

, Pande

. Chem. Sci., 2018, 9(2): 513.

[123]

J L

, Ahmed

, Lu

, Pan

S F

, Bo

, Liu

Y F

. Neurocomputing, 2024, 568: 127063.

[124]

Ishida

, Miyazaki

, Sugaya

, Omachi

. Molecules., 2021, 26(11): 3125.

[125]

Qiu

J Z

, Chen

Q B

, Dong

Y X

, Zhang

, Yang

H X

, Ding

, Wang

K S

, Tang

. 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Virtual Event CA USA: ACM, 2020, 1150.

[126]

, Wang

, Ni

, Guo

, Tang

. International Conference on Machine Learning. PMLR, 2021. 11548.

[127]

Liu

, Wang

, Liu

, Lasenby

, Guo

, Tang

. Pre-Training Molecular Graph Representation with 3D Geometry.[2022-05-29]. https://doi.org/10.48550/arXiv.2110.07728.

[128]

Veličković

, Fedus

, Hamilton

W L

, Liò

, Bengio

, Hjelm

R D

. Deep Graph Infomax. [2018-12-21]. https://doi.org/10.48550/arXiv.1809.10341.

[129]

You

, Chen

, Sui

, Chen

, Wang

, Shen

. In Neural Information Processing Systems. Curran Associates, 2020, 33, 5812.

[130]

Zhang

Z X

, Liu

, Wang

, Lu

C Q

, Lee

C K

. In Neural Information Processing Systems. Curran Associates, 2021, 34: 15870.

[131]

Zhang

, Hu

, Subramonian

, Sun

. Motif-Driven Contrastive Learning of Graph Representations. [2021-03-15]. https://doi.org/10.48550/arXiv.2012.12533.

[132]

Wang

, Wang

, Cao

, Barati Farimani

. Nat. Mach. Intell., 2022, 4(3): 279.

[133]

Wang

Y Y

, Magar

, Liang

, Farimani

. J. Chem. Inf. Model., 2022, 62(11): 2713.

[134]

Wang

, Li

, Jin

, Cho

, Ji

, Han

, Burke

M D

. Chemical-Reaction-Aware Molecule Representation Learning. [2021-10-12]. https://doi.org/10.48550/arXiv.2109.09888.

[135]

Gong

, Liu

, Han

, Guo

Y K

, Wang

G Y

. Neural Netw., 2025, 184: 107088.

[136]

Gong

, Liu

M T

, Liu

, Guo

Y K

, Wang

G Y

. Pattern Recognit., 2025, 163: 111463.

[137]

Wang

C H

, Yang

Y Q

, Song

J S

, Nan

X F

. J. Chem. Inf. Model., 2024, 64(19): 7189.

[138]

Guo

Z C

, Yu

W H

, Zhang

C X

, Jiang

, Chawla

N V

. 29th ACM International Conference on Information & Knowledge Management. Virtual Event Ireland: ACM, 2020. 435.

[139]

Liu

J P

, Lei

X J

, Zhang

Y C

, Pan

. Comput. Biol. Med., 2023, 153: 106524.

[140]

Nguyen

D M H

, Lukashina

, Nguyen

, Le

A T

, Nguyen

, Ho

, Peters

, Sonntag

, Zaverkin

, Niepert

. Structure-Aware E(3)-Invariant Molecular Conformer Aggregation Networks. [2024-08-19]. https://doi.org/10.48550/arXiv.2402.01975.

[141]

Vayer

, Courty

, Tavenard

, Chapel

, Flamary

. In International Conference on Machine Learning. California: PMLR, 2019, 6275.

[142]

Zhu

, Hwang

, Adams

, Liu

, Nan

, Stenfors

, Du

, Chauhan

, Wiest

, Isayev

, Coley

C W

, Sun

, Wang

. Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks.[2024-07-28]. https://doi.org/10.48550/arXiv.2310.00115.

[143]

Wang

X F

, Li

, Jiang

M J

, Wang

, Zhang

S G

, Wei

Z Q

. J. Chem. Inf. Model., 2019, 59(9): 3817.

[144]

Cai

H X

, Zhang

H M

, Zhao

D C

, Wu

J X

, Wang

. Brief. Bioinform., 2022, 23(6): bbac408.

[145]

Wang

T Y

, Sun

J Q

, Zhao

. Comput. Biol. Med., 2023, 153: 106464.

[146]

Zhang

H H

, Wu

J T

, Liu

S C

, Han

S. Inf.

Fusion, 2024, 103: 102092.

[147]

, Xie

, Xu

, Mao

, Chang

, Xu

. Comput. Struct. Biotechnol. J., 2024, 23, 1666.

[148]

Nan

S H

, Li

Z M

, Jin

S M

, Du

W L

, Shen

W F

. Ind. Eng. Chem. Res., 2025, 64(5): 3045.

[149]

W L

, Zhang

, Xu

Y L

, Cheng

X P

, Chen

T Z

. Expert Syst. Appl., 2025, 260: 125403.

[150]

Ryu

, Lee

M Y

, Lee

J H

, Lee

, Oh

. Bioinformatics, 2020, 36(10): 3049.

[151]

Deng

D G

, Chen

X W

, Zhang

R C

, Lei

Z R

, Wang

X J

, Zhou

. J. Chem. Inf. Model., 2021, 61(6): 2697.

[152]

J Z

, Su

, Yang

, Ren

J Z

, Xiang

. Comput. Biol. Med., 2023, 165: 107452.

[153]

Zheng

Z X

, Wang

, Tan

Y Y

, Liang

, Sun

Y S

. Expert Syst. Appl., 2023, 234: 121016.

[154]

Stärk

, Beaini

, Corso

, Tossou

, Dallago

, Günnemann

, Liò

. International Conference on Machine Learning. Baltimore: PMLR, 2022. 20479.

[155]

Chen

M K

, Gong

X W

, Pan

S R

, Wu

, Lin

, Du

, Hu

W B

. Neural Netw., 2025, 184: 107068.

[156]

Chen

R Z

, Li

C Y

, Wang

L Y

, Liu

M Q

, Chen

S G

, Yang

J H

, Zeng

X X. Inf.

Fusion, 2025, 115: 102784.

[157]

Xiang

H X

, Jin

S T

, Xia

, Zhou

, Wang

J M

, Zeng

X X

. Thirty-Third International Joint Conference on Artificial Intelligence. Jeju: AAAI Press, IJCAI Organization, 2024. 6107.

[158]

, Lei

X J

. Comput. Biol. Med., 2024, 169: 107911.

[159]

Yin

, Liu

R Y

, Hao

X S

, Zhou

X R

, Liu

, Ma

, Wang

W P

. IEEE Trans. Image Process., 2024, 34: 3225.

[160]

Chen

Z Y

, Xie

F K

, Wan

, Yuan

, Liu

, Wang

Z G

, Meng

, Wang

Y G

. Chin. Phys. B., 2023, 32(11): 118104.

[161]

Grisoni

. Curr. Opin. Struct. Biol., 2023, 79: 102527.

[162]

Luo

, Zhang

, Fan

, Yang

, Wu

, Qiao

, Nie

. BioMedGPT: Open Multimodal Generative Pre-Trained Transformer for BioMedicine. [2023-08-21]. https://doi.org/10.48550/arXiv.2308.09442.

[163]

Xie

, Wan

, Liu

, Zeng

, Wang

, Zhang

, Grazian

, Kit

, Ouyang

, Zhou

, Hoex

. DARWIN 1.5: Large Language Models as Materials Science Adapted Learners. [2025-05-21]. https://doi.org/10.48550/arXiv.2412.11970.

[164]

Liu

, Wang

, Yang

, Liu

, Wen

X D

. AlchemBERT: Exploring Lightweight Language Models for Materials Informatics. [2025-02-13]. https://www.cambridge.org/engage/chemrxiv/article-details/6781a6b481d2151a02a3212e

[165]

Zeng

Z N

, Yao

, Liu

Z Y

, Sun

M S

. Nat. Commun., 2022, 13(1): 862.

[166]

, Du

, Yang

, Zhou

, Li

, Rao

, Sun

, Lu

, Wen

J R

. A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language. [2022-09-12]. https://doi.org/10.48550/arXiv.2209.05481.

[167]

Liu

, Nie

, Wang

, Lu

, Qiao

, Liu

, Tang

, Xiao

, Anandkumar

. Nat. Mach. Intel., 2023, 5(12): 1447.

[168]

Hua

, Feng

Z H

, Song

X N

, Wu

X J

, Kittler

. Pattern Recognit., 2025, 157: 110887.

[169]

Polat

, Kurban

, Serpedin

, Kurban

. Understanding the Capabilities of Molecular Graph Neural Networks in Materials Science Through Multimodal Learning and Physical Context Encoding.[2025-05-17]. https://doi.org/10.48550/arXiv.2505.12137.

[170]

Kim

, Chen

, Cheng

T J

, Gindulyte

, He

S Q

, Li

Q L

, Shoemaker

B A

, Thiessen

P A

, Yu

, Zaslavsky

, Zhang

, Bolton

E E

. Nucleic Acids Res., 2025, 53(D1): D1516.

[171]

Arevalo

, Solorio

, Montes-y-Gómez

, González

F A

. Neural Comput. Appl., 2020, 32(14): 10209.

[172]

Tang

X R

, Tran

, Tan

, Gerstein

M B

. Bioinformatics, 2024, 40(Supplement_1): i357.

[173]

Kang

C L

, Liu

X Y

, Guo

. The Thirteenth International Conference on Learning Representations. Singapore: ICLR, 2025.

[174]

Chen

, Li

, Dai

, Song

. International Conference on Machine Learning. PMLR, 2020, 1608.

[175]

Liu

, Sun

, Matusik

, Jiang

, Chen

. Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning. [2024-10-08]. https://arxiv.org/abs/2410.04223

[176]

Liu

, Xu

, Luo

, Jiang

. Graph Diffusion Transformers for Multi-Conditional Molecular Generation. [2024-10-03]. https://doi.org/10.48550/arXiv.2401.13858.

[177]

Wiercioch

, Kirchmair

. Expert Syst. Appl., 2023, 213: 119055.

[178]

, Chen

, Lv

, Zhou

, Xu

, Chen

, Hu

, Gao

. Advanced Intelligent Computing Technology and Applications. Huang D S, Premaratne P, Jin B, Qu B, Jo K H, Hussain A (Eds.). Singapore: Springer, 2023. 14088: 700.

[179]

K D

, Wei

. J. Chem. Inf. Model., 2018, 58(2): 520.

[180]

Yap

C W

. J. Comput. Chem., 2011, 32(7):1466.

[181]

Ding

, Jiang

X Q

, Kim

. Bioinformatics, 2022, 38(10): 2826.

[182]

Schlichtkrull

, Kipf

T N

, Bloem

, Van

Den Berg R

, Titov

, Welling

. The Semantic Web. Gangemi A, Navigli R, Vidal M-E, Hitzler P, Troncy R, Hollink L, Tordai A, Alam M (Eds.). Cham: Springer, 2018, 10843: 593.

[183]

Moriwaki

, Tian

Y S

, Kawashita

, Takagi

. J. Cheminf., 2018, 10: 4.

[184]

Kumar

, Sharma

, Alexiou

, Bilgrami

A L

, Kamal

M A

, Ashraf

G M

. Front. Neurosci., 2022, 16: 858126.

[185]

Zou

, Zhang

, Liang

, Wei

, Leng

, Jiang

, Luo

, Hu

. Nat. Comput. Sci., 2023, 3(11): 957.

[186]

Alberts

, Schilter

, Zipoli

, Hartrampf

, Laino

. Neural Information Processing Systems. Vancouver: Curran Associates, 2024, 37: 125780.

[187]

Guo

, Nan

, Zhou

, Guo

, Surve

, Liang

, Chawla

, Wiest

, Zhang

X. In

Neural Information Processing Systems. Vancouver: Curran Associates, 2024, 37: 134721.

[188]

Guo

S B

, Jiang

, Ren

, Wang

. J. Phys. Chem. Lett., 2023, 14(33): 7461.

[189]

Yang

, Jiang

, Luo

, Wang

, Jiang

. J. Phys. Chem. Lett. 2024, 15(34): 8766.

[190]

EdwinChacko, RudraSondhi, Praveen

, Luska

K L

, Spectro: A Multi-Modal Approach for Molecule Elucidation Using IR and NMR Data. [2024-11-06]. https://www.cambridge.org/engage/chemrxiv/article-details/6724fb5b7be152b1d0ae66f8

[191]

BehnamGhader

, Adlakha

, Mosbach

, Bahdanau

, Chapados

, Reddy

. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. [2024-08-21]. https://doi.org/10.48550/arXiv.2404.05961

[192]

Mirza

, Jablonka

K M

. Elucidating Structures from Spectra Using Multimodal Embeddings and Discrete Optimization. [2024-11-22]. https://chemrxiv.org/engage/chemrxiv/article-details/673fbcab5a82cea2fa4c4a39

[193]

Rocabert-Oriols

, López

, Heras-Domingo

. Multi-Modal Contrastive Learning for Chemical Structure Elucidation with VibraCLIP. [2025-04-23]. https://www.cambridge.org/engage/chemrxiv/article-details/6807a71c50018ac7c5a0d0cb

[194]

Radford

, Kim

J W

, Hallacy

, Ramesh

, Goh

, Agarwal

, Sastry

, Askell

, Mishkin

, Clark

. International Conference on Machine Learning. PmLR, 2021, 8748.

[195]

Wang

, Liu

, Rong

, Zhao

, Liu

, Wu

, Wang

. MolSpectra: Pre-Training 3D Molecular Representation with Multi-Modal Energy Spectra. [2025-02-22]. https://doi.org/10.48550/arXiv.2502.16284

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

1 Introduction

图1 化学分子的隐藏空间嵌入方法发展历程

2 The Principle of "Embedding" in Machine Learning

2.1 Word embeddings

表1 词嵌入方法总结

2.2 Graph Embedding

图2 （a） 基于矩阵分解的方法使用数据特征矩阵通过矩阵分解学习嵌入；（b） 基于随机游走的方法通过随机游走生成节点序列，并应用Word2Vec模型学习嵌入表示；（c） 基于神经网络的方法在不同模型中架构和输入存在差异

2.3 Multimodal Embedding

图3 多模态表示学习框架：（a） 联合表示框架；（b） 协调表示框架；（c） 编码器解码器框架

3 Element Hidden Space Representation Method

3.1 Element Representation Based on Attribute Features

3.2 Element Representation Based on Physical and Chemical Knowledge

3.3 Data-Driven Element Embedding

4 Research Progress in Molecular Hidden Space Embedding

4.1 Traditional chemical feature molecular descriptors

图4 不同维度分子描述符

4.2 Graph Theory–Driven Molecular Embedding

图5 （a） 图结构的特征标注；（b） 通过消息传递和聚合进行特征更新；（c） 图神经网络中图的迭代更新

表2 GNNs特征选择及聚合信息方法

4.3 Data-Driven Molecular Embeddings

表3 数据驱动嵌入方法

4.4 Molecular Multimodal Embedding

图6 多模态分子表示嵌入

图7 双模态协同预测第三模态

5 Conclusion and Outlook

5.1 Current Status and Key Technologies

5.2 Future Research Prospects

References

图2 （a）基于矩阵分解的方法使用数据特征矩阵通过矩阵分解学习嵌入；（b）基于随机游走的方法通过随机游走生成节点序列，并应用Word2Vec模型学习嵌入表示；（c）基于神经网络的方法在不同模型中架构和输入存在差异

图3 多模态表示学习框架：（a）联合表示框架；（b）协调表示框架；（c）编码器解码器框架

图5 （a）图结构的特征标注；（b）通过消息传递和聚合进行特征更新；（c）图神经网络中图的迭代更新