Application of Advanced Artificial Intelligence Technology in New Drug Discovery

Zhonghua Wang; Yichu Wu; Zhongshan Wu; Ranran Zhu; Yang Yang; Fanhong Wu

doi:10.7536/PC230318

Progress in Chemistry >

2023 , Vol. 35 >Issue 10: 1505 - 1518

DOI: https://doi.org/10.7536/PC230318

Review

Application of Advanced Artificial Intelligence Technology in New Drug Discovery

Zhonghua Wang ¹^,² ,
Yichu Wu ¹ ,
Zhongshan Wu ¹ ,
Ranran Zhu ¹ ,
Yang Yang ¹ ,
Fanhong Wu ^,¹^,²^,^*

Expand

¹ School of Chemical and Environmental Engineering, Shanghai Institute of Technology,Shanghai 201418, China
² Shanghai Engineering Research Center of Green Fluoropharmaceutical Technology,Shanghai 201418, China

*Corresponding author e-mail: wfh@sit.edu.cn

Received date: 2023-03-16

Revised date: 2023-05-24

Online published: 2023-08-06

Supported by

National Natural Science Foundation of China(21672151)

National Natural Science Foundation of China(21602136)

Fold

Abstract

In recent years, the discovery of new drugs driven by advanced artificial intelligence (AI) has attracted much attention. Advanced artificial intelligence algorithms (machine learning and deep learning) have been gradually applied in various scenarios of new drug discovery, such as representation learning task (molecular descriptor), prediction task (drug target binding affinity prediction, crystal structure prediction and molecular basic properties prediction) and generation task (molecular conformation generation and drug molecular generation). This technology can significantly reduce the cost and time of new drug development, improve the efficiency of drug development, and reduce the costs and risks associated with preclinical and clinical trials. This review summarizes the application of advanced artificial intelligence technology in new drug discovery in recent years, to help understand the research progress and future development trend in this field, and to facilitate the discovery of innovative drugs.

Contents

1 Introduction

2 Artificial intelligence

2.1 Convolutional neural network

2.2 Recurrent neural network

2.3 Graph neural network

2.4 Generative adversarial network

2.5 Variational auto encoder

2.6 Diffusion model

2.7 Transformer model

3 The application of artificial intelligence in drug discovery

3.1 Data resources and open-source tools

3.2 Artificial intelligence technology drives molecular representation learning tasks

3.3 Artificial intelligence technology drives predictive tasks

3.4 Artificial intelligence technology drives generation tasks

4 Conclusion and outlook

Key words： artificial intelligence; new drug discovery; deep learning; representation learning; task application

Cite this article

Zhonghua Wang , Yichu Wu , Zhongshan Wu , Ranran Zhu , Yang Yang , Fanhong Wu . Application of Advanced Artificial Intelligence Technology in New Drug Discovery[J]. Progress in Chemistry, 2023 , 35(10) : 1505 -1518 . DOI: 10.7536/PC230318

1 Introduction

The primary goal of drug discovery is to develop safe and effective drugs for the treatment of human diseases. All drug development processes go from drug design to step-by-step clinical trials, which require a lot of time and cost. Generally speaking, it takes billions of dollars and 10 to 15 years for a new drug to be developed and finally launched^[1]. As the cost of new drug research and development increases at each step, it is essential to ensure that the appropriate drug candidates are selected for the next stage at each milestone. In particular, the discovery and identification of lead compounds is a key step in the process of new drug discovery. One of the reasons why clinical trials face side effects and lack of in vivo efficacy is based on the concept of polypharmacology, where single or multiple drugs often interact with multiple targets^[2]. Ideally, a comprehensive in vivo trial for each disease model should be able to resolve this issue, but it would require astronomical amounts of time and effort. Since the 1980s, computer-aided drug discovery or design methods have played an important role in the process of new drug development by reducing the consumption of validation to reduce the experimental burden in modern drug development^[3⇓~5]. However, even this approach has failed to prevent the decline in the efficiency of new drug development in the pharmaceutical industry since the mid-1990s. Artificial intelligence (AI) has been applied to drug discovery in recent years, which has enabled academia and the pharmaceutical industry to achieve important and cost-effective development strategies.

The vast amount of chemical and biological data accumulated over decades, as well as the automation of technology through high-performance processors such as graphics processing unit (GPU) computing, have paved the way for AI to be acquired in the process of drug research and development^[6,7]. In the early stage of drug research and development, advanced AI technology dominated by machine learning and deep learning has been gradually applied to drug design and drug property prediction, such as molecular representation learning, molecular property prediction and molecular generation. In the late stage of drug research and development, AI technology has promoted innovative drugs to enter the clinic. Through the self-developed AI platform chemistry42, Yingxi Intelligence has designed new small molecule compounds for new targets^[8]. On July 28, they announced that ism001-055, the first candidate drug discovered and designed by artificial intelligence to enter clinical trials in China, had completed phase I clinical trials in China. It is worth mentioning that ism001-055 spent only $2.7 million on research and development from target discovery to nomination of preclinical candidate compounds, which took only 18 months. In the future, with the introduction of more and more AI algorithm models, this is bound to become a major driving force for efficient drug research and development.

2 Artificial intelligence technology

The core of AI technology is to construct mathematical models by algorithms to deal with practical problems, among which machine learning and deep learning are the two hot topics of AI technology research. Deep learning, as a branch of machine learning, is mainly different from machine learning in terms of the amount of data and the complexity of the model. Usually, the model of deep learning is more complex and requires a larger amount of data^[9]. With the unprecedented progress of highly parallel graphics processing units (GPUs) and the development of algorithms supported by GPUs, deep learning has gradually entered the field of new drug research and development like an industry revolution^[10]. The logic mode of GPU and CPU is shown in Figure 1. Most of the area of CPU is controller and register. In contrast, GPU has more Arithmetic logic unit (ALU) for data processing instead of data cache and flow control. Such structure is suitable for parallel processing of intensive data. Sections 2.1 ~ 2.7 will introduce several current cutting-edge deep learning algorithms.

显示原图|下载原图ZIP|生成PPT

图1 GPU与CPU的逻辑模式^[10]

Fig.1 Logical mode of the GPU and CPU^[10]

2.1 Convolutional Neural Network (CNN).

Convolutional neural networks (CNN) are mainly used in computer vision to process pixels of data in an image^[11]. As shown in Figure 2, in CNN, there are convolutional and pooling (i.e., sub-sampling) layers, on top of which the vector representation is learned by concatenating feature maps to make the ﬁnal prediction. The advantage of CNN is that it shares parameters between ﬁlters, which greatly reduces the number of parameters that need to be learned, thus reducing memory consumption and improving computational speed. In drug discovery, CNN can be used to elucidate the spectrum of biological activity based on microscopy images, and at the same time, it is also widely used for molecular property prediction^[12,13]^[14,15]. Duvenaud et al. Applied CNN to ring fingerprints to create a distinguishable fingerprint, which is the first time that data-driven representation learning is used to predict molecular properties instead of fixed chemical descriptors, and this work has greatly promoted the learning of molecular representations^[15]. In addition, for molecular structure images, Goh et al. Predicted the free energy of molecules by training images on two-dimensional structures^[16].

显示原图|下载原图ZIP|生成PPT

图2 卷积神经网络模型架构

Fig.2 Convolutional neural network model architecture

2.2 Recurrent neural network (RNN).

Recurrent neural networks (RNN) are mainly used to process sequence data in deep learning algorithms^[11]. As shown in Figure 3, the network is based on the traditional neural network model, allowing the connection between neurons in the same hidden layer to form a directed loop, so that sequential input can be used, which is usually used for language processing^[17]. However, the long-term dependency of RNN makes it difﬁcult to learn the parameters due to the gradient explosion or vanishing problem. Therefore, two variants, Long short-term memory (LSTM) and Gated recurrent unit (GRU), were developed to augment the network with memory modules^[18]^[19]. In the application of drug research and development, the working mode of RNN mainly uses a string representation method as the input to predict molecular properties and generate molecules. The characters in the string are first converted into "one-hot" vectors, and then input to RNN in turn. Each step updates a hidden vector, and finally outputs. Goh et al. Proposed a SMILES2Vec method that uses RNN to learn features from SMILES and predict a wide range of chemical properties^[20]. Mayr et al. Proposed a SmilesLSTM method to predict DTI, and the results are better than those of traditional machine learning models^[21].

显示原图|下载原图ZIP|生成PPT

图3 递归神经网络模型架构

Fig.3 Recurrent neural network model architecture

2.3 Graph neural network (GNN).

In recent years, Graph neural network (GNN) has become more and more popular, which is a modeling method that abstracts data into nodes and edges and represents them as graphs^[22]. As shown in Figure 4, GNNs can handle node-level (such as node classiﬁcation), edge-level (such as link prediction), and graph-level (such as graph regression) tasks, with neighborhood aggregation, pooling, and readout operations. Current GNNs are mainly divided into two types: convolutional GNNs and recursive GNNs^[22]. In a recurrent GNN, node representations are learned by some recurrent neural architecture. Convolutional GNNs, on the other hand, generalize convolutional operations from grid data to graph data and can stack multiple graph convolutional layers to extract high-level node representations. From the point of view of data structure, molecular structural formula can be easily represented by graph structure, and GNN is naturally applied to the field of new drug research and development. For example, GNN-based methods such as SchNet, PotentialNet, and DimeNet have been used for molecular prediction tasks^[23]^[24]^[25].

显示原图|下载原图ZIP|生成PPT

图4 图神经网络模型架构

Fig.4 Graph neural network model architecture

2.4 Generative Adversarial Network (GAN).

Generative adversarial network (GAN) was proposed by Goodfellow et al. In 2014. The model has made remarkable achievements in generating realistic synthetic samples^[26]. As shown in Fig. 5, the GAN consists of a generative model and a discriminative model. The generator aims to generate new data points from a random distribution, while the discriminator aims to classify whether the generated samples come from the training data distribution or from the generator. The GAN can be trained by a minimax loss, using a minimax objective to optimize the generator and discriminator:

(1)

m i n G m a x D L (G, D) = E x ~ p x [l o g (D (x))] + E z ~ p z [l o g (1 - D (G (z)))]

显示原图|下载原图ZIP|生成PPT

图5 生成对抗网络模型架构

Fig.5 Generative adversarial network model architecture

Where

p x

and

p z

represent the distribution of the real data X and the distribution of the prior noise Z.

At the same time, as a powerful generative model, the ORGAN model proposed by Guimaraes et al. And the ORGANIC model based on this model have been applied to molecular generation tasks in new drug research and development^[27]^[28].

2.5 Variational autoencoder (VAE).

Variational auto encoder (VAE) is a powerful class of probabilistic generative models, which was first proposed by Kingma et al. In 2013^[29]. As shown in Fig. 6, a VAE consists of an encoder and a decoder, and the encoder maps the high-dimensional data to a low-dimensional continuous latent space Z. In contrast to ordinary self-encoders, the latent space is regularized and, ideally, organized by KL divergence. In addition to the reconstruction, the VAE approximates a probability distribution, which can be sampled to generate. Thus, given an input X, the parameters of the VAE are optimized by minimizing the reconstruction loss and KL divergence:^[30]

(2)

‖ x - D (E (x)) ‖ 2 + K L (N (μ x, σ x), N (0,1)

显示原图|下载原图ZIP|生成PPT

图6 变分自编码器模型架构

Fig.6 Variational auto encoder model architecture

Where

N (0,1)

is the unit normal distribution, and

μ x

and

σ x

are the learnable parameters mean and covariance of the Gaussian distribution.

Because VAE is a powerful generative model, in new drug research and development, such as GrammarVAE proposed by Kusner et al. And Syntax-directed VAE proposed by Dai et al. Have been applied to molecular generation tasks^[31]^[32].

2.6 Diffusion model

Diffusion models are a class of models for image generation tasks, and are used in several well-known text-to-image generation applications. As shown in Fig. 7, this model architecture consists of a diffusion layer, a hidden space, and a denoising layer, respectively. Unlike VAE or streaming models, diffusion models are learned by a fixed process, slowly adding random noise to the data, and then learning an inverse diffusion process to construct the desired data samples from the noise, and the hidden space has a relatively high dimension^[33,34]. Fundamentally, the model works by corrupting the training data by successively adding Gaussian noise, and then recovering the data by a denoising process that is learned to reverse. After training, we can use the diffusion model to pass randomly sampled noise into the model to generate data through the learned denoising process^[35,36].

显示原图|下载原图ZIP|生成PPT

图7 扩散模型架构

Fig.7 Diffusion model architecture

According to the literature, based on the diffusion model, the RF diffusion method and the Family-wide Hallucination method proposed by Baker et al. Have been proposed for protein design and de novo luciferase design, respectively^[37]^[38].

2.7 Transformer model

In 2017, the Transformer model architecture was first proposed and became a powerful language model^[39,40]. Subsequently, its variants GPT and BERT were also proposed^[41]^[42]. The most advanced ChatGPT and GPT-4 models currently proposed by OpenAI are language models developed based on GPT, a variant of the Transformer model^[43]^[44]. As shown in Fig. 8, a Transformer is composed of an input part, a decoder, an encoder, and an output part. The Key of this deep learning model lies in the attention mechanism, specifically, the three vectors in the mechanism, Key (K), Query (Q) and Value (V), respectively, satisfy the following formula:

(3)

A t t e n t i o n (Q, K, V) = s o f t m a x Q K T d k V

显示原图|下载原图ZIP|生成PPT

图8 Transformer模型架构

Fig.8 Transformer model architecture

Where

d k

is the dimension of Key vector and Query vector, which is used to scale the dot product of these vectors. Unlike the RNN, the Transformer model loses the recurrent connections and instead employs positional embeddings, which makes the model perform better at handling long sequences. Soon, the model was applied to the field of new drug discovery. The SMILES-BERT model proposed by Wang et al. And the SMILES-Transformer model proposed by Honda et al. Are applied to the task of molecular prediction and representation learning, respectively^[45]^[46].

3 Application of Artificial Intelligence Technology in Drug Research and Development

With the emergence of more and more chemical and biological open source data and AI models, AI technology is gradually applied to many aspects of new drug research and development. This section will summarize the application of AI in drug research and development from the current major open source tools and databases, molecular representation learning, prediction tasks, and generation tasks. Figure 9 is the framework diagram of this section.

显示原图|下载原图ZIP|生成PPT

图9 (a)数据及开源工具;(b)模型架构;(c)分子描述形式;(d)执行任务

Fig.9 (a) Data and open-source tools. (b) Model architecture. (c) Molecular descriptions. (d) Performing tasks

3.1 Data resources and open source tools

Data on molecular activities and related properties continue to increase as analytical improvements in high-throughput screening techniques contribute to a rich public data resource. These resources typically provide molecular structure, molecular properties, and target information^[7]. PubChem, ChEMBL and ZINC are the main databases commonly used in current research^[47]^[48]^[49]. PubChem is a chemical information database launched by the National Institutes of Health in 2004. As of August 2020, PubChem contains 111 million unique chemical structures with 271 million activity data points from 1.2 million bioanalytical experiments. ChEMBL is another large manually curated chemical database maintained by the European Molecular Biology Laboratory. In ChEMBL22 (version 22), there are more than 1.6 million different chemical structures and more than 14 million activity values. The ZINC database is a free, commercially available compound database used for virtual screening. There are more than 230 million compounds in ZINC that can be purchased in 3D form, and there are more than 750 million compounds that can be purchased.

In addition to data, powerful chemical processing tools are also necessary. Rdkit is an open source toolkit for cheminformatics, which is based on 2D and 3D molecular operations of compounds, using machine learning methods for compound descriptor generation, compound structure similarity calculation, and 2D and 3D molecular display^[50]. Some current open source platforms are based on Rdkit, such as DeepChem, CheTo and OCEAN^[51]^[52]^[53].

3.2 Artificial Intelligence Technology Driving Molecular Representation Learning Task

The machine learning model established by AI technology can spontaneously extract the characteristics of various data after learning from a large amount of data. In the field of new drug research and development, the description of drug molecules and protein macromolecules is very important, which is related to the feature extraction of machine learning models for such data. To describe these molecules, several types of molecular representations are used in many machine learning methods, i.e., from simple molecular entity sequences to manually predefined molecular features^[54,55]. Moreover, data representation has a significant impact on the pre-training of the model, because it is directly related to the knowledge learned by the model, and a reasonable data representation can improve the performance of the prediction model. Therefore, there is a surge of interest in molecular representation, which will help to capture the unknown characteristics of target compounds^[56].

The current representation of drug molecules is mainly based on SMILES sequences, Fingerprint (FP) and molecular graphs^{[57⇓⇓~60]}^{[61⇓⇓~64]}^[65]. A molecular fingerprint FP is a binary vector, with each dimension in the vector indicating the presence or absence of a particular substructure. Among them, there are 1D descriptors representing substituent atoms, chemical bonds, structural fragments, and functional groups, as well as two-dimensional descriptors to represent atomic connectivity and molecular topology. However, the current FP has some problems, such as large dependence on the amount of feature extraction data, feature loss and so on. A molecular graph is a visual representation of a molecule that combines a set of nodes (atoms) and a set of edges (bonds), but the matrix used to store the graph requires a lot of disk space for storage and a lot of memory during computation, which may reduce the efficiency of the molecule generation process. A SMILES sequence is a representation of a molecule as a sequence. Typically, a SMILES string is converted to a one-hot vector before being fed into a machine learning model^[66]. SMILES strings are less expensive to compute than molecular graph representations. However, since SMILES strings do not directly encode atomic concatenation, strings may lose structural information^[67]. Based on the above data structures, Ren et al. Proposed a new substructure hierarchical attention network SuHAN method to discover the intrinsic characteristics of molecules for representation learning^[68]. SuHAN divides the molecule in SMILES format into several substructure fragments according to the predefined division rules, and then sends them into the atomic layer and the substructure layer in turn to obtain the feature representations of different angles of the atomic view and the substructure view. Liu et al. Used RoPE to effectively encode the position information of SMILES sequences, and finally improved the ability of BERT pre-trained model to extract potential molecular substructure information for molecular property prediction^[69]. Vipul et al. proposed Grammar2vec, a framework based on the SMILES Grammar for generating dense numer- ical molecular representations^[70]. Lv et al. proposed a deep contextual Bi-LSTM architecture Mol2Context-vec, which can integrate internal States at different levels, thus bringing a dynamic representation of molecular substructures^[71]. Ji et al. Proposed a representation learning method for molecular graphs, called ReLMole, which is characterized by hierarchical graph modeling of molecules and a contrastive learning scheme based on the similarity of two layers of graphs^[72].

3.3 Artificial Intelligence Technology Drives Prediction Task

3.3.1 Prediction of drug-target binding affinity

In the development of targeted drugs, it is very important to predict the Interaction between the Drug and the target. This process can be described as the prediction of drug-target interaction (DTI). The process demonstration is shown in Figure 10^[73]. At present, in the task of affinity prediction, neural network models dominated by deep learning can usually do well. Because neural networks can usually deal with more complex drug molecular data, through layer-by-layer learning and transmission, the model can automatically learn and extract relevant data features, thereby improving the accuracy of the model in predicting activity.

显示原图|下载原图ZIP|生成PPT

图10 DTI示意图:首先将药物分子和蛋白分别表示,经过神经网络模型后两者向量拼凑进入MLP从而将结果执行分类任务判断药物对靶点是否有作用^[76]

Fig.10 DTI schematic diagram: Firstly, drug molecules and proteins are represented respectively, and after neural network model, the vectors of the two are pieced together into MLP, so that the results can be classified to determine whether the drug has an effect on the target^[76]

Based on the drug target affinity prediction task of deep learning, Lee et al. Proposed a DTI prediction model based on deep learning to capture the local residual patterns of proteins involved in DTI^[61]. When they used a convolutional neural network (CNN) on the raw protein sequence, amino acid subsequences of different lengths were convolved to capture local residual patterns of the generalized protein class. The long short-term memory network (LSTM), a variant of RNN, is also widely used for DTI prediction. Wang et al. Developed a DTIs prediction model based on LSTM architecture^[64]. Evolutionary features of proteins were extracted by position-specific scoring matrix (PSSM) and Legendre moments (LM) and correlated with drug molecular substructure fingerprints to form feature vectors of drug-target pairs. Sparse Principal Component Analysis (SPCA) is then used to compress the features of drugs and proteins into a uniform vector space. The experimental results show that the prediction performance of DTIs is significantly improved, and the AUCs on four important drug target datasets are 0. 9951, 0. 9705, 0. 9951 and 0. 9206, respectively. Yuan et al. proposed a FusionDTA method based on the LSTM architecture^[74]. For the loss of implicit information, a novel multi-head linear attention mechanism is used to replace the coarse pooling method. This allows FusionDTA to aggregate global information based on the attention weights instead of selecting the largest one like max pooling. The experimental results show that the consistency index (CI) of the model on Davis and KIBA datasets is 0.913 and 0.906, respectively. Shao et al. Proposed a method based on the GNN model^[75]. This is an end-to-end model based on the Attention Mechanism Heterogeneous Graph (DTI-HETA). In this model, the heterogeneity graph is first constructed based on the drug-drug, target-target similarity matrix and DTI matrix. Then, the graph convolutional neural network is used to get the embedding representation of drug and target. In order to highlight the contribution of different neighborhood nodes to the central node when aggregating the graph convolution information, the graph attention mechanism is introduced in the node embedding process. Finally, the inner product decoder is used to predict DTIs. Li et al. Proposed a network architecture MINN-DTI based on an improved Transformer^[76]. MINN-DTI combines the interaction converter module (Interformer) with a modified communication message passing neural network (CMPNN), namely Inter-CMPNN, to better capture the bidirectional influence between drug and target, which is represented by molecular graph and distance graph, respectively. This method provides a good interpretation of the results, and the interpretability of the results is shown in Figure 11.

显示原图|下载原图ZIP|生成PPT

图11 DTI的注意力可视化。左:蛋白质距离图以热图的形式显示,相应目标的注意力栏显示出来。中:配体和预测的重要残基分别用绿色和粉色骨架表示,预测的重要配体原子用红色突出显示,已知的氢键用黄色虚线标出,局部目标结构被涂成灰色作为背景。右:配体用二维凯库勒公式表示,相应的预测的重要原子用浅红点突出显示^[76]

Fig.11 Attention visualization of DTIs. Left: Protein distance maps are displayed in the form of heat maps. The corresponding targets’ attention bars are shown. Middle: Ligands and predicted important residues are represented as green and pink skeletons, respectively. Predicted important atoms of ligands are highlighted in red. Known hydrogen bonds are marked with yellow dashed lines. Local target structures are painted grey as the background. Right: Ligands are represented by 2D Kekule formula. The corresponding predicted important atoms are highlighted by light red dots^[76]

3.3.2 Crystal structure prediction

Crystal form is an important property of drugs, which significantly affects the physical and chemical properties of drugs, such as drug stability, solubility and dissolution rate, and then affects the storage, formulation, product design and delivery mechanism of drugs to patients^[77]. At the same time, crystal morphology also affects the downstream processing of drugs, such as processes that require specific particles (morphology to maximize filtration, particle flowability, and tabletability)^[78,79]. While current crystal engineering has been focused on targeting specific interactions by solvent selection or controlling experimental conditions to produce ideal crystal morphology^[80]. The success of the experiment depends on the experience of scientists and frequent trial and error, which requires a lot of manpower, time and resources, so the development of AI technology has brought new opportunities for crystal form prediction. AI Pharmaceutical Company Jingtai Technology and AstraZeneca jointly released the API AZD1305 experiment for cardiovascular diseases, and found two crystal forms with very close physical stability at room temperature. With the help of Jingtai Technology's crystal form prediction and stability evaluation technology, the polymorphic phenomenon was systematically evaluated and the experimental results were verified. The results are shown in Figure 12^[81].

显示原图|下载原图ZIP|生成PPT

图12 (a)预测的最低晶格能形式Z1(蓝色,rmsd15 = 0.091 Å)和形式X1(红色,rmsd15 = 0.141 Å)与实验单晶结构稳定形式AZD1305(绿色)的结构叠加。(b)形态A(黑色)的实验PXRD数据与预测形态Z8(蓝色)和X23(红色)的模拟PXRD图谱的比较。(c)AZD1305预测形式A(蓝色)和形式B(红色)的构象结构叠加。(d)XtalPi预测形式X1(形式B)、X2、X3、X4、X5和X23(形式A)相对自由能稳定性的温度依赖性^[81]

Fig.12 (a)Structure overlay of the predicted lowest lattice energy form Z1 (in blue, rmsd15 = 0.091 Å) and form X1 (in red, rmsd15 = 0.141 Å) with experimental single crystal structure of the stable form B of AZD1305 (in green). (b)Comparison of experimental PXRD data for form A (black) and simulated PXRD patterns of predicted form Z8 (blue) and X23(red). (c)Structural overlay of conformers in predicted form A (blue) and form B (red) of AZD1305. (d)Temperature dependence of relative free-energy stabilities of forms X1 (form B), X2, X3, X4, X5, and X23 (form A) predicted by XtalPi^[81]

In addition, Bhardwaj et al. Used the random forest model to predict the crystallinity of organic molecules for the first time, with an accuracy of about 70%^[82]. The predictive model is based on calculated molecular descriptors and published crystallization propensity experiments for a library of substituted acylanilides. Wicker et al. Used Python and the RDKit cheminformatics toolkit to build models on the basis of chemical descriptors and unsupervised machine learning methods^[83]. They predicted whether molecules would crystallize without considering crystal growth mechanisms or conditions, a method that focuses on the properties and interactions of individual molecules. By comparing the SVM and RF models, they found that the SVM model had the highest prediction accuracy of 90.3%. On this basis, Pillong et al. Established the RF model to evaluate the solubility and crystallization tendency of 319 small molecules in 18 different solvents^[84]. The model can guide the selection of appropriate crystallization solvent, and effectively reduce the workload to 1/3 of the initial plan while ensuring the success rate of crystallization to exceed 92%. In the latest research, Yang et al. Developed an efficient molecular mechanics protocol based on the Einstein crystal method^[85]. Using this approach, they performed finite temperature free energy corrections to the predicted crystal structure and the experimentally known five polymorphisms of molecule XXIII, a pharmaceutically relevant compound. Wilkinson et al. Proposed a method based on transfer learning and open source robotics, which improved the level of AI in predicting crystal morphology^[86]. Experiments show that the method can predict the crystal morphology with an accuracy of 87. 9% by using the data-driven model. This approach will shorten the time to drug development and create more sustainable and efficient manufacturing methods.

3.3.3 Molecular property prediction

Molecular property prediction is a fundamental but challenging task in drug discovery, which has gained increasing attention in recent decades with the development of machine learning and deep learning^[87]. As shown in Table 1, the properties of compounds can be divided into three categories: physicochemical properties, biophysical properties, and physiological properties, and each category corresponds to several benchmark test data sets. In these test data sets, AI techniques usually divide the prediction task into classification (AUC-ROC criterion) and regression (RMSE criterion) to test the performance of the model in molecular property prediction according to two sets of values^[88].

表1 人工智能模型用于化合物性质预测的测试数据集

Table 1 The dataset for compound property prediction by artificial intelligence

Category	Dataset	Description
Physical chemistry	ESOL	Aqueous solubility
	FreeSolv	Hydration free energy
	Lipophilicity	Octanol/water distribution coefficient (LogD)
Biopyhsics	MUV	17 tasks from PubChem BioAssay
	HIV	Ability to inhibit HIV replication
	BACE	Binding results for inhibitors of human BACE-1
Physiology	BBBP	Blood-brain barrier penetration
	Tox21	Toxicity measurements
	SIDER	Adverse drug reactions on 27 system organs
	ClinTox	Clinical trail toxicity and FDA approval status

Based on SMILES sequences, Jiang et al. Proposed the NoiseMol method, which for the first time systematically proposed to inject perturbation noise into SMILES strings labeled at the atomic level or substring level to expand the size of the data set and alleviate the limitation of labeled data on the prediction of molecular properties^[89]. From the experimental results, this method enhances the prediction ability and robustness of Transformer, and the results on small data sets such as BACE, BBBP, FDA and ecoli are improved more significantly. Li et al. Proposed the Multiple SMILES method, which shows excellent performance in molecular property prediction tasks by encoding Multiple SMILES for each molecule as automatic data augmentation, and alleviates the overfitting problem caused by the small amount of data in the molecular property prediction data set^[90].

Based on the graph structure, Maziarka et al. Proposed a MAT method to make the Transformer model more suitable for chemical molecules by increasing the distance between atoms and the self-attention of the molecular graph^[91]. Besides that, the good results of downstream task prediction and the interpretability of attention weights are also the highlights of their work. Liu et al. Proposed a self-supervised learning method ATMOL for molecular representation learning and property prediction and a new molecular graph enhancement strategy, called attention intelligent graph masking, to generate challenging positive samples for contrastive learning^[92]. Hu et al. Proposed an ABT-MPNN method based on a variant of graph convolutional neural network (GCN) to improve the molecular characterization embedding process for molecular property prediction^[93]. Their approach provides a novel architecture that integrates molecular representations at the bond, atom, and molecule levels in an end-to-end manner. Experimental results on nine benchmark datasets show that the proposed ABT-MPNN outperforms or is comparable to state-of-the-art baseline models in quantitative structure-attribute relationship tasks.

3.4 Artificial Intelligence Technology Driven Generation Task

3.4.1 Molecular conformational formation

Molecular conformation generation is an important task in drug discovery, where the goal is to generate a spatial arrangement of atoms in a molecule at low energy. The traditional way of molecular conformation generation is to first generate a stack of molecular conformations using conformational search methods, and after energy minimization of each conformation, a batch of low-energy molecular conformations is screened out according to the potential energy evaluated by density functional theory (DFT), semi-empirical DFT, or molecular mechanism (MM)^[94]. In recent years, with the development of deep learning of AI technology, data-driven molecular conformation generation has become possible, and its basic process is shown in Figure 13.

显示原图|下载原图ZIP|生成PPT

图13 Rdkit与聚类算法用于分子构象生成^[95]

Fig.13 Rdkit and clustering algorithm are used for molecular conformation generation^[95]

Mansimov et al. Used VAE to directly generate atomic coordinates^[96]. However, this model does not preserve translational and rotational equivariance of the generated conformations and performs poorly in benchmarking. To guarantee this geometric property of the conformation, later studies used intermediate structures such as interatomic distances or torsion angles to generate the conformation. For example, Xu et al. Used the flow model to generate the distance matrix, and then used the classical distance geometry technique to iteratively generate the conformation^[97]. After that, the team designed an end-to-end framework to enhance the performance of the model by using two-tier optimization^[98]. Zhu et al. Proposed a method to directly predict the atomic coordinates without predicting the interatomic distance, the gradient of the interatomic distance, or the local structure of the molecule in advance, and achieved good results on two types of data sets^[99]. Recently, Xu et al. Proposed a GeoDiff model, which is a geometric diffusion model that can effectively train the entire framework in an end-to-end manner by optimizing the weighted variational lower bound of the (conditional) likelihood for molecular conformation generation^[100].

3.4.2 Molecular production

Molecular generation is a challenging open problem in cheminformatics. Presumably, the total number of potential drug-like candidate molecules is between 10²³~10⁶⁰ molecules, of which only 10⁸ molecules have been synthesized^[101,102]. Since it is difficult to sift the practically infinite chemical space and there is a huge difference between synthesized molecules and potential molecules, modeling the distribution of molecules with a generative model in order to sample molecules with desirable properties is the main purpose of this task, and the process of molecule generation is shown in Figure 14.

显示原图|下载原图ZIP|生成PPT

图14 分子生成旨在偌大的化学空间中生成具有特定性质的化合物^[107]

Fig.14 Molecular formation aims to create compounds with specific properties in a large chemical space^[107]

Inﬂuenced by natural language processing (NLP), based on SMILES sequences. Segler et al. trained a long short-term memory RNN model based on a set of molecules in ChEMBL, represented by normalized SMILES, which can generate diverse and reasonably regular molecules^[103]. Yang et al. Used an LSTM-based neural network model to train 200,000 compounds in the ChEMBL database, and then fine-tuned the model using a dataset containing 135 published P300 inhibitors and 576 macrocyclic molecules to generate new P300/CBP inhibitors^[104]. As a result, the model generated a centralized library of 672 chemical structures, and a series of highly effective inhibitors were obtained through synthetic optimization. Bagal et al. proposed the LigGPT model for molecule generation, which can learn long-term dependencies in SMILES, such as loop closure^[105]. Since only the attention in front of the token needs to be utilized when predicting the next token in the sequence, a masking self-attention mechanism is applied in the model. Later, the team proposed a MolGPT model that performed as well as other previously proposed modern machine learning frameworks for molecule generation in generating effective, unique, and novel molecules^[106]. Furthermore, they demonstrate that the model can be conditionally trained to control multiple properties of the generative molecule. The model can be used to generate molecules with desired scaffolds and desired properties by tuning the SMILES strings and attribute values of the desired scaffolds. Wang et al. Proposed a Transformer model for target-specific molecular design^[107]. Their proposed method enables the generation of drug-like compounds (without assigned targets) and target-specific compounds, where target-specific compounds are generated by applying different multi-head attention keys and values to each target. The experimental results show that the method can not only generate effective drug-like compounds, but also generate target-specific compounds.

In addition, with the development of GNN, molecular generation model based on molecular graph has become a hot research topic. You et al. Proposed a graph convolution policy network (GCPN) model to generate molecular graphs^[108]. The model combines graph representation, reinforcement learning, and adversarial training in a uniﬁed framework, which enables the generation of effective molecular graphs. In addition, the method can directly optimize the properties of the molecular graph to generate target oriented molecules. Samanta et al. proposed a graph-based VAE model^[109]. In this model, the encoder is used to aggregate information and then map this aggregated information into a continuous latent space, which can encode graphs with variable number of atoms; The decoder collectively represents all edges as a non-normalized log-probability vector, which is then fed as a single edge distribution, enabling efﬁcient inference algorithms and decoding. The results show that the model can efficiently generate new molecules. Li et al. Proposed a multi-physical graph neural network (MP-GNN) model based on the developed multi-physical graph molecular graph representation and features^[110]. The various molecular interactions at different atomic types and different scales are systematically represented by a series of scale-specific and element-specific graphs featuring distance-dependent nodes. From these graphs, a graph convolutional network (GCN) model is built with a specially designed weight-sharing architecture. A base learner is constructed from GCN models of different elements at different scales and further integrated together using single-scale and multi-scale ensemble learning schemes. The experimental results show that the model has high accuracy.

4 Conclusion and prospect

The use of AI technology in drug discovery has proliferated over the past decade and is still growing in popularity. The emergence of this advanced technology has had a positive impact on drug research and development to some extent, such as the prediction of some important properties and the generation of some specific compounds. These approaches all aim to reduce spending (both time and money) in the drug discovery process, but like all concepts, it is unlikely to be a master key. The driving role of AI technology in the field of drug research and development is still in its infancy, and there are still many problems to be solved, and the current key problems are mainly reflected in the following aspects:

(1) Data. Despite the flourishing of deep learning models, it is important to emphasize that data is always at the core of model development and evaluation. For a model (whether predictive or generative) to be more useful, the data must be available in sufficient quantity and should remain of high quality. However, despite the large number of molecules in existing chemical libraries, the number of data points for each particular analysis can be very small. Sometimes, the quality of benchmark data sets is also problematic in terms of the representativeness of the vast chemical space for real-world drug discovery. Therefore, in addition to a sufficient amount of data, suitable data sets and evaluation metrics are quite important when evaluating a model. Secondly, the quality of the data is also crucial, which involves data labels. Usually, labeled data allows the model to better extract the characteristics of the data. However, the process of labeling the data needs to be based on a large number of wet experiments, which causes the problem of the high cost of building a high-quality dataset.

(2) Interpretability. Despite its superior performance, deep learning still makes models hard to interpret by people. Therefore, an important need is to develop interpretable models with a high degree of interpretability. When some visual indicators can represent the ability of the model, such problems can also be solved.

(3) Parameter complication. The current model algorithm is the basis for supporting the development of deep learning, and now the development of the model is in the direction of large-scale, which leads to the need for a large number of parameters to train the model. Therefore, it is believed that the future model should develop towards a more sophisticated model design, that is, to achieve the compression and lightweight of the deep learning model, which can greatly reduce the number of parameters and calculation of the model.

In general, there are both opportunities and great challenges in applying AI technology to drug development. With the in-depth research and understanding of this field, it is expected that AI technology, which is constantly improving, will play a more important role in the research and development of new drugs.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	DiMasi J A, Grabowski H G, Hansen R W. J. Health Econ., 2016, 47: 20.

[2]	Reddy A S, Zhang S X. Expert Rev. Clin. Pharmacol., 2013, 6(1): 41.

[3]	Sachdev K, Gupta M K. J. Biomed. Inform., 2019, 93: 103159.

[4]	Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S R. Nat. Rev. Drug Discov., 2019, 18(6): 463.

[5]	Kimber T B, Chen Y H, Volkamer A. Int. J. Mol. Sci., 2021, 22(9): 4435.

[6]	Lipinski C F, Maltarollo V G, Oliveira P R, da Silva A B F, Honorio K M. Front. Robot. AI, 2019, 6: 108.

[7]	Rifaioglu A S, Atas H, Martin M J, Cetin-Atalay R, Atalay V, Doğan T. Brief Bioinform, 2019, 20(5): 1878.

[8]	Ivanenkov Y A, Polykovskiy D, Bezrukov D, Zagribelnyy B, Aladinskiy V, Kamya P, Aliper A, Ren F, Zhavoronkov A. J. Chem. Inf. Model., 2023, 63(3): 695.

[9]	Janiesch C, Zschech P, Heinrich K. Electron. Mark., 2021, 31(3): 685.

[10]	Wang M Z, Wang B, He Q, Liu X X, Zhu K S. arXiv preprint arXiv: 1505.06561, 2015.

[11]	LeCun Y, Bengio Y, Hinton G. Nature, 2015, 521(7553): 436.

[12]	Simm J, Klambauer G, Arany A, Steijaert M, Wegner J K, Gustin E, Chupakhin V, Chong Y T, Vialard J, Buijnsters P, Velter I, Vapirev A, Singh S, Carpenter A E, Wuyts R, Hochreiter S, Moreau Y, Ceulemans H. Cell Chem. Biol., 2018, 25(5): 611.

[13]	Hofmarcher M, Rumetshofer E, Clevert D A, Hochreiter S, Klambauer G. J. Chem. Inf. Model., 2019, 59(3): 1163.

[14]	Ramsundar B, Kearnes S, Riley P, Webster D, Konerding D, Pande V. arXiv preprint arXiv: 1502.02072, 2015.

[15]	Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, G’omez-Bombarelli R, Hirzel T, Aspuru-Guzik A, P. Adams R. arXiv preprint arXiv: 1509.09292, 2015.

[16]	Goh G B, Siegel C, Vishnu A, Hodas N O, Baker N. arXiv preprint arXiv: 1706.06689, 2017.

[17]	Chen M Y, Chiang H S, Sangaiah A K, Hsieh T C. Neural Comput. Appl., 2020, 32(12): 7915.

[18]	Hochreiter S, Schmidhuber J. Neural Comput., 1997, 9(8): 1735.

[19]	Chung J, Gulcehre C, Cho K, Bengio K. arXiv preprint arXiv: 1412.3555, 2014.

[20]	Goh G B, Hodas N O, Siegel C, Vishnu A. arXiv preprint arXiv: 1712.02034, 2017.

[21]	Mayr A, Klambauer G, Unterthiner T, Steijaert M, Wegner J K, Ceulemans H, Clevert D A, Hochreiter S. Chem. Sci., 2018, 9(24): 5441.

[22]	Wu Z H, Pan S R, Chen F W, Long G D, Zhang C Q, Yu P S. IEEE Trans. Neural Netw. Learn. Syst., 2021, 32(1): 4.

[23]	Schütt K T, Kindermans P J, Sauceda H E, Chmiela S, Tkatchenko A, Müller K R. arXiv preprint arXiv: 1706.08566, 2017.

[24]	Feinberg E N, Sur D, Wu Z Q, Husic B E, Mai H H, Li Y, Sun S S, Yang J Y, Ramsundar B, Pande V S. ACS Cent. Sci., 2018, 4(11): 1520.

[25]	Gasteiger J, Grob J, Günnemann S. arXiv preprint arXiv:2003.03123, 2020.

[26]	Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014, 2672.

[27]	Guimaraes G L, Sanchez-Lengeling B, Outerial C, Farias P L C, Aspuru-Guzik A. arXiv preprint arXiv: 1705.10843, 2017.

[28]	Sanchez-Lengeling B, Outeiral C, Guimaraes G L, Aspuru-Guzik A. chemRxiv preprint: 10.26434/chemrxiv.5309668.v3, 2017.

[29]	Kingma D P, Welling M. arXiv preprint arXiv: 1312.6114, 2013.

[30]	Kingma D P, Welling M. Found. Trends^® Mach. Learn., 2019, 12(4): 307.

[31]	Kusner M J, Paige B, Hernández-Lobato J M. International Conference on Machine Learning. PMLR, 2017, 1945.

[32]	Dai H J, Tian Y T, Dai B, Skiena S, Song L. arXiv preprint arXiv: 1802.08786, 2018.

[33]	Saharia C, Chan W, Chang H, Lee C A, Ho J, Salimas T, Fleet D J, Norouzi M. arXiv preprint arXiv: 2111.05826, 2021.

[34]	Hoogeboom E, Satorras V G, Vignac C, Welling M. International Conference on Machine Learning. PMLR, 2022: 8867.

[35]	Luo S, Su Y, Peng X, Wang S, Peng J, Ma J. bioRxiv, 2022, 10.499510.

[36]	Tachibana H, Go M, Inahara M, Katayama Y, Watanabe Y. arXiv preprint arXiv: 2112.13339, 2021.

[37]

Watson

J L

, Juergens

, Bennett

N R

, Trippe

B L

, Yim

, Eisenach

H E

, Ahern

, Borst

A J

, Ragotte

R J

, Milles

L F

, Wicky

B I M

, Hanikel

, Pellock

S J

, Courbet

, Sheffler

, Wang

, Venkatesh

, Sappington

, Torres

S V

, Lauko

, De

Bortoli V

, Mathieu

, Ovchinnikov

, Barzilay

, Jaakkola

T S

, DiMaio

, Baek

, Baker

. Nature, 2023, 620: 1089.

[38]	Yeh A H W, Norn C, Kipnis Y, Tischer D, Pellock S J, Evans D, Ma P C, Lee G R, Zhang J Z, Anishchenko I, Coventry B, Cao L X, Dauparas J, Halabiya S, DeWitt M, Carter L, Houk K N, Baker D. Nature, 2023, 614(7949): 774.

[39]	Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Davison J, Shleifer S, Platen P, Ma C, Jernite Y, Plu J, Xu C W, Scao T L, Gugger S, Drame M, Lhoest Q, Rush A M. arXiv preprint arXiv: 1910.03771, 2019.

[40]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. arXiv preprint arXiv: 1706.03762, 2017.

[41]	Radford A, Narasimhan K, Salimans T, Sutskever I. OpenAI: https://openai.com/blog/language-unsupervised/, 2018.

[42]	Devlin J, Chang M W, Lee K, Toutanova. Proceedings of NAACL-HLT, 2019, 4171-4186.

[43]	Thorp H H. Science, 2023, 379(6630): 313.

[44]	OpenAI. arXiv preprint arXiv: 2303.08774, 2023.

[45]	Wang S, Guo Y Z, Wang Y H, Sun H M, Huang J Z. Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. Niagara Falls NY USA. New York, NY, USA: ACM, 2019, 429.

[46]	Honda S, Shi S, Ueda H R. arXiv preprint arXiv: 1911.04738, 2019.

[47]	Kim S, Chen J, Cheng T J, Gindulyte A, He J, He S Q, Li Q L, Shoemaker B A, Thiessen P A, Yu B, Zaslavsky L, Zhang J, Bolton E E. Nucleic Acids Res., 2021, 49(D1): D1388.

[48]	Gaulton A, Hersey A, Nowotka M, Bento A P, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis L J, Cibrián-Uhalte E, Davies M, Dedman N, Karlsson A, Magariños M P, Overington J P, Papadatos G, Smit I, Leach A R. Nucleic Acids Res., 2017, 45(D1): D945.

[49]	Irwin J J, Tang K G, Young J, Dandarchuluun C, Wong B R, Khurelbaatar M, Moroz Y S, Mayfield J, Sayle R A. J. Chem. Inf. Model., 2020, 60(12): 6065.

[50]	Schneider N, Sayle R A, Landrum G A. J. Chem. Inf. Model., 2015, 55(10): 2111.

[51]	Ramsundar B, Eastman P, Walters P, Pande V. Drug Discovery and More. 1st ed. CA; O’Reilly Media: Sebastopol, 2019.

[52]	Schneider N, Fechner N, Landrum G A, Stiefl N. J. Chem. Inf. Model., 2017, 57(8): 1816.

[53]	Czodrowski P, Bolick W G. J. Chem. Inf. Model., 2016, 56(10): 2013.

[54]	Xue L, Bajorath J. Comb. Chem. High Throughput Screen., 2000, 3(5): 363.

[55]	Redkar S, Mondal S, Joseph A, Hareesha K S. Mol. Inform., 2020, 39(5): 1900062.

[56]	Rifaioglu A S, Nalbat E, Atalay V, Martin M J, Cetin-Atalay R, Doğan T. Chem. Sci., 2020, 11(9): 2531.

[57]	Stokes J M, Yang K, Swanson K, Jin W G, Cubillos-Ruiz A, Donghia N M, MacNair C R, French S, Carfrae L A, Bloom-Ackermann Z, Tran V M, Chiappino-Pepe A, Badran A H, Andrews I W, Chory E J, Church G M, Brown E D, Jaakkola T S, Barzilay R, Collins J J. Cell, 2020, 181(2): 475.

[58]	Shin B, Park S, Kang K, C. Ho J. arXiv preprint arXiv: 1908.06760, 2019.

[59]	Öztürk H, Özgür A, Ozkirimli E. Bioinformatics, 2018, 34(17): i821.

[60]	Abnousi A, Broschat S L, Kalyanaraman A. BMC Bioinform., 2018, 19(1): 83.

[61]	Lee I, Keum J, Nam H. PLoS Comput. Biol., 2019, 15(6): e1007129.

[62]	Zhao Q, Xing F, Tao Y Y, Liu H L, Huang K, Peng Y, Feng N P, Liu C H. Front. Pharmacol., 2020, 11: 1.

[63]	Gao K F, Nguyen D D, Sresht V, Mathiowetz A M, Tu M H, Wei G W. Phys. Chem. Chem. Phys., 2020, 22(16): 8373.

[64]	Wang Y B, You Z H, Yang S, Yi H C, Chen Z H, Zheng K. BMC Med. Inform. Decis. Mak., 2020, 20(2): 49.

[65]	David L, Thakkar A, Mercado R, Engkvist O. J. Cheminformatics, 2020, 12(1): 56.

[66]	Bian Y, Xie X Q. J. Chem. Inf. Model., 2021, 27(3): 1.

[67]	Xiong Z P, Wang D Y, Liu X H, Zhong F S, Wan X Z, Li X T, Li Z J, Luo X M, Chen K X, Jiang H L, Zheng M Y. J. Med. Chem., 2020, 63(16): 8749.

[68]	Ren T, Zhang H D, Shi Y, Luo X M, Zhou S Q. J. Mol. Graph. Model., 2023, 119: 108401.

[69]	Liu Y W, Zhang R S, Li T F, Jiang J, Ma J, Wang P. J. Mol. Graph. Model., 2023, 118: 108344.

[70]	Vipul M, Karoline B, Rafiqul G, Venkat V. Fluid Phase Equilib., 2022, 561: 113531.

[71]	Lv Q J, Chen G X, Zhao L, Zhong W H, Chen C Y C. Brief Bioinform., 2021, 22(6): bbab317.

[72]	Ji Z W, Shi R H, Lu J R, Li F, Yang Y. J. Chem. Inf. Model., 2022, 62(22): 5361.

[73]	Vázquez J, LÓpez M, Gibert E, Herrero E, Luque F J. Molecules, 2020, 25(20): 4723.

[74]	Yuan W, Chen G, Chen C Y C. Briefings Bioinf., 2022, 23(1): bbab506.

[75]	Shao K H, Zhang Y H, Wen Y Q, Zhang Z N, He S, Bo X C. Brief Bioinform., 2022, 23(3): bbac109.

[76]	Li F, Zhang Z Q, Guan J H, Zhou S G. Bioinformatics, 2022, 38(14): 3582.

[77]	Gardner C R, Walsh C T, Almarsson Ö. Nat. Rev. Drug Discov., 2004, 3(11): 926.

[78]	Tung H H. Org. Process Res. Dev., 2013, 17(3): 445.

[79]	Waknis V, Chu E, Schlam R, Sidorenko A, Badawy S, Yin S, Narang A S. Pharm. Res., 2014, 31(1): 160.

[80]	Dandekar P, Kuvadia Z B, Doherty M F. Annu. Rev. Mater. Res., 2013, 43: 359.

[81]	Sun G X, Liu X T, Abramov Y A, Nilsson Lill S O, Chang C, Burger V, Broo A. Cryst. Growth Des., 2021, 21(4): 1972.

[82]	Bhardwaj R M, Johnston A, Johnston B F, Florence A J. CrystEngComm, 2015, 17(23): 4272.

[83]	Wicker J G P, Cooper R I. CrystEngComm, 2015, 17(9): 1927.

[84]	Pillong M, Marx C, Piechon P, Wicker J G P, Cooper R I, Wagner T. CrystEngComm, 2017, 19(27): 3737.

[85]	Yang M J, Dybeck E, Sun G X, Peng C W, Samas B, Burger V M, Zeng Q, Jin Y D, Bellucci M A, Liu Y, Zhang P Y, Ma J, Alan Jiang Y, Hancock B C, Wen S H, Wood G P F. Cryst. Growth Des., 2020, 20(8): 5211.

[86]	Wilkinson M R, Martinez-Hernandez U, Huggon L K, Wilson C C, Castro Dominguez B. CrystEngComm, 2022, 24(43): 7545.

[87]	Mater A C, Coote M L. J. Chem. Inf. Model., 2019, 59(6): 2545.

[88]	Wu Z Q, Ramsundar B, Feinberg E N, Gomes J, Geniesse C, Pappu A S, Leswing K, Pande V. Chem. Sci., 2018, 9(2): 513.

[89]	Jiang J, Zhang R S, Yuan Y N, Li T F, Li G L, Zhao Z L, Yu Z X. J. Mol. Graph. Model., 2023, 121: 108454.

[90]	Li C Y, Feng J H, Liu S H, Yao J F. Comput. Intell. Neurosci., 2022, 2022: 8464452.

[91]	Maziarka Ł, Danel T, Mucha S, Rataj K, Tabor J, Jastrzębski S. arXiv preprint arXiv: 2002.08264, 2020.

[92]	Liu H, Huang Y B, Liu X J, Deng L. Brief. Bioinform., 2022, 23(5): bbac303.

[93]	Liu C Y, Sun Y, Davis R, Cardona S T, Hu P Z. J. Cheminformatics, 2023, 15(1): 29.

[94]	Hawkins P C D. J. Chem. Inf. Model., 2017, 57(8): 1747.

[95]	Zhou G M, Gao Z F, Wei Z W, Zheng H, Ke G L. arXiv preprint arXiv: 2302.07061, 2023.

[96]	Mansimov E, Mahmood O, Kang S, Cho K. Sci. Rep., 2019, 9: 20381.

[97]	Xu M K, Luo S T, Bengio Y, Peng J, Tang J. arXiv preprint arXiv: 2102.10240, 2021.

[98]	Xu M K, Wang W J, Luo S T, Shi C, Bengio Y, Gomez-Bombarelli R, Tang J. International Conference on Machine Learning. PMLR, 2021: 11537-11547.

[99]	Zhu J H, Xia Y C, Liu C, Wu L J, Xie S F, Wang T, Wang Y S, Zhou W G, Qin T, Li H Q, Liu T Y. arXiv preprint arXiv: 2202.01356, 2022.

[100]

M K

, Yu

L T

, Song

, Shi

, Ermon

, Tang

. arXiv preprint arXiv: 2203.02923, 2022.

[101]

Polishchuk

P G

, Madzhidov

T I

, Varnek

. J. Comput. Aided Mol. Des., 2013, 27(8): 675.

[102]

Kim

, Thiessen

P A

, Bolton

E E

, Chen

, Fu

, Gindulyte

, Han

L Y

, He

S Q

, Shoemaker

B A

, Wang

J Y

, Yu

, Zhang

, Bryant

S H

. Nucleic Acids Res., 2016, 44(D1): D1202.

[103]

Segler

M H S

, Kogej

, Tyrchan

, Waller

M P

. ACS Cent. Sci., 2018, 4(1): 120.

[104]

Yang

Y X

, Zhang

R K

, Li

Z J

, Mei

L H

, Wan

S L

, Ding

, Chen

Z F

, Xing

, Feng

H J

, Han

, Jiang

H L

, Zheng

M Y

, Luo

, Zhou

. J. Med. Chem., 2020, 63(3): 1337.

[105]

Bagal

, Aggarwal

, Vinod

P K

, Priyakumar

U D

. chemRxiv preprint: 10.26434/chemrxiv.14561901.v1, 2021.

[106]

Bagal

, Aggarwal

, Vinod

P K

, Deva Priyakumar

. J. Chem. Inf. Model., 2022, 62(9): 2064.

[107]

Wang

W L

, Wang

, Zhao

H G

, Sciabola

. arXiv preprint arXiv: 2210.08749, 2022.

[108]

You

J X

, Liu

B W

, Ying

, Pande

, Leskovec

. arXiv preprint arXiv: 1806.02473, 2018.

[109]

Samanta

, De

, Jana

, Chattaraj

P K

, Ganguly

, Gomez-Rodriguez

. arXiv preprint arXiv: 1802.05283, 2018.

[110]

X S

, Liu

, Lu

, Hua

X S

, Chi

, Xia

K L

. Brief Bioinform., 2022, 23(4): bbac231.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

1 Introduction

2 Artificial intelligence technology

图1 GPU与CPU的逻辑模式[10]

2.1 Convolutional Neural Network (CNN).

图2 卷积神经网络模型架构

2.2 Recurrent neural network (RNN).

图3 递归神经网络模型架构

2.3 Graph neural network (GNN).

图4 图神经网络模型架构

2.4 Generative Adversarial Network (GAN).

图5 生成对抗网络模型架构

2.5 Variational autoencoder (VAE).

图6 变分自编码器模型架构

2.6 Diffusion model

图7 扩散模型架构

2.7 Transformer model

图8 Transformer模型架构

3 Application of Artificial Intelligence Technology in Drug Research and Development

图9 (a)数据及开源工具;(b)模型架构;(c)分子描述形式;(d)执行任务

3.1 Data resources and open source tools

3.2 Artificial Intelligence Technology Driving Molecular Representation Learning Task

3.3 Artificial Intelligence Technology Drives Prediction Task

3.3.1 Prediction of drug-target binding affinity

图10 DTI示意图:首先将药物分子和蛋白分别表示,经过神经网络模型后两者向量拼凑进入MLP从而将结果执行分类任务判断药物对靶点是否有作用[76]

3.3.2 Crystal structure prediction

3.3.3 Molecular property prediction

表1 人工智能模型用于化合物性质预测的测试数据集

3.4 Artificial Intelligence Technology Driven Generation Task

3.4.1 Molecular conformational formation

图13 Rdkit与聚类算法用于分子构象生成[95]

3.4.2 Molecular production

图14 分子生成旨在偌大的化学空间中生成具有特定性质的化合物[107]

4 Conclusion and prospect

References

图1 GPU与CPU的逻辑模式^[10]

图10 DTI示意图:首先将药物分子和蛋白分别表示,经过神经网络模型后两者向量拼凑进入MLP从而将结果执行分类任务判断药物对靶点是否有作用^[76]

图13 Rdkit与聚类算法用于分子构象生成^[95]

图14 分子生成旨在偌大的化学空间中生成具有特定性质的化合物^[107]