Machine Learning Assisted Molecule Design of Fuel

Zhang Xiangwen; Hou Fang; Liu Ruichen; Wang Li; Li Guozhu

doi:10.7536/PC230911

Progress in Chemistry >

2024 , Vol. 36 >Issue 4: 471 - 485

DOI: https://doi.org/10.7536/PC230911

Review

Machine Learning Assisted Molecule Design of Fuel

Zhang Xiangwen ¹^,²^,³ ,
Hou Fang ¹ ,
Liu Ruichen ¹ ,
Wang Li ¹^,²^,³ ,
Li Guozhu ^,¹^,²^,³^,^*

Expand

¹ School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
² Key Laboratory for Advanced Fuel and Chemical Propellant of Ministry of Education, Tianjin 300072, China
³ Haihe Laboratory of Green Creation and Manufacture of Matter, Tianjin 300192, China

*e-mail: gzli@tju.edu.cn

Received date: 2023-09-26

Revised date: 2023-12-10

Online published: 2024-02-26

Supported by

National Natural Science Foundation of China(22178248)

Fold

Abstract

theoretical design of fuel has always been the focus of research about fuel in the area of propulsion technology.It can effectively overcome the complexity and potential danger of the experiment,and guide experimental synthesis of fuel,which can be verified by experimental results.It is anticipated that a new generation of fuel can be efficiently designed for subsequent fuel synthesis and application.However,the traditional Theoretical calculation methods,such as group contribution method and quantum chemical method,have the defects of low accuracy and efficiency.machine learning,a rapidly developed algorithm,has opened up a new way to design potential high-energy fuels,which exhibits strong capabilities in both property prediction and molecule design.in this review,several fuel molecule descriptors for machine learning are introduced,and different machine learning models for fuel property prediction and molecule design are briefed.Furthermore,the research on machine learning assisted property prediction and new molecule design of fuel is summarized,respectively.Finally,the challenges and future development of machine learning applied in fuel design are discussed。

Contents

1 Introduction

2 Fuel molecule description method

2.1 Molecular fingerprinting based on SMILES

2.2 Coulomb matrix

2.3 Continuous operable molecular entry specification

2.4 Molecule graph

3 Machine learning model

3.1 Model for fuel property prediction

3.2 Model for fuel molecule generation

4 Fuel property prediction

4.1 Single fuel property prediction

4.2 Multiple fuel properties prediction

5 Design of new fuel molecules

5.1 High throughput screening of fuel molecules

5.2 Reverse design of new fuel molecules

6 Conclusion and outlook

Key words： fuel; machine learning; molecule description; property prediction; molecule design; high-throughput screening

Cite this article

Zhang Xiangwen , Hou Fang , Liu Ruichen , Wang Li , Li Guozhu . Machine Learning Assisted Molecule Design of Fuel[J]. Progress in Chemistry, 2024 , 36(4) : 471 -485 . DOI: 10.7536/PC230911

1 Introduction

Fuel is a substance that releases energy through combustion reaction.It can be divided into solid,liquid and gaseous fuels according to its form of existence,and into fossil fuels and biomass fuels according to its source of raw materials.All kinds of fuels are widely used in industrial,civil,national defense,military and aerospace fields^{[1⇓⇓⇓~5]}。 the rapid development of downstream application fields,the green and friendly demand of the environment,and the in-depth promotion of the energy revolution and the"double carbon"strategy have put forward new and higher requirements for the performance of fuels,and the development of new fuels is becoming more and more urgent.However,the variety of molecular structures of fuel compounds and the complexity of experimental synthesis have greatly limited the development process of next-generation fuels^[6]。 Therefore,the theoretical design of fuel in the early stage of research and development is very necessary.fuel theoretical design is the starting point of new fuel research and development,and plays a decisive role in the subsequent fuel synthesis and evaluation,including molecular structure design,parameter selection,property calculation,etc.fuel theory design is based on the molecular"structure-performance"structure-activity relationship,which has a pre-evaluation of the properties of the target synthetic fuel or intermediate product molecules.It provides more effective candidate molecular structures for the experimental synthesis of fuels to meet specific needs,improves the efficiency of fuel design from both"quantity"and"quality",and avoids a large number of repetitive,redundant and even dangerous experimental work of the traditional"trial and error method"^[7]。 Therefore,the calculation methods based on quantitative structure-property relationship,such as empirical equation,group contribution method and quantitative calculation,have emerged^{[8⇓⇓⇓~12]}。

machine learning(ML),as an emerging artificial intelligence algorithm,has developed rapidly in various fields such as text detection,image recognition,language translation and autonomous driving.in recent years,in the field of material molecules,Machine Learning has been widely used in the prediction and design of molecular properties,and has been gradually applied to more material fields from drug molecules,effectively driving the development of material science,greatly reducing the cost of new material development,and improving efficiency^{[13⇓⇓⇓⇓⇓~19]}^{[20⇓⇓⇓~24]}。 Correspondingly,machine learning also provides new ideas for solving the related problems of fuel property prediction and theoretical design.Compared with traditional computing methods,machine learning has shown strong advantages in big data processing,computing power and reverse design,which can better meet the needs of theoretical design of new fuels.more and More attention has been paid to the cross-study of theoretical design of machine learning-assisted fuel by researchers at home and abroad,and our research group has also carried out corresponding research work^{[25⇓⇓~28]}。

the theoretical research of machine learning-aided fuel design mainly includes two stages:fuel property prediction and new fuel molecular design.the basic flow of machine learning to predict fuel properties is shown in Fig.1A.Firstly,a database containing the molecular structure of compounds and the properties of fuels is collected,sorted out and established,and the main ways to obtain data are experimental synthesis testing and theoretical calculation.these data can be obtained through public literature reports,open source databases,experimental data and software access.the data set is further divided into a training set,a verification set and a test set according to different proportions,and then the fuel molecular structure is converted into a molecular description mode which can be recognized by a machine learning model and input into the machine learning model for training;After several times of optimization,the fuel property prediction model with small prediction error is obtained.Finally,the predicted target property of the molecule can be obtained by inputting the new molecular structure into the model.the basic process of machine learning reverse design of new fuel molecules is shown in Figure 1b,which mainly includes two paths.one is high-throughput screening based on the existing molecular structure library,which inputs the molecular structure in the open source large-scale molecular structure library into the trained and optimized property prediction model,and can quickly obtain the large-scale"molecular structure-fuel property"database;then,according to the requirements of One or more properties,the threshold of screening criteria is set to quickly screen out the molecular structures that meet the target requirements.the other is to design molecules From scratch.First,the molecular structure is converted into a molecular description that can be recognized by machine learning through coding,and then a deep generative model that can produce new molecules according to fuel properties is trained based on These molecular description data.Finally,the data generated by the model is converted into the description of fuel molecules by decoding method,and the required new molecular structure can be obtained.from the above two molecular design paths,it can be seen that the accurate prediction of fuel properties is the basis for the design of new fuel molecules。

显示原图|下载原图ZIP|生成PPT

图1 机器学习用于燃料性质预测（a）和分子设计（b）的流程

Fig. 1 Flow of machine learning application in (a) the property prediction and (b) molecule design of fuel

Although machine learning is becoming more and more abundant in the research of fuel theory design,few systematic articles focusing on the application of machine learning in fuel have been published,and more reviews on the application of machine learning in materials science or molecular science have been published.according to the process sequence of machine learning assisted fuel property prediction and molecular design,this review first introduces several commonly used fuel molecular description methods,and classifies different machine learning models.Then,the research status of machine learning for predicting single and multiple properties of fuel is summarized.Furthermore,the research on the design of new fuel molecules by machine learning is summarized from the two dimensions of high-throughput screening and reverse design of new molecules.Finally,According to the current research progress,the future development prospects of machine learning in the field of fuel are prospected。

2 Fuel molecule description mode

It is a key step in data-driven materials and molecular science to effectively extract and transform the structural features of molecules into specific mathematical descriptions^[29]。 Whether it is machine learning-assisted prediction of fuel properties or generation of new fuel molecular structures,it is necessary to transform discrete fuel molecular structures into mathematical descriptions that can be recognized by machine learning models.At present,in the related research that has been publicly reported,the description methods of fuel molecules mainly include molecular fingerprints based on SMILES,Coulomb matrix,continuously operable molecular input paradigm,molecular graph,etc。

2.1 Molecular fingerprint based on SMILES

the Simplified Molecular Input Line Entry System(SMILES)is a Simplified way to describe The spatial structure information of a molecule with ASCII strings,and its simple representation is shown in Figure 2a^[30]。 For regularized SMILES,each SMILES formula corresponds to a unique molecular structure,and the SMILES formula corresponding to each molecular structure is also unique.SMILES is a one-dimensional linear sequence,which can be relatively easily transformed into other Molecule Finger prints describing molecular structure,including Extended-Connectivity Fingerprints(ECFPs),Functional Class Finger prints(FCFPs),etc^[31]。 The extended connection fingerprint first assigns an integer identifier to each heavy atom;Then,with each heavy atom as the center,a circle of heavy atoms around it is merged until the specified radius is reached.Finally,the substructure is operated to generate the feature sequence.functional group fingerprint is more generalized,and the same kind of Functional group can be used as a characteristic structure.RDKit(http://www.rdkit.org)is an open source molecular chemical information processing software,which uses Python programming language to realize the mutual transformation of various molecular fingerprints based on SMILES.the software can be used to generate a variety of fuel molecule descriptions。

显示原图|下载原图ZIP|生成PPT

图2 表示燃料分子结构的不同分子描述方式：（a）分子SMILES式;（b）库仑矩阵^[25];（c）连续可操作的分子输入范式^[27];（d）分子图^[35]

Fig. 2 Different molecule descriptors representing the molecular structure of fuels: (a) SMILES; (b) Coulomb matrix^[25]; (c) Continuous operable molecular entry specification^[27]; (d) Molecule graph^[35]

2.2 Coulomb matrix

the Coulomb matrix(CM)is a two-dimensional Matrix that collectively represents The Cartesian coordinates and nuclear charge numbers of atoms within a molecule^[32,33]。 the calculation formula of The Coulomb matrix is shown in fig.2b^[25]。 Where the off-diagonal position corresponds to the Coulomb repulsion between diﬀerent atoms I and J within the molecule,while the diagonal position corresponds to the atomization energy of the same atom within the molecule.the Coulomb matrix can be further transformed into the eigenvalues of the Coulomb matrix by calculation,and the matrix is reduced from two dimensions to one dimension.the representation of Coulomb matrix and its eigenvalues contains both molecular spatial structure and atomic charge information,which is more suitable as an input method to predict the energy-related properties of molecules。

2.3 Continuously operable molecular input paradigm.

the Continuous Operable Molecular Entry Specification(COMES)is a Continuous multidimensional vector generated by converting the Molecular SMILES formula with a variational self-encoder,which can effectively represent the spatial structure information of molecules^[27]。 Referring to the deep learning method reported by Aspuru-Guzik et al.,the variational autoencoder can reversibly transform the discrete structure of molecules into multi-dimensional continuous vectors after training and optimization,and they are mutually unique,as shown in Figure 2C^[34]。 the characteristic of the continuously operable molecular input paradigm is that it can reversely transform the discrete molecular structure into a continuously differentiable multi-dimensional vector,which is more widely used in molecular generation models。

2.4 Molecular graph

Molecule Graph is a representation of molecular structure in the form of a Graph^[35]。 Generally,atoms are represented as nodes and chemical bonds between atoms are represented as edges.Hydrogen atoms are often ignored in The process,and finally a complete marking graph representing the molecular structure is constructed,as shown in Figure 2D.the representation of molecular graph is mainly used in graph neural network model。

in addition to the above molecular description of molecular structure transformation,some studies use other physical and chemical properties of molecules(non-target predictive properties),percentage content In mixture composition or other numerical combinations as the input of the model,so as to achieve the effect of predicting target properties^[36⇓~38]。 In essence,this way of prediction is to explore and establish mathematical associations between different properties。

3 Machine learning model

According to The different application purposes in the field of fuel,machine learning models can be divided into two categories:one is fuel property prediction models,mainly including linear regression,artificial neural network,support vector machine,decision tree,etc.,as well as ensemble learning combining multiple models;the other is the fuel molecule generation model,which mainly includes variational self-encoder and generative adversarial network.the following is a brief classification of different machine learning models。

3.1 Fuel property prediction model

3.1.1 Linear regression

Linear regression is a simple model method in machine learning.the simplest Linear Regression is to establish a Linear relationship between a target property y and an influencing factor X,that is,y=wx+B,where w and B are the parameters of the formula.However,there is often more than one factor affecting the target property,so Multiple Linear Regression(MLR)is needed to establish the Linear relationship between the target property and Multiple independent variables.Linear regression has a formula representation with explicit parameters,and is usually used to deal with property prediction problems with small and simple data^[39]。

3.1.2 Artificial neural network

Artificial Neural network(ANN)is a classical machine learning model,which abstracts the Neural Network of human brain from the perspective of data information processing and forms different Network structures according to different connection modes^[40]。 artificial neural networks have many subdivisions according to different uses and algorithms.At present,Artificial neural networks for auxiliary fuel property prediction mainly include single-layer neural network,deep neural network,convolutional neural network and graph neural network。

3.1.2.1 Single layer neural network

Single layer Neural Network(SLNN)is a simple Neural Network model composed of a visible Layer and a hidden Layer^[32]。 The visible layer contains an input layer and an output layer,which have the functions of inputting molecular structure information and outputting target properties,respectively.the hidden layer does not directly connect with the outside of the neural network,and the internal calculation process is completed by setting different parameter weights.The Nntool toolbox in MATLAB software provides modular functions for building single-layer neural networks,including a variety of logic algorithms including Levenberg-Marquardt。

3.1.2.2 Deep neural network

Deep Neural Network(DNN)is composed of input layer,hidden layer and output layer,which is the same as ordinary Neural Network.the difference is that the hidden layer of DNN is more and the parameters are more complex^[41]。 In 2006,Hinton et al.Used the pre-training method to alleviate the problem of local optimal solution,and successfully expanded the hidden layer of the neural network to 7 layers^[42]。 So far,the neural network has a real sense of"depth",and its advantage is that it has a stronger ability to calculate and process data.There is no clear definition of the"depth"of a deep neural network,that is,the number of hidden layers is not clearly defined。

3.1.2.3 Convolutional neural network

Convolutional Neural Network(CNN)is a feed-forward Neural Network that contains Convolutional computation and has a deep structure^[43]。 the typical structure of convolutional neural network is composed of convolutional layer,pooling layer and fully connected layer.the convolutional layer is used to extract features from the input data and contains multiple convolutional kernels,also known as receptive fields.the pooling layer is a feature dimensionality reduction(downsampling)process in order to reduce the number of parameters and the amount of computation.the function of the connection layer is to combine the features extracted by the convolution layer and the pooling layer nonlinearly to get the target result。

3.1.2.4 Graph neural network

graph Neural Network(GNN)is a Neural Network model that learns Graph-structured data,extracts and excavates features and patterns in Graph-structured data^[44,45]。 graph neural network can preserve the symmetry information of the graph by optimizing all the attributes on the graph,and this transformation will not change its connectivity.Graph neural network is also gradually applied to the prediction of fuel properties,using molecular graph as input to predict some fuel properties^[46,47]。

3.1.3 Support vector machine

Support Vector Machine(SVM)is a generalized linear classifier that performs binary classification of data in a supervised learning manner^[48,49]。 the decision boundary of support vector machine is the maximum margin hyperplane for learning samples,which can transform the problem into a convex quadratic programming problem.In addition,nonlinear SVM can also be constructed by introducing kernel function to solve nonlinear classification problems.support vector machines can also be used to solve regression prediction problems.the difference is that the support vector machine maximizes the interval between the sample points nearest to the hyperplane In the classification task;In the regression task,the interval between the sample points farthest from the hyperplane is maximized^[49,50]。

3.1.4 Decision tree

Decision Tree is a machine learning model for classification based on Tree structure^[51]。 the decision tree consists Of multiple decision nodes and leaf nodes.Each decision node makes branch selection through different conditional judgments,while the leaf nodes represent the final result of classification.By designing the node division with different parameters,the appropriate data classification effect is obtained.of course,decision trees can also achieve regression through continuous classification.Therefore,the decision tree can also be used as a fuel property prediction model。

3.1.5 Ensemble learning model

The Model of Ensemble Learning is an Ensemble Model that reasonably arranges and combines one or more different types of machine Learning models,including The above-mentioned neural network,decision tree,support vector machine,etc.,so as to achieve different purposes such as classification and prediction^[52]。 For example,Random Forest is a kind of ensemble learning,which is a model that integrates many decision trees into a Forest and uses it to predict the final goal^[53]。

In the process of effect evaluation of machine learning prediction models,the determination coefficient（R²）,mean absolute error(MAE),mean square error(MSE),root mean square error(RMSE)and other indicators are usually used to evaluate the prediction accuracy.The calculation formulas of these indicators are shown in Table 1.In the formula,y_irepresents the true value ,$\hat{y}_{i}$ represents the predicted value,and$\bar{y}_{i}$ represents the data average。

表1 Evaluation index of prediction accuracy of machine learning model

Table 1 Evaluation indexes for prediction using machine learning modela

Index	Implication	Expression
R²	Coefficient of Determination	$R^{2}-1-\frac{\sum_{n-1}^{n}\left(y_{1}-y_{1}\right)^{2}}{\sum_{n=1}^{n}\left(y_{1}-y_{1}\right)^{2}}$
MAE	Mean Absolute Error	$M A E-\frac{1}{n} \sum_{n=1}^{n}\left\|\hat{p}_{1}-y_{1}\right\|$
MSE	Mean Squared Error	$M S E-\frac{1}{n} \sum_{1=-2}^{n}\left(y_{1}-y_{1}\right)^{2}$
RMSE	Root Mean Square Error	$R M S E-\sqrt{\frac{1}{n} \sum_{n-2}^{n}\left(y_{1}-y_{1}\right)^{2}}$

3.2 Fuel molecule formation model

Inspired by computer vision and natural language processing,researchers have gradually developed a variety of molecular generative models since 2017,including variational self-encoders,recurrent neural networks,reinforcement learning,generative adversarial networks and so on^[34]^[54,55]^[56]^[18]。 At present,the variational self-encoding model and the generative adversarial network are widely used in the inverse design of new fuel molecules。

3.2.1 Variational self-encoder

Variational Auto Encoder(VAE)is a deep generative model based on Auto Encoder(AE),which adds the function of generating new data by constructing Variational next generation and re-adoption^[57]。 the variational self-encoder consists of an encoder and a decoder,as shown in Figure 3A.the encoder maps the numerator to a low-dimensional latent vector taken from a Gaussian distribution,and the Decoder maps the latent vector to the input numerator.Encoders and decoders can employ a variety of neural network architectures,including deep neural network models such as convolutional neural networks and graph neural networks。

显示原图|下载原图ZIP|生成PPT

图3 （a）变分自编码器结构图和（b）生成对抗网络结构图^[57]

Fig. 3 (a) The structure of VAE and (b) the structure of GAN^[57]

3.2.2 Generative countermeasure network

Generative Adversarial Network(GAN)is different from variational autoencoder in that it no longer uses an explicit probability density function,but consists of a Generator and a Discriminator to form an Adversarial training framework^[57⇓~59]。 the model structure of the generative adversarial network is shown in Figure 3B.the generator generates molecules that the discriminant model cannot distinguish between true and false,so as to achieve the purpose of"confusing the true with the false".the discriminator is trained to distinguish the real data from the generated data to the maximum extent,so as to achieve the purpose of"distinguishing the true from the false".the generator and the discriminator play a zero-sum game through continuous training,and finally the two reach a Nash equilibrium to generate a new molecular structure that meets the requirements。

4 Fuel property prediction

4.1 Prediction of single fuel properties

At present,machine learning has successfully achieved accurate prediction of fuel properties,effectively accelerating The process of evaluating fuel performance.These fuel properties mainly include density,flash point,viscosity,calorific value of combustion,cetane number,etc.the following is a selection of representative works for introduction。

4.1.1 Density

density is the mass per unit volume of a substance,which is a key indicator for evaluating the specific energy of a fuel.Available data show a positive correlation between Density and volumetric calorific value of hydrocarbon fuels^[60]。 For example,tetrahydronorbornadiene dimer(RJ-5)is the highest density liquid hydrocarbon fuel publicly reported so far,with a density of up to 1.08 g·cm^-3and a volumetric calorific value of 44.9 MJ·L^-1^[61]。 Therefore,it is essential to accurately predict the density of the fuel.Machine learning is gradually playing a role in fuel density prediction.Yang et al.Measured the density values of 69 diesel blends composed of 12 hydrocarbons with different proportions,and effectively correlated the mass percentage and density of diesel with models such as multiple linear regression and artificial neural network^[36]。 The coefficient of determination（R²）and mean absolute error(MAE)of the best GRNN model for predicting density on the full data set are 0.98 and 0.003 g·cm^-3,respectively.Hall et al.Proposed a Gaussian process regression model for predicting jet fuel density^[62]。 the temperature range for density measurement is-40~140℃.At the same time,in order to evaluate the impact of synthetic fuels on the prediction ability,12 synthetic fuels were added to the training data of 54 conventional fuels.the results show that the introduction of synthetic fuel data improves the accuracy of the predicted density。

4.1.2 Flash point

flash point(FP)is the lowest temperature at which the vapor produced by fuel ignites In air,which is the key index to evaluate the flammability of fuel.Fuels with high flash points are easy to store and transport.in 2020,Sun et al.Collected the flash point data set of 10,575 compound molecules,and evaluated the effect of two graph deep neural network models,information transfer neural network(MPNN)and graph convolutional neural network(GCNN),on predicting the flash point of compounds^[63]。 The prediction accuracy of the optimized message passing neural network model on the full dataset is high,with MAE and R²of 18.76 K and 0.83,respectively,as shown in Fig.4.The research on machine learning assisted prediction of flash point of mixed fuel has been carried out one after another,and some input methods of characteristic description have been established for the mixture^[64]^[65⇓~67]。 Aljaman et al.Disassembled the fuel molecular structure into 11 different types of functional groups,and used the mass fractions of functional groups and different components as the input of the machine learning model^[68]。 Furthermore,two neural network models were developed with Matlab and Keras,respectively,to effectively predict the flash points of 788 oxygenated petroleum-based fuels(474 pure compounds and 314 mixtures).The R²of flash point predicted by the two network models established by Matlab and Keras are 0.981 and 0.979,respectively,and the MAE are 3.12 K and 3.55 K,respectively.Jiao et al.Used the electrotopological state index(ETSI)of pure substances and the weighted average of mole fractions as the descriptive operators of binary mixtures.Multiple linear Regression(MLR),Stepwise regression,radial basis function artificial neural network(RBF-ANN)and other methods were used to establish a variety of flash point prediction models,and the quantitative relationship between the electrotopological state index and the flash point of 288 binary mixtures was verified^[69]。

显示原图|下载原图ZIP|生成PPT

图4 （a）信息传递神经网络和（b）图卷积神经网络预测闪点值与实验值对比^[63]

Fig. 4 Comparison of predicted values via (a) message- passing neural network (MPNN) and (b) graph convolutional neural network (GCNN) with experimental values for flashpoint^[63]

4.1.3 Viscosity

viscosity is mainly used to evaluate the fluidity of fuel and help to calculate the pressure drop of liquid fuel in the pipeline.dynamic viscosity and kinematic viscosity are the main indexes to evaluate fuel Viscosity.Where Kinematic Viscosity is the ratio of Dynamic Viscosity to density.Researchers have studied the prediction of fuel Viscosity using different machine learning methods^{[70⇓⇓~73]}。 Cengiz et al.Constructed three machine learning models of multilayer perceptron,extreme learning machine and K-proximity method,and predicted the kinematic viscosity of 77 liquid fuels by using the experimentally measured water content,density and flash point of fuels as input data^[70]。 the results show that the extreme learning machine has the lowest MRE value(0.0140)and MSE value(0.0313),which is more suitable for predicting the kinematic viscosity of fuel.Yahya et al.Used temperature,the kinematic viscosity of biodiesel and its concentration in the mixture as inputs to construct an adaptive neuro-fuzzy system and a least squares support vector machine to predict the kinematic viscosity of biodiesel blends^[71]。 The results of kinematic viscosity predicted by different models are shown in Figure 5.The comparison results show that the least squares support vector machine based on multinomial kernel function has the highest prediction accuracy,and its MAE and R²for predicting the kinematic viscosity of 636 biodiesel blends are 0.03 mm²/s and 0.9997,respectively 。

显示原图|下载原图ZIP|生成PPT

图5 不同智能方法在（a）训练、（b）测试和（c）全部数据集上的预测性能^[71]

Fig. 5 Prediction performance of the different intelligent approaches in the database of (a) training, (b) testing and (c) the whole^[71]

4.1.4 Calorific value

calorific Value is an important index to evaluate the dynamic performance of fuel,and the fuel with high calorific Value can provide enough energy for the engine.the detailed indexes for evaluating the calorific value of fuel are different,including Net Heat of Combustion(NHOC),v-NHOC,Higher Heating value(HHV),etc^[74]。 Xing et al.Effectively predicted the mass calorific value of biofuels by constructing different machine learning models,such as linear regression,artificial neural network,support vector machine,random forest and decision tree,with the mass percentage of five different elements of C,H,O,N and S in compound molecules as input^[75]。 Among these models,Random Forest Regression(RFR)and Decision Tree Regression(DTR)models performed better in predicting mass calorific value on the full dataset,with R²of 0.9814 and 0.9664,respectively.Hosseinpour et al.Proposed a new prediction model based on fuzzy partial least squares iterative network and principal component analysis(PCA-INFPLS),which effectively correlated the fixed carbon(FC),volatile matter(VM)and ash content of 350 biomass fuels with their high calorific value^[76]。 The results show that the R²between the predicted calorific value and the real calorific value of biomass fuel is 0.96 。

4.1.5 Cetane number

Cetane Number(CN)is a key index to evaluate the ignition performance of fuel^[77]。 the higher the cetane number,the better the ignition performance of the fuel,the uniform combustion and the smooth starting of the engine.Machine learning is also increasingly used as an effective model for predicting fuel cetane number^[78,79]。 the artificial neural network(ANN)established by Guo et al.Effectively predicted the cetane number of 349 hydrocarbons and oxygenates,and the prediction accuracy was better than that of multiple linear regression(MLR)^[78]。 the mean absolute errors of the optimal model for cyclic and chain compounds are 6.5 CN and 4.0 CN,respectively.Kessler et al.Reported a neural network model for predicting the cetane number of furan compounds^[79]。 in the process of model optimization,the accuracy of predicting the cetane number of furyl molecules was improved by 49.21%(3.74 CN)on average by expanding the target furan compounds In the training data set。

in addition to the fuel properties such as density,flash point,viscosity,calorific value of combustion and cetane number,machine learning has also been applied to assist In predicting other types of fuel properties such as octane number and smoke formation characteristics^{[80⇓⇓~83]}^[84,85]。 in addition,In addition to focusing on the prediction of molecular fuel properties,machine learning to predict the physical and chemical properties of other types of compounds can also promote and guide the prediction of fuel properties。

4.2 Prediction of multi-fuel properties

the machine learning model method for predicting the properties of a single fuel has also been explored and applied to the prediction of the properties of other fuels,and the prediction of multiple properties of fuels has been gradually realized,so that the performance of fuels can be evaluated more comprehensively^[86]^[87]。

as early As 2007,Liu Guozhu and others used GC-MS to analyze the chemical composition of more than 80 kinds of fuels,and divided the above diesel fuels into eight kinds of hydrocarbons,including monocycloalkane,dicycloalkane,n-alkane,iso-alkane and naphthalene and its substitutes^[37]。 Furthermore,a simple artificial neural network was Constructed to correlate the fuel composition with its flash point,freezing point,density,net calorific value and other properties.In 2022,Liu Guozhu et al.constructed different deep neural networks to predict various properties of fuel^[88]。 The prediction accuracy of three graph neural networks,graph convolutional network(GCN),graph attention network(GAT)and graph isomorphism network(GIN),for the flash point of fuel compounds is compared.Among them,the graph isomorphism network coupling molecular and atomic features has the best prediction accuracy,with the R²and MAE of the predicted flash point being 0.991 and 3.952 K,respectively.The model can also be effectively extended to the prediction of freezing point and density properties,and the prediction accuracy R²is 0.997 and 0.991,respectively.The predicted results for the three fuel properties are shown in Fig.6 。

显示原图|下载原图ZIP|生成PPT

图6 （a）沸点、（b）密度和（c）闪点的预测值与实验值。黄色、红色、蓝色分别代表训练集、验证集、测试集的数据^[88]

Fig. 6 Experimental values versus predicted values of (a) boiling point, (b) density, (c) FP. The yellow, red, and blue points refer to the predicted values from the train, validation, and test sets, respectively^[88]

Zhang Linzhou et al.used molecular structural groups and chemical descriptors as inputs,and Used a variety of machine learning models to accurately predict the four key properties of diesel,namely,freezing point,smoke index,cetane number,and heat of combustion^[89]。 Taking cetane number as an example,the prediction results of artificial neural network,support vector machine and random forest were compared.the results show that the error of artificial neural network in the training set and the test set is small.On the basis of accurately predicting the properties,the above four key properties are used to evaluate the low-temperature fluidity,cleanliness,ignition performance and power performance of diesel molecules respectively,and a system for evaluating the comprehensive performance of diesel is established through the property radar chart,as shown in Figure 7.the results show that hexylcyclohexane and 2,6,10-trimethylundecane have good low temperature fluidity,cleanliness,ignition performance and power performance.Therefore,the general rule is concluded that isoparaffins and naphthenes are the ideal components of high quality clean diesel。

显示原图|下载原图ZIP|生成PPT

图7 使用QSPR模型预测代表性分子的综合性能雷达图^[89]

Fig. 7 Radar chart of comprehensive performance of representative molecules using the combination of QSPR models^[89]

Saldana et al.Used a variety of machine learning models to predict six fuel properties,including molecular density,viscosity,flash point,n-cetane number,freezing point and net heat of combustion^[90⇓~92]。 A comparison of the best predicted and true values for these fuel properties is shown in Figure 8.Except for the freezing point,the R²of the other five fuel properties are predicted to exceed 0.9,and the R²of the net heat of combustion is predicted to be as high as 0.999.In the above study,two molecular description methods are mainly used,which are the functional group descriptor based on the molecular SMILES formula and the molecular topological descriptor calculated by Materials Studio software.At the same time,different machine learning models such as artificial neural network and support vector machine are used to create new models with different combinations of the two molecular descriptions,and the prediction effects of different models are compared.The results show that the"consensus model"based on the average combination of the prediction data of different models has the best prediction effect,and the specific combination of the"consensus model"for predicting different fuel properties is different 。

显示原图|下载原图ZIP|生成PPT

图8 （a）密度、（b）黏度、（c）闪点、（d）正十六烷值，（e）冰点、（f）净燃烧热6种燃料性质预测值和真实值的比较^[90⇓~92]

Fig. 8 Comparison of predicted values and real values for six fuel properties: (a) density, (b) viscosity, (c) flash point, (d) cetane numbers, (e) freezing point, and (f) net heat of combustion^[90⇓~92]

Our research group designed and optimized 342 hydrocarbon molecular structures,and calculated the density,freezing point,boiling point,combustion heat value,specific impulse and other key fuel properties of these molecules by DFT and group contribution method,thus constructing a database containing 342 hydrocarbon molecular structure-fuel properties.Based on the above database,a single-layer neural network was built and optimized with Matlab to accurately predict multiple fuel properties of 342 molecules,and the R²of multiple properties was above 0.9^[25]。 Based on the previously established database of 342 hydrocarbon molecular structure-fuel properties,the training database was expanded to 739 hydrocarbon molecules by further searching the physical and Chemical properties of hydrocarbons in the American chemical Abstracts.Then,different individual learners and ensemble learning stacking models(Stacking model)are constructed with continuously operable molecular input paradigm(COMES)and Coulomb matrix(CM)as inputs,respectively,to achieve accurate prediction of multiple fuel properties.the Stacking Model exhibited lower prediction error whether the continuously operable molecular input paradigm or the Coulomb matrix was used as input.Compared with the previously designed single-layer neural network(SLNN),the accuracy of predicting fuel properties such as density,mass calorific value and specific impulse is still improved on the premise of more training data and different sources^[27]。 See Table 2 for the prediction accuracy results of various key fuel properties of hydrocarbons in the two works。

表2 Error Comparison of Single Layer Neural Network and Stacked Ensemble Models for Predicting Multiple Fuel Properties^[25,27]

Table 2 Comparison of the errors for predicting multiple fuel properties by single layer neural network and stacking model^[25,27]

Model	Molecular descriptor	T_m/K		FP/℃		ρ/g·cm^-3		NHOC/MJ·kg^-1
Model	Molecular descriptor	MAE	R²	MAE	R²	MAE	R²	MAE	R²
SLNN	CM	11.47	0.9675	4.029	0.9910	0.0515	0.9736	0.3651	0.9023
Stacking	CM	13.47	0.8873	4.294	0.9686	0.0440	0.9266	0.1783	0.9334
Stacking	COMES	113.61	0.8960	6.334	0.9337	0.0322	0.9457	0.1800	0.9058

Li et al.Proposed an integrated machine learning and quantitative structure-property relationship(ML-QSPR)method to predict 15 physicochemical properties of 23 different types of fuels^[93]。 They used 10-fold cross-validation and leave-one-out cross-validation to train the regression model and test the accuracy of the prediction results.Compared with the published fuel property prediction models,the above designed model has the following four main advantages:(1)The model can be used to predict a variety of fuel properties,including CN,RON,MON,T_m,T_b,ΔH_vap,γ,LHV,ρ,YSI,IT,FP,VP,LFL,UFL 15 properties;(2)The model is applicable to many different types of fuels,including 23 kinds of fuels such as alkanes,cycloalkanes,alkenes,cyclic alkenes,alkynes,alcohols,aldehydes,etc.;(3)The model achieves high prediction accuracy,and the average coefficient of determination R²for predicting 15 fuel properties is as high as 0.9816,and the prediction accuracy of 15 fuel properties is shown in Table 3;(4)The model demonstrates reasonable interpolation and extrapolation capabilities for testing new molecules.The four advantages of the model are mainly attributed to two key factors:one is that the developed functional group system UOB 3.0 takes into account the contribution characteristics of fuel molecular structure and the interaction of functional groups,which can be effectively converted into molecular input;Second,the machine learning model describes the relationship between fuel molecular structure and properties through non-parametric fitting,and optimizes it through automatic hyperparameter adjustment,feature selection,and best model identification,so as to accurately capture the impact of molecular structure information on fuel properties 。

表3 Prediction performance of machine learning and quantitative structure-relationship models trained by 10-fold cross-validation and leave-one-out cross-validation^[93]

Table 3 Predictive performances of ML-QSPR models trained by 10-fold cross validation and leave-one-out cross validation^[93]

Property	T_m/K		FP/℃
Property	R²	RMSE	R²	RMSE
CN	0.9898	2.776	0.9948	2.045
RON	0.9884	2.468	0.9884	2.466
MON	0.9758	2.805	0.9821	2.448
T_m	0.9653	15.214	0.9625	15.805
T_b	0.9484	20.097	0.9788	13.010
ΔH_vap	0.9968	1.399	0.9986	0.926
γ	0.9898	0.799	0.9894	0.813
LHV	0.9959	189.563	0.9961	184.204
ρ	0.9946	11.945	0.9946	11.943
YSI	0.9993	7.567	0.9993	7.567
IT	0.9603	21.951	0.9631	21.218
FP	0.9798	10.142	0.9938	5.666
VP	0.9972	4.798	0.9971	4.825
LFL	0.9935	0.062	0.9948	0.056
UFL	0.9486	0.725	0.9826	0.429

5 Molecular Design of New Fuels

In the research reported so far,there are two main ways to design new fuel molecules,namely,high-throughput screening to find fuel molecules that meet specific performance requirements and de novo design of fuel molecules。

5.1 High throughput screening of fuel molecules

the optimized machine learning model is trained based on a small data set to predict the fuel properties of molecules in a large molecular structure library,and a large molecular structure-fuel property database can be quickly obtained.At present,the open source large molecular structure libraries mainly include GDB,QM9,ZINC,PubChem,etc^[94,95]^[96]^[97,98]^[99]。 Furthermore,by setting the screening criteria of specific properties,the molecular structures that meet the requirements can be screened out from the large molecular structure-fuel property database in high throughput.This virtual screening method uses a large number of candidate molecular data sets to obtain more focused and smaller molecular data sets with high throughput,which can effectively reduce the order of magnitude of candidate molecules from 10⁵to 10¹ 。

Li et al.further proposed a method for high-throughput screening of fuels with specific properties on the basis of previous studies on the prediction of various fuel properties,and set up a two-stage virtual screening process,as shown in Figure 9a^[100]。 the primary screening is based on the machine learning and quantitative structure-property relationship(ML-QSPR)model of previous research to predict 15 properties of fuel,such as melting point,boiling point,cetane number and gasification enthalpy.According to the performance requirements of the SI engine,the thresholds of different properties are set,and the specific screening criteria are shown in Figure 9b.the first-level screening successfully screened out 166 qualified fuel compounds from 1742 compounds.the secondary screening is based on the kinetic study,and the ignition delay time,sensitivity and laminar flame speed are used to further evaluate the combustion performance of the fuel,and finally eight candidate fuel molecules are obtained.This"funnel-type"hierarchical screening method can gradually screen out new fuel molecules that are more in line with the ultimate goal。

显示原图|下载原图ZIP|生成PPT

图9 （a）面向特定性质设计燃料的工作流程，阴影区域表示通过ML-QSPR和化学动力学进行的虚拟燃料筛选;（b）ML-QSPR模型对SI发动机进行一级燃料物理化学特性筛选^[100]

Fig. 9 (a) Property-oriented fuel design workflow, the shadow region represents virtual fuel screening by ML-QSPR and chemical kinetics; (b) Tier 1 fuel physicochemical property screening by ML-QSPR models for SI engine^s[100]

Based on the single-layer neural network model trained by 342 hydrocarbon structure-multi-fuel property small databases,our research group successfully predicted the multi-fuel properties of 319 893 hydrocarbon molecules in GDB-13C,and constructed a large database of 319 893 hydrocarbon molecule-fuel properties,as shown in Figure 10a^[25]。 By setting the screening criteria of freezing point below 273.15 K,mass heating value greater than 85%of the maximum value,and specific impulse greater than 80%of the maximum value,28 new hydrocarbon fuel molecules with high density,high specific impulse,high mass heating value and low freezing point were screened out from the established large database.the structures of the 28 selected molecules are shown in Figure 10B,and the fuel properties of these molecules are conceptually calculated and verified。

显示原图|下载原图ZIP|生成PPT

图10 本课题组有关高通量筛选碳氢燃料分子的工作：（a）单层神经网络预测319 895个分子的质量热值NHOC（x轴）、比冲I_sp（y轴）、密度ρ（z轴）和熔点T_m^[25];（b）筛选出的28个烃类分子的结构^[25];（c）机器学习预测质量热值（黑色十字）与DFT和基团贡献法计算（红色圆圈）的比较^[27];（d）筛选出的20个碳氢化合物分子的结构^[27]

Fig. 10 The works of our group on high-throughput screening of hydrocarbon fuel molecules: (a) The values of NHOC (x axis), I_sp (y axis), ρ (z axis) and T_m (color depth of the dots) for the 319 895 molecules predicted by SLNN^[25]; (b) Molecular structures of the as-screened 28 hydrocarbon molecules^[25]; (c) Comparison of NHOC predicted by machine learning (black crosses) and calculated by DFT and group contribution^[27]; (d) Molecular structures of the as-screened 20 hydrocarbon molecules^[27]

On the basis of previous studies,our research group predicted various fuel properties of 319,893 hydrocarbon molecules in the GDB-13C database with the stacking model optimized by training,and conducted high-throughput screening^[27]。 The screening criteria are more stringent,and SA is added as the screening condition.SA represents the ease of synthesis of a compound molecule,and the smaller the SA value,the easier the synthesis.When the screening criteria were set asρ>1.1 g·cm^-3,NHOC>42 MJ·kg^-1,I_sp>343 s,SA<5.0,1026 molecular structures were selected from 319,893 molecules.In order to verify the validity of the screening results,78 molecules were randomly selected,and the errors between the predicted and calculated values of these molecules were compared.Therein,the comparison of the predicted and calculated values of mass calorific value is shown in Fig.10c.Compared with the previously established single-layer neural network,the MAE values of the properties predicted by the stacked model are greatly reduced by 87%on average,such as the MAE of NHOC is reduced from 4.12 MJ·kg^-1to 0.45 MJ·kg^-1.Among the 1026 molecules screened by high-throughput screening,20 new hydrocarbon molecules with structural characteristics(Fig.10d)were selected to guide the design of new fuel molecules in the future.Compared with the fuel performance of 28 hydrocarbon molecules screened by single-layer neural network,20 hydrocarbon molecules with outstanding performance were screened by stacking model.The freezing point is averagely reduced by 21.8K,the density is averagely improved by the 0.06 g·cm^-3,the mass heat value is averagely improved by the 0.71 MJ·kg^-1,and the specific impulse is averagely improved by the 5.8 m·s^-1,thereby being more in line with the performance standard of high-density hydrocarbon liquid fuel;The screening dimension of SA is increased,and the experimental synthesizability of the target fuel is correspondingly improved 。

5.2 Inverse design of new fuel molecules

Unlike high-throughput screening of existing molecular structure libraries for fuel molecules that meet specific requirements,reverse engineering of fuel molecules is the design of completely new fuel molecular structures from scratch.At present,molecular inverse generation model is one of the hot topics of machine learning in materials chemistry,which can generate new molecular structures according to specific objectives^[18,101]。 However,there are still few applications in fuel design at present.Our research group has done some exploratory research in this area,mainly through variational autoencoder(VAE)and generative antagonistic network(GAN)to realize the reverse design of new fuel molecules。

Our research group has developed a variational self-encoder,which can reversibly convert The continuous multi-dimensional mathematical vectors representing the fuel molecules.the process is shown in Figure 11A^[26]。 Figure 11b shows the sampling results of the VAE model around the classical hydrocarbon fuel molecule JP-10.The distance between the sampled molecule and the original molecule is less than the average distance of the adjacent old molecules in the training set,which proves that the model has a strong ability to generate molecules.At the same time,a python code based on the group contribution method is developed to automatically calculate the fuel properties of hydrocarbons,such as density,freezing point,boiling point and flash point,with the molecular SMILES formula as the input.A VAE model jointly trained with a multilayer perceptron MLP was trained and optimized based on a small quantum chemical computational database of 9252 hydrocarbon molecules and their calorific values using a continuous multidimensional numerical vector(CMR)as input,and the trained model can accurately predict the combustion calorific value.As shown in Figure 11 C,the distribution of molecular properties in the latent space is relatively ordered,proving that joint training can help the model better capture molecular structural features and organize the latent space more orderly.Furthermore,the VAE model was used to generate 11 291 051 hydrocarbon molecules,and the properties of these molecules were successfully calculated by the above calculation method,resulting in a large data CH-02 containing 11 291 051 hydrocarbon molecular structure-fuel properties.By setting the screening threshold as density greater than 1.05 g·cm^-3,freezing point lower than 240 K,mass heating value greater than 42 MJ·kg^-1,specific impulse greater than 340 s,and molecular ring number less than 5,41 199 hydrocarbon molecular structures were obtained by high-throughput screening in CH-02.Twenty typical new molecular structures were further selected from the 41199 molecules,as shown in Figure 11d.Through structural analysis,it is found that the strained carbocycle of cyclopropane or cyclobutane connected in the form of spiro is an effective building block to improve the comprehensive fuel performance of hydrocarbon molecules 。

显示原图|下载原图ZIP|生成PPT

图11 （a）燃料分子设计图示，包含编码器和解码器的VAE、性质预测模型;（b）经过训练的VAE的生成能力，经典燃料分子JP-10周围的采样结果;（c）通过与多层感知器MLP联合训练的VAE生成的潜在空间的二维主成分分析，用于预测密度（ρ，g·cm^-3）、凝固点（T_m，K）、比冲（I_sp，s）和热值（NHOC，MJ·kg^-1），色条颜色显示性质的数值;（d）筛选出的20个突出的分子结构^[26]

Fig. 11 (a) Illustration of the VAE containing an encoder and a decoder developed, a joint property prediction model has been included for fuel design; (b) Generation capability of the as-trained VAE, sampling results around the classic fuel molecule JP-10; (c) Two-dimensional principal component analysis of the latent space generated by the VAE jointly trained with MLP for the prediction of density (ρ, g·cm^-3), freezing point (T_m, K), specific impulse (I_sp, s) and heat value (NHOC, MJ·kg^-1), the color bar shows the numerical value of the molecular properties; (d) The structures of as-screened 20 excellent molecules^[26]

Recently,based on the application of variational autoencoder to design fuel molecules,our group continued to develop the latent space generation antagonistic network model with stacking domain(LIGANDS)^[28]。 the model consists of three parts:a variational autoencoder(VAE),a generative adversarial network(GAN),and a stacked model.the workflow of the model is shown in Figure 12.the VAE model is trained on the basis of 319 893 hydrocarbon structures in the GDB-13C database,and is used to transform the hydrocarbon structures into continuous actionable real-valued vectors(COMES)mapped into the latent space.the trained model has good robustness,and the correct rate of encoding and decoding is up to 99%.the stacking model is used to predict six key fuel properties including freezing point,boiling point,flash point,density,mass heating value and specific impulse of molecules.in the GAN model,255 known typical high-density hydrocarbon fuel molecular structures are used as the training set,and the discriminator continuously improves the training parameters according to the molecular structure of the target fuel in the training set to determine whether the given molecule meets the standard;the generator is trained to generate qualified new molecules for the discriminator to confuse with the target fuel.When the two reach Nash equilibrium,the generator can transform the vector of random inputs into new hydrocarbon molecules that meet the target fuel characteristics.Finally,the LIGANDS model designed 3461 qualified new fuel molecules from scratch,whose property distribution is similar to that of the target fuel and whose energy characteristics are better.Sixteen new hydrocarbon fuel molecules with unique structures were further selected,and their performance was comparable to or even better than that of typical hydrocarbon fuels JP-10 and QC.Therefore,the LIGANDS model has the ability to generate hydrocarbon fuels in depth efficiently and robustly。

显示原图|下载原图ZIP|生成PPT

图12 用于从头设计燃料的LIGANDS深度生成模型的工作流程，集成VAE的编码器和解码器、GAN的生成器和判别器以及堆叠预测模型^[28]

Fig. 12 The workflow of deep generative algorithm of LIGANDS by integrating a VAE with encoder and decoder, a GAN with generator and discriminator and a stacking model for de novo fuel design^[28]

5.3 Comparison of different design methods

Comparing the two types of machine learning-aided fuel molecule design methods introduced above,the high-throughput fuel molecule method belongs to virtual screening,which is based on the existing large molecular structure database.different machine learning methods can be used to predict the fuel properties of different molecular structures in the database and screen out the fuel molecular structures that meet specific performance requirements from medium and high throughput.in This method,the performance and accuracy of the screened molecules mainly depend on the quality of the training database,the accuracy of the machine learning model to predict the fuel properties,and the severity of the screening conditions.the reverse design of new fuel molecules belongs to de novo design,that is,to target the requirements of fuel properties,and to reverse design the fuel molecular structure that meets the requirements based on different depth generation models.this method can generate new molecular structures different from those in the existing molecular structure database,and the performance of the designed fuel molecules largely depends on the accuracy of the encoding and decoding molecular structures of the constructed deep generative model,that is,whether the structural information of high-performance fuel molecules can be accurately learned。

6 Conclusion and prospect

in this paper,the research progress in two key areas of machine learning-assisted fuel property prediction and new fuel molecular design is reviewed.Machine learning can accurately predict fuel properties such as molecular density,flash point,viscosity,combustion heat value and cetane number,and can efficiently and accurately design new fuel molecules through high-throughput screening and reverse design.Machine learning has effectively helped to discover more potential high-energy fuel molecules,demonstrating its efficiency and advancement in the design of the next generation of new fuels.According to the current research progress,the future application of machine learning in the field of fuel will focus on the following four aspects。

(1)Improvement and standardization of basic data of fuel structure and properties.in the process of machine learning training,the key problem of data-driven is the lack of data and the inconsistency of standards,especially the lack of experimental data.the same is true In the study of fuel theory.the acquisition of more experimental data and the development of data standardization will drive the rapid development of fuel theory design,so as to guide experimental synthesis and specific application scenarios more efficiently。

(2)effective extraction of fuel molecular features.the current study has designed a variety of molecular descriptions and compared their property prediction effects.However,the information extracted from these molecular descriptions is still limited,and they can not fully interpret the molecular characteristics related to the molecular structure and target properties.the ultimate goal of molecular description is to effectively extract molecular features,integrate more information such as molecular structure into a simple molecular fingerprint representation,and achieve the effect of target property prediction with as few features as possible.Therefore,more"precise"molecular descriptions need to be further developed.In addition,in practical application scenarios,the fuel used by aircraft and engines is more in the form of mixture.In the following research,more practical application problems will be gradually solved by designing molecular description methods for Effective extraction of mixed fuels。

(3)generalization ability and interpretability of fuel property prediction model.the current machine learning model,based on the existing training and testing data sets,shows high prediction accuracy,but its extended prediction ability is not ideal,and there is still a large deviation between the prediction of fuel molecular properties outside the existing data sets and the actual situation.it is one of the difficult problems to overcome in the future that machine learning models with strong Generalization ability can be obtained by training based on small databases.At the same time,the current fuel property prediction model is more like a"black box",although It has achieved the purpose of predicting properties,but the interpretability of the model needs to be further recognized.At present,the interpretable deep learning model is one of the research hotspots in computer and big data,and will be gradually applied to predict fuel properties in the future,which is necessary to further explain the structure-activity relationship between molecular structure and fuel properties。

(4)on-demand reverse design of fuel molecules and prediction of synthesis routes.At present,there are still few studies on the development of inverse design models of fuel molecules.in the next step,reinforcement learning,recurrent neural networks and other generative models will be gradually applied to the on-demand reverse design of fuel molecules to design more potential new high-energy fuel molecules from scratch.in addition,in addition to the direct design of fuel molecular structure,intelligent prediction and optimization design of target molecular synthesis routes are also crucial.This has an important guiding role In evaluating the practicability of the target molecule and guiding the experimental synthesis in depth。

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Zou J J, Guo C, Zhang X W, Wang L, Mi Z T. Journal of Propulsion Technology, 2014, 35(10): 1419. ( 邹吉军, 郭成, 张香文, 王莅, 米镇涛. 推进技术, 2014, 35(10): 1419.)

[2]	Zhang X W, Pan L, Wang L, Zou J J. Chem. Eng. Sci., 2018, 180: 95.

[3]	Yu R, Liu X L, Shi C X, Pan L, Zhang X W, Zou J J. Chinese Journal of Energetic Materials, 2022, 30(11): 1167. ( 余锐, 刘显龙, 史成香, 潘伦, 张香文, 邹吉军. 含能材料, 2022, 30(11): 1167.)

[4]	Liu N, Shi C X, Pan L, Zhang X W, Zou J J. Journal of Fuel Chemistry and Technology, 2021, 49(12): 1780. ( 刘宁, 史成香, 潘伦, 张香文, 邹吉军. 燃料化学学报, 2021, 49(12): 1780. )

[5]	Pan L, Li H Y, Xue K, Zhang X W, Zou J J. Journal of Propulsion Technology, 2023, 44(09): 6. ( 潘伦, 李怀宇, 薛康, 张香文, 邹吉军. 推进技术, 2023, 44(09): 6.)

[6]	Pan L D, E X T F, Nie G K, Zhang X W, Zou J J. Progress in Chemistry, 2015, 27(11): 1531. ( 潘伦邓, 鄂秀天凤, 聂根阔, 张香文, 邹吉军. 化学进展, 2015, 27(11): 1531.)

[7]	Yang J, Xin Z, He Q S, Corscadden K, Niu H B. Fuel, 2019, 237: 916.

[8]	Savos'kin M V, Kapkan L M, Vaiman G E, Vdovichenko A N, Gorkunenko O A, Yaroshenko A P, Popov A F, Mashchenko A N, Tkachev V A, Voloshin M L, Potapov Y F. Russ. J. Appl. Chem., 2007, 80(1): 31.

[9]	Marrero J, Gani R. Fluid Phase Equilib., 2001, 183/184: 183.

[10]	Hukkerikar A S, Sarup B, Ten Kate A, Abildskov J, Sin G, Gani R. Fluid Phase Equilib., 2012, 321: 25.

[11]	Wang X Y, Jia T H, Pan L, Liu Q, Fang Y M, Zou J J, Zhang X W. Trans. Tianjin Univ., 2021, 27(2): 87.

[12]	Zhang S Y, Jia Q Z, Yan F Y, Xia S Q, Wang Q. Chem. Eng. Sci., 2021, 231: 116326.

[13]	Butler K T, Davies D W, Cartwright H, Isayev O, Walsh A. Nature, 2018, 559(7715): 547.

[14]	Pilania G, Wang C C, Jiang X, Rajasekaran S, Ramprasad R. Sci. Rep., 2013, 3: 2810.

[15]	Wu W, Sun Q. Sci. Sin.-Phys. Mech. Astron., 2018, 48(10): 107001.

[16]	Hou F, Ma Y, Hu Z, Ding S N, Fu H H, Wang L, Zhang X W, Li G Z. Adv. Theory Simul., 2021, 4(6): 2100057.

[17]	Damewood J, Karaguesian J, Lunger J R, Tan A R, Xie M R, Peng J Y, Gómez-Bombarelli R. Annu. Rev. Mater. Res., 2023, 53: 399.

[18]	Lee Y J, Kahng H, Kim S B. Mol. Inform., 2021, 40(10): 2100045.

[19]	Afzal M A F, Sonpal A, Haghighatlari M, Schultz A J, Hachmann J. Chem. Sci., 2019, 10(36): 8374.

[20]	Mayr A, Klambauer G, Unterthiner T, Hochreiter S. Front. Environ. Sci., 2016, 3: 80.

[21]	Song S W, Wang Y, Chen F, Yan M, Zhang Q H. Engineering, 2022, 10: 99.

[22]	Tian X L, Song S W, Chen F, Qi X J, Wang Y, Zhang Q H. Energ. Mater. Front., 2022, 3(3): 177.

[23]	Kang P, Liu Z L, Abou-Rachid H, Guo H. J. Phys. Chem. A, 2020, 124(26): 5341.

[24]	Mai H X, Le T C, Chen D H, Winkler D A, Caruso R A. Chem. Rev., 2022, 122(16): 13478.

[25]	Li G Z, Hu Z, Hou F, Li X Y, Wang L, Zhang X W. Fuel, 2020, 265: 116968.

[26]	Liu R C, Liu R Z, Liu Y F, Wang L, Zhang X W, Li G Z. Fuel, 2022, 316: 123426.

[27]	Liu R Z, Liu Y F, Duan J Y, Hou F, Wang L, Zhang X W, Li G Z. Fuel, 2022, 324: 124520.

[28]	Liu Y F, Liu R Z, Duan J Y, Wang L, Zhang X W, Li G Z. Chem. Eng. Sci., 2023, 274: 118686.

[29]	Wigh D S, Goodman J M, Lapkin A A. Wires Comput. Mol. Sci., 2022, 12(5): e1603.

[30]	Weininger D. J. Chem. Inf. Comput. Sci., 1988, 28(1): 31.

[31]	Wen H Q, Su Y, Wang Z H, Jin S M, Ren J Z, Shen W F, Eden M. AlChE. J., 2022, 68(1): e17402.

[32]	Hou F, Wu Z Y, Hu Z, Xiao Z R, Wang L, Zhang X W, Li G Z. J. Phys. Chem. A, 2018, 122(46): 9128.

[33]	Hansen K, Montavon G, Biegler F, Fazli S, Rupp M, Scheffler M, von Lilienfeld O A, Tkatchenko A, Müller K R. J. Chem. Theory Comput., 2013, 9(8): 3404.

[34]	Gómez-Bombarelli R, Wei J N, Duvenaud D, Hernández-Lobato J M, Sánchez-Lengeling B, Sheberla D, Aguilera-Iparraguirre J, Hirzel T D, Adams R P, Aspuru-Guzik A. ACS Cent. Sci., 2018, 4(2): 268.

[35]	Liu X, Liu X, Yang T, Qin Z, Chen T, Liu X, Tan P. Chin. J. Org. Chem., 2021, 41(7).

[36]	Yang H, Ring Z, Briker Y, McLean N, Friesen W, Fairbridge C. Fuel, 2002, 81(1): 65.

[37]	Liu G Z, Wang L, Qu H J, Shen H M, Zhang X W, Zhang S T, Mi Z T. Fuel, 2007, 86(16): 2551.

[38]	Özgür C, Tosun E. Energy Sources Part A Recovery Util. Environ. Eff., 2017, 39(10): 985.

[39]	Hanley J A. J. Clin. Epidemiol., 2016, 79: 112.

[40]	Zhang C, Guo Y, Li M. Computer Engineering and Applications, 2021, 57(11): 57. ( 张弛, 郭媛, 黎明. 计算机工程与应用, 2021, 57(11): 57.)

[41]	Rithani M, Kumar R P, Doss S. Artif. Intell. Rev., 2023, 56(12): 14765.

[42]	Hinton G E, Salakhutdinov R R. Science, 2006, 313(5786): 504.

[43]	Gu J X, Wang Z H, Kuen J, Ma L Y, Shahroudy A, Shuai B, Liu T, Wang X X, Wang G, Cai J F, Chen T. Pattern Recognit., 2018, 77: 354.

[44]	Tsubaki M, Tomii K, Jun S S. Bioinformatics, 2019, 35(2): 309.

[45]	Reiser P, Neubert M, Eberhard A, Torresi L, Zhou C, Shao C, Metni H, van Hoesel C, Schopmans H, Sommer T, Friederich P. Commun. Mater., 2022, 3: 93.

[46]	Kim Y, Cho J, Naser N, Kumar S, Jeong K, McCormick R L, St John P C, Kim S. Proc. Combust. Inst., 2023, 39(4): 4969.

[47]	Martinez-Hernandez E, Valencia D, Arvizu C, Romero Alatorre D F, Aburto J. ACS Sustainable Chem. Eng., 2021, 9(20): 7044.

[48]	Ding S F, Shi Z Z, Tao D C, An B. Neurocomputing, 2016, 211: 1.

[49]	Lu W C, Ji X B, Li M J, Liu L, Yue B H, Zhang L M. Adv. Manuf., 2013, 1(2): 151.

[50]	Liu X R, Yang J M, Yuan L J. RSC Adv., 2023, 13(2): 802.

[51]	Hehn T M, Kooij J F P, Hamprecht F A. Int. J. Comput. Vis., 2020, 128(4): 997.

[52]	Ganaie M A, Hu M H, Malik A K, Tanveer M, Suganthan P N. Eng. Appl. Artif. Intell., 2022, 115: 105151.

[53]	Zhao C M, Wu D R, Huang J, Yuan Y, Zhang H T, Peng R M, Shi Z H. IEEE Trans. Pattern Anal. Mach. Intell., 2022: 1.

[54]	Segler M H S, Kogej T, Tyrchan C, Waller M P. ACS Cent. Sci., 2018, 4(1): 120.

[55]	Li C, Wang C H, Sun M, Zeng Y, Yuan Y, Gou Q L, Wang G C, Guo Y Z, Pu X M. J. Chem. Inf. Model., 2022, 62(20): 4873.

[56]	Korshunova M, Huang N, Capuzzi S, Radchenko D S, Savych O, Moroz Y S, Wells C I, Willson T M, Tropsha A, Isayev O. Commun. Chem., 2022, 5: 129.

[57]	Tong X C, Liu X H, Tan X Q, Li X T, Jiang J X, Xiong Z P, Xu T Y, Jiang H L, Qiao N, Zheng M Y. J. Med. Chem., 2021, 64(19): 14011.

[58]	Cai Z P, Xiong Z B, Xu H H, Wang P, Li W, Pan Y. ACM Comput. Surv., 2022, 54(6): 1.

[59]	Saxena D, Cao J N. ACM Comput. Surv., 2022, 54(3): 1.

[60]	Zou J J, Zhang X W, Wang L, Mi Z T. Chem. Propellants Polym. Mater., 2008, 6(1): 26. ( 邹吉军, 张香文, 王莅, 米镇涛. 化学推进剂与高分子材料, 2008, 6(1): 26.)

[61]	Xiong Z Q, Mi Z T, Zhang X W, Xing E H. Progress in Chemistry, 2005, 17(02): 359. ( 熊中强, 米镇涛, 张香文, 邢恩会. 化学进展, 2005, 17(02): 359).

[62]	Hall C, Rauch B, Bauder U, Le Clercq P, Aigner M. Energy Fuels, 2021, 35(3): 2520.

[63]	Sun X Y, Krakauer N J, Politowicz A, Chen W T, Li Q Y, Li Z Y, Shao X J, Sunaryo A, Shen M R, Wang J, Morgan D. Mol. Inform., 2020, 39(6): 1900101.

[64]	Amirkhani F, Dashti A, Abedsoltan H, Mohammadi A H, Chofreh A G, Goni F A, Klemeš J J. Fuel, 2022, 323: 124292.

[65]	Gao Y, Zhang X Q, Zhang Z Y, Zhang J M, Wang Y Q, Zhang H Z. Journal of Safety Science and Technology, 2020, 16(10): 20. ( 高月, 张向倩, 张子炎, 张金梅, 王亚琴, 张宏哲. 中国安全生产科学技术, 2020, 16(10): 20.)

[66]	Song X Y, Pan Y, Jiang J C, Xu X. Journal of Safety and Environment, 2016, 16(6): 133. ( 宋晓亚, 潘勇, 蒋军成, 徐迅. 安全与环境学报, 2016, 16(6): 133.)

[67]	Saldana D A, Starck L, Mougin P, Rousseau B, Creton B. Energy Fuels, 2013, 27(7): 3811.

[68]	Aljaman B, Ahmed U, Zahid U, Reddy V M, Sarathy S M, Abdul Jameel A G. Fuel, 2022, 317: 123428.

[69]	Jiao L, Zhang X F, Qin Y C, Wang X F, Li H. Chemom. Intell. Lab. Syst., 2016, 156: 211.

[70]	Cengiz E, Babagiray M, Aysal F E, Aksoy F. Fuel, 2022, 316: 123422.

[71]	Yahya S I, Aghel B. Renew. Energy, 2021, 177: 318.

[72]	Zheng Y Z, Shadloo M S, Nasiri H, Maleki A, Karimipour A, Tlili I. Renew. Energy, 2020, 153: 1296.

[73]	Gülüm M, Onay F K, Bilgin A. Energy, 2018, 161: 361.

[74]	Akkaya E. Fuel, 2016, 180: 687.

[75]	Xing J K, Luo K, Wang H O, Gao Z W, Fan J R. Energy, 2019, 188: 116077.

[76]	Hosseinpour S, Aghbashlo M, Tabatabaei M. Fuel, 2018, 222: 1.

[77]	Creton B, Dartiguelongue C, de Bruin T, Toulhoat H. Energy Fuels, 2010, 24(10): 5396.

[78]	Guo Z, Lim K H, Chen M, Thio B J R, Loo B L W. Fuel, 2017, 207: 344.

[79]	Kessler T, Sacia E R, Bell A T, Mack J H. Fuel, 2017, 206: 171.

[80]	vom Lehn F, Brosius B, Broda R, Cai L, Pitsch H. Fuel, 2020, 281: 118772.

[81]	Li R, Herreros J, Tsolakis A, Yang W Z. Fuel, 2020, 280: 118589.

[82]	Sun X Y, Zhang F, Liu J P, Duan X B. Fuel, 2023, 333: 126510.

[83]	Rittig J G, Ritzert M, Schweidtmann A M, Winkler S, Weber J M, Morsch P, Heufer K A, Grohe M, Mitsos A, Dahmen M. AlChE. J., 2023, 69(4): e17971.

[84]	Li R Z, Herreros J M, Tsolakis A, Yang W Z. J. Mol. Graph. Model., 2022, 111: 108083.

[85]	Gao Z, Zou X Y, Huang Z, Zhu L. Fuel, 2019, 242: 438.

[86]	Freitas R S M,Lima Á P F, Chen C, Rochinha F A, Mira D, Jiang X. Fuel, 2022, 329: 125415.

[87]	Rocabruno-Valdés C I, Ramírez-Verduzco L F, Hernández J A. Fuel, 2015, 147: 9.

[88]	Liu J P, Gong S Y, Li H W, Liu G Z. Fuel, 2022, 313: 122712.

[89]	Cai G Q, Zhang L Z. Petrol. Sci., 2022, 19(2): 809.

[90]	Saldana D A, Starck L, Mougin P, Rousseau B, Pidol L, Jeuland N, Creton B. Energy Fuels, 2011, 25(9): 3900.

[91]	Saldana D A, Starck L, Mougin P, Rousseau B, Ferrando N, Creton B. Energy Fuels, 2012, 26(4): 2416.

[92]	Saldana D A, Starck L, Mougin P, Rousseau B, Creton B. SAR QSAR Environ. Res., 2013, 24(4): 259.

[93]	Li R Z, Herreros J, Tsolakis A, Yang W Z. Fuel, 2021, 304: 121437.

[94]	Blum L C, Reymond J L. J. Am. Chem. Soc., 2009, 131(25): 8732.

[95]	Ruddigkeit L, van Deursen R, Blum L C, Reymond J L. J. Chem. Inf. Model., 2012, 52(11): 2864.

[96]	Ramakrishnan R, Dral P O, Rupp M, von Lilienfeld O A. Sci. Data, 2014, 1: 140022.

[97]	Sterling T, Irwin J J. J. Chem. Inf. Model., 2015, 55(11): 2324.

[98]	Irwin J J, Tang K G, Young J, Dandarchuluun C, Wong B R, Khurelbaatar M, Moroz Y S, Mayfield J, Sayle R A. J. Chem. Inf. Model., 2020, 60(12): 6065.

[99]	Wang Y, Xiao J, Suzek T O, Zhang J, Wang J, Bryant S H. Nucleic Acids Res., 2009, 37: W623.

[100]

R Z

, Herreros

, Tsolakis

, Yang

W Z

. Fuel, 2022, 307: 121908.

[101]

Jørgensen

P B

, Schmidt

M N

, Winther

. Mol. Inform., 2018, 37(1/2): 1700133.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

1 Introduction

图1 机器学习用于燃料性质预测（a）和分子设计（b）的流程

2 Fuel molecule description mode

2.1 Molecular fingerprint based on SMILES

图2 表示燃料分子结构的不同分子描述方式：（a）分子SMILES式;（b）库仑矩阵[25];（c）连续可操作的分子输入范式[27];（d）分子图[35]

2.2 Coulomb matrix

2.3 Continuously operable molecular input paradigm.

2.4 Molecular graph

3 Machine learning model

3.1 Fuel property prediction model

3.1.1 Linear regression

3.1.2 Artificial neural network

3.1.2.1 Single layer neural network

3.1.2.2 Deep neural network

3.1.2.3 Convolutional neural network

3.1.2.4 Graph neural network

3.1.3 Support vector machine

3.1.4 Decision tree

3.1.5 Ensemble learning model

表1 Evaluation index of prediction accuracy of machine learning model

3.2 Fuel molecule formation model

3.2.1 Variational self-encoder

图3 （a）变分自编码器结构图和（b）生成对抗网络结构图[57]

3.2.2 Generative countermeasure network

4 Fuel property prediction

4.1 Prediction of single fuel properties

4.1.1 Density

4.1.2 Flash point

图4 （a）信息传递神经网络和（b）图卷积神经网络预测闪点值与实验值对比[63]

4.1.3 Viscosity

图5 不同智能方法在（a）训练、（b）测试和（c）全部数据集上的预测性能[71]

4.1.4 Calorific value

4.1.5 Cetane number

4.2 Prediction of multi-fuel properties

图6 （a）沸点、（b）密度和（c）闪点的预测值与实验值。黄色、红色、蓝色分别代表训练集、验证集、测试集的数据[88]

图7 使用QSPR模型预测代表性分子的综合性能雷达图[89]

图8 （a）密度、（b）黏度、（c）闪点、（d）正十六烷值，（e）冰点、（f）净燃烧热6种燃料性质预测值和真实值的比较[90⇓~92]

表2 Error Comparison of Single Layer Neural Network and Stacked Ensemble Models for Predicting Multiple Fuel Properties[25,27]

表3 Prediction performance of machine learning and quantitative structure-relationship models trained by 10-fold cross-validation and leave-one-out cross-validation[93]

5 Molecular Design of New Fuels

5.1 High throughput screening of fuel molecules

图9 （a）面向特定性质设计燃料的工作流程，阴影区域表示通过ML-QSPR和化学动力学进行的虚拟燃料筛选;（b）ML-QSPR模型对SI发动机进行一级燃料物理化学特性筛选[100]

5.2 Inverse design of new fuel molecules

图12 用于从头设计燃料的LIGANDS深度生成模型的工作流程，集成VAE的编码器和解码器、GAN的生成器和判别器以及堆叠预测模型[28]

5.3 Comparison of different design methods

6 Conclusion and prospect

References

图2 表示燃料分子结构的不同分子描述方式：（a）分子SMILES式;（b）库仑矩阵^[25];（c）连续可操作的分子输入范式^[27];（d）分子图^[35]

图3 （a）变分自编码器结构图和（b）生成对抗网络结构图^[57]

图4 （a）信息传递神经网络和（b）图卷积神经网络预测闪点值与实验值对比^[63]

图5 不同智能方法在（a）训练、（b）测试和（c）全部数据集上的预测性能^[71]

图6 （a）沸点、（b）密度和（c）闪点的预测值与实验值。黄色、红色、蓝色分别代表训练集、验证集、测试集的数据^[88]

图7 使用QSPR模型预测代表性分子的综合性能雷达图^[89]

图8 （a）密度、（b）黏度、（c）闪点、（d）正十六烷值，（e）冰点、（f）净燃烧热6种燃料性质预测值和真实值的比较^[90⇓~92]

表2 Error Comparison of Single Layer Neural Network and Stacked Ensemble Models for Predicting Multiple Fuel Properties^[25,27]

表3 Prediction performance of machine learning and quantitative structure-relationship models trained by 10-fold cross-validation and leave-one-out cross-validation^[93]

图9 （a）面向特定性质设计燃料的工作流程，阴影区域表示通过ML-QSPR和化学动力学进行的虚拟燃料筛选;（b）ML-QSPR模型对SI发动机进行一级燃料物理化学特性筛选^[100]

图12 用于从头设计燃料的LIGANDS深度生成模型的工作流程，集成VAE的编码器和解码器、GAN的生成器和判别器以及堆叠预测模型^[28]