[人工智能] MemBrain2.0

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> MemBrain2.0_论文 -> 正文阅读

[人工智能]MemBrain2.0_论文

MemBrain-contact 2.0: a new two-stage machine learning model for the prediction enhancement of transmembrane protein residue contacts in the full chain

MemBrain-contact 2.0:一种新的两阶段机器学习模型，用于预测全链跨膜蛋白残基接触的增强

Abstract

Motivation

Inter-residue contacts in proteins have been widely acknowledged to be valuable for protein 3?D structure prediction. Accurate prediction of long-range transmembrane inter-helix residue contacts can significantly improve the quality of simulated membrane protein models.

蛋白质中残基间的相互接触对蛋白质三维结构预测具有重要价值。准确预测long-range跨膜内螺旋残基触可以显著提高模拟膜蛋白模型的质量。

Results

In this paper, we present an updated MemBrain predictor, which aims to predict transmembrane protein residue contacts. Our new model benefits from an efficient learning algorithm that can mine latent structural features, which exist in original feature space. The new MemBrain is a two-stage inter-helix contact predictor. The first stage takes sequence-based features as inputs and outputs coarse（粗糙的）contact probabilities for each residue pair, which will be further fed into convolutional neural network together with predictions from three direct-coupling analysis approaches in the second stage（三种直接耦合分析方法）. Experimental results on the training dataset show that our method achieves an average accuracy of 81.6% for the top L/5 predictions using a strict sequence-based jackknife cross-validation. Evaluated on the test dataset, MemBrain can achieve 79.4% prediction accuracy. Moreover, for the top L/5 predicted long-range loop contacts, the prediction performance can reach an accuracy of 56.4%. These results demonstrate that the new MemBrain is promising for transmembrane protein’s contact map prediction.

在本文中，我们提出一个更新的MemBrain预测器，旨在预测跨膜蛋白残基接触。我们的新模型得益于一种有效的学习算法，它可以挖掘原始特征空间中潜在的结构特征。新的MemBrain是一个two-stage螺旋间接触预测器。第一阶段以序列的特征作为的输入，输出每个残基对接触的粗概率，第二阶段将这些接触的粗概率和三种直接耦合分析方法（哪三种？）的预测输入到卷积神经网络中。在训练数据集上的实验结果表明，使用严格的基于序列的jackknife交叉验证，我们的方法对top L/5预测的平均准确率达到81.6%。通过对测试数据集的评估，MemBrain可以达到79.4%的预测准确率。此外，对于top L/5预测的long-range loop接触，预测性能可达到56.4%的精度。这些结果表明，新的MemBrain在跨膜蛋白的接触图预测方面是有前途的。

1 Introduction

Integral membrane proteins act（扮演） essential functional roles in living organisms（生物体） and they are involved in（参与到）various crucial cellular processes, such as, molecular transport, cell signaling and cell adhesion（细胞粘附）. On the drug market, it has been shown that more than half of all current drug targets are membrane proteins and knowing（了解） their three-dimensional (3?D) structures is valuable for drug design. However, the number of membrane protein structures in the protein data bank (PDB) is relatively（相当地） small compared with that of soluble proteins（可溶性蛋白） because of the experimental difficulties in their study (e.g. they are hard to crystallize). Fortunately, in recent years, several studies have suggested（表明） that inter-helix contacts can assist（帮助） membrane protein structure prediction.

整合膜蛋白在生物体内起着重要的功能作用，参与各种重要的细胞过程，如分子转运、细胞信号转导和细胞粘附等。在药物市场上，已经有超过一半的现有药物靶点是膜蛋白，了解它们的三维结构对药物设计是有价值的。然而，与可溶性蛋白相比，蛋白数据库(PDB)中膜蛋白结构的数量相对较少，这是由于其研究的实验困难(如难以结晶)。幸运的是，近年来的一些研究表明，螺旋间的接触可以帮助预测膜蛋白的结构。

Residue contact prediction has been of long-term interest（长期关注的问题） due to its critical importance and wide applications in protein structural bioinformatics（蛋白质结构生物信息学）. To date, a large number of methods have been proposed to predict residue contacts based on machine learning (ML) framework, correlated mutation analysis (CMA)（相关的突变分析） or the combination of them. Nevertheless（然而）, most of currently available residue contact predictors were developed for soluble proteins, such as, SVMSEQ, CMAPpro, DNCON, PhyCMAP , PconsC2 , MetaPSICOV, CoinDCA and R2C. The reason could be that the limited number of membrane protein structures hinders（阻碍） the progress of developing high-quality contact prediction model for membrane proteins because of the small training sample size problem. Even so, several predictors have been designed to predict inter-helix contacts for transmembrane (TM) proteins.

残基接触预测在蛋白质结构生物信息学中具有重要的意义和广泛的应用前景。迄今为止，基于机器学习(ML)框架、相关突变分析(CMA)或两者结合的残基接触预测方法已被大量提出。然而，目前大多数可用的残基接触预测器都是针对水溶性蛋白开发的，如SVMSEQ、CMAPpro、DNCON、PhyCMAP、PconsC2、MetaPSICOV、CoinDCA和R2C。其原因可能是膜蛋白结构数量有限，训练样本量小，阻碍了膜蛋白高质量接触预测模型的发展。尽管如此，已经设计了很多预测器来预测跨膜蛋白的螺旋间接触。

Inter-helix contacts can be predicted either by ML-based methods or CMA-based approaches. The ML-based methods try to learn statistical models guided by informative（提供有用信息的） sequence-derived features, for instance, TMHcon , TMhit, MEMPACK, TMhhcp and COMSAT. These prediction models rely on ML algorithms such as neural network (NN), support vector machine (SVM) or random forest (RF) etc. The CMA-based approaches aim to detect residue contacts by analysing multiple sequence alignment (MSA)（多序列比对） using local or global algorithms. This type of algorithms includes HelixCorr, mfDCA, PSICOV, plmDCA, GREMLIN and CCMpred . For a protein with sufficient homologous sequences（足够的同源序列）, the CMA-based approaches can give precise predictions, while the ML-based methods perform better on proteins with few sequence homologies. Predictors that combine the above two different classes of methods, e.g. MemBrain and MemConP, are also available. It has been demonstrated that the combination of the ML and CMA-based methods provides an improvement of the prediction performance.

螺旋间接触的预测可以采用基于ML的方法或基于CMA的方法。基于ML的方法尝试学习以信息性的序列衍生特征为指导的统计模型，如TMHcon、TMhit、MEMPACK、TMhhcp和COMSAT。这些预测模型依赖于ML算法，如神经网络(NN)、支持向量机(SVM)或随机森林(RF)等。基于CMA的方法旨在通过使用局部或全局算法分析多序列比对(MSA)来检测残基接触。这类算法包括HelixCorr、mfDCA、PSICOV、plmDCA、GREMLIN和CCMpred。对于具有足够同源序列的蛋白质，基于CMA的方法可以给出精确的预测，而基于ML的方法在序列同源性较少的蛋白质上表现更好。将上述两类不同的方法(如MemBrain和MemConP)组合在一起的预测器也是可用的。结果表明，将基于ML和基于CMA的方法相结合可以提高预测性能。

Recently, several methods have been developed to improve inter-helix residue contact prediction in TM proteins. MemConP, an updated version of TMHcon, incorporates（合并） a series of sequence-based features and correlated mutations（相关突变） generated by Freecontact to train a RF model on a non-redundant dataset. To better handle the case of insufficient homologous sequences（同源序列不足）, a hybrid method called COMSAT was proposed, which integrates（结合） SVM and mixed integer linear programming（混合整数线性规划）. When the statistical SVM model fails to predict any contacts, the optimization-based method works to maximize the cumulative potential（累积势） of residue contacts. Despite the significant progress, there is still much room to further improve the prediction performance. For instance, at the current stage, for the top L/5 inter-helix residue contact prediction, a prediction accuracy of 65.6% was reported, which is expected to be further enhanced.

近年来，人们开发了多种方法来改善TM蛋白螺旋间残基接触预测。MemConP是TMHcon的升级版，它整合了Freecontact生成的一系列基于序列的特征和相关突变，以在非冗余数据集上训练RF模型。为了更好地处理同源序列不足的情况，提出了一种融合SVM和混合整数线性规划的混合方法COMSAT。当统计支持向量机模型不能预测任何接触点时，基于优化的方法可以使残基接触点的累积势最大。尽管取得了显著的进展，但仍有很大的空间来进一步提高预测性能。例如，在现阶段，对于top L/5的螺旋间残差接触预测，报道的预测准确率为65.6%，预计还会进一步提高。

Deep learning has been successfully applied to computer vision, speech recognition, natural language processing and also bioinformatics, because it can learn high-level abstract features from original inputs and thus would perform quite well by reducing the noise effects embedded into the original features. For instance, DeepBind uses a deep convolutional neural network (CNN) to predict the sequence specificities（特异性） of protein binding; RaptorX-Contact uses a deep residual network to predict protein contact map. In this work, we applied CNN to develop a new TM inter-helix residue contact prediction model and update our former predictor MemBrain to further improve its performance.

深度学习已经成功地应用于计算机视觉、语音识别、自然语言处理以及生物信息学，因为它可以从原始输入中学习高级抽象特征，从而通过减少嵌入到原始特征中的噪声作用来表现出很好的性能。例如，DeepBind使用deep convolutional neural network (CNN)来预测蛋白结合的序列特异性；RaptorX-Contact利用深度残差网络预测蛋白接触图。在这项工作中，我们使用CNN开发了一个新的TM螺旋间残基接触预测模型，并更新了我们之前的预测器MemBrain，进一步提高其性能。

For residue contact prediction, a series of sequence-based features are used to encode a residue pair. Among these features, correlated mutation score indicates the potential（可能性） of two residues forming a spatial contact. Usually, with two independent feature vectors for the two target residues, a direct combination of them（目标残基之间的直接组合） is fed into the prediction model, which may lack straightforward biophysical meanings（直接的生物物理学意义）. Furthermore, the doubled dimensions will result in more parameters of CNN to be optimized. Therefore, it could not be an optimal choice（最佳选择） to take all the sequence-based features as inputs to CNN, especially when there are not sufficient training samples from membrane protein structures to fit the weights（来适应权重）. Motivated by these, we developed a two-stage prediction framework, where the first stage is used to get a coarse prediction map followed by a deep refinement（深度细化） of the second stage.

对于残基接触预测，利用一系列基于序列的特征对残基对进行编码。在这些特征中，相关突变分数表示两个残基形成空间接触的可能性。通常，两个目标残基有两个独立的特征向量，它们的直接组合被输入预测模型，这可能缺乏直接的生物物理意义。此外，doubled dimensions会导致更多的CNN参数需要优化。因此，将所有基于序列的特征作为输入到CNN并不是最优的选择，尤其是在膜蛋白结构的训练样本不足的情况下。在此基础上，我们开发了一个两阶段预测框架，第一阶段得到一个粗略的预测图，然后第二阶段进行深度细化。

2 Materials and methods

2.1 Datasets

To make a fair comparison with previous studies, we selected the same benchmark training and test datasets used in MemConP, which were collected from the PDBTM database in 2015. The original training dataset contains 90?TM proteins. In this work, however, we excluded（排除） the proteins 2e74B and 4mt4A from the dataset, because the protein 2e74B has too few inter-helix contacts and the protein 4mt4A is a beta-barrel protein. The remaining 88 alpha-helical TM proteins form the final training dataset and the test dataset contains 30 alpha-helical TM proteins. All the proteins in the benchmark datasets have at least 3?TM helices（螺旋）, and their locations were extracted from the PDBTM. Supplementary Table S1 lists the details of the training and test datasets. The high-quality benchmark datasets were screened rigorously（严格筛选）so that: (1) resolution（分辨率） is less than 3.5??; (2) pairwise（成对地） sequence identity is less than 35%; (3) pairwise TM-score is below 0.5; (4) proteins are from different Pfam families.

为了与之前的研究进行公平的比较，我们选择了MemConP中使用的基准训练和测试数据集，这些数据集是2015年从PDBTM数据库中收集的。原始训练数据集包含90个TM蛋白。然而，在这项工作中，我们从数据集中排除了2e74B和4mt4A蛋白，因为2e74B蛋白的螺旋间接触太少，而4mt4A蛋白是一个贝塔桶状蛋白。其余88个alpha-螺旋TM蛋白构成最终的训练数据集，测试数据集包含30个alpha-螺旋TM蛋白。基准数据集中的所有蛋白质都至少有3个TM螺旋，它们的位置都是从PDBTM中提取出来的。补充表S1列出了训练和测试数据集的详细信息。严格筛选高质量的基准数据集:(1)分辨率小于3.5 A;(2)两两序列恒等式小于35%;(3)两两配对的TM-score低于0.5;(4)蛋白来自不同的Pfam家族。

In order to evaluate our method with more TM proteins, we prepared a larger independent test dataset by only considering the above first two criteria. First, we downloaded all redundant alpha-helical TM proteins from the PDBTM and removed the proteins appearing in the training and test datasets or having less than 3?TM helices. Then, the remaining proteins were culled（挑选） by running PISCES server to get a non-redundant dataset. Next, we discarded（丢弃） the proteins in the non-redundant dataset sharing more than 35% sequence identity with any proteins from the training dataset. Finally, 175?TM proteins were obtained (denoted（标记）as ITD35). We also created another independent test dataset with pairwise sequence identity less than 30% and no more than（至多）30%（小于等于30%） of sequence identity with the training dataset (155?TM proteins collected, denoted as ITD30). These two independent test datasets are listed in Supplementary Tables S2–S3.————问题：什么叫sequence identity（序列一致性）

为了用更多的TM蛋白来评价我们的方法，我们只考虑了前两个标准，准备了一个更大的独立测试数据集。首先，我们从PDBTM下载了所有冗余的alpha螺旋TM蛋白，并删除了出现在训练和测试数据集或少于3个TM螺旋的蛋白。然后，通过运行PISCES服务器来挑选剩余的蛋白质，得到一个非冗余数据集。接下来，我们将非冗余数据集中的蛋白质与训练数据集中的任何蛋白质共享35%以上的序列一致性的丢弃。最终获得175个TM蛋白(标记为ITD35)。我们还创建了另一个独立的测试数据集，使用训练数据集(收集了155个TM蛋白，记作ITD30)，其成对序列一致性小于等于 30%。这两个独立的测试数据集在补充表S2-S3中列出。

2.2 Contact definition

In the literature（文献）, there are multiple definitions of residue contact. For instance, in the well-known Critical Assessment of protein Structure Prediction (CASP) competition（在著名的蛋白质结构预测的关键评估(CASP)竞赛中）, contact definition is based on Cβ atoms, i.e. if the Euclidean distance between Cβ atoms (Cα for GLY) of two residues is less than 8??, then the two residues are said to be in contact. But in the case of TM proteins, residue contacts are often determined according to residues’ heavy atoms. Concretely（具体地）, two residues from different TM helices are considered to be in contact if the minimal distance of their side chain or backbone heavy atoms is less than 5.5??.

在文献中，对残基接触有多种定义。例如,在著名的蛋白质结构预测的关键评估(CASP)竞赛,接触的定义是基于Cβ原子,即如果Cβ原子之间的欧几里得距离(Cα for GLY)两种残基小于8 A,那么这两个残基物是在接触。但在TM蛋白中，残基接触通常是根据残基的重原子来确定的。具体地说，如果来自不同TM螺旋的两个残基的侧链或主链重原子的最小距离小于5.5 A，则认为它们处于接触状态。

For a fair comparison, we used the definition based on heavy atoms for inter-helix residue contact prediction, which is the same as previous studies. Using the above contact definition, we only obtained 19920 contact residue pairs from the training dataset. To enlarge the positive set, we also took contacts (sequence separation??>?=?6) involving residues from loop regions into account, which resulted in 62?493 contact residue pairs. In the end, three types of contacts were present（出现） in the training dataset: (I) contacts between residues from TM helical regions, (II) contacts between residues from loop regions and (III) contacts between residues from TM helical regions and loop regions, respectively. By doing so, our new model is able to predict contact map for the entire TM protein sequence, not just limited to the TM helical regions.

为了公平的比较，我们使用基于重原子的定义来预测螺旋残基接触，这与之前的研究相同。使用上面的接触定义，我们仅从训练数据集中获得了19920个接触残基对。为了扩大正样本集，我们还考虑了涉及环路区域残基的接触(序列分离 >= 6)，得到了62493个接触残基对。最后，在训练数据集中出现了三种类型的接触:(I)来自TM螺旋区域的接触，(II)来自loop区域的接触，(III)来自TM螺旋区域和loop区域的接触。通过这样做，我们的新模型能够预测整个TM蛋白序列的接触图，而不仅仅局限于TM螺旋区域。

When considering the residue contact prediction as a classification problem (recognizing the contact pairs from those not), it is actually an imbalanced learning problem, i.e. the non-contact pairs are much more than the contact ones. Previous statistics have shown that the contact density is approximately 2%–3%. Thus, to balance the positive and negative training samples, we used an under-sampling strategy, where all the positive samples and a subset of the negative samples are used for model training. To determine a proper sampling ratio with respect to（关于） the positive samples, we tested eight ratios and found that a ratio of 1:5 gives good and robust（可靠的） results (Supplementary Table S4).

当残基接触预测作为一个分类问题(从非接触接触对中识别出接触接触对)时，它实际上是一个不平衡的学习问题，即非接触对远远多于接触接触对。以前的统计数据表明，接触密度约为2%-3%。因此，为了平衡正负训练样本，我们使用了降采样策略，其中所有的正样本和负样本的一个子集用于模型训练。为了确定正确的取样比例，我们测试了8个比例，发现1:5的比例可以得到良好而可靠的结果(补充表S4)。

2.3 Feature extraction

For machine learning algorithms, discriminative（识别的）features are crucial for model building and unknown samples classification. Ab initio residue contact prediction mainly relies on sequence-derived information. In this work, six different types of input features were extracted for training MemBrain model, including amino acid composition, secondary structure, solvent accessibility, residue conservation score, contact potential and correlated mutation score, which were commonly（一般地，通常地） used to build ML models . These features are described in detail below.

对于机器学习算法来说，判别特征对于模型的建立和未知样本的分类是至关重要的。从头算残基接触预测主要依赖于序列信息。在本研究中，我们提取了6种不同类型的输入特征来训练MemBrain模型，包括氨基酸组成、二级结构、溶剂可及性、残基保守分数、接触电位和相关突变分数，这些都是常用来构建ML模型的。下面将详细描述这些特性。

Single residue features Amino acid composition represents the appearance frequencies of 20 amino acids and also gap occurring at a certain position in MSA. We used HHblits to search against the bundled UniProt20 database with three iterations to generate MSA for each protein sequence. The predicted secondary structure and solvent accessibility were calculated by running PSIPRED and SOLVPRED, respectively. Residue conservation is used to measure the probability of a given residue to mutate（突变） in another.

单残基特征氨基酸组成代表20个氨基酸的出现频率，在MSA（多序列比对）中某一位置出现间隙。我们使用HHblits搜索绑定的UniProt20数据库，经过三次迭代，为每个蛋白序列生成MSA。通过运行PSIPRED和SOLVPRED分别计算预测的二级结构和溶剂可及性。残基保守性用于测量给定残基在另一个残基中发生突变的概率。

The conservation score of each column in MSA was calculated according to the Shannon entropy, which is defined as follows:

根据香农熵计算MSA中各列的守恒分数，定义如下：

where fi is the frequency of a certain residue occurring at a column of interest. For these local features, a sliding window of size 9 was used to encode the current and neighboring information. In addition, for a given residue pair (i, j), a window of size 5 centered at position (i?+?j)/2 was used to extract extra local features.

其中fi为某一残基在感兴趣列上出现的频率。对于这些局部特征，使用大小为9的滑动窗口对当前和邻近信息进行编码。此外，对于给定的残基对(i, j)，使用以位置(i + j)/2为中心的大小为5的窗口来提取额外的局部特征。

Residue pair features Contact potential is a mean value and was computed by averaging the contact energies of all residue pairs that go through the certain two columns in MSA. Correlated mutation score between two columns in MSA indicates the potential of that residue pair forming molecular contact. It is an informative descriptor（信息描述符） because the inferred（推断的） values can be directly used for residue contact prediction. In recent years, some elegant global CMA-based algorithms were proposed to detect coevolving residue pairs（检测协同进化残基对） in MSA. When sufficient sequence homologies are available, contact predictions are in consequence（因此，结果） quite reliable. There are only few solved 3D membrane protein structures in the PDB when compared with soluble proteins; however, many homologous sequences can be searched due to membrane proteins constitute（组成） approximately 30% of the proteins. In the training dataset, the smallest number of homologous sequences is 48 for the protein 1yewC, and the largest number is 46701 for the protein 4tpjB. The average number of homologous sequences for the whole training dataset is 3464, which means that correlated mutation score is very powerful for model training and evaluation. To reduce calculation bias, we used five different algorithms to calculate this type of feature, i.e. MI, MIp, mfDCA, PSICOV, CCMpred. Since residue contacts are densely distributed in native structures（原生结构）, a 9 by 9 window centered at the position (i, j) was applied to extract nearby correlated mutation information.

接触电势是一个平均值，它是通过对MSA中特定两列的所有残基对的接触能进行平均计算得到的。MSA中两列之间的相关突变分数表明该残基对形成分子接触的可能性。这是一个信息描述符，因为推断的值可以直接用于残基接触预测。近年来，人们提出了一些优秀的全局的基于CMA的残基检测算法。当有足够的序列同源性时，接触预测结果是相当可靠的。与可溶性蛋白相比，PDB中仅存在少量的已解决的三维膜蛋白结构；然而，由于膜蛋白约占蛋白质的30%，许多同源序列可以被搜索到。在训练数据集中，蛋白质1yewC的同源序列最少为48条，蛋白质4tpjB的同源序列最多为46701条。整个训练数据集的平均同源序列数为3464，这意味着相关突变分数对于模型训练和评估是非常强大的。为了减少计算偏差，我们使用了五种不同的算法来计算这种类型的特征，即MI、MIp、mfDCA、PSICOV、CCMpred。由于残基接触点密集分布在原生结构中，因此我们以位置(i, j)为中心，采用9×9的窗口提取附近相关突变信息。

2.4 MemBrain-contact 2.0 prediction model

MemBrain-contact 2.0 is a hierarchical（分级的） two-stage residue contact predictor. The first stage is a conventional（常见、传统的）two-hidden-layer perceptron. 1084-dimensional sequence-based features (26?×?9?×?2?+?26?×?5?+?6?×?9?×?9) are fed into this neural network with 150 units for each of the two hidden layers. The single output indicates the contact potential（接触电势） of a given residue pair. The second stage is the fusion（融合） of three powerful CNNs, which have one, two and three convolution layers, respectively. We also tried more convolution layers, but found no improvement due to insufficient training samples. On the top of each CNN, a fully connected layer with 150 hidden units is used to predict the final contact probability（最终的接触概率）. For a target residue pair (i, j), it takes four 25 by 25 patches from the raw contact maps generated by the first stage of MemBrain, mfDCA , PISCOV and CCMpred as inputs, where each patch centers at the position (i, j). Then, these sub contact maps go though the subsequent（随后的，后面的） convolution and max-pooling layers.

MemBrain-contact 2.0是一个分级的两阶段残基接触预测器。第一阶段是传统的双隐层感知器。将1084维的基于序列的特征(26×9×2 + 26×5 + 6×9×9)输入该神经网络，每层150个单元。单个输出表示给定残基对的接触电势。第二阶段是三个强大的CNNs的融合，分别有一个、两个和三个卷积层。我们也尝试了更多的卷积层，但是由于训练样本不足，没有发现改善。在每个CNN的顶部，使用一个包含150个隐藏单元的全连接层来预测最终的接触概率。对于一个目标残基对(i, j),需要四个25×25个patches从MemBrain, mfDCA, PISCOV和CCMpred的第一阶段生成的原始接触地图作为输入, 每个patches以位置(i, j)为中心。然后,这些子接触地图随后经过卷积和max-pooling层。

The convolution operator is formulated by:

卷积运算公式为:

where P is a 25 by 25 patch, F is a 5 by 5 filter. The max-pooling is a form of down-sampling, which outputs the maximum value of the interested 2 by 2 patch. Figure 1 illustrates the flow chart of the new MemBrain-contact 2.0 protocol.

P是一个25×25的patch, F是一个5×5的filter。max-pooling是一种降采样的形式，它输出感兴趣的2×2 patch的最大值。图1展示了新的MemBrain-contact 2.0协议的流程图。

During the training phase（阶段）, we used batch gradient descent（批量梯度下降） to minimize cross entropy（交叉熵） with 100 training samples over 30 epochs. We also introduced L2-norm regularization to avoid overfitting.

在训练阶段，我们使用批量梯度下降来最小化30个epoch内的100个训练样本的交叉熵。为了避免过拟合，我们还引入了L2范数正则化。

The loss function is defined as follows:

损失函数定义如下：

where N is the number of training samples, yi is the expected output, pi is the prediction, w is the parameters of the entire model and λ is used to balance cross entropy and penalty term, which is set to 1e-4. The learning rates for the first stage of multilayer perceptron and the second stage of CNN are 0.001 and 0.01, respectively. In the second stage, we trained three CNN models, from which outputs were averaged as the final predictions.

其中N是训练样本的数量，yi是预期的输出，pi是预测，w是整个模型的参数并且λ是用来平衡交叉熵和惩罚项，设置为1e-4。第一阶段多层感知器和第二阶段CNN的学习率分别为0.001和0.01。在第二阶段，我们训练了三个CNN模型，并从中取平均值作为最终预测。

2.5 Evaluation criteria

The predictions can be separated into four categories（类别）, i.e. true positive (TP), false negative (FN), false positive (FP) and true negative (TN). TP is the group that contains correctly predicted positive samples, FN is the set of positive samples, which are mistakenly predicted as negative samples, FP includes wrongly predicted negative samples and TN indicates accurately predicted negative samples. Based on these metrics（指标、度量）, three derived evaluation criteria were used to compare the prediction performance with state-of-the-art methods.

预测分为真阳性(TP)、假阴性(FN)、假阳性(FP)和真阴性(TN)四类。TP为预测正确的阳性样本组，FN为误预测为阴性样本的阳性样本组，FP为预测错误的阴性样本组，TN为准确预测的阴性样本组。基于这些指标，我们使用三个派生的评价标准来评估最先进的方法的预测性能。

The first performance criterion is accuracy (Acc). It is defined as the fraction（分数） of correctly predicted contacts with respect to all the predicted contacts:

第一个性能指标是准确性(Acc)。它被定义为正确预测为接触相对于所有预测接触的比例：

where TP and FP are defined above. The accuracy is calculated according to the top predictions, such as, top L/5 or top L, where L is the length of the concatenate（连接的，连结的） TM helices.

TP和FP在上面已经被定义。准确性是根据top预测来计算的,例如：top L/5或top L，这儿L是连接TM螺旋的长度。

The second performance evaluation is coverage (Cov), which is also indicated as（被表示为） sensitivity. It is defined as the ratio of correctly predicted contacts from all the observed true inter-helix contacts in native structure:

第二个评估性能是Coverage (Cov)，也称为敏感性。它被定义为所有观察到的天然结构中真实螺旋间接触的正确预测接触的比率:

Thus, keeping the TP unchanged, a larger coverage can be obtained on the protein with less native contacts.

因此，在保持TP不变的情况下，较少的天然接触就可以获得较大的蛋白质召回率。

The last measure is Matthews correlation coefficient (MCC), which is used to evaluate the performance and robustness of the certain predictor. It is formulated as follows:

最后一个指标是Matthews correlation coefficient (MCC)（马修斯相关系数），用来评价某一预测器的性能和鲁棒性。它的表达式如下：

3 Results

3.1 Influence of neighboring contact pattern（相邻接触方式的影响）

In the current state-of-the-art（最先进的） residue contact predictors, correlated mutations are incorporated into（纳入） their protocols to enhance the prediction ability. Although CMA-based approaches are limited by sufficient homologies in MSA, they are still valuable for ML-based methods. As we know, CMA-based algorithms perform poorly on CASP hard targets（在CASP硬目标上） due to the few homologous sequences that can be searched. In the case of TM proteins, although there are not many solved 3?D structures, there are enough homologous sequences to analyse as stated in previous section（如前一节所述）. Inspired by the intrinsic characteristic（固有的特点） of protein structures where residue contacts are densely distributed, in the first stage of MemBrain protocol, correlated mutations were encoded by a 9 by 9 window and then flattened（平展） to a feature vector so that it can be fed into a traditional multilayer perceptron. This strategy has been demonstrated to be helpful for improving the prediction performance. Here, we trained four one-convolution-layer CNNs（四个单层CNN）with sub contact maps generated by mfDCA, PSICOV, CCMpred and the first stage of MemBrain, respectively, where each patch covers the target and nearby residue pairs to see the influence of neighboring contact pattern.

在目前最先进的残基接触预测方法中，相关突变被纳入其协议中以增强预测能力。虽然基于CMA的方法在MSA中受到充分同源性的限制，但是它们对于基于ML的方法仍然很有价值。众所周知，由于可以搜索的同源序列较少，基于CMA的算法在CASP硬目标上的性能较差。以TM蛋白为例，虽然没有很多已解决的3D结构，但如前面章节所述，有足够的同源序列进行分析。在MemBrain协议的第一阶段，受到残留接触密集分布的蛋白质结构固有特性的启发，相关突变被一个9×9的窗口编码，然后平面化为一个特征向量，这样就可以输入到一个传统的多层感知器中。结果表明，该策略有助于提高预测性能。在这里，我们分别使用mfDCA、PSICOV、CCMpred和MemBrain第一阶段生成的子接触图训练了四个单卷积层CNNs，每个patch覆盖目标和附近的残基对来观察邻近接触模式的影响。

Figure 2 shows the performance improvement after using the initial sub contact maps to train CNN models. No matter what the input source is, the CNN can further help to increase the prediction accuracy visibly（明显地）. From Table 1, we can see that PSICOV achieves an average accuracy of 55.0%/53.2% for the top L/5 predicted inter-helix contacts on the training/test dataset. When we decomposed（分解） each contact map from the training dataset into a series of patches and used these sub contact maps to train a CNN model, the prediction accuracy on the test dataset is increased to 69.2%, which is 16.0% higher than the initial prediction. Similar conclusions can be conducted on mfDCA, CCMpred and MemBrain. After using CNN framework, mfDCA/CCMpred/MemBrain can achieve 72.0%/73.8%/76.4% prediction accuracy, where the improvements are 14.8%, 11.9% and 5.3%, respectively. From the four CNN models, we can see that the improvement of MemBrain is quite lower compared with that provided by other CMA-based methods. This is because our ML-based model has already used correlated mutations as neighboring contact information. Even so, the prediction accuracy can be increased from 71.1% to 76.4%. In addition, given the same MSAs of proteins in the test dataset, CCMpred performs much better than PSICOV by the fact that the former achieves 8.7% higher prediction accuracy than that of the latter. This difference can be decreased to 4.6% with the help of CNN. These results demonstrate that neighboring contact pattern is indeed important for residue contact prediction. However, structural features hidden in sub contact maps are omitted（忽略） by traditional serial feature combination（传统的串行特征组合）. When we take the patches as inputs, the latent structural features can be mined and thus improve the prediction performance.

图2显示了使用初始子接触映射训练CNN模型后的性能改进。无论输入源是什么，CNN都可以进一步帮助明显提高预测精度。从表1可以看出，PSICOV对训练/测试数据集上L/5预测的最高螺旋间接触的平均准确率为55.0%/53.2%。当我们将训练数据集中的每个接触地图分解成一系列的patch，并使用这些子接触地图训练一个CNN模型时，测试数据集的预测准确率提高到了69.2%，比初始预测的准确率高出16.0%。类似的结论可以在mfDCA、CCMpred和MemBrain上进行。使用CNN框架后，mfDCA/CCMpred/MemBrain可以达到72.0%/73.8%/76.4%的预测准确率，分别提高了14.8%、11.9%和5.3%。从这四个CNN模型中我们可以看出，MemBrain的改进程度与其他基于CMA的方法相比是非常低的。这是因为我们基于ML的模型已经使用相关突变作为邻近的联系信息。即使如此，预测精度也能从71.1%提高到76.4%。此外，考虑到测试数据集中蛋白质的MSAs相同，CCMpred的预测精度比PSICOV高8.7%，因此比PSICOV有更好的表现。在CNN的帮助下，这个差距可以缩小到4.6%。结果表明，相邻接触模式对残基接触预测具有重要意义。传统的串行特征组合方法忽略了隐藏在子接触图中的结构特征。当我们以这些patches作为输入时，可以挖掘出潜在的结构特征，从而提高预测性能。

3.2 Evaluation of inter-helix residue contact prediction

3.2 螺旋残基接触预测的评价

In this section, we compare the performance of MemBrain with the state-of-the-art inter-helix contact predictor MemConP. We also list the results of three representative（典型的） CMA-based approaches, i.e. mfDCA , PSICOV and CCMpred. Since（由于） many homologous sequences can be searched for most TM proteins, the CMA-based approaches can also give good predictions. On the training dataset, we used a strict sequence-based jackknife cross validation to evaluate our method. During the process of validation, each protein of the training dataset was selected to test the model, which as trained using the remaining proteins. Note that（注意） MemConP used a 10-fold cross validation on the training dataset. Table 1 shows the results of different methods for inter-helix residue contact prediction.

在本节中，我们比较了MemBrain和最先进的螺旋间接触预测器MemConP的性能。我们还列出了三种代表性的基于CMA的方法，即mfDCA、PSICOV和CCMpred的结果。由于大多数TM蛋白可以搜索到许多同源序列，因此基于CMA的方法也可以提供良好的预测。在训练数据集上，我们使用严格的基于序列的jackknife交叉验证来评估我们的方法。在验证过程中，选取训练数据集的每个蛋白对模型进行测试，并使用剩余的蛋白进行训练。注意，MemConP对训练数据集使用了10倍交叉验证。表1显示了不同的螺旋残基接触预测方法的结果。

On both the training and test datasets, CCMpred achieves the best performance among the three CMA-based approaches in terms of（在…方面） all evaluation criteria. When compared with the ML-based predictor MemConP, CCMpred gives close accuracies for the top L/5 predicted inter-helix contacts, where the differences are 5.6% and 3.7% on the training and test datasets, respectively. However, the differences are increased to 12.1% and 6.9% for the top L predicted inter-helix contacts. The reason could be that CMA-based predictions are widespread（普遍的，广泛的）, when more contacts are evaluated, it has a higher chance to introduce（引入） more false positives. The first stage of MemBrain can achieve 73.3%/71.1% prediction accuracy on the training/test dataset for the top L/5 predicted inter-helix contacts, which is 2.4%/5.5% higher than that of MemConP. When we used CNN architecture to enhance MemBrain at the second stage, we can obtain 81.6%/79.4% accuracy, 12.4%/11.0% coverage and a MCC of 0.308/0.285 on the training/test dataset, which is 10.7%/13.8% higher than that of MemConP in terms of（在…方面） accuracy. For the top L predicted inter-helix contacts, MemBrain can also give 8.1%/13.0% higher accuracy, 4.1%/7.2% higher coverage and 0.064/0.101 higher MCC on the training/test dataset compared with MemConP. As shown in Supplementary Figure S1, the area under the curve (AUC) of the final MemBrain is 0.915, which is higher than those of the first stage of MemBrain and other methods.

在训练数据集和测试数据集上，三种基于CMA的方法中，CCMpred在所有评价标准方面都取得了最好的成绩。与基于ML的预测器MemConP相比，CCMpred对top L/5预测的螺旋间接触提供了接近的准确性，其中在训练和测试数据集的差异分别为5.6%和3.7%。然而， top L预测的螺旋间接触的差异增加到12.1%和6.9%。原因可能是基于CMA的预测很普遍，当更多的接触被评估时，它有更高的机会引入更多的假阳性。MemBrain的第一阶段在训练/测试数据集上，对于top L/5预测的螺旋间接触，预测准确率达到了73.3%/71.1%，比MemConP高2.4%/5.5%。当我们在第二阶段使用CNN架构增强MemBrain时，在训练/测试数据集上，我们可以获得81.6%/79.4%的准确率，12.4%/11.0%的覆盖率，MCC为0.308/0.285，在准确率上比MemConP高出10.7%/13.8%。与MemConP相比，对于top L预测的螺旋间接触，MemBrain还可以提高8.1%/13.0%的准确率，4.1%/7.2%的覆盖率和0.064/0.101的训练/测试数据集的MCC。如补充图S1所示，最终MemBrain曲线下面积(AUC)为0.915，高于MemBrain第一阶段及其他方法。

To have a deep insight into the difference（为了对差异有更深入的了解） between the first stage and the second stage of MemBrain, we report（报道） in Figure 3 the prediction accuracies for 118?TM proteins from the training and test datasets performed（执行） by both the first stage and the second stage of MemBrain. As can be seen, most of targets are better predicted with higher prediction accuracies after applying CNN framework refinement（改进）. Among these targets, the largest improvement occurs for the protein 4mndA, where the top L/5 prediction accuracy is increased from 25.0% to 65.0%. Supplementary Figure S2 also shows the comparison of the top L prediction performance. In Supplementary Table S5, we list the detailed predictions for each of the proteins from the training and test datasets. On the training dataset, there are 763 true contacts eliminated（消除） from the top L predictions by the second stage of MemBrain. However, 2309 new true contacts are introduced, which improves the average accuracy from 45.6% to 56.4%. On the test dataset, the extra 577 true contacts result in 10.4% accuracy improvement.

为了深入了解MemBrain第一阶段和第二阶段的区别，我们在图3中报告了MemBrain第一阶段和第二阶段的训练和测试数据集中118个TM蛋白的预测准确性。可以看出，使用CNN框架细化后，大部分目标的预测效果更好，预测精度更高。在这些靶点中，蛋白质4mndA的提高幅度最大，top L/5预测准确率从25.0%提高到65.0%。补充图S2还显示了top L预测性能的比较。在补充表S5中，我们列出了来自训练和测试数据集的每种蛋白质的详细预测。在训练数据集上，在MemBrain的第二阶段，有763个真实的接触从L的预测中消失了。但是，引入了2309个新的真实接触，平均准确率从45.6%提高到56.4%。在测试数据集上，额外的577个真实接触带来了10.4%的准确性改进。

3.3 Evaluation of contact prediction for loop region

3.3环区接触预测的评价

Since all TM protein contact predictors focus on the performance of inter-helix contacts, in this section we evaluate long-range type II and III contact residue pairs (one or both residues are from the loop region) to see how reliable the prediction of these types of contacts is. Since（因为） MemBrain was trained using the entire native contact maps, it has the ability to predict type II and III contacts. For these kinds of contacts, we can also use the contact predictors developed for soluble proteins. Here, we show the prediction performance of MetaPSICOV. We also list the performance of mfDCA, PSICOV and CCMpred . Since these three approaches can be viewed as unsupervised learning algorithms, they are also suitable for inferring type II and III contacts. For the purpose of quantitative comparison（定量比较） with inter-helix residue contact prediction, the definition of L is the same as above, i.e. the length of the concatenate TM helices. But, the number of native contacts for each type of contacts is different. It just covers the corresponding（相应的） type of contacts in native structure. Table 2 lists the prediction performance of different methods for type II contacts.

由于所有TM蛋白接触预测器都关注螺旋间接触的性能，因此在本节中，我们将评估长范围II型和III型接触残基对(一个或两个残基都来自环路区域)，以了解这些类型接触的预测有多可靠。因为MemBrain是使用完整的本地接触地图进行训练的，所以它有能力预测II型和III型联系人。对于这些类型的接触，我们也可以使用为可溶性蛋白开发的接触预测器。这里，我们展示了MetaPSICOV的预测性能。我们还列出了mfDCA, PSICOV和CCMpred的性能。由于这三种方法可以看作是无监督学习算法，它们也适用于推断II型和III型接触。为了定量比较螺旋残基接触预测，L的定义同上，即连接的TM螺旋的长度。但是，每种类型接触的原生接触的数量是不同的。它只覆盖了原生结构中相应类型的接触。表2列出了不同方法对II类接触的预测性能。

CCMpred performs the best among the three CMA-based approaches. The ML-based methods MetaPSICOV and MemBrain work better than CCMpred because they used additional（额外的） sequence-derived features to predict residue contacts. MetaPSICOV performs 41.5%/48.0% prediction accuracy and a MCC of 0.173/0.172 for the top L/5 predictions on the training/test dataset, while MemBrain reaches 52.1%/56.4% prediction accuracy, which is 10.6%/8.4% higher than that of MetaPSICOV. Also, MemBrain gives the best MCC of 0.224/0.217 on the training/test dataset. For the top L predicted loop contacts, MemBrain achieves 30.5% and 35.3% prediction accuracies on the training and test datasets, respectively, which are 7.1% and 5.4% higher than that of MetaPSICOV.

CCMpred在三种基于CMA的方法中表现最好。MetaPSICOV和MemBrain方法比CCMpred更有效，因为它们使用额外的序列衍生特征来预测残基接触。在训练/测试数据集上，top L/5预测的偏MetaPSICOV预测精度为41.5%/48.0%，MCC为0.173/0.172，而MemBrain的预测精度为52.1%/56.4%，比MetaPSICOV的预测精度高10.6%/8.4%。此外，MemBrain在训练/测试数据集上提供了最好的MCC，为0.224/0.217。对于top L预测环路接点，MemBrain在训练数据集和测试数据集上的预测准确率分别达到了30.5%和35.3%，分别比MetaPSICOV高7.1%和5.4%。

For type III contacts, we evaluate the top L/5 predicted contacts, where Supplementary Table S6 shows the prediction performance. MemBrain achieves 45.0%/37.8% prediction accuracy for the top L/5 predicted contacts on the training/test dataset. The results demonstrate that MemBrain is capable of predicting residue contacts for the entire TM protein sequence. Compared with inter-helix residue contact prediction, where MemBrain achieves 79.4% prediction accuracy for the top L/5 predicted contacts on the test dataset, it provides only 56.4%/37.8% prediction accuracy for type II/III contacts. These results show that although MemBrain was trained with relative few inter-helix residue pairs, it still performs much better on inter-helix contacts. This interesting phenomenon indicates that inter-helix contacts are more conserved and hence they are easy to be detected by contact predictor. Due to the flexibility of the loop region, modeling the loop contacts will still be a challenging task.

对于III型接触，我们评估top L/5预测接触，其中补充表S6显示预测性能。MemBrain对训练/测试数据集中top L/5个预测接触的预测准确率达到45.0%/37.8%。结果表明，MemBrain能够预测整个TM蛋白序列的残基接触。与螺旋残基接触预测相比，MemBrain对测试数据集上top L/5预测的接触预测准确率为79.4%，而对II/III型接触预测准确率仅为56.4%/37.8%。这些结果表明，虽然MemBrain的训练相对较少的螺旋残基对，但它仍然表现出更好的螺旋间接触。这一有趣的现象表明，螺旋间的接触更为保守，因此很容易被接触预测器检测到。由于环路区域的灵活性，环路接触的建模仍然是一个具有挑战性的任务。

3.4 Performance on the bigger independent test dataset

3.4在更大的独立测试数据集上的性能

On the training and test datasets, MemBrain achieves 81.6% and 79.4% prediction accuracies for the top L/5 predicted inter-helix contacts, respectively. To evaluate MemBrain with more TM proteins, a larger independent test dataset was prepared without considering the criterion of TM-score. Table 3 lists the overall（总体的） performance of inter-helix residue contact prediction, where MemBrain performs 84.5%/60.8% prediction accuracy, 12.3%/42.4% coverage and a MCC of 0.311/0.488 for the top L/5 and top L predicted contacts on the ITD35. When tested on the ITD30, where sequence identity is reduced from 35% to 30%, the average accuracy slightly decreases to 84.0% for the top L/5 predicted contacts. This is because that the performance of CMA-based approaches is not sensitive to sequence identity but relies on the number of effective sequences in MSA, and thus the prediction accuracy does not decrease much. From Tables 1 and 3, we can see that MemBrain performs better on the ITD35 and ITD30 than on the test dataset. Also, the three CMA-based approaches achieve better performance on these two datasets when compared with the test dataset. This can partially（部分地） explain why MemBrain gives better predictions on the ITD35 and ITD30, because it takes the CMA-based predictions as input features. Since these two datasets were screened（筛选） mainly based on sequence identity（序列一致性）, thus, it could be more redundant than the test dataset in terms of（在…方面） TM-score criterion.

在训练和测试数据集上，MemBrain对top L/5预测的螺旋间接触的预测准确率分别为81.6%和79.4%。为了使用更多的TM蛋白来评价MemBrain，在不考虑TM-score标准的情况下，我们准备了一个更大的独立测试数据集。表3列出了螺旋残基接触预测的总体表现，其中MemBrain对ITD35上的top L/5和top L预测接触的预测准确率为84.5%/60.8%，覆盖率为12.3%/42.4%，MCC为0.311/0.488。当在ITD30上测试时，序列一致性从35%降低到30%，对于top L/5预测接触的平均精度略微降低到84.0%。这是因为基于CMA的方法的性能对序列识别不敏感，而是依赖于MSA中有效序列的数量，因此预测精度不会降低太多。从表1和表3可以看出，MemBrain在ITD35和ITD30上的性能比在测试数据集上要好。此外，与测试数据集相比，这三种基于CMA的方法在这两个数据集上实现了更好的性能。这可以部分解释为什么MemBrain对ITD35和ITD30给出了更好的预测，因为它将基于CMA的预测作为输入特性。由于这两个数据集的筛选主要基于序列的一致性，因此，在TM-score标准上，它可能比测试数据集更冗余。

For each protein from the ITD35, we used TM-align to get the largest TM-score compared with all the proteins from the training dataset. As can be seen in Figure 4, in general, the larger the TM-score, the higher the prediction accuracy MemBrain gives. There are three cases that have low prediction accuracy (less than 10.0%) but with TM-score more than 0.5. The reason is that low-quality MSAs result in unreliable correlated mutations, which lead to poor performance by MemBrain. From these 175?TM proteins, 42 proteins have the largest TM-score less than 0.5 against the training dataset. For this non-redundant sub dataset, MemBrain achieves 79.4% and 51.3% prediction accuracies for the top L/5 and L predicted inter-helix contacts, respectively, which is comparable with the performance on the test dataset. The results demonstrate that MemBrain is robust for inter-helix contact prediction. In addition, when we remove the redundancy among protein sequences, sequence identity alone is not enough to obtain a non-redundant dataset. There exist sequence pairs that have low sequence identity but with large TM-score, which is known as the ‘twilight zone’ phenomena. Therefore, structural similarity may also be considered to ensure no similarity among proteins of interest.

对于来自ITD35的每个蛋白质，我们使用TM-align来获得与来自训练数据集的所有蛋白质相比最大的TM-score。从图4可以看出，一般来说，TM-score越大，MemBrain给出的预测精度越高。有三例预测准确率低(低于10.0%)，但TM-score大于0.5。原因是低质量的MSAs导致不可靠的相关突变，导致MemBrain表现不佳。在这175个TM蛋白中，42个蛋白在训练数据集上的TM得分最大，小于0.5。对于这个非冗余的子数据集，MemBrain对top L/5和L预测的inter-helix contacts的预测准确率分别达到了79.4%和51.3%，与测试数据集的性能相当。结果表明，MemBrain是稳健的螺旋间接触预测。此外，当我们去除蛋白质序列之间的冗余时，仅凭序列一致性是不足以获得非冗余数据集的。存在低序列一致性但TM-score较大的序列对，称为“模糊带”现象。因此，结构上的相似性也可以被考虑，以确保相关蛋白之间没有相似性。

3.5 Case study 个案研究

In this section, we used the TM protein 3wajA from the test dataset as an illustrative（说明性的） case to show the efficiency of CNN architecture. This protein has 13?TM helices. From the top L/5 predicted inter-helix contacts, 42 out of 46 residue contacts are correctly predicted by MemBrain, resulting in a prediction accuracy of 91.3%. For the top L predicted contacts, the prediction accuracy drops to（下降到） 47.4%. Before applying（使用） CNN, the prediction accuracies for the top L/5 and L predicted inter-helix contacts are 69.6% and 33.8%, respectively, which are much worse. To dig into the data, Supplementary Figure S3 shows the prediction details, where green, red and blue points represent（代表） native contacts, the predicted contacts by the first stage of MemBrain and the final MemBrain, respectively. As can be seen, red points within red ellipses（红色椭圆） (false positives) are partially or totally eliminated（部分或全部被消除） with the help of CNN. Also, more points are introduced in（被引进） the blue ellipses (true positives). The results show that, from the point of CNN’s view, a residue pair surrounded by more contact pairs has a higher chance to be predicted as positive pair. This rule is also consistent with（相一致） the observation that contacts are densely distributed in native structures.

在本节中，我们使用来自测试数据集的TM protein 3wajA作为一个说明性案例来展示CNN架构的效率。这种蛋白质有13个TM螺旋。在预测top L/5的螺旋间接触时，MemBrain对46个残基接触的预测正确率为42，预测正确率为91.3%。对于top L的预测，预测精度下降到47.4%。在使用CNN之前，top L/5的预测准确率为69.6%，top L的螺旋间接触的预测准确率为33.8%，这两个值的预测准确率要差得多。为了深入挖掘数据，补充图S3显示了预测细节，其中绿色、红色和蓝色的点分别代表原生接触、MemBrain第一阶段的预测接触和MemBrain最后阶段的预测接触。可以看出，在CNN的帮助下，红色椭圆内的红点(假阳性)被部分或全部消除。此外，在蓝色椭圆中引入了更多的点(真阳性)。结果表明，从CNN的观点来看，被更多接触对包围的残基对有更高的机会被预测为正对。这一规律也与观察到的接触在自然结构中密集分布的现象相一致。

4 Discussions

In recent years, residue contact prediction reached a high level of performance with the technique of machine learning and data mining. For inter-helix residue contact prediction, MemConP used a high-quality dataset to train a RF model, and improved the prediction performance. Although there are few non redundant membrane protein structures available, compared with that of several years ago, we can get more structures to study the characters of contact residue pairs. By observing helix packing（螺旋堆积）, we can see that residue contacts are densely distributed in native structures. This conclusion can also be extended to soluble proteins. Conventional serial feature combination of neighboring contact potential can make partial use of this kind of information, but structural relationship of neighboring contact pattern could be missed. In this work, we used CNN architecture to mine the latent structural features. Our MemBrain achieves 79.4% prediction accuracy for the top L/5 predictions on the test dataset, which is a significant improvement for inter-helix residue contact prediction with the limited training samples.

近年来，应用机器学习和数据挖掘技术使残基接触预测达到了较高的水平。对于螺旋残差接触预测，MemConP使用高质量的数据集来训练RF模型，提高了预测性能。非冗余膜蛋白结构虽然不多，但与几年前相比，我们可以得到更多的结构来研究接触残基对的性质。通过观察螺旋堆积，我们可以看到残基接触在原生结构中密集分布。这一结论也适用于可溶性蛋白。传统的相邻接触电位序列特征组合可以部分利用这类信息，但容易忽略相邻接触模式的结构关系。在这项工作中，我们使用CNN框架来挖掘潜在的结构特征。我们的MemBrain在测试数据集上对前L/5的预测达到了79.4%的预测精度，这对于有限训练样本下的螺旋残基接触预测是一个显著的改进。

MemBrain was trained with inter-helix residue pairs and also long-range residue pairs of type II and III. On one hand, we can better fit the parameters of CNN; on the other, MemBrain can give the entire contact map of the query TM protein sequence and not just the inter-helix contact map. We compared the performance of MemBrain on loop contacts with MetaPSICOV and found that MemBrain performs better and achieves 56.4% prediction accuracy for the top L/5 predicted loop contacts. However, it is still worse than inter-helix contact prediction, where a prediction accuracy of 79.4% is reached on the test dataset. Although there are more contact pairs (2.1 times) in loop region to train MemBrain model, inter-helix contacts are easier to be detected. The reason may be that inter-helix contacts are more conserved（保守）.

MemBrain分别对螺旋间残基对和II型、III型长程残基对进行了训练。一方面，我们可以更好的拟合CNN的参数;另一方面，MemBrain可以提供查询TM蛋白序列的整个接触图谱，而不仅仅是螺旋间的接触图谱。我们将MemBrain与MetaPSICOV预测loop contacts的性能进行了比较，发现MemBrain的性能更好，预测loop contacts的top L/5的预测准确率为56.4%。然而，它仍然比在测试数据集上达到79.4%预测精度的螺旋间接触预测差。虽然在回路区域有更多的接触对(2.1倍)来训练MemBrain模型，但更容易检测到螺旋间的接触。原因可能是螺旋间的接触更保守。

By evaluating（评价） MemBrain on the independent test datasets, we show that it is important to consider both the sequence identity（序列一致性） and the structural similarity（结构相似性） to remove homologous redundancy（同源冗余）. We separated the ITD35 dataset into two sub datasets, one dataset consisting of（包含） proteins that have the largest TM-score less than 0.5, the other of the remaining proteins. On the first sub dataset, MemBrain performs 79.4% and 51.3% prediction accuracies for the top L/5 and L predicted contacts, respectively. Regarding the second sub dataset, the corresponding（相应的，一致的） prediction accuracies are increased to 86.1% and 63.8%. Thus, we suggest that both the sequence and structural similarity need to be taken into account（考虑） when dealing with homologous redundancy（同源冗余）.

通过在独立测试数据集上对MemBrain进行评价，我们发现去除同源冗余必须同时考虑序列一致性和结构相似性。我们将ITD35数据集分为两个子数据集，其中一个数据集包含的蛋白质的最大TM-score小于0.5，另一个数据集包含其余的蛋白质。在第一个子数据集上，MemBrain对topL/5和L预测接触的准确率分别为79.4%和51.3%。第二个子数据集的预测准确率分别提高到86.1%和63.8%。因此，我们建议在处理同源冗余时应同时考虑序列和结构的相似性。

The promising performance of the new MemBrain-contact 2.0 algorithm is due to the new hierarchical design of the prediction model and CNN’s powerful capability of mining latent structural features existed in original feature space. The so-called ‘curse of dimensionality（维数灾难）’ is a typical challenge（典型的挑战） of this study, i.e. more samples will be required to model a reliable predictor with the increasing number of feature dimensions. Due to the relatively（相当地，相对地） small number of solved membrane protein structures used for training, the original 1084 dimensions of features, resulted from（由…） combining two target residues, becomes a much heavy high-dimension computation load（成为一个沉重的高维计算负荷）. Our first stage model is used to transform the high dimensional feature space to a coarse prediction represented by a 2D probability image（二维概率图像）, from which the CNN is applied to learn the latent spatial structural correlations（潜在的空间结构相关性）. Our results have shown that the final prediction from the hierarchical two-stage model is 8.3% higher than that of the first stage in terms of（依据，按照） the top L/5 predictions, demonstrating the efficacy（有效性） of the new protocol.

MemBrain-contact 2.0新算法的良好性能得益于预测模型的新层次设计和CNN挖掘原特征空间中潜在结构特征的强大能力。所谓的“维数诅咒”是本研究的一个典型挑战，即随着特征维数的增加，需要更多的样本来建模一个可靠的预测器。由于用于训练的已解膜蛋白结构相对较少，由两个目标残基组合而成的原始1084维特征成为了一个非常沉重的高维计算负荷。我们的第一阶段模型是将高维特征空间转化为二维概率图像表示的粗预测，然后利用CNN学习潜在的空间结构相关性。我们的结果表明，从L/5预测结果来看，分级两阶段模型的最终预测结果比第一阶段高8.3%，证明了新方案的有效性。

Compared with our previous MemBrain-contact 1.0 model, the new predictor is significantly improved in the following aspects: (1) a more powerful learning algorithm is used. We applied deep learning algorithm CNN to mine the latent structural features of neighboring contact pattern（近邻接触模式） and thus enhance the prediction ability. (2) the application scope（适用范围） is extended. In our previous model, the prediction is just for the inter-helix contacts, whereas（然而） now MemBrain 2.0 is capable of（能够） predicting residue contacts in the full chain. Currently, MemBrain predictor is constructed（构建） on alpha-helical TM proteins, but it could potentially（可能地，潜在地） be extended to beta-barrel membrane proteins. A potential challenge is that the number of solved beta-barrel membrane protein structures is even less than that of alpha-helical TM proteins, which will result in a smaller training dataset than the one used for the current study. One of our important future improvements of MemBrain will consist in（主要在于，在于） developing a specific residue contact prediction engine for beta-barrel membrane proteins.

与之前的MemBrain-contact 1.0模型相比，新的预测器在以下几个方面有了明显的改进:(1)使用了更强大的学习算法。利用深度学习算法CNN挖掘相邻接触模式的潜在结构特征，提高预测能力。(2)适用范围扩大。在我们之前的模型中，预测只是针对螺旋间的接触，而现在MemBrain 2.0能够预测全链的残基接触。目前，MemBrain预测器是在alpha螺旋TM蛋白上构建的，但它有可能扩展到beta桶状膜蛋白。一个潜在的挑战是，已经解决的beta桶状膜蛋白结构的数量甚至比alpha螺旋TM蛋白的数量还少，这将导致比当前研究使用的训练数据集更小。我们将来对MemBrain的一个重要改进是开发一种特定的beta桶状膜蛋白残基接触预测引擎。