生信问题记录

我的需求

input：

FASTA文件，含六千余个蛋白序列。命名为FA.fasta
txt文件，经过interpro注释后，筛选出五千余个蛋白，将五千余个蛋白ID导出到txt文件中，每行一个。命名为ID.txt

output：

FASTA文件，根据ID.txt里的ID从FA.fasta里提取出对应的五千余个序列。提取结果也以文件的格式保存为out_file.fasta

环境

Windows 10 64x
命令行运行py

解决流程

如果编程娴熟，可以直接用perl或Python写脚本；碍于个人技艺生疏，直接面向百度编程。搜索发现，解决方案非常多，最终某位网友的Python脚本成功运行

提示如果你是用ID行，形如 >RLS74643.1 dihydrofolate reductase [Planctomycetes bacterium] 来提取序列，那么更为简单，可以使用联川生物的免费在线小工具

参考

1c43f522e1c3的帖子

https://www.jianshu.com/p/b7032dfae44e

步骤

直接复制网友的代码会出很多错，要先安装click模块和biopython的包

安装包

pip install click
pip install biopython

复制粘贴，另存为get_seqs_by_id.py

#coding:utf-8

import click

from Bio import SeqIO

@click.command()

@click.option('-f', '--fastafile', help='Input a fasta file', required=True)

@click.option('-i', '--idfile', help='Input an idlist', required=True)

@click.option('-o', '--outfile', help='Input the name of result file', default='result_out.fa')

def main(fastafile="FA.fa",idfile="ID.txt",outfile= "result_out.fa"):

    with open(idfile) as id_handle:
        wanted = set(line.rstrip("\n").split(None,1)[0] for line in id_handle)

    print("Found %i unique identifiers in %s" % (len(wanted), idfile))

    records = (r for r in SeqIO.parse(fastafile, "fasta") if r.id in wanted)

    count = SeqIO.write(records, outfile, "fasta")

    print("Saved %i records from %s to %s" % (count, fastafile, outfile))

    if count < len(wanted):

        print("Warning %i IDs not found in %s" % (len(wanted) - count, fastafile))
 
if __name__ == '__main__':
    main()

将输入文件和py文件放在同一目录下，命令行输入

python get_seqs_by_id.py -f **.fasta -i ID.txt -o out_file.fasta

如果顺利出结果再好不过了，然而我报了错

报错不要慌，只有最后一行是关键信息

debug

错误的意思是：Unicode的解码（Decode）出现错误（Error）了，以gbk编码的方式去解码（该字符串变成Unicode），但是此处通过gbk的方式，却无法解码（can’t decode ）。“illegal multibyte sequence”意思是非法的多字节序列，即没法（解码）了。
网上解决方案
于是我在open函数这一行加入encoding=‘UTF-8’

with open(idfile,encoding='UTF-8') as id_handle:

结果又双叒报错了！
错误信息
耐心搜索，发现可能是用Python读文件（txt或者csv），出现编码错误
打开我的ID.txt一看，从Excel复制的信息居然是UTF-16！怪不得一直报错。把txt文件用UTF-8另存就好了……

运行成功

运行成功是这样的，输出一个fasta文件，正是我想要的。

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2022-03-21 20:45:38 更:2022-03-21 20:49:48

360图书馆购物三丰科技阅读网日历万年历 2026年1日历

-2026/1/9 0:42:27-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码