序列输入和输出，Bio.SeqIO 模块【biopython吧】

03月23日漏签0天

biopython吧关注：59贴子：174

1 2 下一页尾页
19回复贴，共2页
，跳到页

<返回biopython吧

序列输入和输出，Bio.SeqIO 模块

只看楼主收藏回复

Bio.SeqIO 模块旨在提供一个简单的接口，实现对各种不同格式序列文件进行统一的处理。
>>> from Bio import SeqIO
>>> help(SeqIO)
...

送TA礼物

IP属地:广东

1楼2017-01-10 13:58回复

biopython.org/DIST/docs/api/Bio.SeqIO-module.html

IP属地:广东

2楼2017-01-10 13:59

http://biopython.org/wiki/SeqIO

IP属地:广东

3楼2017-01-10 13:59

解析/读取序列：
Bio.SeqIO.parse()
用于读取序列文件生成 SeqRecord 对象，包含两个参数：
第一个参数是一个文件名或者一个句柄（ handle ）。
第二个参数是一个小写字母字符串，用于指定序列格式。
如SeqIO.parse("ls_orchid.fasta", "fasta")

IP属地:广东

4楼2017-01-10 14:01

需要处理只包含一个序列条目的文件，请使用函数 Bio.SeqIO.read() 。它使用与函数 Bio.SeqIO.parse() 相同的参数，当文件有且仅有一个序列条目时返回一个 SeqRecord 对象，否则触发异常。

IP属地:广东

5楼2017-01-12 14:17

从文件中提取序列ID列表：
>>> from Bio import SeqIO
>>> identifiers = [seq_record.id for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank")]
>>> identifiers
['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1', 'Z78527.1', ..., 'Z78439.1']

IP属地:广东

8楼2017-01-12 14:20

遍历序列文件：
from Bio import SeqIO
record_iterator = SeqIO.parse("ls_orchid.fasta", "fasta")
first_record = record_iterator.next()
print first_record.id
print first_record.description
second_record = record_iterator.next()
print second_record.id
print second_record.description

IP属地:广东

9楼2017-01-12 14:24

收起回复

只需要第一个条目：
from Bio import SeqIOfirst_record = SeqIO.parse("ls_orchid.gbk", "genbank").next()

IP属地:广东

10楼2017-01-12 14:25

收起回复

获得序列文件中序列条目列表：
from Bio import SeqIO
records = list(SeqIO.parse("ls_orchid.gbk", "genbank"))
print "Found %i records" % len(records)
print "The last record"
last_record = records[-1] #using Python's list tricks
print last_record.id
print repr(last_record.seq)
print len(last_record)
print "The first record"
first_record = records[0] #remember, Python counts from zero
print first_record.id
print repr(first_record.seq)
print len(first_record

IP属地:广东

11楼2017-01-12 14:27

运行结果:
Found 94 records
The last record
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', IUPACAmbiguousDNA())
592
The first record
Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
740

IP属地:广东

12楼2017-01-12 14:32

提取数据：
from Bio import SeqIO
record_iterator = SeqIO.parse("ls_orchid.gbk", "genbank")
first_record = record_iterator.next()
print first_record
ID: Z78533.1
Name: Z78533
Description: C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Number of features: 5
/sequence_version=1
/source=Cypripedium irapeanum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', ..., 'Cypripedium']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', ..., 'ITS1', 'ITS2']
/references=[...]
/accessions=['Z78533']
/data_file_division=PLN
/date=30-NOV-2006
/organism=Cypripedium irapeanum
/gi=2765658
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())

IP属地:广东

13楼2017-01-12 14:34

可以直接输出：
print first_record.annotations
与其他Python字典一样，你可以轻松地获得键列表：
print first_record.annotations.keys()
或者值列表:
print first_record.annotations.values()
>>> print first_record.annotations["source"]
Cypripedium irapeanum
>>> print first_record.annotations["organism"]
Cypripedium irapeanum
from Bio import SeqIO
all_species = []
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
all_species.append(seq_record.annotations["organism"])
print all_species
from Bio import SeqIO
all_species = [seq_record.annotations["organism"] for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank")]
print all_species

IP属地:广东

14楼2017-01-12 14:37

你需要从一个FASTA文件提取出物种列表：
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...
from Bio import SeqIO
all_species = []
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
all_species.append(seq_record.description.split()[1])
print all_species

IP属地:广东

15楼2017-01-12 14:40

from Bio import SeqIO
all_species == [seq_record.description.split()[1] for seq_record in \
SeqIO.parse("ls_orchid.fasta", "fasta")]
print all_species

IP属地:广东

18楼2017-01-12 23:51

计算GenBank文件中多条序列条目的总长：
>>> from Bio import SeqIO
>>> print sum(len(r) for r in SeqIO.parse("ls_orchid.gbk", "gb"))
67518
使用 with 语句（Python 2.5及以上版本）自动关闭句柄：
>>> from __future__ import with_statement #Needed on Python 2.5
>>> from Bio import SeqIO
>>> with open("ls_orchid.gbk") as handle:
... print sum(len(r) for r in SeqIO.parse(handle, "gb"))
67518
或者，用旧版本的方式，手动关闭句柄：
>>> from Bio import SeqIO
>>> handle = open("ls_orchid.gbk")
>>> print sum(len(r) for r in SeqIO.parse(handle, "gb"))
67518
>>> handle.close()

IP属地:广东

19楼2017-01-12 23:55

扫二维码下载贴吧客户端

下载贴吧APP
看高清直播、视频！

贴吧热议榜

1 2 下一页尾页
19回复贴，共2页
，跳到页

<返回biopython吧

发表回复

发贴请遵守贴吧协议及“七条底线”贴吧投诉

内容:

使用签名档查看全部

发表

保存至快速回贴

日	一	二	三	四	五	六

序列输入和输出，Bio.SeqIO 模块

登录百度账号

扫二维码下载贴吧客户端