biopython吧 关注:59贴子:174

序列输入和输出,Bio.SeqIO 模块

只看楼主收藏回复

Bio.SeqIO 模块旨在提供一个简单的接口,实现对各种不同格式序列文件进行统一的处理。
>>> from Bio import SeqIO
>>> help(SeqIO)
...


IP属地:广东1楼2017-01-10 13:58回复
    biopython.org/DIST/docs/api/Bio.SeqIO-module.html


    IP属地:广东2楼2017-01-10 13:59
    回复
      http://biopython.org/wiki/SeqIO


      IP属地:广东3楼2017-01-10 13:59
      回复
        解析/读取序列:
        Bio.SeqIO.parse()
        用于读取序列文件生成 SeqRecord 对象,包含两个参数:
        第一个参数是一个文件名或者一个句柄( handle )。
        第二个参数是一个小写字母字符串,用于指定序列格式。
        如SeqIO.parse("ls_orchid.fasta", "fasta")


        IP属地:广东4楼2017-01-10 14:01
        回复
          需要处理只包含一个序列条目的文件,请使用函数 Bio.SeqIO.read() 。它使用与函数 Bio.SeqIO.parse() 相同的参数,当文件有且仅有一个序列条目时返回一个 SeqRecord 对象,否则触发异常。


          IP属地:广东5楼2017-01-12 14:17
          回复
            从文件中提取序列ID列表:
            >>> from Bio import SeqIO
            >>> identifiers = [seq_record.id for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank")]
            >>> identifiers
            ['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1', 'Z78527.1', ..., 'Z78439.1']


            IP属地:广东8楼2017-01-12 14:20
            回复
              遍历序列文件:
              from Bio import SeqIO
              record_iterator = SeqIO.parse("ls_orchid.fasta", "fasta")
              first_record = record_iterator.next()
              print first_record.id
              print first_record.description
              second_record = record_iterator.next()
              print second_record.id
              print second_record.description


              IP属地:广东9楼2017-01-12 14:24
              收起回复
                只需要第一个条目:
                from Bio import SeqIOfirst_record = SeqIO.parse("ls_orchid.gbk", "genbank").next()


                IP属地:广东10楼2017-01-12 14:25
                收起回复
                  获得序列文件中序列条目列表:
                  from Bio import SeqIO
                  records = list(SeqIO.parse("ls_orchid.gbk", "genbank"))
                  print "Found %i records" % len(records)
                  print "The last record"
                  last_record = records[-1] #using Python's list tricks
                  print last_record.id
                  print repr(last_record.seq)
                  print len(last_record)
                  print "The first record"
                  first_record = records[0] #remember, Python counts from zero
                  print first_record.id
                  print repr(first_record.seq)
                  print len(first_record


                  IP属地:广东11楼2017-01-12 14:27
                  回复
                    运行结果:
                    Found 94 records
                    The last record
                    Z78439.1
                    Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', IUPACAmbiguousDNA())
                    592
                    The first record
                    Z78533.1
                    Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
                    740


                    IP属地:广东12楼2017-01-12 14:32
                    回复
                      提取数据:
                      from Bio import SeqIO
                      record_iterator = SeqIO.parse("ls_orchid.gbk", "genbank")
                      first_record = record_iterator.next()
                      print first_record
                      ID: Z78533.1
                      Name: Z78533
                      Description: C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.
                      Number of features: 5
                      /sequence_version=1
                      /source=Cypripedium irapeanum
                      /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', ..., 'Cypripedium']
                      /keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', ..., 'ITS1', 'ITS2']
                      /references=[...]
                      /accessions=['Z78533']
                      /data_file_division=PLN
                      /date=30-NOV-2006
                      /organism=Cypripedium irapeanum
                      /gi=2765658
                      Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())


                      IP属地:广东13楼2017-01-12 14:34
                      回复
                        可以直接输出:
                        print first_record.annotations
                        与其他Python字典一样,你可以轻松地获得键列表:
                        print first_record.annotations.keys()
                        或者值列表:
                        print first_record.annotations.values()
                        >>> print first_record.annotations["source"]
                        Cypripedium irapeanum
                        >>> print first_record.annotations["organism"]
                        Cypripedium irapeanum
                        from Bio import SeqIO
                        all_species = []
                        for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
                        all_species.append(seq_record.annotations["organism"])
                        print all_species
                        from Bio import SeqIO
                        all_species = [seq_record.annotations["organism"] for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank")]
                        print all_species


                        IP属地:广东14楼2017-01-12 14:37
                        回复
                          你需要从一个FASTA文件提取出物种列表:
                          >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
                          CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
                          AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
                          ...
                          from Bio import SeqIO
                          all_species = []
                          for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
                          all_species.append(seq_record.description.split()[1])
                          print all_species


                          IP属地:广东15楼2017-01-12 14:40
                          回复
                            from Bio import SeqIO
                            all_species == [seq_record.description.split()[1] for seq_record in \
                            SeqIO.parse("ls_orchid.fasta", "fasta")]
                            print all_species


                            IP属地:广东18楼2017-01-12 23:51
                            回复
                              计算GenBank文件中多条序列条目的总长:
                              >>> from Bio import SeqIO
                              >>> print sum(len(r) for r in SeqIO.parse("ls_orchid.gbk", "gb"))
                              67518
                              使用 with 语句(Python 2.5及以上版本)自动关闭句柄:
                              >>> from __future__ import with_statement #Needed on Python 2.5
                              >>> from Bio import SeqIO
                              >>> with open("ls_orchid.gbk") as handle:
                              ... print sum(len(r) for r in SeqIO.parse(handle, "gb"))
                              67518
                              或者,用旧版本的方式,手动关闭句柄:
                              >>> from Bio import SeqIO
                              >>> handle = open("ls_orchid.gbk")
                              >>> print sum(len(r) for r in SeqIO.parse(handle, "gb"))
                              67518
                              >>> handle.close()


                              IP属地:广东19楼2017-01-12 23:55
                              回复