biopython吧 关注:59贴子:174
  • 2回复贴,共1

序列文件作为字典 - 索引文件

只看楼主收藏回复

因为Bio.SeqIO.to_dict()将所有的信息都存储在内存中,处理的文件大小受限于电脑的RAM,对于更大的文件,应该考虑使用 Bio.SeqIO.index():
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index("ls_orchid.gbk", "genbank")
>>> len(orchid_dict)
94
>>> orchid_dict.keys()
['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', ..., 'Z78471.1']
>>> seq_record = orchid_dict["Z78475.1"]
>>> print seq_record.description
P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA.
>>> seq_record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA())


IP属地:广东1楼2017-01-16 14:44回复
    工作原理上略有不同。尽管仍然是返回一个类似于字典的对象,它并不将所有的信息存储在内存中。相反,它仅仅记录每条序列条目在文件中的位置 - 当你需要读取某条特定序列条目时,它才进行解析。


    IP属地:广东2楼2017-01-16 14:44
    回复
      fasta:
      def get_acc(identifier):
      """"Given a SeqRecord identifier string, return the accession number as a string.
      e.g. "gi|2765613|emb|Z78488.1|PTZ78488" -> "Z78488.1"
      """
      parts = identifier.split("|")
      assert len(parts) == 5 and parts[0] == "gi" and parts[2] == "emb"
      return parts[3]
      >>> from Bio import SeqIO
      >>> orchid_dict = SeqIO.index("ls_orchid.fasta", "fasta", key_function=get_acc)
      >>> print orchid_dict.keys()
      ['Z78484.1', 'Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', ..., 'Z78471.1']


      IP属地:广东3楼2017-01-16 17:55
      回复