2013-09-14

using HTSeq

用tophat做完mapping之后，接下来就是count reads，HTSeq的文档很详细，还有一个use case，除了这个case，其他的全部跟着走了一遍。接下来使用htseq-qa和htseq-count的时候，遇到了问题，记录在这里：

###htseq-qa
安装了HTSeq后，htseq-qa的使用也很简单：

htseq-qa [options] read_file #assumes htseq is in your PATH
python -m HTSeq.scripts.qa [options] read_file #if htseq is not in your PATH

检查clean data的时候很顺利，在read_file是Col_0的raw data时，遇到了问题，原因是raw data里面有的reads quality string信息不完整，尝试过删除该reads或者把reads quality数据补全，都不成。

###htseq-count
htseq-count的使用与qa类似，如果htseq不再你的PATH，可以这样：python -m HTSeq.scripts.count [options] <sam_file> <gff_file>。我遇到的问题有两个：

tophat输出的是BAM格式的结果，而htseq-count需要SAM格式的比对结果，首先要转换，这个用samtools就行了：samtools view col.bam >> col.sam samtools默认的输入就是BAM格式，这里用重定向将屏幕输出写入col.sam
然后运行htseq-count报错：”ValueError: The attribute string seems to contain mismatched quotes”.搜了下，报错的行里面’gene_name’id含有多余的分号，手工修改了几次，还是报同样的错。这时候需要写一个脚本了：

import re
import sys

input_file = sys.argv[1]
output_file = sys.argv[2]
j = 0
with open(input_file, 'r') as f:
  for line in f.readlines():
    gene_name = re.search(r'gene_name\s(\S+)',line)
    gene_id = gene_name.group(1)
    match = re.search(r'(\w+);(\w+)', gene_id)
    if match:
      newid = ''.join(['"', match.group(1), match.group(2), '"', ';'])
      newline = line.replace(gene_id, newid)
    else:
      newline = line

    with open(output_file, 'a') as newf:
      newf.write(newline)
      j += 1

print 'totaly', j, 'number of lines modified!'

得到正确格式的gtf之后，就可以count了：python -m HTSeq.scripts.count -m intersection-nonempty -s no sam_file gtf_file >> samplecount 需要注意的事这里’-s’选项默认是’yes’,所以如果样品建库不是连特异性的话，一定要加-s no,要不有一半的reads就浪费了。另外就是把count结果写入文件，以备之后分析。

popucui

Stuff Related to Bioinformatics Computer and More

using HTSeq

留言