Bioinformatics

Counting Q20 Bases in a .qual File

I sometimes get asked to count the number of bases with qualities greater than or equal to 20 in a quality file. I’m not entirely sure this is all that good a metric with 454 sequencing but that’s another story. It always takes me a minute or two to come up with the right Unix commands to do it so I’m going to post it here so I remember (and maybe save someone else a couple minutes).

cat *qual|grep '^[^>]'|sed 's/ /\n/g'|grep -c [234][0-9]

This is very quick and dirty (just removing lines starting with “>”, replacing spaces with newlines and counting the resulting lines with quals 20-40) but it seems to work ok for me. Also yes I know it’s stupid to cat to a grep but I often replace the cat with head for testing. And I’m sure you could do it in a single awk or sed step but it gets done in a minute or two for several hundred million bases so I haven’t really been motivated to change it.

Bash/UNIX
Bioinformatics
Biologist
Programmer

Comments (1)

Permalink

Converting Between Gapped and NonGapped DNA Coordinates in R

I've been needing to convert coordinates in gapped DNA alignments back and forth to coordinates in the nongapped sequence a lot recently. For example, the T in AC--TGA--A is in the 5th position but in the gapless sequence ACTGAA it is 3rd. My first few tries (counting chars in regex'd substrings) at programming up a decent converters ended up not working very well with large datasets. Since I had to do it with 50,000 positions in a sequence with 2 million letters, I was running into a little trouble. But it ended up actually being a pretty easy problem in R so I thought I'd post it up in case anybody else is running into something similar. Also it's a pretty good example why programming in R can be fun.

R:
  1. gap2NoGap<-function(gapSeq,coords){
  2.    gapSeqSplit<-strsplit(gapSeq,'')[[1]]
  3.    nonDash<-gapSeqSplit!='-'
  4.    newCoords<-cumsum(nonDash)
  5.    return(newCoords[coords])
  6. }

So the function takes a gapped sequence and the gapped coordinates. It splits the gapped sequence into an array of single characters and finds which ones are not -'s. It then takes the cumulative sum at each element in the array (TRUE evaluates as 1 and FALSE as 0). This gives an array where each element gives the new gap-free coordinates. So the function just returns the appropriate new coordinate for each old coordinate.

R:
  1. noGap2Gap<-function(gapSeq,coords){
  2.    gapSeqSplit<-strsplit(gapSeq,'')[[1]]
  3.    newCoords<-which(gapSeqSplit!='-')
  4.    return(newCoords[coords])
  5. }

This opposite conversion function again takes the gapped sequence and appropriately the nongapped coordinates. It splits the gapped sequence into an array of single characters and finds the indices of the characters which are not -'s. Since we only stored the indices of nongap characters, this array gives the gapped coordinates for each nongap letter. So again the function can just returns the appropriate new coordinate for each old coordinate.

These ended up being pretty elegant in R (once I finally figured them out).

Bioinformatics
Biologist
Programmer
R

Comments (3)

Permalink