Programmer

Displaying Code in LaTeX

gioby of Bioinfo Blog! (an interesting read by the way) left a comment asking about displaying code in LaTeX documents. I’ve sort of been cludging around using \hspace‘s and \textcolor but I’ve always meant to figure out the right way to do things so this seemed like a good chance to figure out how to do it right.

LaTeX tends to ignore white space. This is good when you’re writing papers but not so good when you’re trying to show code where white space is an essential part (e.g. Python). Luckily there’s a builtin verbatim environment in LaTeX that is equivalent to html’s <pre>. So something like the following should preserve white space.

Code in LaTeX using verbatim
\begin{verbatim}
for i in range(1, 5):
  print i
else:
  print "The for loop is over"
\end{verbatim}

Unfortunately, you can’t use any normal LaTeX commands inside verbatim (since they’re displayed verbatim). But luckily there a handy package called fancyvrb that fixes this (the color package is also useful for adding colors). For example, if you wanted to highlight “for” in the above code, you can use the Verbatim (note the capital V) environment from fancyvrb:

Code in LaTeX using fancyvrb
\newcommand\codeHighlight[1]{\textcolor[rgb]{1,0,0}{\textbf{#1}}}
\begin{Verbatim}[commandchars=\\\{\}]
\codeHighlight{for} i in range(1, 5):
  print i
else:
  print "The for loop is over"
\end{Verbatim}
Code in LaTeX using pygmentize

If you really want to get fancy, the Pygments package in Python will output syntax highlighted latex code with a command like: pygmentize -f latex -O full test.py >py.tex The LaTeX it outputs is a bit hard to read but it’s not too bad (it helped me figure out the fancyvrb package) and it does make nice syntax highlighted output.

Here’s an example LaTeX file with the three examples above and the pdf it generates if you’re curious.

LaTeX
Programmer

Comments (8)

Permalink

Counting Q20 Bases in a .qual File

I sometimes get asked to count the number of bases with qualities greater than or equal to 20 in a quality file. I’m not entirely sure this is all that good a metric with 454 sequencing but that’s another story. It always takes me a minute or two to come up with the right Unix commands to do it so I’m going to post it here so I remember (and maybe save someone else a couple minutes).

cat *qual|grep '^[^>]'|sed 's/ /\n/g'|grep -c [234][0-9]

This is very quick and dirty (just removing lines starting with “>”, replacing spaces with newlines and counting the resulting lines with quals 20-40) but it seems to work ok for me. Also yes I know it’s stupid to cat to a grep but I often replace the cat with head for testing. And I’m sure you could do it in a single awk or sed step but it gets done in a minute or two for several hundred million bases so I haven’t really been motivated to change it.

Bash/UNIX
Bioinformatics
Biologist
Programmer

Comments (1)

Permalink

Converting Between Gapped and NonGapped DNA Coordinates in R

I've been needing to convert coordinates in gapped DNA alignments back and forth to coordinates in the nongapped sequence a lot recently. For example, the T in AC--TGA--A is in the 5th position but in the gapless sequence ACTGAA it is 3rd. My first few tries (counting chars in regex'd substrings) at programming up a decent converters ended up not working very well with large datasets. Since I had to do it with 50,000 positions in a sequence with 2 million letters, I was running into a little trouble. But it ended up actually being a pretty easy problem in R so I thought I'd post it up in case anybody else is running into something similar. Also it's a pretty good example why programming in R can be fun.

R:
  1. gap2NoGap<-function(gapSeq,coords){
  2.    gapSeqSplit<-strsplit(gapSeq,'')[[1]]
  3.    nonDash<-gapSeqSplit!='-'
  4.    newCoords<-cumsum(nonDash)
  5.    return(newCoords[coords])
  6. }

So the function takes a gapped sequence and the gapped coordinates. It splits the gapped sequence into an array of single characters and finds which ones are not -'s. It then takes the cumulative sum at each element in the array (TRUE evaluates as 1 and FALSE as 0). This gives an array where each element gives the new gap-free coordinates. So the function just returns the appropriate new coordinate for each old coordinate.

R:
  1. noGap2Gap<-function(gapSeq,coords){
  2.    gapSeqSplit<-strsplit(gapSeq,'')[[1]]
  3.    newCoords<-which(gapSeqSplit!='-')
  4.    return(newCoords[coords])
  5. }

This opposite conversion function again takes the gapped sequence and appropriately the nongapped coordinates. It splits the gapped sequence into an array of single characters and finds the indices of the characters which are not -'s. Since we only stored the indices of nongap characters, this array gives the gapped coordinates for each nongap letter. So again the function can just returns the appropriate new coordinate for each old coordinate.

These ended up being pretty elegant in R (once I finally figured them out).

Bioinformatics
Biologist
Programmer
R

Comments (3)

Permalink

How SINs (and Credit Card Numbers) Are Validated

A while back it was tax season in Canada and a friend of mine was trying to do his taxes online. But since he was foreign and didn't have a Social Insurance Number (their equivalent of an SSN), the helpful webapp wouldn't let him print the thing (of course it only informed him of that after he had already entered everything). We tried a few guesses before finally just using mine and crossing out my numbers after printing it. But I always wondered how the form knew our guesses were invalid. Luckily, I recently stumbled across a mention of how credit card numbers are validated.

It turns out SINs and credit card number are checked with something called the Luhn algorithm. Basically the algorithm just involves taking each digit, multiplying the second, fourth, sixth and so on digit from the right by 2 and adding up all the resulting digits (e.g. if 7 is multiplied by 2 then the resulting 14 is split into 1+4). If the sum of the digits is a multiple of 10, the number passes.

For example, to check 345678, you'd split it into 3,4,5,6,7,8. Then multiply 3, 5 and 7 by 2 to give 6, 4, 10, 6, 14, 8. Then split all the digits again to give 6, 4, 1, 0, 6, 1, 4, 8. That adds up to 30 so 345678 would be a valid credit card number (if there wasn't a set length).

Just for fun, here's a quick function in R to run the Luhn Algorithm on a number (or tell you the remainder so you can adjust):

R:
  1. luhnCheck <- function(number,returnLogical=TRUE){
  2.   numbers <- gsub('[^0-9]','',as.character(number))
  3.   numbers <- as.numeric(strsplit(numbers,'')[[1]])
  4.   selector<-seq(length(numbers)-1,1,-2)
  5.   numbers[selector]<-numbers[selector]*2
  6.   numbers[numbers> 9] <- numbers[numbers> 9] - 9
  7.   remainder <- sum(numbers) %% 10
  8.   if(returnLogical) return(remainder==0)
  9.   else return(remainder)
  10. }

So the next time some stupid web form needs a SIN number I'm going with 999999998.

Programmer
R

Comments (0)

Permalink

Don’t Use Zypper to Upgrade OpenSuse

OpenSuse gecko

I installed OpenSuse on my work computer and I've been really happy with it so far. Recently, OpenSuse came out with a new version 11.1 so I figured I would upgrade. Since this was my first time updating, I turned to google and the first result for "upgrade opensuse 11.1" is this page about zypper on the OpenSuse site. It sure sounds easy, just run zypper and OpenSuse will upgrade in the background.

But after 5 hours of installation fun (Want me to fall back to /dev/mapper/...?) I just wanted to warn people that this is not the way to go. Somehow zypper decided that my computer would function just fine without supporting the harddrive (RAID0) but I was forced to disagree with this point when my computer became unbootable. Luckily the installation DVD (after finally downloading) was able to fix the problem.

So if you're planning on upgrading your OpenSuse, I would say it's definitely worth just downloading the DVD and updating from that (which actually went really smoothly).

Yes I do (now) realize the zypper upgrade page says:

This method is unsupported. The official method of upgrading to a new release is by using the latest DVD. To start the upgrade you boot from the DVD and start installation, at some point you will be prompted to either do a New Installation or Upgrade, at that point you select Upgrade and continue with the setup.

But I didn't realize that translated to really really don't use this method. Anyway live and learn and now I know it's really easy to upgrade from the DVD.

Bash/UNIX
Programmer

Comments (5)

Permalink