I've been needing to convert coordinates in gapped DNA alignments back and forth to coordinates in the nongapped sequence a lot recently. For example, the T in AC--TGA--A is in the 5th position but in the gapless sequence ACTGAA it is 3rd. My first few tries (counting chars in regex'd substrings) at programming up a decent converters ended up not working very well with large datasets. Since I had to do it with 50,000 positions in a sequence with 2 million letters, I was running into a little trouble. But it ended up actually being a pretty easy problem in R so I thought I'd post it up in case anybody else is running into something similar. Also it's a pretty good example why programming in R can be fun.
So the function takes a gapped sequence and the gapped coordinates. It splits the gapped sequence into an array of single characters and finds which ones are not -'s. It then takes the cumulative sum at each element in the array (TRUE evaluates as 1 and FALSE as 0). This gives an array where each element gives the new gap-free coordinates. So the function just returns the appropriate new coordinate for each old coordinate.
This opposite conversion function again takes the gapped sequence and appropriately the nongapped coordinates. It splits the gapped sequence into an array of single characters and finds the indices of the characters which are not -'s. Since we only stored the indices of nongap characters, this array gives the gapped coordinates for each nongap letter. So again the function can just returns the appropriate new coordinate for each old coordinate.
These ended up being pretty elegant in R (once I finally figured them out).