<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dammit Jim! &#187; Statistician</title>
	<atom:link href="http://scott.sherrillmix.com/blog/category/statistician/feed/" rel="self" type="application/rss+xml" />
	<link>http://scott.sherrillmix.com/blog</link>
	<description>I'm a biologist not a...</description>
	<lastBuildDate>Mon, 06 Feb 2012 05:19:08 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Functional Metagenomics: Sequence Everything and Let DNA Sort The Functions Out</title>
		<link>http://scott.sherrillmix.com/blog/biologist/functional-metagenomics-sequence-everything-and-let-dna-sort-the-functions-out/</link>
		<comments>http://scott.sherrillmix.com/blog/biologist/functional-metagenomics-sequence-everything-and-let-dna-sort-the-functions-out/#comments</comments>
		<pubDate>Mon, 27 Oct 2008 03:57:55 +0000</pubDate>
		<dc:creator>ScottS-M</dc:creator>
				<category><![CDATA[Biologist]]></category>
		<category><![CDATA[Statistician]]></category>
		<category><![CDATA[bacteria]]></category>
		<category><![CDATA[DNA]]></category>
		<category><![CDATA[DNA sequencing]]></category>
		<category><![CDATA[metagenomics]]></category>
		<category><![CDATA[pyrosequencing]]></category>
		<category><![CDATA[virus]]></category>

		<guid isPermaLink="false">http://scott.sherrillmix.com/blog/?p=163</guid>
		<description><![CDATA[One of the cool things you can do with the high throughput DNA analysis of pyrosequencing, is to collect a sample from the environment, isolate the DNA from everything in it and sequence it. Then you can match the DNA up with known sequences and see what sort of microbes you had. Dinsdale and a [...]]]></description>
			<content:encoded><![CDATA[<a href="http://www.researchblogging.org"><img alt="ResearchBlogging.org" src="http://www.researchblogging.org/images/rbicons/ResearchBlogging-Medium-Trans.png" width="80" height="50" class="left"/></a>
<p>One of the cool things you can do with the high throughput DNA analysis of <a href="http://scott.sherrillmix.com/blog/biologist/can-you-sequence-a-bacterias-entire-genome-overnight/">pyrosequencing</a>, is to collect a sample from the environment, isolate the DNA from everything in it and sequence it. Then you can match the DNA up with known sequences and see what sort of microbes you had. Dinsdale and a bunch of coauthors collected the data from a bunch of such studies. They managed to find 45 bacterial samples and 42 viral samples from 9 broad environmental classifications. You can see all the different samples the authors pooled together (circles microbial and squares viral).</p>
<img src="/res/images/metagenomics_locations.png" alt="Locations of metagenomic samples from Dinsdale et al." class="center"/>
<p>The interesting thing about this study was that instead of looking at the taxonomy of the critters as usual, they looked at the <em>function</em> of the genes. By simply looking at what the genes do, the researchers hoped to get a feel for what activities were going on in that environment without necessarily having to identify the species of the bacteria and viruses. To do this, they fed their 14.5 million sequences (pyrosequencing sure can generate data) into the <a href="http://www.theseed.org/wiki/Main_Page">SEED database</a>, a big collection of genes which have been assigned to functions (for example membrane transport or sulphur metabolism) by experts. They were able to match 1 million of the bacterial and 500,000 of the viral sequences to previously identified gene functions.</p>
<p>It might seem odd that they would look at viral DNA since viruses are rather simple and have only a few basic genes. But the researchers were actually looking at bacterial genetic sequences being carried inside viruses. This of course brings up the question of what bacterial DNA is doing inside viruses. It turns out there are a lot of bacteriophage viruses that like to infect bacteria and sometimes these viruses capture some of the DNA of their bacterial hosts and carry it to their next host. Looking at the bacterial DNA present in a viral population gives an interesting look at what types of genes are being passed around between individual bacteria (and even between bacterial species).</p>
<p>So here are the high level classifications of the function of the genes they found for each environment.</p>
<img src="/res/images/metagenomics_percent_functions.png" alt="Percentages of gene function of bacterial and viral gene function from Dinsdale et al." title="Percentages of gene function of bacterial and viral gene function from Dinsdale et al." class="center"/>
<p>It&#8217;s pretty cool that the viruses were carrying around so much of a variety of bacterial DNA. The authors suggest that motility genes coding for things like flagella and cilia (which could help the bacterial host spread the virus further) were enriched in the viral samples but it seems a bit hard to say that for certain without a bit more analysis.</p>
<p>A useful way to look at huge masses of data, like their 1.5 million matches, is to try and reduce all the different counts in the functional categories into a couple of condensed variables. This can be seen in the next couple plots. They could use a little explaining. Bacterial sequences are on top and viral sequences on the bottom. Lines show how the various functional categories have been condensed into the x and y variables. For example, samples that contained lots of genes for making cell walls will tend to be at the top of the plot in the bacterial samples and tend not to have many genes for respiration.</p>
<img src="/res/images/metagenomics_cdf.png" alt="Canonical discriminant function analysis of bacterial and viral gene function from Dinsdale et al." title="Canonical discriminant function analysis of bacterial and viral gene function from Dinsdale et al." class="center"/>
<p>It&#8217;s pretty cool to see how the various environments clustered with other samples from the same environment. For example, all the yellow diamond fish farm samples ended up on the right side of the bacteria graphs even though they were sampled independently. It appears that functions seem to correlate with environmental conditions. For example, the fish food at the fish farms contained a lot of sulfur supplements and the bacteria from those samples were rich in sulfur metabolism genes and the bacteria from corals contained many different respiration genes to deal with the highly variable oxygen concentrations found there. Dinsdale and her coauthors go so far as to suggest that gene function may provide a better indicator of environment than the taxonomy of the bacteria present.</p>
<p>The paper did have a little trouble in the math in one part but the authors already have a correction in for it so it&#8217;s really not worth worrying about. Overall, it was a pretty interesting story and a good example of stuff to do with a sequencing machine (also it must have taken a good bit of work to collect all that data together from all those authors).</p>

<h3>References</h3>
<p><span class="Z3988" title="ctx_ver=Z39.88-2004&#038;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&#038;rft.jtitle=Nature&#038;rft.id=info:DOI/10.1038%2Fnature06810&#038;rft.atitle=Functional+metagenomic+profiling+of+nine+biomes&#038;rft.date=2008&#038;rft.volume=452&#038;rft.issue=7187&#038;rft.spage=629&#038;rft.epage=632&#038;rft.artnum=http%3A%2F%2Fwww.nature.com%2Fdoifinder%2F10.1038%2Fnature06810&#038;rft.au=Elizabeth+A.+Dinsdale&#038;rft.au=Robert+A.+Edwards&#038;rft.au=Dana+Hall&#038;rft.au=Florent+Angly&#038;rft.au=Mya+Breitbart&#038;rft.au=Jennifer+M.+Brulc&#038;rft.au=Mike+Furlan&#038;rft.au=Christelle+Desnues&#038;rft.au=Matthew+Haynes&#038;rft.au=Linlin+Li&#038;rft.au=Lauren+McDaniel&#038;rft.au=Mary+Ann+Moran&#038;rft.au=Karen+E.+Nelson&#038;rft.au=Christina+Nilsson&#038;rft.au=Robert+Olson&#038;rft.au=John+Paul&#038;rft.au=Beltran+Rodriguez+Brito&#038;rft.au=Yijun+Ruan&#038;rft.au=Brandon+K.+Swan&#038;rft.au=Rick+Stevens&#038;rft.au=David+L.+Valentine&#038;rft.au=Rebecca+Vega+Thurber&#038;rft.au=Linda+Wegley&#038;rft.au=Bryan+A.+White&#038;rft.au=Forest+Rohwer&#038;bpr3.included=1&#038;bpr3.tags=Biology">Elizabeth A. Dinsdale, Robert A. Edwards, Dana Hall, Florent Angly, Mya Breitbart, Jennifer M. Brulc, Mike Furlan, Christelle Desnues, Matthew Haynes, Linlin Li, Lauren McDaniel, Mary Ann Moran, Karen E. Nelson, Christina Nilsson, Robert Olson, John Paul, Beltran Rodriguez Brito, Yijun Ruan, Brandon K. Swan, Rick Stevens, David L. Valentine, Rebecca Vega Thurber, Linda Wegley, Bryan A. White, Forest Rohwer (2008). <cite>Functional metagenomic profiling of nine biomes</cite> Nature, 452 (7187), 629-632 DOI: <a rev="review" href="http://dx.doi.org/10.1038/nature06810">10.1038/nature06810</a></span></p>
]]></content:encoded>
			<wfw:commentRss>http://scott.sherrillmix.com/blog/biologist/functional-metagenomics-sequence-everything-and-let-dna-sort-the-functions-out/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Getting Help with SAS</title>
		<link>http://scott.sherrillmix.com/blog/programmer/getting-help-with-sas/</link>
		<comments>http://scott.sherrillmix.com/blog/programmer/getting-help-with-sas/#comments</comments>
		<pubDate>Thu, 15 Nov 2007 06:01:35 +0000</pubDate>
		<dc:creator>ScottS-M</dc:creator>
				<category><![CDATA[Programmer]]></category>
		<category><![CDATA[SAS]]></category>
		<category><![CDATA[Statistician]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[group]]></category>
		<category><![CDATA[help]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[questions]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://scott.sherrillmix.com/blog/programmer/getting-help-with-sas/</guid>
		<description><![CDATA[There was some discussion in one of my SAS posts about where to find SAS help and communities. It seemed like a pretty useful topic so I thought I&#8217;d expand it a bit and make a post out of it. First, let me say I&#8217;m not the most knowledgeable since I&#8217;m more of a find-wall-bang-head [...]]]></description>
			<content:encoded><![CDATA[<img src="/res/images/sas_source.png" alt="SAS source code" class="right"/>

<p>There was some <a href="http://scott.sherrillmix.com/blog/programmer/sas-lag-problems/#comment-14004">discussion</a> in one of my SAS posts about where to find SAS help and communities. It seemed like a pretty useful topic so I thought I&#8217;d expand it a bit and make a post out of it. First, let me say I&#8217;m not the most knowledgeable since I&#8217;m more of a find-wall-bang-head type of programmer but I did my best to dig up some possible answers. If anyone has any other suggestions, feel free to leave them in the comments.</p> 

<ul>
<li>To start with, there&#8217;s always the official online <a href="http://support.sas.com/documentation/onlinedoc/91pdf/index.html">documentation</a> although this tends to be more for polishing something you already know how to do than starting cold.</li>
<li>Speaking of official, there&#8217;s also the official <a href="http://support.sas.com/forums/index.jspa">SAS forums</a>. I didn&#8217;t know about these until I started looking around for this post so I can&#8217;t say much about them but the topics they have available seem rather specific and I can&#8217;t figure out where one would go to post a basic question.</li>
<li><ins datetime="2007-11-17T18:23:59+00:00"><i>Edit:</i>There&#8217;s also the <a href="http://support.sas.com/resources/">SAS Knowledge Base</a> that has a lot of good papers and notes detailing SAS features complete with sample code and explanations. It&#8217;s really useful if you&#8217;re a learn by example type. (Thanks to <a href="http://blogs.sas.com/sascom">Alison</a> for pointing this one out).</ins></li>
<li><a href="http://scott.sherrillmix.com/blog/programmer/sas-lag-problems/#comment-15278">Kelly Levoyer</a> of SAS points out <a href="http://www.sascommunity.org/wiki/Main_Page">SAScommunity.org</a> which seems like it is a little sparse but does have a surprisingly long list of <a href="http://www.sascommunity.org/wiki/Category:Bloggers_Corner">SAS-related blogs</a>.</li>
<li>The SAS company also appears to have jumped on the <a href="http://blogs.sas.com/">blogging band wagon</a> although really only <a href="http://blogs.sas.com/sasdummy/">SAS Dummy</a> looks helpful for learning SAS at the moment.</li>
<li>The only place that seem to be available for asking general question is the <a href="http://www.listserv.uga.edu/archives/sas-l.html">SAS-L email list</a> (which I just found out is the same as the <a href="http://groups.google.com/group/comp.soft-sys.sas/about">comp.soft-sys.sas</a> Usenet group). There&#8217;s a nice paper on <a href="http://www2.sas.com/proceedings/sugi29/247-29.pdf">SAS-L etiquette</a> (mostly do your homework first) (found via the sascommunity site).</li>
</ul>

<p>Offline, there are also <a href="http://support.sas.com/usergroups/">SAS user groups</a>. I often get emails from our local one but I&#8217;ve never actually gone. The SAS company also has trainers that travel and teach quick classes. Our university stats department brought in one to teach a couple short two-day classes about statistical functions and macros. The classes were pretty good although I&#8217;m not sure how much it cost or how frequent they are. It might be worth checking on if you&#8217;re near a university.</p>

<p>Finally, you can also read my poor attempts at explaining <a href="http://scott.sherrillmix.com/blog/programmer/sas-macros/">SAS macro variables</a> and <a href="http://scott.sherrillmix.com/blog/programmer/sas-macros-letting-sas-do-the-typing/">SAS macros</a>. Also, if you have any specific questions you can try asking in the comments here and if it&#8217;s not too time consuming I&#8217;ll try to lend a hand.</p>]]></content:encoded>
			<wfw:commentRss>http://scott.sherrillmix.com/blog/programmer/getting-help-with-sas/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>SAS Macros: Letting SAS Do the Typing</title>
		<link>http://scott.sherrillmix.com/blog/programmer/sas-macros-letting-sas-do-the-typing/</link>
		<comments>http://scott.sherrillmix.com/blog/programmer/sas-macros-letting-sas-do-the-typing/#comments</comments>
		<pubDate>Sun, 04 Nov 2007 08:09:31 +0000</pubDate>
		<dc:creator>ScottS-M</dc:creator>
				<category><![CDATA[Programmer]]></category>
		<category><![CDATA[SAS]]></category>
		<category><![CDATA[Statistician]]></category>
		<category><![CDATA[ampersand]]></category>
		<category><![CDATA[arrays]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[do loop]]></category>
		<category><![CDATA[macro]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[std]]></category>
		<category><![CDATA[syput]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[variable]]></category>

		<guid isPermaLink="false">http://scott.sherrillmix.com/blog/programmer/sas-macros-letting-sas-do-the-typing/</guid>
		<description><![CDATA[I've been meaning to write up a bit on using macros in SAS to complement my previous post on macro variables for quite a while. Luckily Norwegian guy reminded me about the pain of starting programming in SAS and provided me some motivation. So here's my take on using macros in programming. So what is [...]]]></description>
			<content:encoded><![CDATA[<p>I've been meaning to write up a bit on using macros in SAS to complement my previous post on <a href="http://scott.sherrillmix.com/blog/programmer/sas-macros/">macro variables</a> for quite a while. Luckily <a href="http://scott.sherrillmix.com/blog/programmer/sas-lag-problems/#comment-14002">Norwegian guy</a> reminded me about the pain of starting programming in SAS and provided me some motivation. So here's my take on using macros in programming.</p>

<p>So what is a macro? Macros are a part of SAS that look through your code before the normal part of SAS sees it and writes out your code for you based on a special syntax. If you've ever found yourself copying and pasting code then you've probably been in a situation well suited for macros. They're also great if you need to perform different functions under different conditions. Once I learned macros, SAS seemed a lot more like a usable (although weird) programming language and tasks seemed to get a lot easier (except actually picking the statistical techniques to use).</p>

<p>Probably the easiest way to see what macros do is an example. So say we once again have a data set of tree heights</p>

<div class="syntax_hilite"><span class="langName">SAS:</span><br /><div id="sas-8">
<div class="sas"><ol><li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #000080; font-weight: bold;">data</span> trees;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">input</span> name:$<span style="color: #2e8b57; font-weight: bold;color:#800000;">8</span>. height;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">cards;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Maple <span style="color: #2e8b57; font-weight: bold;color:#800000;">123</span></div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Maple <span style="color: #2e8b57; font-weight: bold;color:#800000;">78</span></div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Maple <span style="color: #2e8b57; font-weight: bold;color:#800000;">90</span></div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Elm <span style="color: #2e8b57; font-weight: bold;color:#800000;">155</span></div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Elm <span style="color: #2e8b57; font-weight: bold;color:#800000;">65</span></div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Elm <span style="color: #2e8b57; font-weight: bold;color:#800000;">90</span></div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Elm <span style="color: #2e8b57; font-weight: bold;color:#800000;">120</span></div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Birch <span style="color: #2e8b57; font-weight: bold;color:#800000;">100</span></div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Birch <span style="color: #2e8b57; font-weight: bold;color:#800000;">30</span></div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Maple <span style="color: #2e8b57; font-weight: bold;color:#800000;">111</span></div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #000080; font-weight: bold;">run</span>; </div></li></ol></div>
</div></div><br />

<p>I already talked about how to find and use the mean and standard deviation for the <a href="http://scott.sherrillmix.com/blog/programmer/sas-macros/"> whole data set</a>. Now what if we wanted to standardize each species by its own seperate mean and deviation? We could cut and paste but once we get a few more species or want to change something later this really becomes a hassle. So this is where macros come in.</p>

<p>The first thing to do is to calculate the mean and standard deviations for each species. We can use <code>proc means</code> again to do this. Since we won't be using the output I'll add the <code>noprint</code> option and since we only want the means for the individual species and not the whole dataset I'll add the <code>nway</code> option. The <code>class name;</code> statement tells SAS to find the statistics seperately for each species and the <code>output</code> line tells SAS to save the mean and deviation in a dataset called <code>meansd</code>.</p>
<div class="syntax_hilite"><span class="langName">SAS:</span><br /><div id="sas-9">
<div class="sas"><ol><li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #000080; font-weight: bold;">proc means</span> <span style="color: #000080; font-weight: bold;">data</span>=trees nway noprint;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">class name;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">var</span> height;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">output</span> out=meansd <span style="color: #0000ff;">mean</span>=meanheight <span style="color: #0000ff;">std</span>=sdheight;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #000080; font-weight: bold;">run</span>; </div></li></ol></div>
</div></div><br />

<p>Now we just need to get the values from the <code>meansd</code> dataset into macro variables. We'll use the _NULL_ dataset and <code>call symput</code> again to create macro variables. This time we need to create seperate macro variables for each species. Luckily SAS automatically numbers each observation in a dataset in a column called <code>_N_</code>. Since each line of the dataset corresponds to a tree species, we can easily use this identifier to create the macro variables by using <code>call symput(&#039;mean&#039;||left(_N_), meanheight);</code>. The <code>left()</code> and <code>()trim</code> functions (numeric variables have extra spaces to the left and string variables have spaces to the right) removes any unnecessary spaces and the <code>||</code> concatenates (connects) the text "mean" with the line number to give give <code>mean1</code>, <code>mean2</code>, etc.. I'll do the same thing for standard deviation and tree name. Once the macro variables are created, there is still one problem remaining. We don't know how many species there were or how many macro variables were created. Luckily SAS will make another column that indicates the last line of the dataset when it sees <code>end=newcolumnname</code> following a set statement. Then we just need to check if SAS is on the last line and if so save the line number (<code>_N_</code>) to know the number of species of trees.</p>  

<div class="syntax_hilite"><span class="langName">SAS:</span><br /><div id="sas-10">
<div class="sas"><ol><li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #000080; font-weight: bold;">data</span> <span style="color: #0000ff;">_NULL_</span>;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">set</span> meansd <span style="color: #0000ff;">end</span>=last;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">call</span> symput<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'mean'</span>||left<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">_N_</span><span style="color: #66cc66;">&#41;</span>,meanheight<span style="color: #66cc66;">&#41;</span>;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">call</span> symput<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'sd'</span>||left<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">_N_</span><span style="color: #66cc66;">&#41;</span>,sdheight<span style="color: #66cc66;">&#41;</span>;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">call</span> symput<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'name'</span>||left<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">_N_</span><span style="color: #66cc66;">&#41;</span>,<span style="color: #0000ff;">trim</span><span style="color: #66cc66;">&#40;</span>name<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">if</span> last <span style="color: #0000ff;">then</span> <span style="color: #0000ff;">call</span> symput<span style="color: #66cc66;">&#40;</span><span style="color: #a020f0;">'numspecies'</span>,<span style="color: #0000ff;">_N_</span><span style="color: #66cc66;">&#41;</span>;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #000080; font-weight: bold;">run</span>; </div></li></ol></div>
</div></div><br />

<p>If you ever want to check what macro variables you have in your program, you can use <code>%PUT _USER_;</code> to print them all to the log file. Or if you want to see every macro variable available  (SAS has quite a few automatic ones like operating system and date) use <code>%PUT _ALL_;</code>. Inserting <code>%PUT _USER_;</code> here produces:</p>
<div class="syntax_hilite"><span class="langName">SAS:</span><br /><div id="sas-11">
<div class="sas"><ol><li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL NUMSPECIES&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #2e8b57; font-weight: bold;color:#800000;">3</span></div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL NAME1 Birch</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL NAME2 Elm</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL NAME3 Maple</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL MEAN1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color: #2e8b57; font-weight: bold;color:#800000;">65</span></div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL MEAN2&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #2e8b57; font-weight: bold;color:#800000;">107</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">5</span></div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL MEAN3&nbsp; &nbsp; &nbsp; &nbsp; <span style="color: #2e8b57; font-weight: bold;color:#800000;">100</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">5</span></div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL SD1 <span style="color: #2e8b57; font-weight: bold;color:#800000;">49</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">497474683</span></div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL SD2 <span style="color: #2e8b57; font-weight: bold;color:#800000;">38</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">837267326</span></div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GLOBAL SD3 <span style="color: #2e8b57; font-weight: bold;color:#800000;">20</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">273134933</span> </div></li></ol></div>
</div></div><br />
<p>Now we've set a lot of macro variables but we still haven't created a real macro. In SAS, macros are started with <code>%MACRO macroname;</code> and finished with <code>%MEND;</code> (short for M[acro]END). <code>%</code>'s are used to indicate commands that the SAS macro facility will read and remove before normal SAS sees the code. Anything not with a % will be printed out by the macro facility. Macros don't spit out their code for SAS until they're are called using <code>%macroname</code>.</p>

<p>So I'll call my macro <code>treestandardizer</code> but you can call it whatever you want. I'm going to use a pretty simple and specific macro but if you were going to use this often and for different datasets you would want to program it better. The first thing to do is create the <code>final</code> dataset and set it to the <code>trees</code> dataset. Since we need to loop through each species of tree, we'll need a <code>%DO</code> loop. Everything between <code>%DO</code> and <code>%END</code> will be repeated while <code>i</code> increments from 1 to the number of tree species.  If you want to combine text and a macro variable to reference another macro variable, you use the double ampersand <code>&amp;&amp;<!--formatted--></code> in SAS. For example, we want to get the mean for species 1 by looking in the macro variable <code>&amp;mean1<!--formatted--></code> so we use <code>&amp;&amp;mean&amp;i<!--formatted--></code>. I <em>think</em> the macro processing part of SAS ends up running through the code twice, the first time finding the <code>&amp;&amp;<!--formatted--></code> and replacing it with <code>&amp;<!--formatted--></code> and the <code>&amp;i<!--formatted--></code> and replacing it with <code>1</code> to leave <code>&amp;mean1<!--formatted--></code> and the second time finding <code>&amp;mean1<!--formatted--></code> and pasting in the appropriate value (65). So we'll have the do loop write out a series of <code>if</code> statements to check what the name of the tree is and use the appropriate mean and deviation. Note that when using a string macro variable like <code>&amp;nameX<!--formatted--></code>, you need to surround it with double quotes (the macro processor doesn't look inside single quotes) so SAS doesn't think it is a variable name. </p>

<div class="syntax_hilite"><span class="langName">SAS:</span><br /><div id="sas-12">
<div class="sas"><ol><li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">%MACRO</span> treestandardizer;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #000080; font-weight: bold;">data</span> final;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">set</span> trees;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">%DO</span> i = <span style="color: #2e8b57; font-weight: bold;color:#800000;">1</span> <span style="color: #0000ff;">%TO</span> <span style="color: #0000ff; font-weight: bold;">&amp;numspecies</span>;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">if</span> name=<span style="color: #a020f0;">"&amp;&amp;name&amp;i"</span> <span style="color: #0000ff;">then</span> stheight=<span style="color: #66cc66;">&#40;</span>height-&amp;<span style="color: #0000ff; font-weight: bold;">&amp;mean</span><span style="color: #0000ff; font-weight: bold;">&amp;i</span><span style="color: #66cc66;">&#41;</span>/&amp;<span style="color: #0000ff; font-weight: bold;">&amp;sd</span><span style="color: #0000ff; font-weight: bold;">&amp;i</span>; </div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">%END</span>;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #000080; font-weight: bold;">run</span>;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color: #0000ff;">%MEND</span>; </div></li></ol></div>
</div></div><br />

<p>The previous code prepared the macro but nothing actually happens until we call it using <code>%treestandardizer</code>. Unlike almost everything else in SAS this line doesn't have to end in a semicolon (although it's pretty unlikely to hurt if you forget and add one). So to call the macro:</p>
<div class="syntax_hilite"><span class="langName">SAS:</span><br /><div id="sas-13">
<div class="sas"><ol><li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">%treestandardizer </div></li></ol></div>
</div></div><br />

<p>If you want to see what happens when you call a macro, you can have SAS print the code generated by the macro to the log file with the option <code>option mprint;</code> (make sure to set it before actually calling the macro). In this case, it gives:</p>
<div class="syntax_hilite"><span class="langName">SAS:</span><br /><div id="sas-14">
<div class="sas"><ol><li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">MPRINT<span style="color: #66cc66;">&#40;</span>TREESTANDARDIZER<span style="color: #66cc66;">&#41;</span>:&nbsp; &nbsp;<span style="color: #000080; font-weight: bold;">data</span> final;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">MPRINT<span style="color: #66cc66;">&#40;</span>TREESTANDARDIZER<span style="color: #66cc66;">&#41;</span>:&nbsp; &nbsp;<span style="color: #0000ff;">set</span> trees;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">MPRINT<span style="color: #66cc66;">&#40;</span>TREESTANDARDIZER<span style="color: #66cc66;">&#41;</span>:&nbsp; &nbsp;<span style="color: #0000ff;">if</span> name=<span style="color: #a020f0;">"Birch"</span> <span style="color: #0000ff;">then</span> stheight=<span style="color: #66cc66;">&#40;</span>height- <span style="color: #2e8b57; font-weight: bold;color:#800000;">65</span><span style="color: #66cc66;">&#41;</span>/<span style="color: #2e8b57; font-weight: bold;color:#800000;">49</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">497474683</span>;</div></li>
<li style="font-weight: bold;color:#26536A;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">MPRINT<span style="color: #66cc66;">&#40;</span>TREESTANDARDIZER<span style="color: #66cc66;">&#41;</span>:&nbsp; &nbsp;<span style="color: #0000ff;">if</span> name=<span style="color: #a020f0;">"Elm"</span> <span style="color: #0000ff;">then</span> stheight=<span style="color: #66cc66;">&#40;</span>height- <span style="color: #2e8b57; font-weight: bold;color:#800000;">107</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">5</span><span style="color: #66cc66;">&#41;</span>/<span style="color: #2e8b57; font-weight: bold;color:#800000;">38</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">837267326</span>;</div></li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;"><div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">MPRINT<span style="color: #66cc66;">&#40;</span>TREESTANDARDIZER<span style="color: #66cc66;">&#41;</span>:&nbsp; &nbsp;<span style="color: #0000ff;">if</span> name=<span style="color: #a020f0;">"Maple"</span> <span style="color: #0000ff;">then</span> stheight=<span style="color: #66cc66;">&#40;</span>height- <span style="color: #2e8b57; font-weight: bold;color:#800000;">100</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">5</span><span style="color: #66cc66;">&#41;</span>/<span style="color: #2e8b57; font-weight: bold;color:#800000;">20</span>.<span style="color: #2e8b57; font-weight: bold;color:#800000;">273134933</span>; </div></li></ol></div>
</div></div><br />

<p>So it worked and we now have the standardized heights in the <code>stheight</code> column of the <code>final</code> dataset. This particular example could be done a few different ways (the easiest and probably better way being to merge the <code>meancv</code> dataset with the <code>trees</code>) but I hope it gives a decent introduction to SAS macros. If you have any specific questions or something wasn't clear, feel free to ask in a comment.</p> 

<p>Here is the <a href="/res/SAS_macro_example.sas">SAS source code</a> if you don't feel like copying and pasting.</p>

]]></content:encoded>
			<wfw:commentRss>http://scott.sherrillmix.com/blog/programmer/sas-macros-letting-sas-do-the-typing/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>WP_MonsterID and Statistics</title>
		<link>http://scott.sherrillmix.com/blog/programmer/wp_monsterid-and-statistics/</link>
		<comments>http://scott.sherrillmix.com/blog/programmer/wp_monsterid-and-statistics/#comments</comments>
		<pubDate>Wed, 24 Jan 2007 21:44:26 +0000</pubDate>
		<dc:creator>ScottS-M</dc:creator>
				<category><![CDATA[Programmer]]></category>
		<category><![CDATA[Statistician]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[birthday paradox]]></category>
		<category><![CDATA[monster]]></category>
		<category><![CDATA[monsterid]]></category>
		<category><![CDATA[random]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[user]]></category>
		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://scott.sherrillmix.com/blog/programmer/wp_monsterid-and-statistics/</guid>
		<description><![CDATA[After making the WP_MonsterID WordPress plugin to create a random monster avatar from an assortment of parts for each commenter (based on other people's code), fruityoaty asked This looks nifty, but how many monster images are available for assigning? I'd been meaning to calculate this anyway so I did the math and posted it in [...]]]></description>
			<content:encoded><![CDATA[<img class="left" src="/res/images/monsterid_example2.png" alt="An example of a MonsterID" /><p>After making the <a href="http://scott.sherrillmix.com/blog/blogger/wp_monsterid/">WP_MonsterID WordPress plugin</a> to create a random monster avatar from an assortment of parts for each commenter (based on <a href="http://www.docuverse.com/blog/donpark/2007/01/19/identicon-explained">other</a> <a href="http://www.splitbrain.org/projects/monsterid">people's</a> code), <a href="http://fruityoaty.com/">fruityoaty</a> asked <q>This looks nifty, but how many monster images are available for assigning?</q></p>

<p>I'd been meaning to calculate this anyway so I did the math and posted it in the <a href="http://scott.sherrillmix.com/blog/blogger/wp_monsterid/#comment-72">comments</a>:</p>
<blockquote><p>The current part totals are: 17 eyes, 8 hairs, 12 mouths, 15 bodies, 10 legs. That is 244,800 possible combinations. In addition, the body color can range between 20-235 for red, green, and blue. If we count that as 20 distinguishable values for red, green, and blue that adds 8000 possible colors and brings the unique monster count to 2 billion. The only problem is that the algorithm is only using the first 6 digits of the md5 hash of the email which only provides 16 million possible combinations. So I guess the answer is 16 million monsters currently and in the next release I’ll use a few more digits of the hash and increase it to a billion or so. <ins datetime="2007-02-10T21:00:14+00:00"><em>Edit: I did change this so in version 0.3 and later there should be a couple billion possibilities.</em></ins></p></blockquote>
<p>Calculating this got me wondering how many unique users it would take before there was likely to be a duplicate monster. For two users it was easy (1 out of 2 billion) but as the number of users increased things got messy since each new monster could match any of the prior monsters. Luckily I remembered enough of my stats class to google for something on <a href="http://en.wikipedia.org/wiki/Birthday_paradox">calculating the chance of people in a group sharing a birthday</a>.</p> 
<p>If you've never heard of this problem, stop and take a quick guess for how many people you think it would take for the odds to be better than 50% for two people sharing a birthday. Or as my statistics professor put it, <q>There are twenty-five people in the room will you bet me that no one shares the same birthday?</q></p>
<p>...</p>
<p>...</p>
<p>Guessed?</p>
<p>Now I know just enough statistics to know betting against a statistics professor is a bad idea but I have to say, at the time, I thought it would have been a fairly good bet. It turns out that I, like most non-statistics professors, underestimated the chance of any two people in a group sharing the same birthday. Actually, if there are 23 people in a room there is a greater than 50% chance that at least two will share a birthday. If there are 47 people in a room, there is a 95% chance that at least one pair share a birthday. This greatly increasing probability occurs because like the monsters each person added to the room can match any of the previous people (the 5th person can match person 1,2,3,4; the 25th person can match person 1,2,3,4,5,6,...,24;...).</p>
<p>All this was interesting in understanding the problem but didn't really get any closer to finding the probability. Luckily Wikipedia provides an approximation for determining the number of people at a given probability of overlap:</p>
<img src="/res/images/birthday_problem_n_at_p.png" alt="An approximation for calculating the chance two or more people with the same birthday in a group."/>
<p>Substituting in 2 billion for 365, results in a probablity of overlap that looks like:</p>
<img src="/res/images/monster_vs_prob_overlap.png" alt="Number of monsters for a given probability of overlap in 2 billion monsters."/>
<p>Even with 2 billion monsters there is still a 50% chance of overlap with only 52,000 monsters and a 1 out of 10 chance of overlap with only 20,000 monsters. Most unintuitive to me is that there's a 99% chance of overlap with only 135,000 monsters. The chances of an overlap really does pile up as the number of already present monsters grow. In the plus side, most normal sized blogs should be safe from monster overlap with only a .1% chance of overlap even with 2000 commenters.</p>
<p>So what does all this mean? Well besides not getting suckered in any birthday betting, it's a good reminder to be careful about assuming uniqueness among a group just because the chance of a match is rare. For example, if in some application each user was assigned a random key of 4 digits (10000 possible combinations). There would be a greater than 50% chance of overlap after only 1% (117 users) of the keys were assigned.</p>
<p>If any one feels like messing around with the calculations themselves here's the function in R to calculate the miminimum number of assignments to reach a certain probability of overlap from a total number of possible combinations. I'm sure it would be trivial to convert to any other language. Note that's natural log not log10. <code>number_assignments=function(total_number,probability_overlap){sqrt(2*total_number*log(1/(1-probability_overlap)))}</code></p>]]></content:encoded>
			<wfw:commentRss>http://scott.sherrillmix.com/blog/programmer/wp_monsterid-and-statistics/feed/</wfw:commentRss>
		<slash:comments>33</slash:comments>
		</item>
		<item>
		<title>xkcd Geek Comic Site</title>
		<link>http://scott.sherrillmix.com/blog/programmer/xkcd-geek-comic-site/</link>
		<comments>http://scott.sherrillmix.com/blog/programmer/xkcd-geek-comic-site/#comments</comments>
		<pubDate>Thu, 23 Nov 2006 18:11:30 +0000</pubDate>
		<dc:creator>ScottS-M</dc:creator>
				<category><![CDATA[Programmer]]></category>
		<category><![CDATA[Statistician]]></category>
		<category><![CDATA[comic]]></category>
		<category><![CDATA[computer]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[funny]]></category>
		<category><![CDATA[geek]]></category>

		<guid isPermaLink="false">http://scott.sherrillmix.com/blog/programmer/xkcd-geek-comic-site/</guid>
		<description><![CDATA[I've been running into math and programming related comics for a while now and always wondered where they were coming from. Today I finally ran across the source. From the topics, it appears the guy is some sort of computer networking mathy type. Some of them are beyond me and I've actually learned a bit [...]]]></description>
			<content:encoded><![CDATA[<p>I've been running into math and programming related comics for a while now and always wondered where they were coming from. Today I finally ran across the <a href="http://xkcd.com">source</a>. From the topics, it appears the guy is some sort of computer networking mathy type. Some of them are beyond me and I've actually learned a bit by googling the ones I didn't understand like the <a href="http://tabo.aurealsys.com/archives/2006/10/30/xkcd-on-cryptography-alice-bob-and-eve-the-eavesdropper/">Alice and Bob one</a>. Anyway, here's a few samples: </p>

<a href="http://xkcd.com/c149.html"><img class="center" src="/res/images/sandwich.png" alt="Sandwich from xkcd.com" /></a>
<a href="http://xkcd.com/c138.html"><img class="center" src="/res/images/pointers.png" alt="Pointers from xkcd.com" /></a>
<a href="http://xkcd.com/c74.html"><img class="center" src="/res/images/su_doku.jpg" alt="Binary Sudoku from xkcd.com" /></a>]]></content:encoded>
			<wfw:commentRss>http://scott.sherrillmix.com/blog/programmer/xkcd-geek-comic-site/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

