<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>I am ... unhindered by talent &#187; estimation of distribution algorithm</title>
	<atom:link href="http://UnhinderedByTalent.com/Phi/archives/tag/estimation-of-distribution-algorithm/feed/" rel="self" type="application/rss+xml" />
	<link>http://UnhinderedByTalent.com/Phi</link>
	<description>Not all battles are fought with a sword</description>
	<lastBuildDate>Thu, 17 May 2012 02:57:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>N-grams and the evolution of programs</title>
		<link>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/</link>
		<comments>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comments</comments>
		<pubDate>Sat, 02 Feb 2008 22:58:53 +0000</pubDate>
		<dc:creator>Phi</dc:creator>
				<category><![CDATA[Computing]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Sabbatical]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Dagstuhl]]></category>
		<category><![CDATA[estimation of distribution algorithm]]></category>
		<category><![CDATA[evolutionary computation]]></category>
		<category><![CDATA[genetic programming]]></category>
		<category><![CDATA[N-gram]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/</guid>
		<description><![CDATA[Which of the following was written by (a) me, (b) William Shakespeare, and (c) Charles Darwin? “I would have sent to Rome that&#8217;s worthy death?” “The naturalist looking at species as he might succeed from a fork low down in &#8230; <a href="http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p class="p1">Which of the following was written by (a) me, (b) William Shakespeare, and (c) Charles Darwin?</p>
<p class="p2"></p>
<blockquote><p>“I would have sent to Rome that&#8217;s worthy death?”</p></blockquote>
<blockquote><p>
“The naturalist looking at species as he might succeed from a fork low down in the separation of the species of any species in a more or less from their sap this is unimportant for the instincts already possessed by certain plants so that natural selection of mere piles of superimposed strata and watch the sea separating an island even if we believe that pure water can effect little or no offspring.”
</p></blockquote>
<blockquote><p>
“The troubling aspects of a building block semantics in a given tree in the context and false.”
</p></blockquote>
<p class="p1">The answer and (much) more is below the fold.</p>
<p><span id="more-788"></span></p>
<p class="p1">That probably wasn’t too hard.</p>
<p class="p1">Except, of course, I’m a mean person and it was a trick question. In fact none of those sentences was actually written by a human being. All three of these pseudo-quotes were generated using simple statistics and the notion of <a href="http://en.wikipedia.org/wiki/N-gram">N-grams</a> (which I will describe below) to capture and reproduce the <i>sense</i> of a text, without really understanding it at a larger scale. I wrote a program to create these examples as an opening illustration for my <a href="http://kathrin.dagstuhl.de/08051/Materials2/">Dagstuhl</a> talk, and since they’re a lot of fun I thought I’d summarize the key concepts here.</p>
<p class="p2"></p>
<h3>A quick intro to N-grams</h3>
<p class="p1">The essential idea is to look at the frequency of the occurrences of groups of adjacent symbols (called N-grams). In this example the symbols are words, and we’re using N=3, so we’re looking at triplets of adjacent words. The preceding sentence, for example starts with the following sequence of triplets:</p>
<ul>
<li>(“In”, “this”, “example”)</li>
<li>(“this”, “example”, “the”)</li>
<li>(“example”, “the”, “symbols”)</li>
<li>Etc.</li>
</ul>
<p class="p1">Note that the triplets overlap, so most words will in fact appear in three N-grams (triplets), once at the end, once in the middle, and once at the beginning, as happens for “example” above.</p>
<p class="p1">Let’s say you go through a text (a book, say), and make a great big list of all the N-grams and how often they each occur. What would that tell us about that text? That depends pretty crucially on the value of N. If N=1, for example, then we’re simply counting up the frequency of individual words. This could be very useful in certain circumstances; we could, for example, probably use that to decide whether the text is a crime novel or a technical paper on econometrics. N=1, however, doesn’t give us any sense of the larger structure of the document; we may find it difficult to distinguish between two particular novels, and we’d be unable to recognize when words tended to appear in related contexts (does the author tend to mention the butler whenever missing candlestick is discussed?).</p>
<p class="p1">If, on the other hand, we take N to be much larger (several 100s or 1,000s of words) then we capture lots of long range structure, but we lose any kind of statistics. Almost all sequences of 100 consecutive words, for example, are likely to be unique. Consequently they will tend to only appear once in our table for a given text, and are unlikely to appear in the tables for any two texts (although if they did that would be a useful plagiarism flag). Very large sequences don’t really work, therefore, as a tool for looking for relationships between documents, or even patterns within a text.</p>
<p class="p1">Experience in fields such as computational linguistics suggest that taking N=3 is a useful compromise position. [1] Different triples occur often enough (assuming the text has some length) that their distribution is meaningful (like when N=1), while there’s enough structure and overlap with N=3 that you can capture some long-range regularities as well.</p>
<h3>Generating my &#8220;pseudo-quotes&#8221;</h3>
<p class="p1">All of which is illustrated by my pseudo-quotes above. Each was generated by first computing the frequency of triples in three texts: <i>Coriolanus</i>, <i>The origin of species</i>, and a paper I wrote last fall with a couple of UMM students. [2] I also kept track of the frequency with which words occurred as the first word of a sentence, and the frequency of pairs containing the first two words in every sentence.</p>
<p class="p1">The program starts generating a sentence by picking a first word based on the frequency of first words (so if 20% of the sentences started with “The”, there would be a 20% chance of starting our new sentence with “The”). Given that choice, the program would take the table of frequencies of all the pairs of first two words, and pull out just those pairs that used our chosen first word. Once we have the first two words, we can repeatedly generate words based on the frequencies of 3-grams in our big table. If we’re using <i>Coriolanus</i>, and the last two words we’ve generated were (in order) “I” and “would”, then the relevant triples that occur in that play more than once are:</p>
<table>
<tbody>
<tr>
<td valign="middle" class="td1">
        I would <i>he</i>
      </td>
<td valign="middle" class="td1">
        2 occurrences
      </td>
</tr>
<tr>
<td valign="middle" class="td1">
        I would <i>I</i></p>
</td>
<td valign="middle" class="td1">
        2 occurrences
      </td>
</tr>
<tr>
<td valign="middle" class="td1">
        <strong>I would <i>have</i></strong>
      </td>
<td valign="middle" class="td1">
        3 occurrences
      </td>
</tr>
<tr>
<td valign="middle" class="td1">
        I would <i>they</i>
      </td>
<td valign="middle" class="td1">
        4 occurrences
      </td>
</tr>
<tr>
<td valign="middle" class="td1">
        I would <i>not</i>
      </td>
<td valign="middle" class="td1">
        4 occurrences
      </td>
</tr>
</tbody>
</table>
<p class="p1">with another 11 &#8220;I would&#8221; triples that occurred a single time. Thus &#8220;I would have&#8221; (the prefix we chose in the &#8220;Shakespearean&#8221; sentence above) was more likely to be chosen than &#8220;I would he&#8221; by a ratio of 3:2, and three times more likely to be chosen than any of the triples that occurred just once. It was, on the other hand, <i>less</i> likely to be chosen than either &#8220;I would they&#8221; or &#8220;I would not&#8221;, but the (digital) dice rolled in its favor on this particular run.</p>
<p class="p1">Now that we have that start, we continue choosing words based on the last two that we&#8217;ve added to the sentence. So we look in our table at what triples start &#8220;would have&#8221;, and chose (this time) &#8220;sent&#8221; as the next word. We then look up &#8220;have sent&#8221; and chose &#8220;to&#8221;. The process continues (in my program at least) until we choose a &#8220;word&#8221; that&#8217;s in fact a terminal punctuation mark (a period, a question mark, or an exclamation point), thus ending the sentence.</p>
<h3>A few observations</h3>
<p class="p1">As we can see from the pseudo-quotes that I opened with, this process can generate sentences whose style is clearly recognizable, and which can make a great deal of sense at the local level (e.g., at the level of phrases). This process, however, doesn&#8217;t &#8220;understand&#8221; or respect larger structural connections or semantics. The third pseudo-quote, for example,</p>
<blockquote><p>
“The troubling aspects of a building block semantics in a given tree in the context and false.”
</p></blockquote>
<p class="p1">lacks a verb. The large Darwin pseudo-quote</p>
<blockquote><p>
“The naturalist looking at species as he might succeed from a fork low down in the separation of the species of any species in a more or less from their sap this is unimportant for the instincts already possessed by certain plants so that natural selection of mere piles of superimposed strata and watch the sea separating an island even if we believe that pure water can effect little or no offspring.”
</p></blockquote>
<p class="p1">rambles all over the shop, containing very sensible phrases like &#8220;The naturalist looking at species&#8221; as well as nonsense such as &#8220;pure water can effect little or no offspring&#8221;.</p>
<h3>What we did with it</h3>
<p class="p1">In the research I was reporting on, Riccardo Poli and I applied this idea of N-grams as a tool for capturing regularities in a language to computer programs that were represented as sequences of simple &#8220;machine&#8221; instructions in a highly simplified programming language that was designed for a specific set of test problems. We used a type of <i>Estimation of distribution algorithm</i> (EDA) to essentially evolve the triplet frequencies, instead of taking them from a text like I did with <i>Coriolanus</i>, and then used those to generate programs much like I did with the text examples. We would generate a set of, say, 100 programs this way, and try each of them on our test problem. Some were better than others, so we&#8217;d take the better ones (say the top half) and use those to update the frequencies; triplets that appeared in those better programs would have their frequencies increased somewhat, while those that didn&#8217;t would have their frequencies reduced.</p>
<p class="p1">When repeated over several generations, this process would evolve/learn/find/discover a set of frequencies that allowed it to generate successful programs with a reasonably high probability, at least on problems that had solutions that could be formed from repeated sequences of instructions. Even though it was limited to only 3-grams, the system was able to &#8220;learn&#8221; some fairly long sequences of instructions. In one case, for example, the evolved set of probabilities generated, given a particular starting pair, a particular sequence of 9 instructions with probability of over 60%, which is some 500,000 times more likely than generating a sequence of that length by randomly drawing instructions from a hat. The solutions that were generated tended to be composed of numerous copies of a small number of basic patterns, and it seems likely that this approach will do better in problem spaces where there are solutions that exhibit that kind of regularity.</p>
<p class="p1">If you&#8217;re interested in learning more, check out the paper: “<a href="http://www.essex.ac.uk/dces/research/publications/technicalreports/2008/ces-479.pdf">A Linear Estimation-of-Distribution GP System</a>”.</p>
<p class="p6">[1] Does anyone know of a theoretical justification for this?</p>
<p class="p6">[2] ”<a href="http://www.morris.umn.edu/academic/fclt/Working%20Papers/Morris_WP_3.2.pdf">Semantic building blocks in Genetic Programming</a>”, which will be appearing next month in the Proceedings of the European Conference on Genetic Programming (provide link).</p>
]]></content:encoded>
			<wfw:commentRss>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

