<?xml version="1.0" encoding="utf-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: N-grams and the evolution of programs</title>
	<atom:link href="http://unhinderedbytalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/feed/" rel="self" type="application/rss+xml" />
	<link>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/</link>
	<description>Not all battles are fought with a sword</description>
	<pubDate>Fri, 25 Jul 2008 15:06:21 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: Phi</title>
		<link>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32506</link>
		<dc:creator>Phi</dc:creator>
		<pubDate>Wed, 11 Jun 2008 23:10:56 +0000</pubDate>
		<guid isPermaLink="false">http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32506</guid>
		<description>Sorry for the delay in responding to this - the hazards of family holidays :-).

I'm not sure why the old link doesn't work anymore.  I found the new link, and have edited the link in the article above.  Thanks for pointing out the problem!</description>
		<content:encoded><![CDATA[<p>Sorry for the delay in responding to this - the hazards of family holidays :-).</p>
<p>I&#8217;m not sure why the old link doesn&#8217;t work anymore.  I found the new link, and have edited the link in the article above.  Thanks for pointing out the problem!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tozier</title>
		<link>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32501</link>
		<dc:creator>Tozier</dc:creator>
		<pubDate>Fri, 06 Jun 2008 11:22:37 +0000</pubDate>
		<guid isPermaLink="false">http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32501</guid>
		<description>Oy! Link to preprint/paper is dead. Have we another recourse, or has the evil Springer empire eaten the thing completely and hidden it from public eyes?</description>
		<content:encoded><![CDATA[<p>Oy! Link to preprint/paper is dead. Have we another recourse, or has the evil Springer empire eaten the thing completely and hidden it from public eyes?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: CoryQ</title>
		<link>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32291</link>
		<dc:creator>CoryQ</dc:creator>
		<pubDate>Mon, 04 Feb 2008 15:07:51 +0000</pubDate>
		<guid isPermaLink="false">http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32291</guid>
		<description>OK, I'm not a scientist (nor do I play one on TV) and though I'm not sure what you might use this N-gram stuff for, I think it is interesting to read about.</description>
		<content:encoded><![CDATA[<p>OK, I&#8217;m not a scientist (nor do I play one on TV) and though I&#8217;m not sure what you might use this N-gram stuff for, I think it is interesting to read about.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill Tozier</title>
		<link>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32288</link>
		<dc:creator>Bill Tozier</dc:creator>
		<pubDate>Sun, 03 Feb 2008 19:01:51 +0000</pubDate>
		<guid isPermaLink="false">http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32288</guid>
		<description>A long, long time ago (1991 SFI Summer School, I think) we played with masked long-range N-grams as a way of capturing multi-scale linguistic structure. Admittedly, we were considering texts and CA patterns character-wise, not word-wise, but the idea still seems applicable and interesting.

Most N-gram datasets consider only contiguous atoms, as in strings of three characters, or runs of three words. Back in the Dark Ages we looked at "masked" runs with "don't care" spaces in them. For example, as I recall we looked at how well "English-like" texts were reproduced by standard 4-grams [1234], compared to masked schema like [12#3##4] and [1##23##4], where "#" indicates a "don't care" space-holder.

There was some math, and some fancy statistics we did on which masks were "better" for particular tasks... but it was just Summer School play. We looked at a limited subset of possible masks, and spent most of our time considering how to determine "Englishness" automatically. And on Mac IIs and first-generation NeXT boxen, so very very slowly.

I'm not sure if anybody ever followed up on it.

As I see it, you and Riccardo are essentially "lossily compressing" the kind of subtree archives that are useful in Conor Ryan's and Guido Smits's and others' work in GP. That seems like a great idea, and a challenge we're tackling right now in practice here in Ann Arbor. We're building a Push-GP system, and wanted to use "subtree" archives (Push programs aren't really trees) to accelerate and do pattern mining. I think we'll look at your approach instead.

And I can see how this would really boost the exploration/exploitation angle in GP: I can use your predictive EDA rules in an ongoing run of GP to bias search towards successful patterns (as you have), but can use the same infoto hare off into unexplored territory by invoking some kind of inverse. By &lt;i&gt;avoiding&lt;/i&gt; what's been shown to be useful before.

Maybe somebody should do that right away. (He said, pointedly.)</description>
		<content:encoded><![CDATA[<p>A long, long time ago (1991 SFI Summer School, I think) we played with masked long-range N-grams as a way of capturing multi-scale linguistic structure. Admittedly, we were considering texts and CA patterns character-wise, not word-wise, but the idea still seems applicable and interesting.</p>
<p>Most N-gram datasets consider only contiguous atoms, as in strings of three characters, or runs of three words. Back in the Dark Ages we looked at &#8220;masked&#8221; runs with &#8220;don&#8217;t care&#8221; spaces in them. For example, as I recall we looked at how well &#8220;English-like&#8221; texts were reproduced by standard 4-grams [1234], compared to masked schema like [12#3##4] and [1##23##4], where &#8220;#&#8221; indicates a &#8220;don&#8217;t care&#8221; space-holder.</p>
<p>There was some math, and some fancy statistics we did on which masks were &#8220;better&#8221; for particular tasks&#8230; but it was just Summer School play. We looked at a limited subset of possible masks, and spent most of our time considering how to determine &#8220;Englishness&#8221; automatically. And on Mac IIs and first-generation NeXT boxen, so very very slowly.</p>
<p>I&#8217;m not sure if anybody ever followed up on it.</p>
<p>As I see it, you and Riccardo are essentially &#8220;lossily compressing&#8221; the kind of subtree archives that are useful in Conor Ryan&#8217;s and Guido Smits&#8217;s and others&#8217; work in GP. That seems like a great idea, and a challenge we&#8217;re tackling right now in practice here in Ann Arbor. We&#8217;re building a Push-GP system, and wanted to use &#8220;subtree&#8221; archives (Push programs aren&#8217;t really trees) to accelerate and do pattern mining. I think we&#8217;ll look at your approach instead.</p>
<p>And I can see how this would really boost the exploration/exploitation angle in GP: I can use your predictive EDA rules in an ongoing run of GP to bias search towards successful patterns (as you have), but can use the same infoto hare off into unexplored territory by invoking some kind of inverse. By <i>avoiding</i> what&#8217;s been shown to be useful before.</p>
<p>Maybe somebody should do that right away. (He said, pointedly.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Phi</title>
		<link>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32286</link>
		<dc:creator>Phi</dc:creator>
		<pubDate>Sun, 03 Feb 2008 13:08:06 +0000</pubDate>
		<guid isPermaLink="false">http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32286</guid>
		<description>Very cool.  I'd thought about turning my bit of Ruby code into a web app so people could play, but hadn't actually done anything about it, in part because I assumed that there were probably similar things out there on the web already.

One important difference between the tool you found and the program I used is that he works at the character level, and I worked at the level of words.  Thus his program can generate "words" that aren't (like "linguish" and "documential"), whereas the words from my program will always be real, and the breakdown will be more at the sentence level.

I hadn't thought about the relationship to spam text, but it does look a lot like the stuff they tend to include.  I suspect this is not a coincidence!</description>
		<content:encoded><![CDATA[<p>Very cool.  I&#8217;d thought about turning my bit of Ruby code into a web app so people could play, but hadn&#8217;t actually done anything about it, in part because I assumed that there were probably similar things out there on the web already.</p>
<p>One important difference between the tool you found and the program I used is that he works at the character level, and I worked at the level of words.  Thus his program can generate &#8220;words&#8221; that aren&#8217;t (like &#8220;linguish&#8221; and &#8220;documential&#8221;), whereas the words from my program will always be real, and the breakdown will be more at the sentence level.</p>
<p>I hadn&#8217;t thought about the relationship to spam text, but it does look a lot like the stuff they tend to include.  I suspect this is not a coincidence!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ToolboxY2K</title>
		<link>http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32285</link>
		<dc:creator>ToolboxY2K</dc:creator>
		<pubDate>Sun, 03 Feb 2008 08:36:57 +0000</pubDate>
		<guid isPermaLink="false">http://UnhinderedByTalent.com/Phi/archives/2008/02/02/n-grams-and-the-evolution-of-programs/#comment-32285</guid>
		<description>I found an online tool that creates N-grams (http://jonathanwellons.com/n-grams/index.cgi) and ran your "intro to N-grams" through it a couple of times to see what I would get:

N=1, generate 20 characters:
TTTTTTTTTTTTTTTTTTTT

N=5, generate 600 characters:
t (a book, say you can capture, but we table the beginning, as well.A quick is to ment; we could be very useful computational linguish between two texts (although if therefore, are lots of then N=1), while the butler whenever, does the but we table following sequently the otherefore, are words within a technical paper once at texts (does the frequency of triplets overlap, so we’re us and however, doesn’t really words, and we’re simply counting can capture, are like when words) that triplets), once in the documential idea is to looking up the beginning, as well.A quick is meaningful compromise

Looks interesting.  The second example looks like the "classic prose" that spammers use to get their junk mail past spam filters.

I like the term "computational linguish" to describe it.  Maybe we can evolve that term and find a more successful one. :)</description>
		<content:encoded><![CDATA[<p>I found an online tool that creates N-grams (http://jonathanwellons.com/n-grams/index.cgi) and ran your &#8220;intro to N-grams&#8221; through it a couple of times to see what I would get:</p>
<p>N=1, generate 20 characters:<br />
TTTTTTTTTTTTTTTTTTTT</p>
<p>N=5, generate 600 characters:<br />
t (a book, say you can capture, but we table the beginning, as well.A quick is to ment; we could be very useful computational linguish between two texts (although if therefore, are lots of then N=1), while the butler whenever, does the but we table following sequently the otherefore, are words within a technical paper once at texts (does the frequency of triplets overlap, so we’re us and however, doesn’t really words, and we’re simply counting can capture, are like when words) that triplets), once in the documential idea is to looking up the beginning, as well.A quick is meaningful compromise</p>
<p>Looks interesting.  The second example looks like the &#8220;classic prose&#8221; that spammers use to get their junk mail past spam filters.</p>
<p>I like the term &#8220;computational linguish&#8221; to describe it.  Maybe we can evolve that term and find a more successful one. :)</p>
]]></content:encoded>
	</item>
</channel>
</rss>
