N-grams and the evolution of programs

Which of the following was written by (a) me, (b) William Shakespeare, and (c) Charles Darwin?

“I would have sent to Rome that’s worthy death?”

“The naturalist looking at species as he might succeed from a fork low down in the separation of the species of any species in a more or less from their sap this is unimportant for the instincts already possessed by certain plants so that natural selection of mere piles of superimposed strata and watch the sea separating an island even if we believe that pure water can effect little or no offspring.”

“The troubling aspects of a building block semantics in a given tree in the context and false.”

The answer and (much) more is below the fold.

That probably wasn’t too hard.

Except, of course, I’m a mean person and it was a trick question. In fact none of those sentences was actually written by a human being. All three of these pseudo-quotes were generated using simple statistics and the notion of N-grams (which I will describe below) to capture and reproduce the sense of a text, without really understanding it at a larger scale. I wrote a program to create these examples as an opening illustration for my Dagstuhl talk, and since they’re a lot of fun I thought I’d summarize the key concepts here.

A quick intro to N-grams

The essential idea is to look at the frequency of the occurrences of groups of adjacent symbols (called N-grams). In this example the symbols are words, and we’re using N=3, so we’re looking at triplets of adjacent words. The preceding sentence, for example starts with the following sequence of triplets:

  • (“In”, “this”, “example”)
  • (“this”, “example”, “the”)
  • (“example”, “the”, “symbols”)
  • Etc.

Note that the triplets overlap, so most words will in fact appear in three N-grams (triplets), once at the end, once in the middle, and once at the beginning, as happens for “example” above.

Let’s say you go through a text (a book, say), and make a great big list of all the N-grams and how often they each occur. What would that tell us about that text? That depends pretty crucially on the value of N. If N=1, for example, then we’re simply counting up the frequency of individual words. This could be very useful in certain circumstances; we could, for example, probably use that to decide whether the text is a crime novel or a technical paper on econometrics. N=1, however, doesn’t give us any sense of the larger structure of the document; we may find it difficult to distinguish between two particular novels, and we’d be unable to recognize when words tended to appear in related contexts (does the author tend to mention the butler whenever missing candlestick is discussed?).

If, on the other hand, we take N to be much larger (several 100s or 1,000s of words) then we capture lots of long range structure, but we lose any kind of statistics. Almost all sequences of 100 consecutive words, for example, are likely to be unique. Consequently they will tend to only appear once in our table for a given text, and are unlikely to appear in the tables for any two texts (although if they did that would be a useful plagiarism flag). Very large sequences don’t really work, therefore, as a tool for looking for relationships between documents, or even patterns within a text.

Experience in fields such as computational linguistics suggest that taking N=3 is a useful compromise position. [1] Different triples occur often enough (assuming the text has some length) that their distribution is meaningful (like when N=1), while there’s enough structure and overlap with N=3 that you can capture some long-range regularities as well.

Generating my “pseudo-quotes”

All of which is illustrated by my pseudo-quotes above. Each was generated by first computing the frequency of triples in three texts: Coriolanus, The origin of species, and a paper I wrote last fall with a couple of UMM students. [2] I also kept track of the frequency with which words occurred as the first word of a sentence, and the frequency of pairs containing the first two words in every sentence.

The program starts generating a sentence by picking a first word based on the frequency of first words (so if 20% of the sentences started with “The”, there would be a 20% chance of starting our new sentence with “The”). Given that choice, the program would take the table of frequencies of all the pairs of first two words, and pull out just those pairs that used our chosen first word. Once we have the first two words, we can repeatedly generate words based on the frequencies of 3-grams in our big table. If we’re using Coriolanus, and the last two words we’ve generated were (in order) “I” and “would”, then the relevant triples that occur in that play more than once are:

I would he 2 occurrences
I would I

2 occurrences
I would have 3 occurrences
I would they 4 occurrences
I would not 4 occurrences

with another 11 “I would” triples that occurred a single time. Thus “I would have” (the prefix we chose in the “Shakespearean” sentence above) was more likely to be chosen than “I would he” by a ratio of 3:2, and three times more likely to be chosen than any of the triples that occurred just once. It was, on the other hand, less likely to be chosen than either “I would they” or “I would not”, but the (digital) dice rolled in its favor on this particular run.

Now that we have that start, we continue choosing words based on the last two that we’ve added to the sentence. So we look in our table at what triples start “would have”, and chose (this time) “sent” as the next word. We then look up “have sent” and chose “to”. The process continues (in my program at least) until we choose a “word” that’s in fact a terminal punctuation mark (a period, a question mark, or an exclamation point), thus ending the sentence.

A few observations

As we can see from the pseudo-quotes that I opened with, this process can generate sentences whose style is clearly recognizable, and which can make a great deal of sense at the local level (e.g., at the level of phrases). This process, however, doesn’t “understand” or respect larger structural connections or semantics. The third pseudo-quote, for example,

“The troubling aspects of a building block semantics in a given tree in the context and false.”

lacks a verb. The large Darwin pseudo-quote

“The naturalist looking at species as he might succeed from a fork low down in the separation of the species of any species in a more or less from their sap this is unimportant for the instincts already possessed by certain plants so that natural selection of mere piles of superimposed strata and watch the sea separating an island even if we believe that pure water can effect little or no offspring.”

rambles all over the shop, containing very sensible phrases like “The naturalist looking at species” as well as nonsense such as “pure water can effect little or no offspring”.

What we did with it

In the research I was reporting on, Riccardo Poli and I applied this idea of N-grams as a tool for capturing regularities in a language to computer programs that were represented as sequences of simple “machine” instructions in a highly simplified programming language that was designed for a specific set of test problems. We used a type of Estimation of distribution algorithm (EDA) to essentially evolve the triplet frequencies, instead of taking them from a text like I did with Coriolanus, and then used those to generate programs much like I did with the text examples. We would generate a set of, say, 100 programs this way, and try each of them on our test problem. Some were better than others, so we’d take the better ones (say the top half) and use those to update the frequencies; triplets that appeared in those better programs would have their frequencies increased somewhat, while those that didn’t would have their frequencies reduced.

When repeated over several generations, this process would evolve/learn/find/discover a set of frequencies that allowed it to generate successful programs with a reasonably high probability, at least on problems that had solutions that could be formed from repeated sequences of instructions. Even though it was limited to only 3-grams, the system was able to “learn” some fairly long sequences of instructions. In one case, for example, the evolved set of probabilities generated, given a particular starting pair, a particular sequence of 9 instructions with probability of over 60%, which is some 500,000 times more likely than generating a sequence of that length by randomly drawing instructions from a hat. The solutions that were generated tended to be composed of numerous copies of a small number of basic patterns, and it seems likely that this approach will do better in problem spaces where there are solutions that exhibit that kind of regularity.

If you’re interested in learning more, check out the paper: “A Linear Estimation-of-Distribution GP System”.

[1] Does anyone know of a theoretical justification for this?

[2] ”Semantic building blocks in Genetic Programming”, which will be appearing next month in the Proceedings of the European Conference on Genetic Programming (provide link).

Related posts

6 thoughts on “N-grams and the evolution of programs”

  1. I found an online tool that creates N-grams (http://jonathanwellons.com/n-grams/index.cgi) and ran your “intro to N-grams” through it a couple of times to see what I would get:

    N=1, generate 20 characters:
    TTTTTTTTTTTTTTTTTTTT

    N=5, generate 600 characters:
    t (a book, say you can capture, but we table the beginning, as well.A quick is to ment; we could be very useful computational linguish between two texts (although if therefore, are lots of then N=1), while the butler whenever, does the but we table following sequently the otherefore, are words within a technical paper once at texts (does the frequency of triplets overlap, so we’re us and however, doesn’t really words, and we’re simply counting can capture, are like when words) that triplets), once in the documential idea is to looking up the beginning, as well.A quick is meaningful compromise

    Looks interesting. The second example looks like the “classic prose” that spammers use to get their junk mail past spam filters.

    I like the term “computational linguish” to describe it. Maybe we can evolve that term and find a more successful one. :)

  2. Very cool. I’d thought about turning my bit of Ruby code into a web app so people could play, but hadn’t actually done anything about it, in part because I assumed that there were probably similar things out there on the web already.

    One important difference between the tool you found and the program I used is that he works at the character level, and I worked at the level of words. Thus his program can generate “words” that aren’t (like “linguish” and “documential”), whereas the words from my program will always be real, and the breakdown will be more at the sentence level.

    I hadn’t thought about the relationship to spam text, but it does look a lot like the stuff they tend to include. I suspect this is not a coincidence!

  3. A long, long time ago (1991 SFI Summer School, I think) we played with masked long-range N-grams as a way of capturing multi-scale linguistic structure. Admittedly, we were considering texts and CA patterns character-wise, not word-wise, but the idea still seems applicable and interesting.

    Most N-gram datasets consider only contiguous atoms, as in strings of three characters, or runs of three words. Back in the Dark Ages we looked at “masked” runs with “don’t care” spaces in them. For example, as I recall we looked at how well “English-like” texts were reproduced by standard 4-grams [1234], compared to masked schema like [12#3##4] and [1##23##4], where “#” indicates a “don’t care” space-holder.

    There was some math, and some fancy statistics we did on which masks were “better” for particular tasks… but it was just Summer School play. We looked at a limited subset of possible masks, and spent most of our time considering how to determine “Englishness” automatically. And on Mac IIs and first-generation NeXT boxen, so very very slowly.

    I’m not sure if anybody ever followed up on it.

    As I see it, you and Riccardo are essentially “lossily compressing” the kind of subtree archives that are useful in Conor Ryan’s and Guido Smits’s and others’ work in GP. That seems like a great idea, and a challenge we’re tackling right now in practice here in Ann Arbor. We’re building a Push-GP system, and wanted to use “subtree” archives (Push programs aren’t really trees) to accelerate and do pattern mining. I think we’ll look at your approach instead.

    And I can see how this would really boost the exploration/exploitation angle in GP: I can use your predictive EDA rules in an ongoing run of GP to bias search towards successful patterns (as you have), but can use the same infoto hare off into unexplored territory by invoking some kind of inverse. By avoiding what’s been shown to be useful before.

    Maybe somebody should do that right away. (He said, pointedly.)

  4. OK, I’m not a scientist (nor do I play one on TV) and though I’m not sure what you might use this N-gram stuff for, I think it is interesting to read about.

  5. Oy! Link to preprint/paper is dead. Have we another recourse, or has the evil Springer empire eaten the thing completely and hidden it from public eyes?

  6. Sorry for the delay in responding to this – the hazards of family holidays :-).

    I’m not sure why the old link doesn’t work anymore. I found the new link, and have edited the link in the article above. Thanks for pointing out the problem!

Comments are closed.