wide-coding-overcomes-noise

Wide Coding Overcomes Noise

Our research on a word game shows that restrictions impose a limit on variation. There cannot be unlimited variation when there is a tight restriction on the range of a step of change. Natural languages take advantage of the limits of variation to protect their messages from the interference of noise. Living organisms have the characteristics of different species encoded in their DNA, a language-like system. The DNA language protects the natural reproduction process from many simple errors that could produce birth defects.

The most probable kind of noise corresponds to small steps, those that could change a single letter in a message. Natural languages protect messages against noise corruption by interspersing letter strings that are not words between the accepted words of the language. The natural language of DNA prevents reproduction errors by interspersing DNA sequences that do not code for any constructible protein between DNA code segments that do provide instructions for assembling proteins. With wide spacing a small amount of noise changes a word or code segment into an easily recognized nonsense sequence. The noise is unlikely to make the broad leap to another acceptable word or code segment that was not part of the original message.

We must introduce the concepts of sets, arrays, and spaces to see how this works. A set contains a bunch of items. The items need not be in any order. They may be an unorganized jumble. We can characterize a set by its size, the number of items. The set of all possible two-letter strings has 676 items. The subset of all two-letter English words has 59 items.

We can impose some order on the jumble by alphabetizing the strings. The order is arbitrary because the sequence of letters in the alphabet is conventional. Alphabetizing by each letter in turn starting with the first letter turns a set of strings into a list. There is another way of alphabetizing. We can arrange two-letter strings in a table if we alphabetize the first letter in rows and the second letter in columns.

Continuing this idea, we can introduce the concept of an array. A list is an array of one-letter strings. A table is an array of two-letter strings, and the table is square. An array of three-letter strings stacks up 26 square tables in alphabetical order by the third letter, making a cube. We can locate a three-letter word in such an array by its row, column, and table letter. We can visualize an array of three-letter strings with a three-dimensional model. Suppose the 26 square tables of three-letter strings are printed on 26 sequential pages of a book, with the pages in order by the third letter of the string. The book will not be cubical because the paper is much thinner than the length or width of any of the tables, but we can think of the array as being cubical.

If we go to four-letter strings, we need to put 26 cubical arrays in alphabetical order by the fourth letter. This technically makes the array a four-dimensional cube or tesseract, but few people can visualize such an object. In general, we can arrange strings of any given length n into a multidimensional array that is an n-dimensional cube. Does that sound complex? Arranging words in multidimensional arrays is a way of measuring their complexity.

An array becomes a space if there is some way to measure distance from any element of the array to any other element of the array. In geometric space we use the Pythagorean Theorem to measure distance. The square of the diagonal distance is the sum of the squares of the distances along each row, column, or edge of the array. There are simpler ways to measure distance. We may define the distance between two words as the minimum number of moves up or down, left or right, from table to table, to go from one word to another. This might be called the “city block” distance, because in a city one must go around corners instead of cutting across vacant lots.

From information theory we learn that all natural languages have wide coding spaces between accepted words to protect messages against noise. The spaces may be understood with the help of a table. We arrange the 59 two-letter English words in a coding space with all words beginning with a in the first row, all words beginning with b in the second row, etc. Then we arrange all the words in a row alphabetically, and space them out in columns according to their second letter. This gives the table shown below.

We note that most words are separated from the others by more than one column in a row, or by more than one row in a column. The only words adjacent in rows are: "am" and "an," "as" and "at," "aw" and "ax," "ax" and "ay," "el" and "em," "em" and "en," and "is," and "it." The only words adjacent in columns are "ay" and "by," "go" and "ho," "la" and "ma," "li" and "mi," "mu" and "nu," and "so" and "to." In the table below we have replaced each word with its city-block distance to its nearest neighbor.

The sum of the 59 numbers enclosed above is 115. The average spacing between nearest neighbors is 115/59 = 1.95 moves. The occupancy, or number of words divided by the number of spaces, is 59/676 = 9.42%.

To put the sparse occupation of the alphabet hyperspace another way, suppose an “average” two-letter word occupied a single cell in the two-dimensional space shown above, but it was surrounded by a layer one cell thick of empty cells. Then each of the 59 words would need 9 cells. There would be 9 x 59 = 531 cells occupied out of a total of 26 x 26 = 676 cells. That is, even with a minimum layer one cell thick surrounding each word, only 531/676 = 78.6% of the space is filled.

The space of three-letter words is 26 x 26 x 26 = 17 576 cells. Let each of the 340 accepted English words be embedded in a cube with walls one cell thick. The cube would be made up of 3 x 3 x 3 = 27 cells. Thus, with each word surrounded by a layer one cell thick, only 340 x 27 = 9,180 cells are occupied. The percentage of filling drops to 9,180/17,576 = 52.2%.

The space of four-letter words is 26 x 26 x 26 x 26 = 456,976 cells. Let each of the 833 accepted English words be embedded in a hypercube with walls one cell thick. The hypercube would be 3 x 3 x 3 x 3 = 81 cells. Thus, with each word surrounded by a layer one cell thick, only 833 x 81 = 67,473 cells are occupied. The percentage of filling drops to 67,473/456,976 = 14.8%. This allows us to increase the hyperspace around each word to a four-dimensional hypercube or tesseract four cells on a side. The tesseract surrounding a word has now 4 x 4 x 4 x 4 = 256 cells. The 833 words occupy 833 x 256 = 213,248 cells, which is 213,248/456,976 or 46.7% of the available space.

The 859 five-letter words occupy only 22.6% of the available space even with five cells on an edge. If we make the edges six cells long, the words occupy 56.2% of the space.

The 406 six-letter words occupy only 6.1% of the available space even with six cells on an edge. If we make the edges nine cells long, the words occupy 69.8% of the space.

The 86 seven-letter words occupy 69.5% of the available space with thirteen cells on an edge.

The 12 eight-letter words occupy 63.3% of the available space with eighteen cells on an edge.

Finally, the 1 nine-letter word occupies 100% of the available space with twenty-six cells on an edge.

The average spacing between words is large and the occupancy is much less than 100%. There is a good reason for this. English, like all natural languages, uses redundancy to help hearers understand even though there is noise and to help readers even if there are printing errors. As the environment becomes noisier, we restrict the number of words we use, and we say each one with more emphasis. This increases the average space between words. A bomber crew communicating among themselves over an intercom system may substitute the word “negative” for “no,” because “no” is too easily confused with “go” in a noisy environment.

Written language is also prone to transmission errors. Handwriting is not always clear enough to distinguish between letters. Typographical errors are easy to make. Let’s see how the average spacing between words above and the sparse occupancy prevents misunderstandings arising from transmission errors. There are 20 rows containing words. In the row containing the 9 entries “ad ah am an as at aw ax ay” a single error in the second letter of any entry has a (9 ‑ 1)/ (26 ‑ 1) = 8/25 = 32% chance of falsely substituting one word for another word. In the rows containing “do,” “fa,” “go,” “re,” “so,” “we,” “xi,” and “ye” a single error in the second letter of any entry has zero chances of falsely substituting one word for another word. The average chance of a single error in the second letter falsely substituting one word for another word is 2.64%. A similar calculation shows that a single error in the first letter of a word has a 2.78% chance of falsely substituting one word for another word. A single error in either the first or second letter therefore has a 2.64% + 2.78% = 5.42% chance of falsely substituting one word for another word. This is slightly better than a chance in eighteen.

Suppose we are sending a two-letter word as part of a message. The communications channel may have noise. Noise takes many forms, depending on the type of transmission channel. Noise broadly includes typographical or transcription errors. We hope the recipients will at least know if there was an error in transmission so they can ask for a retransmission.

From the above analysis we know that, if noise corrupts the transmission, there is a 5.42% chance of a two-letter English word appearing instead of another in the message. The recipients might not notice the error. The corrupted message could mislead them. But almost 95% of the time a single error in one of the letters will produce a two-letter string that our recipients will know is not an English word. This shows how the wide spacing keeps noise from corrupting the message.

As words become longer the spacing becomes wider and wider. This is why it is easier to detect a typographical error the longer the words are.

Human readers can process messages at a conceptual level. This allows them to detect and even correct many errors. Computer programs are still a long way behind. Grammar checkers often flag errors in sentences that are correct but may have unusual structures. For this reason, messages that control computers directly, such as program code, must usually be free of errors. There are various error-detecting and error-correcting schemes in use.

There is a good reason for the wide spacing in natural languages. However, it is preposterous to suppose that every originator of a language was thinking of that reason. People agree collectively to the words and meanings of a language using the rational facilities of their minds, but very few people understand what we have just explained about wide spacing. Do we collectively and unconsciously build safeguards into our language? Language shows much evidence of design, but we cannot identify any single human designer. Perhaps people learned collectively to make the sounds of their auditory system of signs easily distinguishable.

In the 20th century Darwinists tried to explain that languages evolved by some analog of Darwin’s natural selection process, but that idea is now completely discredited. Primitive natural languages are more complex than modern languages. Latin and ancient Greek had case declensions as well as several verb conjugations. We have to learn the proper endings to apply to different nouns and verbs to write and speak older languages correctly. Such features continue to a greater or lesser degree in modern languages. English has a few vestiges of declensions and conjugations. We suffix an apostrophe and sometime an “s” to words in the possessive case. In the present tense we add an “s” to the third person singular of regular verbs. Basically, languages become simpler when their speakers learn to write them down and when more and more people use them. This is just the opposite of the Darwinist idea of building up from the simple to the complex.

DNA is a Natural Language

Home