Heyyyyy, Alexander,

Long time no see! Of course I remember our discussions on Crap. I 
can't believe you are still busy playing with that program. I am 
flattered!

The principle is purely statistics. You know, different languages 
have different frequencies in terms of occurrances of letters, 
bigrams, triads, quads, etc. In English, E, S and A are the most 
frequent characters and it contains letters like Q and Z. Other 
languages, like Swedish, for instance, use Q and Z only in very 
special cases, but contain accented characters to a large extent (one 
of which is specific to Swedish and Norwegian only). German, too, 
uses accented characters. As for bigrams, the combination UU occurs 
in Finnish, whereas English uses EE and EA a lot. The list could be 
made enormous!

CRAP performs a two-pass operation. First it makes a 
statistical analysis of the input material in terms of word lengths 
and spelling. Then it uses this 'knowledge' to fake an output that 
looks like genuine language, in particular to someone only vaguely 
familiar with the language.

I once used it to produce a report (in English) on some test problems 
we had, and faxed it over to an American colleague of mine. He was 
thrilled!!

I have a new PC here, and the source code is on a drive I don't use 
any more, but I'll explain the algo to you:

Data structures:
==============

Sentence Length Counter (Variable).
Byte counter (Variable).

Linked Lists with the following format:
	| Item count | Link | Count | 3 chars | Link | (etc)
	|     2      |  2   |   2   |    3    |  bytes

(You need one list for one letter words, one list for two letter 
words, one list for three letter words and three lists for longer 
words.)

One linked list for the word lengths, and one for the sentence 
lengths. (You can have one for paragraph lengths too, if you want). 
Format:
	| Item Count | Link | Count | Length | ...
	|     2      |  2   |   2   |   1    | bytes

NOTE: All lists are null terminated, i.e. the last link in the list 
is 0. Note also that the Item Count appears only once per list. It is 
initialised to 0.

Word composition area. This is a linear area the size of the longest 
expected word plus a count byte.

Output data area: Linear area for characters. It would have to be 
rather lengthy. Try using a separate segment for this area.

Learning process:
================

Clear the Sentence-Length counter (variable).

While there is still room in your memory:
	Take a word from the input file and scrap special characters like 
	full stop, exclamation mark, comma, etc.

	Turn all characters into lower case.

	Count the number of letters in the word. Append the number to a 
	linked list of Word-Lengths.

	Increment a sentence length counter.

	If the word ends with a special character (full stop, etc), append 
	the contents of the sentence length counter to a linked list of 
	Sentence-Lengths and clear the counter.

	If the word is a one letter word, append it to a linked list of 
	one-letter words, then go to Done. If the word already exists in the list,
	 increment the counter for that word in the list.

	If it is a two letter word, append it to a linked list of two letter 
	words, then go to Done. If the word already exists in the list,
	 increment the counter for that word in the list.

	If it is a three letter word, append it to a linked list of three 
	letter words, then go to Done. If the word already exists in the list,
	 increment the counter for that word in the list.

	Take the three first letters and append them to a linked list of 
	First-Triads. Delete the first letter from the remainder of the 
	word. If the word already exists in the list, increment the counter 
	for that word in the list.

	WHILE there are MORE THAN three letters left in the word, 
		append the three first letters in the remaining word to a linked 
		list of Mid-Triads and delete the first letter from the remainder 
		of the word. If the word already exists in the list, increment the 
		counter for that word in the list.

	Add the remaining three letters to a linked list of End-Triads. If 
	the word already exists in the list, increment the counter for that 
	word in the list.

DONE:
Repeat


Preparation for output:
====================
The principle is built on weighing the triad frequencies. One way of 
doing this is to fill the output area with triads and index into the 
area with a random number, multiplied by three for the characters and 
1 for the lengths. The output area contains triads, repeated as many 
times as they appear in the input material. I have done it this way:

Getting a Word Length:
--------------------------------

Select the Word Length list. While the link is >0 do:
	For as many times as the Count do:
		Append the Length Byte to the Output Area.
		Increment the Byte Counter.
	Next link.
Repeat

Generate a random number between 0 and the number of bytes in the
Output Area.

Index the output area with this random number as offset. This is the 
current Length.

Getting a Sentence Length:
--------------------------------------

Do the same as for the Word Length above, but use the Sentence Length 
list instead.

Composing a word:
--------------------------

If the word length is 1, select the One Letter Word list.

If the word length is 2, select the Two Letter Word list.

If the word length is 3, select the Three Letter Word list.

If the word length is >3, select the First-Triad list.

Expand the selected list into the Output area in a way similar to  
the Word Lengths above, but appending ALL THREE characters at the 
time from the list. Then generate a random number between 0 and the 
number of bytes in the Output Area. Multiply the number with 3. Use 
this number to index into the Output area and copy the three 
following characters into the three first character positions of the 
Word Composition area.

If this is the first word in a sentence, turn the first character to 
upper case.

If the Word Length is 1, set the count to 1 and print (save) it, 
followed by a blank space. Go to DONE.

If the Word Length is 2, set the count to 2 and print (save) it,
followed by a blank space. Go to DONE.

If the Word Length is 3, set the count to 3 and print (save) it,
followed by a blank space. Go to DONE.

This is the tricky bit. The word is longer than three letters. We 
have the beginning and have to find the continuation of the word. 
This is done by comparing the TWO LAST characters of the word we 
are building with the TWO FIRST characters of the list of Mid-Triads. 
When we find a match, we take the last character from the matching 
three-character string and append to the output area. Then we 
generate a random number again between 0 and the number of bytes in 
the output area, pick the character at that index and append it to 
the word under construction. Loop until there is one more character 
left to append. Then select a character from the End-Triad list.

One thing to bear in mind:

If there is no match to the characters anywhere along the loop, we 
have to select a different First-triad and start again.

If you expand the lists to quads rather than triads, you have to 
expand the lists and collect 1-letter words, 2-letter words, 3-letter 
words, 4-letter words, First-Quads, Mid-Quads and End-Quads. The 
principle is exactly the same. You get a significantly better result, 
but it requires more memory.

This is how I remember the program. I hope I haven't missed something 
and that I have managed to explain it understandably. Even though the 
principle is very simple, parts of it are a bit complicated to 
implement.

I don't really see any sensible use for it. Do you?

Please let me know how you are getting along.

Kind regards,

Bengt
