by Erick Engelke Aug 22, 2025
Have you ever wondered how your phone and other systems predict the next word or phrase you might type.
They use something called Markov chains, which very simplified, is the probability of a word given the previous word or words.
In just a few lines, I will show you how to do this for a two word sequence. Constructing longer sequences is more computation and slows us down. But two words is suprisingly enough to give pretty good results.
Data Source
We need a source of data to learn from. I picked Frankstein text from Project Gutenburg. It’s public domain not terribly long (12,000 words), and English, all of which were perfect for this test.
The Preprossing Code
The page’s onshow handler loads Frankenstein from the network and calls ProcessText to convert it into a useful format. The progress symbol shows it’s working.
We separate the text into words. I didn’t add fancy code to detect punctuation, page numbers, chapter numbers, etc. This is simplified code to be readable for others.
So if the word ends with a period or a comma, we record that as part of the word.
ANyways, we also record the word that follows each word. If the same sequence of words appear at multiple times int the text, then the second word is repeated that number of times.
So we are left with a data table. It’s not sorted or optimized for speed, again, I tried to keep this as simple as possible.
The User
The user is encouraged to enter text.
As they type, the onChange handler is called for the TEdit. If the final character on the line is a letter/number/etc, we search for words like the current one ---- word complation.
If the final character is a space, we look for the preceding word and guess what comes next.
ShowFrequency
A routine called ShowFrequency is used to show predicted words. It counts the occurance of each word in the source text (Frankenstein) and says it’s THAT likely. It also sorts the words for good measure.
That’s all there is to it.
You can try out the system here.
Download the source here.
You can substitute other books to get different results.