Markov Models

Introduction

The premise of a Markov model is rather simple: keep track of how often events follow one another and leverage this information to perform inference. This principle makes Markov models flexible, powerful statistical models that can be put to use in almost every field of modern applied mathematics.

In practice, Markov models typically have a finite state-space. They are trained by creating an index of the frequency events follow one another in the training set. Once trained, a Markov model can:

For instance, when a college friend had a poetry assignment, rather than search his soul for words, he constructed a Markov model to generate a poem. He used a database of poems to discover which words commonly follow others (for example, the word “you” is often after the word “love”). When this same friend was forced to write an apology letter to the school for a certain mishap, he developed another model to construct the letter.

The Training Set

All Markov models need to be trained on some set of data vectors. This is called the training set. Since the training set will determine how a model behaves, properly selecting the data on which to train the model is very important. Selecting the right training set is domain specific and is something that should be given careful consideration when using any type of statistical model. Although choosing carefully is important, using common sense will usually suffice here. If you are training a model to generate rap lyrics, then train on rap, not love songs.

The Test Set

Once you have trained a Markov model you will probably want to test it. The dataset that is used for testing is called the test set. The most important thing to know about the test set is that it should be completely disjoint from the training set. Don't test on vectors that are in your training set. However, a good test set should follow the same probability distribution as the training set. In practice, this means that vectors in the test set should be similar but not identical to vectors in the training set. If you are training on poems, test on poems, but don't test on any poems you trained on because that would be cheating.

First-Order Markov model

In a first-order Markov model (FOMM), the next state is determined solely by the current state. That is, the next state is independent of all past events except for the current state. We can write this formally as: $$P(X_{n+1}|X_{n}, X_{n-1}, X_{n-2},...) = P(X_{n+1} | X_{n})$$ FOMMs are very convenient because they are easy to implement and provide very manageable space and time complexity. Despite their simplicity, FOMMs may outperform higher-order Markov models, especially when the data set is too sparse for a higher-order Markov model to detect a strong signal.

Training a FOMM

Training a FOMM involves keeping count of all bigrams in the training set. For example, if we were to train a FOMM on the set: $$\{A,B,C,A,B,C,A,B,C\}$$ We just count the frequency that one state follows another and end up with the following frequency table: $$\{A,A\} = 0 \qquad \{B,A\} = 0 \qquad \{C,A\} = 3$$ $$\{A,B\} = 3 \qquad \{B,B\} = 0 \qquad \{C,B\} = 0$$ $$\{A,C\} = 0 \qquad \{B,C\} = 3 \qquad \{C,C\} = 0$$ Now at any given state we can predict the next state by selecting the outcome with the highest probability given the observed data. If we were to use this model to construct a Markov chain starting at state A we would get:
$$A \to B \to C \to A \to B \to C \to A \to B \to C \to A \to ....$$
As you can see, Markov models can be very good at detecting patterns in data when they exist.

FOMM: Applications

We've seen above that a first-order Markov model is able to detect simple patterns. This at the core of what data mining is about.

FOMM: Implementation Notes

There are a great many ways to implement a first-order Markov model. Here are two ways that I think are worth pointing out.

With a Database

If you need to persist your model to disk, index more data than can be stored in physical memory, or update your model on the fly then a database may be the way to go. You can think of the bigrams and counts as rows in a database table:
TABLE FOMM:
E_1E_2Total
AA0
AB3
AC0
BA0
BB0
BC3
CA3
CB0
CC0

First to create the database you might have SQL that looks something like:
CREATE TABLE FOMM (E_1 TEXT, E_2 TEXT, Total INTEGER);
To check if a bigram exists in the database:
SELECT Total FROM FOMM WHERE E_1='A' AND E_2='B';
To add a bigram to the database:
INSERT INTO FOMM VALUES ('A', 'B', 1);
To increment a bigram:
UPDATE FOMM SET Total=Total+1 WHERE E_1='A' AND E_2='B';
To get the most likely predecessor(s) of 'A':
SELECT E_2 FROM FOMM WHERE E_1='A' AND Total=(SELECT MAX(Total) FROM FOMM WHERE E_1='A');

With a HashMap/HashTable

If you don't want to use a database then you might want to construct a FOMM class that encapsulates nested HashMaps/HashTables. When properly implemented this approach provides some nice benefits:

Example: Click-Through Prediction

You may be wondering what real-world systems are simple enough to be modeled by a first-order Markov model. Consider the following problem: an eCommerce website (think Amazon) wants to predict what page a user will visit next given his or her current page. One way to address this problem is to use a FOMM.

As you can probably imagine, $X_n$ is the user's current page and $X_{n+1}$ is the user's next page. When a user transitions from one page to the next the model should be updated. Once enough data has been collected, such a model could be leveraged to make product recommendations and guide the user's path through the site.

Practical Considerations

Most systems are not first-order. Sometimes employing a higher-order Markov model can help improve signal. For more complicated problems there are more sophisticated techniques such as hidden Markov models, support vector machines and regression analysis, among others.