The premise of a Markov model is rather simple: keep track of how often
events follow one another and leverage this information to perform
inference. This principle makes Markov models flexible, powerful
statistical models that can be put to use in almost every field of
modern applied mathematics.
In practice, Markov models typically have a finite state-space. They are trained by creating an index of the frequency events follow one another in the training set.
Once trained, a Markov model can:
yield the probability that one event follows another
predict future events
generate a Markov chain representative of the training set
be used in conjunction with other Markov models to perform classification
For instance, when a college friend had a poetry assignment, rather than
search his soul for words, he constructed a Markov model to generate a poem. He
used a database of poems to discover which words commonly follow others (for
example, the word “you” is often after the word “love”). When
this same friend was forced to write an apology letter to the school for a
certain mishap, he developed another model to construct the letter.
The Training Set
All Markov models need to be trained on some set of data vectors. This is called the training set.
Since the training set will determine how a model behaves, properly selecting
the data on which to train the model is very important. Selecting the right
training set is domain specific and is something that should be given careful
consideration when using any type of statistical model. Although choosing
carefully is important, using common sense will usually suffice here. If you
are training a model to generate rap lyrics, then train on rap, not love songs.
The Test Set
Once you have trained a Markov model you will probably want to test it. The
dataset that is used for testing is called the test set. The most
important thing to know about the test set is that it should be completely
disjoint from the training set. Don't test on vectors that are in your training
set. However, a good test set should follow the same probability distribution
as the training set. In practice, this means that vectors in the test set
should be similar but not identical to vectors in the training set. If you are
training on poems, test on poems, but don't test on any poems you trained on
because that would be cheating.
First-Order Markov model
In a first-order Markov model (FOMM), the next state is determined solely by
the current state. That is, the next state is independent of all past events
except for the current state. We can write this formally as:
$$P(X_{n+1}|X_{n}, X_{n-1}, X_{n-2},...) = P(X_{n+1} | X_{n})$$
FOMMs are very convenient because they are easy to implement and provide very
manageable space and time complexity. Despite their simplicity, FOMMs may
outperform higher-order Markov models, especially when the data set is too
sparse for a higher-order Markov model to detect a strong signal.
Training a FOMM
Training a FOMM involves keeping count of all bigrams in the training set. For
example, if we were to train a FOMM on the set:
$$\{A,B,C,A,B,C,A,B,C\}$$
We just count the frequency that one state follows another and end up with the
following frequency table:
$$\{A,A\} = 0 \qquad \{B,A\} = 0 \qquad \{C,A\} = 3$$
$$\{A,B\} = 3 \qquad \{B,B\} = 0 \qquad \{C,B\} = 0$$
$$\{A,C\} = 0 \qquad \{B,C\} = 3 \qquad \{C,C\} = 0$$
Now at any given state we can predict the next state by selecting the outcome
with the highest probability given the observed data.
If we were to use this model to construct a Markov chain starting at state A we
would get:
$$A \to B \to C \to A \to B \to C \to A \to B \to C \to A \to ....$$
As you can see, Markov models can be very good at detecting patterns in data
when they exist.
FOMM: Applications
We've seen above that a first-order Markov model is
able to detect simple patterns. This at the core of what data mining is about.
FOMM: Implementation Notes
There are a great many ways to implement a first-order Markov model. Here are
two ways that I think are worth pointing out.
With a Database
If you need to persist your model to disk, index more data than can be stored
in physical memory, or update your model on the fly then a database may be the
way to go. You can think of the bigrams and counts as rows in a database table:
TABLE FOMM:
E_1
E_2
Total
A
A
0
A
B
3
A
C
0
B
A
0
B
B
0
B
C
3
C
A
3
C
B
0
C
C
0
First to create the database you might have SQL that looks something like:
CREATE TABLE FOMM (E_1 TEXT, E_2 TEXT, Total INTEGER);
To check if a bigram exists in the database:
SELECT Total FROM FOMM WHERE E_1='A' AND E_2='B';
To add a bigram to the database:
INSERT INTO FOMM VALUES ('A', 'B', 1);
To increment a bigram:
UPDATE FOMM SET Total=Total+1 WHERE E_1='A' AND E_2='B';
To get the most likely predecessor(s) of 'A':
SELECT E_2 FROM FOMM WHERE E_1='A' AND Total=(SELECT MAX(Total) FROM FOMM WHERE E_1='A');
With a HashMap/HashTable
If you don't want to use a database then you might want to construct a FOMM
class that encapsulates nested HashMaps/HashTables.
When properly implemented this approach provides some nice benefits:
Fast $O(1)$ lookups and updates
Efficient use of space despite keeping the model in physical memory
Java HashTables are serializeable so
they can be persisted to disk and restored later
Example: Click-Through Prediction
You may be wondering what real-world systems are simple enough to be modeled
by a first-order Markov model. Consider the following problem: an eCommerce
website (think Amazon) wants to predict what page a user will visit next given
his or her current page. One way to address this problem is to use a FOMM.
As you can probably imagine, $X_n$ is the user's current page and $X_{n+1}$
is the user's next page. When a user transitions from one page to the next the
model should be updated. Once enough data has been collected, such a model
could be leveraged to make product recommendations and guide the user's path
through the site.
Practical Considerations
Most systems are not first-order. Sometimes employing a higher-order Markov
model can help improve signal. For more complicated problems there are more
sophisticated techniques such as hidden Markov models, support vector
machines and regression analysis, among others.