Hands-On Markov Models with Python

Introduction to the Markov Process

In this chapter, we will develop the basic concepts that we need to understand Hidden Markov Models (HMM). We will cover the following topics:

Random processes
Markov processes
Markov chains or discrete-time Markov processes
Continuous-time Markov chains

Random processes

In the previous section, we discussed random variables that are able to mathematically represent the outcomes of a single random phenomenon. But what if we want to represent these random events over some period of time or the length of an experiment? For example, let's say we want to represent the stock prices for a whole day at intervals of every one hour, or we want to represent the height of a ball at intervals of every one second after being dropped from some height in a vacuum. For such situations, we would need a set of random variables, each of which will represent the outcome at the given instance of time. These sets of random variables that represent random variables over a period of time are also known as random processes. It is worth noting that the domains of all these random variables are the same. Therefore, we can also think of the process as just changing the states.

Here, we have been talking about random variables at different instances of time, but it doesn't need to be time-based in every case. It could be just some other event. But since, in most cases, it is usually time, and it is much easier to talk about random processes in terms of time, we will use time to represent any such event. The same concepts will apply to creating a model if it varies over some other event instead of time.

Now let's discuss the previous two examples in more detail. Starting with the example of dropping the ball from a height in a vacuum, if we know the exact value of gravity and the height from which the ball is being dropped, we will be able to determine the exact location of the ball at every interval of one second using Newton's laws of motion.

Such random processes, in which we can deterministically find the state of each random variable given the initial conditions (in this case, dropping the ball, zero initial velocity) and the parameters of the system (in this case, the value of gravity), are known as deterministic random processes (commonly called deterministic processes).

Now let's go to the second example; representing the stock price over time. In this case, even if we know the current price and the exact probability distribution of the price at the next one hour mark, we won't be able to deterministically compute the value. These random processes, in which we can't determine the state of a process, even if we are given the initial conditions and all the parameters of the system, are known as stochastic random processes (commonly called processes). A very good way of understanding or getting a feel for a stochastic process is to think of it as being the opposite of a deterministic process.

Markov processes

A stochastic process is called a Markov process if the state of the random variable at the next instance of time depends only on the outcome of the random variable at the current time. In simplistic mathematical terms, for a stochastic process, S = {R1, R₂, . . ., R_n} = {R}_{t=1, . . ., n}, to be a Markov process, it must satisfy the following condition:

According to the previous condition, the probability distribution for any variable at any given instance in a Markov process is a conditional distribution, which is conditioned only on the random variable at the last time instance. This property of a system, such that the future states of the system depend only on the current state of the system, is also known as the Markov property. Systems satisfying the Markov property are also known as memoryless systems since they don't need to remember the previous states to compute the distribution of the next state, or, in other words, the next state depends only on the current state of the system.

A very common example used to explain the Markov process is a drunk man walking along a street. We consider that, since the man is drunk, he can either take a step backward, a step forward, or stay in his current position, which is given by some distribution of these, let's say [0.4, 0.4, 0.2]. Now, given the position of the man at any given instance in time, his position at the next instance depends only on his current position and the parameters of the system (his step size and the probability distribution of possible actions). Therefore, this is an example of a Markov process.

In the previous example, let's assume that the drunk man takes an action (steps forward/backward or stays in his position) at fixed intervals of time and his step size is always the same. With these considerations, the Markov process in our example has a discrete state space. Also, since the man takes steps after fixed intervals of time, we can think of it as a discrete time. But Markov processes don't need to have discrete state space or discrete time intervals. Considering discrete and continuous time as well as discrete and continuous state space, we can categorize Markov processes into four main categories:

Discrete time and discrete state space
Discrete time and continuous state space
Continuous time and discrete state space
Continuous time and continuous state space

We will discuss each of these categories of Markov process in more detail in the following sections.

Installing Python and packages

Before moving ahead, we need to set up Python and all the packages required to run the code examples. For all the code examples in this book, we will be using Python 3.4. All the example code in the book is also available on GitHub at https://github.com/PacktPublishing/HandsOnMarkovModelswithPython. We highly recommend using Miniconda to set up your environment for running the examples. Miniconda can be downloaded from https://conda.io/miniconda.html.

Installation on Windows

Miniconda can be installed on a Windows system by just double-clicking on the downloaded .exe file and following the installation instructions. After installation, we will need to create a conda environment and install all the required packages in the environment. To create a new Python 3.4 environment with the name hmm, run the following command:

conda create -n hmm python=3.4

After creating the environment, we will need to activate it and install the required packages in it. This can be done using the following commands:

activate hmm
conda install numpy scipy

Installation on Linux

On Linux, after downloading the Miniconda file, we will need to give it execution permissions and then install it. This can be done using the following commands:

chmod +x Miniconda.sh
./Miniconda.sh

After executing the file, we can simply follow the installation instructions. Once installed, we will need to create a new environment and install the required packages. We can create a new Python 3.4 environment with the name hmm using the following commands:

conda create -n hmm python=3.4

Once the environment has been created, we will need to activate it and install the packages inside it using the following:

source activate hmm
conda install numpy scipy

Markov chains or discrete-time Markov processes

A Markov chain is a type of Markov process in which the time is discrete. However, there is a lot of disagreement among researchers on what categories of Markov process should be called Markov chain. But, most commonly, it is used to refer to discrete-state-space Markov processes. Therefore, a Markov chain is a stochastic process over a discrete state space satisfying the Markov property. More formally, we can say that a discrete-time Markov chain is a sequence of random variables X₁, X₂, X₃, ... that satisfy the Markov property, namely that the probability of moving from the current state to the next state depends only on the present state and not on any of the previous states. In terms of the probability distribution, we can say that, given that the system is at time instance n, the conditional distribution of the states at the next time instance, n + 1, is conditionally independent of the state of the system at time instances {1, 2, . . ., n-1}, given the state of the random variable at time instance n. This can be written as follows:

Markov chains are often represented using directed graphs. The nodes in the directed graphs represent the different possible states of the random variables, and the edges represent the probability of the system going from one state to the other in the next time instance. Let's take a simple example of predicting the weather to understand this representation better. We will consider that there are three possible states of the random variable Weather={Sunny, Rainy, Snowy}, and possible Markov chains for this can be represented as shown in Figure 1.1:

Figure 1.1: A simple Markov chain on the random variable, representing the random variable Weather={Sunny, Rainy, Snowy} and showing the probability of the random variable switching to other states in the next time instance

One of the main points to understand in Markov chains is that we are modeling the outcomes of a sequence of random variables over time. This is sometimes confusing for people since the model is represented using a single graph, which doesn't mention anything about time. So, the name state transitions is not a particularly good name for this, since the state is not changing for any random variable; rather, we are trying to determine the state of the next random variable given the observed state of our current random variable. Coming back to our example, we can see that the nodes of the graph represent the different possible states of the random variable Weather, and the edges between them show the probability of the next random variable taking the different possible states, given the state of the current random variable. The self-loops show the probability of the model staying in its current state. In the previous Markov chain, let's say we know that the observed state of the current random variable is Sunny, then the probability that the random variable at the next time instance will also take the value Sunny is 0.8. It could also take the value Rainy with a probability of 0.19, or Snowy with a probability of 0.01. One thing to note here is that the sum of all the probability values on all the outward edges from any state should equal 1, since it's an exhaustive event.

Now, let's try to code this simple Markov chain. We will start by defining a simple MarkovChain class, and we will keep on adding methods to this class as we go through this chapter:

import numpy as np

class MarkovChain(object):
    def __init__(self, transition_prob):
        """
        Initialize the MarkovChain instance.

        Parameters
        ----------
        transition_prob: dict
            A dict object representing the transition probabilities in 
            Markov Chain. Should be of the form: {'state1': {'state1': 
            0.1, 'state2': 0.4}, 'state2': {...}}
        """
        self.transition_prob = transition_prob
        self.states = list(transition_prob.keys())

    def next_state(self, current_state):
        """
        Returns the state of the random variable at the next time 
        instance.

        Parameters
        ----------
        current_state: str
            The current state of the system.
        """
        return np.random.choice(
            self.states, p=[self.transition_prob[current_state][next_state] 
                            for next_state in self.states])

    def generate_states(self, current_state, no=10):
        """
        Generates the next states of the system.

        Parameters
        ----------
        current_state: str
            The state of the current random variable.

        no: int
            The number of future states to generate.
        """
        future_states = []
        for i in range(no):
            next_state = self.next_state(current_state)
            future_states.append(next_state)
            current_state = next_state
        return future_states

Now, we can try out our example with this MarkovChain class:

>>> transition_prob = {'Sunny': {'Sunny': 0.8, 'Rainy': 0.19, 
 'Snowy': 0.01},
 'Rainy': {'Sunny': 0.2, 'Rainy': 0.7,
 'Snowy': 0.1},
 'Snowy': {'Sunny': 0.1, 'Rainy': 0.2,
 'Snowy': 0.7}}

>>> weather_chain = MarkovChain(transition_prob=transition_prob)
>>> weather_chain.next_state(current_state='Sunny')
'Sunny'
>>> weather_chain.next_state(current_state='Snowy')
'Snowy'
>>> weather_chain.generate_states(current_state='Snowy', no=10)     
['Snowy', 'Snowy', 'Snowy', 'Rainy', 'Snowy', 'Snowy', 'Rainy',
 'Rainy', 'Snowy', 'Snowy']

In the previous code example, you might find your outputs to be different from what's shown here. This is because the Markov chain is probabilistic in nature and it picks on the next state based on a probability distribution, which can give different outputs on different runs.

So far in the discussion, we have considered that the probability space of the variables doesn't change over different instances of time. This is known as a time-homogeneous Markov chain, but it is also possible to have a time-inhomogeneous Markov chain, which also has a lot of applications but is outside the scope of this book.

Parameterization of Markov chains

In the code for the Markov chain in the previous section, we used a dictionary to parameterize the Markov chain that had the probability values of all the possible state transitions. Another way of representing state transitions is using a transition matrix. The transition matrix, as the name suggests, uses a tabular representation for the transition probabilities. The transition matrix for the example in Figure 1.1 is shown in the following table.

The following table shows the transition matrix for the Markov chain shown in Figure 1.1. The probability values represent the probability of the system going from the state in the row to the states mentioned in the columns:

States	Sunny	Rainy	Snowy
Sunny	0.8	0.19	0.01
Rainy	0.2	0.7	0.1
Snowy	0.1+	0.2	0.7

The transition matrix represents the same information as in the dictionary, but in a more compact way. For this reason, the transition matrix is the standard way of representing Markov chains. Let's modify our MarkovChain class so that it can accept a transition matrix:

import numpy as np

class MarkovChain(object):
    def __init__(self, transition_matrix, states):
        """
        Initialize the MarkovChain instance.

        Parameters
        ----------
        transition_matrix: 2-D array
            A 2-D array representing the probabilities of change of 
            state in the Markov Chain.

        states: 1-D array 
            An array representing the states of the Markov Chain. It
            needs to be in the same order as transition_matrix.
        """
        self.transition_matrix = np.atleast_2d(transition_matrix)
        self.states = states
        self.index_dict = {self.states[index]: index for index in 
                           range(len(self.states))}
        self.state_dict = {index: self.states[index] for index in
                           range(len(self.states))}

    def next_state(self, current_state):
        """
        Returns the state of the random variable at the next time 
        instance.

        Parameters
        ----------
        current_state: str
            The current state of the system.
        """
        return np.random.choice(
                    self.states, 
                    p=self.transition_matrix[self.index_dict[current_state], :])

    def generate_states(self, current_state, no=10):
        """
        Generates the next states of the system.

        Parameters
        ----------
        current_state: str
            The state of the current random variable.

        no: int
            The number of future states to generate.
        """
        future_states = []
        for i in range(no):
            next_state = self.next_state(current_state)
            future_states.append(next_state)
            current_state = next_state
        return future_states

Running this code should also give similar results to what we got in the previous section. Using a transition matrix might not seem like a good idea because it requires us to create extra variables to store the indices. But, in cases when we have hundreds of states, using a transition matrix is much more efficient than using the simple dictionary implementation. In the case of a transition matrix, we can simply use NumPy indexing to get the probability values in the next_state method, whereas we were looping over all the state names in the previous implementation:

>>> transition_matrix = [[0.8, 0.19, 0.01],
                         [0.2,  0.7,  0.1],
                         [0.1,  0.2,  0.7]]
>>> weather_chain = MarkovChain(transition_matrix=transition_matrix,
                                states=['Sunny', 'Rainy', 'Snowy'])
>>> weather_chain.next_state(current_state='Sunny')
'Sunny'
>>> weather_chain.next_state(current_state='Snowy')
'Sunny'
>>> weather_chain.generate_states(current_state='Snowy', no=10)
['Snowy', 'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy', 
 'Rainy', 'Rainy', 'Sunny', 'Sunny']

Properties of Markov chains

In this section, we will talk about the different properties of Markov chains, namely reducibility, periodicity, transience and recurrence, ergodicity, and steady-state analysis and limiting distributions. We will also try some simple examples of our MarkovChain class to show these properties.

Reducibility

A Markov chain is said to be irreducible if we can reach any state of the given Markov chain from any other state. In terms of states, state j is said to be accessible from another state i if a system that started at state i has a non-zero probability of getting to the state j. In more formal terms, state j is said to be accessible from state i if an integer n_ij ≥ 0 exists such that the following condition is met:

The n_ij here is basically the number of steps it takes to go from state i to j, and it can be different for different pairs of values for i and j. Also, for a given state i, if all the values for n_ij = 0, it means that all the states of the Markov chain are directly accessible from it. The accessibility relation is reflexive and transitive, but not necessary symmetric. We can take a simple example to understand this property:

Figure 1.2: An example of an irreducible Markov chain

In the previous example, it can be clearly seen that all of the states are accessible from all other states and hence are irreducible.

Note in the examples in Figure 1.2 and Figure 1.3 that we haven't represented edges if probability values are 0. This helps to keep the model less complicated and easier to read.

In the following example, we can see that state D is not accessible from A, B, or C. Also, state C is not accessible from either A or B. But all the states are accessible from state D, and states A and B are accessible from C:

Figure 1.3: An example of a reducible Markov chain

We can also add a couple of methods to our MarkovChain class to check which states in our chain are reachable and whether our chain is irreducible:

from itertools import combinations

def is_accessible(self, i_state, f_state):
    """
    Check if state f_state is accessible from i_state.

    Parameters
    ----------
    i_state: str
        The state from which the accessibility needs to be checked.
    
    f_state: str
        The state to which accessibility needs to be checked.
    """
    reachable_states = [i_state]
    for state in reachable_states:
        if state == self.index_dict[f_state]:
            return True
        else:
            reachable_states.append(np.nonzero(
              self.transition_matrix[self.index_dict[i_state], :])[0])
    return False

def is_irreducible(self):
    """
    Check if the Markov Chain is irreducible.
    """
    for (i, j) in combinations(self.states, self.states):
        if not self.is_accessible(i, j):
            return False
    return True

Let's give our examples a try using the examples in Figure 1.2 and Figure 1.3:

>>> transition_irreducible = [[0.5, 0.5, 0, 0],
                              [0.25, 0, 0.5, 0.25],
                              [0.25, 0.5, 0, 0.25],
                              [0, 0, 0.5, 0.5]]
>>> transition_reducible = [[0.5, 0.5, 0, 0],
                            [0, 1, 0, 0],
                            [0.25, 0.5, 0, 0],
                            [0, 0, 0.25, 0.75]]
>>> markov_irreducible = MarkovChain(transition_matrix=transition_irreducible,
                                     states=['A', 'B', 'C', 'D'])
>>> markov_reducible = MarkovChain(transition_matrix=transition_reducible,
                                   states=['A', 'B', 'C', 'D'])
>>> markov_irreducible.is_accessible(i_state='A', f_state='D')
True
>>> markov_irreducible.is_accessible(i_state='B', f_state='D')
True
>>> markov_irreducible.is_irreducible()
True
>>> markov_reducible.is_accessible(i_state='A', f_state='D')
False
>>> markov_reducible.is_accessible(i_state='D', f_state='A')
True
>>> markov_reducible.is_accessible(i_state='C', f_state='D')
False
>>> markov_reducible.is_irreducible()
False

Periodicity

State i is said to have period k if any possible path to return to state i would be a multiple of k steps. Formally, it is defined like this:

Here, gcd means the greatest common divisor (GCD). Basically, k is the GCD of the length/number of steps of all possible paths from state i back to itself. If there are no possible paths from state i back to itself, then the period for it is not defined. We also need to note that k has nothing to do with the number of steps required to return to the starting state. For example, let's say that for any given state the number of steps required to return to it are (4, 6, 8, 12, 16). In this case k=2, but the minimum number of steps required to return is 4, and 2 doesn't even appear in the list of possible numbers of steps.

For any given state in the Markov chain, if k=1, the state is said to be aperiodic. A Markov chain is called aperiodic if all of its states are aperiodic. One major thing to note is that, in the case of an irreducible Markov chain, a single aperiodic state is enough to imply that all the states are aperiodic. Let's take a simple example and check the periodicity of different states:

Figure 1.4: Markov chain is also periodic

In the previous example, we can easily see that for state A the possible paths to return are A -> B -> C -> A or A -> B -> C -> D -> E -> C -> A. For these two paths, the path lengths are 3 and 6, respectively, and hence state A has a period of 3. Similarly, B, C, D, and E also each has a period of 3 in the Markov chain, and hence the Markov chain is also periodic:

Figure 1.5: Example of Markov Chain with aperiodic states.

In this example, we added a couple of extra edges, due to which the possible path lengths for A are now 3, 5, 7, ...; and for B are 2, 3, 4, 5, ... And, since the GCD of these path lengths is 1, states A and B are both now aperiodic. Similarly, we can compute the period of other nodes, each of which is also 1, and hence the Markov chain is also aperiodic.

Let's now add a couple of new methods to our MarkovChain class to compute the period of different states and check whether our model is aperiodic:

def get_period(self, state):
    """
    Returns the period of the state in the Markov Chain.

    Parameters
    ----------
    state: str
        The state for which the period needs to be computed.
    """
    return gcd([len(i) for i in all_possible_paths])

def is_aperiodic(self):
    """
    Checks if the Markov Chain is aperiodic. 
    """
    periods = [self.get_period(state) for state in self.states]
    for period in periods:
        if period != 1:
            return False
    return True

Let's now try out our methods on our examples. In this example, we will randomly assign probability values to different transitions:

>>> transition_periodic = [[0, 1, 0, 0, 0],
                           [0, 0, 1, 0, 0],
                           [0.5, 0, 0, 0.5, 0],
                           [0, 0, 0, 0, 1],
                           [0, 0, 1, 0, 0]]
>>> transition_aperiodic = [[0, 1, 0, 0, 0],
                            [0, 0, 1, 0, 0],
                            [0.5, 0.25, 0, 0.25, 0],
                            [0, 0, 0, 0, 1],
                            [0, 0, 0.5, 0.5, 0]]
>>> markov_periodic = MarkovChain(transition_matrix=transition_periodic,
                                  states=['A', 'B', 'C', 'D', 'E'])
>>> markov_aperiodic = MarkovChain(transition_matrix=transition_aperiodic,
                                   states=['A', 'B', 'C', 'D', 'E'])

>>> markov_periodic.get_period('A')
3
>>> markov_periodic.get_period('C')
3
>>> markov_aperiodic.is_periodic()
False

>>> markov_aperiodic.get_period('A')
1
>>> markov_aperiodic.get_period('B')
1
>>> markov_aperiodic.is_periodic()
True

Transience and recurrence

Given that we start at state i, it is called transient if there is a non-zero probability that we will never return to state i. To define this in more formal terms, let's consider a random variable T_i as the first return time to state i:

Let's now define another term, , as the probability of the system returns to state i after n steps:

Now we can define that any given state i is transient if the following condition is met:

In the preceding equation, we are basically checking whether the total sum of probabilities of returning to state i in step sizes less than is less than 1. If the total sum is less than 1, it would mean that the probability of T_i to be is greater than 0 which would mean that the state i is transient. The given state i is called recurrent if it is not transient:

Figure 1.6:

In the preceding example, we can see that states A and B are transient because A doesn't have any incoming edge. B does have an incoming edge, but it's incoming from another transient state and therefore it is also transient. Hence, once the system leaves state A or B, it won't be able to come back.

It is really simple to check whether a given state is transient or not. We can simply check whether there are any incoming edges from other states or not. If not, the state is transient. Let's write a simple method to check this for our MarkovChain class:

def is_transient(self, state):
    """
    Checks if a state is transient or not.

    Parameters
    ----------
    state: str
        The state for which the transient property needs to be checked.
    """
    if all(self.transition_matrix[~self.index_dict[state], self.index_dict[state]] == 0):
        return True
    else:
        return False

Now we can use this method in our example in Figure 1.6 to check which nodes are transient:

>>> transient_matrix = [[0, 0.5, 0.5, 0],
                        [0, 0, 0.25, 0.75],
                        [0, 0, 0, 1],
                        [0, 0, 0.5, 0.5]]
>>> transient_markov = MarkovChain(transition_matrix=transient_matrix,
                                   states=['A', 'B', 'C', 'D'])
>>> transient_markov.is_transient('A')
True
>>> transient_markov.is_transient('B')
True
>>> transient_markov.is_transient('C')
False

In the following subsections, we will talk about the statistical properties of the random variable T_i.

Mean recurrence time

The first-return time for the initial state i is also known as the hitting time. It was represented using the random variable T_i in the previous section. The mean recurrence time of state i is defined as its expected return time:

If the mean recurrence time, M_i, is finite, the state is called positive recurrent, otherwise it is called null recurrent.

Expected number of visits

As is evident from the name, the expected number of visits for any state i is the number of times the system is expected to be in that state. Also, a given state i is recurrent if and only if the expected number of visits to i is infinite:

Absorbing states

State i is said to be an absorbing state if it is impossible for a system to leave that state once it reaches it. For a state to be an absorbing state, the probability of staying in the same state should be 1, and all the other probabilities should be 0:

In a Markov chain, if all the states are absorbing, then we call it an absorbing Markov chain:

Figure 1.7: An example showing an absorbing state C, since the probability of transitioning from state C to C is 1

Again, we can add a very simple method to check for absorbing states in our MarkovChain class:

def is_absorbing(self, state):
 """
 Checks if the given state is absorbing.

 Parameters
 ----------
 state: str
 The state for which we need to check whether it's absorbing
 or not.
 """
 state_index = self.index_dict[state]
 if self.transition_matrix[state_index, state_index]

We can again check whether our state in the example is absorbing by creating a Markov chain and using the is_absorbing method:

>>> absorbing_matrix = [[0, 1, 0],
                        [0.5, 0, 0.5],
                        [0, 0, 1]]
>>> absorbing_chain = MarkovChain(transition_matrix=absorbing_matrix,
                                  states=['A', 'B', 'C'])
>>> absorbing_chain.is_absorbing('A')
False
>>> absorbing_chain.is_absorbing('C')
True

Ergodicity

State i is said to be ergodic if it is recurrent, has a period of 1, and has a finite mean recurrence time. If all the states of a Markov chain are ergodic, then it's an ergodic Markov chain. In general terms, a Markov chain is ergodic if there is a number N, such that any state in the system can be reached from any other state in any number of steps greater than or equal to the number N. Therefore, in the case of a fully connected transition matrix, where all transitions have a non-zero probability, this condition is fulfilled with N=1.

Steady-state analysis and limiting distributions

In a Markov chain, vector π is called the stationary distribution if ∀ j ∈ s satisfies the following conditions:

The stationary distribution is one of the most important properties of Markov chains, and we will talk about it in much more detail in later sections of this chapter.

Continuous-time Markov chains

Continuous-time Markov chains are quite similar to discrete-time Markov chains except for the fact that in the continuous case we explicitly model the transition time between the states using a positive-value random variable. Also, we consider the system at all possible values of time instead of just the transition times.

Exponential distributions

The random variable x is said to have an exponential distribution with a rate of distribution of λ if its probability density function is defined as follows:

Here, the rate of distribution λ needs to be greater than 0. We can also compute the expectation of X as this:

We see that the expectation of X is inversely proportional to the rate of learning. This means that an exponential distribution with a higher rate of learning would have a lower expectation. The exponential distribution is often used to model problems that involve modelling time until some event happens. A simple example could be modelling the time before an alarm clock goes off, or the time before a server comes to your table in a restaurant. And, as we know , the higher the learning rate, the sooner we would expect the event to happen, and hence the name learning rate.

We can also compute the second moment and the variance of the exponential distribution:

And, using the first moment and the second moment, we can compute the variance of the distribution:

Figure 1.x: Probability distribution of exponential distribution

Now we will move on to some of the properties of the exponential distribution that are relevant to our example:

Memoryless: Figure 1.x shows a plot of an exponential distribution. In the diagram, we can clearly see that the graph after any given point (a in this case) is an exact copy of the original distribution. We can also say that an exponential distribution that is conditioned on (X > a) still stays exponential. If we think about this property in terms of our examples, it means that if we had an alarm clock, and at any time, t, we check that it still hasn't gone off, we can still determine the distribution over the time ahead of t, which will be the same exponential distribution. This property of the exponential distribution is known as being memoryless, since at any given point in time, if you know the current state of the system (in this example, that the alarm hasn't gone off), you can determine the probability distribution over time in the future. This property of exponential distributions is quite similar to Markov chains, as you may recall from previous sections.
Probability of minimum value: Let's say we have n independent exponential distributions over the random variables X₀, . . ., X_n with learning rates λ₀, ..., λ_n, respectively. For these distributions, we can prove that the distribution of min(X₀, . . ., X_n) is also an exponential distribution with learning rate :

We will use both of these properties of the exponential distribution in our example for the continuous time Markov chain in a later section.

Poisson process

The Poisson process is a continuous process, and there can be multiple interpretations of it, which lead to different possible definitions. In this section, we will start with the formal definition and build up to a more simple, intuitive definition. A continuous-time stochastic process N(t):t > 0 is a Poisson process with a rate λ > 0 if the following conditions are met:

N(0) = 0
It has stationary and independent increments
The distribution of N(t) is Poisson with mean λt:

First of all, we need to define what the stationary and independent increments are. For a continuous-time stochastic process, X(t): ≥ 0, an increment is defined as the difference in state of the system between two time instances; that is, given two time instances s and t with s < t, the increment from time s to time t is X(t) - X(s). As the name suggests, a process is said to have a stationary increment if its distribution for the increment depends only on the time difference.

In other words, a process is said to have a stationary increment if the distribution of X(t₁) - X(s₁) is equal to X(t₂) - X(s₂) if t₁ > s₁,t₂ > s₂ and t₁ - s₁ = t₂ -s₂. A process is said to have an independent increment if any two increments in disjointed time intervals are independent; that is, if t₁ > s₁ > t₂ > s₂, then the increments X(t₂) - X(s₂) and X(t1) - X(s1) are independent.

Now let's come back to defining the Poisson process. The Poisson process is essentially a counting process that counts the number of events that have occurred before time t. This count of the number of events before time t is given by N(t), and, similarly, the number of events occurring between time intervals t and t + s is given by N(t + s) - N(t). The value N(t + s) - N(t) is Poisson-distributed with a mean λ_s. We can see that the Poisson process has stationary increments in fixed time intervals, but as , the value of N(t) will also approach infinity; that is, . Another thing worth noting is that, as the value of λ increases, the number of events happening will also increase, and that is why λ is also known as the rate of the process.

This brings us to our second simplified definition of the Poisson process. A continuous-time stochastic process N(t): t ≥ 0 is called a Poisson process with the rate of learning λ > 0 if the following conditions are met:

N(0) = 0
It is a counting process; that is, N(T) gives the count of the number of events that have occurred before time t
The times between the events are distributed independently and identically, with an exponential distribution having a learning rate of λ

Continuous-time Markov chain example

Now, since we have a basic understanding of exponential distributions and the Poisson process, we can move on to the example to build up a continuous-time Markov chain. In this example, we will try to show how the properties of exponential distributions can be used to build up generic continuous-time Markov chains. Let's consider a hotel reception where n receptionists are working in parallel. Also consider that the guests arrive according to a Poisson process, with rate λ, and the service time for each guest is represented using an exponential random variable with learning rate µ. Also, if all the receptionists are busy when a new guest arrives, he/she will depart without getting any service. Now let's consider that a new guest arrives and finds all the receptionists are busy, and let's try to compute the expected number of busy receptionists in the next time interval.

Let's start by assuming that T_k represents the number of k busy receptionists in the next time instance. We can also use T_k to represent the expected number of busy receptionists found by the next arriving guest if k receptionists are busy at the current time instance. These two representations of T_k are equivalent because of the memoryless property of exponential distributions.

Firstly, T₀ is clearly 0, because if there are currently 0 busy receptionists, the next arrival will also find 0 busy receptionists for sure. Now, considering T₁, if there are currently i busy receptionists, the next arriving guest will find 1 busy receptionist if the time to the next arrival is less than the remaining service time for the busy receptionist. From the memoryless property, we know that the next arrival time is exponentially distributed with a learning rate of λ, and the remaining service time is also exponentially distributed with a learning rate of µ. Therefore, the probability that the next guest will find one receptionist busy is and hence the following is true:

In general, we consider the situation that k receptionists are busy. We can obtain an expression for T_k by conditioning on what happens first. When we have k receptionists busy, we can think of basically k+1 independent exponential distributions: k exponentials with a learning rate of µ for the remaining service time for each receptionist, and 1 exponential distribution with a learning rate of λ for the next arriving guest. In our case, we want to condition on whether a service completion happens first or a new guest arrives first. The time for a service completion will be the minimum of the k exponentials. This first completion time is also exponentially distributed with a learning rate of kµ. Now, the probability of having a service completion before the next guest arrives is . Similarly, the probability of the next thing happening being a guest arrival is .

Now, based on this, we can say that if the next event is service completion, then the expected number of busy receptionists will be T_k-1. Otherwise, if a guest arrives first, there will be k busy receptionists. Therefore we have the following:

We need to just solve this recursion now. T₂ will be given by this equation:

If we continue this same pattern, we will get T₃ as follows:

We can see a pattern in the values of T₁ and T₂, and therefore we can write a general term for it as follows:

Let's point out our observations on the previous example:

At any given time instance, if there are i busy receptionists, for i < n there are i + 1 independent exponential distributions, with i of them having rate µ, and 1 of them having rate λ. The time until the process makes a jump is exponential, and its rate is given by iµ + λ. If all the receptionists are busy, then only the n exponential distributions corresponding to the service time can trigger a jump, and the time until the process makes a jump is exponential with rate nµ.
When the process jumps from state i, for i < n, it jumps to state i + 1 with probability , and jumps to state i - 1 with probability of .
When the process makes a jump from state i, we can start up a whole new set of distributions corresponding to the state we jumped to. This is because, even though some of the old exponential distributions haven't triggered, it's equivalent to resetting or replacing those distributions.

Every time we jump to state i, regardless of when the time is, the distribution of how long we stay in state i and the probabilities of where we jump to next when we leave state i are the same. In other words, the process is time-homogeneous.

The preceding description of a continuous-time stochastic process corresponds to a continuous-time Markov chain. In the next section, we will try to define it in a more formal way.

Continuous-time Markov chain

In the previous section, we showed an example of a continuous-time Markov chain to give an indication of how it works. Let's now move on to formally define it. In a continuous-time Markov chain with a discrete state space S, for each state i ∈ S we have an associated set of n_i independent exponential distributions with rates q_i, j₁, ..., q_i,j_{n_i}, where j₁, ..., j_{n_i} is the set of possible states the process may jump to when it leaves state i. And, when the process enters state i, the amount of time it spends in state i is exponentially distributed with rate v_i = q_ij₁+...+q_ij_{n_i}, and when it leaves state i it will go to state j_l with probability for l = 1, ...,n_i.

We can also extend the Markov property from the discrete-time case to continuous time.

For a continuous-time stochastic process (X(t) : t ≥ 0) with state space S, we say it has the Markov property if the following condition is met:

Here, 0 ≤ t₁ ≤ t2 ≤. . . .t_n-1≤ s ≤ t is any non-decreasing sequence of n + 1 times, and i_{1, i}2, . . ., i_n-1, i, j∈ S are any n + 1 states in the state space, for any integer n ≥ 1.

Similarly, we can extend time-homogeneity to the case of continuous-time Markov chains. We say that a continuous-time Markov chain is time homogenous if, for any s ≤ t, and any states i, j ∈ S, the following condition is met:

As in the case of discrete-time Markov chains, a continuous-time Markov chain does not need to be time-homogeneous, but non-homogeneous Markov chains are out of scope for this book. For more details on non-homogeneous Markov chains, you can refer to Cheng-Chi Huang's thesis on the topic: https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=8613&context=rtd.

Now let's define the transition probability for a continuous-time Markov chain. Just as the rates q_ij in a continuous-time Markov chain are the counterpart of the transition probabilities p_ij in a discrete-time Markov chain, there is a counterpart to the n-step transition probabilities p_ij(t) for a time-homogeneous, continuous-time Markov chain, which is defined as follows:

RFEMYGDIO Jan 21, 2019

Excelente

Amazon Verified review

DWH Dec 02, 2018

Although the algorithms in this book are generally correct it is riddled with crippling errors. There are undefined variables and out of range errors in almost every example. These are still present in the code that you download directly from the publisher. Buyer beware, you'll spend more time troubleshooting than learning.

Victoria Sherratt Jul 10, 2021

It's ok, not great. It's printed by Amazon and some of the graphics are a bit low quality. What annoyed me most about this book is the chapter I was most interested in, the authors didn't bother with the code - "it would be too long for this book" - but in later chapters presented pages of back to back code. The book is not long at all so why skirt over it?

Damian Jan Cordon Jun 26, 2019

Las fórmulas y gráficos de este libro son diminutas en el Kindle y no es posible aumentar su tamaño, lo que hace imposible seguir correctamente los razonamientos que aplica ya que no es posible acceder a la justificacion matemática de lo que explica.