When building mathematical models to describe real-world processes, we often come across two parameter estimators, the maximum a posteriori estmator (MAP) and the maximum likelihood estimator (MLE). Their formulae look similar, which seems to imply that one is just a slight variation of the other. However, the intuitions behind them are actually very different. MAP is best explained through the Bayesian statistic framework, while MLE the frequentist statistic framework. In this post, I am going to walk through the motivation behind these two estimators.
Bayesian statistics and frequentist statistics differs in their interpretations of probability. In Bayesian statistics, we can think of probability has one’s belief about a certain event, and we would adjust our belief if new information becomes available.
For example, let’s say you have been living in cave and have no contact with the outside world for 10 days. If you were to guess the chances of rain tomorrow, you could only base your guess on the region you live in. Say, if you live in a costal region which has a fair chance of rain, then your guess may be at 60%. If you live in a land-locked region which is relatively arid, then your guess may be at 40%. Your knowledge about the region you live in is your a priori information for your estimate.
Now, if someone from the outside world told you that he had seen thick dark clouds closing in the area. With this new information, you may up your estimate for chances of rain to 90%. You have used the a posteriori information to adjust your estimate.
We can now apply the same idea to parameter estimation. Suppose we have a process that generate \(y\) from \(\mathbf{x}\), where \(y\) is a scalar and \(\mathbf{x} \in R^n\), we can describe this process using a model \(f\):
\[y = f(\mathbf{x})\]
And if this model is parameterized in \(\boldsymbol{\theta}\), we then have the followings:
\[y = f(\mathbf{x}; \boldsymbol{\theta})\]
Under the Bayesian statistic framework, we think of \(\boldsymbol{\theta}\) as a random variable, and we have some a priori belief about it. We express our a priori belief as a probability distribution. For example, we may believe that each parameter is independently drawn from a Gaussian distribution with mean 0 and variance 1. Then we have:
\[p(\boldsymbol{\theta}) = \prod_i p(\theta_i)\]
because each parameter \(\theta_i\) is independent from others, and:
\[p(\theta_i) \sim N(0,1)\]
where \(N(\mu,\sigma)\) is a Gaussian distribution with mean \(\mu\) and variance \(\sigma\).
If we collect some pairs of \(\mathbf{x}\) and \(y\) from the process, we can then adjust our belief about \(\boldsymbol{\theta}\) using these data points, and express our new belief about \(\boldsymbol{\theta}\) as a condtional probability:
\[p(\boldsymbol{\theta}|\mathbf{x},y)\]
which is interpreted as the probability of \(\boldsymbol{\theta}\) given \(\mathbf{x}\) and \(y\). The maximum a posteriori estimator is simply the value of \(\boldsymbol{\theta}\) at which \(p(\boldsymbol{\theta}|\mathbf{x},y)\) is at maximum.
We will now reconsider the data-generation process \(y = f(\mathbf{x}; \boldsymbol{\theta})\) from a different point of view. Suppose there is some randomness in the process, such that the same value of \(\mathbf{x}\) does not always produce the same value of \(y\). This may due to measurement noises or some inherent randomness of the process. In this case, the process is better described by a probabilistic model:
\[p(y|\mathbf{x};\boldsymbol{\theta})\]
which is the probability of \(y\) given \(\mathbf{x}\), and \(\boldsymbol{\theta}\) is the parameter of this probability distribution.
An interesting point to consider here is that \(\boldsymbol{\theta}\) in this perspective is not a random variable. It is fixed and we want to estimate its true value by considering the set of models:
\[p(y|\mathbf{x};\hat{\boldsymbol{\theta}})\]
where \(\hat{\boldsymbol{\theta}}\) is an estimate of \(\boldsymbol{\theta}\). If we collect some pairs of \(\mathbf{x}\) and \(y\) from the process, we can then choose \(\hat{\boldsymbol{\theta}}\) such that \(p(y|\mathbf{x};\hat{\boldsymbol{\theta}})\) is at maximum. This chosen \(\hat{\boldsymbol{\theta}}\) is the maximum likelihood estimator of \(\boldsymbol{\theta}\).
It should be noted that \(\hat{\boldsymbol{\theta}}\) is a random variable because it is calculated from the random variables \(y\) and \(\mathbf{x}\).
We have gone through the motivations behind the maximum a posteriori estimator and the maximum likelihood estimator. To re-cap, their definitions are given as:
\[\boldsymbol{\theta}_{\operatorname{MAP}} =\; \underset{\theta}{\operatorname{arg} \operatorname{max}}\;p(\boldsymbol{\theta}|\mathbf{x},y)\]
\[\boldsymbol{\theta}_{\operatorname{MLE}} =\; \underset{\theta}{\operatorname{arg} \operatorname{max}}\;p(y|\mathbf{x};\boldsymbol{\theta})\]
Though visually similar, they are derived from two entirely different frameworks.
Comments
Michael Checker April 22nd, 2020
Thanks for reading. Do you want to leave a comment?
It will take a couple of minutes to show, please come back or refresh then.