Prologue

When I was a freshman, learning the course Probability and Statistics, I was confused about the concept moment, and the same concept in Chinese is “矩”. Until I got some ideas about it, I also realized the profound meaning of Chinese, which is a little bit like the original of the word moment. Magical the languages are, that's not the point I want to focus on. Let's dive into the subject.

Moment

Moment Uncentered Centered
1st \(E(x)=\mu\)
2nd \(E\left(X^{2}\right)\) \(E\left((X-\mu)^{2}\right)\)
3rd \(E\left(X^{3}\right)\) \(E\left((X-\mu)^{3}\right)\)
4th \(E\left(X^{4}\right)\) \(E\left((X-\mu)^{4}\right)\)

\[ \begin{array}{ll}\operatorname{Mean}(X) & =E(X) \\ \operatorname{Var}(X) & =E\left[(X-\mu)^{2}\right]=\sigma^{2} \\ \text {Skewness}(X) & =E\left[(X-\mu)^{3}\right] / \sigma^{3} \\ \text {Kurtosis}(X) & =E\left[(X-\mu)^{4}\right] / \sigma^{4}\end{array} \]

Moment Generating Functions (MGF)

The moment generating function of \(X\) is defined by \[ \begin{aligned} \psi(t) &=E\left[e^{tX}\right] \\ &=\int e^{t X} \mathrm{d} F(x) . \end{aligned} \tag{1} \] All the moments of \(X\) can be successively obtained by differentiating \(\psi\) and then evaluating at \(t=0\). That is, \[ \begin{aligned} &\psi^{\prime}(t)=E\left[X e^{t X}\right],\\ &\psi^{\prime \prime}(t)=E\left[X^{2} e^{t X}\right],\\ &\qquad\quad \cdots\\ &\psi^{n}(t)=E\left[X^{n} e^{t X}\right]. \end{aligned} \tag{2} \] Evaluating at \(t=0\) yields \[ \psi^{n}(0)=E\left[X^{n}\right], \ n \geqslant 1\ . \tag{3} \] It should be noted that we have assumed that it is justifiable to interchange the differentiation and integration operations. This is usually the case.

 When a moment generating function exists, it uniquely determines the distribution. This is quite important because it enables us to characterize the probability of a random variable by its generating function.

 Here, we need to tell the MGF from Probability Generating Function, of a non-negative integer random variable \(X\), which is defined by \[ G(t)=E\left(t^{X}\right)=\sum_{x=0}^{\infty} P(x) t^{x}\ . \tag{4} \]  By the way, we may be very curious about that why we define the MGF using the form of \(\text{e}\) exponent. That's what will be talked about in Example 2.

Now, let dive into the formula \((3)\). We use the Taylor series to prove this. Let's get the Maclaurin expansion of \(e^x\): \[ e^{X}=1+X+\frac{X^{2}}{2 !}+\frac{X^{3}}{3 !}+\cdots+\frac{X^{n}}{n !} \] then, \[ e^{t X}=1+t x+\frac{(t X)^{2}}{2 !}+\frac{(t X)^{3}}{3 !}+\cdots+\frac{(t X)^{n}}{n !}\ . \] Take the expected value: \[ \begin{aligned} E\left(e^{t X}\right) &=E\left[1+t X+\frac{(t X)^{2}}{2 !}+\frac{(t X)^{3}}{3 !}+\cdots+\frac{(t X)^{n}}{n !}\right] \\ &=E(1)+t E(X)+\frac{t^{2}}{2 !} E\left(X^{2}\right)+\frac{t^{3}}{3 !} E\left(X^{3}\right)+\cdots+\frac{t^{n}}{n !} E\left(X^{n}\right) \ . \end{aligned} \] Now, take a derivative with respect to \(t\): \[ \begin{aligned} \frac{\mathrm{d}}{\mathrm{d} t} E\left(e^{t X}\right)&=\frac{\mathrm{d}}{\mathrm{d} t}\left[E(1)+t E(X)+\frac{t^{2}}{2 !} E\left(X^{2}\right)+\frac{t^{3}}{3 !} E\left(X^{3}\right)+\cdots+\frac{t^{n}}{n !} E\left(X^{n}\right)\right] \\ &=0+E(X)+0+0+\cdots+0 \quad (\text{Let }t=0)\\ &=E(X) \ . \end{aligned} \] Let's replace the \(x\) with \(X\), take \(n\) times of derivative, and we will get the formula \((3)\). Besides, we can see the role of the variable \(t\), who helps us calculate the derivatives and makes the terms (that we are not interested in) zero.

Why do we need MGF?

For convenience. It is much easier to get the expected value using the MGF than the definition of expected values. Using the MGF, it is possible to find moments by taking derivatives rather than doing integrals! The followings are tow examples.

Example 1

The expected value of exponential distribution. \[ f_{X}(x)=\left\{\begin{array}{cl} \lambda \cdot e^{-\lambda x} & x>0 \\ 0 & \text { else } \end{array}\right. \]

\[ \begin{aligned} \psi(t)=E\left[e^{t X}\right]&=\int_{0}^{\infty} e^{t x} \cdot \lambda e^{-\lambda x} \mathrm{d} x \\ & =\lambda \int_{0}^{\infty} e^{(t-\lambda) x} \mathrm{d} x \quad \text{(Note)}\\ &=\lambda\left|\frac{1}{t-\lambda} e^{(t-\lambda) x}\right|_{0}^{\infty} \\ &=\lambda\left(0-\frac{1}{t-\lambda}\right)\\ &=\frac{\lambda}{\lambda-t}\,. \end{aligned} \]

Note: For the MGF to exist, the expected value \(E(e^{tx})\) should exist. So, \(t-\lambda<0\) is an important condition to meet. Otherwise, the integral won't converge.

 Once we have the MGF: \(\lambda/(\lambda-t)\), calculating moments becomes just a matter of taking derivatives, which is easier than the integrals to calculate the expected value directly. \[ E\left(X^{3}\right)=\frac{\mathrm{d}^{3}}{\mathrm{d} t^{3}}\left(\frac{\lambda}{\lambda-t}\right) \quad \text{is easier than} \quad E\left(X^{3}\right)=\int_{0}^{\infty} x^{3} \lambda e^{-\lambda x} \mathrm{d} x \ . \]

Example 2

Let \(X\) and \(Y\) be independent normal random variables with respective means \(\mu_{1}\) and \(\mu_{2}\) and respective variances \(\sigma_{1}^{2}\) and \(\sigma_{2}^{2} .\) The moment generating function of their sum is given by \[ \begin{aligned} \psi_{X+Y}(t) &=E\left[e^{t(X+Y)}\right] \\ &=E\left[e^{t X}\right] E\left[e^{t Y}\right] \quad \text{(by independence)}\\ &=\psi_{X}(t) \psi_{Y}(t)\\ &=\exp \left\{\left(\mu_{1}+\mu_{2}\right) t+\left(\sigma_{1}^{2}+\sigma_{2}^{2}\right) t^{2} / 2\right\} \,. \end{aligned} \] Thus the moment generating function of \(X+Y\) is that of a normal random variable with mean \(\mu_1+\mu_2\) and variance \(\sigma_{1}^{2}+\sigma_{2}^{2}\). By uniqueness, this is the distribution of \(X+Y\).

 By this example, we make a trivial process becomes a non-trivial one using the MGF.

 In my opinion, the MGP derives from the Laplace transform. Thus it inherits the form of \(e\) exponent. The power of \(e\) exponent lies in the converting between multiplication and addition.

Notes

  1. For any valid MGF, \(M(0) = 1\). Whenever you compute an MGF, plug in \(t = 0\) and see if you get \(1\).
  2. Moments provide a way to specify a distribution. For example, you can completely specify the normal distribution by the first two moments which are a mean and variance. As you know multiple different moments of the distribution, you will know more about that distribution. If there is a person that you haven’t met, and you know about their height, weight, skin color, favorite hobby, etc., you still don’t necessarily fully know them but are getting more and more information about them.
  3. The beauty of MGF is, once you have MGF (once the expected value exists), you can get any \(n^{th}\) moment. MGF encodes all the moments of a random variable into a single function from which they can be extracted again later.
  4. A probability distribution is uniquely determined by its MGF. If two random variables have the same MGF, then they must have the same distribution. (Proof)
  5. For the people (like me) who are curious about the terminology “moments”: Why is a moment called moment?
  6. One of the important features of a distribution is how heavy its tails are, especially for risk management in finance. If you recall the 2009 financial crisis, that was essentially the failure to address the possibility of rare events happening. Risk managers understated the kurtosis (kurtosis means ‘bulge’ in Greek) of many financial securities underlying the fund’s trading positions. Sometimes seemingly random distributions with hypothetically smooth curves of risk can have hidden bulges in them. And we can detect those using MGF!

——Aerin Kim

Characteristic Functions (CF)

As the moment generating function of a random variable \(X\) need not exist, it is theoretically convenient to define the characteristic function of \(X\) by \[ \phi(t)=E\left[e^{itX}\right], \ -\infty<t<\infty \ , \tag{5} \] where \(i=\sqrt{-1}\). It can be shown that \(\phi\) always exists and, like the moment generating function, uniquely determines the distribution of \(X\).

 Similar to the MGF above, we can also get the Maclaurin expansion of formula \((5)\): \[ \begin{aligned} E\left(e^{it X}\right) &=E\left[1+it X+\frac{(it X)^{2}}{2 !}+\frac{(it X)^{3}}{3 !}+\cdots+\frac{(it X)^{n}}{n !}\right] \\ &=E(1)+it E(X)+\frac{(it)^{2}}{2 !} E\left(X^{2}\right)+\frac{(it)^{3}}{3 !} E\left(X^{3}\right)+\cdots+\frac{(it)^{n}}{n !} E\left(X^{n}\right) \ . \end{aligned} \] We call it characteristic function because it consists of all the moments of the distribution.

A generating function is a clothesline on which we hang up a sequence of numbers for display.

——Herbert Wilf

The same with the characteristic function. Compared with the MGF, CF expands the number domain from \(\mathbb{R}\) to \(\mathbb{C}\), however, the latter one is limited within the imaginary axis, not the whole complex plane. And that's enough for most of the cases.

Relationship with Fourier Transform

We know that the Fourier transform is defined by \[ F(t)=\int_{-\infty}^{\infty} f(x) e^{-i t x} \mathrm{d} x\ . \tag{6} \] Compared with CF \[ \begin{aligned} \phi_{X}(t) &=E\left[e^{i t X}\right] \\ &=\int_{-\infty}^{+\infty} e^{i t x} f(x) \mathrm{d} x \ , \end{aligned}\tag{7} \] it's obvious that the tow function are conjugated: \(\phi_X(t)=\overline{F(t)}\).

 Similarly, we can get the \(f(x)\) form \(\phi_X(t)\) through Fourier inversion: \[ f(x)=\frac{1}{2 \pi} \int_{-\infty}^{\infty} e^{i t x} F(t) \mathrm{d} t=\frac{1}{2 \pi} \int_{-\infty}^{\infty} e^{i t x} \overline{\phi_{X}(t)} \mathrm{d} t \ . \tag{8} \]


We may define the joint moment generating of the random variables \(X_1,X_2,\dots,X_n\) by \[ \psi\left(t_{1}, \ldots, t_{n}\right)=E\left[\exp \left\{\sum_{i=1}^{n} t_{1} X_{1}\right\}\right]\ , \tag{9} \] or the joint characteristic function by \[ \phi\left(t_{1}, \ldots, t_{n}\right)=E\left[\exp \left\{i \sum_{j=1}^{n} t_{1} X_{1}\right\}\right]\ . \tag{10} \] It may be proven that the joint moment generating function (when it exists) or the joint characteristic function uniquely determines the joint distribution.

Laplace Transforms

When dealing with random variables that only assume non-negative values, it is sometimes more convenient to use Laplace transforms rather than characteristic functions. The Laplace transform of the distribution \(F\) is defined by \[ \mathcal{L}(s)=\int_{0}^{\infty} e^{-sx} \mathrm{d} F(x)\ . \tag{11} \] This integral exists for complex variables \(s=a+bi\), where \(a \geqslant 0\). As in the case of characteristic functions, the Laplace transform uniquely determines the distribution.

 Here, the number field has been expanded to the whole complex plane.

How to Understand those Functions

In fact, the MGF, the CF, and the Fourier transform are all special kinds of Laplace transform. They belongs to the field of harmonic analysis. They all map the random variable to another number field, thus getting some magical properties of the random variables. And once again, we gain deeper understanding of transform.

References