Cheat Sheet

Cheatsheet for 18.6501x by Blechturm Page 1 of x 1 Important probability distributions Bernoulli Parameter ( p ∈ [0, 1]

Views 395 Downloads 11 File size 247KB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

  • Author / Uploaded
  • Asd
Citation preview

Cheatsheet for 18.6501x by Blechturm Page 1 of x

1 Important probability distributions Bernoulli Parameter ( p ∈ [0, 1], discrete p, if k = 1 px (k) = (1 − p), if k = 0

E[X] = p

Fisher Information:

Exponential Parameter ( λ, continuous λexp(−λx), if x >= 0 fx (x) = 0, o.w. ( 1 − exp(−λx), if x >= 0 Fx (x) = 0, o.w.

n I(p) = p(1−p)

Canonical exponential form: fp (y) = !

n = exp(y (ln(p) − ln(1 − p)) + n ln(1 − p) + ln( )) y | {z } | {z } | {z } θ

E[X] = λ1 V ar(X) = 12

V ar(X) = p(1 − p) Multinomial Parameters n > 0 and p1 , . . . , pr . n! px (x) = x !,...,x ! p1 , . . . , pr

Likelihood n trials:

Ln (X , . . . , Xn , p) = P P1 n n = p i=1 Xi (1 − p)n− i=1 Xi

1

E[Xi ] = n ∗ pi

Loglikelihood n trials:

V ar(Xi ) = npi (1 − pi )

`n (p) =   P P = ln (p) ni=1 Xi + n − ni=1 Xi ln (1 − p)

MLE: Pn

pˆMLE =

i=1 (Xi )

n

1 I(p) = p(1−p)

Canonical exponential form:   fθ (y) = exp yθ − ln(1 + eθ ) + 0 | {z } |{z} b(θ) p θ = ln 1−p

c(y,φ)



φ=1

Binomial Parameters p and n, discrete. Describes the number of successes in n independent Bernoulli trials.  px (k) = nk pk (1 − p)n−k , k = 1, . . . , n E[X] = np

V ar(X) = np(1 − p)

Likelihood:

Ln(X1 , . . . , Xn , θ) = ! n Pn Y K  Pn Xi  θ i=1 (1 − θ)nK− i=1 Xi =   Xi i=1

Loglikelihood: P    P `n (θ) = C + ni=1 Xi log θ + nK − ni=1 Xi log(1−θ) MLE:

Likelihood: Q px (x) = nj=1 pj Tj , where T j = 1(Xi = j) is the count how often an outcome is seen in trials.

Poisson Parameter λ. discrete, approximates the binomial PMF when n is large, p is small, and λ = np. k

px (k) = exp(−λ) λk! for k = 0, 1, . . . , E[X] = λ V ar(X) = λ Likelihood: Ln (x1 , . . . , xn , λ) =

Qn

i=1

Pn

x

i λ Qni=1 e−nλ i=1 xi !

Loglikelihood: `n (λ) = P Q = −nλ + log(λ)( ni=1 xi )) − log( ni=1 xi !) MLE:

Fisher Information:

Loglikelihood:

Gaussians are invariant under affine transformation: Pn

aX + b ∼ N (X + b, a2 σ 2 )

MLE:

Sum of independent gaussians:

λˆ MLE = Pn n(X )

Let X∼N (µX , σX2 ) and Y ∼N (µY , σY2 )

Fisher Information:

If Y = X + Z, then Y ∼ N (µX + µY , σX + σY )

I(λ) = 12

If U = X − Y , then U ∼ N (µX − µY , σX + σY )

Canonical exponential form:   fθ (y) = exp yθ − (− ln(−θ)) + 0 | {z } |{z}

Symmetry:

i

λ

b(θ)

c(y,φ)

φ=1 Shifted Exponential Parameters ( λ, θ ∈ R, continuous λexp(−λ(x − θ)), x >= θ fx (x) = 0, x = θ Fx (x) = 0, x x) = 2P(X > x) Standardization:

Fisher Information:

(2πσ 2 )

b(θ) c(y,φ)

If X ∼ N (0, σ 2 ), then −X ∼ N (0, σ 2 )

θ = −λ = − µ1

L(X1 . . . Xn ; λ, θ) = λn exp(−λn(X n − θ))1(X1 ≥ θ) Univariate Gaussians Parameters µ and σ 2 > 0, continuous

Canonical exponential form:   fθ (y) = exp yθ − eθ − ln y! |{z} |{z}

Canonical exponential form:

`n (λ) = nln(λ) − λ i=1 (Xi )

P λˆ MLE = n1 ni=1 (Xi )

I(λ) = λ1

`n (µ, σ 2 ) =√ P = −nlog(σ 2π) − 1 2 ni=1 (Xi − µ)2 2σ MLE:

Likelihood:  P  L(X1 . . . Xn ; λ) = λn exp −λ ni=1 Xi

i=1

Loglikelihood:   P `n = nj=2 Tj ln pj

Fisher Information:



n

Loglikelihood:

λ

−b(θ)

c(y,φ)

L(x1 . . . Xn ; µ, σ 2 ) =   1 1 P =  √ n exp − 2 ni=1 (Xi − µ)2 2σ σ 2π

exp(−

(x−µ)2 ) 2σ 2

Uniform Parameters ( 1 a and b, continuous. , if a < x M) = P (X < M) ∞ 1 1 · dx = 1/2 = 2 1/2 π 1 + (x − m) Chi squared The χd2 distribution with d degrees of freedom is given by the distribution of Z12 + Z22 + · · · + Zd2 , where

E[X · Y ] = E[E[Y · X|Y ]] = E[Y · E[X|Y ]] Linearity of Expectation where a and c are given scalars:

6 Covariance The Covariance is a measure of how much the values of each of two correlated random variables determine each other Cov(X, Y ) = E[(X − µX )(Y − µY )] Cov(X, Y ) = E[XY ] − E[X]E[Y ]

8 Random Vectors  T A random vector X = X (1) , . . . , X (d) of dimension d × 1 is a vector-valued function from a probability space ω to Rd : X : Ω −→ Rd  (1)  X (ω) X (2) (ω)   ω −→   ..   .    (d)  X (ω)

E[aX + cY ] = aE[X] + cE[Y ]

Cov(X, Y ) = E[(X)(Y − µY )]

If Variance of X is known:

Possible notations:

E[X 2 ] = var(X) − E[X]

Cov(X, Y ) = σ (X, Y ) = σ(X,Y )

Z1 , . . . , Zd ∼ N (0, 1) If V ∼ χk2 :

4 Variance Variance is the squared distance from the mean.

Covariance is commutative:

E = E[Z12 ] + E[Z22 ] + . . . + E[Zd2 ] = d

V ar(X) = E[(X − E(X))2 ]

Covariance with of r.v. with itself is variance:

PDF of X: joint distribution of its components X (1) , . . . , X (d) .

V ar(V ) = V ar(Z12 ) + V ar(Z22 ) + . . . + V ar(Zd2 ) = 2d Student’s T Distribution Tn := √ Z where Z ∼ N (0, 1), and Z and V are

h i V ar (X) = E X 2 − (E [X])2

Cov(X, X) = E[(X − µX )2 ] = V ar(X)

CDF of X:

Variance of a product with constant a:

Useful properties:

Rd → [0, 1]

V ar(aX) = a2 V ar (X)

Cov(aX + h, bY + c) = abCov(X, Y )

x 7→ P(X (1) ≤ x(1) , . . . , X (d) ≤ x(d) ).

Variance of sum of two dependent r.v.:

Cov(X, X + Y ) = V ar(X) + cov(X, Y )

The sequence X1 , X2 , . . . converges in probability to X if and only if each component of the sequence

V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y )

Cov(aX + bY , Z) = aCov(X, Z) + bCov(Y , Z)

iid

V /n

independent 2 Quantiles of a Distribution Let α in (0, 1). The quantile of order 1 − α of a random variable X is the number qα such that:

qα = P (X ≤ qα ) = 1 − α

P(X ≥ qα ) = α

FX (qα ) = 1 − α

−1 (1 − α) = α FX

If X ∼ N (0, 1):

P(|X| > qα ) = α 3 Expectation R +inf E [X] = −inf x · fX (x) dx R +inf E [g (X)] = −inf g (x) · fX (x) dx R +inf E [X Y = y] = −inf x · fX|Y (x|y) dx Integration limits only have to be over the support of the pdf. Discrete r.v. same as continuous but with sums and pmfs.

Total expectation theorem: R +inf E [X] = −inf fY (y) · E [X Y = y] dy Expectation of constant a:

Variance of sum of two independent r.v.: V ar(X + Y ) = V ar(X) + V ar(Y ) V ar(X − Y ) = V ar(X) + V ar(Y ) 5

Sample Mean and Sample Variance iid

Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ and V ar(Xi ) = σ 2 for all i = 1, 2, ..., n Sample Mean: P X n = n1 ni=1 Xi Sample Variance: P Sn = n1 ni=1 (Xi − X n )2 = P 2 = n1 ( ni=1 Xi2 ) − X n

Unbiased estimator of sample variance:

Product of independent r.vs X and Y : E[X · Y ] = E[X] · E[Y ]

S˜n =

1 n−1

n  X

Xi − X n

i=1

2

=

n S n−1 n

(k)

Let X and Y be random matrices of the same dimension, and let A and B be conformable matrices of constants.

(σ ) n→∞

iid

If X1 , ..., Xn ∼ N µ, σ 2 the sample mean X n and the sample variance Sn are independent X n ⊥⊥ Sn for all n. The sum of squares of n Numbers follows a nS 2 Chi squared distribution 2n ∼ χn−1

(k)

X1 , X2 , . . . converges in probability to X (k) . If Cov(X, Y ) = 0, we say that X and Y are uncorrela- Expectation of a random vector ted. If X and Y are independent, their Covariance The expectation of a random vector is the elementis zero. The converse is not always true. It is only wise expectation. Let X be a random vector of true if X and Y form a gaussian vector, ie. any linear dimension d × 1.   combination αX + βY is gaussian for all (α, β) ∈ R2 E[X (1) ] without {0, 0}.   ..  . 7 Law of large Numbers and Central Limit theorem E[X] =  .   univariate  (d) E[X ] iid Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ and The expectation of a random matrix is the expected P V ar(Xi ) = σ 2 for all i = 1, 2, ..., n and Xn = n1 ni=1 Xi . value of each of its elements. Let X = {X } be an ij n × p random matrix. Then E[X], is the n × p matrix Law of large numbers: of numbers (if they exist): P ,a.s. E[X ] E[X ] . . . E[X ] Xn −−−−−→ µ . 11 12 1p   n→∞ E[X21 ] E[X22 ] . . . E[X2p ]   P ,a.s.  1 Pn E[X] =  . −−−−→ E[g(X)] .. ..  .. n i=1 g(Xi ) −n→∞  .. . . .   E[Xn1 ] E[Xn2 ] . . . E[Xnp ] Central Limit Theorem: p Xn −µ (d) (n) √ 2 −−−−−→ N (0, 1)

Cochranes Theorem:

σ

E[a] = a

Cov(X, Y ) = Cov(Y , X)

where each X (k) , is a (scalar) random variable on Ω.

p (d) (n)(Xn − µ) −−−−−→ N (0, σ 2 ) n→∞

Variance of the Mean: V ar(Xn ) = σ2

σ2

( n )2 V ar(X1 + X2 , ..., Xn ) = n . Expectation of the mean: E[Xn ] = 1 E[X1 + X2 , ..., Xn ] = µ.

E[X + Y ] = E[X] + E[Y ] E[AXB] = AE[X]B Covariance Matrix Let X be a random vector of dimension d × 1 with expectation µX . Matrix outer products! Σ = E[(X − µX )(X − µX )T ] =

Cheatsheet for 18.6501x by Blechturm Page 3 of x

  X − µ  1   1   X2 − µ2  E  . . .  [X1 − µ1 , X2 − µ2 , . . . , Xd − µd ]    Xd − µd  σ  11 σ12 . . . σ1d  σ21 σ22 . . . σ2d    Σ = Cov(X) =  .. .. ..  ..   . . . .   σd1 σd2 . . . σdd The covariance matrix Σ is a d × d matrix. It is a table of the pairwise covariances of the elemtents of the random vector. Its diagonal elements are the variances of the elements of the random vector, the off-diagonal elements are its covariances. Note that the covariance is commutative e.g. σ12 = σ21

Alternative forms: Σ = E[XX T ] − E[X]E[X]T = = E[XX T ] − µX µTX Let the random vector X ∈ Rd and A and B be conformable matrices of constants. Cov(AX + B) = Cov(AX) = ACov(X)AT = AΣAT Every Covariance matrix is positive definite. Σ≺0

Multivariate CLT Let X1 , . . . , Xd ∈ Rd be independent copies of a random vector X such that E[x] = µ (d × 1 vector of expectations) and Cov(X) = Σ p

(d)

(n)(Xn − µ) −−−−−→ N (0, Σ) n→∞

p

(d) (n)Σ−1/2 Xn − µ −−−−−→ N (0, Id ) n→∞

Where Σ−1/2 is the d × d matrix such that Σ−1/2 Σ−1/2 = Σ1 and Id is the identity matrix. Multivariate Delta Method Gradient Matrix of a Vector Function:

∂xd

...

∂fk    ∂x1 

... ...

.. .

  .    ∂fk 

∂xd

This is also the transpose of what is known as the Jacobian matrix Jf of f .

d

(2π) det(Σ)

With multivariate Gaussians and Sample mean:

Let Tn = Xn where Xn is the sample average of Where det(Σ) is the determinant of Σ, which is posiiid ~ = E[X]. The (multivariate) tive when Σ is invertible. X1 , . . . , Xn ∼ X, and θ If µ = 0 and Σ is the identity matrix, then X is called CLT then gives T ∼ N (0, ΣX ) where ΣX is the a standard normal random vector . covariance of X. In this case, we have: If the covariant matrix Σ is diagonal, the pdf factors  (d) into pdfs of univariate Gaussians, and hence the √  ~ −−−−−→ ∇g(θ) ~ TT n g(Tn ) − g(θ) components are independent. n→∞   ~ T T ∼ N 0, ∇g(θ) ~ T ΣX ∇g(θ) ~ The linear transform of a gaussian X ∼ Nd (µ, Σ) ∇g(θ) with conformable matrices A and B is a gaussian: (T ∼ N (0, ΣX )) AX + B = Nd (Aµ + b, AΣAT )

limn→∞ Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ

Two-sided asymptotic CI iid Let X1 , . . . , Xn = X˜ and X˜ ∼ Pθ . A two-sided CI θ is the true parameter and unknown. In a parame- is a function depending on X˜ giving an upper tric model we assume that Θ ⊂ Rd , for some d ≥ 1. and lower bound in which the estimated parame˜ u(X)] ˜ with a certain probability ter lies I = [l(X, P(θ ∈ I ) ≥ 1 − qα and conversely P(θ < I ) ≤ α Identifiability:

A Model is well specified if:

 |  ∇fk  = |

Any random interval I whose boundaries do not depend on θ and such that:

Θ is a parameter set, i.e. a set consisting of some possible values of Θ.

∇f =   | = ∇f1 |  ∂f  1  ∂x1  =  ...   ∂f1

Gaussian Random Vectors A random vector X = (X (1) , . . . , X (d) )T is a Gaussian vector, or multivariate Gaussian or normal va- General statement, given riable, if any linear combination of its components is a (univariate) Gaussian variable or a constant (a • (Tn )n≥1 a sequence of random vectors “Gaussian"variable with zero variance), i.e., if α T X is (univariate) Gaussian or constant for any constant  (d) √  ~ −−−−−→ T, non-zero vector α ∈ Rd . • satisfying n Tn − θ n→∞ Multivariate Gaussians The distribution of, X the d-dimensional Gaussian • a function g : Rd → Rk that is continuously or normal distribution, is completely specified by ~ differentiable at θ, the vector mean µ = E[X] = (E[X (1) ], . . . , E[X (d) ])T and the d × d covariance matrix Σ. If Σ is invertible, then then the pdf of X is:  (d) √  1 T −1 1 ~ −−−−−→ ∇g(θ) ~TT n g(Tn ) − g(θ) fX (x) = q e− 2 (x−µ) Σ (x−µ) , n→∞

x ∈ Rd

{Pθ }θ∈Θ is a family of probability distributions on E.

θ , θ 0 ⇒ Pθ , Pθ 0

| ... |

Confidence interval of asymptotic level 1 − α for θ:

E is a sample space for X i.e. a set that contains all possible outcomes of X

Given a vector-valued function f : Rd → Rk , the gradient or the gradient matrix of f , denoted by ∇f , is the d × k matrix: | ∇f2 |

Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ

9 Statistical models E, {Pθ }θ∈Θ

Pθ = Pθ 0 ⇒ θ = θ 0

∃θ s.t. P = Pθ 10 Estimators A statistic is any measurable functionof the sample, e.g. Xn , max(Xi ), etc. An Estimator of θ is any statistic which does not depend on θ. An estimator θˆn is weakly consistent if: lim θˆn = θ n→∞

P or θˆn −−−−−→ E[g(X)]. If the convergence is almost n→∞

surely it is strongly consistent. Asymptotic normality of an estimator: p (d) (n)(θˆn − θ) −−−−−→ N (0, σ 2 ) n→∞

σ 2 is called the Asymptotic Variance of θˆn . In the case of the sample mean it the variance of a single Xi . If the estimator is a function of the sample mean the Delta Method is needed to compute the Asymptotic Variance. Asymptotic Variance , Variance of an estimator. Bias of an estimator: Bias(θˆn = E[θˆn ] − θ Quadratic risk of an estimator: R(θˆn ) = E[(θˆn − θ)2 ] = Bias2 + V ariance 11 Confidence intervals Let (E, (Pθ )θ∈Θ ) be a statistical model based on observations X1 , . . . Xn and assume Θ ⊆ R. Let α ∈ (0, 1). Non asymptotic confidence interval of level 1 − α for θ: Any random interval I , depending on the sample X1 , . . . Xn but not at θ and such that:

Since the estimator is a r.v. depending on X˜ it has a variance V ar(θˆn and a mean E[θˆn ]. After finding those it is possible to standardize the estimator using the √ CLT. This yields an asymptotic CI: √ −q V ar(θ) q V ar(θ) α/2 α/2 √ √ I = θˆn + [ , ] n

n

This expression depends on the real variance V ar(θ) of the r.vs, the variance has to be estimated. Three possible methods: plugin (use sample mean), solve (solve quadratic inequality), conservative (use the maximum of the variance). Delta Method If I take a function of the mean and want to make it converge to a function of the mean. (d) √ b1 ) − g(m1 (θ))) −−−−−→ N (0, g 0 (m1 (θ))2 σ 2 ) n(g(m n→∞

12 Hypothesis tests Comparisons of two proportions iid

iid

Let X1 , . . . , Xn ∼ Bern(px ) and Y1 , . . . , Yn ∼ Bern(py ) P and be X independent of Y . pˆx = 1/n ni=1 Xi and Pn pˆx = 1/n i=1 Yi H0 : px = py ; H1 : px , py To get the asymptotic Variance use multivariate Delta-method. Consider pˆx − pˆy = g(pˆx , pˆy ); g(x, y) = x − y, then p (d) (n)(g(pˆx , pˆy ) − g(px − py )) −−−−−→ N (0, ∇g(px − py )T Σ∇g(px − py ))

n→∞

⇒ N (0, px (1 − px) + py (1 − py)) Pivot: Let X1 , . . . , Xn be random samples and let Tn be a function of X and a parameter vector θ. That is, Tn is a function of X1 , . . . , Xn , θ. Let g(Tn ) be a random variable whose distribution is the same for all θ . Then, g is called a pivotal quantity or a pivot. For example, let X be a random variable with mean µ and variance σ 2 . Let X1 , . . . , Xn be iid samples of X. Then,

nonnegative: d(P, Q) ≥ 0 definite: d(P, Q) = 0 ⇐⇒ P = Q Xn − µ triangle inequality: gn , d(P, V) ≤ d(P, Q) + d(Q, V) σ If the support of P and Q is disjoint: h iT is a pivot with θ = µ σ 2 being the parameter vec- d(P, V) = 1 tor. The notion of a parameter vector here is not to TV between continuous and discrete r.v: d(P, V) = 1 be confused with the set of paramaters that we use KL divergence to define a statistical model. the KL divergence (also known as relative entropy) Onesided KL between between the propability measures P and Twosided Q with the common sample space E and pmf/pdf P-Value functions f and g is defined as: Walds Test   P p(x)  iid  p(x) ln , discr X1 , . . . , Xn ∼ Pθ ∗ for some true parameter θ ∗ ∈   x∈E  q(x) R p(x) Rd . We construct the associated statistical model KL(P, Q) =    p(x) ln q(x) dx, cont (R, {Pθ }θ∈Rd ) and the maximum likelihood estimax∈E MLE ∗ Not a distance! b tor θn for θ . Sum over support of P ! Decide between two hypotheses: Asymetric in general: ∗ ∗ H0 : θ = 0 VS H1 : θ , 0 Assuming that the null hypothesis is true, the asym- KL(P, Q) , KL(Q, P) bMLE implies that the Nonnegative: ptotic normality of the MLE θ KL(P, Q) ≥ 0

√ n

bnMLE − 0)

2 Definite: following random variable n I (0)1/2 (θ if P = Q then KL(P, Q) = 0 converges to a χk2 distribution. Does not satisfy triangle inequality in general: KL(P, V)  KL(P, Q) + KL(Q, V)

2 (d)

√ bnMLE − 0) −−−−−→ χ2

n I (0)1/2 (θ d n→∞ Estimator of KL divergence: " !# Wald’s Test in 1 dimension: p ∗ (X) KL (Pθ ∗ , Pθ ) = Eθ ∗ ln θ , pθ (X) In 1 dimension, Wald’s Test coincides with the twoP sided test based on on the asymptotic normality of KL(P c θ , Pθ ) = const − 1 n log(pθ (Xi )) n i=1 ∗ the MLE. Maximum likelihood estimation Given the hypotheses Cookbook: take the log of the likelihood function. H0 : θ ∗ = 0 VS H1 : θ ∗ , 0 a two-sided test of level α, based on the Take the partial derivative of the loglikelihood funcwith respect to the parameter. Set the partial asymptotic  is ψα = tion p normality of the MLE, derivative to zero and solve for the parameter. bMLE − θ0 > qα/2 (N (0, 1)) 1 nI(θ0 ) θ If an indicator function on the pdf/pmf does not where the Fisher information I(θ0 )−1 is the asym- depend on the parameter, it can be ignored. If it debMLE under the null hypothesis. pends on the parameter it can’t be ignored because ptotic variance of θ there is an discontinuity in the loglikelihood functiOn the other hand, a Wald’s test of level α is    2 on. The maximum/minimum of the Xi is then the Wald MLE 2 b ψα = 1 nI(θ0 ) θ − θ0 > qα (χ1 ) = maximum likelihood estimator. ! Maximum likelihood estimator: q p bMLE − θ0 > qα (χ2 ) . 1 nI(θ0 ) θ n o 1 Let E, (Pθ )θ∈Θ be a statistical model associa13 Distance between distributions ted with a sample of i.i.d. random variables Total variation X1 , X2 , . . . , Xn . Assume that there exists θ ∗ ∈ Θ such The total variation distance TV between the propa- that Xi ∼ Pθ ∗ . bility measures P and Q with a sample space E is The maximum likelihood estimator is the (unique) defined as: c (Pθ ∗ , Pθ ) over the parameter θ that minimizes KL TV(P, Q) = maxA⊂E |P(A) − Q(A)|, space. (The minimizer of the KL divergence is unique due to it being strictly convex in the space of Calculation with f and g: distributions once is fixed.) (1 P |f (x) − g(x)|, discr bnMLE = x∈E θ TV(P, Q) = 12 R |f (x) − g(x)|dx, cont x∈E c n (Pθ ∗ , Pθ ) = 2 argminθ∈Θ KL n Symmetry: X d(P, Q) = d(Q, P) argmaxθ∈Θ ln pθ (Xi ) = Cheatsheet for 18.6501x by Blechturm Page 4 of x

 n  Y  argmaxθ∈Θ ln  pθ (Xi ) i=1

Gaussian Maximum-loglikelihood estimators: 2 MLE estimator P for σ = τ: τˆnMLE = n1 ni=1 Xi2

MLE estimators: P µˆ MLE = n1 i=1 (xi ) n 13.1 Fisher Information The Fisher information is the covariance matrix of the gradient of the loglikelihood function. It is equal to the negative expectation of the Hessian of the loglikelihood function and captures the negative of the expected curvature of the loglikelihood function. Let θ ∈ Θ ⊂ Rd and let (E, {Pθ }θ∈Θ ) be a statistical model. Let fθ (x) be the pdf of the distribution Pθ . Then, the Fisher information of the statistical model is. I (θ) = Cov(∇`(θ)) = = E[∇`(θ))∇`(θ)T ] − E[∇`(θ)]E[∇`(θ)] = = −E[H`(θ)] Where `(θ) = ln fθ (X).If ∇`(θ) ∈ Rd it is a d × d matrix. The definition when the distribution has a pmf pθ (x) is also the same, with the expectation taken with respect to the pmf. Let (R, {Pθ }θ∈R ) denote a continuous statistical model. Let fθ (x) denote the pdf (probability density function) of the continuous distribution Pθ . Assume that fθ (x) is twice-differentiable as a function of the parameter θ. Formula for the calculation of Fisher Information of X: 

R∞

I (θ) = −∞

 ∂fθ (x) 2 ∂θ fθ (x)

• Find the expectation of the functions of Xi and subsitute them back into the Hessian or the second derivative. Be extra careful to subsitute the right power back. E[Xi ] , E[Xi2 ]. • Don’t forget the minus sign! Asymptotic normality of the maximum likelihood estimator Under certain conditions the MLE is asymptotically normal and consistent. This applies even if the MLE is not the sample average. Let the true parameter θ ∗ ∈ Θ. Necessary assumptions: • The parameter is identifiable • For all θ ∈ Θ, the support Pθ does not depend on θ (e.g. like in U nif (0, θ)); • θ ∗ is not on the boundary of Θ; • Fisher information I (θ) is invertible in the neighborhood of θ ∗ • A few more technical conditions The asymptotic variance of the MLE is the inverse of the fisher information. p (d) bnMLE − θ ∗ ) −−−−−→ Nd (0, I (θ ∗ )−1 ) (n)(θ n→∞

14 Method of Moments iid Let X1 , . . . , Xn ∼ Pθ ∗ associated with model (E, {Pθ }θ∈Θ ), with E ⊆ R and Θ ⊆ R, for some d ≥1 Population moments: mk (θ) = Eθ [X1k ], 1 ≤ k ≤ d Empirical moments: P ck (θ) = Xnk = n1 ni=1 Xik m Convergence of empirical moments: P ,a.s.

dx

ck −−−−−→ mk m n→∞

P ,a.s.

Models with one parameter (ie. Bernulli):

c1 , . . . , m cd ) −−−−−→ (m1 , . . . , md ) (m

I (θ) = Var(` 0 (θ))

MOM Estimator M is a map from the parameters of a model to the moments of its distribution. This map is invertible, (ie. it results into a system of equations that can be solved for the true parameter vector θ ∗ ). Find the moments (as many as parameters), set up system of equations, solve for parameters, use empirical moments to estimate. ψ : Θ → Rd

I (θ) = −E(` 00 (θ)) Models with multiple parameters (ie. Gaussians): I (θ) = −E [H`(θ)] Cookbook: Better to use 2nd derivative. • Find loglikelihood • Take second derivative (=Hessian if multivariate) • Massage second derivative or Hessian (isolate functions of Xi to use with −E(` 00 (θ)) or −E [H`(θ)].

n→∞

θ 7→ (m1 (θ), m2 (θ), . . . , md (θ)) M −1 (m1 (θ ∗ ), m2 (θ ∗ ), . . . , md (θ ∗ )) The MOM estimator uses the empirical moments:  P  P P M −1 n1 ni=1 Xi , n1 ni=1 Xi2 , . . . , n1 ni=1 Xid Assuming M −1 is continuously differentiable at M(0), the asymptotical variance of the MOM estimator is:

Cheatsheet for 18.6501x by Blechturm Page 5 of x

p (d) € (n)(θnMM − θ) −−−−−→ N (0, Γ ) n→∞

where,

T    −1 −1 (M(θ)) Σ(θ) ∂M (M(θ)) Γ (θ) = ∂M ∂θ ∂θ

19 Calculus If the Hessian has both positive and negative eigenvalues then a is a saddle point for f . Differentiation under the integral sign R  b(x) d f (x, t)dt = f (x, b(x))b0 (x) − f (x, a(x))a0 (x) + dx a(x) R b(x) f (x, t)dt. a(x) x Concavity in 1 dimension If g : I → R is twice differentiable in the interval I: concave: if and only if g 00 (x)≤0 for all x ∈ I

Γ (θ) = ∇θ (M −1 )T Σ∇θ (M −1 ) Σθ is the covariance matrix of the random vector of the moments (X11 , X12 . . . , X1d ). 15 OLS Y |X = x ∼ N (µ(x), σ 2 I)

strictly concave: if g 00 (x)0 for all x ∈ I

convex: if and only if g 00 (x)≥0 for all x ∈ I

E[Y |X = x] = µ(x) = xT β

Multivariate Calculus The Gradient ∇ of a twice differntiable function f is defined as: ∇f : Rd → Rd  ∂f   ∂θ     1  θ1   ∂f  θ2    θ =  ..  7→  ∂θ. 2      T   ..  g(µ(x)) = x β  .    θ d  The function g is assumed to be known, and is refer ∂f  red to as the link function. It maps the domain of ∂θd θ the dependent variable to the entire real Line. Hessian it has to be strictly increasing, it has to be continuously differentiable and The Hessian of f is a symmetric matrix of second its range is all of R partial derivatives of f 16.1 The Exponential Family = ∇2 h(θ) = A family of distribution {Pθ : θ ∈ Θ}, where the Hh(θ)   2 h (θ) · · · ∂2 h (θ)  parameter space Θ ⊂ Rk is -k dimensional, is called  ∂θ∂ ∂θ  ∂θ1 ∂θd 1 1    a k-parameter exponential family on R1 if the pmf  .. d×d  ∈ R  q . or pdf fθ : R → R of Pθ can be written in the form:     ∂2 h (θ)   ∂2 h (θ) · · ·  ∂θd ∂θd (θ) d ∂θ  η1∂θ 1      symmetric   d × d matrix A is:   ..  : Rk(real-valued) η(θ) = A → Rk   .        Positive semi-definite:     Tηk (θ)  x A x ≥ 0 for all x ∈ Rd .  T (y)   1  fθ (y) = h(y) exp (η(θ) · T(y) − B(θ)) where      ..  : Rq → Rk  T(y) = Positive  .  definite:    T    x x > 0 for all non-zero vectors x ∈ Rd  TkA (y)      B(θ) Negative semi-definite : Rk → R   (resp. negative definite):  h(y) : Rq → R. xT A x is negative for all x ∈ Rd − {0}. if k = 1 it reduces to:

Random Component of the Linear Model: Y is continous and Y |X = x is Gaussian with mean µ(x) 16 Generalized Linear Models We relax the assumption that µ is linear. Instead, we assume that g ◦µ is linear, for some function g:

fθ (y) = h(y) exp (η(θ)T (y) − B(θ)) 17 Algebra Absolute Value Inequalities: |f (x)| < a ⇒ −a < f (x) < a |f (x)| > a ⇒ f (x) > a or f (x) < −a 18 Matrixalgebra kAxk2 = (Ax)T (Ax) = xT AT Ax = xT AT Ax

Positive (or negative) definiteness implies positive (or negative) semi-definiteness. If the Hessian is positive definite then f attains a local minimum at a (convex). If the Hessian is negative definite at a, then f attains a local maximum at a (concave).