Probability and Statistics Cheat Sheet c Matthias Vallentin, 2011 Copyright [email protected] 6th March, 2011 12 Par
Views 264 Downloads 40 File size 1MB
Probability and Statistics Cheat Sheet
c Matthias Vallentin, 2011 Copyright [email protected] 6th March, 2011
12 Parametric Inference 12.1 Method of Moments . . . . . . . . . . 12.2 Maximum Likelihood . . . . . . . . . . sity of California in Berkeley but also influenced by other sources 12.2.1 Delta Method . . . . . . . . . . [4, 5]. If you find errors or have suggestions for further topics, I 12.3 Multiparameter Models . . . . . . . . would appreciate if you send me an email. The most recent ver12.3.1 Multiparameter Delta Method sion of this document is available at http://bit.ly/probstat. 12.4 Parametric Bootstrap . . . . . . . . . This cheat sheet integrates a variety of topics in probability theory and statistics. It is based on literature [1, 6, 3] and in-class material from courses of the statistics department at the Univer-
To reproduce, please contact me.
. . . . . .
13 Hypothesis Testing
Contents 1 Distribution Overview 1.1 Discrete Distributions . . . . . . . . . . 1.2 Continuous Distributions . . . . . . . . 2 Probability Theory 3 Random Variables 3.1 Transformations . . . . . . . . . . . . . 4 Expectation 5 Variance 6 Inequalities
14 Bayesian Inference 14.1 Credible Intervals . . . . 3 14.2 Function of Parameters 3 14.3 Priors . . . . . . . . . . 4 14.3.1 Conjugate Priors 6 14.4 Bayesian Testing . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
6 15 Exponential Family 7 16 Sampling Methods 7 16.1 The Bootstrap . . . . . . . . . . . . . 16.1.1 Bootstrap Confidence Intervals 7 16.2 Rejection Sampling . . . . . . . . . . . 16.3 Importance Sampling . . . . . . . . . . 8
. . . .
16 16 16 17 17
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
17 17 17 18 18
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
18 18 19 19 19
19 Non-parametric Function Estimation 19.1 Density Estimation . . . . . . . . . . . . 11 Statistical Inference 10 19.1.1 Histograms . . . . . . . . . . . . 11.1 Point Estimation . . . . . . . . . . . . . 10 19.1.2 Kernel Density Estimator (KDE) 11.2 Normal-based Confidence Interval . . . . 11 19.2 Non-parametric Regression . . . . . . . 11.3 Empirical Distribution Function . . . . . 11 11.4 Statistical Functionals . . . . . . . . . . 11 19.3 Smoothing Using Orthogonal Functions
20 20 20 21 21 21
7 Distribution Relationships 8 Probability Functions
and
Moment
17 Decision Theory 17.1 Risk . . . . . . . . . . . . 17.2 Admissibility . . . . . . . 9 17.3 Bayes Rule . . . . . . . . 17.4 Minimax Rules . . . . . . 9 9 18 Linear Regression 9 18.1 Simple Linear Regression 9 18.2 Prediction . . . . . . . . . 18.3 Multiple Regression . . . 9 18.4 Model Selection . . . . . . 10
. . . . .
11 20 Stochastic Processes 20.1 Markov Chains . . . . . . . . . . 11 20.2 Poisson Processes . . . . . . . . . 12 12 21 Time Series 12 21.1 Stationary Time Series . . . . . . 13 21.2 Estimation of Correlation . . . . 13 21.3 Non-Stationary Time Series . . . 21.3.1 Detrending . . . . . . . . 13 21.4 ARIMA models . . . . . . . . . . 21.4.1 Causality and Invertibility 14 21.5 Spectral Analysis . . . . . . . . . 14 14 22 Math 22.1 Gamma Function . . . . . . . . . 15 22.2 Beta Function . . . . . . . . . . . 15 22.3 Series . . . . . . . . . . . . . . . 15 22.4 Combinatorics . . . . . . . . . . 16
8
Generating
9 Multivariate Distributions 9.1 Standard Bivariate Normal . . . . . . . 9.2 Bivariate Normal . . . . . . . . . . . . . 9.3 Multivariate Normal . . . . . . . . . . .
10 Convergence 10.1 Law of Large Numbers (LLN) . . . . . . 10.2 Central Limit Theorem (CLT) . . . . . 10
22 . . . . 22 . . . . 22
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
23 23 24 24 24 24 25 25
. . . .
. . . .
. . . .
. . . .
26 26 26 27 27
Distribution Overview
1.1
Discrete Distributions Notation1
FX (x)
Unif {a, . . . , b}
Uniform
0
xb (1 − p)1−x
bxc−a+1 b−a
1 Bernoulli
Bern (p)
Binomial
Multinomial
Hypergeometric
≈Φ
Hyp (N, m, n)
Negative Binomial
NBin (n, p)
Geometric
Geo (p)
Poisson
Po (λ)
!
e−λ
MX (s)
I(a < x < b) b−a+1
a+b 2
(b − a + 1)2 − 1 12
eas − e−(b+1)s s(b − a)
px (1 − p)1−x
p
p(1 − p)
1 − p + pes
np
np(1 − p)
(1 − p + pes )n
x
x ∈ N+
x X λi i! i=0
Uniform (discrete)
nm(N − n)(N − m) N 2 (N − 1)
nm N r
1−p p
λx e−λ x!
r
1−p p2
λ
λ
p 1 − (1 − p)es
eλ(e
s
−1)
Poisson ●
p = 0.2 p = 0.5 p = 0.8
●
●
●
λ=1 λ=4 λ = 10
●
●
●
● ●
0.2
0.6 ●
PMF
●
0.4
●
PMF
0.15 ●
n
PMF
1
●
0.10
● ●
●
0.05 x
1 We
● ●
●
0
● ● ●
10
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
20 x
30
40
●
●
●
●
●
0.0
●
●
0.0
● ● ● ●
●
●
●
b
●
●
●
a
0.1
0.2
●
●
0.00
PMF
0.20
0.3
0.25
n = 40, p = 0.3 n = 30, p = 0.6 n = 25, p = 0.9
r
p 1 − (1 − p)es
Geometric ●
pi e
N/A
1−p p2
1 p
x ∈ N+
Binomial
!n si
i=0
n−x N x
p(1 − p)x−1
k X
npi (1 − pi )
npi
! x+r−1 r p (1 − p)x r−1
Ip (r, x + 1) 1 − (1 − p)x
V [X]
k X n! x xi = n px1 1 · · · pkk x1 ! . . . xk ! i=1 m m−x
Mult (n, p) x − np p np(1 − p)
E [X]
! n x p (1 − p)n−x x
I1−p (n − x, x + 1)
Bin (n, p)
fX (x)
0.8
1
0
2
4
6
8
10
●
0
x
5
●
●
●
●
●
10
●
●
●
●
●
15
●
●
●
●
●
20
x
use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2). 3
1.2
Continuous Distributions Notation
FX (x)
Uniform
Unif (a, b)
Normal
N µ, σ 2
0
x2 (α − 1)2 (α − 2)2
Γ (α + β) α−1 x (1 − x)β−1 Γ (α) Γ (β)
α α+β 1 λΓ 1 + k
αβ (α + β)2 (α + β + 1) 2 λ2 Γ 1 + − µ2 k
αxm α>1 α−1
xα m α>2 (α − 1)2 (α − 2)
k
1 − e−(x/λ) 1−
d1 , 2 2
1 −x/β e β
Ix (α, β)
Weibull(λ, k)
xB
d1
1 − e−x/β
Dir (α)
Beta (α, β)
(d1 x)d1 d2 2
(d1 x+d2 )d1 +d2
x α m
x
x ≥ xm
k x k−1 −(x/λ)k e λ λ xα m α α+1 x
x ≥ xm
αi Pk
i=1 αi
1 (s < 1/β) 1 − βs α 1 (s < 1/β) 1 − βs p 2(−βs)α/2 −4βs Kα Γ(α)
E [Xi ] (1 − E [Xi ]) Pk i=1 αi + 1 1+
∞ X k=1 ∞ X n=0
k−1 Y r=0
α+r α+β+r
!
sk k!
sn λn n Γ 1+ n! k
α(−xm s)α Γ(−α, −xm s) s < 0
4
Normal
Log−normal
Student's t
1.0
µ = 0, σ2 = 3 µ = 2, σ2 = 2 µ = 0, σ2 = 1 µ = 0.5, σ2 = 1 µ = 0.25, σ2 = 1 µ = 0.125, σ2 = 1
ν=1 ν=2 ν=5 ν=∞
0.2
PDF
PDF 0.4
φ(x)
PDF
0.6
0.6
0.3
0.8
0.8
µ = 0, σ2 = 0.2 µ = 0, σ2 = 1 µ = 0, σ2 = 5 µ = −2, σ2 = 0.5
0.4
Uniform (continuous)
●
0.1 −4
−2
0
x
2
4
0.0
0.0
0.2
0.2 ●
b
0.0
●
a
0.0
0.5
1.0
1.5
x
2
χ
F
2.5
3.0
−4
−2
0
Exponential
4
Gamma
2.0
β=2 β=1 β = 0.4
α = 1, β = 2 α = 2, β = 2 α = 3, β = 2 α = 5, β = 1 α = 9, β = 0.5
0.4
3.0
d1 = 1, d2 = 1 d1 = 2, d2 = 1 d1 = 5, d2 = 2 d1 = 100, d2 = 1 d1 = 100, d2 = 100
2
x
0.3 PDF 0.2
1.0
PDF
PDF
1.5
4
6
8
0.1 0
1
2
x
3
4
5
0.0 0
1
2
x
Inverse Gamma
Beta
5
0
5
10
15
20
x
Pareto xm = 1, α = 1 xm = 1, α = 2 xm = 1, α = 4
λ = 1, k = 0.5 λ = 1, k = 1 λ = 1, k = 1.5 λ = 1, k = 5
2.0
2.5
4
4
Weibull
α = 0.5, β = 0.5 α = 5, β = 1 α = 1, β = 3 α = 2, β = 2 α = 2, β = 5
3.0
α = 1, β = 1 α = 2, β = 1 α = 3, β = 1 α = 3, β = 0.5
3 x
0
1
2
3 x
4
5
0.0
0.2
0.4
0.6 x
0.8
1.0
2 0
0.0
0
0.0
0.5
0.5
1
1
1.0
1.0
PDF
PDF
1.5
PDF 2
PDF
1.5
3
2.0
3
2
2.5
0
0.0
0.0
0.0
0.5
0.1
0.5
1.0
0.2
PDF
0.3
2.0
1.5
2.5
0.4
0.5
k=1 k=2 k=3 k=4 k=5
2.0
x
0.5
●
0.4
1 b−a
0.0
0.5
1.0
1.5 x
2.0
2.5
0
1
2
3
4
5
x
5
2
Probability Theory
Law of Total Probability
Definitions • • • •
P [B] =
Sample space Ω Outcome (point or element) ω ∈ Ω Event A ⊆ Ω σ-algebra A
P [B|Ai ] P [Ai ]
n G
Ai
i=1
Bayes’ Theorem P [B | Ai ] P [Ai ] P [Ai | B] = Pn j=1 P [B | Aj ] P [Aj ] Inclusion-Exclusion Principle [ X n n Ai = (−1)r−1
• Probability distribution P 1. P [A] ≥ 0 for every A 2. P [Ω] = 1 "∞ # ∞ G X 3. P Ai = P [Ai ]
i=1
3
r=1
X
Ω=
n G
Ai
i=1
\ r A ij
i≤i1 x + y | X > y] = P [X > x]
Inequalities
Cauchy-Schwarz
Normal • X ∼ N µ, σ 2
2 E [XY ] ≤ E X 2 E Y 2
Markov P [ϕ(X) ≥ t] ≤
E [ϕ(X)] t
Chebyshev P [|X − E [X]| ≥ t] ≤ Chernoff
P [X ≥ (1 + δ)µ] ≤
• • • •
V [X] t2
eδ (1 + δ)1+δ
δ > −1
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex
Distribution Relationships
X−µ σ
2
Gamma • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1) Pα • Gamma (α, β) ∼ i=1 Exp (β) P P • Xi ∼ Gamma (αi , β) ∧ Xi ⊥ ⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β) Z ∞ Γ(α) α−1 −λx • = x e dx λα 0 Beta
Binomial • Xi ∼ Bern (p) =⇒
∼ N (0, 1) X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2 X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22 P P P 2 Xi ∼ N µi , σi2 =⇒ X ∼N i µi , i σi i i − Φ a−µ P [a < X ≤ b] = Φ b−µ σ σ =⇒
• Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x) −1 • Upper quantile of N (0, 1): zα = Φ (1 − α)
Jensen
7
n X
1 Γ(α + β) α−1 xα−1 (1 − x)β−1 = x (1 − x)β−1 B(α, β) Γ(α)Γ(β) B(α + k, β) α+k−1 • E Xk = = E X k−1 B(α, β) α+β+k−1 • Beta (1, 1) ∼ Unif (0, 1) •
Xi ∼ Bin (n, p)
i=1
• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p) • limn→∞ Bin (n, p) = Po (np) (n large, p small)
8
8
Probability and Moment Generating Functions • GX (t) = E tX
σX (Y − E [Y ]) σY p V [X | Y ] = σX 1 − ρ2
E [X | Y ] = E [X] + ρ
|t| < 1
t
Xt
• MX (t) = GX (e ) = E e
=E
"∞ # X (Xt)i i!
i=0
∞ X E Xi = · ti i! i=0
• P [X = 0] = GX (0) • P [X = 1] = G0X (0) • P [X = i] =
Conditional mean and variance
9.3
Multivariate Normal (Precision Matrix Σ−1 ) V [X1 ] · · · Cov [X1 , Xk ] .. .. .. Σ= . . .
(i) GX (0)
Covariance Matrix Σ
i! • E [X] = G0X (1− ) (k) • E X k = MX (0) X! (k) • E = GX (1− ) (X − k)!
Cov [Xk , X1 ] · · · If X ∼ N (µ, Σ),
2
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
−1/2
fX (x) = (2π)−n/2 |Σ|
d
• GX (t) = GY (t) =⇒ X = Y
9 9.1
1 exp − (x − µ)T Σ−1 (x − µ) 2
Properties
Multivariate Distributions
• • • •
Standard Bivariate Normal
Let X, Y ∼ N (0, 1) ∧ X ⊥ ⊥ Z with Y = ρX +
V [Xk ]
p
1 − ρ2 Z
Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ) X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1) X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT X ∼ N (µ, Σ) ∧ a is vector of length k =⇒ aT X ∼ N aT µ, aT Σa
Joint density f (x, y) =
2 1 x + y 2 − 2ρxy p exp − 2(1 − ρ2 ) 2π 1 − ρ2
10
Conditionals (Y | X = x) ∼ N ρx, 1 − ρ2
and
(X | Y = y) ∼ N ρy, 1 − ρ2
Convergence
Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote the cdf of Xn and let F denote the cdf of X. Types of Convergence
Independence
D
1. In distribution (weakly, in law): Xn → X
X⊥ ⊥ Y ⇐⇒ ρ = 0
9.2
lim Fn (t) = F (t)
n→∞
Bivariate Normal
P
Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 . f (x, y) = " z=
x − µx σx
2πσx σy
2
+
1 p
2. In probability: Xn → X
z exp − 2 2(1 − ρ2 ) 1−ρ
y − µy σy
∀t where F continuous
2
− 2ρ
x − µx σx
(∀ε > 0) lim P [|Xn − X| > ε] = 0
y − µy σy
n→∞ as
#
3. Almost surely (strongly): Xn → X h i h i P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 n→∞
n→∞
9
qm
4. In quadratic mean (L2 ): Xn → X
CLT Notations Zn ≈ N (0, 1) σ2 ¯ Xn ≈ N µ, n σ2 ¯ Xn − µ ≈ N 0, n √ 2 ¯ n − µ) ≈ N 0, σ n(X √ ¯ n(Xn − µ) ≈ N (0, 1) n
lim E (Xn − X)2 = 0
n→∞
Relationships qm
P
D
• Xn → X =⇒ Xn → X =⇒ Xn → X as P • Xn → X =⇒ Xn → X D P • Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X • • • •
Xn Xn Xn Xn
P
→X qm →X P →X P →X
∧ Yn ∧ Yn ∧ Yn =⇒
P
P
→ Y =⇒ Xn + Yn → X + Y qm qm → Y =⇒ Xn + Yn → X + Y P P → Y =⇒ Xn Yn → XY P ϕ(Xn ) → ϕ(X)
D
Continuity Correction x + 12 − µ √ σ/ n x − 12 − µ ¯ √ P Xn ≥ x ≈ 1 − Φ σ/ n
D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X) qm • Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 qm ¯n → • X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X µ
¯n ≤ x ≈ Φ P X
Slutzky’s Theorem D
P
Delta Method
D
• Xn → X and Yn → c =⇒ Xn + Yn → X + c D P D • Xn → X and Yn → c =⇒ Xn Yn → cX D D D • In general: Xn → X and Yn → Y =⇒ 6 Xn + Yn → X + Y
Yn ≈ N
11 10.1
Law of Large Numbers (LLN)
σ2 µ, n
=⇒ ϕ(Yn ) ≈ N
σ2 ϕ(µ), (ϕ (µ)) n 0
2
Statistical Inference iid
Let X1 , · · · , Xn ∼ F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] < ∞.
11.1
Weak (WLLN) P ¯n → X µ
as n → ∞
as ¯n → X µ
as n → ∞
• Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn ) h i • bias(θbn ) = E θbn − θ
Strong (SLLN)
10.2
Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 . √ ¯ ¯n − µ n(Xn − µ) D X →Z Zn := q = σ ¯ V Xn lim P [Zn ≤ z] = Φ(z)
n→∞
Point Estimation
where Z ∼ N (0, 1)
z∈R
P • Consistency: θbn → θ • Sampling distribution: F (θbn ) r h i b • Standard error: se(θn ) = V θbn h i h i • Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent θbn − θ D • Asymptotic normality: → N (0, 1) se • Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consistent estimator σ bn . 10
11.2
Normal-based Confidence Interval
b 2 . Let zα/2 Suppose θbn ≈ N θ, se and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then
= Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2
b Cn = θbn ± zα/2 se
11.3
Empirical Distribution Function
Empirical Distribution Function (ECDF) Pn Fbn (x) =
i=1
I(Xi ≤ x) n
( 1 I(Xi ≤ x) = 0
Xi ≤ x Xi > x
Properties (for any fixed x) h i • E Fˆn = F (x) h i F (x)(1 − F (x)) • V Fˆn = n F (x)(1 − F (x)) D • mse = →0 n P • Fˆn → F (x) Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn ∼ F ) 2 P sup F (x) − Fˆn (x) > ε = 2e−2nε
11.4 • • • •
Statistical Functionals Statistical functional: T (F ) Plug-in estimator of θ = T (F ) : θbn = T (Fˆn ) R Linear functional: T (F ) = ϕ(x) dFX (x) Plug-in estimator for linear functional:
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p} ¯n • µ ˆ=X n 1 X ¯ n )2 • σ b2 = (Xi − X n − 1 i=1 Pn 1 ˆ)3 i=1 (Xi − µ n • κ ˆ= σ b3 j Pn ¯ n )(Yi − Y¯n ) (Xi − X qP • ρˆ = qP i=1 n n 2 ¯ ¯ (X − X ) i n i=1 i=1 (Yi − Yn )
12
Parametric Inference
Let F = f (x; θ : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk and parameter θ = (θ1 , . . . , θk ).
12.1
Method of Moments
j th moment αj (θ) = E X j =
x
Nonparametric 1 − α confidence band for F L(x) = max{Fˆn − n , 0} U (x) = min{Fˆn + n , 1} s 1 2 log = 2n α
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α
n
Z
1X ϕ(Xi ) ϕ(x) dFbn (x) = n i=1 b 2 =⇒ T (Fˆn ) ± zα/2 se b • Often: T (Fˆn ) ≈ N T (F ), se T (Fˆn ) =
Z
xj dFX (x)
j th sample moment n
α ˆj =
1X j X n i=1 i
Method of Moments Estimator (MoM) α1 (θ) = α ˆ1 α2 (θ) = α ˆ2 .. .. .=. αk (θ) = α ˆk 11
Properties of the MoM estimator • θbn exists with probability tending to 1 P • Consistency: θbn → θ • Asymptotic normality: √
D
n(θb − θ) → N (0, Σ)
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , ∂ −1 g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)
12.2
Maximum Likelihood
Likelihood: Ln : Θ → [0, ∞) Ln (θ) =
n Y
f (Xi ; θ)
• Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ) • Asymptotic normality: p 1. se ≈ 1/In (θ) (θbn − θ) D → N (0, 1) se q b ≈ 1/In (θbn ) 2. se (θbn − θ) D → N (0, 1) b se • Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If θen is any other estimator, the asymptotic relative efficiency is h i V θbn are(θen , θbn ) = h i ≤ 1 V θen
i=1
• Approximately the Bayes estimator Log-likelihood `n (θ) = log Ln (θ) =
n X
log f (Xi ; θ)
i=1
12.2.1
Delta Method
b where ϕ is differentiable and ϕ0 (θ) 6= 0: If τ = ϕ(θ)
Maximum Likelihood Estimator (mle)
(b τn − τ ) D → N (0, 1) b τ) se(b
Ln (θbn ) = sup Ln (θ) θ
Score Function s(X; θ) =
∂ log f (X; θ) ∂θ
b is the mle of τ and where τb = ϕ(θ) b b b = ϕ0 (θ) b θn ) se se(
Fisher Information I(θ) = Vθ [s(X; θ)] In (θ) = nI(θ) Fisher Information (exponential family) ∂ I(θ) = Eθ − s(X; θ) ∂θ Observed Fisher Information Inobs (θ) = − Properties of the mle P • Consistency: θbn → θ
12.3
Multiparameter Models
Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle. Hjj =
∂ 2 `n ∂θ2
Hjk =
Fisher Information Matrix
n ∂2 X log f (Xi ; θ) ∂θ2 i=1
∂ 2 `n ∂θj ∂θk
··· .. . Eθ [Hk1 ] · · ·
Eθ [H11 ] .. In (θ) = − .
Eθ [H1k ] .. . Eθ [Hkk ]
Under appropriate regularity conditions (θb − θ) ≈ N (0, Jn )
12
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then
13
Hypothesis Testing H0 : θ ∈ Θ0
(θbj − θj ) D → N (0, 1) bj se h i b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k) where se
12.3.1
Multiparameter Delta Method
Let τ = ϕ(θ1 , . . . , θk ) be a function and let the gradient of ϕ be ∂ϕ ∂θ1 . ∇ϕ = .. ∂ϕ ∂θk
Definitions • • • • • • • • • • •
Null hypothesis H0 Alternative hypothesis H1 Simple hypothesis θ = θ0 Composite hypothesis θ > θ0 or θ < θ0 Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0 One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0 Critical value c Test statistic T Rejection Region R = {x : T (x) > c} Power function β(θ) = P [X ∈ R] Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
(b τ − τ) D → N (0, 1) b τ) se(b
θ∈Θ0
H0 true H1 true
ˆ ∇ϕ
T
Type II error (β)
1−Fθ (T (X))
p-value < 0.01 0.01 − 0.05 0.05 − 0.1 > 0.1
ˆ Jˆn ∇ϕ
b and ∇ϕ ˆ = ∇ϕ b. and Jˆn = Jn (θ) θ=θ
12.4
Retain H0 √
Reject H0 Type √ I error (α) (power)
• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα • p-value = supθ∈Θ0 Pθ [T (X ? ) ≥ T (X)] = inf α : T (X) ∈ Rα | {z }
where b τ) = se(b
θ∈Θ1
• Test size: α = P [Type I error] = sup β(θ)
p-value
b Then, Suppose ∇ϕ θ=θb 6= 0 and τb = ϕ(θ).
r
H1 : θ ∈ Θ1
versus
Parametric Bootstrap
Sample from f (x; θbn ) instead of from Fˆn , where θbn could be the mle or method of moments estimator.
since T (X ? )∼Fθ
evidence very strong evidence against H0 strong evidence against H0 weak evidence against H0 little or no evidence against H0
Wald Test • Two-sided test θb − θ0 • Reject H0 when |W | > zα/2 where W = b se • P |W | > zα/2 → α • p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|) Likelihood Ratio Test (LRT) • T (X) =
supθ∈Θ Ln (θ) Ln (θbn ) = supθ∈Θ0 Ln (θ) Ln (θbn,0 )
13
D
• λ(X) = 2 log T (X) → χ2r−q where
k X
iid
Zi2 ∼ χ2k with Z1 , . . . , Zk ∼ N (0, 1)
i=1 • p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x) Multinomial LRT Xk X1 ,..., be the mle • Let pˆn = n n Xj k Y Ln (ˆ pn ) pˆj • T (X) = = Ln (p0 ) p0j j=1 k X pˆj D Xj log • λ(X) = 2 → χ2k−1 p 0j j=1 • The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α
• xn = (x1 , . . . , xn ) • Prior density f (θ) • Likelihood f (xn | θ): joint density of the data n Y In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ) i=1
• Posterior density f (θ | xn ) R • Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ • Kernel: part of a density that depends Ron θ R θLn (θ)f (θ) • Posterior Mean θ¯n = θf (θ | xn ) dθ = R Ln (θ)f (θ) dθ
14.1
Credible Intervals
1 − α Posterior Interval
2
Pearson χ Test n
k X (Xj − E [Xj ])2 where E [Xj ] = np0j under H0 • T = E [Xj ] j=1 D
f (θ | xn ) dθ = 1 − α
P [θ ∈ (a, b) | x ] = a
1 − α Equal-tail Credible Interval
χ2k−1
• T → • p-value = P χ2k−1 > T (x)
Z
a
f (θ | xn ) dθ =
−∞
D
2 • Faster → Xk−1 than LRT, hence preferable for small n
Z
∞
f (θ | xn ) dθ = α/2
b
1 − α Highest Posterior Density (HPD) region Rn
Independence Testing • I rows, J columns, X multinomial sample of size n = I ∗ J X • mles unconstrained: pˆij = nij X
1. P [θ ∈ Rn ] = 1 − α 2. Rn = {θ : f (θ | xn ) > k} for some k
• mles under H0 : pˆ0ij = pˆi· pˆ·j = Xni· n·j PI PJ nX • LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j PI PJ (X −E[X ])2 • Pearson χ2 : T = i=1 j=1 ijE[Xij ]ij
Rn is unimodal =⇒ Rn is an interval
D
Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }. Posterior CDF for τ
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1)
14
b
Z
14.2
Function of Parameters
Bayesian Inference
H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] =
Z
f (θ | xn ) dθ
A
Bayes’ Theorem
Posterior Density
f (x | θ)f (θ) f (x | θ)f (θ) f (θ | x) = =R ∝ Ln (θ)f (θ) n f (x ) f (x | θ)f (θ) dθ
h(τ | xn ) = H 0 (τ | xn ) Bayesian Delta Method
Definitions n
• X = (X1 , . . . , Xn )
b b se b ϕ0 (θ) τ | X n ≈ N ϕ(θ), 14
14.3
Priors
Continuous likelihood (subscript c denotes constant)
Choice • Subjective Bayesianism: prior should incorporate as much detail as possible the research’s a priori knowledge — via prior elicitation. • Objective Bayesianism: prior should incorporate as little detail as possible (non-informative prior). • Robust Bayesianism: consider various priors and determine sensitivity of our inferences to changes in the prior. Types • Flat: f (θ) ∝ constant R∞ • Proper: −∞ f (θ) dθ = 1 R∞ • Improper: −∞ f (θ) dθ = ∞ • Jeffreys’ prior (transformation-invariant): f (θ) ∝
p
I(θ)
f (θ) ∝
p
det(I(θ))
• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family
14.3.1
Conjugate Priors Discrete likelihood
Likelihood Bernoulli(p) Binomial(p)
Conjugate Prior Beta(α, β) Beta(α, β)
Posterior hyperparameters α+ α+
n X i=1 n X
xi , β + n − xi , β +
i=1
Negative Binomial(p) Poisson(λ)
Beta(α, β) Gamma(α, β)
α + rn, β + α+
n X
n X
n X
i=1
xi
Multinomial(p)
Dirichlet(α)
α+
xi , β + n x(i)
i=1
Geometric(p)
Beta(α, β)
Conjugate Prior
Uniform(0, θ)
Pareto(xm , k)
Exponential(λ)
Gamma(α, β)
Posterior hyperparameters max x(n) , xm , k + n n X α + n, β + xi i=1
Normal(µ, σc2 )
Normal(µ0 , σ02 )
Normal(µc , σ 2 )
Scaled Inverse Chisquare(ν, σ02 )
Normal(µ, σ 2 )
Normalscaled Inverse Gamma(λ, ν, α, β)
MVN(µ, Σc )
MVN(µ0 , Σ0 )
MVN(µc , Σ)
InverseWishart(κ, Ψ)
Pareto(xmc , k)
Gamma(α, β)
Pareto(xm , kc )
Pareto(x0 , k0 )
Gamma(αc , β)
Gamma(α0 , β0 )
Pn n µ0 1 i=1 xi + + 2 , / σ2 σ2 σ02 σc 0 c−1 n 1 + 2 σ02 σc Pn νσ02 + i=1 (xi − µ)2 ν + n, ν+n
n νλ + n¯ x , ν + n, α + , ν+n 2 n 1X γ(¯ x − λ)2 2 β+ (xi − x ¯) + 2 i=1 2(n + γ) −1 Σ−1 0 + nΣc
α + n, β +
n X i=1
−1
−1 Σ−1 x ¯ , 0 µ0 + nΣ
−1 −1 Σ−1 0 + nΣc n X n + κ, Ψ + (xi − µc )(xi − µc )T i=1 n X
xi xm c i=1 x0 , k0 − kn where k0 > kn n X α0 + nαc , β0 + xi α + n, β +
log
i=1
xi
i=1
Ni −
n X
14.4 xi
Bayesian Testing
If H0 : θ ∈ Θ0 :
i=1
Z Prior probability P [H0 ] =
i=1
i=1
n X
n X
Likelihood
Posterior probability P [H0 | xn ] =
f (θ) dθ ZΘ0
f (θ | xn ) dθ
Θ0
Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ), xi
f (xn | Hk )P [Hk ] P [Hk | xn ] = PK , n k=1 f (x | Hk )P [Hk ]
15
Marginal Likelihood f (xn | Hi ) =
Z
1. Estimate VF [Tn ] with VFˆn [Tn ]. 2. Approximate VFˆn [Tn ] using simulation:
f (xn | θ, Hi )f (θ | Hi ) dθ
∗ ∗ (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from the sampling distribution implied by Fˆn i. Sample uniformly X ∗ , . . . , Xn∗ ∼ Fˆn .
Θ
Posterior Odds (of Hi relative to Hj ) P [Hi | xn ] P [Hj | xn ]
=
f (xn | Hi ) f (xn | Hj ) | {z }
Bayes Factor BFij
×
P [Hi ] P [Hj ] | {z }
1
ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ). (b) Then
prior odds
Bayes Factor log10 BF10 BF10 evidence 0 − 0.5 1 − 1.5 Weak 0.5 − 1 1.5 − 10 Moderate 1−2 10 − 100 Strong >2 > 100 Decisive p BF 10 1−p p∗ = where p = P [H1 ] and p∗ = P [H1 | xn ] p 1 + 1−p BF10
15
Exponential Family
b=1
16.1.1
Bootstrap Confidence Intervals
Normal-based Interval Tn ± zα/2 se ˆ boot
1. 2. 3. 4.
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} = h(x)g(θ) exp {η(θ)T (x)} Vector parameter fX (x | θ) = h(x) exp
!2
Pivotal Interval
Scalar parameter
(
ˆˆ vboot = V Fn
B B 1 X ∗ 1 X ∗ Tn,b − T = B B r=1 n,r
s X
Location parameter θ = T (F ) Pivot Rn = θbn − θ Let H(r) = P [Rn ≤ r] be the cdf of Rn ∗ ∗ Let Rn,b = θbn,b − θbn . Approximate H using bootstrap:
)
B 1 X ∗ ˆ H(r) = I(Rn,b ≤ r) B
ηi (θ)Ti (x) − A(θ)
i=1
b=1
= h(x) exp {η(θ) · T (x) − A(θ)} = h(x)g(θ) exp {η(θ) · T (x)} Natural form fX (x | η) = h(x) exp {η · T(x) − A(η)} = h(x)g(η) exp {η · T(x)} = h(x)g(η) exp η T T(x)
∗ ∗ , . . . , θbn,B ) 5. Let θβ∗ denote the β sample quantile of (θbn,1 ∗ ∗ 6. Let rβ∗ denote the β sample quantile of (Rn,1 , . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn 7. Then, an approximate 1 − α confidence interval is Cn = a ˆ, ˆb with
a ˆ= ˆb =
16 16.1
Sampling Methods The Bootstrap
Let Tn = g(X1 , . . . , Xn ) be a statistic.
ˆ −1 1 − α = θbn − H 2 ˆ −1 α = θbn − H 2
∗ θbn − r1−α/2 =
∗ 2θbn − θ1−α/2
∗ θbn − rα/2 =
∗ 2θbn − θα/2
Percentile Interval ∗ ∗ Cn = θα/2 , θ1−α/2 16
16.2
Rejection Sampling
Setup • We can easily sample from g(θ) • We want to sample from h(θ), but it is difficult k(θ) k(θ) dθ • Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ • We know h(θ) up to proportional constant: h(θ) = R
Algorithm 1. Draw θcand ∼ g(θ) 2. Generate u ∼ Unif (0, 1) k(θcand ) 3. Accept θcand if u ≤ M g(θcand ) 4. Repeat until B values of θcand have been accepted Example • • • •
Loss functions • Squared error loss: L(θ, a) = (θ − a)2 ( K1 (θ − a) a − θ < 0 • Linear loss: L(θ, a) = K2 (a − θ) a − θ ≥ 0 • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 ) • Lp loss: L(θ, a) = |θ − a|p ( 0 a=θ • Zero-one loss: L(θ, a) = 1 a 6= θ
17.1
We can easily sample from the prior g(θ) = f (θ) Target is the posterior with h(θ) ∝ k(θ) = f (xn | θ)f (θ) Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M Algorithm 1. Draw θcand ∼ f (θ) 2. Generate u ∼ Unif (0, 1) Ln (θcand ) 3. Accept θcand if u ≤ Ln (θbn )
16.3
• Decision rule: synonymous for an estimator θb • Action a ∈ A: possible value of the decision rule. In the estimation b context, the action is just an estimate of θ, θ(x). • Loss function L: consequences of taking action a when true state is θ or b L : Θ × A → [−k, ∞). discrepancy between θ and θ,
Posterior Risk Z
h i b b L(θ, θ(x))f (θ | x) dθ = Eθ|X L(θ, θ(x))
Z
h i b b L(θ, θ(x))f (x | θ) dx = EX|θ L(θ, θ(X))
r(θb | x) = (Frequentist) Risk b = R(θ, θ) Bayes Risk ZZ
Importance Sampling
b = r(f, θ)
Sample from an importance function g rather than target density h. Algorithm to obtain an approximation to E [q(θ) | xn ]: iid
Ln (θi ) 2. For each i = 1, . . . , B, calculate wi = PB i=1 Ln (θi ) PB n 3. E [q(θ) | x ] ≈ i=1 q(θi )wi
Decision Theory
Definitions • Unknown quantity affecting our decision: θ ∈ Θ
h i b b L(θ, θ(x))f (x, θ) dx dθ = Eθ,X L(θ, θ(X))
h h ii h i b = Eθ EX|θ L(θ, θ(X) b b r(f, θ) = Eθ R(θ, θ) h h ii h i b = EX Eθ|X L(θ, θ(X) b r(f, θ) = EX r(θb | X)
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
17
Risk
17.2
Admissibility
• θb0 dominates θb if
b ∀θ : R(θ, θb0 ) ≤ R(θ, θ) b ∃θ : R(θ, θb0 ) < R(θ, θ)
• θb is inadmissible if there is at least one other estimator θb0 that dominates it. Otherwise it is called admissible.
17
17.3
Bayes Rule
Residual Sums of Squares (rss)
Bayes Rule (or Bayes Estimator) rss(βb0 , βb1 ) =
b = inf e r(f, θ) e • r(f, θ) θ R b b b = r(θb | x)f (x) dx • θ(x) = inf r(θ | x) ∀x =⇒ r(f, θ)
n X
ˆ2i
i=1
Least Square Estimates Theorems
βbT = (βb0 , βb1 )T : min rss b0 ,β b1 β
• Squared error loss: posterior mean • Absolute error loss: posterior median • Zero-one loss: posterior mode
17.4
¯n βb0 = Y¯n − βb1 X Pn Pn ¯ n )(Yi − Y¯n ) ¯ (Xi − X i=1 Xi Yi − nXY Pn βb1 = i=1 = P n 2 ¯ 2 2 i=1 (Xi − Xn ) i=1 Xi − nX h i β0 E βb | X n = β1 P h i σ 2 n−1 ni=1 Xi2 −X n n b V β |X = −X n 1 nsX r Pn 2 σ b i=1 Xi √ b βb0 ) = se( n sX n σ b √ b βb1 ) = se( sX n
Minimax Rules
Maximum Risk b = sup R(θ, θ) b ¯ θ) R(
¯ R(a) = sup R(θ, a)
θ
θ
Minimax Rule e e = inf sup R(θ, θ) b = inf R( ¯ θ) sup R(θ, θ) θ
θe
θe
θ
b =c θb = Bayes rule ∧ ∃c : R(θ, θ) Least Favorable Prior θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ
18
Pn where s2X = n−1 i=1 (Xi − X n )2 and σ b2 = of σ. Further properties:
Pn
ˆ2i i=1
an (unbiased) estimate
Linear Regression P P • Consistency: βb0 → β0 and βb1 → β1 • Asymptotic normality:
Definitions • Response variable Y • Covariate X (aka predictor variable or feature)
18.1
1 n−2
βb0 − β0 D → N (0, 1) b βb0 ) se(
Simple Linear Regression
βb1 − β1 D → N (0, 1) b βb1 ) se(
• Approximate 1 − α confidence intervals for β0 and β1 are
Model Yi = β0 + β1 Xi + i
and
E [i | Xi ] = 0, V [i | Xi ] = σ 2
b βb0 ) βb0 ± zα/2 se(
b βb1 ) and βb1 ± zα/2 se(
Fitted Line • The Wald test for testing H0 : β1 = 0 vs. H1 : β1 6= 0 is: reject H0 if b βb1 ). |W | > zα/2 where W = βb1 /se(
rb(x) = βb0 + βb1 x Predicted (Fitted) Values Ybi = rb(Xi ) Residuals
ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi
R2
Pn b Pn 2 2 ˆ rss i=1 (Yi − Y ) R = Pn = 1 − Pn i=1 i 2 = 1 − 2 tss i=1 (Yi − Y ) i=1 (Yi − Y ) 2
18
If the (k × k) matrix X T X is invertible,
Likelihood L= L1 = L2 =
n Y i=1 n Y i=1 n Y
f (Xi , Yi ) =
n Y
fX (Xi ) ×
i=1
n Y
βb = (X T X)−1 X T Y i h V βb | X n = σ 2 (X T X)−1
fY |X (Yi | Xi ) = L1 × L2
i=1
βb ≈ N β, σ 2 (X T X)−1
fX (Xi ) (
fY |X (Yi | Xi ) ∝ σ
−n
i=1
2 1 X Yi − (β0 − β1 Xi ) exp − 2 2σ i
)
Estimate regression function rb(x) = Unbiased estimate for σ
1X 2 ˆ σ b2 = n i=1 i
18.2
k X
βbj xj
j=1
Under the assumption of Normality, the least squares estimator is also the mle n
2
σ b2 =
n 1 X 2 ˆ n − k i=1 i
ˆ = X βb − Y
mle
Prediction
¯ µ b=X
Observe X = x∗ of the covarite and want to predict their outcome Y∗ .
σ b2 =
n−k 2 σ n
1 − α Confidence Interval Yb∗ = βb0 + βb1 x∗ h i h i h i h i V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 Prediction Interval
Procedure 1. Assign a score to each model 2. Search through all models to find the one with the highest score
Y = Xβ + where X11 .. X= . Xn1
··· .. . ···
Model Selection
• Underfitting: too few covariates yields high bias • Overfitting: too many covariates yields high variance
Multiple Regression
18.4
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J denote a subset of the covariates in the model, where |S| = k and |J| = n. Issues
Pn 2 2 2 i=1 (Xi − X∗ ) b P ξn = σ b ¯ 2j + 1 n i (Xi − X) Yb∗ ± zα/2 ξbn
18.3
b βbj ) βbj ± zα/2 se(
X1k .. .
β1 β = ...
Xnk
1 .. =.
βk
n
Likelihood 2 −n/2
L(µ, Σ) = (2πσ )
1 exp − 2 rss 2σ
rss = (y − Xβ)T (y − Xβ) = ||Y − Xβ||2 =
N X i=1
(Yi − xTi β)2
Hypothesis Testing H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J Mean Squared Prediction Error (mspe) h i mspe = E (Yb (S) − Y ∗ )2 Prediction Risk R(S) =
n X i=1
mspei =
n X i=1
h i E (Ybi (S) − Yi∗ )2 19
19
Training Error n X btr (S) = R (Ybi (S) − Yi )2
19.1
i=1
R
2
R2 (S) = 1 −
btr (S) R rss(S) =1− =1− tss tss
Non-parametric Function Estimation
Pn b 2 i=1 (Yi (S) − Y ) P n 2 i=1 (Yi − Y )
The training error is a downward-biased estimate of the prediction risk. i h btr (S) < R(S) E R
Density Estimation
R Estimate f (x), where f (x) = P [X ∈ A] = A f (x) dx. Integrated Square Error (ise) Z Z 2 L(f, fbn ) = f (x) − fbn (x) dx = J(h) + f 2 (x) dx Frequentist Risk Z h i Z R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx h i b(x) = E fbn (x) − f (x) h i v(x) = V fbn (x)
n i h i X btr (S)) = E R btr (S) − R(S) = −2 bias(R Cov Ybi , Yi
h
i=1
Adjusted R2
19.1.1
n − 1 rss R (S) = 1 − n − k tss
Histograms
2
Definitions
Mallow’s Cp statistic b btr (S) + 2kb R(S) =R σ 2 = lack of fit + complexity penalty Akaike Information Criterion (AIC) `n (βbS , σ bS2 )
−k
fbn (x) =
k BIC(S) = `n (βbS , σ bS2 ) − log n 2 Validation and Training (Ybi∗ (S) − Yi∗ )2
m = |{validation data}|, often
i=1
Leave-one-out Cross-validation bCV (S) = R
n X i=1
m X pbj j=1
Bayesian Information Criterion (BIC)
m X
Number of bins m 1 Binwidth h = m Bin Bj has νj observations R Define pbj = νj /n and pj = Bj f (u) du
Histogram Estimator
AIC(S) =
bV (S) = R
• • • •
2
(Yi − Yb(i) ) =
n X i=1
Yi − Ybi (S) 1 − Uii (S)
U (S) = XS (XST XS )−1 XS (“hat matrix”)
!2
n n or 4 2
h
I(x ∈ Bj )
h i p j E fbn (x) = h h i p (1 − p ) j j V fbn (x) = nh2 Z h2 1 2 b R(fn , f ) ≈ (f 0 (u)) du + 12 nh !1/3 1 6 ∗ h = 1/3 R 2 du n (f 0 (u)) 2/3 Z 1/3 C 3 2 R∗ (fbn , f ) ≈ 2/3 C= (f 0 (u)) du 4 n Cross-validation estimate of E [J(h)] Z n m 2Xb 2 n+1 X 2 2 b b f(−i) (Xi ) = pb JCV (h) = fn (x) dx − − n i=1 (n − 1)h (n − 1)h j=1 j 20
19.1.2
Kernel Density Estimator (KDE)
k-nearest Neighbor Estimator X 1 rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x} k
Kernel K • • • •
i:xi ∈Nk (x)
K(x) ≥ 0 R K(x) dx = 1 R xK(x) dx = 0 R 2 2 >0 x K(x) dx ≡ σK
Nadaraya-Watson Kernel Estimator rb(x) =
n X
wi (x)Yi
i=1
KDE
x−xi h Pn x−xj j=1 K h
K
wi (x) =
n 1X1 x − Xi fbn (x) = K n i=1 h h Z Z 1 1 4 00 2 b R(f, fn ) ≈ (hσK ) (f (x)) dx + K 2 (x) dx 4 nh Z Z −2/5 −1/5 −1/5 c c2 c3 2 2 h∗ = 1 c = σ , c = K (x) dx, c = (f 00 (x))2 dx 1 2 3 K n1/5 Z 4/5 Z 1/5 5 2 2/5 c4 c4 = (σK R∗ (f, fbn ) = 4/5 ) K 2 (x) dx (f 00 )2 dx 4 n | {z }
R(b rn , r) ≈
h4 4 Z
+
∈ [0, 1]
4 Z 2 f 0 (x) x2 K 2 (x) dx r00 (x) + 2r0 (x) dx f (x) R σ 2 K 2 (x) dx dx nhf (x) Z
c1 n1/5 c2 R∗ (b rn , r) ≈ 4/5 n h∗ ≈
C(K)
Cross-validation estimate of E [J(h)]
Epanechnikov Kernel ( K(x) =
3 √ 4 5(1−x2 /5)
0
|x|
0
Waiting Times n-step Wt := time at which Xt occurs
Transition matrix P (n-step: Pn ) • (i, j) element is pij • pij > 0 P • i pij = 1
1 Wt ∼ Gamma t, λ
Chapman-Kolmogorov
Interarrival Times
pij (m + n) =
X
pij (m)pkj (n)
St = Wt+1 − Wt
k
Pm+n = Pm Pn Pn = P × · · · × P = Pn
St ∼ Exp
1 λ
Marginal probability µn = (µn (1), . . . , µn (N ))
where
µi (i) = P [Xn = i]
St
µ0 , initial distribution µn = µ0 Pn
Wt−1
Wt
t 22
21
Time Series
21.1
Mean function
Z
Strictly stationary
∞
µxt = E [xt ] =
Stationary Time Series
xft (x) dx
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ]
−∞
Autocovariance function γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt γx (t, t) = E (xt − µt )2 = V [xt ] Autocorrelation function (ACF)
∀k ∈ N, tk , ck , h ∈ Z Weakly stationary • E x2t < ∞ ∀t ∈ Z 2 • E xt = m ∀t ∈ Z • γx (s, t) = γx (s + r, t + r)
γ(s, t) Cov [xs , xt ] =p ρ(s, t) = p V [xs ] V [xt ] γ(s, s)γ(t, t) Cross-covariance function (CCV)
∀r, s, t ∈ Z
Autocovariance function
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• • • • •
Cross-correlation function (CCF) ρxy (s, t) = p
γxy (s, t) γx (s, s)γy (t, t)
γ(h) = E [(xt+h − µ)(xt − µ)] γ(0) = E (xt − µ)2 γ(0) ≥ 0 γ(0) ≥ |γ(h)| γ(h) = γ(−h)
∀h ∈ Z
Backshift operator B k (xt ) = xt−k
Autocorrelation function (ACF)
Difference operator
Cov [xt+h , xt ] γ(t + h, t) γ(h) ρx (h) = p =p = γ(0) V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t)
∇d = (1 − B)d White Noise 2 • wt ∼ wn(0, σw )
• • • •
Jointly stationary time series
iid
2 0, σw
Gaussian: wt ∼ N E [wt ] = 0 t ∈ T V [wt ] = σ 2 t ∈ T γw (s, t) = 0 s 6= t ∧ s, t ∈ T
γxy (h) = E [(xt+h − µx )(yt − µy )]
ρxy (h) = p
Random Walk Linear Process
• Drift δ Pt • xt = δt + j=1 wj • E [xt ] = δt
xt = µ +
k X j=−k
aj xt−j
∞ X
ψj wt−j
where
j=−∞
Symmetric Moving Average mt =
γxy (h) γx (0)γy (h)
where aj = a−j ≥ 0 and
k X j=−k
aj = 1
γ(h) =
∞ X
|ψj | < ∞
j=−∞
2 σw
∞ X
ψj+h ψj
j=−∞
23
21.2
Estimation of Correlation
21.3.1
Detrending
Least Squares
Sample mean n
1X x ¯= xt n t=1 Sample variance n |h| 1 X 1− γx (h) V [¯ x] = n n h=−n
1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2 2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2 3. Residuals , noise wt Moving average • The low-pass filter vt is a symmetric moving average mt with aj =
Sample autocovariance function n−h 1 X γ b(h) = (xt+h − x ¯)(xt − x ¯) n t=1
vt =
1 2k+1 :
k X 1 xt−1 2k + 1 i=−k
Pk 1 • If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes without distortion
Sample autocorrelation function ρb(h) =
γ b(h) γ b(0)
Differencing • µt = β0 + β1 t =⇒ ∇xt = β1
Sample cross-variance function γ bxy (h) =
n−h 1 X (xt+h − x ¯)(yt − y) n t=1
21.4
ARIMA models
Autoregressive polynomial φ(z) = 1 − φ1 z − · · · − φp zp
Sample cross-correlation function γ bxy (h) ρbxy (h) = p γ bx (0)b γy (0)
z ∈ C ∧ φp 6= 0
Autoregressive operator φ(B) = 1 − φ1 B − · · · − φp B p
Properties 1 • σρbx (h) = √ if xt is white noise n 1 • σρbxy (h) = √ if xt or yt is white noise n
21.3
Non-Stationary Time Series
Autoregressive model order p, AR (p) xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt AR (1) • xt = φk (xt−k ) +
k−1 X
φj (wt−j )
k→∞,|φ|q
ARMA (p, q) is invertible ⇐⇒ ∃{πj } :
MA (1) xt = wt + θwt−1 2 2 (1 + θ )σw h = 0 2 γ(h) = θσw h=1 0 h>1 ( θ h=1 2 ρ(h) = (1+θ ) 0 h>1
π(B)xt =
P∞
j=0
∞ X
πj < ∞ such that
Xt−j = wt
j=0
Properties • ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle ψ(z) =
∞ X
ψj z j =
j=0
θ(z) φ(z)
|z| ≤ 1
ARMA (p, q) xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
φ(B)xt = θ(B)wt
π(z) =
Partial autocorrelation function (PACF)
∞ X
πj z j =
j=0
• xh−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 } i • φhh = corr(xh − xh−1 , x0 − xh−1 ) h≥2 0 h • E.g., φ11 = corr(x1 , x0 ) = ρ(1)
φ(z) θ(z)
|z| ≤ 1
Behavior of the ACF and PACF for causal and invertible ARMA models
ACF PACF
ARIMA (p, d, q) ∇d xt = (1 − B)d xt is ARMA (p, q)
AR (p) tails off cuts off after lag p
MA (q) cuts off after lag q tails off q
ARMA (p, q) tails off tails off
φ(B)(1 − B)d xt = θ(B)wt Exponentially Weighted Moving Average (EWMA) xt = xt−1 + wt − λwt−1 xt =
∞ X
(1 − λ)λj−1 xt−j + wt
when |λ| < 1
21.5
Spectral Analysis
Periodic process xt = A cos(2πωt + φ) = U1 cos(2πωt) + U2 sin(2πωt)
j=1
x ˜n+1 = (1 − λ)xn + λ˜ xn
• Frequency index ω (cycles per unit time), period 1/ω
25
• Amplitude A • Phase φ • U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
Discrete Fourier Transform (DFT) d(ωj ) = n−1/2
xt =
q X
Fourier/Fundamental frequencies (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
ωj = j/n
k=1
• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2 Pq • γ(h) = k=1 σk2 cos(2πωk h) Pq • γ(0) = E x2t = k=1 σk2
Inverse DFT xt = n−1/2
j=0
Scaled Periodogram
σ 2 −2πiω0 h σ 2 2πiω0 h e + e = 2 2 Z 1/2 = e2πiωh dF (ω)
4 I(j/n) n !2 n 2X = xt cos(2πtj/n + n t=1
P (j/n) =
−1/2
Spectral distribution function ω < −ω0 −ω ≤ ω < ω0 ω ≥ ω0
22 22.1
γ(h)e−2πiωh
−
h=−∞ h=−∞
!2
Gamma Function ∞
ts−1 e−t dt 0 Z ∞ • Upper incomplete: Γ(s, x) = ts−1 e−t dt x Z x • Lower incomplete: γ(s, x) = ts−1 e−t dt
• Ordinary: Γ(s) =
Spectral density ∞ X
n
2X xt sin(2πtj/n n t=1
Math Z
• F (−∞) = F (−1/2) = 0 • F (∞) = F (1/2) = γ(0)
f (ω) =
d(ωj )e2πiωj t
I(j/n) = |d(j/n)|2
γ(h) = σ 2 cos(2πω0 h)
0 F (ω) = σ 2 /2 2 σ
n−1 X
Periodogram
Spectral representation of a periodic process
P∞
xt e−2πiωj t
i=1
Periodic mixture
• Needs
n X
|γ(h)| < ∞ =⇒ γ(h) =
1 1 ≤ω≤ 2 2
R 1/2 −1/2
e2πiωh f (ω) dω
• f (ω) ≥ 0 • f (ω) = f (−ω) • f (ω) = f (1 − ω) R 1/2 • γ(0) = V [xt ] = −1/2 f (ω) dω
0
h = 0, ±1, . . .
• Γ(α + 1) = αΓ(α) α>1 • Γ(n) = (n − 1)! n∈N √ • Γ(1/2) = π
22.2
Beta Function
Z 1 Γ(x)Γ(y) tx−1 (1 − t)y−1 dt = • Ordinary: B(x, y) = B(y, x) = Γ(x + y) 0 Z x • Incomplete: B(x; a, b) = ta−1 (1 − t)b−1 dt
2 • White noise: fw (ω) = σw • ARMA (p, q) , φ(B)xt = θ(B)wt :
|θ(e−2πiω )|2 fx (ω) = |φ(e−2πiω )|2 Pp Pq where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k 2 σw
0
• Regularized incomplete: a+b−1 B(x; a, b) a,b∈N X (a + b − 1)! Ix (a, b) = = xj (1 − x)a+b−1−j B(a, b) j!(a + b − 1 − j)! j=a
26
Stirling numbers, 2nd kind n n−1 n−1 =k + k k k−1
• I0 (a, b) = 0 I1 (a, b) = 1 • Ix (a, b) = 1 − I1−x (b, a)
22.3
Series
Finite •
n(n + 1) k= 2
•
(2k − 1) = n2
•
k=1 n X
•
k=1 n X
k=1 n X
•
k2 =
ck =
k=0
cn+1 − 1 c−1
n X n k=0 n X
k
n
=2
k=0 ∞ X
Balls and Urns |B| = n, |U | = m B : D, U : ¬D
• Binomial Theorem: n X n n−k k a b = (a + b)n k
B : ¬D, U : D
k=0
c 6= 1
k > n : Pn,k = 0
f :B→U f arbitrary n
m
B : D, U : ¬D
n+n−1 n m X n k=1
1 , 1−p
∞ X
p |p| < 1 1−p k=1 ! ∞ X d 1 1 k = p = dp 1 − p 1 − p2 pk =
kpk−1 =
B : ¬D, U : ¬D |p| < 1
ordered
(n − i) =
i=0
unordered
Pn,k
f injective ( mn m ≥ n 0 else m n ( 1 m≥n 0 else ( 1 m≥n 0 else
f surjective n m! m
n−1 m−1 n m Pn,m
f bijective ( n! m = n 0 else ( 1 m=n 0 else ( 1 m=n 0 else ( 1 m=n 0 else
References
[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R Examples. Springer, 2006.
w/o replacement nk =
k
D = distinguishable, ¬D = indistinguishable.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American Statistician, 62(1):45–53, 2008.
Sampling
k−1 Y
n ≥ 1 : Pn,0 = 0, P0,0 = 1
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole, 1972.
Combinatorics
k out of n
m X k=1
k=0
22.4
Pn,i
k=0
d dp k=0 k=0 ∞ X r+k−1 k • x = (1 − x)−r r ∈ N+ k k=0 ∞ X α k • p = (1 + p)α |p| < 1 , α ∈ C k
•
n X
• Vandermonde’s Identity: r X m n m+n = k r−k r
k=0
Infinite ∞ X • pk =
Pn+k,k =
i=1
r+k r+n+1 = k n k=0 n X k n+1 • = m m+1
n(n + 1)(2n + 1) 6 k=1 2 n X n(n + 1) • k3 = 2 •
Partitions
Binomial
n X
1≤k≤n
( 1 n=0 n = 0 0 else
n! (n − k)!
n nk n! = = k k! k!(n − k)!
w/ replacement
[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra. Springer, 2001.
nk
[5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik. Springer, 2002.
n−1+r n−1+r = r n−1
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003. 27
28
Univariate distribution relationships, courtesy of Leemis and McQueston [2].