Least Squares Method For Factor Analysis

University of California Los Angeles Least Squares Method for Factor Analysis A thesis submitted in partial satisfacti

Views 64 Downloads 0 File size 311KB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

METHOD OF LEAST WORK

91 METHOD OF LEAST WORK CHAPTER TWO METHOD OF LEAST WORK The method of least work is used for the analysis of statical

28 0 558KB Read more

Exploratory Factor Analysis

19 2 2MB Read more

Mn Analysis Method

20 0 593KB Read more

Abramelin Squares

Abramelin's Magickal Word Squares Compiled and Corrected for the First Time Aaron Leitch The following word-squares (or

49 8 420KB Read more

P 1963 Marquardt AN ALGORITHM FOR LEAST-SQUARES ESTIMATION OF NONLINEAR PARAMETERS.pdf

An Algorithm for Least-Squares Estimation of Nonlinear Parameters Author(s): Donald W. Marquardt Source: Journal of the

0 0 705KB Read more

Magic Squares

MAGIC MAGIC SQUARES SQUARES FOR ALL ALL FOR SUCCESS SUCCESS MAGIC SQUARES FOR ALL SUCCESS MAGIC SQUARES FOR ALL SUCC

47 1 139KB Read more

Magic Squares

483 41 81KB Read more

K Factor For Transformer

K FACTOR FOR TRANSFORMER What is a K-factor? It is a value used to determine how much harmonic current a transformer ca

6 1 60KB Read more

Direct Analysis Method: Project: Description

Project: Prepared by: Checked by: Description: Location: DIRECT ANALYSIS METHOD Note: Change Cells in Yellow Only I

19 0 195KB Read more

Software Architecture Analysis Method DANIEL

30 2 402KB Read more

Author / Uploaded
Prasanna Kumar

Citation preview

University of California Los Angeles

Least Squares Method for Factor Analysis

A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Statistics

by

Jia Chen

2010

c Copyright by

Jia Chen 2010

The thesis of Jia Chen is approved.

Hongquan Xu

Yingnian Wu

Jan de Leeuw, Committee Chair

University of California, Los Angeles 2010

ii

To my parents for their permanent love. And to my friends and teachers who have given me precious memory and tremendous encouragement.

iii

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2.1

Factor Analysis Models . . . . . . . . . . . . . . . . . . . . . . . .

4

2.1.1

Random Factor Model . . . . . . . . . . . . . . . . . . . .

5

2.1.2

Fixed Factor Model . . . . . . . . . . . . . . . . . . . . . .

6

Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2.1

Principal Component Method . . . . . . . . . . . . . . . .

7

2.2.2

Maximum Likelihood Method . . . . . . . . . . . . . . . .

9

2.2.3

Least Squares Method . . . . . . . . . . . . . . . . . . . .

11

Determining the Number of Factors . . . . . . . . . . . . . . . . .

12

2.3.1

Mathematical Approaches . . . . . . . . . . . . . . . . . .

12

2.3.2

Statistical Approach . . . . . . . . . . . . . . . . . . . . .

13

2.3.3

The Third Approach . . . . . . . . . . . . . . . . . . . . .

14

3 Algorithms of Least Squares Methods . . . . . . . . . . . . . . .

15

2.2

2.3

3.1

Least Squares on the Covariance Matrix . . . . . . . . . . . . . .

16

3.1.1

Loss Function . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.1.2

Projections . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.1.3

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .

16

iv

3.2

Least Squares on the Data Matrix . . . . . . . . . . . . . . . . . .

19

3.2.1

Loss Function . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.2.2

Projection . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.2.3

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .

20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.1

9 Mental Tests from Holzinger-Swineford . . . . . . . . . . . . . .

25

4.2

9 Mental Tests from Thurstone . . . . . . . . . . . . . . . . . . .

28

4.3

17 mental Tests from Thurstone/Bechtoldt . . . . . . . . . . . . .

30

4.4

16 Health Satisfaction items from Reise . . . . . . . . . . . . . . .

32

4.5

9 Emotional variable from Burt . . . . . . . . . . . . . . . . . . .

35

5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . .

37

A Augmented Procrustus . . . . . . . . . . . . . . . . . . . . . . . . .

39

B Implementation Code . . . . . . . . . . . . . . . . . . . . . . . . . .

41

C Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

4 Examples

v

List of Tables

4.1

LSFA methods Summary - 9 Mental Tests from Holzinger-Swineford 27

4.2

Loss function Summary - 9 Mental Tests from Holzinger-Swineford 27

4.3

Loading Matrices Summary - 9 Mental Tests from Holzinger-Swineford 28

4.4

LSFA methods Summary - 9 Mental Tests from Thurstone . . . .

29

4.5

Loss function Summary - 9 Mental Tests from Thurstone . . . . .

30

4.6

Loading Matrices Summary - 9 Mental Tests from Thurstone . . .

30

4.7

LSFA methods Summary - 17 mental Tests from Thurstone/Bechtoldt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.8

Loss function Summary - 17 mental Tests from Thurstone/Bechtoldt 32

4.9

Loading Matrices Summary - 17 mental Tests from Thurstone . .

32

4.10 LSFA methods Summary - 16 Health Satisfaction items from Reise 34 4.11 Loss function Summary - 16 Health Satisfaction items from Reise

34

4.12 Loading Matrices Summary - 16 Health Satisfaction items from Reise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.13 LSFA methods Summary - 9 Emotional variable from Burt . . . .

36

4.14 Loading Matrices Summary - 9 Emotional variable from Burt

36

vi

. .

Abstract of the Thesis

Least Squares Method for Factor Analysis by

Jia Chen Master of Science in Statistics University of California, Los Angeles, 2010 Professor Jan de Leeuw, Chair

This paper demonstrates the implementation of using alternating least squares to solve the common factor analysis. The algorithm leads to convergence and accumulation points of the sequences it generates will be stationary points. In addition to implementing the Procrustus algorithm, it provides a means of verifying that the solution obtained is at least a local minimum of the loss function.

vii

CHAPTER 1 Introduction A major objective of scientific or social activities is to summarize, by theoretical formulations, the empirical relationships among a given set of events and discover the natural laws behind thousands of random events. The events can be investigated are almost infinite, so it is difficult to make any general statement about phenomena. However, it could be stated that scientists analyze the relationships among a set of variables, while these relationships are evaluated across a set of individuals under specified conditions. The variables are the characteristic being measured and could be anything that can be objectively identified or scored. Factor analysis can be used for theory instrument development and assessing construct validity of an established instrument when administered to a specific population. Through factor analysis, the original set of variables is reduced to a few factors with minimum loss of information. Each factor represents an area of generalization that is qualitatively distinct from that represented by any other factors. “Within an area where data can be summarized, factor analysis first represents that area by a factor and then seeks to make the degree of generalization between each variable and the factor explicit” [6]. There are many methods available to estimate a factor model, and the purpose of this paper is to present and implement a new least squares algorithm, and then compare its speed of convergence and model accuracy to some existing approaches. To begin, we provide some matrix background and assumptions on

1

the existence of a factor analysis model.

2

CHAPTER 2 Factor Analysis Many statistical methods are used to study the relation between independent and dependent variables. Factor analysis is different; the purpose of factor analysis is data reduction and summarization with the goal understanding causation. It aims to describe the covariance relationships among a large set of observed variables in terms of a few underlying, but unobservable, random quantities called factors. Factor analysis is a branch of multivariate analysis that was invented by psychologist Charles Spearman. He discovered that school children’s scores on a wide variety of seemingly unrelated subjects were positively correlated, which led him to postulate that a general mental ability, or g, underlies and shapes human cognitive performance. Raymond Cattell expanded on Spearman’s idea of a twofactor theory of intelligence after performing his own tests and factor analysis. He used a multi-factor theory to explain intelligence. Factor analysis was developed to analyze test scores so as determine if ‘intelligence’ is made up of a single underlying general factor or of several more limited factors measuring attributes like ‘mathematical ability’. Today factor analysis is the most widely used brach of multivariate analysis in the psychological field, and helped by the advent of electronic computers, it has been quickly spreading to economics, botany, biology and social sciences. Factor analysis has two main motivations to study it. One of the purposes of factor analysis is to reduced the number of variables. In multivariate analysis, one

3

often has data or a large number of variables Y1 , Y2 , ... Ym , and it is reasonable to believe that there is a reduce list of unobserved factors that determine the full dataset. And the primary motivation for factor analysis is to detect patterns of relationship among many dependent variables with the goal of discovering the independent variables that affect them, event though those independent variables cannot be measured directly. “The object of a factor problem is to account for the tests, the smallest possible number that is consistent with acceptable residual errors” [21].

2.1

Factor Analysis Models

The factor analysis is generally presented with the framework of the multivariate linear model for data analysis. Two classical linear common factor models are briefly reviewed. For a more comprehensive discussion should refer to Anderson and Rubin [1956] and to Anderson [1984] [1] [2]. Common factor analysis (CFA) starts with the assumption that the variance in a given variable can be explained by a small number of underlying common factors. For the common factor model, the factor score matrix can be divided into two parts. The common factor part and unique factor part. The model in matrix algebra form as:

Y = F + U,

n×m

n×m

n×m

F = H A ′,

n×m

n×p p×m

U = E

n×m

D .

n×m m×m

4

It can be rewritten as: Y = HA′ + ED

(2.1)

where the common factor part is linear combination of p common factors (Hn×p ) and factor loadings (Am×p ). The unique factor part is linear combination of m unique factors (En×m ) and unique factor scores (Dm×m ) is a diagonal matrix. In the common factor model, common factor and unique factor are assumed to be orthogonal and follow a multivariate normal distribution with mean zero and scaled to have unit length. The assumption of normality means they are statistically independent random variables. The common factor are assumed to be independent of the unique factor. Therefore, E(H) = 0, H′ H = Ip , E(E) = 0, E′ E = Im , E′ H = 0m×p , and D is a diagonal matrix. The common factor model (2.1) and assumptions imply the following model correlation structure Σ for the observed variables: Σ = AA′ + D2

2.1.1

(2.2)

Random Factor Model

The matrix Y is assumed to be a realization of a matrix-valued random variable Y, where the random variable Y has a random common part F and a random unique part U. Thus Y = F + U, n×m

n×m

n×m

F = H A ′, n×m

n×p p×m

U = E n×m

D .

n×m m×m

Each row of Y are corresponding with an individual observation, and these observations are assumed to be independent. Moreover the specific parts are assumed

5

to be uncorrelated with the common factors, and with the other specific parts.

2.1.2

Fixed Factor Model

The random factor model explained above was criticized soon after it was formally introduced by Lawley. The point is that in factor analysis different individuals are regarded as drawing their scores from different k-way distributions, and in these distributions the mean for each test is the true score of the individual on that test. “Nothing is implied about the distribution of observed scores over a population of individuals, and one makes assumptions only about the error distributions” [25]. There is a fixed factor model, which assumes Y= F+U The common part is a bilinear combination of a number of a number of common factor loadings and common factor scores F = HA′ In the fixed model we merely assume the specific parts are uncorrelated withe the other specific parts.

2.2

Estimation Methods

In common factor analysis, the population covariance matrix Σ of m variables with n common factors can be decomposed as Σ = AA′ + D2 ,

6

where A is the loading matrix of order m × n and D2 is the matrix of unique variance of order m which is diagonal and non-negative definite. These parameters are nearly always unknown and need to be estimated from the sample data. The estimation are relatively straightforward method of breaking down a covariance or correlation matrix into a set of orthogonal components or axes equal in number to the number of variate methods. The sample covariance matrix is occasionally used, but it is much more common to work with the sample correlation matrix. There are many different methods have been developed for estimating, the best known of these is principal factor method. It extracts the maximum amount of variance that can be possibly extracted by a given number of factors. This method chooses the first factor so as to account for as much as possible of the variance from the correlation matrix, the second factor to account for as much as possible of the remaining variance, and so on. In 1940, a major step forward was made by D. N. Lawley, who developed the Maximum Likelihood equations. These are fairly complicated and difficult ..

to solve, but recent computational advances, particularly by K. G. Joreskog, have made maximum-likelihood estimation a practical proposition, and computer programs has widely available [4]. Since then, the maximum likelihood become a dominant estimation method in factor analysis.

2.2.1

Principal Component Method

Let variance-covariance matrix Σ have corresponding eigenvalues and eigenvector. Eigenvalues of Σ b b2 , ..., λ bm λ1 , λ

7

where b1 ≥ λ b2 ≥ ... ≥ λ bm ) (λ

Eigenvector of Σ

b e1 , b e2 , ..., b em

The Spectral Decomposition of Σ [13] says, the variance-covariance matrix can be expressed as the sum of m eigenvalues multiplied by their eigenvectors and their transpose. The idea behind the principal component method is to approximate this expression. √

Σ=

m X

λi ei e′i

i=1

=

√ |

λ1 e1

λ1 e′1



 √  ′   λ2 e2        . √ √  = AA′ λ2 e2 ... λm em    {z } .    A    .    √ ′ λm em {z } |

(2.3)

A′

Instead of summing the equation (2.3) from 1 to m, we would sum it from 1 to p to estimate the variance-covariance matrix. √

b = Σ∼ =Σ

p X i=1

λi ei e′i =

√

|

λ1 e1

λ1 e′1



√   ′   λ2 e2        . p √ bA b′  =A λ2 e2 ... λp ep    {z }  .  b A    .    p ′ λp ep | {z } b′ A

8

(2.4)

The equation (2.4) yields the estimator for the factor loadings: b = √λ e √λ e ... pλ e A 1 1 2 2 p p

(2.5)

Recall equation (2.2), D2 is going to be equal to the variance-covariance matrix b A b ′. minus A

2.2.2

b2 = Σ − A bA b′ D

(2.6)

Maximum Likelihood Method

Maximum-likelihood method was first proposed to factor analysis by Lawley (1940, 1941, 1943) but its routine use had to await the development of computers and suitable numerical optimization procedures [19]. This method is the procedure of finding the value of one or more parameters for a given statistic which makes the known likelihood distribution a maximum. It consist in finding factor loadings which maximize the likelihood function for a specified set of unique variances. The Maximum-Likelihood method are assumed that the data are independently sampled from a multivariate normal distribution. As the common factor (H) and unique factor (E) are assumed to be multivariate normal, Y = HA′ + ED are then multivariate normal with mean vector 0 and variancecovariance matrix Σ. Maximum likelihood method is estimating the matrix of factor loadings and unique factors. The method estimator for factor loadings A b and D b that maximizes the and the unique factors D are obtained by finding A

log-likelihood, which is given by the following expression:

£(A, D) = −

1 nm n log2π − log AA′ + D2 − Y′ (AA′ + D2 )−1 Y 2 2 2

(2.7)

There are two types of maximum likelihood methods, one is called Covariance Matrix Methods. This method was first proposed by Lawley [14], and

9

then popularized and programmed by J¨ oreskog [12]. Since then the multinormal maximum likelihood for the random factor model became the dominant estimation method in factor analysis. The maximum likelihood was applied to the likelihood function of the covariance matrix, assuming multivariate normality. The negative log-likelihood measures the distance between the sample and population covariance model, and choose A and D to minimize £(A, D) = n log |Σ| + n trΣ−1 S,

(2.8)

where S is the sample covariance matrix of Y, and Σ = AA′ + D2 . In Anderson and Rubin the impressive machinery developed by the Cowles Commission was applied to both the fixed and random factor analysis model. Maximum likelihood was applied to likelihood function of the covariance matrix, assuming multivariate normality. The other method is called Data Matrix Methods. The maximum likelihood procedures were proposed by Lawley [14] were criticized soon after they appeared by Young. Young said “Such a distribution is specified by the means and variance of each test and the covariance of the tests in pairs’ it has no parameters distinguishing different individuals. Such a formulation is therefore inappropriate for factor analysis, where factor loadings of the tests and of the individuals enter in a symmetric fashion in a bilinear form.” [25] Young proposed to minimize the log-likelihood of the data, £(H, A, D) = n log |D| + tr(Y − HA′ )′ D−1 (Y − HA′ )

(2.9)

where D is known diagonal matrices with column (variable) weights. The solution is given by a weighted singular value decomposition of Y. The basic problem with Young’s method is that it assumes the weights to be known. One solution, suggested by Lawley, is to estimate them along with

10

the loadings and uniquenesses [15]. If there are no person-weights, Lawley suggests to alternate minimization over (H, A), which is done by weighted singular value decomposition, and minimization over diagonal D, which simply amounts to computing the average sum of squares of the residuals for each variable. However, iterating two minimizations produces a block relaxation algorithm intended to minimize the negative log-likelihood does not work. Although the algorithm produces a decreasing sequence of loss function values. A rather disconcerting feature of the new method is, however, that iterative numerical solutions of the estimation equations either fail to converge, or else converge to unacceptable solutions in which one of more of the measurements have zero error variance. It is apparently impossible to estimate scale as well as location parameters when so many unknowns are involved [24]. In fact, if we look at the loss function we can see it is unbounded below. We can choose scores to fit one variable perfectly, and then let the corresponding variance term approach zero [1]. In 1952, Whittle suggested to take D proportional to the variance of the variables. This amounts to doing a singular value decomposition of the standardized variables. J¨ oreskog makes the more reasonable choice of setting D proportional to the reciprocals of the diagonals of the inverse of the covariance matrix of the variables [11].

2.2.3

Least Squares Method

The least-squares method is one of the most important estimation methods, which attempts to obtain such values of the factor loading A and the unique variance D2 that minimizes a different loss function. The least squares loss function either used in minimizing the residual of covariances matrix or residual of data matrix.

11

The least squares loss function used on the covariances matrix is 1 ø(A, D) = SSQ(C − AA′ − D2 ) 2 We minimize over A ∈ Rm×p and D ∈ Dm , the diagonal matrices of order m. There have been four major approaches to minimizing this loss function. The least squares loss function used on the data matrix is 1 σ(H, A, E, D) = SSQ(Y − HA′ − ED) 2 We minimize over H ∈ Rn×p , E ∈ Rn×m , A ∈ Rm×p and D ∈ Dm , under the conditions that H′ H = I, E′ E = I, H′ E = 0 and D is diagonal.

2.3

Determining the Number of Factors

A factor model is not solveable without first determining the number of factors p. How many common factors should be included in the model? This requires a determination of how parameters are going to be involved. There are statistical and mathematical approaches to determine the number of factors.

2.3.1

Mathematical Approaches

The mathematical approach to the number of factors is concerned with the number of factors for a particular sample of variables in the population, and the theories of approaches are based on a population correlation matrix. Mathematically, the number of factors underlying any given correlation matrix is a function of its rank. Estimating the minimum rank of the correlation matrix is the same as estimating the number of factors [6]. 1. The percentage of variance criterion. This method applies particularly to the principal component method. The percentage of common

12

variance extracted is computed by using the sum of the eigenvalues of variance-covariance matrix (Σ) in the division. Usually, investigators compute the cumulative percentage of variance after each factor is removed from the matrix and then stop the factoring process when 75, 80 or 85% of the total variance is account for. 2. The latent root criterion. This rule is the commonly used criterion of long standing and performs well in practice. This method use the variancecovariance matrix and choose the number of eigenvalues greater than one. 3. The scree test criterion. The scree test was named after the geological term scree. It also results in practice well. This rule is derived by plotting the latent roots against the number of factors in their order of the extraction, and the shape of the resulting curve is used to evaluate the cutoff point. Use the scree test based on a plot of the eigenvalues of Σ. If the graph drops sharply, followed by a straight line with much smaller slope, choose m equal to the number of eigenvalues before the straight line begins. Because the test involves subjective judgement, it cannot be programmed into the computer run.

2.3.2

Statistical Approach

In the statistical procedure for determining the number of factors to extract, the following question is asked: Is the residual matrix after p factors have been extracted statistically significant? A hypothesis tesst would be state to answer this question, H0 : Σ = AA′ + D 2 vs H1 : Σ 6= AA′ + D 2 where H is m×p matrix. If the statistic is significant at beyond the 0.05 level, then the number of factors is insufficient to totally explain the reliable variance. If the statistic is

13

nonsignificant, then the hypothesized number of factors is correct. Bartlett has presented a chi-square test of the significance of a correlation matrix, the test statistic is χ2 = −(n − 1 −

2ν + 5 ) ln |Σ| 6

(2.10)

where ν = 21 [(m − p)2 − m − p] and |Σ| is the determinant of the correlation matrix [3].

2.3.3

The Third Approach

There is no single way to determine the number of factors to extract. A third approach to sometimes use is to look the theory within field of study for indications of how many factors to expect. In many respects this is a better approach because it’s letting the science to drive the statistic rather than the statistic to drive the science.

14

CHAPTER 3 Algorithms of Least Squares Methods Given a multivariate sample of n independent observations on each of taking m variables. Collect these data in an n×m matrix Y. In common factor analysis model can be written as: Y = HA′ + ED

(3.1)

where, H ∈ Rn×p , A ∈ Rm×p , E ∈ Rn×m , D ∈ Rm×m , H′ H = I, E′ E = I, and H′ E = 0 , and where D is diagonal, p is the number of common factors.

Minimizing the least squares loss function is a form of factor analysis, but it is not the familiar one. In “classical” least squres factor analysis, as described in Young [1941], Whittle [1952] and J¨ oreskog [1962], the unique factors E are not parameter in the loss function [25] [24] [11]. Instead the unique variances are used to weight the residuals of each observed variable. Two different loss function will be illustrated.

15

3.1 3.1.1

Least Squares on the Covariance Matrix Loss Function

The least squares loss function used in LSFAC is 1 ø(A, D) = SSQ(C − AA′ − D2 ) 2

(3.2)

We minimize over A ∈ Rm×p and D ∈ Dm , the diagonal matrices of order m. 3.1.2

Projections

We also define the two projected or concentrated loss functions, in which one set of parameters is “minimized out”, ø(A) = minm (A, D) = D∈D

X

(cjl − a′j al )2 ,

(3.3)

m 1 X 2 λ (C − D 2 ), 2 s=p+1 s

(3.4)

1≤j 1. Note this correlation matrix has a negative eigenvalue, but the LSFAY still works. In all LSFAC case we iterate until the loss function decrease less than 1e - 6, and in LSFAY case we iterate until 1000 iterations. For the LSFAC case, in table 4.13, simple the PFA beats BFGS and CG by a factor of 8. In addition, going from PFA to Comrey algorithm makes convergence 1.6 times faster, and going from Harman to PFA algorithm makes convergence 0.02 times faster. Further analysis, going from Comrey algorithm to Newton algorithm again makes convergenc 1.4 times faster (observe that Newton algorithm starts with a small number of Comrey algorithm iterations to get into an area where quadratic approximation is safe). For the LSFAY case, the direct alternating least squares beats BFGS and CG by 13 times. The PFA algorithm is 6 times faster than the direct alternating least squares. In table 4.14, each entry for all pairs of loading matrices comparison is 1.00, that verifies all algorithms are similarity to each other.

35

Table 4.13: LSFA methods Summary - 9 Emotional variable from Burt LSFAC Newton

LSFAY

PFA Comrey Harman BFGS

CG InvSqrt BFGS

CG

loss 0.1696 0.1696

0.1696

0.1696 0.1696 0.1696 0.1287 0.1287 0.1287

iteration 5.0000 134.00

48.000

39.000

user.self 0.0250 0.1560

0.0600

0.1530 0.8240 1.2670 1.0730 10.481 14.244

sys.self 0.0000 0.0000

0.0000

0.0000 0.0030 0.0020 0.0030 0.0450 0.0420

1000.0

Table 4.14: Loading Matrices Summary - 9 Emotional variable from Burt LSFAC

LSFAY

Newton PFA Comrey Harman BFGS CG InvSqrt BFGS CG Newton

1.00 1.00

1.00

1.00

1.00 1.00

0.98

0.98 0.98

PFA

1.00 1.00

1.00

1.00

1.00 1.00

0.98

0.98 0.98

Comrey

1.00 1.00

1.00

1.00

1.00 1.00

0.98

0.98 0.98

Harman

1.00 1.00

1.00

1.00

1.00 1.00

0.98

0.98 0.98

BFGS

1.00 1.00

1.00

1.00

1.00 1.00

0.98

0.98 0.98

CG

1.00 1.00

1.00

1.00

1.00 1.00

0.98

0.98 0.98

InvSqrt

0.98 0.98

0.98

0.98

0.98 0.98

1.00

1.00 1.00

BFGS

0.98 0.98

0.98

0.98

0.98 0.98

1.00

1.00 1.00

CG

0.98 0.98

0.98

0.98

0.98 0.98

1.00

1.00 1.00

36

CHAPTER 5 Discussion and Conclusion This paper demonstrates the feasibility of using an Alternating Least Squares algorithm to solve the minimizing the least squares loss function from an common factor analysis perspective. The algorithm leads to clean and solid convergence, and accumulation points of the sequences it generates will be stationary points. In addition, the Procrustus algorithm provides a means of verifying that the solution obtained is at least a local minimum of the loss function. The Procrustus algorithm was applied to the classical Holzinger-Swineford mental tests problem and a simple structure of the loadings was achieved. The illustrative results verify that the steps of iteration generated from Projection algorithm are exactly the same as those from the Procrustus algorithm. The optimization was easily carried out using the gradient projection algorithm. The Procrustus algorithm has to be compared to other least squares methods. Current algorithm for fitting common factor analysis models to data are quite good. In almost every case the least squares on the covariance matrix (LSFAC) method has a faster convergence speed than least squares on the data matrix (LSFAY) method, and results from two different methods are almost identical. The Newton algorithm (with a small number of Comrey algorithm iterations to get into an area where quadratic approximation is safe) is the fastest algorithm in least squares method. However, the LSFAY has two advantages over LSFAC: 1. the independence is more realistic, optima scaling easy (FACTALS), and 2.

37

the square root of the estimating unique variance are always result positive. The illustrative results verify that an common factor analysis solution can be surprisingly similar to the classical maximum likelihood and least squares solutions on those data set we analysis earlier, suggestion that further research into its properties may be of interest in the future. Although we showed that the proposal factor analysis methodology can yield results that are equivalent to those from standard methods, an important question to consider s whether any variants of this methodology actually can yield improvements over existing methods. If not, results will be of interest mainly in providing a new theoretical perspective on the relations between components and factor analysis.

38

APPENDIX A Augmented Procrustus Suppose X is an n × m matrix of rank r. Consider the problem of maximizing tr(U′ X) over the n × m matrices U satisfying U′ U = I. This is known as the Procrustus problem, and it is usually studied for the case n ≥ m = r. We want to generalize to n ≥ m ≥ r. For this, we use the singular value decomposition    ′ L Λ 0 1    r×r r×(m−r)   r×m  K0 X = K1   . ′ n×r n×(n−r) L 0 0 0 (n−r)×r

(n−r)×(m−r)

(m−r)×m

Theorem 1. The maximum of tr U′ X over n×m matrices U satisfying U′ U = I is tr Λ, and it is attained for any U of the from U = K1 L′1 + K0 V L′0 , where V is any (n − r) × (m − r) matrix satisfying V ′ V = I. Proof. Using a symmetric matrix of Lagrange multipliers leads to the station1

ary equations X = UM, which implies X′ X = M2 or M = ±(X′ X 2 ). It also implies that at a solution of the stationary equation tr U′ X = ±tr Λ. The negative sign corresponds with the mimimum, the positive sign with the maximum.

Now M=

L1

m×r

L0 m×(m−r)

  

Λ

r×r

0

(m−r)×r

39

0

r×(m−r)

0

(m−r)×(m−r)

   

L′1 r×m L′0

(m−r)×m



 . 

If we write U in the form

U = K1

K0

n×r

n×(n−r)

  

U1

r×m

U0 (n−r)×m

  

then X = UM can be simplified to U1 L1 = I, U0 L1 = 0. with in addition, of course, U′1 U1 + U′0 U0 = I. It follows that U1 = L′1 and U0 (n−r)×m

=

V L′0 , (n−r)×(m−r)(m−r)×m

with V ′ V = I. Thus U = K1 L′1 + K0 V L′0 .

40

APPENDIX B Implementation Code B.1. Examples Dataset 1

libra ry (MASS) libra ry ( psych )

3

libra ry ( optimx ) libra ry ( s e r i a t i o n )

5

data ( Harman ) data ( b i f a c t o r )

7

data ( Psych24 )

9

haho