Multivariate Analysis of Variance (MANOVA)

Rhenium Yuan

2023-03-27

Multivariate Analysis of Variance (MANOVA)

The Multivariate Analysis of Variance (MANOVA) is the multivariate analog of the Analysis of Variance (ANOVA) procedure used for univariate data.

One-way Multivariate Analysis of Variance (One-way MANOVA)

Suppose that we have data on p variables which we can arrange in a table such as the one below:

Treatment1 Treatment2 Treatment\(g\)
Subject1 \(\textbf{Y}_{11}=\begin{pmatrix} Y_{111} \\ Y_{112} \\ \vdots \\ Y_{11p} \end{pmatrix}\) \(\textbf{Y}_{21}=\begin{pmatrix} Y_{211} \\ Y_{212} \\ \vdots \\ Y_{21p} \end{pmatrix}\) \(\textbf{Y}_{g1}=\begin{pmatrix} Y_{g11} \\ Y_{g12} \\ \vdots \\ Y_{g1p} \end{pmatrix}\)
Subject2 \(\textbf{Y}_{12}=\begin{pmatrix} Y_{121} \\ Y_{122} \\ \vdots \\ Y_{12p} \end{pmatrix}\) \(\textbf{Y}_{22}=\begin{pmatrix} Y_{221} \\ Y_{222} \\ \vdots \\ Y_{22p} \end{pmatrix}\) \(\textbf{Y}_{g2}=\begin{pmatrix} Y_{g21} \\ Y_{g22} \\ \vdots \\ Y_{g2p} \end{pmatrix}\)
Subject\(n_i\) \(\textbf{Y}_{1n_i}=\begin{pmatrix} Y_{1n_11} \\ Y_{1n_12} \\ \vdots \\ Y_{1n_1p} \end{pmatrix}\) \(\textbf{Y}_{1n_2}=\begin{pmatrix} Y_{2n_21} \\ Y_{2n_22} \\ \vdots \\ Y_{2n_2p} \end{pmatrix}\) \(\textbf{Y}_{gn_g}=\begin{pmatrix} Y_{gn_g1} \\ Y_{gn_g2} \\ \vdots \\ Y_{gn_gp} \end{pmatrix}\)

In this multivariate case the scalar quantities, , of the corresponding table in ANOVA, are replaced by vectors having p observations.

Assumption

  1. The data from group i has common mean vector \(\mathbf{\mu_i}=\begin{pmatrix} \mu_{i1} \\ \mu_{i2} \\ \vdots \\ \mu_{ip} \end{pmatrix}\)

  2. The data from all groups have common variance-covariance matrix \(\Sigma\).

  3. Linearity between dependent variables. But, absence of multicollinearity: The correlation between DVs cannot be too high, say >0.9

  4. Normality: The data are multivariate normally distributed.

Here we are interested in testing the null hypothesis that the group mean vectors are all equal to one another. Mathematically this is expressed as:

\[H_0:\boldsymbol{\mu}_1=\boldsymbol{\mu}_2=\dots=\boldsymbol{\mu}_g\]

The alternative hypothesis being:

\[H_1:\mu_{ik}\ne\mu_{jk}\ for\ at\ least\ one\ i\ne j\ and\ at\ least\ one\ variable\ k\]

Total Sum of Squares and Cross Products

In the univariate Analysis of Variance, we defined the Total Sums of Squares, a scalar quantity. The multivariate analog is the Total Sum of Squares and Cross Products matrix, a p x p matrix of numbers. The total sum of squares is a cross products matrix defined by the expression below:

\[\textbf{T}=\sum_{i=1}^g\sum_{j=1}^{n_i}(\textbf{Y}_{ij}-\bar{\textbf{y}}_{..})(\textbf{Y}_{ij}-\bar{\textbf{y}}_{..})'\]

Here we are looking at the differences between the vectors of observations \(Y_{ij}\) and the Grand mean vector. These differences form a vector which is then multiplied by its transpose.

Here the \((k,l)^{th}\) element of \(\textbf{T}\) is

\[\sum_{i=1}^g\sum_{j=1}^{n_i}(Y_{ijk}-\bar{y}_{..k})(Y_{ijl}-\bar{y}_{..l})\]

For k = l, this is the total sum of squares for variable k, and measures the total variation in the \(k^{th}\) variable. For \(k \ne l\), this measures the dependence between variables k and l across all of the observations.

We may partition the total sum of squares and cross products as follows:

\[\textbf{T}=\sum_{i=1}^g\sum_{j=1}^{n_i}(\textbf{Y}_{ij}-\bar{\textbf{y}}_{..})(\textbf{Y}_{ij}-\bar{\textbf{y}}_{..})'\\=\begin{matrix} \underbrace{\sum_{i=1}^g\sum_{j=1}^{n_i}(\textbf{Y}_{ij}-\bar{\textbf{y}}_{i.})'}\\ \textbf{E}\end{matrix}+\begin{matrix}\underbrace{\sum_{i=1}^g\textbf{n}_i(\bar{\textbf{y}}_{i.}-\bar{\textbf{y}}_{..})(\bar{\textbf{y}}_{i.}-\bar{\textbf{y}}_{..})'}\\\textbf{H}\end{matrix}\]

where E is the Error Sum of Squares and Cross Products, and H is the Hypothesis Sum of Squares and Cross Products.

The partitioning of the total sum of squares and cross products matrix may be summarized in the multivariate analysis of variance table:

Source df SSP
Treatments g-1 H
Error N-g E
Total N-1 T

We wish to reject \(H_0\) if the hypothesis sum of squares and cross products matrix H is large relative to the error sum of squares and cross products matrix E.

Test Statistics for MANOVA

  1. Wilks Lambda \[\Lambda^*=\frac{|\textbf{E}|}{|\textbf{H}+\textbf{E}|}\]

Here, the determinant of the error sums of squares and cross products matrix E is divided by the determinant of the total sum of squares and cross products matrix T = H + E. If H is large relative to E, then |H + E| will be large relative to |E|. Thus, we will reject the null hypothesis if Wilks lambda is small (close to zero).

  1. Hotelling-Lawley Trace \[T_0^2=trace(\textbf{HE}^{-1})\]

Here, we are multiplying H by the inverse of E; then we take the trace of the resulting matrix. If H is large relative to E, then the Hotelling-Lawley trace will take a large value. Thus, we will reject the null hypothesis if this test statistic is large.

  1. Pillai Trace \[V=trace(\textbf{H}(\textbf{H}+\textbf{E})^{-1})\]

Here, we are multiplying H by the inverse of the total sum of squares and cross products matrix T = H + E. If H is large relative to E, then the Pillai trace will take a large value. Thus, we will reject the null hypothesis if this test statistic is large.

  1. Roy’s Maximum Root: Largest eigenvalue of \(\textbf{HE}^{-1}\)

Here, we multiply H by the inverse of E, and then compute the largest eigenvalue of the resulting matrix. If H is large relative to E, then the Roy’s root will take a large value. Thus, we will reject the null hypothesis if this test statistic is large.

Compute MANOVA in R

library("readxl")
library("rstatix")
## 
## 载入程辑包:'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library("mvnormtest")
library("MVTests")
## Warning: 程辑包'MVTests'是用R版本4.2.3 来建造的
## 
## 载入程辑包:'MVTests'
## The following object is masked from 'package:datasets':
## 
##     iris
library("heplots")
## Warning: 程辑包'heplots'是用R版本4.2.3 来建造的
## 载入需要的程辑包:car
## 载入需要的程辑包:carData
## 载入需要的程辑包:broom
salaryData = read_xlsx("E:/PkuerRhenium/Psy/心理统计2/week5/example (1).xlsx")
IV = cbind(salaryData$salary, salaryData$salbegin)
jobcat = as.factor(salaryData$jobcat)
gender = as.factor(salaryData$gender)

Check for assumptions

Linearity

scatterplot of DVs to check whether they have linear relationship on each group level.

library(car)
scatterplot(salary ~ salbegin, data=salaryData, ellipse=TRUE, 
  smooth=list(style="lines"))

Multicollinearity

Compute the correlation coefficients between DVs, should be <0.9. Here we have r = 0.88. If >0.9, consider delete the ones that produced multicollinearity

Equal Variance and Covariance

Significant results indicate that covariance matrices of DVs are not equal, violating the homogeneity assumption. However, MANOVA is usually robust against the violation of this assumption. We can proceed if the data size is decent with equal number of subjects in each cell.

boxM(cbind(salary, salbegin) ~ factor(jobcat), data = salaryData)
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  Y
## Chi-Sq (approx.) = 508.68, df = 6, p-value < 2.2e-16

Normality

mshapiro.test(t(IV))
## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.70689, p-value < 2.2e-16

MANOVA

fit = manova(IV ~ jobcat)
summary(fit)  #for manova
##            Df  Pillai approx F num Df den Df    Pr(>F)    
## jobcat      2 0.67298   119.43      4    942 < 2.2e-16 ***
## Residuals 471                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(fit)  #for individual anovas
##  Response 1 :
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## jobcat        2 8.9438e+10 4.4719e+10  434.48 < 2.2e-16 ***
## Residuals   471 4.8478e+10 1.0293e+08                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## jobcat        2 1.7926e+10 8962772266  371.11 < 2.2e-16 ***
## Residuals   471 1.1375e+10   24151508                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(effectsize)
## 
## 载入程辑包:'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
effectsize::eta_squared(fit)
## # Effect Size for ANOVA (Type I)
## 
## Parameter | Eta2 (partial) |       95% CI
## -----------------------------------------
## jobcat    |           0.34 | [0.30, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

Post-hoc Test

pwc <- salaryData %>%
  gather(key = "variables", value = "value", salary, salbegin) %>%
  group_by(variables) %>%
  games_howell_test(value ~ jobcat) %>%
  select(-estimate, -conf.low, -conf.high) # Remove details
pwc
## # A tibble: 6 × 6
##   variables .y.   group1 group2    p.adj p.adj.signif
##   <chr>     <chr> <chr>  <chr>     <dbl> <chr>       
## 1 salary    value 1      2      1.22e- 6 ****        
## 2 salary    value 1      3      4.25e-10 ****        
## 3 salary    value 2      3      4.21e-10 ****        
## 4 salbegin  value 1      2      6   e- 3 **          
## 5 salbegin  value 1      3      2.74e-10 ****        
## 6 salbegin  value 2      3      4.77e-10 ****