Multivariate Analysis of Variance (MANOVA)
The Multivariate Analysis of Variance (MANOVA) is the multivariate analog of the Analysis of Variance (ANOVA) procedure used for univariate data.
One-way Multivariate Analysis of Variance (One-way MANOVA)
Suppose that we have data on p variables which we can arrange in a table such as the one below:
Treatment1 | Treatment2 | … | Treatment\(g\) | |
---|---|---|---|---|
Subject1 | \(\textbf{Y}_{11}=\begin{pmatrix} Y_{111} \\ Y_{112} \\ \vdots \\ Y_{11p} \end{pmatrix}\) | \(\textbf{Y}_{21}=\begin{pmatrix} Y_{211} \\ Y_{212} \\ \vdots \\ Y_{21p} \end{pmatrix}\) | … | \(\textbf{Y}_{g1}=\begin{pmatrix} Y_{g11} \\ Y_{g12} \\ \vdots \\ Y_{g1p} \end{pmatrix}\) |
Subject2 | \(\textbf{Y}_{12}=\begin{pmatrix} Y_{121} \\ Y_{122} \\ \vdots \\ Y_{12p} \end{pmatrix}\) | \(\textbf{Y}_{22}=\begin{pmatrix} Y_{221} \\ Y_{222} \\ \vdots \\ Y_{22p} \end{pmatrix}\) | … | \(\textbf{Y}_{g2}=\begin{pmatrix} Y_{g21} \\ Y_{g22} \\ \vdots \\ Y_{g2p} \end{pmatrix}\) |
… | … | … | … | … |
Subject\(n_i\) | \(\textbf{Y}_{1n_i}=\begin{pmatrix} Y_{1n_11} \\ Y_{1n_12} \\ \vdots \\ Y_{1n_1p} \end{pmatrix}\) | \(\textbf{Y}_{1n_2}=\begin{pmatrix} Y_{2n_21} \\ Y_{2n_22} \\ \vdots \\ Y_{2n_2p} \end{pmatrix}\) | … | \(\textbf{Y}_{gn_g}=\begin{pmatrix} Y_{gn_g1} \\ Y_{gn_g2} \\ \vdots \\ Y_{gn_gp} \end{pmatrix}\) |
In this multivariate case the scalar quantities, , of the corresponding table in ANOVA, are replaced by vectors having p observations.
Assumption
The data from group i has common mean vector \(\mathbf{\mu_i}=\begin{pmatrix} \mu_{i1} \\ \mu_{i2} \\ \vdots \\ \mu_{ip} \end{pmatrix}\)
The data from all groups have common variance-covariance matrix \(\Sigma\).
Linearity between dependent variables. But, absence of multicollinearity: The correlation between DVs cannot be too high, say >0.9
Normality: The data are multivariate normally distributed.
Here we are interested in testing the null hypothesis that the group mean vectors are all equal to one another. Mathematically this is expressed as:
\[H_0:\boldsymbol{\mu}_1=\boldsymbol{\mu}_2=\dots=\boldsymbol{\mu}_g\]
The alternative hypothesis being:
\[H_1:\mu_{ik}\ne\mu_{jk}\ for\ at\ least\ one\ i\ne j\ and\ at\ least\ one\ variable\ k\]
Total Sum of Squares and Cross Products
In the univariate Analysis of Variance, we defined the Total Sums of Squares, a scalar quantity. The multivariate analog is the Total Sum of Squares and Cross Products matrix, a p x p matrix of numbers. The total sum of squares is a cross products matrix defined by the expression below:
\[\textbf{T}=\sum_{i=1}^g\sum_{j=1}^{n_i}(\textbf{Y}_{ij}-\bar{\textbf{y}}_{..})(\textbf{Y}_{ij}-\bar{\textbf{y}}_{..})'\]
Here we are looking at the differences between the vectors of observations \(Y_{ij}\) and the Grand mean vector. These differences form a vector which is then multiplied by its transpose.
Here the \((k,l)^{th}\) element of \(\textbf{T}\) is
\[\sum_{i=1}^g\sum_{j=1}^{n_i}(Y_{ijk}-\bar{y}_{..k})(Y_{ijl}-\bar{y}_{..l})\]
For k = l, this is the total sum of squares for variable k, and measures the total variation in the \(k^{th}\) variable. For \(k \ne l\), this measures the dependence between variables k and l across all of the observations.
We may partition the total sum of squares and cross products as follows:
\[\textbf{T}=\sum_{i=1}^g\sum_{j=1}^{n_i}(\textbf{Y}_{ij}-\bar{\textbf{y}}_{..})(\textbf{Y}_{ij}-\bar{\textbf{y}}_{..})'\\=\begin{matrix} \underbrace{\sum_{i=1}^g\sum_{j=1}^{n_i}(\textbf{Y}_{ij}-\bar{\textbf{y}}_{i.})'}\\ \textbf{E}\end{matrix}+\begin{matrix}\underbrace{\sum_{i=1}^g\textbf{n}_i(\bar{\textbf{y}}_{i.}-\bar{\textbf{y}}_{..})(\bar{\textbf{y}}_{i.}-\bar{\textbf{y}}_{..})'}\\\textbf{H}\end{matrix}\]
where E is the Error Sum of Squares and Cross Products, and H is the Hypothesis Sum of Squares and Cross Products.
The partitioning of the total sum of squares and cross products matrix may be summarized in the multivariate analysis of variance table:
Source | df | SSP |
---|---|---|
Treatments | g-1 | H |
Error | N-g | E |
Total | N-1 | T |
We wish to reject \(H_0\) if the hypothesis sum of squares and cross products matrix H is large relative to the error sum of squares and cross products matrix E.
Test Statistics for MANOVA
- Wilks Lambda \[\Lambda^*=\frac{|\textbf{E}|}{|\textbf{H}+\textbf{E}|}\]
Here, the determinant of the error sums of squares and cross products matrix E is divided by the determinant of the total sum of squares and cross products matrix T = H + E. If H is large relative to E, then |H + E| will be large relative to |E|. Thus, we will reject the null hypothesis if Wilks lambda is small (close to zero).
- Hotelling-Lawley Trace \[T_0^2=trace(\textbf{HE}^{-1})\]
Here, we are multiplying H by the inverse of E; then we take the trace of the resulting matrix. If H is large relative to E, then the Hotelling-Lawley trace will take a large value. Thus, we will reject the null hypothesis if this test statistic is large.
- Pillai Trace \[V=trace(\textbf{H}(\textbf{H}+\textbf{E})^{-1})\]
Here, we are multiplying H by the inverse of the total sum of squares and cross products matrix T = H + E. If H is large relative to E, then the Pillai trace will take a large value. Thus, we will reject the null hypothesis if this test statistic is large.
- Roy’s Maximum Root: Largest eigenvalue of \(\textbf{HE}^{-1}\)
Here, we multiply H by the inverse of E, and then compute the largest eigenvalue of the resulting matrix. If H is large relative to E, then the Roy’s root will take a large value. Thus, we will reject the null hypothesis if this test statistic is large.
Compute MANOVA in R
library("readxl")
library("rstatix")
##
## 载入程辑包:'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library("mvnormtest")
library("MVTests")
## Warning: 程辑包'MVTests'是用R版本4.2.3 来建造的
##
## 载入程辑包:'MVTests'
## The following object is masked from 'package:datasets':
##
## iris
library("heplots")
## Warning: 程辑包'heplots'是用R版本4.2.3 来建造的
## 载入需要的程辑包:car
## 载入需要的程辑包:carData
## 载入需要的程辑包:broom
= read_xlsx("E:/PkuerRhenium/Psy/心理统计2/week5/example (1).xlsx")
salaryData = cbind(salaryData$salary, salaryData$salbegin)
IV = as.factor(salaryData$jobcat)
jobcat = as.factor(salaryData$gender) gender
Check for assumptions
Linearity
scatterplot of DVs to check whether they have linear relationship on each group level.
library(car)
scatterplot(salary ~ salbegin, data=salaryData, ellipse=TRUE,
smooth=list(style="lines"))
Multicollinearity
Compute the correlation coefficients between DVs, should be <0.9. Here we have r = 0.88. If >0.9, consider delete the ones that produced multicollinearity
Equal Variance and Covariance
Significant results indicate that covariance matrices of DVs are not equal, violating the homogeneity assumption. However, MANOVA is usually robust against the violation of this assumption. We can proceed if the data size is decent with equal number of subjects in each cell.
boxM(cbind(salary, salbegin) ~ factor(jobcat), data = salaryData)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: Y
## Chi-Sq (approx.) = 508.68, df = 6, p-value < 2.2e-16
Normality
mshapiro.test(t(IV))
##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.70689, p-value < 2.2e-16
MANOVA
= manova(IV ~ jobcat)
fit summary(fit) #for manova
## Df Pillai approx F num Df den Df Pr(>F)
## jobcat 2 0.67298 119.43 4 942 < 2.2e-16 ***
## Residuals 471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(fit) #for individual anovas
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## jobcat 2 8.9438e+10 4.4719e+10 434.48 < 2.2e-16 ***
## Residuals 471 4.8478e+10 1.0293e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## jobcat 2 1.7926e+10 8962772266 371.11 < 2.2e-16 ***
## Residuals 471 1.1375e+10 24151508
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(effectsize)
##
## 载入程辑包:'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
::eta_squared(fit) effectsize
## # Effect Size for ANOVA (Type I)
##
## Parameter | Eta2 (partial) | 95% CI
## -----------------------------------------
## jobcat | 0.34 | [0.30, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
Post-hoc Test
<- salaryData %>%
pwc gather(key = "variables", value = "value", salary, salbegin) %>%
group_by(variables) %>%
games_howell_test(value ~ jobcat) %>%
select(-estimate, -conf.low, -conf.high) # Remove details
pwc
## # A tibble: 6 × 6
## variables .y. group1 group2 p.adj p.adj.signif
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 salary value 1 2 1.22e- 6 ****
## 2 salary value 1 3 4.25e-10 ****
## 3 salary value 2 3 4.21e-10 ****
## 4 salbegin value 1 2 6 e- 3 **
## 5 salbegin value 1 3 2.74e-10 ****
## 6 salbegin value 2 3 4.77e-10 ****