To follow up the MANOVA, we applied pairwise PERMANOVA’s on the
distance matrix between groups, univariate ANOVA’s on rank-transformed
data, and linear discriminant analyses (LDA). Pairwise PERMANOVA’s
inform on the multivariate distances between groups, univariate ANOVA’s
inform on variation driven by a single variable, and LDA finds a linear
combination of features that characterizes or separates the classes of
objects (Field et al.,
2012). As for the assumptions tests, we applied a false
discovery rate (FDR) correction to control for the increase of Type I
error when multiple testing.
Pairwise PERMANOVA
Pairwise PERMANOVA was conducted using the
pairwise.perm.manova
function from the
RVAideMemoire
package on a distance matrix produced using
vegdist
function from the vegan
package.
dist_dml <- vegan::vegdist(x = as.matrix(dISO[, 2:12]), method = "euclidean",
binary = FALSE, diag = TRUE, upper = TRUE, na.rm = FALSE)
pairwise_permanova <- pairwise.perm.manova(resp = dist_dml, fact = dISO$Group,
test = "Wilks", nperm = 999, progress = FALSE, p.method = "fdr",
F = TRUE, R2 = TRUE)
# Extract the relevant information and create a data frame
pairwisePermanova <- as.data.frame(cbind(group1 = rep(dimnames(pairwise_permanova$p.value)[[2]],
each = ncol(pairwise_permanova$p.value)), group2 = rep(dimnames(pairwise_permanova$p.value)[[1]],
times = nrow(pairwise_permanova$p.value)), `statistic(pseudo-F)` = round(as.vector(pairwise_permanova$F.value),
3), `adjusted p-value` = round(as.vector(pairwise_permanova$p.value),
3), R2 = round(as.vector(pairwise_permanova$R2.value), 3)))
# Filter out rows with NA values and first column with
# numbers
pairwisePermanova <- pairwisePermanova |>
drop_na() |>
select(-0)
color.me <- which(pairwisePermanova$`adjusted p-value` < 0.05)
kbl(pairwisePermanova, format = "html", booktabs = TRUE, caption = pairwise_permanova$method) |>
kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
"responsive")) |>
row_spec(color.me, bold = TRUE, background = "gray!6") |>
row_spec(0, italic = TRUE)
permutation MANOVAs on a distance matrix
group1
|
group2
|
statistic(pseudo-F)
|
adjusted p-value
|
R2
|
El Molo
|
Turkana
|
9.209
|
0.045
|
0.397
|
El Molo
|
Luhya (Webuye)
|
0.95
|
0.473
|
0.064
|
El Molo
|
Luhya (Port Vict.)
|
0.294
|
0.719
|
0.022
|
El Molo
|
Luo (Port Vict.)
|
8.136
|
0.045
|
0.352
|
Turkana
|
Luhya (Webuye)
|
3.453
|
0.17
|
0.223
|
Turkana
|
Luhya (Port Vict.)
|
6.484
|
0.05
|
0.371
|
Turkana
|
Luo (Port Vict.)
|
0.007
|
0.959
|
0.001
|
Luhya (Webuye)
|
Luhya (Port Vict.)
|
0.228
|
0.719
|
0.02
|
Luhya (Webuye)
|
Luo (Port Vict.)
|
3.079
|
0.17
|
0.192
|
Luhya (Port Vict.)
|
Luo (Port Vict.)
|
5.017
|
0.095
|
0.295
|
Interpretation of \(R\) 2
is, by convention, that 0.01, 0.09, and 0.25 correspond to small,
medium, and large effects, respectively (Cohen, 1988). We found two significant
comparisons, between the El Molo and the Turkana, and the El Molo and
the Luo (Port Victoria).
Univariate ANOVA
As the data does not follow ANOVA assumptions, namely normality, we
rank-transformed the data, also known as a Kruskal Wallis test, for
which we used the kruskal_test
function from the
rstatix
package.
followup_kruskal <- grouped.data |>
kruskal_test(value ~ Group) |>
adjust_pvalue(method = "fdr")
# followup_kruskal <- followup_kruskal[,
# !names(followup_kruskal) %in% '.y.']
followup_effsize <- grouped.data |>
kruskal_effsize(value ~ Group)
followup_effsize <- followup_effsize |>
select(variable, effsize, magnitude)
merged_data <- followup_kruskal %>%
left_join(followup_effsize, by = "variable") |>
select(variable, n, df, statistic, p.adj, method, effsize,
magnitude)
colnames(merged_data) <- c("variable", "n", "df", "statistic (H)",
"adjusted p-value", "method", "eta-squared", "magnitude")
color.me <- which(merged_data$`adjusted p-value` < 0.05)
# Convert the R object into a html table
kbl(merged_data, digits = c(3, 3, 0, 3, 3), format = "html",
booktabs = TRUE, caption = "Univariate Kruskal-Wallis") |>
kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
"responsive")) |>
row_spec(color.me, bold = TRUE, background = "gray!6") |>
row_spec(0, italic = TRUE)
Univariate Kruskal-Wallis
variable
|
n
|
df
|
statistic (H)
|
adjusted p-value
|
method
|
eta-squared
|
magnitude
|
Asfc
|
37
|
4
|
4.322
|
0.500
|
Kruskal-Wallis
|
0.010
|
small
|
HAsfc
|
37
|
4
|
10.121
|
0.106
|
Kruskal-Wallis
|
0.191
|
large
|
Sda
|
37
|
4
|
7.454
|
0.209
|
Kruskal-Wallis
|
0.108
|
moderate
|
Sdq
|
37
|
4
|
2.170
|
0.839
|
Kruskal-Wallis
|
-0.057
|
small
|
Sdv
|
37
|
4
|
6.537
|
0.255
|
Kruskal-Wallis
|
0.079
|
moderate
|
Sp
|
37
|
4
|
1.853
|
0.839
|
Kruskal-Wallis
|
-0.067
|
moderate
|
Ssk
|
37
|
4
|
23.072
|
0.001
|
Kruskal-Wallis
|
0.596
|
large
|
Sxp
|
37
|
4
|
10.294
|
0.106
|
Kruskal-Wallis
|
0.197
|
large
|
Sz
|
37
|
4
|
0.809
|
0.937
|
Kruskal-Wallis
|
-0.100
|
moderate
|
Tfv
|
37
|
4
|
12.638
|
0.073
|
Kruskal-Wallis
|
0.270
|
large
|
epLsar
|
37
|
4
|
8.370
|
0.174
|
Kruskal-Wallis
|
0.137
|
moderate
|
At p < 0.05
, only Ssk was significantly
different between groups, although other 3 variables – Hasfc,
Sxp, and Tfv also have large effect sizes. In our
previous study (Correia et al., 2021), Hasfc
and Tfv were also not significantly different with large effect
sizes .
Next, we conducted pairwise comparisons for the significantly
different variable Ssk. We used the function
dunn_test
from the rstatix
package, which
respects the ranked data of Kruskall-Wallis, while controlling for
multiple comparisons. We chose Dunn’s test over Games-Howells because
the first addresses distribution issues, whereas the second is more
appropriate when there is no homogeneity of variances, which is not as
problematic here (see @ref(assumptions)). Finally, we calculated Hedge’s
g as a measure of effect size for each
pairwise comparison.
pairwise <- dunn_test(dISO, Ssk ~ Group, p.adjust.method = "fdr",
detailed = TRUE)
pairwise_effsize <- dISO |>
rstatix::cohens_d(formula = Ssk ~ Group, paired = FALSE,
hedges.correction = TRUE)
pairwise_effsize <- pairwise_effsize |>
select(group1, group2, effsize, magnitude)
merged_data_effsize <- pairwise %>%
left_join(pairwise_effsize, by = c("group1", "group2")) |>
select(group1, group2, estimate, statistic, p.adj, effsize,
magnitude)
colnames(merged_data_effsize) <- c("group1", "group2", "mean difference",
"statistic (z)", "adjusted p-value", "hedge's g", "magnitude")
color.me <- which(merged_data_effsize$`adjusted p-value` < 0.05)
kbl(merged_data_effsize, digits = c(3, 3, 3, 3, 3, 3), format = "html",
booktabs = TRUE, caption = "Pairwise comparisons for Ssk using Dunn's test") |>
kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
"responsive")) |>
row_spec(color.me, bold = TRUE, background = "gray!6") |>
row_spec(0, italic = TRUE)
Pairwise comparisons for Ssk using Dunn’s test
group1
|
group2
|
mean difference
|
statistic (z)
|
adjusted p-value
|
hedge’s g
|
magnitude
|
El Molo
|
Turkana
|
-9.048
|
-1.659
|
0.162
|
1.238
|
large
|
El Molo
|
Luhya (Webuye)
|
-9.333
|
-1.711
|
0.162
|
1.041
|
large
|
El Molo
|
Luhya (Port Vict.)
|
-16.000
|
-2.805
|
0.019
|
1.611
|
large
|
El Molo
|
Luo (Port Vict.)
|
-24.333
|
-4.626
|
0.000
|
2.329
|
large
|
Turkana
|
Luhya (Webuye)
|
-0.286
|
-0.049
|
0.961
|
-0.046
|
negligible
|
Turkana
|
Luhya (Port Vict.)
|
-6.952
|
-1.154
|
0.298
|
0.740
|
moderate
|
Turkana
|
Luo (Port Vict.)
|
-15.286
|
-2.729
|
0.019
|
1.588
|
large
|
Luhya (Webuye)
|
Luhya (Port Vict.)
|
-6.667
|
-1.107
|
0.298
|
0.718
|
moderate
|
Luhya (Webuye)
|
Luo (Port Vict.)
|
-15.000
|
-2.678
|
0.019
|
1.529
|
large
|
Luhya (Port Vict.)
|
Luo (Port Vict.)
|
-8.333
|
-1.426
|
0.220
|
0.808
|
large
|
# boxplot
pairwise <- pairwise |>
add_xy_position(x = "Group")
ggplot(dISO, aes(y = Ssk, x = Group)) + geom_boxplot(aes(fill = Group),
show.legend = FALSE) + scale_fill_manual(values = cbbPalette) +
stat_pvalue_manual(pairwise, label = "p = {scales::pvalue(p.adj)}",
tip.length = 0.01, size = 2.5, hide.ns = "p.adj", step.increase = 0.02) +
my_theme

For the Ssk variable, we observe that 4 comparisons are
significantly different, Interestingly, most comparisons have large or
moderate effect sizes, suggesting that the sample size is too low to
detect more significant comparisons. The one exception is the negligible
effect size in the comparison between the Turkana and Luhya
(Webuye).
Canonical discriminant analysis
Canonical discriminant analyses (CDA) find a linear combination of
features that characterizes or separates two or more classes of objects
or events. This approach makes important demands in terms of data
distribution, and sample sizes. However, alternatives, such as
regularized or quadratic discriminant analyses, are more adequate for
predictive rather than descriptive analyses. In addition, even
when dealing with non-normal and high dimensional data, CDA axis retain
discriminality between classes although cut-off points between classes
may be incorrect. In other words, CDA on non-normal data cannot be
generalized to other data (Hastie, Tibshirani, & Friedman, 2009, p.
110). Here, CDA was conducted using the candisc
function from the candisc
package, while rLDA was conducted
using the Slda
function from the HiDimDA
package with the goal of informing the reliability of the CDA approach.
ISOmanova <- lm(as.matrix(dISO[, 2:12]) ~ Group, dISO)
# summary(manova(ISOmanova), test = 'Wilks')
ISO.can <- candisc(ISOmanova)
ISO.can
##
## Canonical Discriminant Analysis for Group:
##
## CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.64629 1.82717 0.32596 41.2101 41.210
## 2 0.60019 1.50121 0.32596 33.8584 75.068
## 3 0.45336 0.82935 0.32596 18.7052 93.774
## 4 0.21634 0.27606 0.32596 6.2263 100.000
##
## Test of H0: The canonical correlations in the
## current row and all that follow are zero
##
## LR test stat approx F numDF denDF Pr(> F)
## 1 0.06058 2.11593 44 86.12 0.001524 **
## 2 0.17127 1.87334 30 68.19 0.016817 *
## 3 0.42838 1.40763 18 48.00 0.171547
## 4 0.78366 0.86268 8 25.00 0.559555
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create a data frame for the scores
scores_df <- data.frame(
Group = ISO.can$scores$Group,
Can1 = ISO.can$scores$Can1,
Can2 = ISO.can$scores$Can2
)
# Extract the relevant information from the candisc object
loadings_df <- data.frame(
Variable = rownames(ISO.can$structure),
Can1 = ISO.can$structure[, 1],
Can2 = ISO.can$structure[, 2])
# Calculate the percentage of variance explained by each canonical variable
eigenvalues <- ISO.can$eigenvalues
total_variance <- sum(eigenvalues)
variance_explained <- eigenvalues / total_variance * 100
# Define axis titles including the percentage of variance explained
x_axis_title <- paste("Can1 (", round(variance_explained[1], 1), "%)", sep = "")
y_axis_title <- paste("Can2 (", round(variance_explained[2], 1), "%)", sep = "")
# Calculate centroids for each group
centroids <- aggregate(cbind(Can1, Can2) ~ Group, scores_df, mean)
# Create the biplot using ggplot2
ggplot() +
geom_point(data = scores_df, aes(x = Can1, y = Can2, color = Group), size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
stat_ellipse(data = scores_df, aes(
x = Can1, y = Can2, fill = Group),
geom = "polygon", level = 0.68, alpha = 0.2) + # Add ellipses for group
geom_text(data = centroids, aes(
x = Can1, y = Can2, label = "+"),
size = 6, color = cbbPalette) + # Add "+" symbols for centroids
geom_text_repel(data = loadings_df, aes(
x = Can1 * 6, y = Can2 * 6, label = Variable),
size = 3, hjust = 1, vjust = 0.5, box.padding = 0.5) +
geom_segment(data = loadings_df, aes(x = 0, y = 0,
xend = Can1*5.5, yend = Can2*5.5),
arrow = arrow(type = "open", length = unit(0.2, "cm")),
linewidth = 0.5, lineend = "round") +
geom_text(data = centroids, aes(
x = Can1, y = Can2, label = Group),
size = 4, color = cbbPalette, fontface = 2,
nudge_x = c(-1.0, 0.5, -0.5, 1.5, 1.0),
nudge_y = c(-1.5, -1.2, 1.5, 1.2, -1.0)) + # Add group labels
scale_color_manual(values = cbbPalette) +
scale_fill_manual(values = cbbPalette) +
coord_fixed(ratio = 1) +
my_theme + labs(x = x_axis_title, y = y_axis_title) +
guides(colour = "none", fill = "none")

heplot(ISO.can, var.col = "black", fill = TRUE, fill.alpha = 0.1)

## Vector scale factor set to 6.435736
plot(ISO.can, which=1, col = cbbPalette, cex.axis = 0.7, cex.lab = 0.9,
points.1d=TRUE, rev.axes = TRUE,pch = 15:19,
var.lwd = 2, var.col = "black")

plot(ISO.can, which=2, col = cbbPalette, cex.axis = 0.7, cex.lab = 0.8,
points.1d=TRUE, rev.axes = TRUE,pch = 15:19,
var.lwd = 2, var.col = "black")

# Create a data frame for the scores
scores_df <- data.frame(
Group = ISO.can$scores$Group,
Can3 = ISO.can$scores$Can3,
Can4 = ISO.can$scores$Can4
)
# Extract the relevant information from the candisc object
loadings_df <- data.frame(
Variable = rownames(ISO.can$structure),
Can3 = ISO.can$structure[, 3],
Can4 = ISO.can$structure[, 4])
# Define axis titles including the percentage of variance explained
x_axis_title <- paste("Can3 (", round(variance_explained[3], 1), "%)", sep = "")
y_axis_title <- paste("Can4 (", round(variance_explained[4], 1), "%)", sep = "")
# Calculate centroids for each group
centroids <- aggregate(cbind(Can3, Can4) ~ Group, scores_df, mean)
# Create the biplot using ggplot2
ggplot() +
geom_point(data = scores_df, aes(x = Can3, y = Can4, color = Group), size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
stat_ellipse(data = scores_df, aes(
x = Can3, y = Can4, fill = Group),
geom = "polygon", level = 0.68, alpha = 0.2) + # Add ellipses for group
geom_text(data = centroids, aes(
x = Can3, y = Can4, label = "+"),
size = 6, color = cbbPalette) + # Add "+" symbols for centroids
geom_text_repel(data = loadings_df, aes(
x = Can3 * 6, y = Can4 * 6, label = Variable),
size = 3, hjust = 1, vjust = 0.5, box.padding = 0.5) +
geom_segment(data = loadings_df, aes(x = 0, y = 0,
xend = Can3*5.5, yend = Can4*5.5),
arrow = arrow(type = "open", length = unit(0.2, "cm")),
linewidth = 0.5, lineend = "round") +
geom_text(data = centroids, aes(
x = Can3, y = Can4, label = Group),
size = 4, color = cbbPalette, fontface = 2,
nudge_x = c(1.0, -0.5, 0.0, -1.0, 1.5),
nudge_y = c(1.5, -1.0, -1.0, 1.2, -0.5)) + # Add group labels
scale_color_manual(values = cbbPalette) +
scale_fill_manual(values = cbbPalette) +
coord_fixed(ratio = 1) +
my_theme + labs(x = x_axis_title, y = y_axis_title) +
guides(colour = "none", fill = "none")

heplots::heplot(ISO.can, variables = 3:4, var.col = "black",
fill = TRUE, fill.alpha = 0.1)

## Vector scale factor set to 3.034421
The CDA finds 4 canonical vectors, but only two have group means that
are significantly different. Together, the 2 explain 75% of the
variation observed.
The HE plot represent sums-of-squares-and-products
matrices for linear hypotheses (H) and for error
(E). In the default scaling, such plots have the visual
property that a given effect in the multivariate linear model is
significant (by Roy’s maximum root test) if the H ellipse projects
anywhere outside the E ellipse. The size of the H ellipse relative to
that of the E ellipse is an indication of the magnitude of the
multivariate effect for group. The further away variable centroids are
from the “Error” circle, the stronger the association between groups and
variables (Friendly
& Sigal, 2017).
In the biplot, each data point is represented by its
score on the discriminant axis. The variable vectors represent the
structure coefficients, i.e. the angle between the vector and the
canonical axis represent the correlation of variables with canonical
scores, while the vector’s relative length reflect the sum of the
squared correlations and hence represent the variable’s contribution to
the discrimination between groups (Field et al., 2012, p. 738).
scaled_ISO <- dISO |>
mutate(across(where(is.numeric), scale))
Slda <- Slda(data = as.matrix(scaled_ISO[, 2:12]), grouping = scaled_ISO$Group,
StddzData = TRUE, ladfun = "classification")
Slda
## Call:
## Slda.default(data = as.matrix(scaled_ISO[, 2:12]), grouping = scaled_ISO$Group,
## StddzData = TRUE, ladfun = "classification")
##
## Prior probabilities of groups:
## El Molo Turkana Luhya (Webuye) Luhya (Port Vict.)
## 0.2432432 0.1891892 0.1891892 0.1621622
## Luo (Port Vict.)
## 0.2162162
##
## Group means:
## Tfv Ssk Sxp Sda
## El Molo 0.6393859 0.8847521 0.1170062 -0.102954091
## Turkana -0.6359069 0.2133864 -0.5262379 -0.623304691
## Luhya (Webuye) 0.2228006 0.2406359 -0.1939716 -0.166426744
## Luhya (Port Vict.) 0.4264057 -0.3306241 -0.5939760 1.073557817
## Luo (Port Vict.) -0.6776453 -1.1446475 0.9440333 0.001669996
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## Tfv -0.39348495 -0.4456465 0.65633995 0.8074253
## Ssk -1.22926872 0.1839274 0.04278755 -0.7178051
## Sxp -0.04260694 0.8299720 0.81804596 -0.2981733
## Sda 0.53782314 -0.7345409 0.17354765 -0.7349617
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.554380573 0.317499098 0.123729033 0.004391297
##
## Variables kept in discriminant rule:
## [1] "Tfv" "Ssk" "Sxp" "Sda"
## Number of variables kept: 4
‘Slda’ is a shrunken discriminant analysis that finds the
coefficients of a linear discriminant rule based on Fisher and Sun’s
(2011) estimate and generalizations of Ledoit and Wolf’s (2004) optimal
shrunken covariance matrix (Silva, 2015). Here, it finds 4
discriminants just as the canonical approach, and keeps 4 variables in
the discriminant rule, Tfv, Ssk, Sxp,
Sda, which are also the variables with higher impact before.
Thus, this approach confirms the discriminatory and descriptive power of
the canonical approach. As scores are not produced in these approaches,
no further analyses is possible.