This file documents the statistical analysis of dental microwear variables from Kenyan living human populations, as well as the code to produce Figure 1, the location map with fieldwork sites. A previous analyses of SSFA parameters (Asfc, epLsar, HAsfc, Tfv) was published in Correia, Foley, & Lahr (2021). Here, ISO 25178 (or STA) parameters are also included in a multivariate analysis.

#save images in folder in directory
knitr::opts_chunk$set(
  echo = TRUE, #shows code
  warning = FALSE, message = FALSE, #stops warning messages
  fig.path = "images/",
  dev = c("svg", "png", "tiff"), #saves figures as svg, tiff, and png in images folder
  dpi = 500, #publishing quality for combination art (Elsevier)
  tidy.opts=list(width.cutoff=60), # stops code from running off page
  tidy=TRUE
)
# fileEnconding specifies character encoding as UTF-8
# rather than Unicode characters; otherwise unicode
# characters are added to column name
ISO <- read.csv("MW_ISO.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE,
    fileEncoding = "UTF-8-BOM")

# making ggplot2 respect the Group order
ISO$Group <- factor(ISO$Group, levels = unique(ISO$Group))
ISO <- ISO |>
    mutate(Group = fct_relevel(Group, "El Molo", "Turkana", "Webuye",
        "Luhya", "Luo"))
levels(ISO$Group) <- c("El Molo", "Turkana", "Luhya (Webuye)",
    "Luhya (Port Vict.)", "Luo (Port Vict.)")

# fieldwork site location
sites <- read.csv("workplaces.txt")
#graphical settings for map
map_theme <- theme(axis.text=element_blank(),
        axis.title=element_blank(),#and for axis titles
        panel.grid.minor = element_blank(), 
        panel.background = element_rect(fill="white"),
        plot.background = element_blank(),
        panel.grid.major = element_blank(),
        axis.ticks=element_blank(),  
        panel.border = element_rect(colour = "black", fill=NA, linewidth = 0.5))

#graphical settings for ggplot
my_theme <- theme(
  axis.text=element_text(size=8,colour="black"),
  axis.ticks=element_line (linewidth = 0.5, colour="black"), 
  axis.title=element_text(size=10),
  panel.grid.minor = element_blank(), 
  panel.background = element_blank(),
  panel.grid.major = element_blank(),
  legend.key=element_blank(),
  panel.border = element_rect(colour = "black", fill=NA, linewidth = 0.5))

#use first row data as column names for transposition
header.true <- function(df) {
  colnames(df) <- as.character(unlist(df[1,]))
  df[-1,]
}

Producing the location map

africa <- ne_countries(type = "countries", continent = "Africa",
    scale = "medium", returnclass = "sf")
lakes <- ne_download(scale = "medium", category = "physical",
    type = "lakes", returnclass = "sf")
## Reading layer `ne_50m_lakes' from data source 
##   `C:\Users\corre\AppData\Local\Temp\RtmpoLOXWw\ne_50m_lakes.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 412 features and 39 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -165.8985 ymin: -50.62002 xmax: 176.0827 ymax: 81.94033
## Geodetic CRS:  WGS 84
lakes <- ne_load(scale = "medium", category = "physical", type = "lakes",
    returnclass = "sf")
## Reading layer `ne_50m_lakes' from data source 
##   `C:\Users\corre\AppData\Local\Temp\RtmpoLOXWw\ne_50m_lakes.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 412 features and 39 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -165.8985 ymin: -50.62002 xmax: 176.0827 ymax: 81.94033
## Geodetic CRS:  WGS 84
# tomorrow start by trying to add lakes to map to do: inset
inset <- africa %>%
    ggplot() + geom_sf(fill = "grey87", color = "grey30") + geom_sf(data = africa |>
    filter(name %in% c("Kenya")), fill = "grey70", color = "grey30") +
    coord_sf(xlim = c(-22, 55), ylim = c(-40, 43), expand = FALSE) +
    geom_rect(aes(xmin = 31, xmax = 43, ymin = -5, ymax = 5.6),
        color = "black", fill = NA, linewidth = 0.6) + map_theme


main <- africa %>%
    ggplot() + geom_sf(fill = "grey87", color = "grey30") + geom_sf(data = africa |>
    filter(name %in% c("Kenya")), fill = "grey70", color = "grey30") +
    geom_sf(data = lakes, fill = "white") + geom_point(data = sites,
    aes(x = long, y = lat)) + coord_sf(xlim = c(31, 43), ylim = c(-5,
    5.6), expand = FALSE) + geom_sf_text(data = africa %>%
    filter(name %in% c("Kenya")), aes(label = name_long), size = 3.5,
    fontface = "bold", nudge_x = 0.6, nudge_y = 0.3) + geom_sf_text(data = lakes %>%
    filter(name %in% c("Lake Victoria", "Lake Turkana")), aes(label = c("Lake \n Victoria",
    "Lake \nTurkana")), size = 2.5, nudge_y = c(0.5, 0.1), nudge_x = c(0.1,
    1)) + ggrepel::geom_text_repel(data = sites, aes(x = long,
    y = lat, label = c("El Molo", "Luo | Luhya", "Luhya", "Turkana"),
    fontface = "plain"), nudge_y = c(-0.3, 0.05, 0.1, -0.5),
    nudge_x = c(0.3, 1, 0.2, 0.05), size = 2.9) + annotation_north_arrow(location = "tl",
    which_north = "true", height = unit(0.5, "cm"), width = unit(0.4,
        "cm"), pad_x = unit(0.4, "cm"), pad_y = unit(0.4, "cm")) +
    annotation_scale(location = "bl", style = "bar") + map_theme

# inset rectangle

main + annotation_custom(ggplotGrob(inset), xmin = 37.5, xmax = 44.5,
    ymin = 1.5, ymax = 5.5)

Addressing collinearity

The STA parameters initially considered here are described in ISO/TC 213 (2021):

  1. skewness (Ssk);
  2. maximum peak height (Sp);
  3. maximum height (Sz);
  4. extreme peak height (Sxp);
  5. root mean square gradient (Sdq);
  6. developed interfacial area ratio (Sdr);
  7. pit void volume (Vvv);
  8. five-point pit height (S5v);
  9. mean dale area (Sda);
  10. mean dale volume (Sdv)

Some of the STA variables likely capture the same underlying variation. Thus, we expect some of the variables to be highly correlated (Ungar, Livengood, & Crittenden, 2019), which affects general linear models since it violates one of the main assumptions which states that dependent variables will be independent from each other, i.e. that there will be no multicollinearity (Field, Miles, & Field, 2012, p. 274). Here, a correlogram was used to evaluate the correlation matrix.

dISO <- ISO %>%
    dplyr::select(Group, Asfc, epLsar, HAsfc, Tfv, Ssk, Sp, Sz,
        Sxp, Sdq, Sdr, Vvv, S5v, Sda, Sdv)

x <- as.matrix(dISO[, 2:15])
M <- x |>
    cor()

cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
    colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
    p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(x)

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD",
    "#4477AA"))
corrplot(M, method = "color", col = col(200), type = "lower",
    tl.col = "black", addCoef.col = "black", p.mat = p.mat, sig.level = 0.01,
    insig = "blank", diag = FALSE)

# remove variables with high correlation
dISO <- dISO |>
    dplyr::select(-c(Sdr, Vvv, S5v))

Squares with colour are significant correlations (\(p<0.01\)) and we observed 3 correlations above 0.90, confirming collinearity. We removed 3 variables from the analyses to avoid this issue (Sdr, Vvv, S5v). In addition, we observe several medium to strong correlations, which suggest an underlying pattern of variation and support a multivariate approach.

Visualization and descriptive statistics

Dotplots1 were used to visually examine the data, and it’s interesting that there is one clear outlier for Sz, which explains the high standard deviation of this variable for the group including it. Descriptive statistics reported include standard and robust measures of both central tendency and dispersion (mean and SD; median and IQR).

grouped.data <- dISO |>
    gather(key = "variable", value = "value", Asfc:Sdv) |>
    group_by(variable)

# Colour-blind palette
cbbPalette <- c("#E69F00", "#CC79A7", "#009E73", "#56B4E9", "#0072B2")

# dotplots
dotplot <- ggplot(grouped.data, aes(x = value, colour = Group,
    fill = Group)) + geom_dotplot(stackdir = "center", stackgroups = TRUE,
    binpositions = "all") + coord_flip() + scale_fill_manual(values = cbbPalette) +
    scale_colour_manual(values = cbbPalette) + scale_y_continuous(NULL,
    breaks = NULL) + facet_wrap(~variable, scales = "free_y") +
    my_theme

reposition_legend(dotplot, "center", panel = "panel-4-3")

# summary table
sISO <- dISO |>
    group_by(Group) |>
    get_summary_stats(type = "common", show = c("n", "mean",
        "sd", "median", "iqr"))

kbl(sISO, format = "html", booktabs = TRUE, longtable = TRUE,
    caption = "summary statistics") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(0, italic = TRUE)
summary statistics
Group variable n mean sd median iqr
El Molo Asfc 9 3.375 1.557 3.781 2.486
El Molo epLsar 9 0.004 0.002 0.003 0.002
El Molo HAsfc 9 0.451 0.196 0.460 0.280
El Molo Tfv 9 53682.043 7781.755 52229.147 10723.046
El Molo Ssk 9 -0.034 0.229 -0.028 0.211
El Molo Sp 9 3.402 2.251 2.579 3.714
El Molo Sz 9 5.875 3.157 4.667 4.549
El Molo Sxp 9 0.862 0.173 0.795 0.309
El Molo Sdq 9 0.214 0.046 0.214 0.067
El Molo Sda 9 316.519 159.673 308.716 135.162
El Molo Sdv 9 7.347 5.562 5.717 5.787
Turkana Asfc 7 3.320 0.978 3.231 1.160
Turkana epLsar 7 0.002 0.001 0.001 0.001
Turkana HAsfc 7 0.256 0.124 0.210 0.055
Turkana Tfv 7 41618.790 8024.451 44927.780 5903.392
Turkana Ssk 7 -0.315 0.199 -0.336 0.249
Turkana Sp 7 2.357 1.068 2.043 0.794
Turkana Sz 7 7.159 7.644 3.772 2.754
Turkana Sxp 7 0.713 0.173 0.769 0.159
Turkana Sdq 7 0.229 0.086 0.198 0.073
Turkana Sda 7 234.190 122.793 207.141 126.187
Turkana Sdv 7 4.693 3.332 3.380 3.998
Luhya (Webuye) Asfc 7 3.017 1.816 2.495 1.716
Luhya (Webuye) epLsar 7 0.002 0.001 0.002 0.002
Luhya (Webuye) HAsfc 7 0.504 0.126 0.520 0.155
Luhya (Webuye) Tfv 7 49741.478 8327.653 53447.254 11367.517
Luhya (Webuye) Ssk 7 -0.303 0.260 -0.277 0.325
Luhya (Webuye) Sp 7 2.504 1.612 1.729 1.452
Luhya (Webuye) Sz 7 4.815 2.244 4.223 2.237
Luhya (Webuye) Sxp 7 0.790 0.204 0.792 0.256
Luhya (Webuye) Sdq 7 0.212 0.095 0.179 0.084
Luhya (Webuye) Sda 7 306.476 124.430 274.599 154.342
Luhya (Webuye) Sdv 7 7.309 4.474 7.491 6.915
Luhya (Port Vict.) Asfc 6 2.211 0.966 2.165 1.449
Luhya (Port Vict.) epLsar 6 0.003 0.001 0.002 0.001
Luhya (Port Vict.) HAsfc 6 0.557 0.437 0.430 0.248
Luhya (Port Vict.) Tfv 6 51667.420 5782.717 50000.628 4299.156
Luhya (Port Vict.) Ssk 6 -0.543 0.353 -0.491 0.326
Luhya (Port Vict.) Sp 6 2.466 1.357 2.020 1.671
Luhya (Port Vict.) Sz 6 5.025 1.458 5.084 1.588
Luhya (Port Vict.) Sxp 6 0.697 0.192 0.697 0.214
Luhya (Port Vict.) Sdq 6 0.175 0.043 0.189 0.073
Luhya (Port Vict.) Sda 6 502.664 196.921 519.202 298.244
Luhya (Port Vict.) Sdv 6 12.467 8.062 12.226 12.700
Luo (Port Vict.) Asfc 8 2.474 0.859 2.748 1.304
Luo (Port Vict.) epLsar 8 0.003 0.002 0.003 0.001
Luo (Port Vict.) HAsfc 8 0.441 0.123 0.440 0.150
Luo (Port Vict.) Tfv 8 41223.978 10192.177 45092.266 8974.235
Luo (Port Vict.) Ssk 8 -0.884 0.433 -0.636 0.464
Luo (Port Vict.) Sp 8 2.243 1.436 1.915 0.511
Luo (Port Vict.) Sz 8 5.103 1.945 4.568 0.997
Luo (Port Vict.) Sxp 8 1.055 0.259 1.043 0.300
Luo (Port Vict.) Sdq 8 0.197 0.042 0.195 0.058
Luo (Port Vict.) Sda 8 333.072 101.095 314.123 70.991
Luo (Port Vict.) Sdv 8 10.347 7.417 8.368 5.473

Inferential Statistics

A multivariate analysis of variance (MANOVA) was used to investigate variation in central tendency between groups, since it takes into consideration any relationship between the dependent variables.

Assumptions

The assumptions of the MANOVA are (Field et al., 2012, p. 717):

  1. Independence

  2. Random Sampling

  3. Uni and multivariate normality (within groups)

  4. Uni and multivariate homogeneity of variances

  5. Linearity (linear relationships among all pairs of dependent variables, all pairs of covariates, and all dependent variable-covariate pairs in each cell)

    ADDITIONAL LIMITATIONS

  6. Unequal sample sizes (when cells in a factorial MANOVA have different sample sizes, the sum of squares for effect plus error does not equal the total sum of squares. This causes tests of main effects and interactions to be correlated)

  7. Uni and multivariate outliers (may produce either a Type I or Type II error and give no indication as to which type of error is occurring in the analysis)

Independence and linearity were addressed with the correlogram. Sample sizes are not too unequal, but small. Specifically, the sample size of each group is inferior to the number of variables, which is a risk when conducting multivariate analyses. Next, we will examine the outliers, normality and homocedasticity of the data. Conducting multiple statistical tests increases the chances of Type I error; when necessary, we applied a false discovery rate (FDR) correction, also known as Benjamini & Hochberg.

The aq.plot() function from the mvoutlier package (Filzmoser & Gschwandter, n.d.) identifies multivariate outliers by plotting the ordered squared robust Mahalanobis distances of the observations against the empirical distribution function of the MD2i. The function produces 4 graphs that identify multivariate outliers in red. Univariate outliers were identified in the graphs above.

outliers <- aq.plot(dISO[, 2:12])
## Projection to the first and second robust principal components.
## Proportion of total variation (explained variance): 0.9994766

To test for multivariate normality within groups, we used the mshapiro.test in the mvnormtest package (Jarek, 2012), which suggests that there is no multivariate normality. We also assessed multinormality graphically by examining plots of the ordered Mahalanobis distances (D2) and paired chi-squared values, \(\chi\) 2, of the model residuals. If the data are multivariate normal, the plot should form a relatively straight, diagonal line. The lack of multivariate normality is not surprising considering that the scales of measurement are very distinct between variables.

# Perform the multivariate Shapiro-Wilk test within each
# group
mshapiro_results <- dISO %>%
    group_by(Group) %>%
    do(tidy(mvnormtest::mshapiro.test(as.matrix(select(., where(is.numeric))))))


kbl(mshapiro_results, digits = c(0, 3, 10), format = "html",
    booktabs = TRUE, caption = "Multinormality tests") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(0, italic = TRUE)
Multinormality tests
Group statistic p.value method
El Molo 0.345 2.25e-08 Shapiro-Wilk normality test
Turkana 0.345 2.24e-08 Shapiro-Wilk normality test
Luhya (Webuye) 0.347 2.34e-08 Shapiro-Wilk normality test
Luhya (Port Vict.) 0.345 2.25e-08 Shapiro-Wilk normality test
Luo (Port Vict.) 0.345 2.24e-08 Shapiro-Wilk normality test
cqplot(lm(as.matrix(dISO[, 2:12]) ~ Group, data = dISO), main = "Chi-Square Q-Q Plot of MANOVA Model Residuals")

For univariate normality, we used shapiro.test, and found 4 significant tests at p<0.01, suggesting most variables are normal within groups.

uni_normal <- grouped.data |>
    group_by(variable, Group) |>
    do(tidy(shapiro.test(.$value)))

color.me <- which(uni_normal$p.value < 0.01)

kbl(uni_normal, digits = c(3, 3), format = "html", booktabs = TRUE,
    longtable = TRUE, caption = "Uninormality tests") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(color.me, bold = TRUE, background = "gray!6") |>
    row_spec(0, italic = TRUE)
Uninormality tests
variable Group statistic p.value method
Asfc El Molo 0.938 0.562 Shapiro-Wilk normality test
Asfc Turkana 0.926 0.520 Shapiro-Wilk normality test
Asfc Luhya (Webuye) 0.902 0.340 Shapiro-Wilk normality test
Asfc Luhya (Port Vict.) 0.944 0.694 Shapiro-Wilk normality test
Asfc Luo (Port Vict.) 0.910 0.355 Shapiro-Wilk normality test
HAsfc El Molo 0.951 0.705 Shapiro-Wilk normality test
HAsfc Turkana 0.650 0.001 Shapiro-Wilk normality test
HAsfc Luhya (Webuye) 0.935 0.595 Shapiro-Wilk normality test
HAsfc Luhya (Port Vict.) 0.747 0.019 Shapiro-Wilk normality test
HAsfc Luo (Port Vict.) 0.935 0.567 Shapiro-Wilk normality test
Sda El Molo 0.893 0.217 Shapiro-Wilk normality test
Sda Turkana 0.947 0.701 Shapiro-Wilk normality test
Sda Luhya (Webuye) 0.968 0.881 Shapiro-Wilk normality test
Sda Luhya (Port Vict.) 0.882 0.279 Shapiro-Wilk normality test
Sda Luo (Port Vict.) 0.916 0.401 Shapiro-Wilk normality test
Sdq El Molo 0.976 0.937 Shapiro-Wilk normality test
Sdq Turkana 0.864 0.163 Shapiro-Wilk normality test
Sdq Luhya (Webuye) 0.850 0.122 Shapiro-Wilk normality test
Sdq Luhya (Port Vict.) 0.821 0.090 Shapiro-Wilk normality test
Sdq Luo (Port Vict.) 0.896 0.264 Shapiro-Wilk normality test
Sdv El Molo 0.854 0.082 Shapiro-Wilk normality test
Sdv Turkana 0.830 0.080 Shapiro-Wilk normality test
Sdv Luhya (Webuye) 0.925 0.513 Shapiro-Wilk normality test
Sdv Luhya (Port Vict.) 0.884 0.288 Shapiro-Wilk normality test
Sdv Luo (Port Vict.) 0.760 0.011 Shapiro-Wilk normality test
Sp El Molo 0.832 0.047 Shapiro-Wilk normality test
Sp Turkana 0.888 0.264 Shapiro-Wilk normality test
Sp Luhya (Webuye) 0.748 0.012 Shapiro-Wilk normality test
Sp Luhya (Port Vict.) 0.919 0.496 Shapiro-Wilk normality test
Sp Luo (Port Vict.) 0.626 0.000 Shapiro-Wilk normality test
Ssk El Molo 0.965 0.849 Shapiro-Wilk normality test
Ssk Turkana 0.944 0.674 Shapiro-Wilk normality test
Ssk Luhya (Webuye) 0.909 0.386 Shapiro-Wilk normality test
Ssk Luhya (Port Vict.) 0.860 0.188 Shapiro-Wilk normality test
Ssk Luo (Port Vict.) 0.785 0.020 Shapiro-Wilk normality test
Sxp El Molo 0.848 0.071 Shapiro-Wilk normality test
Sxp Turkana 0.969 0.887 Shapiro-Wilk normality test
Sxp Luhya (Webuye) 0.903 0.350 Shapiro-Wilk normality test
Sxp Luhya (Port Vict.) 0.985 0.974 Shapiro-Wilk normality test
Sxp Luo (Port Vict.) 0.942 0.628 Shapiro-Wilk normality test
Sz El Molo 0.850 0.074 Shapiro-Wilk normality test
Sz Turkana 0.608 0.000 Shapiro-Wilk normality test
Sz Luhya (Webuye) 0.822 0.067 Shapiro-Wilk normality test
Sz Luhya (Port Vict.) 0.941 0.670 Shapiro-Wilk normality test
Sz Luo (Port Vict.) 0.688 0.002 Shapiro-Wilk normality test
Tfv El Molo 0.920 0.388 Shapiro-Wilk normality test
Tfv Turkana 0.871 0.188 Shapiro-Wilk normality test
Tfv Luhya (Webuye) 0.912 0.408 Shapiro-Wilk normality test
Tfv Luhya (Port Vict.) 0.851 0.161 Shapiro-Wilk normality test
Tfv Luo (Port Vict.) 0.814 0.040 Shapiro-Wilk normality test
epLsar El Molo 0.942 0.608 Shapiro-Wilk normality test
epLsar Turkana 0.848 0.118 Shapiro-Wilk normality test
epLsar Luhya (Webuye) 0.950 0.730 Shapiro-Wilk normality test
epLsar Luhya (Port Vict.) 0.859 0.184 Shapiro-Wilk normality test
epLsar Luo (Port Vict.) 0.915 0.393 Shapiro-Wilk normality test

One may use Box’s test to test for multivariate homoscedasticity, but this test is very sensitive to violations of normality, leading to rejection in most typical cases. Field et al. (2012, p. 275) does not recommend a formal test for multivariate homocedasticity but simply to compare values within covariance matrices. These matrices are quite different, probably as a result, once again, of the difference in scales between variables.

multihomo <- dISO |>
    group_by(Group) |>
    group_map(~cov(.))
levels <- levels(dISO$Group)

for (i in seq_along(multihomo)) {
    table_html <- kbl(multihomo[[i]], format = "html", booktabs = T,
        digits = 3, caption = paste0("Covariance matrix for the ",
            levels[i])) |>
        kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
            "responsive"))

    print(table_html)
    cat("\n\n")
}
Covariance matrix for the El Molo
Asfc epLsar HAsfc Tfv Ssk Sp Sz Sxp Sdq Sda Sdv
Asfc 2.423 0.001 0.210 -1304.595 0.086 0.832 1.633 -0.075 0.067 -98.785 -1.781
epLsar 0.001 0.000 0.000 -0.028 0.000 0.001 0.001 0.000 0.000 -0.015 -0.002
HAsfc 0.210 0.000 0.039 665.339 0.004 0.281 0.438 0.006 0.007 4.060 0.290
Tfv -1304.595 -0.028 665.339 60555705.896 -537.263 9415.738 14102.771 706.405 39.049 590742.371 18376.267
Ssk 0.086 0.000 0.004 -537.263 0.053 0.018 0.037 -0.009 0.003 0.733 0.030
Sp 0.832 0.001 0.281 9415.738 0.018 5.068 7.020 0.140 0.054 276.263 10.062
Sz 1.633 0.001 0.438 14102.771 0.037 7.020 9.965 0.171 0.086 351.463 12.636
Sxp -0.075 0.000 0.006 706.405 -0.009 0.140 0.171 0.030 0.000 15.358 0.679
Sdq 0.067 0.000 0.007 39.049 0.003 0.054 0.086 0.000 0.002 -0.576 0.033
Sda -98.785 -0.015 4.060 590742.371 0.733 276.263 351.463 15.358 -0.576 25495.469 806.884
Sdv -1.781 -0.002 0.290 18376.267 0.030 10.062 12.636 0.679 0.033 806.884 30.933
Covariance matrix for the Turkana
Asfc epLsar HAsfc Tfv Ssk Sp Sz Sxp Sdq Sda Sdv
Asfc 0.957 -0.001 0.048 2660.207 -0.120 0.546 0.516 0.125 0.033 20.102 2.460
epLsar -0.001 0.000 0.000 -1.428 0.000 0.000 0.001 0.000 0.000 0.009 -0.002
HAsfc 0.048 0.000 0.015 -4.546 0.002 0.111 -0.054 0.002 0.000 9.563 0.281
Tfv 2660.207 -1.428 -4.546 64391815.611 65.685 -1134.903 -13747.727 832.022 -37.988 360980.503 7393.343
Ssk -0.120 0.000 0.002 65.685 0.040 -0.018 -0.652 -0.010 -0.011 12.462 -0.062
Sp 0.546 0.000 0.111 -1134.903 -0.018 1.140 2.852 0.073 0.036 55.948 2.602
Sz 0.516 0.001 -0.054 -13747.727 -0.652 2.852 58.438 0.337 0.608 -478.238 -4.052
Sxp 0.125 0.000 0.002 832.022 -0.010 0.073 0.337 0.030 0.008 5.455 0.380
Sdq 0.033 0.000 0.000 -37.988 -0.011 0.036 0.608 0.008 0.007 -5.244 0.012
Sda 20.102 0.009 9.563 360980.503 12.462 55.948 -478.238 5.455 -5.244 15078.074 296.869
Sdv 2.460 -0.002 0.281 7393.343 -0.062 2.602 -4.052 0.380 0.012 296.869 11.101
Covariance matrix for the Luhya (Webuye)
Asfc epLsar HAsfc Tfv Ssk Sp Sz Sxp Sdq Sda Sdv
Asfc 3.300 -0.002 0.026 -3954.114 0.244 2.709 3.526 0.328 0.170 -41.268 1.385
epLsar -0.002 0.000 0.000 1.881 0.000 -0.001 -0.001 0.000 0.000 0.070 0.000
HAsfc 0.026 0.000 0.016 -403.211 0.006 0.042 0.045 0.004 0.001 4.631 0.074
Tfv -3954.114 1.881 -403.211 69349799.710 -99.503 -4634.101 -4543.318 -287.156 -174.760 195996.322 2573.441
Ssk 0.244 0.000 0.006 -99.503 0.068 0.206 0.228 0.008 0.011 1.310 0.016
Sp 2.709 -0.001 0.042 -4634.101 0.206 2.600 3.549 0.283 0.145 31.965 3.485
Sz 3.526 -0.001 0.045 -4543.318 0.228 3.549 5.036 0.394 0.194 81.860 6.253
Sxp 0.328 0.000 0.004 -287.156 0.008 0.283 0.394 0.042 0.018 -0.777 0.267
Sdq 0.170 0.000 0.001 -174.760 0.011 0.145 0.194 0.018 0.009 -0.969 0.118
Sda -41.268 0.070 4.631 195996.322 1.310 31.965 81.860 -0.777 -0.969 15482.843 480.245
Sdv 1.385 0.000 0.074 2573.441 0.016 3.485 6.253 0.267 0.118 480.245 20.019
Covariance matrix for the Luhya (Port Vict.)
Asfc epLsar HAsfc Tfv Ssk Sp Sz Sxp Sdq Sda Sdv
Asfc 0.933 0.000 0.067 -1598.681 -0.305 0.283 0.507 0.091 0.039 -76.470 -2.369
epLsar 0.000 0.000 0.000 -0.300 0.000 0.000 0.000 0.000 0.000 -0.003 -0.001
HAsfc 0.067 0.000 0.191 554.237 -0.012 0.476 0.352 0.004 0.006 53.180 2.281
Tfv -1598.681 -0.300 554.237 33439817.261 825.683 -1371.415 -4213.561 -682.916 -100.892 -248610.800 -14410.640
Ssk -0.305 0.000 -0.012 825.683 0.125 -0.044 -0.109 -0.038 -0.012 18.789 0.636
Sp 0.283 0.000 0.476 -1371.415 -0.044 1.841 1.809 0.137 0.030 181.886 8.004
Sz 0.507 0.000 0.352 -4213.561 -0.109 1.809 2.124 0.190 0.042 153.766 7.443
Sxp 0.091 0.000 0.004 -682.916 -0.038 0.137 0.190 0.037 0.005 7.245 0.318
Sdq 0.039 0.000 0.006 -100.892 -0.012 0.030 0.042 0.005 0.002 -1.132 -0.007
Sda -76.470 -0.003 53.180 -248610.800 18.789 181.886 153.766 7.245 -1.132 38777.694 1559.151
Sdv -2.369 -0.001 2.281 -14410.640 0.636 8.004 7.443 0.318 -0.007 1559.151 65.004
Covariance matrix for the Luo (Port Vict.)
Asfc epLsar HAsfc Tfv Ssk Sp Sz Sxp Sdq Sda Sdv
Asfc 0.737 0.000 0.054 -3644.121 -0.047 0.442 0.748 0.073 0.033 -71.302 -4.269
epLsar 0.000 0.000 0.000 -9.981 0.000 0.000 -0.001 0.000 0.000 -0.027 0.002
HAsfc 0.054 0.000 0.015 -776.863 -0.010 0.004 -0.010 0.017 0.003 -9.849 -0.623
Tfv -3644.121 -9.981 -776.863 103880472.855 1369.067 6479.592 6362.517 -297.274 -137.003 537618.558 29244.498
Ssk -0.047 0.000 -0.010 1369.067 0.188 -0.081 -0.239 -0.028 -0.003 1.200 0.322
Sp 0.442 0.000 0.004 6479.592 -0.081 2.063 2.622 0.221 0.034 -9.811 -0.058
Sz 0.748 -0.001 -0.010 6362.517 -0.239 2.622 3.784 0.306 0.051 -7.732 0.034
Sxp 0.073 0.000 0.017 -297.274 -0.028 0.221 0.306 0.067 0.007 -4.313 -0.008
Sdq 0.033 0.000 0.003 -137.003 -0.003 0.034 0.051 0.007 0.002 -2.767 -0.135
Sda -71.302 -0.027 -9.849 537618.558 1.200 -9.811 -7.732 -4.313 -2.767 10220.102 693.161
Sdv -4.269 0.002 -0.623 29244.498 0.322 -0.058 0.034 -0.008 -0.135 693.161 55.015

To asses univariate equality of variances, we used levene_test from the rstatix package (Kassambara, 2023). Even though the test was applied to check the assumptions of the MANOVA, the results already suggest that there are no significant differences in dispersion, something that was investigated in previous studies of living human dental microwear (Ungar et al., 2019).

uni_homocedast <- grouped.data |>
    levene_test(value ~ Group)

kbl(uni_homocedast, digits = c(0, 0, 0, 3, 3), format = "html",
    booktabs = TRUE, caption = "Unihomocedasticity tests") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(0, italic = TRUE)
Unihomocedasticity tests
variable df1 df2 statistic p
Asfc 4 32 0.801 0.534
HAsfc 4 32 1.216 0.323
Sda 4 32 1.527 0.218
Sdq 4 32 0.608 0.660
Sdv 4 32 1.256 0.308
Sp 4 32 0.802 0.533
Ssk 4 32 0.450 0.772
Sxp 4 32 0.345 0.846
Sz 4 32 0.732 0.577
Tfv 4 32 0.210 0.931
epLsar 4 32 0.437 0.781

Main Test

Bottom line, the data violates several of the assumptions, and non-parametric alternatives should be used. An alternative is to conduct a permutational multivariate analysis of variance (PERMANOVA) using the adonis.II function in the RVAideMemoire package (Herve, 2023), a non-parametric test used to test whether the centroids and dispersion of a categorical variable are equivalent for all groups (Anderson, 2017).

Another issue with the data is that \(n>p\), i.e. number of samples per group (9 for El Molo, 7 for Turkana, 7 for the Luhya from Webuye and 6 for the Luhya from Port Victoria and 8 for the Luo) is lower than the number of dependent variables (11). This stops us from applying some corrections (such as MCD), and decreases the power of the test, and thus, results should be interpreted cautiously.

permanova <- RVAideMemoire::adonis.II(formula = dISO[, 2:12] ~
    Group, data = dISO, method = "euclidean")

es_permanova <- effectsize::eta_squared(permanova, partial = TRUE)

permanova <- permanova |>
    mutate(partialEta2 = ifelse(row_number() == 1, es_permanova$Eta2_partial[1],
        NA)) |>
    rename(`p-value` = "Pr(>F)")


options(knitr.kable.NA = "")
kbl(permanova, caption = "Permanova Results with 999 permutations",
    digits = c(0, 0, 0, 3, 3, 3, 3), format = "html", booktabs = TRUE,
    format.args = list(big.mark = ",")) |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(0, italic = TRUE)
Permanova Results with 999 permutations
Sum Sq Mean Sq Df F p-value partialEta2
Group 1,040,145,654 260,036,413 4 3.814 0.01 0.323
Residuals 2,181,912,283 68,184,759 32
Total 3,222,057,937 36

The interpretation of partial-\(\eta\) 2 is, by convention, that 0.01, 0.06, and 0.14 correspond to small, medium, and large effects, respectively (Cohen, 1988). Thus, we can state that group has a large effect on the dependent variable, such that it accounts for a meaningful part of the variance in some linear combination of the dependent variables.

Follow-up tests

To follow up the MANOVA, we applied pairwise PERMANOVA’s on the distance matrix between groups, univariate ANOVA’s on rank-transformed data, and linear discriminant analyses (LDA). Pairwise PERMANOVA’s inform on the multivariate distances between groups, univariate ANOVA’s inform on variation driven by a single variable, and LDA finds a linear combination of features that characterizes or separates the classes of objects (Field et al., 2012). As for the assumptions tests, we applied a false discovery rate (FDR) correction to control for the increase of Type I error when multiple testing.

Pairwise PERMANOVA

Pairwise PERMANOVA was conducted using the pairwise.perm.manova function from the RVAideMemoire package on a distance matrix produced using vegdist function from the vegan package.

dist_dml <- vegan::vegdist(x = as.matrix(dISO[, 2:12]), method = "euclidean",
    binary = FALSE, diag = TRUE, upper = TRUE, na.rm = FALSE)

pairwise_permanova <- pairwise.perm.manova(resp = dist_dml, fact = dISO$Group,
    test = "Wilks", nperm = 999, progress = FALSE, p.method = "fdr",
    F = TRUE, R2 = TRUE)

# Extract the relevant information and create a data frame
pairwisePermanova <- as.data.frame(cbind(group1 = rep(dimnames(pairwise_permanova$p.value)[[2]],
    each = ncol(pairwise_permanova$p.value)), group2 = rep(dimnames(pairwise_permanova$p.value)[[1]],
    times = nrow(pairwise_permanova$p.value)), `statistic(pseudo-F)` = round(as.vector(pairwise_permanova$F.value),
    3), `adjusted p-value` = round(as.vector(pairwise_permanova$p.value),
    3), R2 = round(as.vector(pairwise_permanova$R2.value), 3)))

# Filter out rows with NA values and first column with
# numbers
pairwisePermanova <- pairwisePermanova |>
    drop_na() |>
    select(-0)

color.me <- which(pairwisePermanova$`adjusted p-value` < 0.05)

kbl(pairwisePermanova, format = "html", booktabs = TRUE, caption = pairwise_permanova$method) |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(color.me, bold = TRUE, background = "gray!6") |>
    row_spec(0, italic = TRUE)
permutation MANOVAs on a distance matrix
group1 group2 statistic(pseudo-F) adjusted p-value R2
El Molo Turkana 9.209 0.045 0.397
El Molo Luhya (Webuye) 0.95 0.473 0.064
El Molo Luhya (Port Vict.) 0.294 0.719 0.022
El Molo Luo (Port Vict.) 8.136 0.045 0.352
Turkana Luhya (Webuye) 3.453 0.17 0.223
Turkana Luhya (Port Vict.) 6.484 0.05 0.371
Turkana Luo (Port Vict.) 0.007 0.959 0.001
Luhya (Webuye) Luhya (Port Vict.) 0.228 0.719 0.02
Luhya (Webuye) Luo (Port Vict.) 3.079 0.17 0.192
Luhya (Port Vict.) Luo (Port Vict.) 5.017 0.095 0.295

Interpretation of \(R\) 2 is, by convention, that 0.01, 0.09, and 0.25 correspond to small, medium, and large effects, respectively (Cohen, 1988). We found two significant comparisons, between the El Molo and the Turkana, and the El Molo and the Luo (Port Victoria).

Univariate ANOVA

As the data does not follow ANOVA assumptions, namely normality, we rank-transformed the data, also known as a Kruskal Wallis test, for which we used the kruskal_test function from the rstatix package.

followup_kruskal <- grouped.data |>
    kruskal_test(value ~ Group) |>
    adjust_pvalue(method = "fdr")

# followup_kruskal <- followup_kruskal[,
# !names(followup_kruskal) %in% '.y.']

followup_effsize <- grouped.data |>
    kruskal_effsize(value ~ Group)

followup_effsize <- followup_effsize |>
    select(variable, effsize, magnitude)

merged_data <- followup_kruskal %>%
    left_join(followup_effsize, by = "variable") |>
    select(variable, n, df, statistic, p.adj, method, effsize,
        magnitude)

colnames(merged_data) <- c("variable", "n", "df", "statistic (H)",
    "adjusted p-value", "method", "eta-squared", "magnitude")

color.me <- which(merged_data$`adjusted p-value` < 0.05)

# Convert the R object into a html table
kbl(merged_data, digits = c(3, 3, 0, 3, 3), format = "html",
    booktabs = TRUE, caption = "Univariate Kruskal-Wallis") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(color.me, bold = TRUE, background = "gray!6") |>
    row_spec(0, italic = TRUE)
Univariate Kruskal-Wallis
variable n df statistic (H) adjusted p-value method eta-squared magnitude
Asfc 37 4 4.322 0.500 Kruskal-Wallis 0.010 small
HAsfc 37 4 10.121 0.106 Kruskal-Wallis 0.191 large
Sda 37 4 7.454 0.209 Kruskal-Wallis 0.108 moderate
Sdq 37 4 2.170 0.839 Kruskal-Wallis -0.057 small
Sdv 37 4 6.537 0.255 Kruskal-Wallis 0.079 moderate
Sp 37 4 1.853 0.839 Kruskal-Wallis -0.067 moderate
Ssk 37 4 23.072 0.001 Kruskal-Wallis 0.596 large
Sxp 37 4 10.294 0.106 Kruskal-Wallis 0.197 large
Sz 37 4 0.809 0.937 Kruskal-Wallis -0.100 moderate
Tfv 37 4 12.638 0.073 Kruskal-Wallis 0.270 large
epLsar 37 4 8.370 0.174 Kruskal-Wallis 0.137 moderate

At p < 0.05, only Ssk was significantly different between groups, although other 3 variables – Hasfc, Sxp, and Tfv also have large effect sizes. In our previous study (Correia et al., 2021), Hasfc and Tfv were also not significantly different with large effect sizes .

Next, we conducted pairwise comparisons for the significantly different variable Ssk. We used the function dunn_test from the rstatix package, which respects the ranked data of Kruskall-Wallis, while controlling for multiple comparisons. We chose Dunn’s test over Games-Howells because the first addresses distribution issues, whereas the second is more appropriate when there is no homogeneity of variances, which is not as problematic here (see @ref(assumptions)). Finally, we calculated Hedge’s g2 as a measure of effect size for each pairwise comparison.

pairwise <- dunn_test(dISO, Ssk ~ Group, p.adjust.method = "fdr",
    detailed = TRUE)


pairwise_effsize <- dISO |>
    rstatix::cohens_d(formula = Ssk ~ Group, paired = FALSE,
        hedges.correction = TRUE)

pairwise_effsize <- pairwise_effsize |>
    select(group1, group2, effsize, magnitude)

merged_data_effsize <- pairwise %>%
    left_join(pairwise_effsize, by = c("group1", "group2")) |>
    select(group1, group2, estimate, statistic, p.adj, effsize,
        magnitude)

colnames(merged_data_effsize) <- c("group1", "group2", "mean difference",
    "statistic (z)", "adjusted p-value", "hedge's g", "magnitude")

color.me <- which(merged_data_effsize$`adjusted p-value` < 0.05)

kbl(merged_data_effsize, digits = c(3, 3, 3, 3, 3, 3), format = "html",
    booktabs = TRUE, caption = "Pairwise comparisons for Ssk using Dunn's test") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(color.me, bold = TRUE, background = "gray!6") |>
    row_spec(0, italic = TRUE)
Pairwise comparisons for Ssk using Dunn’s test
group1 group2 mean difference statistic (z) adjusted p-value hedge’s g magnitude
El Molo Turkana -9.048 -1.659 0.162 1.238 large
El Molo Luhya (Webuye) -9.333 -1.711 0.162 1.041 large
El Molo Luhya (Port Vict.) -16.000 -2.805 0.019 1.611 large
El Molo Luo (Port Vict.) -24.333 -4.626 0.000 2.329 large
Turkana Luhya (Webuye) -0.286 -0.049 0.961 -0.046 negligible
Turkana Luhya (Port Vict.) -6.952 -1.154 0.298 0.740 moderate
Turkana Luo (Port Vict.) -15.286 -2.729 0.019 1.588 large
Luhya (Webuye) Luhya (Port Vict.) -6.667 -1.107 0.298 0.718 moderate
Luhya (Webuye) Luo (Port Vict.) -15.000 -2.678 0.019 1.529 large
Luhya (Port Vict.) Luo (Port Vict.) -8.333 -1.426 0.220 0.808 large
# boxplot
pairwise <- pairwise |>
    add_xy_position(x = "Group")

ggplot(dISO, aes(y = Ssk, x = Group)) + geom_boxplot(aes(fill = Group),
    show.legend = FALSE) + scale_fill_manual(values = cbbPalette) +
    stat_pvalue_manual(pairwise, label = "p = {scales::pvalue(p.adj)}",
        tip.length = 0.01, size = 2.5, hide.ns = "p.adj", step.increase = 0.02) +
    my_theme

For the Ssk variable, we observe that 4 comparisons are significantly different, Interestingly, most comparisons have large or moderate effect sizes, suggesting that the sample size is too low to detect more significant comparisons. The one exception is the negligible effect size in the comparison between the Turkana and Luhya (Webuye).

Canonical discriminant analysis

Canonical discriminant analyses (CDA) find a linear combination of features that characterizes or separates two or more classes of objects or events. This approach makes important demands in terms of data distribution, and sample sizes. However, alternatives, such as regularized or quadratic discriminant analyses, are more adequate for predictive rather than descriptive analyses3. In addition, even when dealing with non-normal and high dimensional data, CDA axis retain discriminality between classes although cut-off points between classes may be incorrect. In other words, CDA on non-normal data cannot be generalized to other data (Hastie, Tibshirani, & Friedman, 2009, p. 110). Here, CDA was conducted using the candisc function from the candisc package, while rLDA was conducted using the Slda function from the HiDimDA package with the goal of informing the reliability of the CDA approach4.

ISOmanova <- lm(as.matrix(dISO[, 2:12]) ~ Group, dISO)
# summary(manova(ISOmanova), test = 'Wilks')

ISO.can <- candisc(ISOmanova)
ISO.can
## 
## Canonical Discriminant Analysis for Group:
## 
##    CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.64629    1.82717    0.32596 41.2101     41.210
## 2 0.60019    1.50121    0.32596 33.8584     75.068
## 3 0.45336    0.82935    0.32596 18.7052     93.774
## 4 0.21634    0.27606    0.32596  6.2263    100.000
## 
## Test of H0: The canonical correlations in the 
## current row and all that follow are zero
## 
##   LR test stat approx F numDF denDF  Pr(> F)   
## 1      0.06058  2.11593    44 86.12 0.001524 **
## 2      0.17127  1.87334    30 68.19 0.016817 * 
## 3      0.42838  1.40763    18 48.00 0.171547   
## 4      0.78366  0.86268     8 25.00 0.559555   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Create a data frame for the scores
scores_df <- data.frame(
  Group = ISO.can$scores$Group,
  Can1 = ISO.can$scores$Can1,
  Can2 = ISO.can$scores$Can2
)

# Extract the relevant information from the candisc object
loadings_df <- data.frame(
  Variable = rownames(ISO.can$structure),
  Can1 = ISO.can$structure[, 1],
  Can2 = ISO.can$structure[, 2])

# Calculate the percentage of variance explained by each canonical variable
eigenvalues <- ISO.can$eigenvalues
total_variance <- sum(eigenvalues)
variance_explained <- eigenvalues / total_variance * 100

# Define axis titles including the percentage of variance explained
x_axis_title <- paste("Can1 (", round(variance_explained[1], 1), "%)", sep = "")
y_axis_title <- paste("Can2 (", round(variance_explained[2], 1), "%)", sep = "")

# Calculate centroids for each group
centroids <- aggregate(cbind(Can1, Can2) ~ Group, scores_df, mean)

# Create the biplot using ggplot2
ggplot() +
  geom_point(data = scores_df, aes(x = Can1, y = Can2, color = Group), size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  stat_ellipse(data = scores_df, aes(
    x = Can1, y = Can2, fill = Group), 
    geom = "polygon", level = 0.68, alpha = 0.2) +  # Add ellipses for group
  geom_text(data = centroids, aes(
    x = Can1, y = Can2, label = "+"),
    size = 6, color = cbbPalette) +  # Add "+" symbols for centroids
  geom_text_repel(data = loadings_df, aes(
    x = Can1 * 6, y = Can2 * 6, label = Variable),
    size = 3, hjust = 1, vjust = 0.5, box.padding = 0.5) +
  geom_segment(data = loadings_df, aes(x = 0, y = 0,
                                       xend = Can1*5.5, yend = Can2*5.5), 
               arrow = arrow(type = "open", length = unit(0.2, "cm")), 
               linewidth = 0.5, lineend = "round") +
  geom_text(data = centroids, aes(
    x = Can1, y = Can2, label = Group),
    size = 4, color = cbbPalette, fontface = 2,
    nudge_x = c(-1.0, 0.5, -0.5, 1.5, 1.0),
    nudge_y = c(-1.5, -1.2, 1.5, 1.2, -1.0)) +  # Add group labels
  scale_color_manual(values = cbbPalette) +
  scale_fill_manual(values = cbbPalette) +
  coord_fixed(ratio = 1) +
  my_theme + labs(x = x_axis_title, y = y_axis_title) +
  guides(colour = "none", fill = "none")

heplot(ISO.can, var.col = "black", fill = TRUE, fill.alpha = 0.1)

## Vector scale factor set to  6.435736
plot(ISO.can, which=1, col = cbbPalette, cex.axis = 0.7, cex.lab = 0.9,
     points.1d=TRUE, rev.axes = TRUE,pch = 15:19,
     var.lwd = 2, var.col = "black")

plot(ISO.can, which=2, col = cbbPalette, cex.axis = 0.7, cex.lab = 0.8,
     points.1d=TRUE, rev.axes = TRUE,pch = 15:19,
     var.lwd = 2, var.col = "black")

# Create a data frame for the scores
scores_df <- data.frame(
  Group = ISO.can$scores$Group,
  Can3 = ISO.can$scores$Can3,
  Can4 = ISO.can$scores$Can4
)

# Extract the relevant information from the candisc object
loadings_df <- data.frame(
  Variable = rownames(ISO.can$structure),
  Can3 = ISO.can$structure[, 3],
  Can4 = ISO.can$structure[, 4])

# Define axis titles including the percentage of variance explained
x_axis_title <- paste("Can3 (", round(variance_explained[3], 1), "%)", sep = "")
y_axis_title <- paste("Can4 (", round(variance_explained[4], 1), "%)", sep = "")

# Calculate centroids for each group
centroids <- aggregate(cbind(Can3, Can4) ~ Group, scores_df, mean)

# Create the biplot using ggplot2
ggplot() +
  geom_point(data = scores_df, aes(x = Can3, y = Can4, color = Group), size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  stat_ellipse(data = scores_df, aes(
    x = Can3, y = Can4, fill = Group), 
    geom = "polygon", level = 0.68, alpha = 0.2) +  # Add ellipses for group
  geom_text(data = centroids, aes(
    x = Can3, y = Can4, label = "+"),
    size = 6, color = cbbPalette) +  # Add "+" symbols for centroids
  geom_text_repel(data = loadings_df, aes(
    x = Can3 * 6, y = Can4 * 6, label = Variable),
    size = 3, hjust = 1, vjust = 0.5, box.padding = 0.5) +
  geom_segment(data = loadings_df, aes(x = 0, y = 0,
                                       xend = Can3*5.5, yend = Can4*5.5), 
               arrow = arrow(type = "open", length = unit(0.2, "cm")), 
               linewidth = 0.5, lineend = "round") +
  geom_text(data = centroids, aes(
    x = Can3, y = Can4, label = Group),
    size = 4, color = cbbPalette, fontface = 2,
    nudge_x = c(1.0, -0.5, 0.0, -1.0, 1.5),
    nudge_y = c(1.5, -1.0, -1.0, 1.2, -0.5)) +  # Add group labels
  scale_color_manual(values = cbbPalette) +
  scale_fill_manual(values = cbbPalette) +
  coord_fixed(ratio = 1) +
  my_theme + labs(x = x_axis_title, y = y_axis_title) +
  guides(colour = "none", fill = "none")

heplots::heplot(ISO.can, variables = 3:4, var.col = "black",
       fill = TRUE, fill.alpha = 0.1)

## Vector scale factor set to  3.034421

The CDA finds 4 canonical vectors, but only two have group means that are significantly different. Together, the 2 explain 75% of the variation observed.

The HE plot represent sums-of-squares-and-products matrices for linear hypotheses (H) and for error (E). In the default scaling, such plots have the visual property that a given effect in the multivariate linear model is significant (by Roy’s maximum root test) if the H ellipse projects anywhere outside the E ellipse. The size of the H ellipse relative to that of the E ellipse is an indication of the magnitude of the multivariate effect for group. The further away variable centroids are from the “Error” circle, the stronger the association between groups and variables (Friendly & Sigal, 2017).

In the biplot, each data point is represented by its score on the discriminant axis. The variable vectors represent the structure coefficients, i.e. the angle between the vector and the canonical axis represent the correlation of variables with canonical scores, while the vector’s relative length reflect the sum of the squared correlations and hence represent the variable’s contribution to the discrimination between groups (Field et al., 2012, p. 738).

scaled_ISO <- dISO |>
    mutate(across(where(is.numeric), scale))

Slda <- Slda(data = as.matrix(scaled_ISO[, 2:12]), grouping = scaled_ISO$Group,
    StddzData = TRUE, ladfun = "classification")
Slda
## Call:
## Slda.default(data = as.matrix(scaled_ISO[, 2:12]), grouping = scaled_ISO$Group, 
##     StddzData = TRUE, ladfun = "classification")
## 
## Prior probabilities of groups:
##            El Molo            Turkana     Luhya (Webuye) Luhya (Port Vict.) 
##          0.2432432          0.1891892          0.1891892          0.1621622 
##   Luo (Port Vict.) 
##          0.2162162 
## 
## Group means:
##                           Tfv        Ssk        Sxp          Sda
## El Molo             0.6393859  0.8847521  0.1170062 -0.102954091
## Turkana            -0.6359069  0.2133864 -0.5262379 -0.623304691
## Luhya (Webuye)      0.2228006  0.2406359 -0.1939716 -0.166426744
## Luhya (Port Vict.)  0.4264057 -0.3306241 -0.5939760  1.073557817
## Luo (Port Vict.)   -0.6776453 -1.1446475  0.9440333  0.001669996
## 
## Coefficients of linear discriminants:
##             LD1        LD2        LD3        LD4
## Tfv -0.39348495 -0.4456465 0.65633995  0.8074253
## Ssk -1.22926872  0.1839274 0.04278755 -0.7178051
## Sxp -0.04260694  0.8299720 0.81804596 -0.2981733
## Sda  0.53782314 -0.7345409 0.17354765 -0.7349617
## Proportion of trace:
##         LD1         LD2         LD3         LD4 
## 0.554380573 0.317499098 0.123729033 0.004391297 
## 
## Variables kept in discriminant rule:
## [1] "Tfv" "Ssk" "Sxp" "Sda"
## Number of variables kept: 4

‘Slda’ is a shrunken discriminant analysis that finds the coefficients of a linear discriminant rule based on Fisher and Sun’s (2011) estimate and generalizations of Ledoit and Wolf’s (2004) optimal shrunken covariance matrix (Silva, 2015). Here, it finds 4 discriminants just as the canonical approach, and keeps 4 variables in the discriminant rule, Tfv, Ssk, Sxp, Sda, which are also the variables with higher impact before. Thus, this approach confirms the discriminatory and descriptive power of the canonical approach. As scores are not produced in these approaches, no further analyses is possible.

References

Anderson, M. J. (2017). Permutational Multivariate Analysis of Variance (PERMANOVA). John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118445112.stat07841
Carlson, D. (2017). Quantitative Methods in Archaeology Using R. Cambridge, UK: Cambridge University Press.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (Second Edi). Hillsdale, New Jersey, USA: Lawrence Erlbaum Associates.
Correia, M. A., Foley, R., & Lahr, M. M. (2021). Applying dental microwear texture analysis to the living: Challenges and prospects. American Journal of Physical Anthropology, 174(3), 542–554. https://doi.org/10.1002/ajpa.24133
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using r. London, United Kingdom: SAGE Publications Ltd.
Filzmoser, P., & Gschwandter, M. (n.d.). Package ’mvoutlier’: Multivariate outlier detection based on robust methods. Retrieved from https://cran.r-project.org/web/packages/mvoutlier/mvoutlier.pdf
Friendly, M., & Sigal, M. (2017). Graphical methods for multivariate linear models in psychological research: An R tutorial. The Quantitative Methods for Psychology, 13(1), 20–45. https://doi.org/10.20982/tqmp.13.1.p020
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-84858-7
Herve, M. (2023). Package ’RVAideMemoire’. Retrieved from https://cran.r-project.org/web/packages/RVAideMemoire/RVAideMemoire.pdf
ISO/TC 213. (2021). ISO 25178-2:2021 geometrical product specifications (GPS) surface texture: Areal part 2: Terms, definitions and surface texture parameters. Geneva, Switzerland: International Organization for Standardization. Retrieved from https://www.iso.org/standard/74591.html
Jarek. (2012). Mvnormtest.pdf. Retrieved from https://cran.r-project.org/web/packages/mvnormtest/mvnormtest.pdf
Kassambara, A. (2023). Rstatix: Pipe-friendly framework for basic statistical tests. Retrieved from https://cran.r-project.org/web/packages/rstatix/rstatix.pdf
Silva, A. (2015). Package HiDimDA’ - high dimensional discriminant analysis. Retrieved from https://cran.r-project.org/web/packages/HiDimDA/HiDimDA.pdf
Ungar, P., Livengood, S., & Crittenden, A. (2019). Dental microwear of living hadza foragers. American Journal of Physical Anthropology, 169(2), 356–367. https://doi.org/10.1002/ajpa.23836
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/Lisbon
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RVAideMemoire_0.9-83-7 vegan_2.6-4            lattice_0.21-8        
##  [4] permute_0.9-7          HiDimDA_0.2-4          ggpubr_0.6.0          
##  [7] lemon_0.4.7            formatR_1.14           kableExtra_1.3.4.9000 
## [10] candisc_0.8-6          heplots_1.6.0          broom_1.0.5           
## [13] rstatix_0.7.2          rrcov_1.7-4            robustbase_0.99-1     
## [16] mvnormtest_0.1-9       mvoutlier_2.1.1        sgeostat_1.0-27       
## [19] car_3.1-2              carData_3.0-5          corrplot_0.92         
## [22] ggspatial_1.1.9        ggrepel_0.9.4          gt_0.10.0             
## [25] rnaturalearth_1.0.1    lubridate_1.9.3        forcats_1.0.0         
## [28] stringr_1.5.1          dplyr_1.1.3            purrr_1.0.2           
## [31] readr_2.1.4            tidyr_1.3.0            tibble_3.2.1          
## [34] ggplot2_3.4.4          tidyverse_2.0.0       
## 
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.15.0       jsonlite_1.8.8          datawizard_0.9.1       
##  [4] magrittr_2.0.3          TH.data_1.1-2           estimability_1.4.1     
##  [7] farver_2.1.1            rmarkdown_2.25          vctrs_0.6.4            
## [10] base64enc_0.1-3         terra_1.7-65            effectsize_0.8.6       
## [13] webshot_0.5.5           htmltools_0.5.7         sass_0.4.8             
## [16] KernSmooth_2.23-21      bslib_0.6.1             htmlwidgets_1.6.4      
## [19] plyr_1.8.9              sandwich_3.1-0          emmeans_1.9.0          
## [22] zoo_1.8-12              cachem_1.0.8            lifecycle_1.0.4        
## [25] pkgconfig_2.0.3         Matrix_1.6-4            R6_2.5.1               
## [28] fastmap_1.1.1           digest_0.6.33           colorspace_2.1-0       
## [31] labeling_0.4.3          fansi_1.0.5             timechange_0.2.0       
## [34] httr_1.4.7              abind_1.4-5             mgcv_1.8-42            
## [37] compiler_4.3.1          proxy_0.4-27            withr_2.5.2            
## [40] backports_1.4.1         DBI_1.2.0               highr_0.10             
## [43] ggsignif_0.6.4          MASS_7.3-60             classInt_0.4-10        
## [46] tools_4.3.1             units_0.8-4             glue_1.6.2             
## [49] rnaturalearthdata_0.1.0 nlme_3.1-162            grid_4.3.1             
## [52] sf_1.0-14               cluster_2.1.4           generics_0.1.3         
## [55] gtable_0.3.4            tzdb_0.4.0              class_7.3-22           
## [58] hms_1.1.3               sp_2.1-2                xml2_1.3.6             
## [61] utf8_1.2.4              pillar_1.9.0            splines_4.3.1          
## [64] survival_3.5-5          tidyselect_1.2.0        knitr_1.45             
## [67] gridExtra_2.3           svglite_2.1.3           stats4_4.3.1           
## [70] xfun_0.41               DEoptimR_1.1-3          stringi_1.8.3          
## [73] yaml_2.3.8              evaluate_0.23           codetools_0.2-19       
## [76] cli_3.6.1               xtable_1.8-4            parameters_0.21.3      
## [79] systemfonts_1.0.5       munsell_0.5.0           jquerylib_0.1.4        
## [82] Rcpp_1.0.11             parallel_4.3.1          rgl_1.2.8              
## [85] bayestestR_0.13.1       viridisLite_0.4.2       mvtnorm_1.2-4          
## [88] scales_1.3.0            e1071_1.7-13            pcaPP_2.0-4            
## [91] insight_0.19.7          rlang_1.1.1             rvest_1.0.3            
## [94] multcomp_1.4-25

  1. warning about bindiwdth but as binwidth depends on scale and each variable has different scales, setting binwidth produces graph with different sized dots; a possible solution wouldbe to use ‘binwidth = unit(10, “npc”)’ as suggested here: https://github.com/mjskay/ggdist/issues/53 but this gives error of !object is not coercible to a unit, which I think is related to recent changes to unit()?↩︎

  2. Hedge’s g was calculated using the rstatix::cohens_d function, which includes an hedges.correction. According to the package (Kassambara, 2023), this correction should be used when n is small and is a “logical indicating whether to apply the Hedges correction by multiplying the usual value of Cohen’s d by (N-3)/(N-2.25) (for unpaired t-test) and by (n1-2)/(n1-1.25) for paired t-test; where N is the total size of the two groups being compared (N = n1 + n2)”. Now, the effsize::cohen.d function has the same syntax, stating that applying an hedges.correction computes the Hedge’s g statistic. However, this second function is difficult to apply to tidy data, and hence, we opted to use the first option.↩︎

  3. Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher’s linear discriminant, a method used in statistics and other fields. The terms Fisher’s linear discriminant and LDA are often used interchangeably, although Fisher’s original article actually describes a slightly different discriminant, which does not make some of the assumptions of LDA such as normally distributed classes or equal class covariances, although statisticians disagree on the subject. Discriminant analysis (DA) encompasses procedures for classifying observations into groups (predictive discriminant analysis, PDA) and describing the relative importance of variables for distinguishing between groups (descriptive or canonical discriminant analysis, DDA). When following up MANOVA, DA falls into the first category. However, discriminant analysis can make substantial demands on sample size. A rule of thumb states that the data should have at least five more observations in each group than the number of variables (Carlson, 2017, p. 245). If dealing with high dimensional data, one should instead conduct regularised discriminant analyses (rLDA or RDA) or quadratic discriminant analyses (QDA). QDA is a variant of LDA in which an individual covariance matrix is estimated for every class of observations, which accommodates more flexible decision boundaries, but whose number of to-estimate parameters increase faster than those of LDA, whereas RDA is a compromise between LDA and QDA that regularizes the individual class covariance matrices, i.e. restrictions are applied to the estimated parameters. However, RDA and QDA do not produce scores, because they partition the space “by ellipsoid like structures based on the estimation of the inner-class covariance matrices”. Considering that scores are essential to LDA interpretation following MANOVA, we opted to commit to the simpler LDA.↩︎

  4. Note that to conduct rLDA we had to standardise the data (by centering and scaling), a different transformation to the rank-transformation used previously.↩︎