This file documents the statistical analysis of dental microwear variables from Kenyan living human populations, as well as the code to produce Figure 1, the location map with fieldwork sites. A previous analyses of SSFA parameters (Asfc, epLsar, HAsfc, Tfv) was published in Correia, Foley, & Lahr (2021). Here, ISO 25178 (or STA) parameters are also included in a multivariate analysis.

#save images in folder in directory
knitr::opts_chunk$set(
  echo = TRUE, #shows code
  warning = FALSE, message = FALSE, #stops warning messages
  fig.path = "images/",
  dev = c("svg", "png", "tiff"), #saves figures as svg, tiff, and png in images folder
  dpi = 500, #publishing quality for combination art (Elsevier)
  tidy.opts=list(width.cutoff=60), # stops code from running off page
  tidy=TRUE
)

# fileEnconding specifies character encoding as UTF-8
# rather than Unicode characters; otherwise unicode
# characters are added to column name
ISO <- read.csv("MW_ISO.csv", header = TRUE, sep = ",", stringsAsFactors = FALSE,
    fileEncoding = "UTF-8-BOM")

# making ggplot2 respect the Group order
ISO$Group <- factor(ISO$Group, levels = unique(ISO$Group))
ISO <- ISO |>
    mutate(Group = fct_relevel(Group, "El Molo", "Turkana", "Webuye",
        "Luhya", "Luo"))
levels(ISO$Group) <- c("El Molo", "Turkana", "Luhya (Webuye)",
    "Luhya (Port Vict.)", "Luo (Port Vict.)")

# fieldwork site location
sites <- read.csv("workplaces.txt")

#graphical settings for map
map_theme <- theme(axis.text=element_blank(),
        axis.title=element_blank(),#and for axis titles
        panel.grid.minor = element_blank(), 
        panel.background = element_rect(fill="white"),
        plot.background = element_blank(),
        panel.grid.major = element_blank(),
        axis.ticks=element_blank(),  
        panel.border = element_rect(colour = "black", fill=NA, linewidth = 0.5))

#graphical settings for ggplot
my_theme <- theme(
  axis.text=element_text(size=8,colour="black"),
  axis.ticks=element_line (linewidth = 0.5, colour="black"), 
  axis.title=element_text(size=10),
  panel.grid.minor = element_blank(), 
  panel.background = element_blank(),
  panel.grid.major = element_blank(),
  legend.key=element_blank(),
  panel.border = element_rect(colour = "black", fill=NA, linewidth = 0.5))

#use first row data as column names for transposition
header.true <- function(df) {
  colnames(df) <- as.character(unlist(df[1,]))
  df[-1,]
}

Producing the location map

africa <- ne_countries(type = "countries", continent = "Africa",
    scale = "medium", returnclass = "sf")
lakes <- ne_download(scale = "medium", category = "physical",
    type = "lakes", returnclass = "sf")

## Reading layer `ne_50m_lakes' from data source 
##   `C:\Users\corre\AppData\Local\Temp\RtmpoLOXWw\ne_50m_lakes.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 412 features and 39 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -165.8985 ymin: -50.62002 xmax: 176.0827 ymax: 81.94033
## Geodetic CRS:  WGS 84

lakes <- ne_load(scale = "medium", category = "physical", type = "lakes",
    returnclass = "sf")

## Reading layer `ne_50m_lakes' from data source 
##   `C:\Users\corre\AppData\Local\Temp\RtmpoLOXWw\ne_50m_lakes.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 412 features and 39 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -165.8985 ymin: -50.62002 xmax: 176.0827 ymax: 81.94033
## Geodetic CRS:  WGS 84

# tomorrow start by trying to add lakes to map to do: inset
inset <- africa %>%
    ggplot() + geom_sf(fill = "grey87", color = "grey30") + geom_sf(data = africa |>
    filter(name %in% c("Kenya")), fill = "grey70", color = "grey30") +
    coord_sf(xlim = c(-22, 55), ylim = c(-40, 43), expand = FALSE) +
    geom_rect(aes(xmin = 31, xmax = 43, ymin = -5, ymax = 5.6),
        color = "black", fill = NA, linewidth = 0.6) + map_theme


main <- africa %>%
    ggplot() + geom_sf(fill = "grey87", color = "grey30") + geom_sf(data = africa |>
    filter(name %in% c("Kenya")), fill = "grey70", color = "grey30") +
    geom_sf(data = lakes, fill = "white") + geom_point(data = sites,
    aes(x = long, y = lat)) + coord_sf(xlim = c(31, 43), ylim = c(-5,
    5.6), expand = FALSE) + geom_sf_text(data = africa %>%
    filter(name %in% c("Kenya")), aes(label = name_long), size = 3.5,
    fontface = "bold", nudge_x = 0.6, nudge_y = 0.3) + geom_sf_text(data = lakes %>%
    filter(name %in% c("Lake Victoria", "Lake Turkana")), aes(label = c("Lake \n Victoria",
    "Lake \nTurkana")), size = 2.5, nudge_y = c(0.5, 0.1), nudge_x = c(0.1,
    1)) + ggrepel::geom_text_repel(data = sites, aes(x = long,
    y = lat, label = c("El Molo", "Luo | Luhya", "Luhya", "Turkana"),
    fontface = "plain"), nudge_y = c(-0.3, 0.05, 0.1, -0.5),
    nudge_x = c(0.3, 1, 0.2, 0.05), size = 2.9) + annotation_north_arrow(location = "tl",
    which_north = "true", height = unit(0.5, "cm"), width = unit(0.4,
        "cm"), pad_x = unit(0.4, "cm"), pad_y = unit(0.4, "cm")) +
    annotation_scale(location = "bl", style = "bar") + map_theme

# inset rectangle

main + annotation_custom(ggplotGrob(inset), xmin = 37.5, xmax = 44.5,
    ymin = 1.5, ymax = 5.5)

Addressing collinearity

The STA parameters initially considered here are described in ISO/TC 213 (2021):

skewness (Ssk);
maximum peak height (Sp);
maximum height (Sz);
extreme peak height (Sxp);
root mean square gradient (Sdq);
developed interfacial area ratio (Sdr);
pit void volume (Vvv);
five-point pit height (S5v);
mean dale area (Sda);
mean dale volume (Sdv)

Some of the STA variables likely capture the same underlying variation. Thus, we expect some of the variables to be highly correlated (Ungar, Livengood, & Crittenden, 2019), which affects general linear models since it violates one of the main assumptions which states that dependent variables will be independent from each other, i.e. that there will be no multicollinearity (Field, Miles, & Field, 2012, p. 274). Here, a correlogram was used to evaluate the correlation matrix.

dISO <- ISO %>%
    dplyr::select(Group, Asfc, epLsar, HAsfc, Tfv, Ssk, Sp, Sz,
        Sxp, Sdq, Sdr, Vvv, S5v, Sda, Sdv)

x <- as.matrix(dISO[, 2:15])
M <- x |>
    cor()

cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat <- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
    colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
    p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(x)

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD",
    "#4477AA"))
corrplot(M, method = "color", col = col(200), type = "lower",
    tl.col = "black", addCoef.col = "black", p.mat = p.mat, sig.level = 0.01,
    insig = "blank", diag = FALSE)

# remove variables with high correlation
dISO <- dISO |>
    dplyr::select(-c(Sdr, Vvv, S5v))

Squares with colour are significant correlations (\(p<0.01\)) and we observed 3 correlations above 0.90, confirming collinearity. We removed 3 variables from the analyses to avoid this issue (Sdr, Vvv, S5v). In addition, we observe several medium to strong correlations, which suggest an underlying pattern of variation and support a multivariate approach.

Visualization and descriptive statistics

Dotplots¹ were used to visually examine the data, and it’s interesting that there is one clear outlier for Sz, which explains the high standard deviation of this variable for the group including it. Descriptive statistics reported include standard and robust measures of both central tendency and dispersion (mean and SD; median and IQR).

grouped.data <- dISO |>
    gather(key = "variable", value = "value", Asfc:Sdv) |>
    group_by(variable)

# Colour-blind palette
cbbPalette <- c("#E69F00", "#CC79A7", "#009E73", "#56B4E9", "#0072B2")

# dotplots
dotplot <- ggplot(grouped.data, aes(x = value, colour = Group,
    fill = Group)) + geom_dotplot(stackdir = "center", stackgroups = TRUE,
    binpositions = "all") + coord_flip() + scale_fill_manual(values = cbbPalette) +
    scale_colour_manual(values = cbbPalette) + scale_y_continuous(NULL,
    breaks = NULL) + facet_wrap(~variable, scales = "free_y") +
    my_theme

reposition_legend(dotplot, "center", panel = "panel-4-3")

# summary table
sISO <- dISO |>
    group_by(Group) |>
    get_summary_stats(type = "common", show = c("n", "mean",
        "sd", "median", "iqr"))

kbl(sISO, format = "html", booktabs = TRUE, longtable = TRUE,
    caption = "summary statistics") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(0, italic = TRUE)

summary statistics
Group	variable	n	mean	sd	median	iqr
El Molo	Asfc	9	3.375	1.557	3.781	2.486
El Molo	epLsar	9	0.004	0.002	0.003	0.002
El Molo	HAsfc	9	0.451	0.196	0.460	0.280
El Molo	Tfv	9	53682.043	7781.755	52229.147	10723.046
El Molo	Ssk	9	-0.034	0.229	-0.028	0.211
El Molo	Sp	9	3.402	2.251	2.579	3.714
El Molo	Sz	9	5.875	3.157	4.667	4.549
El Molo	Sxp	9	0.862	0.173	0.795	0.309
El Molo	Sdq	9	0.214	0.046	0.214	0.067
El Molo	Sda	9	316.519	159.673	308.716	135.162
El Molo	Sdv	9	7.347	5.562	5.717	5.787
Turkana	Asfc	7	3.320	0.978	3.231	1.160
Turkana	epLsar	7	0.002	0.001	0.001	0.001
Turkana	HAsfc	7	0.256	0.124	0.210	0.055
Turkana	Tfv	7	41618.790	8024.451	44927.780	5903.392
Turkana	Ssk	7	-0.315	0.199	-0.336	0.249
Turkana	Sp	7	2.357	1.068	2.043	0.794
Turkana	Sz	7	7.159	7.644	3.772	2.754
Turkana	Sxp	7	0.713	0.173	0.769	0.159
Turkana	Sdq	7	0.229	0.086	0.198	0.073
Turkana	Sda	7	234.190	122.793	207.141	126.187
Turkana	Sdv	7	4.693	3.332	3.380	3.998
Luhya (Webuye)	Asfc	7	3.017	1.816	2.495	1.716
Luhya (Webuye)	epLsar	7	0.002	0.001	0.002	0.002
Luhya (Webuye)	HAsfc	7	0.504	0.126	0.520	0.155
Luhya (Webuye)	Tfv	7	49741.478	8327.653	53447.254	11367.517
Luhya (Webuye)	Ssk	7	-0.303	0.260	-0.277	0.325
Luhya (Webuye)	Sp	7	2.504	1.612	1.729	1.452
Luhya (Webuye)	Sz	7	4.815	2.244	4.223	2.237
Luhya (Webuye)	Sxp	7	0.790	0.204	0.792	0.256
Luhya (Webuye)	Sdq	7	0.212	0.095	0.179	0.084
Luhya (Webuye)	Sda	7	306.476	124.430	274.599	154.342
Luhya (Webuye)	Sdv	7	7.309	4.474	7.491	6.915
Luhya (Port Vict.)	Asfc	6	2.211	0.966	2.165	1.449
Luhya (Port Vict.)	epLsar	6	0.003	0.001	0.002	0.001
Luhya (Port Vict.)	HAsfc	6	0.557	0.437	0.430	0.248
Luhya (Port Vict.)	Tfv	6	51667.420	5782.717	50000.628	4299.156
Luhya (Port Vict.)	Ssk	6	-0.543	0.353	-0.491	0.326
Luhya (Port Vict.)	Sp	6	2.466	1.357	2.020	1.671
Luhya (Port Vict.)	Sz	6	5.025	1.458	5.084	1.588
Luhya (Port Vict.)	Sxp	6	0.697	0.192	0.697	0.214
Luhya (Port Vict.)	Sdq	6	0.175	0.043	0.189	0.073
Luhya (Port Vict.)	Sda	6	502.664	196.921	519.202	298.244
Luhya (Port Vict.)	Sdv	6	12.467	8.062	12.226	12.700
Luo (Port Vict.)	Asfc	8	2.474	0.859	2.748	1.304
Luo (Port Vict.)	epLsar	8	0.003	0.002	0.003	0.001
Luo (Port Vict.)	HAsfc	8	0.441	0.123	0.440	0.150
Luo (Port Vict.)	Tfv	8	41223.978	10192.177	45092.266	8974.235
Luo (Port Vict.)	Ssk	8	-0.884	0.433	-0.636	0.464
Luo (Port Vict.)	Sp	8	2.243	1.436	1.915	0.511
Luo (Port Vict.)	Sz	8	5.103	1.945	4.568	0.997
Luo (Port Vict.)	Sxp	8	1.055	0.259	1.043	0.300
Luo (Port Vict.)	Sdq	8	0.197	0.042	0.195	0.058
Luo (Port Vict.)	Sda	8	333.072	101.095	314.123	70.991
Luo (Port Vict.)	Sdv	8	10.347	7.417	8.368	5.473

Inferential Statistics

A multivariate analysis of variance (MANOVA) was used to investigate variation in central tendency between groups, since it takes into consideration any relationship between the dependent variables.

Assumptions

The assumptions of the MANOVA are (Field et al., 2012, p. 717):

Independence
Random Sampling
Uni and multivariate normality (within groups)
Uni and multivariate homogeneity of variances
Linearity (linear relationships among all pairs of dependent variables, all pairs of covariates, and all dependent variable-covariate pairs in each cell)

ADDITIONAL LIMITATIONS
Unequal sample sizes (when cells in a factorial MANOVA have different sample sizes, the sum of squares for effect plus error does not equal the total sum of squares. This causes tests of main effects and interactions to be correlated)
Uni and multivariate outliers (may produce either a Type I or Type II error and give no indication as to which type of error is occurring in the analysis)

Independence and linearity were addressed with the correlogram. Sample sizes are not too unequal, but small. Specifically, the sample size of each group is inferior to the number of variables, which is a risk when conducting multivariate analyses. Next, we will examine the outliers, normality and homocedasticity of the data. Conducting multiple statistical tests increases the chances of Type I error; when necessary, we applied a false discovery rate (FDR) correction, also known as Benjamini & Hochberg.

The aq.plot() function from the mvoutlier package (Filzmoser & Gschwandter, n.d.) identifies multivariate outliers by plotting the ordered squared robust Mahalanobis distances of the observations against the empirical distribution function of the MD2i. The function produces 4 graphs that identify multivariate outliers in red. Univariate outliers were identified in the graphs above.

outliers <- aq.plot(dISO[, 2:12])

## Projection to the first and second robust principal components.
## Proportion of total variation (explained variance): 0.9994766

To test for multivariate normality within groups, we used the mshapiro.test in the mvnormtest package (Jarek, 2012), which suggests that there is no multivariate normality. We also assessed multinormality graphically by examining plots of the ordered Mahalanobis distances (D²) and paired chi-squared values, \(\chi\) ², of the model residuals. If the data are multivariate normal, the plot should form a relatively straight, diagonal line. The lack of multivariate normality is not surprising considering that the scales of measurement are very distinct between variables.

# Perform the multivariate Shapiro-Wilk test within each
# group
mshapiro_results <- dISO %>%
    group_by(Group) %>%
    do(tidy(mvnormtest::mshapiro.test(as.matrix(select(., where(is.numeric))))))


kbl(mshapiro_results, digits = c(0, 3, 10), format = "html",
    booktabs = TRUE, caption = "Multinormality tests") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(0, italic = TRUE)

Multinormality tests
Group	statistic	p.value	method
El Molo	0.345	2.25e-08	Shapiro-Wilk normality test
Turkana	0.345	2.24e-08	Shapiro-Wilk normality test
Luhya (Webuye)	0.347	2.34e-08	Shapiro-Wilk normality test
Luhya (Port Vict.)	0.345	2.25e-08	Shapiro-Wilk normality test
Luo (Port Vict.)	0.345	2.24e-08	Shapiro-Wilk normality test

cqplot(lm(as.matrix(dISO[, 2:12]) ~ Group, data = dISO), main = "Chi-Square Q-Q Plot of MANOVA Model Residuals")

For univariate normality, we used shapiro.test, and found 4 significant tests at p<0.01, suggesting most variables are normal within groups.

uni_normal <- grouped.data |>
    group_by(variable, Group) |>
    do(tidy(shapiro.test(.$value)))

color.me <- which(uni_normal$p.value < 0.01)

kbl(uni_normal, digits = c(3, 3), format = "html", booktabs = TRUE,
    longtable = TRUE, caption = "Uninormality tests") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(color.me, bold = TRUE, background = "gray!6") |>
    row_spec(0, italic = TRUE)

Uninormality tests
variable	Group	statistic	p.value	method
Asfc	El Molo	0.938	0.562	Shapiro-Wilk normality test
Asfc	Turkana	0.926	0.520	Shapiro-Wilk normality test
Asfc	Luhya (Webuye)	0.902	0.340	Shapiro-Wilk normality test
Asfc	Luhya (Port Vict.)	0.944	0.694	Shapiro-Wilk normality test
Asfc	Luo (Port Vict.)	0.910	0.355	Shapiro-Wilk normality test
HAsfc	El Molo	0.951	0.705	Shapiro-Wilk normality test
HAsfc	Turkana	0.650	0.001	Shapiro-Wilk normality test
HAsfc	Luhya (Webuye)	0.935	0.595	Shapiro-Wilk normality test
HAsfc	Luhya (Port Vict.)	0.747	0.019	Shapiro-Wilk normality test
HAsfc	Luo (Port Vict.)	0.935	0.567	Shapiro-Wilk normality test
Sda	El Molo	0.893	0.217	Shapiro-Wilk normality test
Sda	Turkana	0.947	0.701	Shapiro-Wilk normality test
Sda	Luhya (Webuye)	0.968	0.881	Shapiro-Wilk normality test
Sda	Luhya (Port Vict.)	0.882	0.279	Shapiro-Wilk normality test
Sda	Luo (Port Vict.)	0.916	0.401	Shapiro-Wilk normality test
Sdq	El Molo	0.976	0.937	Shapiro-Wilk normality test
Sdq	Turkana	0.864	0.163	Shapiro-Wilk normality test
Sdq	Luhya (Webuye)	0.850	0.122	Shapiro-Wilk normality test
Sdq	Luhya (Port Vict.)	0.821	0.090	Shapiro-Wilk normality test
Sdq	Luo (Port Vict.)	0.896	0.264	Shapiro-Wilk normality test
Sdv	El Molo	0.854	0.082	Shapiro-Wilk normality test
Sdv	Turkana	0.830	0.080	Shapiro-Wilk normality test
Sdv	Luhya (Webuye)	0.925	0.513	Shapiro-Wilk normality test
Sdv	Luhya (Port Vict.)	0.884	0.288	Shapiro-Wilk normality test
Sdv	Luo (Port Vict.)	0.760	0.011	Shapiro-Wilk normality test
Sp	El Molo	0.832	0.047	Shapiro-Wilk normality test
Sp	Turkana	0.888	0.264	Shapiro-Wilk normality test
Sp	Luhya (Webuye)	0.748	0.012	Shapiro-Wilk normality test
Sp	Luhya (Port Vict.)	0.919	0.496	Shapiro-Wilk normality test
Sp	Luo (Port Vict.)	0.626	0.000	Shapiro-Wilk normality test
Ssk	El Molo	0.965	0.849	Shapiro-Wilk normality test
Ssk	Turkana	0.944	0.674	Shapiro-Wilk normality test
Ssk	Luhya (Webuye)	0.909	0.386	Shapiro-Wilk normality test
Ssk	Luhya (Port Vict.)	0.860	0.188	Shapiro-Wilk normality test
Ssk	Luo (Port Vict.)	0.785	0.020	Shapiro-Wilk normality test
Sxp	El Molo	0.848	0.071	Shapiro-Wilk normality test
Sxp	Turkana	0.969	0.887	Shapiro-Wilk normality test
Sxp	Luhya (Webuye)	0.903	0.350	Shapiro-Wilk normality test
Sxp	Luhya (Port Vict.)	0.985	0.974	Shapiro-Wilk normality test
Sxp	Luo (Port Vict.)	0.942	0.628	Shapiro-Wilk normality test
Sz	El Molo	0.850	0.074	Shapiro-Wilk normality test
Sz	Turkana	0.608	0.000	Shapiro-Wilk normality test
Sz	Luhya (Webuye)	0.822	0.067	Shapiro-Wilk normality test
Sz	Luhya (Port Vict.)	0.941	0.670	Shapiro-Wilk normality test
Sz	Luo (Port Vict.)	0.688	0.002	Shapiro-Wilk normality test
Tfv	El Molo	0.920	0.388	Shapiro-Wilk normality test
Tfv	Turkana	0.871	0.188	Shapiro-Wilk normality test
Tfv	Luhya (Webuye)	0.912	0.408	Shapiro-Wilk normality test
Tfv	Luhya (Port Vict.)	0.851	0.161	Shapiro-Wilk normality test
Tfv	Luo (Port Vict.)	0.814	0.040	Shapiro-Wilk normality test
epLsar	El Molo	0.942	0.608	Shapiro-Wilk normality test
epLsar	Turkana	0.848	0.118	Shapiro-Wilk normality test
epLsar	Luhya (Webuye)	0.950	0.730	Shapiro-Wilk normality test
epLsar	Luhya (Port Vict.)	0.859	0.184	Shapiro-Wilk normality test
epLsar	Luo (Port Vict.)	0.915	0.393	Shapiro-Wilk normality test

One may use Box’s test to test for multivariate homoscedasticity, but this test is very sensitive to violations of normality, leading to rejection in most typical cases. Field et al. (2012, p. 275) does not recommend a formal test for multivariate homocedasticity but simply to compare values within covariance matrices. These matrices are quite different, probably as a result, once again, of the difference in scales between variables.

multihomo <- dISO |>
    group_by(Group) |>
    group_map(~cov(.))

levels <- levels(dISO$Group)

for (i in seq_along(multihomo)) {
    table_html <- kbl(multihomo[[i]], format = "html", booktabs = T,
        digits = 3, caption = paste0("Covariance matrix for the ",
            levels[i])) |>
        kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
            "responsive"))

    print(table_html)
    cat("\n\n")
}

Covariance matrix for the El Molo
	Asfc	epLsar	HAsfc	Tfv	Ssk	Sp	Sz	Sxp	Sdq	Sda	Sdv
Asfc	2.423	0.001	0.210	-1304.595	0.086	0.832	1.633	-0.075	0.067	-98.785	-1.781
epLsar	0.001	0.000	0.000	-0.028	0.000	0.001	0.001	0.000	0.000	-0.015	-0.002
HAsfc	0.210	0.000	0.039	665.339	0.004	0.281	0.438	0.006	0.007	4.060	0.290
Tfv	-1304.595	-0.028	665.339	60555705.896	-537.263	9415.738	14102.771	706.405	39.049	590742.371	18376.267
Ssk	0.086	0.000	0.004	-537.263	0.053	0.018	0.037	-0.009	0.003	0.733	0.030
Sp	0.832	0.001	0.281	9415.738	0.018	5.068	7.020	0.140	0.054	276.263	10.062
Sz	1.633	0.001	0.438	14102.771	0.037	7.020	9.965	0.171	0.086	351.463	12.636
Sxp	-0.075	0.000	0.006	706.405	-0.009	0.140	0.171	0.030	0.000	15.358	0.679
Sdq	0.067	0.000	0.007	39.049	0.003	0.054	0.086	0.000	0.002	-0.576	0.033
Sda	-98.785	-0.015	4.060	590742.371	0.733	276.263	351.463	15.358	-0.576	25495.469	806.884
Sdv	-1.781	-0.002	0.290	18376.267	0.030	10.062	12.636	0.679	0.033	806.884	30.933

Covariance matrix for the Turkana
	Asfc	epLsar	HAsfc	Tfv	Ssk	Sp	Sz	Sxp	Sdq	Sda	Sdv
Asfc	0.957	-0.001	0.048	2660.207	-0.120	0.546	0.516	0.125	0.033	20.102	2.460
epLsar	-0.001	0.000	0.000	-1.428	0.000	0.000	0.001	0.000	0.000	0.009	-0.002
HAsfc	0.048	0.000	0.015	-4.546	0.002	0.111	-0.054	0.002	0.000	9.563	0.281
Tfv	2660.207	-1.428	-4.546	64391815.611	65.685	-1134.903	-13747.727	832.022	-37.988	360980.503	7393.343
Ssk	-0.120	0.000	0.002	65.685	0.040	-0.018	-0.652	-0.010	-0.011	12.462	-0.062
Sp	0.546	0.000	0.111	-1134.903	-0.018	1.140	2.852	0.073	0.036	55.948	2.602
Sz	0.516	0.001	-0.054	-13747.727	-0.652	2.852	58.438	0.337	0.608	-478.238	-4.052
Sxp	0.125	0.000	0.002	832.022	-0.010	0.073	0.337	0.030	0.008	5.455	0.380
Sdq	0.033	0.000	0.000	-37.988	-0.011	0.036	0.608	0.008	0.007	-5.244	0.012
Sda	20.102	0.009	9.563	360980.503	12.462	55.948	-478.238	5.455	-5.244	15078.074	296.869
Sdv	2.460	-0.002	0.281	7393.343	-0.062	2.602	-4.052	0.380	0.012	296.869	11.101

Covariance matrix for the Luhya (Webuye)
	Asfc	epLsar	HAsfc	Tfv	Ssk	Sp	Sz	Sxp	Sdq	Sda	Sdv
Asfc	3.300	-0.002	0.026	-3954.114	0.244	2.709	3.526	0.328	0.170	-41.268	1.385
epLsar	-0.002	0.000	0.000	1.881	0.000	-0.001	-0.001	0.000	0.000	0.070	0.000
HAsfc	0.026	0.000	0.016	-403.211	0.006	0.042	0.045	0.004	0.001	4.631	0.074
Tfv	-3954.114	1.881	-403.211	69349799.710	-99.503	-4634.101	-4543.318	-287.156	-174.760	195996.322	2573.441
Ssk	0.244	0.000	0.006	-99.503	0.068	0.206	0.228	0.008	0.011	1.310	0.016
Sp	2.709	-0.001	0.042	-4634.101	0.206	2.600	3.549	0.283	0.145	31.965	3.485
Sz	3.526	-0.001	0.045	-4543.318	0.228	3.549	5.036	0.394	0.194	81.860	6.253
Sxp	0.328	0.000	0.004	-287.156	0.008	0.283	0.394	0.042	0.018	-0.777	0.267
Sdq	0.170	0.000	0.001	-174.760	0.011	0.145	0.194	0.018	0.009	-0.969	0.118
Sda	-41.268	0.070	4.631	195996.322	1.310	31.965	81.860	-0.777	-0.969	15482.843	480.245
Sdv	1.385	0.000	0.074	2573.441	0.016	3.485	6.253	0.267	0.118	480.245	20.019

Covariance matrix for the Luhya (Port Vict.)
	Asfc	epLsar	HAsfc	Tfv	Ssk	Sp	Sz	Sxp	Sdq	Sda	Sdv
Asfc	0.933	0.000	0.067	-1598.681	-0.305	0.283	0.507	0.091	0.039	-76.470	-2.369
epLsar	0.000	0.000	0.000	-0.300	0.000	0.000	0.000	0.000	0.000	-0.003	-0.001
HAsfc	0.067	0.000	0.191	554.237	-0.012	0.476	0.352	0.004	0.006	53.180	2.281
Tfv	-1598.681	-0.300	554.237	33439817.261	825.683	-1371.415	-4213.561	-682.916	-100.892	-248610.800	-14410.640
Ssk	-0.305	0.000	-0.012	825.683	0.125	-0.044	-0.109	-0.038	-0.012	18.789	0.636
Sp	0.283	0.000	0.476	-1371.415	-0.044	1.841	1.809	0.137	0.030	181.886	8.004
Sz	0.507	0.000	0.352	-4213.561	-0.109	1.809	2.124	0.190	0.042	153.766	7.443
Sxp	0.091	0.000	0.004	-682.916	-0.038	0.137	0.190	0.037	0.005	7.245	0.318
Sdq	0.039	0.000	0.006	-100.892	-0.012	0.030	0.042	0.005	0.002	-1.132	-0.007
Sda	-76.470	-0.003	53.180	-248610.800	18.789	181.886	153.766	7.245	-1.132	38777.694	1559.151
Sdv	-2.369	-0.001	2.281	-14410.640	0.636	8.004	7.443	0.318	-0.007	1559.151	65.004

Covariance matrix for the Luo (Port Vict.)
	Asfc	epLsar	HAsfc	Tfv	Ssk	Sp	Sz	Sxp	Sdq	Sda	Sdv
Asfc	0.737	0.000	0.054	-3644.121	-0.047	0.442	0.748	0.073	0.033	-71.302	-4.269
epLsar	0.000	0.000	0.000	-9.981	0.000	0.000	-0.001	0.000	0.000	-0.027	0.002
HAsfc	0.054	0.000	0.015	-776.863	-0.010	0.004	-0.010	0.017	0.003	-9.849	-0.623
Tfv	-3644.121	-9.981	-776.863	103880472.855	1369.067	6479.592	6362.517	-297.274	-137.003	537618.558	29244.498
Ssk	-0.047	0.000	-0.010	1369.067	0.188	-0.081	-0.239	-0.028	-0.003	1.200	0.322
Sp	0.442	0.000	0.004	6479.592	-0.081	2.063	2.622	0.221	0.034	-9.811	-0.058
Sz	0.748	-0.001	-0.010	6362.517	-0.239	2.622	3.784	0.306	0.051	-7.732	0.034
Sxp	0.073	0.000	0.017	-297.274	-0.028	0.221	0.306	0.067	0.007	-4.313	-0.008
Sdq	0.033	0.000	0.003	-137.003	-0.003	0.034	0.051	0.007	0.002	-2.767	-0.135
Sda	-71.302	-0.027	-9.849	537618.558	1.200	-9.811	-7.732	-4.313	-2.767	10220.102	693.161
Sdv	-4.269	0.002	-0.623	29244.498	0.322	-0.058	0.034	-0.008	-0.135	693.161	55.015

To asses univariate equality of variances, we used levene_test from the rstatix package (Kassambara, 2023). Even though the test was applied to check the assumptions of the MANOVA, the results already suggest that there are no significant differences in dispersion, something that was investigated in previous studies of living human dental microwear (Ungar et al., 2019).

uni_homocedast <- grouped.data |>
    levene_test(value ~ Group)

kbl(uni_homocedast, digits = c(0, 0, 0, 3, 3), format = "html",
    booktabs = TRUE, caption = "Unihomocedasticity tests") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(0, italic = TRUE)

Unihomocedasticity tests
variable	df1	df2	statistic	p
Asfc	4	32	0.801	0.534
HAsfc	4	32	1.216	0.323
Sda	4	32	1.527	0.218
Sdq	4	32	0.608	0.660
Sdv	4	32	1.256	0.308
Sp	4	32	0.802	0.533
Ssk	4	32	0.450	0.772
Sxp	4	32	0.345	0.846
Sz	4	32	0.732	0.577
Tfv	4	32	0.210	0.931
epLsar	4	32	0.437	0.781

Main Test

Bottom line, the data violates several of the assumptions, and non-parametric alternatives should be used. An alternative is to conduct a permutational multivariate analysis of variance (PERMANOVA) using the adonis.II function in the RVAideMemoire package (Herve, 2023), a non-parametric test used to test whether the centroids and dispersion of a categorical variable are equivalent for all groups (Anderson, 2017).

Another issue with the data is that \(n>p\), i.e. number of samples per group (9 for El Molo, 7 for Turkana, 7 for the Luhya from Webuye and 6 for the Luhya from Port Victoria and 8 for the Luo) is lower than the number of dependent variables (11). This stops us from applying some corrections (such as MCD), and decreases the power of the test, and thus, results should be interpreted cautiously.

permanova <- RVAideMemoire::adonis.II(formula = dISO[, 2:12] ~
    Group, data = dISO, method = "euclidean")

es_permanova <- effectsize::eta_squared(permanova, partial = TRUE)

permanova <- permanova |>
    mutate(partialEta2 = ifelse(row_number() == 1, es_permanova$Eta2_partial[1],
        NA)) |>
    rename(`p-value` = "Pr(>F)")


options(knitr.kable.NA = "")
kbl(permanova, caption = "Permanova Results with 999 permutations",
    digits = c(0, 0, 0, 3, 3, 3, 3), format = "html", booktabs = TRUE,
    format.args = list(big.mark = ",")) |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(0, italic = TRUE)

Permanova Results with 999 permutations
	Sum Sq	Mean Sq	Df	F	p-value	partialEta2
Group	1,040,145,654	260,036,413	4	3.814	0.01	0.323
Residuals	2,181,912,283	68,184,759	32
Total	3,222,057,937		36

The interpretation of partial-\(\eta\) ² is, by convention, that 0.01, 0.06, and 0.14 correspond to small, medium, and large effects, respectively (Cohen, 1988). Thus, we can state that group has a large effect on the dependent variable, such that it accounts for a meaningful part of the variance in some linear combination of the dependent variables.

Follow-up tests

To follow up the MANOVA, we applied pairwise PERMANOVA’s on the distance matrix between groups, univariate ANOVA’s on rank-transformed data, and linear discriminant analyses (LDA). Pairwise PERMANOVA’s inform on the multivariate distances between groups, univariate ANOVA’s inform on variation driven by a single variable, and LDA finds a linear combination of features that characterizes or separates the classes of objects (Field et al., 2012). As for the assumptions tests, we applied a false discovery rate (FDR) correction to control for the increase of Type I error when multiple testing.

Pairwise PERMANOVA

Pairwise PERMANOVA was conducted using the pairwise.perm.manova function from the RVAideMemoire package on a distance matrix produced using vegdist function from the vegan package.

dist_dml <- vegan::vegdist(x = as.matrix(dISO[, 2:12]), method = "euclidean",
    binary = FALSE, diag = TRUE, upper = TRUE, na.rm = FALSE)

pairwise_permanova <- pairwise.perm.manova(resp = dist_dml, fact = dISO$Group,
    test = "Wilks", nperm = 999, progress = FALSE, p.method = "fdr",
    F = TRUE, R2 = TRUE)

# Extract the relevant information and create a data frame
pairwisePermanova <- as.data.frame(cbind(group1 = rep(dimnames(pairwise_permanova$p.value)[[2]],
    each = ncol(pairwise_permanova$p.value)), group2 = rep(dimnames(pairwise_permanova$p.value)[[1]],
    times = nrow(pairwise_permanova$p.value)), `statistic(pseudo-F)` = round(as.vector(pairwise_permanova$F.value),
    3), `adjusted p-value` = round(as.vector(pairwise_permanova$p.value),
    3), R2 = round(as.vector(pairwise_permanova$R2.value), 3)))

# Filter out rows with NA values and first column with
# numbers
pairwisePermanova <- pairwisePermanova |>
    drop_na() |>
    select(-0)

color.me <- which(pairwisePermanova$`adjusted p-value` < 0.05)

kbl(pairwisePermanova, format = "html", booktabs = TRUE, caption = pairwise_permanova$method) |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(color.me, bold = TRUE, background = "gray!6") |>
    row_spec(0, italic = TRUE)

permutation MANOVAs on a distance matrix
group1	group2	statistic(pseudo-F)	adjusted p-value	R2
El Molo	Turkana	9.209	0.045	0.397
El Molo	Luhya (Webuye)	0.95	0.473	0.064
El Molo	Luhya (Port Vict.)	0.294	0.719	0.022
El Molo	Luo (Port Vict.)	8.136	0.045	0.352
Turkana	Luhya (Webuye)	3.453	0.17	0.223
Turkana	Luhya (Port Vict.)	6.484	0.05	0.371
Turkana	Luo (Port Vict.)	0.007	0.959	0.001
Luhya (Webuye)	Luhya (Port Vict.)	0.228	0.719	0.02
Luhya (Webuye)	Luo (Port Vict.)	3.079	0.17	0.192
Luhya (Port Vict.)	Luo (Port Vict.)	5.017	0.095	0.295

Interpretation of \(R\) ² is, by convention, that 0.01, 0.09, and 0.25 correspond to small, medium, and large effects, respectively (Cohen, 1988). We found two significant comparisons, between the El Molo and the Turkana, and the El Molo and the Luo (Port Victoria).

Univariate ANOVA

As the data does not follow ANOVA assumptions, namely normality, we rank-transformed the data, also known as a Kruskal Wallis test, for which we used the kruskal_test function from the rstatix package.

followup_kruskal <- grouped.data |>
    kruskal_test(value ~ Group) |>
    adjust_pvalue(method = "fdr")

# followup_kruskal <- followup_kruskal[,
# !names(followup_kruskal) %in% '.y.']

followup_effsize <- grouped.data |>
    kruskal_effsize(value ~ Group)

followup_effsize <- followup_effsize |>
    select(variable, effsize, magnitude)

merged_data <- followup_kruskal %>%
    left_join(followup_effsize, by = "variable") |>
    select(variable, n, df, statistic, p.adj, method, effsize,
        magnitude)

colnames(merged_data) <- c("variable", "n", "df", "statistic (H)",
    "adjusted p-value", "method", "eta-squared", "magnitude")

color.me <- which(merged_data$`adjusted p-value` < 0.05)

# Convert the R object into a html table
kbl(merged_data, digits = c(3, 3, 0, 3, 3), format = "html",
    booktabs = TRUE, caption = "Univariate Kruskal-Wallis") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(color.me, bold = TRUE, background = "gray!6") |>
    row_spec(0, italic = TRUE)

Univariate Kruskal-Wallis
variable	n	df	statistic (H)	adjusted p-value	method	eta-squared	magnitude
Asfc	37	4	4.322	0.500	Kruskal-Wallis	0.010	small
HAsfc	37	4	10.121	0.106	Kruskal-Wallis	0.191	large
Sda	37	4	7.454	0.209	Kruskal-Wallis	0.108	moderate
Sdq	37	4	2.170	0.839	Kruskal-Wallis	-0.057	small
Sdv	37	4	6.537	0.255	Kruskal-Wallis	0.079	moderate
Sp	37	4	1.853	0.839	Kruskal-Wallis	-0.067	moderate
Ssk	37	4	23.072	0.001	Kruskal-Wallis	0.596	large
Sxp	37	4	10.294	0.106	Kruskal-Wallis	0.197	large
Sz	37	4	0.809	0.937	Kruskal-Wallis	-0.100	moderate
Tfv	37	4	12.638	0.073	Kruskal-Wallis	0.270	large
epLsar	37	4	8.370	0.174	Kruskal-Wallis	0.137	moderate

At p < 0.05, only Ssk was significantly different between groups, although other 3 variables – Hasfc, Sxp, and Tfv also have large effect sizes. In our previous study (Correia et al., 2021), Hasfc and Tfv were also not significantly different with large effect sizes .

Next, we conducted pairwise comparisons for the significantly different variable Ssk. We used the function dunn_test from the rstatix package, which respects the ranked data of Kruskall-Wallis, while controlling for multiple comparisons. We chose Dunn’s test over Games-Howells because the first addresses distribution issues, whereas the second is more appropriate when there is no homogeneity of variances, which is not as problematic here (see @ref(assumptions)). Finally, we calculated Hedge’s g² as a measure of effect size for each pairwise comparison.

pairwise <- dunn_test(dISO, Ssk ~ Group, p.adjust.method = "fdr",
    detailed = TRUE)


pairwise_effsize <- dISO |>
    rstatix::cohens_d(formula = Ssk ~ Group, paired = FALSE,
        hedges.correction = TRUE)

pairwise_effsize <- pairwise_effsize |>
    select(group1, group2, effsize, magnitude)

merged_data_effsize <- pairwise %>%
    left_join(pairwise_effsize, by = c("group1", "group2")) |>
    select(group1, group2, estimate, statistic, p.adj, effsize,
        magnitude)

colnames(merged_data_effsize) <- c("group1", "group2", "mean difference",
    "statistic (z)", "adjusted p-value", "hedge's g", "magnitude")

color.me <- which(merged_data_effsize$`adjusted p-value` < 0.05)

kbl(merged_data_effsize, digits = c(3, 3, 3, 3, 3, 3), format = "html",
    booktabs = TRUE, caption = "Pairwise comparisons for Ssk using Dunn's test") |>
    kable_classic(full_width = F, fixed_thead = T, bootstrap_options = c("hover",
        "responsive")) |>
    row_spec(color.me, bold = TRUE, background = "gray!6") |>
    row_spec(0, italic = TRUE)

Pairwise comparisons for Ssk using Dunn’s test
group1	group2	mean difference	statistic (z)	adjusted p-value	hedge’s g	magnitude
El Molo	Turkana	-9.048	-1.659	0.162	1.238	large
El Molo	Luhya (Webuye)	-9.333	-1.711	0.162	1.041	large
El Molo	Luhya (Port Vict.)	-16.000	-2.805	0.019	1.611	large
El Molo	Luo (Port Vict.)	-24.333	-4.626	0.000	2.329	large
Turkana	Luhya (Webuye)	-0.286	-0.049	0.961	-0.046	negligible
Turkana	Luhya (Port Vict.)	-6.952	-1.154	0.298	0.740	moderate
Turkana	Luo (Port Vict.)	-15.286	-2.729	0.019	1.588	large
Luhya (Webuye)	Luhya (Port Vict.)	-6.667	-1.107	0.298	0.718	moderate
Luhya (Webuye)	Luo (Port Vict.)	-15.000	-2.678	0.019	1.529	large
Luhya (Port Vict.)	Luo (Port Vict.)	-8.333	-1.426	0.220	0.808	large

# boxplot
pairwise <- pairwise |>
    add_xy_position(x = "Group")

ggplot(dISO, aes(y = Ssk, x = Group)) + geom_boxplot(aes(fill = Group),
    show.legend = FALSE) + scale_fill_manual(values = cbbPalette) +
    stat_pvalue_manual(pairwise, label = "p = {scales::pvalue(p.adj)}",
        tip.length = 0.01, size = 2.5, hide.ns = "p.adj", step.increase = 0.02) +
    my_theme

For the Ssk variable, we observe that 4 comparisons are significantly different, Interestingly, most comparisons have large or moderate effect sizes, suggesting that the sample size is too low to detect more significant comparisons. The one exception is the negligible effect size in the comparison between the Turkana and Luhya (Webuye).

Canonical discriminant analysis

Canonical discriminant analyses (CDA) find a linear combination of features that characterizes or separates two or more classes of objects or events. This approach makes important demands in terms of data distribution, and sample sizes. However, alternatives, such as regularized or quadratic discriminant analyses, are more adequate for predictive rather than descriptive analyses³. In addition, even when dealing with non-normal and high dimensional data, CDA axis retain discriminality between classes although cut-off points between classes may be incorrect. In other words, CDA on non-normal data cannot be generalized to other data (Hastie, Tibshirani, & Friedman, 2009, p. 110). Here, CDA was conducted using the candisc function from the candisc package, while rLDA was conducted using the Slda function from the HiDimDA package with the goal of informing the reliability of the CDA approach⁴.

ISOmanova <- lm(as.matrix(dISO[, 2:12]) ~ Group, dISO)
# summary(manova(ISOmanova), test = 'Wilks')

ISO.can <- candisc(ISOmanova)
ISO.can

## 
## Canonical Discriminant Analysis for Group:
## 
##    CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.64629    1.82717    0.32596 41.2101     41.210
## 2 0.60019    1.50121    0.32596 33.8584     75.068
## 3 0.45336    0.82935    0.32596 18.7052     93.774
## 4 0.21634    0.27606    0.32596  6.2263    100.000
## 
## Test of H0: The canonical correlations in the 
## current row and all that follow are zero
## 
##   LR test stat approx F numDF denDF  Pr(> F)   
## 1      0.06058  2.11593    44 86.12 0.001524 **
## 2      0.17127  1.87334    30 68.19 0.016817 * 
## 3      0.42838  1.40763    18 48.00 0.171547   
## 4      0.78366  0.86268     8 25.00 0.559555   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Create a data frame for the scores
scores_df <- data.frame(
  Group = ISO.can$scores$Group,
  Can1 = ISO.can$scores$Can1,
  Can2 = ISO.can$scores$Can2
)

# Extract the relevant information from the candisc object
loadings_df <- data.frame(
  Variable = rownames(ISO.can$structure),
  Can1 = ISO.can$structure[, 1],
  Can2 = ISO.can$structure[, 2])

# Calculate the percentage of variance explained by each canonical variable
eigenvalues <- ISO.can$eigenvalues
total_variance <- sum(eigenvalues)
variance_explained <- eigenvalues / total_variance * 100

# Define axis titles including the percentage of variance explained
x_axis_title <- paste("Can1 (", round(variance_explained[1], 1), "%)", sep = "")
y_axis_title <- paste("Can2 (", round(variance_explained[2], 1), "%)", sep = "")

# Calculate centroids for each group
centroids <- aggregate(cbind(Can1, Can2) ~ Group, scores_df, mean)

# Create the biplot using ggplot2
ggplot() +
  geom_point(data = scores_df, aes(x = Can1, y = Can2, color = Group), size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  stat_ellipse(data = scores_df, aes(
    x = Can1, y = Can2, fill = Group), 
    geom = "polygon", level = 0.68, alpha = 0.2) +  # Add ellipses for group
  geom_text(data = centroids, aes(
    x = Can1, y = Can2, label = "+"),
    size = 6, color = cbbPalette) +  # Add "+" symbols for centroids
  geom_text_repel(data = loadings_df, aes(
    x = Can1 * 6, y = Can2 * 6, label = Variable),
    size = 3, hjust = 1, vjust = 0.5, box.padding = 0.5) +
  geom_segment(data = loadings_df, aes(x = 0, y = 0,
                                       xend = Can1*5.5, yend = Can2*5.5), 
               arrow = arrow(type = "open", length = unit(0.2, "cm")), 
               linewidth = 0.5, lineend = "round") +
  geom_text(data = centroids, aes(
    x = Can1, y = Can2, label = Group),
    size = 4, color = cbbPalette, fontface = 2,
    nudge_x = c(-1.0, 0.5, -0.5, 1.5, 1.0),
    nudge_y = c(-1.5, -1.2, 1.5, 1.2, -1.0)) +  # Add group labels
  scale_color_manual(values = cbbPalette) +
  scale_fill_manual(values = cbbPalette) +
  coord_fixed(ratio = 1) +
  my_theme + labs(x = x_axis_title, y = y_axis_title) +
  guides(colour = "none", fill = "none")

heplot(ISO.can, var.col = "black", fill = TRUE, fill.alpha = 0.1)

## Vector scale factor set to  6.435736

plot(ISO.can, which=1, col = cbbPalette, cex.axis = 0.7, cex.lab = 0.9,
     points.1d=TRUE, rev.axes = TRUE,pch = 15:19,
     var.lwd = 2, var.col = "black")

plot(ISO.can, which=2, col = cbbPalette, cex.axis = 0.7, cex.lab = 0.8,
     points.1d=TRUE, rev.axes = TRUE,pch = 15:19,
     var.lwd = 2, var.col = "black")

# Create a data frame for the scores
scores_df <- data.frame(
  Group = ISO.can$scores$Group,
  Can3 = ISO.can$scores$Can3,
  Can4 = ISO.can$scores$Can4
)

# Extract the relevant information from the candisc object
loadings_df <- data.frame(
  Variable = rownames(ISO.can$structure),
  Can3 = ISO.can$structure[, 3],
  Can4 = ISO.can$structure[, 4])

# Define axis titles including the percentage of variance explained
x_axis_title <- paste("Can3 (", round(variance_explained[3], 1), "%)", sep = "")
y_axis_title <- paste("Can4 (", round(variance_explained[4], 1), "%)", sep = "")

# Calculate centroids for each group
centroids <- aggregate(cbind(Can3, Can4) ~ Group, scores_df, mean)

# Create the biplot using ggplot2
ggplot() +
  geom_point(data = scores_df, aes(x = Can3, y = Can4, color = Group), size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  stat_ellipse(data = scores_df, aes(
    x = Can3, y = Can4, fill = Group), 
    geom = "polygon", level = 0.68, alpha = 0.2) +  # Add ellipses for group
  geom_text(data = centroids, aes(
    x = Can3, y = Can4, label = "+"),
    size = 6, color = cbbPalette) +  # Add "+" symbols for centroids
  geom_text_repel(data = loadings_df, aes(
    x = Can3 * 6, y = Can4 * 6, label = Variable),
    size = 3, hjust = 1, vjust = 0.5, box.padding = 0.5) +
  geom_segment(data = loadings_df, aes(x = 0, y = 0,
                                       xend = Can3*5.5, yend = Can4*5.5), 
               arrow = arrow(type = "open", length = unit(0.2, "cm")), 
               linewidth = 0.5, lineend = "round") +
  geom_text(data = centroids, aes(
    x = Can3, y = Can4, label = Group),
    size = 4, color = cbbPalette, fontface = 2,
    nudge_x = c(1.0, -0.5, 0.0, -1.0, 1.5),
    nudge_y = c(1.5, -1.0, -1.0, 1.2, -0.5)) +  # Add group labels
  scale_color_manual(values = cbbPalette) +
  scale_fill_manual(values = cbbPalette) +
  coord_fixed(ratio = 1) +
  my_theme + labs(x = x_axis_title, y = y_axis_title) +
  guides(colour = "none", fill = "none")

heplots::heplot(ISO.can, variables = 3:4, var.col = "black",
       fill = TRUE, fill.alpha = 0.1)

## Vector scale factor set to  3.034421

The CDA finds 4 canonical vectors, but only two have group means that are significantly different. Together, the 2 explain 75% of the variation observed.

The HE plot represent sums-of-squares-and-products matrices for linear hypotheses (H) and for error (E). In the default scaling, such plots have the visual property that a given effect in the multivariate linear model is significant (by Roy’s maximum root test) if the H ellipse projects anywhere outside the E ellipse. The size of the H ellipse relative to that of the E ellipse is an indication of the magnitude of the multivariate effect for group. The further away variable centroids are from the “Error” circle, the stronger the association between groups and variables (Friendly & Sigal, 2017).

In the biplot, each data point is represented by its score on the discriminant axis. The variable vectors represent the structure coefficients, i.e. the angle between the vector and the canonical axis represent the correlation of variables with canonical scores, while the vector’s relative length reflect the sum of the squared correlations and hence represent the variable’s contribution to the discrimination between groups (Field et al., 2012, p. 738).

scaled_ISO <- dISO |>
    mutate(across(where(is.numeric), scale))

Slda <- Slda(data = as.matrix(scaled_ISO[, 2:12]), grouping = scaled_ISO$Group,
    StddzData = TRUE, ladfun = "classification")
Slda

## Call:
## Slda.default(data = as.matrix(scaled_ISO[, 2:12]), grouping = scaled_ISO$Group, 
##     StddzData = TRUE, ladfun = "classification")
## 
## Prior probabilities of groups:
##            El Molo            Turkana     Luhya (Webuye) Luhya (Port Vict.) 
##          0.2432432          0.1891892          0.1891892          0.1621622 
##   Luo (Port Vict.) 
##          0.2162162 
## 
## Group means:
##                           Tfv        Ssk        Sxp          Sda
## El Molo             0.6393859  0.8847521  0.1170062 -0.102954091
## Turkana            -0.6359069  0.2133864 -0.5262379 -0.623304691
## Luhya (Webuye)      0.2228006  0.2406359 -0.1939716 -0.166426744
## Luhya (Port Vict.)  0.4264057 -0.3306241 -0.5939760  1.073557817
## Luo (Port Vict.)   -0.6776453 -1.1446475  0.9440333  0.001669996
## 
## Coefficients of linear discriminants:
##             LD1        LD2        LD3        LD4
## Tfv -0.39348495 -0.4456465 0.65633995  0.8074253
## Ssk -1.22926872  0.1839274 0.04278755 -0.7178051
## Sxp -0.04260694  0.8299720 0.81804596 -0.2981733
## Sda  0.53782314 -0.7345409 0.17354765 -0.7349617
## Proportion of trace:
##         LD1         LD2         LD3         LD4 
## 0.554380573 0.317499098 0.123729033 0.004391297 
## 
## Variables kept in discriminant rule:
## [1] "Tfv" "Ssk" "Sxp" "Sda"
## Number of variables kept: 4

‘Slda’ is a shrunken discriminant analysis that finds the coefficients of a linear discriminant rule based on Fisher and Sun’s (2011) estimate and generalizations of Ledoit and Wolf’s (2004) optimal shrunken covariance matrix (Silva, 2015). Here, it finds 4 discriminants just as the canonical approach, and keeps 4 variables in the discriminant rule, Tfv, Ssk, Sxp, Sda, which are also the variables with higher impact before. Thus, this approach confirms the discriminatory and descriptive power of the canonical approach. As scores are not produced in these approaches, no further analyses is possible.

References

Anderson, M. J. (2017). Permutational Multivariate Analysis of Variance (PERMANOVA). John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118445112.stat07841

Carlson, D. (2017). Quantitative Methods in Archaeology Using R. Cambridge, UK: Cambridge University Press.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (Second Edi). Hillsdale, New Jersey, USA: Lawrence Erlbaum Associates.

Correia, M. A., Foley, R., & Lahr, M. M. (2021). Applying dental microwear texture analysis to the living: Challenges and prospects. American Journal of Physical Anthropology, 174(3), 542–554. https://doi.org/10.1002/ajpa.24133

Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using r. London, United Kingdom: SAGE Publications Ltd.

Filzmoser, P., & Gschwandter, M. (n.d.). Package ’mvoutlier’: Multivariate outlier detection based on robust methods. Retrieved from https://cran.r-project.org/web/packages/mvoutlier/mvoutlier.pdf

Friendly, M., & Sigal, M. (2017). Graphical methods for multivariate linear models in psychological research: An R tutorial. The Quantitative Methods for Psychology, 13(1), 20–45. https://doi.org/10.20982/tqmp.13.1.p020

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-84858-7

Herve, M. (2023). Package ’RVAideMemoire’. Retrieved from https://cran.r-project.org/web/packages/RVAideMemoire/RVAideMemoire.pdf

ISO/TC 213. (2021). ISO 25178-2:2021 geometrical product specifications (GPS) surface texture: Areal part 2: Terms, definitions and surface texture parameters. Geneva, Switzerland: International Organization for Standardization. Retrieved from https://www.iso.org/standard/74591.html

Jarek. (2012). Mvnormtest.pdf. Retrieved from https://cran.r-project.org/web/packages/mvnormtest/mvnormtest.pdf

Kassambara, A. (2023). Rstatix: Pipe-friendly framework for basic statistical tests. Retrieved from https://cran.r-project.org/web/packages/rstatix/rstatix.pdf

Silva, A. (2015). Package HiDimDA’ - high dimensional discriminant analysis. Retrieved from https://cran.r-project.org/web/packages/HiDimDA/HiDimDA.pdf

Ungar, P., Livengood, S., & Crittenden, A. (2019). Dental microwear of living hadza foragers. American Journal of Physical Anthropology, 169(2), 356–367. https://doi.org/10.1002/ajpa.23836

## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/Lisbon
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RVAideMemoire_0.9-83-7 vegan_2.6-4            lattice_0.21-8        
##  [4] permute_0.9-7          HiDimDA_0.2-4          ggpubr_0.6.0          
##  [7] lemon_0.4.7            formatR_1.14           kableExtra_1.3.4.9000 
## [10] candisc_0.8-6          heplots_1.6.0          broom_1.0.5           
## [13] rstatix_0.7.2          rrcov_1.7-4            robustbase_0.99-1     
## [16] mvnormtest_0.1-9       mvoutlier_2.1.1        sgeostat_1.0-27       
## [19] car_3.1-2              carData_3.0-5          corrplot_0.92         
## [22] ggspatial_1.1.9        ggrepel_0.9.4          gt_0.10.0             
## [25] rnaturalearth_1.0.1    lubridate_1.9.3        forcats_1.0.0         
## [28] stringr_1.5.1          dplyr_1.1.3            purrr_1.0.2           
## [31] readr_2.1.4            tidyr_1.3.0            tibble_3.2.1          
## [34] ggplot2_3.4.4          tidyverse_2.0.0       
## 
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.15.0       jsonlite_1.8.8          datawizard_0.9.1       
##  [4] magrittr_2.0.3          TH.data_1.1-2           estimability_1.4.1     
##  [7] farver_2.1.1            rmarkdown_2.25          vctrs_0.6.4            
## [10] base64enc_0.1-3         terra_1.7-65            effectsize_0.8.6       
## [13] webshot_0.5.5           htmltools_0.5.7         sass_0.4.8             
## [16] KernSmooth_2.23-21      bslib_0.6.1             htmlwidgets_1.6.4      
## [19] plyr_1.8.9              sandwich_3.1-0          emmeans_1.9.0          
## [22] zoo_1.8-12              cachem_1.0.8            lifecycle_1.0.4        
## [25] pkgconfig_2.0.3         Matrix_1.6-4            R6_2.5.1               
## [28] fastmap_1.1.1           digest_0.6.33           colorspace_2.1-0       
## [31] labeling_0.4.3          fansi_1.0.5             timechange_0.2.0       
## [34] httr_1.4.7              abind_1.4-5             mgcv_1.8-42            
## [37] compiler_4.3.1          proxy_0.4-27            withr_2.5.2            
## [40] backports_1.4.1         DBI_1.2.0               highr_0.10             
## [43] ggsignif_0.6.4          MASS_7.3-60             classInt_0.4-10        
## [46] tools_4.3.1             units_0.8-4             glue_1.6.2             
## [49] rnaturalearthdata_0.1.0 nlme_3.1-162            grid_4.3.1             
## [52] sf_1.0-14               cluster_2.1.4           generics_0.1.3         
## [55] gtable_0.3.4            tzdb_0.4.0              class_7.3-22           
## [58] hms_1.1.3               sp_2.1-2                xml2_1.3.6             
## [61] utf8_1.2.4              pillar_1.9.0            splines_4.3.1          
## [64] survival_3.5-5          tidyselect_1.2.0        knitr_1.45             
## [67] gridExtra_2.3           svglite_2.1.3           stats4_4.3.1           
## [70] xfun_0.41               DEoptimR_1.1-3          stringi_1.8.3          
## [73] yaml_2.3.8              evaluate_0.23           codetools_0.2-19       
## [76] cli_3.6.1               xtable_1.8-4            parameters_0.21.3      
## [79] systemfonts_1.0.5       munsell_0.5.0           jquerylib_0.1.4        
## [82] Rcpp_1.0.11             parallel_4.3.1          rgl_1.2.8              
## [85] bayestestR_0.13.1       viridisLite_0.4.2       mvtnorm_1.2-4          
## [88] scales_1.3.0            e1071_1.7-13            pcaPP_2.0-4            
## [91] insight_0.19.7          rlang_1.1.1             rvest_1.0.3            
## [94] multcomp_1.4-25

warning about bindiwdth but as binwidth depends on scale and each variable has different scales, setting binwidth produces graph with different sized dots; a possible solution wouldbe to use ‘binwidth = unit(10, “npc”)’ as suggested here: https://github.com/mjskay/ggdist/issues/53 but this gives error of !object is not coercible to a unit, which I think is related to recent changes to unit()?↩︎
Hedge’s g was calculated using the rstatix::cohens_d function, which includes an hedges.correction. According to the package (Kassambara, 2023), this correction should be used when n is small and is a “logical indicating whether to apply the Hedges correction by multiplying the usual value of Cohen’s d by (N-3)/(N-2.25) (for unpaired t-test) and by (n1-2)/(n1-1.25) for paired t-test; where N is the total size of the two groups being compared (N = n1 + n2)”. Now, the effsize::cohen.d function has the same syntax, stating that applying an hedges.correction computes the Hedge’s g statistic. However, this second function is difficult to apply to tidy data, and hence, we opted to use the first option.↩︎
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher’s linear discriminant, a method used in statistics and other fields. The terms Fisher’s linear discriminant and LDA are often used interchangeably, although Fisher’s original article actually describes a slightly different discriminant, which does not make some of the assumptions of LDA such as normally distributed classes or equal class covariances, although statisticians disagree on the subject. Discriminant analysis (DA) encompasses procedures for classifying observations into groups (predictive discriminant analysis, PDA) and describing the relative importance of variables for distinguishing between groups (descriptive or canonical discriminant analysis, DDA). When following up MANOVA, DA falls into the first category. However, discriminant analysis can make substantial demands on sample size. A rule of thumb states that the data should have at least five more observations in each group than the number of variables (Carlson, 2017, p. 245). If dealing with high dimensional data, one should instead conduct regularised discriminant analyses (rLDA or RDA) or quadratic discriminant analyses (QDA). QDA is a variant of LDA in which an individual covariance matrix is estimated for every class of observations, which accommodates more flexible decision boundaries, but whose number of to-estimate parameters increase faster than those of LDA, whereas RDA is a compromise between LDA and QDA that regularizes the individual class covariance matrices, i.e. restrictions are applied to the estimated parameters. However, RDA and QDA do not produce scores, because they partition the space “by ellipsoid like structures based on the estimation of the inner-class covariance matrices”. Considering that scores are essential to LDA interpretation following MANOVA, we opted to commit to the simpler LDA.↩︎
Note that to conduct rLDA we had to standardise the data (by centering and scaling), a different transformation to the rank-transformation used previously.↩︎

ISO 25178 variables and multivariate analyses complement SSFA variables in an in vivo human dental microwear study

Supplementary material

Maria Ana Correia

18/01/2024