Schistosomiasis Cluster Randomized Trial Analysis

Author

Rob Daniels

Published

December 2, 2025

Description

An end-to-end reproducible analysis of a schistosomiasis cluster randomized trial (CRT), from data ingestion through modeling and reporting.

This project presents an independent analysis of publicly available field data from a schistosomiasis cluster randomized trial (CRT).

The analysis emphasizes transparent statistical reasoning, modern workflow tools, and high-quality communication.

View the complete project structure, analysis files, and reproducible workflow on GitHub:
schisto-cluster-project

Tools and workflow

  • Version control & reproducibility: GitHub, Quarto, Markdown, bibliographic integration, renv, and structured workflows
  • Statistical programming: R
  • Visualization & reporting: manuscript-quality tables, HTML, ggplot2
  • Statistical methods: linear model, diagnostics, generalized estimating equations (GEE), permutation test

Many organizational ideas were inspired by the Proctor Foundation Data Science Handbook and R for Data Science (2e).

For questions or feedback, feel free to reach out at rdanielsstat@gmail.com.

Read and process the data

This study is comprised of measurements from 3,663 pre-school age children (PSAC) from 30 villages in Western Kenya. Two blood tests and one stool test were used to assess infection with Schistosoma spp., the parasites that cause schistosomiasis. Data were collected at three different times: in 2012 (before treatment began), and again in 2013 and 2014 (after treatment started). The study was designed as a cluster randomized trial, meaning that entire villages, not individual children, were randomly assigned to one of two treatment arms. In 15 villages, treatment was provided only through schools (SBT), while in the other 15 villages, treatment was offered to the whole community (CWT).

Read, merge, and show data set details.

Code
# Read in data
mbita_psac <- read.csv(paste0(data_path, "/mbita_schisto.csv"))
mbita_spatial <- read.csv(paste0(data_path, "/mbita_spatial.csv"))

# Merge
mbita_schisto <- mbita_psac %>%
  left_join(mbita_spatial, by = "vid")

# Data set details
str(mbita_schisto)
'data.frame':   3663 obs. of  16 variables:
 $ year         : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ vid          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ arm          : chr  "CWT" "CWT" "CWT" "CWT" ...
 $ pid          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ agey         : num  1.81 2.17 2.95 2.63 2.54 ...
 $ sex          : chr  "male" "female" "female" "female" ...
 $ sea          : num  3136 25605 28549 87 26537 ...
 $ sm25         : num  -2 1344 520 -4 1 ...
 $ sm_epg       : int  12 12 0 36 24 132 0 48 132 0 ...
 $ sea_pos      : int  1 1 1 0 1 1 1 1 1 1 ...
 $ sm25_pos     : int  0 1 1 0 0 0 0 0 0 0 ...
 $ kk_pos       : int  1 1 0 1 1 1 0 1 1 0 ...
 $ elev         : int  1189 1189 1189 1189 1189 1189 1189 1189 1189 1189 ...
 $ tmin         : num  161 161 161 161 161 ...
 $ prec         : num  82.7 82.7 82.7 82.7 82.7 ...
 $ dist_victoria: num  568 568 568 568 568 ...

Define factors and derived variables. Run summary descriptives for full data set.

Code
# Additional processing
mbita_schisto <- mbita_schisto %>%
  mutate(
    pid  = as.character(pid),
    arm  = factor(arm, levels = c("SBT", "CWT")),
    # SBT as reference group
    sex  = factor(sex, levels = c("male", "female")),
    male = ifelse(sex == "male", 1, 0),
    # either sea or sm25 positive:
    sea_or_sm25_pos = ifelse(sea_pos == 1 |
                               sm25_pos == 1, 1, 0),
    agecat = cut(
      agey,
      breaks = c(0, 1, 2, 3, 4, 5, 6),
      labels = c("<1 year", "1 year", "2 years", "3 years", "4 years", 
                 "5 years")
    ),
    # create indicator for baseline/post intervention
    post = ifelse(year %in% c(2013, 2014), 1, 0)
  )

# View data
# View(mbita_schisto)

# Full data set summary
summary(mbita_schisto)
      year           vid         arm           pid                 agey       
 Min.   :2012   Min.   : 1.00   SBT:1837   Length:3663        Min.   :0.2108  
 1st Qu.:2012   1st Qu.: 7.00   CWT:1826   Class :character   1st Qu.:2.4312  
 Median :2013   Median :13.00              Mode  :character   Median :3.4990  
 Mean   :2013   Mean   :14.27                                 Mean   :3.3774  
 3rd Qu.:2014   3rd Qu.:22.00                                 3rd Qu.:4.4435  
 Max.   :2014   Max.   :30.00                                 Max.   :5.5524  
                                                                              
     sex            sea             sm25              sm_epg       
 male  :1759   Min.   :    4   Min.   :  -12.00   Min.   :   0.00  
 female:1904   1st Qu.:  101   1st Qu.:    1.15   1st Qu.:   0.00  
               Median :  387   Median :    6.00   Median :   0.00  
               Mean   :11982   Mean   :  129.26   Mean   :  42.01  
               3rd Qu.:27337   3rd Qu.:   19.00   3rd Qu.:  12.00  
               Max.   :32685   Max.   :18816.00   Max.   :5004.00  
                                                  NA's   :237      
    sea_pos          sm25_pos          kk_pos            elev     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :1137  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1151  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :1158  
 Mean   :0.4775   Mean   :0.1701   Mean   :0.2536   Mean   :1170  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1178  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1338  
                                   NA's   :237                    
      tmin            prec        dist_victoria          male       
 Min.   :148.9   Min.   : 81.17   Min.   :  32.28   Min.   :0.0000  
 1st Qu.:157.4   1st Qu.: 83.42   1st Qu.: 234.21   1st Qu.:0.0000  
 Median :158.8   Median : 84.92   Median : 712.82   Median :0.0000  
 Mean   :158.6   Mean   : 88.48   Mean   :1011.26   Mean   :0.4802  
 3rd Qu.:160.8   3rd Qu.: 94.50   3rd Qu.:1598.12   3rd Qu.:1.0000  
 Max.   :161.4   Max.   :101.42   Max.   :4718.11   Max.   :1.0000  
                                                                    
 sea_or_sm25_pos      agecat          post       
 Min.   :0.0000   <1 year:  51   Min.   :0.0000  
 1st Qu.:0.0000   1 year : 570   1st Qu.:0.0000  
 Median :1.0000   2 years: 780   Median :1.0000  
 Mean   :0.5089   3 years: 890   Mean   :0.6942  
 3rd Qu.:1.0000   4 years:1214   3rd Qu.:1.0000  
 Max.   :1.0000   5 years: 158   Max.   :1.0000  
                                                 

Describe the data

The following table shows the number of pre-school age children (PSAC) per year and study arm. The summary statistics are based on counts of PSAC per village (cluster).

Code
mbita_schisto %>%
  group_by(year, arm, vid) %>%
  summarize(n_psac = n_distinct(pid), .groups = "drop") %>%
  group_by(year, arm) %>%
  summarize(
    n_villages  = n(), 
    total_psac  = sum(n_psac), 
    mean_psac   = round(mean(n_psac), 1), 
    sd_psac     = round(sd(n_psac), 1), 
    min_psac    = min(n_psac), 
    q1_psac     = quantile(n_psac, 0.25), 
    median_psac = median(n_psac), 
    q3_psac     = quantile(n_psac, 0.75), 
    max_psac    = max(n_psac),
    .groups = "drop"
  ) %>%
  knitr::kable(
    col.names = c("Year", "Arm", "No. villages", "Total PSAC", 
                  "Mean PSAC", "SD PSAC", "Min PSAC", "Q1 PSAC",
                  "Median PSAC", "Q3 PSAC", "Max PSAC"))
Year Arm No. villages Total PSAC Mean PSAC SD PSAC Min PSAC Q1 PSAC Median PSAC Q3 PSAC Max PSAC
2012 SBT 15 582 38.8 15.2 16 27.0 40 47.5 66
2012 CWT 15 538 35.9 16.5 13 23.0 37 48.5 63
2013 SBT 15 624 41.6 15.5 25 32.0 38 47.5 73
2013 CWT 15 563 37.5 18.6 18 24.0 33 50.5 75
2014 SBT 15 631 42.1 18.0 9 28.5 43 53.5 75
2014 CWT 15 725 48.3 16.7 22 37.0 55 63.0 68

On average, there were 36–48 PSAC per village at each survey time point, with a minimum of 9 observed at one time point and a maximum of 75 observed at two time points.

The table below shows the percentage of missing values for blood- and stool-based measures (SEA, Sm25, and Sm-EPG) by year and study arm, summarized across clusters. There are no missing data for the blood-based measures; missing data occur only for the stool-based measure (Sm-EPG), which is summarized below.

Code
# Aggregates missing values per year, arm, and cluster (vid), then summarizes
# across clusters
make_na_summary <- function(data, var) {
  var <- enquo(var)
  
  data %>%
    group_by(year, arm, vid) %>%
    summarize(
      n_psac           = n_distinct(pid),
      measurements_var = sum(!is.na(!!var)),
      na_var           = sum(is.na(!!var)),
      pct_na_var       = 100 * na_var / n_psac,
      .groups = "drop"
    ) %>%
    group_by(year, arm) %>%
    summarize(
      n_villages     = n(),
      n_psac         = sum(n_psac),
      n_measurements = sum(measurements_var),
      na_total       = sum(na_var),
      mean_pct_na    = round(mean(pct_na_var), 1),
      sd_pct_na      = round(sd(pct_na_var), 1),
      min_pct_na     = round(min(pct_na_var), 1),
      q1_pct_na      = round(quantile(pct_na_var, 0.25), 1),
      median_pct_na  = round(median(pct_na_var), 1),
      q3_pct_na      = round(quantile(pct_na_var, 0.75), 1),
      max_pct_na     = round(max(pct_na_var), 1),
      .groups = "drop"
    ) %>%
    mutate(variable = quo_name(var), .before = 1)
}

make_na_summary(mbita_schisto, sm_epg) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "Arm", "No. villages", "Total PSAC", 
                  "No. measurements", "No. NA (total)", "Mean % NA", "SD % NA",
                  "Min % NA", "Q1 % NA", "Median % NA", "Q3 % NA", "Max % NA")
  )
Variable Year Arm No. villages Total PSAC No. measurements No. NA (total) Mean % NA SD % NA Min % NA Q1 % NA Median % NA Q3 % NA Max % NA
sm_epg 2012 SBT 15 582 554 28 4.1 6.2 0.0 0.0 0.0 5.6 20.4
sm_epg 2012 CWT 15 538 518 20 4.8 7.4 0.0 0.0 0.0 6.4 25.0
sm_epg 2013 SBT 15 624 616 8 1.5 3.5 0.0 0.0 0.0 1.5 13.2
sm_epg 2013 CWT 15 563 556 7 1.5 3.0 0.0 0.0 0.0 1.6 11.1
sm_epg 2014 SBT 15 631 555 76 11.9 4.4 1.9 9.8 11.6 14.4 19.5
sm_epg 2014 CWT 15 725 627 98 13.9 7.1 5.5 9.3 11.4 16.9 29.2

In 2014, the community-wide treatment (CWT) arm had 13.5% missing values (98 of 725 total), with one village showing as much as 29% missing.

The following tables show the counts of SEA and Sm-EPG measurements, respectively, along with their missing values, for each year.

Code
# Summarizes counts of measurements and missing values per year
make_meas_summary <- function(data, var) {
  var <- enquo(var)
  
  data %>%
    group_by(year) %>%
    summarize(
      n_psac         = n_distinct(pid),
      n_measurements = sum(!is.na(!!var)),
      na_var         = sum(is.na(!!var)),
      pct_na_var     = round(100 * na_var / n_psac, 1),
      .groups = "drop"
    ) %>%
    mutate(
      n_psac         = format(n_psac, big.mark = ","),
      n_measurements = format(n_measurements, big.mark = ","),
      variable = quo_name(var),
      .before = 1
    )
}

make_meas_summary(mbita_schisto, sea) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "No. PSAC", "No. measurements", "No. NA", 
                  "% NA")
  )
Variable Year No. PSAC No. measurements No. NA % NA
sea 2012 1,120 1,120 0 0
sea 2013 1,187 1,187 0 0
sea 2014 1,356 1,356 0 0
Code
make_meas_summary(mbita_schisto, sm_epg) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "No. PSAC", "No. measurements", "No. NA", 
                  "% NA")
  )
Variable Year No. PSAC No. measurements No. NA % NA
sm_epg 2012 1,120 1,072 48 4.3
sm_epg 2013 1,187 1,172 15 1.3
sm_epg 2014 1,356 1,182 174 12.8

For the blood-based measurements, there were 1,120, 1,187, and 1,356 pre-school age children (PSAC) at baseline, 2013, and 2014, respectively, with no missing data. For the stool-based measurement, there were 1,072, 1,172, and 1,182 PSAC at baseline, 2013, and 2014, respectively, with 4.3%, 1.3%, and 12.8% of values missing at each time point.

The following table shows prevalence (SEA, Sm25, and KK positivity) by age category for all pre-school age children (PSAC).

Code
mbita_schisto %>%
  group_by(agecat) %>%
  summarize(n_sea_or_sm25 = sum(!is.na(sea_or_sm25_pos)),
            n_sea_or_sm25_pos = sum(sea_or_sm25_pos),
            n_kk = sum(!is.na(kk_pos)),
            n_kk_pos = sum(kk_pos, na.rm = TRUE),
            .groups = "drop") %>%
  mutate(sea_or_sm25_prev = 
           sprintf("%1.1f", n_sea_or_sm25_pos / n_sea_or_sm25 * 100),
         kk_prev =  sprintf("%1.1f", n_kk_pos / n_kk * 100),
         group = as.character(agecat)) %>%
  select(group, n_sea_or_sm25, n_sea_or_sm25_pos, sea_or_sm25_prev,
         n_kk, n_kk_pos, kk_prev) %>%
  kable(align = c("l", "r", "r", "r", "r", "r", "r"),
        col.names = c("Group", "No. SEA/Sm25 measurements", 
                      "No. SEA/Sm25 positive", "SEA/Sm25 prevalence (%)", 
                      "No. KK measurements", "No. KK positive", 
                      "KK prevalence (%)")) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, extra_css = "white-space: nowrap;")
Group No. SEA/Sm25 measurements No. SEA/Sm25 positive SEA/Sm25 prevalence (%) No. KK measurements No. KK positive KK prevalence (%)
<1 year 51 13 25.5 47 5 10.6
1 year 570 162 28.4 503 68 13.5
2 years 780 324 41.5 714 150 21.0
3 years 890 502 56.4 844 211 25.0
4 years 1214 759 62.5 1166 384 32.9
5 years 158 104 65.8 152 51 33.6

There appears to be a strong relationship between age and prevalence.

The following tables shows summary statistics for the quantitative variable SEA at year, arm level.

Code
# Summarizes continuous blood- and stool-based measurements per year and arm,
# pooled across PSAC
make_cont_summary <- function(data, var) {
  var <- enquo(var)
  
  data %>%
    group_by(year, arm) %>%
    summarize(
      n_psac     = n_distinct(pid),
      n_measures = sum(!is.na(!!var)),
      mean       = round(mean(!!var, na.rm = TRUE), 1),
      sd         = round(sd(!!var, na.rm = TRUE), 1),
      min        = round(min(!!var, na.rm = TRUE), 1),
      q1         = round(quantile(!!var, 0.25, na.rm = TRUE), 1),
      median     = round(median(!!var, na.rm = TRUE), 1),
      q3         = round(quantile(!!var, 0.75, na.rm = TRUE), 1),
      max        = round(max(!!var, na.rm = TRUE), 1),
      .groups = "drop"
    ) %>%
    mutate(variable = quo_name(var), .before = 1)
}
make_cont_summary(mbita_schisto, sea) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "Arm", "No. PSAC", "No. measurements",
                  "Mean", "SD", "Min", "Q1", "Median", "Q3", "Max")
  )
Variable Year Arm No. PSAC No. measurements Mean SD Min Q1 Median Q3 Max
sea 2012 SBT 582 582 13718.0 13584.9 4 113.0 8620 27764.2 32685
sea 2012 CWT 538 538 10524.6 12978.7 24 87.0 233 25505.2 32646
sea 2013 SBT 624 624 13134.8 13605.4 17 109.8 4556 27811.8 32625
sea 2013 CWT 563 563 12189.0 13676.2 6 104.5 384 27788.0 32568
sea 2014 SBT 631 631 12710.3 13730.4 7 115.0 762 28249.0 32602
sea 2014 CWT 725 725 9881.8 13144.6 18 92.0 173 25486.0 32502


Construct a histogram for these data.

Code
hist(mbita_schisto$sea, main = "SEA distribution", xlab = "SEA")

All three (Sm25 and Sm-EPG not shown) of the quantitative variables are highly skewed, which could affect assumptions of normality in downstream analyses. Transformations such as log or square root may be necessary, and appropriate statistical methods that account for skewed distributions should be considered when modeling or testing differences.

Finally, the following table shows the prevalence of infection as measured by positive SEA or Sm25 reactivity.

Code
# Calculates pooled prevalence per year and arm
make_prev_summary <- function(data, var) {
  var <- enquo(var)
  
  data %>%
    group_by(year, arm) %>%
    summarize(
      n_psac = n_distinct(pid),
      n_measures = sum(!is.na(!!var)),
      n_positive = sum(!!var, na.rm = TRUE),
      prevalence = round(100 * n_positive / n_measures, 1),
      .groups = "drop"
    ) %>%
    mutate(variable = quo_name(var), .before = 1)
}
make_prev_summary(mbita_schisto, sea_or_sm25_pos) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "Arm", "No. PSAC", "No. measurements",
                  "No. positive", "Prevalence (%)")
  )
Variable Year Arm No. PSAC No. measurements No. positive Prevalence (%)
sea_or_sm25_pos 2012 SBT 582 582 337 57.9
sea_or_sm25_pos 2012 CWT 538 538 241 44.8
sea_or_sm25_pos 2013 SBT 624 624 347 55.6
sea_or_sm25_pos 2013 CWT 563 563 284 50.4
sea_or_sm25_pos 2014 SBT 631 631 343 54.4
sea_or_sm25_pos 2014 CWT 725 725 312 43.0

Summarize baseline characteristics

The following code constructs a manuscript style Table 1 for assessing individual- and cluster-level baseline characteristics. These results are helpful in assessing how well randomization worked in the study. The full code is not shown here for brevity.

Code
source(paste0(here::here(), "/R/3-baseline-table1.R"))

Table 1. Individual- and cluster-level baseline characteristics and infection prevalence by treatment arm.

Characteristic SBT CWT
Individual-level
PSAC, No. (%) 582 (52) 538 (48)
Sex, No. (%)
    Female 305 (52.4) 276 (51.3)
    Male 277 (47.6) 262 (48.7)
Age group, No. (%)
    <1 year 13 (2.2) 4 (0.7)
    1 year 77 (13.2) 77 (14.3)
    2 years 123 (21.1) 102 (19)
    3 years 141 (24.2) 133 (24.7)
    4 years 187 (32.1) 167 (31)
    5 years 41 (7) 55 (10.2)
Age, mean ± SD, y 3.5 ± 1.2 3.5 ± 1.2
Blood-based measures, No. (%)
    SEA positive 326 (56) 231 (42.9)
    Sm25 positive 85 (14.6) 70 (13)
    SEA or Sm25 positive 337 (57.9) 241 (44.8)
Stool-based measure, No. (%) n = 554 n = 518
    KK positive 173 (31.2) 124 (23.9)
Cluster-level
PSAC per village, mean ± SD 38.8 ± 15.2 35.9 ± 16.5
Percent male per village, mean ± SD 46.3 ± 8.7 47.6 ± 8.7
Mean village age, mean ± SD 3.5 ± 0.3 3.5 ± 0.3
Blood-based measures, % per village, mean ± SD
    SEA positive 51.1 ± 25.3 41.4 ± 29.3
    Sm25 positive 14.4 ± 12.6 13.5 ± 12.1
    SEA or Sm25 positive 53.3 ± 24.5 43.5 ± 28.7
Stool-based measure, % per village, mean ± SD
    KK positive 29.0 ± 18.7 24.0 ± 20.5


At baseline, the two study arms were generally well balanced. Both arms had similar average cluster sizes (35.9 vs. 38.8 children per cluster) with comparable variability. Demographic characteristics were nearly identical: the mean age of children was 3.5 years in both arms, with narrow standard deviations, and the proportion of male children was also similar (47.6% vs. 46.3%). These observations indicate excellent demographic balance between the arms.

In terms of infection measures, the mean cluster-level SEA prevalence was somewhat lower in the CWT arm (41.4%) compared to the SBT arm (51.1%), though the large standard deviations (29.3 vs. 25.3) suggest substantial cluster-to-cluster variability and overlap. Mean cluster-level Sm25 prevalence was nearly identical (13.5% vs. 14.4%), and mean cluster-level Kato-Katz prevalence was slightly lower in the CWT arm (24%) than in the SBT arm (29%), again with wide variability across clusters. Overall, while there are some modest differences in baseline infection prevalence, the groups are well balanced on demographics and cluster size.

Reflecting on the study design, this was a cluster randomized trial with 30 villages randomized 1:1 into the two arms. With only 15 clusters per arm, some chance imbalances are more likely than in individually randomized trials with thousands of participants. The observed balance in demographic characteristics and cluster sizes suggests that the randomization performed as intended, while the modest differences in SEA and Kato-Katz prevalence may reflect natural variation between communities or the limited number of clusters.

In conclusion, the arms appear well balanced with respect to demographics and cluster sizes, with only minor differences in baseline infection prevalence that may not be unexpected given the small number of clusters per arm and natural variability in infection rates among villages.

Compare seroprevalence between groups

Calculate observed pooled sero-prevalence for 2013 and 2014 combined at the cluster, arm level.

Code
# Calculate pooled sero-prevalence for 2013 and 2014 combined at the cluster,
# arm level
cluster_post <- mbita_schisto %>%
  filter(post == 1) %>%
  group_by(vid, arm) %>%
  summarize(
    n_pos = sum(sea_or_sm25_pos == 1),
    n_measured = sum(!is.na(sea_or_sm25_pos)),
    prev = if_else(n_measured > 0, 100 * n_pos / n_measured, NA_real_),
    .groups = "drop"
  )

# Observed mean pooled prevalence for each treatment group
cluster_post %>%
  group_by(arm) %>%
  summarize(
    n_clusters = n(),
    mean_prev = round(mean(prev, na.rm = TRUE), 1),
    sd_prev   = round(sd(prev, na.rm = TRUE), 1),
    median_prev = round(median(prev, na.rm = TRUE), 1),
    iqr_prev = round(IQR(prev, na.rm = TRUE), 1)
  ) %>%
  knitr::kable(
    col.names = c("Arm", "No. clusters", "Mean prevalence", "SD prevalence", 
                  "Median prevalence", "IQR prevalence")
  )
Arm No. clusters Mean prevalence SD prevalence Median prevalence IQR prevalence
SBT 15 49.7 21.7 52.3 34.9
CWT 15 40.5 25.1 28.4 40.4


A simple linear model can be used to test for differences at the cluster level.

Code
# Fit a very simple linear model to test for differences in mean prevalence
lm_fit_cluster <- lm(prev ~ arm, data = cluster_post)
summary(lm_fit_cluster)

Call:
lm(formula = prev ~ arm, data = cluster_post)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.344 -21.953  -0.153  15.816  42.404 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   49.699      6.060   8.201  6.3e-09 ***
armCWT        -9.246      8.570  -1.079     0.29    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.47 on 28 degrees of freedom
Multiple R-squared:  0.03992,   Adjusted R-squared:  0.005626 
F-statistic: 1.164 on 1 and 28 DF,  p-value: 0.2898


We can visualize the residuals for this model.

Code
# Visualize residuals
par(mfrow = c(1, 2))
plot(lm_fit_cluster, which = 1)
plot(lm_fit_cluster, which = 2)


Extract and print the linear model results.

Code
# Visualize residuals
# Extract lm results
# Effect estimate with confidence interval and p-value:
lm_coef_est <- coef(lm_fit_cluster)["armCWT"]

# 95% confidence interval
lm_conf_int <- confint(lm_fit_cluster)["armCWT", ]

# p-value
lm_pval <- summary(lm_fit_cluster)$coefficients["armCWT", "Pr(>|t|)"]

# Print lm results nicely
cat(
  "Linear model (CWT vs SBT):\n",
  "Estimate:",
  round(lm_coef_est, 3),
  "\n",
  "95% CI: [",
  round(lm_conf_int[1], 3),
  ",",
  round(lm_conf_int[2], 3),
  "]\n",
  "p-value:",
  signif(lm_pval, 5),
  "\n"
)
Linear model (CWT vs SBT):
 Estimate: -9.246 
 95% CI: [ -26.801 , 8.308 ]
 p-value: 0.28983 


As a sensitivity check, we might fit a GEE model at the individual level which accounts for the within village correlation. Because the cluster sizes are relatively small, the standard errors may be biased for this model. Values in this output are expressed as proportions, not percentages.

Code
# Sensitivity analysis with individual-level GEE model
post <- mbita_schisto %>%
  filter(post == 1)

# GEE with identity link: estimates prevalence difference directly
gee_cluster <- geepack::geeglm(
  sea_or_sm25_pos ~ arm,
  id = vid,
  data = post,
  family = binomial(link = "identity"),
  corstr = "exchangeable"
)

gee_cluster_summary <- summary(gee_cluster)
gee_cluster_summary

Call:
geepack::geeglm(formula = sea_or_sm25_pos ~ arm, family = binomial(link = "identity"), 
    data = post, id = vid, corstr = "exchangeable")

 Coefficients:
            Estimate  Std.err   Wald Pr(>|W|)    
(Intercept)  0.50643  0.04015 159.09   <2e-16 ***
armCWT      -0.09565  0.06149   2.42     0.12    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    1.018 0.04207
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha   0.2279 0.03846
Number of clusters:   60  Maximum cluster size: 75 


Extract and print the GEE results. We see that the estimate is very similar to the linear model estimate however the p-value is smaller. Because the cluster sizes are relatively small, the standard errors may be biased downward for this model (result closer to significance as we see).

Code
# Extract and print results
gee_coef_est <- coef(gee_cluster_summary)["armCWT", "Estimate"]
gee_se <- coef(gee_cluster_summary)["armCWT", "Std.err"]
gee_pval <- coef(gee_cluster_summary)["armCWT", "Pr(>|W|)"]

# 95% CI using normal approximation
gee_lower <- gee_coef_est - 1.96 * gee_se
gee_upper <- gee_coef_est + 1.96 * gee_se

# Print GEE results nicely
cat(
  "GEE (CWT vs SBT):\n",
  "Estimate:",
  round(gee_coef_est, 3),
  "\n",
  "95% CI: [",
  round(gee_lower, 3),
  ",",
  round(gee_upper, 3),
  "]\n",
  "p-value:",
  signif(gee_pval, 3),
  "\n"
)
GEE (CWT vs SBT):
 Estimate: -0.096 
 95% CI: [ -0.216 , 0.025 ]
 p-value: 0.12 


Summarize findings in Table 2.

Code
# Observed values
observed_tbl <- cluster_post %>%
  group_by(arm) %>%
  summarize(
    n_clusters = n(),
    mean_prev = mean(prev, na.rm = TRUE),
    sd_prev   = sd(prev, na.rm = TRUE),
    median_prev = median(prev, na.rm = TRUE),
    iqr_prev = IQR(prev, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    Analysis = paste0(arm, " (n = ", n_clusters, " clusters)"),
    Result = paste0(
      "Mean ",
      round(mean_prev, 1),
      " (SD ",
      round(sd_prev, 1),
      "); ",
      "Median ",
      round(median_prev, 1),
      " (IQR ",
      round(iqr_prev, 1),
      ")"
    ),
    `p-value` = ""
  ) %>%
  select(Analysis, Result, `p-value`)

# Model results table
model_tbl <- tribble(
  ~ Analysis,
  ~ Result,
  ~ `p-value`,
  "Linear model (cluster-level)",
  paste0(
    round(lm_coef_est, 1),
    " (95% CI ",
    round(lm_conf_int[1], 1),
    ", ",
    round(lm_conf_int[2], 1),
    ")"
  ),
  as.character(signif(lm_pval, 4)),
  "GEE (individual-level)",
  paste0(
    round(100*gee_coef_est, 1),
    " (95% CI ",
    round(100*gee_lower, 1),
    ", ",
    round(100*gee_upper, 1),
    ")"
  ),
  as.character(signif(gee_pval, 4))
)

# Combine observed and model results
table2 <- bind_rows(
  tibble(
    Analysis = "Observed cluster-level prevalence",
    Result = "",
    `p-value` = ""
  ),
  observed_tbl,
  tibble(
    Analysis = "Statistical comparisons",
    Result = "",
    `p-value` = ""
  ),
  model_tbl
)

cat('<div style="text-align:left; font-weight:bold; margin-bottom:0.3em;">
Table 2. Observed and estimated differences in sero-prevalence between CWT and SBT arms, post-intervention period (2013–2014).
</div>')

Table 2. Observed and estimated differences in sero-prevalence between CWT and SBT arms, post-intervention period (2013–2014).

Analysis Result p-value
Observed cluster-level prevalence
SBT (n = 15 clusters) Mean 49.7 (SD 21.7); Median 52.3 (IQR 34.9)
CWT (n = 15 clusters) Mean 40.5 (SD 25.1); Median 28.4 (IQR 40.4)
Statistical comparisons
Linear model (cluster-level) -9.2 (95% CI -26.8, 8.3) 0.2898
GEE (individual-level) -9.6 (95% CI -21.6, 2.5) 0.1198


The observed mean post-intervention seroprevalence was slightly lower in the CWT arm (40.5) compared with the SBT arm (49.7). The primary linear model analysis estimated a difference of –9.2 (95% CI –26.8 to 8.3, p = 0.29), indicating no statistically significant difference between arms. Sensitivity analyses using a GEE model gave similar conclusions.

Permutation test

We can perform a non-parametric permutation test to get an exact p-value for the comparison between groups. The following code implements this test.

Code
# Calculate observed difference for test statistic
# Unit of analysis is the cluster, calculate the mean for each arm
obs_diff <-
  with(cluster_post, mean(prev[arm == "CWT"]) - mean(prev[arm == "SBT"]))

# Set-up values to run test
set.seed(1001)  # for reproducibility
n_perm <- 50000  # number of permutations
perm_diff <- numeric(n_perm) # vector to hold all permuted statistics

# Vector of cluster-level prevalence
prev <- cluster_post$prev

# Original treatment arms
arm_orig <- cluster_post$arm

for (i in 1:n_perm) {
  # shuffle cluster labels, keeping cluster prevalences fixed
  arm_perm <- sample(arm_orig)
  
  # compute difference in means for permuted arms
  perm_diff[i] <- mean(prev[arm_perm == "CWT"]) - mean(prev[arm_perm == "SBT"])
}

# Calculate p-value as proportion more extreme than test statistic
p_perm <- mean(abs(perm_diff) >= abs(obs_diff))
p_perm
[1] 0.2863

To visualize the test, we can plot the distribution of the test statistic under the null.

Code
# Plot histogram
hist(
  perm_diff,
  breaks = 50,
  main = "Permutation distribution of mean difference",
  xlab = "Difference in mean prevalence (CWT − SBT)",
  col = "lightgray",
  border = "white"
)

# Shade extreme values (two-sided)
extreme <- abs(perm_diff) >= abs(obs_diff)
hist(
  perm_diff[extreme],
  breaks = 50,
  col = "red",
  border = "white",
  add = TRUE
)

# Add observed statistic line
abline(v = obs_diff,
       col = "darkred",
       lwd = 2)

# Add caption with p-value
mtext(
  paste0("Permutation p-value (two-sided) = ", round(p_perm, 3)),
  side = 3,
  line = 0.5,
  cex = 0.9
)


The design-based permutation test confirms the findings from the linear model and GEE analyses.

References

Arnold, B. F., Kanyi, H., Njenga, S. M., Rawago, F. O., Priest, J. W., Secor, W. E., Lammie, P. J., Won, K. Y., & Odiere, M. R. (2020). Fine-scale heterogeneity in schistosoma mansoni force of infection measured through antibody response. Proc. Natl. Acad. Sci. U. S. A., 117(37), 23174–23181.
Campbell, M. K., Piaggio, G., Elbourne, D. R., Altman, D. G., & CONSORT Group. (2012). Consort 2010 statement: Extension to cluster randomised trials. BMJ, 345(sep04 1), e5661.
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Sert, N. P. du, Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nat. Hum. Behav., 1(1), 0021.
Oldenburg, C. E., Pinsky, B. A., Brogdon, J., Chen, C., Ruder, K., Zhong, L., Nyatigo, F., Cook, C. A., Hinterwirth, A., Lebas, E., Redd, T., Porco, T. C., Lietman, T. M., Arnold, B. F., & Doan, T. (2021). Effect of oral azithromycin vs placebo on COVID-19 symptoms in outpatients with SARS-CoV-2 infection: A randomized clinical trial. JAMA, 326(6), 490–498.
Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten simple rules for reproducible computational research. PLoS Comput. Biol., 9(10), e1003285.
Sur, D., Ochiai, R. L., Bhattacharya, S. K., Ganguly, N. K., Ali, M., Manna, B., Dutta, S., Donner, A., Kanungo, S., Park, J. K., Puri, M. K., Kim, D. R., Dutta, D., Bhaduri, B., Acosta, C. J., & Clemens, J. D. (2009). A cluster-randomized effectiveness trial of vi typhoid vaccine in india. N. Engl. J. Med., 361(4), 335–344.
Won, K. Y., Kanyi, H. M., Mwende, F. M., Wiegand, R. E., Goodhew, E. B., Priest, J. W., Lee, Y.-M., Njenga, S. M., Secor, W. E., Lammie, P. J., & Odiere, M. R. (2017). Multiplex serologic assessment of schistosomiasis in western kenya: Antibody responses in preschool aged children as a measure of reduced transmission. Am. J. Trop. Med. Hyg., 96(6), 1460–1467.