Schistosomiasis Cluster Randomized Trial Analysis

Author

Rob Daniels

Published

December 2, 2025

Description

An end-to-end reproducible analysis of a schistosomiasis cluster randomized trial (CRT), from data ingestion through modeling and reporting.

This project presents an independent analysis of publicly available field data from a schistosomiasis cluster randomized trial (CRT).

The data can be downloaded here.
The study is described in (Won et al., 2017) which can be accessed here.

The analysis emphasizes transparent statistical reasoning, modern workflow tools, and high-quality communication.

View the complete project structure, analysis files, and reproducible workflow on GitHub:
schisto-cluster-project

Tools and workflow

Version control & reproducibility: GitHub, Quarto, Markdown, bibliographic integration, renv, and structured workflows
Statistical programming: R
Visualization & reporting: manuscript-quality tables, HTML, ggplot2
Statistical methods: linear model, diagnostics, generalized estimating equations (GEE), permutation test

Many organizational ideas were inspired by the Proctor Foundation Data Science Handbook and R for Data Science (2e).

For questions or feedback, feel free to reach out at rdanielsstat@gmail.com.

Read and process the data

This study is comprised of measurements from 3,663 pre-school age children (PSAC) from 30 villages in Western Kenya. Two blood tests and one stool test were used to assess infection with Schistosoma spp., the parasites that cause schistosomiasis. Data were collected at three different times: in 2012 (before treatment began), and again in 2013 and 2014 (after treatment started). The study was designed as a cluster randomized trial, meaning that entire villages, not individual children, were randomly assigned to one of two treatment arms. In 15 villages, treatment was provided only through schools (SBT), while in the other 15 villages, treatment was offered to the whole community (CWT).

Read, merge, and show data set details.

Code

# Read in data
mbita_psac <- read.csv(paste0(data_path, "/mbita_schisto.csv"))
mbita_spatial <- read.csv(paste0(data_path, "/mbita_spatial.csv"))

# Merge
mbita_schisto <- mbita_psac %>%
  left_join(mbita_spatial, by = "vid")

# Data set details
str(mbita_schisto)

'data.frame':   3663 obs. of  16 variables:
 $ year         : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ vid          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ arm          : chr  "CWT" "CWT" "CWT" "CWT" ...
 $ pid          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ agey         : num  1.81 2.17 2.95 2.63 2.54 ...
 $ sex          : chr  "male" "female" "female" "female" ...
 $ sea          : num  3136 25605 28549 87 26537 ...
 $ sm25         : num  -2 1344 520 -4 1 ...
 $ sm_epg       : int  12 12 0 36 24 132 0 48 132 0 ...
 $ sea_pos      : int  1 1 1 0 1 1 1 1 1 1 ...
 $ sm25_pos     : int  0 1 1 0 0 0 0 0 0 0 ...
 $ kk_pos       : int  1 1 0 1 1 1 0 1 1 0 ...
 $ elev         : int  1189 1189 1189 1189 1189 1189 1189 1189 1189 1189 ...
 $ tmin         : num  161 161 161 161 161 ...
 $ prec         : num  82.7 82.7 82.7 82.7 82.7 ...
 $ dist_victoria: num  568 568 568 568 568 ...

Define factors and derived variables. Run summary descriptives for full data set.

Code

# Additional processing
mbita_schisto <- mbita_schisto %>%
  mutate(
    pid  = as.character(pid),
    arm  = factor(arm, levels = c("SBT", "CWT")),
    # SBT as reference group
    sex  = factor(sex, levels = c("male", "female")),
    male = ifelse(sex == "male", 1, 0),
    # either sea or sm25 positive:
    sea_or_sm25_pos = ifelse(sea_pos == 1 |
                               sm25_pos == 1, 1, 0),
    agecat = cut(
      agey,
      breaks = c(0, 1, 2, 3, 4, 5, 6),
      labels = c("<1 year", "1 year", "2 years", "3 years", "4 years", 
                 "5 years")
    ),
    # create indicator for baseline/post intervention
    post = ifelse(year %in% c(2013, 2014), 1, 0)
  )

# View data
# View(mbita_schisto)

# Full data set summary
summary(mbita_schisto)

      year           vid         arm           pid                 agey       
 Min.   :2012   Min.   : 1.00   SBT:1837   Length:3663        Min.   :0.2108  
 1st Qu.:2012   1st Qu.: 7.00   CWT:1826   Class :character   1st Qu.:2.4312  
 Median :2013   Median :13.00              Mode  :character   Median :3.4990  
 Mean   :2013   Mean   :14.27                                 Mean   :3.3774  
 3rd Qu.:2014   3rd Qu.:22.00                                 3rd Qu.:4.4435  
 Max.   :2014   Max.   :30.00                                 Max.   :5.5524  
                                                                              
     sex            sea             sm25              sm_epg       
 male  :1759   Min.   :    4   Min.   :  -12.00   Min.   :   0.00  
 female:1904   1st Qu.:  101   1st Qu.:    1.15   1st Qu.:   0.00  
               Median :  387   Median :    6.00   Median :   0.00  
               Mean   :11982   Mean   :  129.26   Mean   :  42.01  
               3rd Qu.:27337   3rd Qu.:   19.00   3rd Qu.:  12.00  
               Max.   :32685   Max.   :18816.00   Max.   :5004.00  
                                                  NA's   :237      
    sea_pos          sm25_pos          kk_pos            elev     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :1137  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1151  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :1158  
 Mean   :0.4775   Mean   :0.1701   Mean   :0.2536   Mean   :1170  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1178  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1338  
                                   NA's   :237                    
      tmin            prec        dist_victoria          male       
 Min.   :148.9   Min.   : 81.17   Min.   :  32.28   Min.   :0.0000  
 1st Qu.:157.4   1st Qu.: 83.42   1st Qu.: 234.21   1st Qu.:0.0000  
 Median :158.8   Median : 84.92   Median : 712.82   Median :0.0000  
 Mean   :158.6   Mean   : 88.48   Mean   :1011.26   Mean   :0.4802  
 3rd Qu.:160.8   3rd Qu.: 94.50   3rd Qu.:1598.12   3rd Qu.:1.0000  
 Max.   :161.4   Max.   :101.42   Max.   :4718.11   Max.   :1.0000  
                                                                    
 sea_or_sm25_pos      agecat          post       
 Min.   :0.0000   <1 year:  51   Min.   :0.0000  
 1st Qu.:0.0000   1 year : 570   1st Qu.:0.0000  
 Median :1.0000   2 years: 780   Median :1.0000  
 Mean   :0.5089   3 years: 890   Mean   :0.6942  
 3rd Qu.:1.0000   4 years:1214   3rd Qu.:1.0000  
 Max.   :1.0000   5 years: 158   Max.   :1.0000

Describe the data

The following table shows the number of pre-school age children (PSAC) per year and study arm. The summary statistics are based on counts of PSAC per village (cluster).

Code

mbita_schisto %>%
  group_by(year, arm, vid) %>%
  summarize(n_psac = n_distinct(pid), .groups = "drop") %>%
  group_by(year, arm) %>%
  summarize(
    n_villages  = n(), 
    total_psac  = sum(n_psac), 
    mean_psac   = round(mean(n_psac), 1), 
    sd_psac     = round(sd(n_psac), 1), 
    min_psac    = min(n_psac), 
    q1_psac     = quantile(n_psac, 0.25), 
    median_psac = median(n_psac), 
    q3_psac     = quantile(n_psac, 0.75), 
    max_psac    = max(n_psac),
    .groups = "drop"
  ) %>%
  knitr::kable(
    col.names = c("Year", "Arm", "No. villages", "Total PSAC", 
                  "Mean PSAC", "SD PSAC", "Min PSAC", "Q1 PSAC",
                  "Median PSAC", "Q3 PSAC", "Max PSAC"))

Year	Arm	No. villages	Total PSAC	Mean PSAC	SD PSAC	Min PSAC	Q1 PSAC	Median PSAC	Q3 PSAC	Max PSAC
2012	SBT	15	582	38.8	15.2	16	27.0	40	47.5	66
2012	CWT	15	538	35.9	16.5	13	23.0	37	48.5	63
2013	SBT	15	624	41.6	15.5	25	32.0	38	47.5	73
2013	CWT	15	563	37.5	18.6	18	24.0	33	50.5	75
2014	SBT	15	631	42.1	18.0	9	28.5	43	53.5	75
2014	CWT	15	725	48.3	16.7	22	37.0	55	63.0	68

On average, there were 36–48 PSAC per village at each survey time point, with a minimum of 9 observed at one time point and a maximum of 75 observed at two time points.

The table below shows the percentage of missing values for blood- and stool-based measures (SEA, Sm25, and Sm-EPG) by year and study arm, summarized across clusters. There are no missing data for the blood-based measures; missing data occur only for the stool-based measure (Sm-EPG), which is summarized below.

Code

# Aggregates missing values per year, arm, and cluster (vid), then summarizes
# across clusters
make_na_summary <- function(data, var) {
  var <- enquo(var)
  
  data %>%
    group_by(year, arm, vid) %>%
    summarize(
      n_psac           = n_distinct(pid),
      measurements_var = sum(!is.na(!!var)),
      na_var           = sum(is.na(!!var)),
      pct_na_var       = 100 * na_var / n_psac,
      .groups = "drop"
    ) %>%
    group_by(year, arm) %>%
    summarize(
      n_villages     = n(),
      n_psac         = sum(n_psac),
      n_measurements = sum(measurements_var),
      na_total       = sum(na_var),
      mean_pct_na    = round(mean(pct_na_var), 1),
      sd_pct_na      = round(sd(pct_na_var), 1),
      min_pct_na     = round(min(pct_na_var), 1),
      q1_pct_na      = round(quantile(pct_na_var, 0.25), 1),
      median_pct_na  = round(median(pct_na_var), 1),
      q3_pct_na      = round(quantile(pct_na_var, 0.75), 1),
      max_pct_na     = round(max(pct_na_var), 1),
      .groups = "drop"
    ) %>%
    mutate(variable = quo_name(var), .before = 1)
}

make_na_summary(mbita_schisto, sm_epg) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "Arm", "No. villages", "Total PSAC", 
                  "No. measurements", "No. NA (total)", "Mean % NA", "SD % NA",
                  "Min % NA", "Q1 % NA", "Median % NA", "Q3 % NA", "Max % NA")
  )

Variable	Year	Arm	No. villages	Total PSAC	No. measurements	No. NA (total)	Mean % NA	SD % NA	Min % NA	Q1 % NA	Median % NA	Q3 % NA	Max % NA
sm_epg	2012	SBT	15	582	554	28	4.1	6.2	0.0	0.0	0.0	5.6	20.4
sm_epg	2012	CWT	15	538	518	20	4.8	7.4	0.0	0.0	0.0	6.4	25.0
sm_epg	2013	SBT	15	624	616	8	1.5	3.5	0.0	0.0	0.0	1.5	13.2
sm_epg	2013	CWT	15	563	556	7	1.5	3.0	0.0	0.0	0.0	1.6	11.1
sm_epg	2014	SBT	15	631	555	76	11.9	4.4	1.9	9.8	11.6	14.4	19.5
sm_epg	2014	CWT	15	725	627	98	13.9	7.1	5.5	9.3	11.4	16.9	29.2

In 2014, the community-wide treatment (CWT) arm had 13.5% missing values (98 of 725 total), with one village showing as much as 29% missing.

The following tables show the counts of SEA and Sm-EPG measurements, respectively, along with their missing values, for each year.

Code

# Summarizes counts of measurements and missing values per year
make_meas_summary <- function(data, var) {
  var <- enquo(var)
  
  data %>%
    group_by(year) %>%
    summarize(
      n_psac         = n_distinct(pid),
      n_measurements = sum(!is.na(!!var)),
      na_var         = sum(is.na(!!var)),
      pct_na_var     = round(100 * na_var / n_psac, 1),
      .groups = "drop"
    ) %>%
    mutate(
      n_psac         = format(n_psac, big.mark = ","),
      n_measurements = format(n_measurements, big.mark = ","),
      variable = quo_name(var),
      .before = 1
    )
}

make_meas_summary(mbita_schisto, sea) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "No. PSAC", "No. measurements", "No. NA", 
                  "% NA")
  )

Variable	Year	No. PSAC	No. measurements
sea	2012	1,120	1,120
sea	2013	1,187	1,187
sea	2014	1,356	1,356

Code

make_meas_summary(mbita_schisto, sm_epg) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "No. PSAC", "No. measurements", "No. NA", 
                  "% NA")
  )

Variable	Year	No. PSAC	No. measurements	No. NA	% NA
sm_epg	2012	1,120	1,072	48	4.3
sm_epg	2013	1,187	1,172	15	1.3
sm_epg	2014	1,356	1,182	174	12.8

For the blood-based measurements, there were 1,120, 1,187, and 1,356 pre-school age children (PSAC) at baseline, 2013, and 2014, respectively, with no missing data. For the stool-based measurement, there were 1,072, 1,172, and 1,182 PSAC at baseline, 2013, and 2014, respectively, with 4.3%, 1.3%, and 12.8% of values missing at each time point.

The following table shows prevalence (SEA, Sm25, and KK positivity) by age category for all pre-school age children (PSAC).

Code

mbita_schisto %>%
  group_by(agecat) %>%
  summarize(n_sea_or_sm25 = sum(!is.na(sea_or_sm25_pos)),
            n_sea_or_sm25_pos = sum(sea_or_sm25_pos),
            n_kk = sum(!is.na(kk_pos)),
            n_kk_pos = sum(kk_pos, na.rm = TRUE),
            .groups = "drop") %>%
  mutate(sea_or_sm25_prev = 
           sprintf("%1.1f", n_sea_or_sm25_pos / n_sea_or_sm25 * 100),
         kk_prev =  sprintf("%1.1f", n_kk_pos / n_kk * 100),
         group = as.character(agecat)) %>%
  select(group, n_sea_or_sm25, n_sea_or_sm25_pos, sea_or_sm25_prev,
         n_kk, n_kk_pos, kk_prev) %>%
  kable(align = c("l", "r", "r", "r", "r", "r", "r"),
        col.names = c("Group", "No. SEA/Sm25 measurements", 
                      "No. SEA/Sm25 positive", "SEA/Sm25 prevalence (%)", 
                      "No. KK measurements", "No. KK positive", 
                      "KK prevalence (%)")) %>%
  kable_styling(full_width = FALSE) %>%
  column_spec(1, extra_css = "white-space: nowrap;")

Group	No. SEA/Sm25 measurements	No. SEA/Sm25 positive	SEA/Sm25 prevalence (%)	No. KK measurements	No. KK positive	KK prevalence (%)
<1 year	51	13	25.5	47	5	10.6
1 year	570	162	28.4	503	68	13.5
2 years	780	324	41.5	714	150	21.0
3 years	890	502	56.4	844	211	25.0
4 years	1214	759	62.5	1166	384	32.9
5 years	158	104	65.8	152	51	33.6

There appears to be a strong relationship between age and prevalence.

The following tables shows summary statistics for the quantitative variable SEA at year, arm level.

Code

# Summarizes continuous blood- and stool-based measurements per year and arm,
# pooled across PSAC
make_cont_summary <- function(data, var) {
  var <- enquo(var)
  
  data %>%
    group_by(year, arm) %>%
    summarize(
      n_psac     = n_distinct(pid),
      n_measures = sum(!is.na(!!var)),
      mean       = round(mean(!!var, na.rm = TRUE), 1),
      sd         = round(sd(!!var, na.rm = TRUE), 1),
      min        = round(min(!!var, na.rm = TRUE), 1),
      q1         = round(quantile(!!var, 0.25, na.rm = TRUE), 1),
      median     = round(median(!!var, na.rm = TRUE), 1),
      q3         = round(quantile(!!var, 0.75, na.rm = TRUE), 1),
      max        = round(max(!!var, na.rm = TRUE), 1),
      .groups = "drop"
    ) %>%
    mutate(variable = quo_name(var), .before = 1)
}
make_cont_summary(mbita_schisto, sea) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "Arm", "No. PSAC", "No. measurements",
                  "Mean", "SD", "Min", "Q1", "Median", "Q3", "Max")
  )

Variable	Year	Arm	No. PSAC	No. measurements	Mean	SD	Min	Q1	Median	Q3	Max
sea	2012	SBT	582	582	13718.0	13584.9	4	113.0	8620	27764.2	32685
sea	2012	CWT	538	538	10524.6	12978.7	24	87.0	233	25505.2	32646
sea	2013	SBT	624	624	13134.8	13605.4	17	109.8	4556	27811.8	32625
sea	2013	CWT	563	563	12189.0	13676.2	6	104.5	384	27788.0	32568
sea	2014	SBT	631	631	12710.3	13730.4	7	115.0	762	28249.0	32602
sea	2014	CWT	725	725	9881.8	13144.6	18	92.0	173	25486.0	32502

Construct a histogram for these data.

Code

hist(mbita_schisto$sea, main = "SEA distribution", xlab = "SEA")

All three (Sm25 and Sm-EPG not shown) of the quantitative variables are highly skewed, which could affect assumptions of normality in downstream analyses. Transformations such as log or square root may be necessary, and appropriate statistical methods that account for skewed distributions should be considered when modeling or testing differences.

Finally, the following table shows the prevalence of infection as measured by positive SEA or Sm25 reactivity.

Code

# Calculates pooled prevalence per year and arm
make_prev_summary <- function(data, var) {
  var <- enquo(var)
  
  data %>%
    group_by(year, arm) %>%
    summarize(
      n_psac = n_distinct(pid),
      n_measures = sum(!is.na(!!var)),
      n_positive = sum(!!var, na.rm = TRUE),
      prevalence = round(100 * n_positive / n_measures, 1),
      .groups = "drop"
    ) %>%
    mutate(variable = quo_name(var), .before = 1)
}
make_prev_summary(mbita_schisto, sea_or_sm25_pos) %>%
  knitr::kable(
    col.names = c("Variable", "Year", "Arm", "No. PSAC", "No. measurements",
                  "No. positive", "Prevalence (%)")
  )

Variable	Year	Arm	No. PSAC	No. measurements	No. positive	Prevalence (%)
sea_or_sm25_pos	2012	SBT	582	582	337	57.9
sea_or_sm25_pos	2012	CWT	538	538	241	44.8
sea_or_sm25_pos	2013	SBT	624	624	347	55.6
sea_or_sm25_pos	2013	CWT	563	563	284	50.4
sea_or_sm25_pos	2014	SBT	631	631	343	54.4
sea_or_sm25_pos	2014	CWT	725	725	312	43.0

Summarize baseline characteristics

The following code constructs a manuscript style Table 1 for assessing individual- and cluster-level baseline characteristics. These results are helpful in assessing how well randomization worked in the study. The full code is not shown here for brevity.

Code

source(paste0(here::here(), "/R/3-baseline-table1.R"))

Table 1. Individual- and cluster-level baseline characteristics and infection prevalence by treatment arm.

Characteristic	SBT	CWT
Individual-level
PSAC, No. (%)	582 (52)	538 (48)
Sex, No. (%)
Female	305 (52.4)	276 (51.3)
Male	277 (47.6)	262 (48.7)
Age group, No. (%)
<1 year	13 (2.2)	4 (0.7)
1 year	77 (13.2)	77 (14.3)
2 years	123 (21.1)	102 (19)
3 years	141 (24.2)	133 (24.7)
4 years	187 (32.1)	167 (31)
5 years	41 (7)	55 (10.2)
Age, mean ± SD, y	3.5 ± 1.2	3.5 ± 1.2
Blood-based measures, No. (%)
SEA positive	326 (56)	231 (42.9)
Sm25 positive	85 (14.6)	70 (13)
SEA or Sm25 positive	337 (57.9)	241 (44.8)
Stool-based measure, No. (%)	n = 554	n = 518
KK positive	173 (31.2)	124 (23.9)
Cluster-level
PSAC per village, mean ± SD	38.8 ± 15.2	35.9 ± 16.5
Percent male per village, mean ± SD	46.3 ± 8.7	47.6 ± 8.7
Mean village age, mean ± SD	3.5 ± 0.3	3.5 ± 0.3
Blood-based measures, % per village, mean ± SD
SEA positive	51.1 ± 25.3	41.4 ± 29.3
Sm25 positive	14.4 ± 12.6	13.5 ± 12.1
SEA or Sm25 positive	53.3 ± 24.5	43.5 ± 28.7
Stool-based measure, % per village, mean ± SD
KK positive	29.0 ± 18.7	24.0 ± 20.5

At baseline, the two study arms were generally well balanced. Both arms had similar average cluster sizes (35.9 vs. 38.8 children per cluster) with comparable variability. Demographic characteristics were nearly identical: the mean age of children was 3.5 years in both arms, with narrow standard deviations, and the proportion of male children was also similar (47.6% vs. 46.3%). These observations indicate excellent demographic balance between the arms.

In terms of infection measures, the mean cluster-level SEA prevalence was somewhat lower in the CWT arm (41.4%) compared to the SBT arm (51.1%), though the large standard deviations (29.3 vs. 25.3) suggest substantial cluster-to-cluster variability and overlap. Mean cluster-level Sm25 prevalence was nearly identical (13.5% vs. 14.4%), and mean cluster-level Kato-Katz prevalence was slightly lower in the CWT arm (24%) than in the SBT arm (29%), again with wide variability across clusters. Overall, while there are some modest differences in baseline infection prevalence, the groups are well balanced on demographics and cluster size.

Reflecting on the study design, this was a cluster randomized trial with 30 villages randomized 1:1 into the two arms. With only 15 clusters per arm, some chance imbalances are more likely than in individually randomized trials with thousands of participants. The observed balance in demographic characteristics and cluster sizes suggests that the randomization performed as intended, while the modest differences in SEA and Kato-Katz prevalence may reflect natural variation between communities or the limited number of clusters.

In conclusion, the arms appear well balanced with respect to demographics and cluster sizes, with only minor differences in baseline infection prevalence that may not be unexpected given the small number of clusters per arm and natural variability in infection rates among villages.

Compare seroprevalence between groups

Calculate observed pooled sero-prevalence for 2013 and 2014 combined at the cluster, arm level.

Code

# Calculate pooled sero-prevalence for 2013 and 2014 combined at the cluster,
# arm level
cluster_post <- mbita_schisto %>%
  filter(post == 1) %>%
  group_by(vid, arm) %>%
  summarize(
    n_pos = sum(sea_or_sm25_pos == 1),
    n_measured = sum(!is.na(sea_or_sm25_pos)),
    prev = if_else(n_measured > 0, 100 * n_pos / n_measured, NA_real_),
    .groups = "drop"
  )

# Observed mean pooled prevalence for each treatment group
cluster_post %>%
  group_by(arm) %>%
  summarize(
    n_clusters = n(),
    mean_prev = round(mean(prev, na.rm = TRUE), 1),
    sd_prev   = round(sd(prev, na.rm = TRUE), 1),
    median_prev = round(median(prev, na.rm = TRUE), 1),
    iqr_prev = round(IQR(prev, na.rm = TRUE), 1)
  ) %>%
  knitr::kable(
    col.names = c("Arm", "No. clusters", "Mean prevalence", "SD prevalence", 
                  "Median prevalence", "IQR prevalence")
  )

Arm	No. clusters	Mean prevalence	SD prevalence	Median prevalence	IQR prevalence
SBT	15	49.7	21.7	52.3	34.9
CWT	15	40.5	25.1	28.4	40.4

A simple linear model can be used to test for differences at the cluster level.

Code

# Fit a very simple linear model to test for differences in mean prevalence
lm_fit_cluster <- lm(prev ~ arm, data = cluster_post)
summary(lm_fit_cluster)


Call:
lm(formula = prev ~ arm, data = cluster_post)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.344 -21.953  -0.153  15.816  42.404 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   49.699      6.060   8.201  6.3e-09 ***
armCWT        -9.246      8.570  -1.079     0.29    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.47 on 28 degrees of freedom
Multiple R-squared:  0.03992,   Adjusted R-squared:  0.005626 
F-statistic: 1.164 on 1 and 28 DF,  p-value: 0.2898

We can visualize the residuals for this model.

Code

# Visualize residuals
par(mfrow = c(1, 2))
plot(lm_fit_cluster, which = 1)
plot(lm_fit_cluster, which = 2)

Extract and print the linear model results.

Code

# Visualize residuals
# Extract lm results
# Effect estimate with confidence interval and p-value:
lm_coef_est <- coef(lm_fit_cluster)["armCWT"]

# 95% confidence interval
lm_conf_int <- confint(lm_fit_cluster)["armCWT", ]

# p-value
lm_pval <- summary(lm_fit_cluster)$coefficients["armCWT", "Pr(>|t|)"]

# Print lm results nicely
cat(
  "Linear model (CWT vs SBT):\n",
  "Estimate:",
  round(lm_coef_est, 3),
  "\n",
  "95% CI: [",
  round(lm_conf_int[1], 3),
  ",",
  round(lm_conf_int[2], 3),
  "]\n",
  "p-value:",
  signif(lm_pval, 5),
  "\n"
)

Linear model (CWT vs SBT):
 Estimate: -9.246 
 95% CI: [ -26.801 , 8.308 ]
 p-value: 0.28983

As a sensitivity check, we might fit a GEE model at the individual level which accounts for the within village correlation. Because the cluster sizes are relatively small, the standard errors may be biased for this model. Values in this output are expressed as proportions, not percentages.

Code

# Sensitivity analysis with individual-level GEE model
post <- mbita_schisto %>%
  filter(post == 1)

# GEE with identity link: estimates prevalence difference directly
gee_cluster <- geepack::geeglm(
  sea_or_sm25_pos ~ arm,
  id = vid,
  data = post,
  family = binomial(link = "identity"),
  corstr = "exchangeable"
)

gee_cluster_summary <- summary(gee_cluster)
gee_cluster_summary


Call:
geepack::geeglm(formula = sea_or_sm25_pos ~ arm, family = binomial(link = "identity"), 
    data = post, id = vid, corstr = "exchangeable")

 Coefficients:
            Estimate  Std.err   Wald Pr(>|W|)    
(Intercept)  0.50643  0.04015 159.09   <2e-16 ***
armCWT      -0.09565  0.06149   2.42     0.12    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    1.018 0.04207
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha   0.2279 0.03846
Number of clusters:   60  Maximum cluster size: 75

Extract and print the GEE results. We see that the estimate is very similar to the linear model estimate however the p-value is smaller. Because the cluster sizes are relatively small, the standard errors may be biased downward for this model (result closer to significance as we see).

Code

# Extract and print results
gee_coef_est <- coef(gee_cluster_summary)["armCWT", "Estimate"]
gee_se <- coef(gee_cluster_summary)["armCWT", "Std.err"]
gee_pval <- coef(gee_cluster_summary)["armCWT", "Pr(>|W|)"]

# 95% CI using normal approximation
gee_lower <- gee_coef_est - 1.96 * gee_se
gee_upper <- gee_coef_est + 1.96 * gee_se

# Print GEE results nicely
cat(
  "GEE (CWT vs SBT):\n",
  "Estimate:",
  round(gee_coef_est, 3),
  "\n",
  "95% CI: [",
  round(gee_lower, 3),
  ",",
  round(gee_upper, 3),
  "]\n",
  "p-value:",
  signif(gee_pval, 3),
  "\n"
)

GEE (CWT vs SBT):
 Estimate: -0.096 
 95% CI: [ -0.216 , 0.025 ]
 p-value: 0.12

Summarize findings in Table 2.

Code

# Observed values
observed_tbl <- cluster_post %>%
  group_by(arm) %>%
  summarize(
    n_clusters = n(),
    mean_prev = mean(prev, na.rm = TRUE),
    sd_prev   = sd(prev, na.rm = TRUE),
    median_prev = median(prev, na.rm = TRUE),
    iqr_prev = IQR(prev, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    Analysis = paste0(arm, " (n = ", n_clusters, " clusters)"),
    Result = paste0(
      "Mean ",
      round(mean_prev, 1),
      " (SD ",
      round(sd_prev, 1),
      "); ",
      "Median ",
      round(median_prev, 1),
      " (IQR ",
      round(iqr_prev, 1),
      ")"
    ),
    `p-value` = ""
  ) %>%
  select(Analysis, Result, `p-value`)

# Model results table
model_tbl <- tribble(
  ~ Analysis,
  ~ Result,
  ~ `p-value`,
  "Linear model (cluster-level)",
  paste0(
    round(lm_coef_est, 1),
    " (95% CI ",
    round(lm_conf_int[1], 1),
    ", ",
    round(lm_conf_int[2], 1),
    ")"
  ),
  as.character(signif(lm_pval, 4)),
  "GEE (individual-level)",
  paste0(
    round(100*gee_coef_est, 1),
    " (95% CI ",
    round(100*gee_lower, 1),
    ", ",
    round(100*gee_upper, 1),
    ")"
  ),
  as.character(signif(gee_pval, 4))
)

# Combine observed and model results
table2 <- bind_rows(
  tibble(
    Analysis = "Observed cluster-level prevalence",
    Result = "",
    `p-value` = ""
  ),
  observed_tbl,
  tibble(
    Analysis = "Statistical comparisons",
    Result = "",
    `p-value` = ""
  ),
  model_tbl
)

cat('<div style="text-align:left; font-weight:bold; margin-bottom:0.3em;">
Table 2. Observed and estimated differences in sero-prevalence between CWT and SBT arms, post-intervention period (2013–2014).
</div>')

Table 2. Observed and estimated differences in sero-prevalence between CWT and SBT arms, post-intervention period (2013–2014).

Analysis	Result	p-value
Observed cluster-level prevalence
SBT (n = 15 clusters)	Mean 49.7 (SD 21.7); Median 52.3 (IQR 34.9)
CWT (n = 15 clusters)	Mean 40.5 (SD 25.1); Median 28.4 (IQR 40.4)
Statistical comparisons
Linear model (cluster-level)	-9.2 (95% CI -26.8, 8.3)	0.2898
GEE (individual-level)	-9.6 (95% CI -21.6, 2.5)	0.1198

The observed mean post-intervention seroprevalence was slightly lower in the CWT arm (40.5) compared with the SBT arm (49.7). The primary linear model analysis estimated a difference of –9.2 (95% CI –26.8 to 8.3, p = 0.29), indicating no statistically significant difference between arms. Sensitivity analyses using a GEE model gave similar conclusions.

Permutation test

We can perform a non-parametric permutation test to get an exact p-value for the comparison between groups. The following code implements this test.

Code

# Calculate observed difference for test statistic
# Unit of analysis is the cluster, calculate the mean for each arm
obs_diff <-
  with(cluster_post, mean(prev[arm == "CWT"]) - mean(prev[arm == "SBT"]))

# Set-up values to run test
set.seed(1001)  # for reproducibility
n_perm <- 50000  # number of permutations
perm_diff <- numeric(n_perm) # vector to hold all permuted statistics

# Vector of cluster-level prevalence
prev <- cluster_post$prev

# Original treatment arms
arm_orig <- cluster_post$arm

for (i in 1:n_perm) {
  # shuffle cluster labels, keeping cluster prevalences fixed
  arm_perm <- sample(arm_orig)
  
  # compute difference in means for permuted arms
  perm_diff[i] <- mean(prev[arm_perm == "CWT"]) - mean(prev[arm_perm == "SBT"])
}

# Calculate p-value as proportion more extreme than test statistic
p_perm <- mean(abs(perm_diff) >= abs(obs_diff))
p_perm

[1] 0.2863

To visualize the test, we can plot the distribution of the test statistic under the null.

Code

# Plot histogram
hist(
  perm_diff,
  breaks = 50,
  main = "Permutation distribution of mean difference",
  xlab = "Difference in mean prevalence (CWT − SBT)",
  col = "lightgray",
  border = "white"
)

# Shade extreme values (two-sided)
extreme <- abs(perm_diff) >= abs(obs_diff)
hist(
  perm_diff[extreme],
  breaks = 50,
  col = "red",
  border = "white",
  add = TRUE
)

# Add observed statistic line
abline(v = obs_diff,
       col = "darkred",
       lwd = 2)

# Add caption with p-value
mtext(
  paste0("Permutation p-value (two-sided) = ", round(p_perm, 3)),
  side = 3,
  line = 0.5,
  cex = 0.9
)

The design-based permutation test confirms the findings from the linear model and GEE analyses.

References

Arnold, B. F., Kanyi, H., Njenga, S. M., Rawago, F. O., Priest, J. W., Secor, W. E., Lammie, P. J., Won, K. Y., & Odiere, M. R. (2020). Fine-scale heterogeneity in schistosoma mansoni force of infection measured through antibody response. Proc. Natl. Acad. Sci. U. S. A., 117(37), 23174–23181.

Campbell, M. K., Piaggio, G., Elbourne, D. R., Altman, D. G., & CONSORT Group. (2012). Consort 2010 statement: Extension to cluster randomised trials. BMJ, 345(sep04 1), e5661.

Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Sert, N. P. du, Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nat. Hum. Behav., 1(1), 0021.

Oldenburg, C. E., Pinsky, B. A., Brogdon, J., Chen, C., Ruder, K., Zhong, L., Nyatigo, F., Cook, C. A., Hinterwirth, A., Lebas, E., Redd, T., Porco, T. C., Lietman, T. M., Arnold, B. F., & Doan, T. (2021). Effect of oral azithromycin vs placebo on COVID-19 symptoms in outpatients with SARS-CoV-2 infection: A randomized clinical trial. JAMA, 326(6), 490–498.

Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten simple rules for reproducible computational research. PLoS Comput. Biol., 9(10), e1003285.

Sur, D., Ochiai, R. L., Bhattacharya, S. K., Ganguly, N. K., Ali, M., Manna, B., Dutta, S., Donner, A., Kanungo, S., Park, J. K., Puri, M. K., Kim, D. R., Dutta, D., Bhaduri, B., Acosta, C. J., & Clemens, J. D. (2009). A cluster-randomized effectiveness trial of vi typhoid vaccine in india. N. Engl. J. Med., 361(4), 335–344.

Won, K. Y., Kanyi, H. M., Mwende, F. M., Wiegand, R. E., Goodhew, E. B., Priest, J. W., Lee, Y.-M., Njenga, S. M., Secor, W. E., Lammie, P. J., & Odiere, M. R. (2017). Multiplex serologic assessment of schistosomiasis in western kenya: Antibody responses in preschool aged children as a measure of reduced transmission. Am. J. Trop. Med. Hyg., 96(6), 1460–1467.