Bladder cancer recurrence analysis

Author

Rob Daniels

Published

December 9, 2025

Description

This document presents an analysis of the classic bladder cancer recurrence data. It illustrates advanced statistical approaches to recurrent events survival analysis while demonstrating the use of modern data science tools and workflows, including:

  • Version control & reproducibility: GitHub, Quarto, Markdown, bibliographic integration
  • Statistical programming: R
  • Visualization & reporting: HTML, LaTeX, ggplot2, plotly
  • Statistical methods: recurrent events survival analysis

If this work has helped you understand these models in an applied context or unlocked new insights, I would greatly appreciate hearing about it. Please feel free to email me at rdanielsstat@gmail.com and let me know.

Data

Data are from (Byar & Blackard, 1977). Per the R survival package documentation, the bladder1 data set is the full data set. Patients entered the study after removal of superficial bladder tumors. They were randomized into one of three treatment arms and followed for tumor recurrence for up to 64 months. Censoring occurs either at loss to follow-up or death. These are interval-level survival data and contain the following variables:

id        : subject ID
treatment : treatment group (placebo, pyridoxine, or thiotepa)
number    : number of initial tumors
size      : size of the largest initial tumor (cm)
recur     : recurrence number
start     : start of the interval (0 or previous event time)
stop      : end of interval; recurrence or censoring time
status    : end of interval code
            (0 = censored, 1 = recurrence, 
             2 = death from bladder disease, 
             3 = other death or unknown)
rtumor    : number of tumors at recurrence
rsize     : size of largest tumor at recurrence (cm)
enum      : event number
Click to show full data set.

Full data set descriptives

Number of subjects.

Code
# number of subjects
cat(length(unique(bladder1$id)))
118

Subjects per treatment arm.

Code
# subject count and percent per treatment arm
ids_per_trt <- bladder1 %>%
  distinct(treatment, id) %>%
  count(treatment, name = "subj") %>%
  mutate(pct = round(100 * subj / sum(subj), 1))

print(ids_per_trt, row.names = FALSE)
  treatment subj  pct
    placebo   48 40.7
 pyridoxine   32 27.1
   thiotepa   38 32.2

Total number of recurrences.

Code
# number of recurrences (2 equivalent ways)
# cat(sum(tapply(bladder1$recur, bladder1$id, max)))
cat(sum(bladder1$status == 1))
189

Recurrence frequencies. 52.5% of subjects had at least one recurrence.

Code
# recurrence frequencies
recur_freqs <- bladder1 %>%
  group_by(id) %>%
  summarize(recurrences = max(recur, na.rm = TRUE), .groups = "drop") %>%
  count(recurrences, name = "subj") %>%
  mutate(pct = round(100 * subj / sum(subj), 1)) %>%
  as.data.frame()

print(recur_freqs, row.names = FALSE)
 recurrences subj  pct
           0   56 47.5
           1   23 19.5
           2   11  9.3
           3    8  6.8
           4    4  3.4
           5    8  6.8
           6    1  0.8
           7    1  0.8
           8    3  2.5
           9    3  2.5

Distribution of number of recurrences. The mean number of recurrences was 1.6.

Code
# distribution of number of recurrences
summary(tapply(bladder1$recur, bladder1$id, max), digits = 2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     1.0     1.6     2.0     9.0 

Recurrent events plot

Code
# recurrent events plots
plot_data <- bladder1 %>%
  mutate(id_factor = factor(id, levels = unique(id)),
         event = case_when(status == 1 ~ "Recurrence",
                           status == 0 ~ "Censored",
                           status == 2 ~ "Bladder cancer death",
                           status == 3 ~ "Other death/unknown",
                           TRUE        ~ NA_character_),
         event = factor(event, levels = c(
           "Recurrence",
           "Censored",
           "Bladder cancer death",
           "Other death/unknown")))

event_plot <- ggplot(plot_data, aes(y = id_factor)) +
  geom_segment(aes(x = start, xend = stop, yend = id_factor), 
               color = "grey60") + 
  geom_point(data = filter(plot_data, !is.na(event)),
             aes(x = stop, shape = event, colour = event,
                 text = paste("ID:", id, "<br>Status:", event, 
                              "<br>Time:", stop)), 
             size = 1.5, stroke = 0.5) +
  scale_shape_manual(
    values = c(
      "Recurrence"           = 16,
      "Censored"             = 1,
      "Bladder cancer death" = 4,
      "Other death/unknown"  = 17)) +
  scale_color_manual(
    values = c(
      "Recurrence"           = "black",
      "Censored"             = "black",
      "Bladder cancer death" = "red",
      "Other death/unknown"  = "blue")) +
  scale_y_discrete("Subject",
                   expand = expansion(mult = c(0.02, 0.02))) +
  scale_x_continuous("Months since enrollment") +
  facet_wrap(~treatment, ncol = 1, scales = "free_y") +
  coord_cartesian(clip = "off") +
  theme_minimal(base_size = 12) +
  theme(plot.margin     = margin(15, 15, 15, 15),
        legend.position = "right",
        strip.placement = "outside",
        strip.text.y    = element_text(angle = 90),
        axis.text.y     = element_blank(),
        axis.ticks.y    = element_blank()) +
  labs(shape = "Event type", color = "Event type")

interactive_plot <- ggplotly(event_plot, tooltip = "text")

htmlwidgets::saveWidget(interactive_plot,
                        "bladder_recurrence_event_plot.html",
                        selfcontained = TRUE)

ggplotly(event_plot, tooltip = "text")


Click here to open an interactive plot in a new window

Traditional analytic approaches

This section walks through several commonly used models for analyzing recurrent event data. Many of the results align with those published in the SAS PROC PHREG documentation. For consistency, the analyses use a reduced version of the bladder1 data set.

Reduced data set

The reduced data set includes only patients in the placebo and thiotepa treatment arms (86 subjects) and is limited to the first four recurrences for each subject. One subject had no follow-up and is typically excluded from analyses. Each subject contributes up to four observations (total observations = 344). For subjects with fewer than four recurrences, the last available recurrence is repeated to fill the remaining rows. The data set contains the following variables:

id        : subject ID (newly generated)
id_orig   : original ID from full data set
trt       : treatment group (1 = placebo, 2 = thiotepa)
number    : number of initial tumors
size      : size of the largest initial tumor (cm)
tstart    : start of the interval (0 or previous event time)
tstop     : end of interval; recurrence or censoring time
status    : end of interval code (0 = censored, 1 = recurrence)
visit     : visit/event number

The following code generates the reduced data set from the full data set. IDs have been regenerated in order to match published versions of the data. id_orig corresponds to the subject ID in the full data set. In the code below, trt1–trt4, number1–number4, and size1–size4 are also created, which are corresponding binary visit-interaction variables used in later analyses.

Code
# reduced data set (combination of WLW and AG styles)
bladder_reduced <- bladder1 %>%
  filter(treatment %in% c("placebo", "thiotepa"),
         enum <= 4) %>%
  mutate(id_orig = id, 
         id = dense_rank(id),
         trt = if_else(treatment == "placebo", 1, 2),
         event = if_else(status == 1, 1, 0)) %>%
  group_by(id) %>%
  complete(enum = 1:4) %>%
  arrange(id, enum) %>%
  fill(id_orig, trt, number, size, stop, .direction = "down") %>%
  mutate(tstart = as.integer(if_else(row_number() == 1, 0, lag(stop))),
         status = if_else(is.na(event), 0, event),
         visit = as.integer(enum)) %>%
  ungroup() %>%
  mutate(trt1 = trt*(visit == 1),
         trt2 = trt*(visit == 2),
         trt3 = trt*(visit == 3),
         trt4 = trt*(visit == 4),
         number1 = number*(visit == 1),
         number2 = number*(visit == 2),
         number3 = number*(visit == 3),
         number4 = number*(visit == 4),
         size1 = size*(visit == 1),
         size2 = size*(visit == 2),
         size3 = size*(visit == 3),
         size4 = size*(visit == 4)) %>%
  select(id, id_orig, trt, number, size, tstart, tstop = stop, status, visit, 
         trt1, trt2, trt3, trt4, number1, number2, number3, number4, size1, 
         size2, size3, size4)
Click to show reduced data set.

Reduced data descriptives

Number of subjects.

Code
# number of subjects
cat(length(unique(bladder_reduced$id)))
86

Subjects per treatment arm.

Code
# subject count and percent per treatment arm
ids_per_trt <- bladder_reduced %>%
  distinct(trt, id) %>%
  count(trt, name = "subj") %>%
  mutate(pct = round(100 * subj / sum(subj), 1)) %>%
  as.data.frame()

print(ids_per_trt, row.names = FALSE)
 trt subj  pct
   1   48 55.8
   2   38 44.2

Total number of recurrences.

Code
# number of recurrences (2 equivalent ways)
# cat(sum(tapply(bladder_reduced$status, bladder_reduced$id, sum)))
cat(sum(bladder_reduced$status == 1))
112

Recurrence frequencies. 54.7% of subjects had at least one recurrence.

Code
# recurrence frequencies
recur_freqs <- bladder_reduced %>%
  group_by(id) %>%
  summarize(recurrences = sum(status), .groups = "drop") %>%
  count(recurrences, name = "subj") %>%
  mutate(pct = round(100 * subj / sum(subj), 1)) %>%
  as.data.frame()

print(recur_freqs, row.names = FALSE)
 recurrences subj  pct
           0   39 45.3
           1   18 20.9
           2    7  8.1
           3    8  9.3
           4   14 16.3

Distribution of number of recurrences. The mean number of recurrences was 1.3.

Code
# distribution of number of recurrences
summary(tapply(bladder_reduced$status, bladder_reduced$id, sum), digits = 2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     1.0     1.3     2.8     4.0 

Reduced data recurrent events plot

Code
# recurrent events plots for reduced data
plot_data <- bladder_reduced %>%
  group_by(id) %>%
  mutate(total_followup = max(tstop)) %>%
  ungroup() %>%
  mutate(id_factor = factor(id, levels =
                              unique(id[order(total_followup)])),
         trt = factor(trt, levels = c(1, 2),
                      labels = c("placebo", "thiotepa")),
         event = case_when(status == 1 ~ "Recurrence",
                           status == 0 ~ "Censored",
                           TRUE        ~ NA_character_),
         event = factor(event, levels = c("Recurrence",
                                          "Censored")))

event_plot <- ggplot(plot_data, aes(y = id_factor)) +
  geom_segment(aes(x = tstart, xend = tstop, yend = id_factor), 
               color = "grey60") + 
  geom_point(data = filter(plot_data, !is.na(event)),
             aes(x = tstop, shape = event, colour = event,
                 text = paste("ID:", id, "<br>Original ID:", id_orig, 
                              "<br>Status:", event, "<br>Time:", tstop)),
             size = 1.5, stroke = 0.5) +
  scale_shape_manual(
    values = c("Recurrence" = 16, 
               "Censored"   = 1)) +
  scale_color_manual(
    values = c("Recurrence" = "black",
               "Censored"   = "black")) +
  scale_y_discrete("Subject",
                   expand = expansion(mult = c(0.02, 0.02))) +
  scale_x_continuous("Months since enrollment") +
  facet_wrap(~trt, ncol = 1, scales = "free_y") +
  coord_cartesian(clip = "off") +
  theme_minimal(base_size = 12) +
  theme(plot.margin     = margin(15, 15, 15, 15),
        legend.position = "right",
        strip.placement = "outside",
        strip.text.y.   = element_text(angle = 90),
        axis.text.y     = element_blank(),
        axis.ticks.y    = element_blank()) +
  labs(shape = "Event type", color = "Event type")

interactive_plot <- ggplotly(event_plot, tooltip = "text")

htmlwidgets::saveWidget(interactive_plot, 
                        "bladder_recurrence_reduced_event_plot.html", 
                        selfcontained = TRUE)

ggplotly(event_plot, tooltip = "text")


Click here to open an interactive plot in a new window

AG intensity model

First, the Andersen–Gill (AG) intensity model (Andersen & Gill, 1982) is applied to the data. It is a counting-process extension of the semi-parametric Cox proportional hazards model, where the hazard is generalized to an intensity function to accommodate repeated events. Time-varying covariates can be incorporated but they may confound inference about baseline effects (Mao, 2025).

For each subject \(i\), let \(N(t)\) be a counting process that records the cumulative number of events experienced up to time \(t\). The intensity function gives the instantaneous rate of a new event at time \(t\), given the observed event history up to but not including \(t\):

\[ \lambda_i(t \mid \mathcal{H}(t)) = \lim_{\Delta t \downarrow 0} \frac{\Pr\{N_i(t+\Delta t) - N_i(t) = 1 \mid \mathcal{H}(t)\}}{\Delta t} \tag{1}\]

In the AG formulation, the intensity is modeled in proportional hazards form, analogous to the Cox model. The risk of a new event is conditional only on covariates observed at time \(t\); event history can affect risk only through its inclusion as a covariate.

\[ \lambda_i(t \mid Z_i(t)) = Y_i(t) \, \lambda_0(t) \, \exp \{ \beta^\top Z_i(t)\} \tag{2}\]

Equation 1 corresponds to equation 2.4 and Equation 2 corresponds to equation 2.1 in Andersen and Gill (1982). \(Y_i(t)\) is an at-risk indicator (1 if subject \(i\) is under observation at time \(t\), 0 otherwise), \(\lambda_0(t)\) is an unspecified baseline intensity function, \(Z_i(t)\) is the vector of (possibly time-varying) covariates for subject \(i\), and \(\beta\) is the vector of regression coefficients.

The code to fit this model in R is below. Only baseline (time-invariant) covariates are included. The Efron method is used to handle tied event times; using the Breslow method produces results that exactly match those in the SAS PROC PHREG documentation.

Code
# fit AG intensity model
ag_intensity <- coxph(
  Surv(tstart, tstop, status) ~ trt + number + size,
  ties = "efron", # change to breslow for exact match with SAS results
  data = bladder_reduced %>% filter(tstop > tstart))

summary(ag_intensity)
Call:
coxph(formula = Surv(tstart, tstop, status) ~ trt + number + 
    size, data = bladder_reduced %>% filter(tstop > tstart), 
    ties = "efron")

  n= 178, number of events= 112 

           coef exp(coef) se(coef)      z Pr(>|z|)    
trt    -0.46469   0.62833  0.19973 -2.327 0.019989 *  
number  0.17496   1.19120  0.04707  3.717 0.000202 ***
size   -0.04366   0.95728  0.06905 -0.632 0.527196    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       exp(coef) exp(-coef) lower .95 upper .95
trt       0.6283     1.5915    0.4248    0.9294
number    1.1912     0.8395    1.0862    1.3063
size      0.9573     1.0446    0.8361    1.0960

Concordance= 0.634  (se = 0.032 )
Likelihood ratio test= 17.52  on 3 df,   p=6e-04
Wald test            = 19.11  on 3 df,   p=3e-04
Score (logrank) test = 19.52  on 3 df,   p=2e-04

The thiotepa arm shows a statistically significant reduction in the rate of bladder cancer recurrence compared with placebo (HR = 0.63; 95% CI: 0.42–0.93; p = 0.020), corresponding to an estimated 37% lower instantaneous risk of recurrence. Each additional initial tumor is associated with a 19% increase in hazard (HR = 1.19; 95% CI: 1.09–1.31; p < 0.001). Tumor size at baseline shows no clear association with recurrence (HR = 0.96; p = 0.53).

The model’s concordance index (C = 0.63) indicates relatively poor discrimination of which patients recur earlier versus later, and the global tests all confirm that at least one predictor is significantly related to recurrence risk.

The AG model does not account for the within-subject dependence of recurrent events. This can result in standard errors that are underestimated (biased downward) and coefficients that are attenuated toward 0 (Allison, 2010). The following models address these two issues in different ways.

Proportional means/rates model (LWYY model)

In contrast to the AG model, which focuses on the instantaneous rate of new events, the proportional means/rates model of Lin et al. (2002) targets the marginal mean of the counting process. In their article, LWYY refer to mean and rate functions of the counting process used in the AG model. The mean refers to the cumulative number of events at any given time \(t\) during follow-up and the rate is the change in the cumulative number of events. When covariates are time-independent, the target reduces to a marginal means model, where the expected cumulative number of events up to time \(t\) is directly modeled. As in the Cox and AG models, covariates act multiplicatively. The model can be expressed as:

\[ \mu_i(t) = \mathbb{E}[N_i(t) \mid Z_i] = \mu_0(t) \, \exp\{\beta^\top Z_i\} \tag{3}\]

In Equation 3, \(\mathbb{E}[N_i(t) \mid Z_i]\) is the expected cumulative number of events (mean of counting process) for subject \(i\) conditional on the covariates, \(\mu_0(t)\) is an unspecified baseline mean function, \(Z_i\) is the vector of covariates for subject \(i\), and \(\beta\) is the vector of regression coefficients. Exponentiated coefficients are interpreted as ratios of mean cumulative functions (often called “rate ratios”), which differ conceptually from the instantaneous hazard (intensity) ratios of the AG model.

The R code to fit the model is shown below. It is identical to the AG model code, except for the addition of the cluster(id) term to account for within-subject correlation. The Efron method is used to handle tied event times; using the Breslow method produces results that exactly match those in the SAS PROC PHREG documentation.

Code
# Lin et al (2000) proportional means model
lwyy_prop_means <- coxph(
  Surv(tstart, tstop, status) ~ trt + number + size + cluster(id),
  ties = "efron", # change to breslow for exact match with SAS results
  data = bladder_reduced %>% filter(tstop > tstart))

summary(lwyy_prop_means)
Call:
coxph(formula = Surv(tstart, tstop, status) ~ trt + number + 
    size, data = bladder_reduced %>% filter(tstop > tstart), 
    ties = "efron", cluster = id)

  n= 178, number of events= 112 

           coef exp(coef) se(coef) robust se      z Pr(>|z|)   
trt    -0.46469   0.62833  0.19973   0.26556 -1.750  0.08015 . 
number  0.17496   1.19120  0.04707   0.06304  2.775  0.00551 **
size   -0.04366   0.95728  0.06905   0.07762 -0.563  0.57376   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       exp(coef) exp(-coef) lower .95 upper .95
trt       0.6283     1.5915    0.3734     1.057
number    1.1912     0.8395    1.0527     1.348
size      0.9573     1.0446    0.8222     1.115

Concordance= 0.634  (se = 0.032 )
Likelihood ratio test= 17.52  on 3 df,   p=6e-04
Wald test            = 11.54  on 3 df,   p=0.009
Score (logrank) test = 19.52  on 3 df,   p=2e-04,   Robust = 11.27  p=0.01

  (Note: the likelihood ratio and score tests assume independence of
     observations within a cluster, the Wald and robust score tests do not).

The proportional means/rates model yields identical parameter estimates as the AG model, since both rely on the same partial likelihood and score equations. An estimated coefficient of −0.46 corresponds to a ratio of mean cumulative counts (rate ratio) of 0.63, indicating that, at any given time \(t\), the treatment arm is expected to have about 37% fewer recurrences than the control arm.

The LWYY model uses the robust (sandwich) estimator (Lin & Wei, 1989) to adjust for within-subject correlation among repeated events. While the Andersen–Gill model can also be paired with a robust (sandwich) variance estimator to account for within-subject correlation, this is not its default formulation. With this adjustment, standard errors increase. Consequently, the treatment effect is no longer statistically significant (p = 0.08).

More on the robust variance (sandwich) estimator: the sandwich estimator calculates standard errors that are robust to violations of certain model assumptions, such as correlation among repeated measurements. It does not require specifying the form of dependence and is often referred to as a population-averaged method, because it provides inference that reflects the average effect across the population rather than subject-specific effects. The method uses model residuals to compute a variance-like term that is “sandwiched” between the usual inverse information matrices, producing a robust variance estimate.

In effect, the proportional means/rates model corrects the underestimation of standard errors seen in the AG model, though it does not correct the potential attenuation of coefficients caused by unobserved heterogeneity.

Second data transformation

To fit some additional types of models, the data need to be transformed again. With the following, each risk set is restricted to patients who have experienced the corresponding number of prior events. For example, a patient who has only two events does not enter the risk sets for the third or fourth recurrence. A new variable, gap_time, is created to represent the time between successive recurrences. trt1–trt4, number1–number4, and size1–size4 are corresponding visit-interaction variables used the following analyses.

Code
# second data transformation, style used by PWP (R documentation refers to this 
# as AG style) see https://rdrr.io/cran/survival/man/bladder.html
bladder2_reduced <- bladder_reduced %>%
  arrange(id, tstart, tstop) %>%
  group_by(id) %>%
  mutate(last_status = lag(status, default = 1)) %>%
  filter(!(status == 0 & last_status == 0)) %>%
  mutate(gap_time = tstop - tstart) %>%
  ungroup() %>%
  filter(gap_time != 0) %>%
  mutate(trt1 = trt*(visit == 1),
         trt2 = trt*(visit == 2),
         trt3 = trt*(visit == 3),
         trt4 = trt*(visit == 4),
         number1 = number*(visit == 1),
         number2 = number*(visit == 2),
         number3 = number*(visit == 3),
         number4 = number*(visit == 4),
         size1 = size*(visit == 1),
         size2 = size*(visit == 2),
         size3 = size*(visit == 3),
         size4 = size*(visit == 4)) %>%
  select(id, id_orig, trt, number, size, tstart, tstop, status, visit, gap_time,
         trt1, trt2, trt3, trt4, number1, number2, number3, number4, size1, 
         size2, size3, size4)
Click to show the transformed data set.

The number of events and censored values are summarized below.

Code
summary_tab <- with(bladder2_reduced, table(visit, status)) |>
  as.data.frame() |>
  pivot_wider(names_from = status, values_from = Freq, values_fill = 0) |>
  rename(censored = `0`, event = `1`) |>
  mutate(total = censored + event,
         pct_censored = round(100 * censored / total, 1))

summary_tab <- bind_rows(
  summary_tab,
  summarise(summary_tab,
            visit    = "Total",
            censored = sum(censored),
            event    = sum(event),
            total    = sum(total),
            pct_censored = round(100 * censored / total, 1))) %>%
  as.data.frame()

print(summary_tab, row.names = FALSE)
 visit censored event total pct_censored
     1       38    47    85         44.7
     2       17    29    46         37.0
     3        5    22    27         18.5
     4        6    14    20         30.0
 Total       66   112   178         37.1

PWP total time model

One way to correct for potential bias in coefficient estimation is through stratification. The following models are the total time (TT) models of Prentice, Williams, and Peterson (1981), which are stratified Cox models measuring time since study enrollment (common time origin = 0). Stratification allows each event order to have its own baseline hazard and, by conditioning on prior events, corrects potential bias in coefficient estimates. To obtain valid standard errors in the presence of within-subject correlation, robust variance estimators clustered by subject would typically be used; however, these are not applied in the PWP models that follow in order to align with published results.

The TT model estimates hazard functions for the 1st, 2nd, 3rd, and 4th events separately, while respecting their order. The strata() function instructs the model to estimate a different baseline hazard function for each event number \(k\). This avoids forcing a common baseline hazard across event orders, allowing the underlying risk processes for the four visits to differ freely. The model can be expressed as:

\[ \lambda_k(t \mid Z) = \lambda_{0k}(t) \, \exp\{\beta^\top Z\} \tag{4}\]

Here, each stratum is defined by the event number \(k\) (1st event, 2nd event, etc.). The risk set for the \(k\)-th event includes only those subjects who have experienced exactly \(k - 1\) prior events and are still at risk. The hazard is therefore the risk of the \(k\)-th event at time \(t\) since baseline, conditional on having survived through \(k - 1\) events.

Uncommon effects

The visit-specific variables (trt1–trt4, number1–number4, size1–size4) are used to implement an uncommon effects version of the TT model. This allows treatment, number, and size to have distinct coefficients in each stratum. Covariate effects are not assumed constant across event orders, but may vary from the 1st to the 4th visit.

This is effectively equivalent to fitting four separate Cox models, one per visit number, each with its own baseline hazard. The difference is that they are estimated jointly within the stratified Cox framework, sharing the same likelihood and variance estimation. Subjects who have not yet reached a given visit are excluded from the corresponding stratum, and those lost to follow-up contribute no further risk sets after their censoring event.

Code
# PWP (1981) total time model with uncommon effects
pwp_tt_ue <- coxph(
  Surv(tstart, tstop, status) ~ trt1 + trt2 + trt3 + trt4 + 
    number1 + number2 + number3 + number4 + size1 + size2 + size3 + size4 + 
    strata(visit), # or, equivalently: (trt + number + size) * strata(visit)
  ties = "efron", # replace with "breslow" for exact match with SAS
  data = bladder2_reduced)

summary(pwp_tt_ue)
Call:
coxph(formula = Surv(tstart, tstop, status) ~ trt1 + trt2 + trt3 + 
    trt4 + number1 + number2 + number3 + number4 + size1 + size2 + 
    size3 + size4 + strata(visit), data = bladder2_reduced, ties = "efron")

  n= 178, number of events= 112 

             coef exp(coef)  se(coef)      z Pr(>|z|)   
trt1    -0.525984  0.590973  0.315826 -1.665   0.0958 . 
trt2    -0.503837  0.604208  0.406167 -1.240   0.2148   
trt3     0.140657  1.151029  0.673063  0.209   0.8345   
trt4     0.050331  1.051619  0.791710  0.064   0.9493   
number1  0.238180  1.268937  0.075885  3.139   0.0017 **
number2 -0.024641  0.975660  0.089873 -0.274   0.7840   
number3  0.049661  1.050915  0.185323  0.268   0.7887   
number4  0.204277  1.226637  0.242040  0.844   0.3987   
size1    0.069613  1.072094  0.101559  0.685   0.4931   
size2   -0.160716  0.851534  0.122467 -1.312   0.1894   
size3    0.168099  1.183053  0.269040  0.625   0.5321   
size4    0.009095  1.009137  0.338928  0.027   0.9786   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
trt1       0.5910     1.6921    0.3182     1.097
trt2       0.6042     1.6551    0.2726     1.339
trt3       1.1510     0.8688    0.3077     4.305
trt4       1.0516     0.9509    0.2228     4.963
number1    1.2689     0.7881    1.0936     1.472
number2    0.9757     1.0249    0.8181     1.164
number3    1.0509     0.9516    0.7308     1.511
number4    1.2266     0.8152    0.7633     1.971
size1      1.0721     0.9328    0.8786     1.308
size2      0.8515     1.1744    0.6698     1.083
size3      1.1831     0.8453    0.6982     2.005
size4      1.0091     0.9909    0.5193     1.961

Concordance= 0.629  (se = 0.037 )
Likelihood ratio test= 14.5  on 12 df,   p=0.3
Wald test            = 15.02  on 12 df,   p=0.2
Score (logrank) test = 15.73  on 12 df,   p=0.2

Results show that there is no significant treatment effect on the total time for any of the four tumor recurrences.

Common effects

The common effects TT model is fit next. Like the uncommon effects model, it stratifies by event order so that each event has its own baseline hazard. However, covariate effects are now constrained to be the same across all strata rather than varying by visit.

This produces a single estimate of each covariate effect, which is interpreted as a shared effect across all recurrences, conditional on prior history. This model is more parsimonious than the uncommon effects version and may be preferred when there is no strong evidence that effects differ by event order.

Code
# PWP (1981) total time model with common effects
pwp_tt_ce <- coxph(
  Surv(tstart, tstop, status) ~ trt + number + size + strata(visit),
  ties = "efron",
  data = bladder2_reduced)

summary(pwp_tt_ce)
Call:
coxph(formula = Surv(tstart, tstop, status) ~ trt + number + 
    size + strata(visit), data = bladder2_reduced, ties = "efron")

  n= 178, number of events= 112 

            coef exp(coef)  se(coef)      z Pr(>|z|)  
trt    -0.333489  0.716420  0.216168 -1.543   0.1229  
number  0.119617  1.127065  0.053338  2.243   0.0249 *
size   -0.008495  0.991541  0.072762 -0.117   0.9071  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       exp(coef) exp(-coef) lower .95 upper .95
trt       0.7164     1.3958    0.4690     1.094
number    1.1271     0.8873    1.0152     1.251
size      0.9915     1.0085    0.8598     1.144

Concordance= 0.616  (se = 0.038 )
Likelihood ratio test= 6.51  on 3 df,   p=0.09
Wald test            = 6.85  on 3 df,   p=0.08
Score (logrank) test = 6.91  on 3 df,   p=0.07

Results from the common effects total time model show no significant treatment effect on the hazard of tumor recurrence, when assuming a single treatment effect applies across all four events.

PWP gap-time model

The following models are the gap-time (PWP-GT) models of Prentice, Williams, and Peterson (1981). These are stratified Cox models where the time scale resets after each event. The model can be expressed as:

\[ \lambda_k(u \mid Z) = \lambda_{0k}(u) \, \exp\{\beta^\top Z\} \tag{5}\]

where \(u = t - T_{k-1}\) is the gap time since the previous event. Each stratum is defined by the event number \(k\). The risk set for the \(k\)-th event includes only those individuals who have experienced exactly \(k - 1\) prior events and are still at risk of another event. The interpretation is the risk of the \(k\)-th event as a function of time since previous event, conditional on prior event history.

Uncommon effects

The visit-specific variables (trt1–trt4, number1–number4, size1–size4) are used in the uncommon effects model below, allowing covariate effects to differ across event orders.

Code
# PWP (1981) gap-time model with uncommon effects
pwp_gt_ue <- coxph(
  Surv(gap_time, status) ~ trt1 + trt2 + trt3 + trt4 + 
    number1 + number2 + number3 + number4 + 
    size1 + size2 + size3 + size4 + strata(visit),
  ties = "efron", # replace with "breslow" for exact match with SAS
  data = bladder2_reduced)

summary(pwp_gt_ue)
Call:
coxph(formula = Surv(gap_time, status) ~ trt1 + trt2 + trt3 + 
    trt4 + number1 + number2 + number3 + number4 + size1 + size2 + 
    size3 + size4 + strata(visit), data = bladder2_reduced, ties = "efron")

  n= 178, number of events= 112 

             coef exp(coef)  se(coef)      z Pr(>|z|)   
trt1    -0.525984  0.590973  0.315826 -1.665   0.0958 . 
trt2    -0.270967  0.762642  0.404971 -0.669   0.5034   
trt3     0.210287  1.234032  0.549969  0.382   0.7022   
trt4    -0.220489  0.802126  0.639159 -0.345   0.7301   
number1  0.238180  1.268937  0.075885  3.139   0.0017 **
number2 -0.006414  0.993607  0.096354 -0.067   0.9469   
number3  0.142074  1.152662  0.161533  0.880   0.3791   
number4  0.475472  1.608773  0.202895  2.343   0.0191 * 
size1    0.069613  1.072094  0.101559  0.685   0.4931   
size2   -0.119269  0.887569  0.119103 -1.001   0.3166   
size3    0.278266  1.320838  0.233094  1.194   0.2326   
size4    0.042913  1.043847  0.289676  0.148   0.8822   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
trt1       0.5910     1.6921    0.3182     1.097
trt2       0.7626     1.3112    0.3448     1.687
trt3       1.2340     0.8104    0.4199     3.626
trt4       0.8021     1.2467    0.2292     2.807
number1    1.2689     0.7881    1.0936     1.472
number2    0.9936     1.0064    0.8226     1.200
number3    1.1527     0.8676    0.8399     1.582
number4    1.6088     0.6216    1.0809     2.394
size1      1.0721     0.9328    0.8786     1.308
size2      0.8876     1.1267    0.7028     1.121
size3      1.3208     0.7571    0.8364     2.086
size4      1.0438     0.9580    0.5916     1.842

Concordance= 0.613  (se = 0.034 )
Likelihood ratio test= 19.54  on 12 df,   p=0.08
Wald test            = 20.26  on 12 df,   p=0.06
Score (logrank) test = 21.68  on 12 df,   p=0.04

Results show no significant treatment effect on any of the gap times for any of the four tumor recurrences. The estimated coefficient for the first recurrence in the gap-time model matches that of the total time model, since the definition of time (total time vs. gap-time) coincides at the initial event.

Common effects

The common effects PWP-GT model is the more parsimonious alternative. Like the uncommon effects model, it stratifies by event order so that each recurrence has its own baseline hazard, and the time scale resets after each event.

Code
# PWP (1981) gap-time model with common effects
pwp_gt_ce <- coxph(
  Surv(gap_time, status) ~ trt + number + size + strata(visit),
  ties = "efron",
  data = bladder2_reduced)

summary(pwp_gt_ce)
Call:
coxph(formula = Surv(gap_time, status) ~ trt + number + size + 
    strata(visit), data = bladder2_reduced, ties = "efron")

  n= 178, number of events= 112 

            coef exp(coef)  se(coef)      z Pr(>|z|)   
trt    -0.279005  0.756536  0.207348 -1.346  0.17844   
number  0.158046  1.171220  0.051942  3.043  0.00234 **
size    0.007415  1.007443  0.070023  0.106  0.91567   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       exp(coef) exp(-coef) lower .95 upper .95
trt       0.7565     1.3218    0.5039     1.136
number    1.1712     0.8538    1.0579     1.297
size      1.0074     0.9926    0.8782     1.156

Concordance= 0.596  (se = 0.035 )
Likelihood ratio test= 9.33  on 3 df,   p=0.03
Wald test            = 10.11  on 3 df,   p=0.02
Score (logrank) test = 10.27  on 3 df,   p=0.02

The results show there is no significant treatment effect on the hazard of tumor recurrence when time is measured since the previous event. Because covariate effects are constrained to be the same across event orders, this estimate reflects a shared treatment effect across all four recurrences, rather than event-specific effects.

In both the PWP-TT and PWP-GT models, stratification ensures each event order has its own baseline hazard. Coefficients are estimated jointly and, in the common effects version, constrained to be the same across strata. Both are conditional models: only individuals eligible for the \(k\)-th event contribute to estimation in the \(k\)-th stratum.

Wei-Lin-Weissfeld model

The Wei-Lin-Weissfeld (WLW) model (Wei et al., 1989) is a marginal, multivariate approach for analyzing recurrent events. Unlike the PWP models, it does not condition on previous events and may be thought of as estimating the population-level hazard for the \(k\)-th event. Each event order is treated as a separate stratum, and coefficients may differ across events.

The hazard function can be expressed similarly to the PWP model:

\[ \lambda_k(t \mid Z) = \lambda_{0k}(t) \, \exp\{\beta^\top Z\} \tag{6}\]

Here, \(\lambda_{0k}(t)\) is the baseline hazard for the \(k\)-th event. The key distinction from the PWP model expressed in Equation 4 is the risk set definition: in the WLW model, all subjects are considered at risk for the \(k\)-th event from baseline (\(t = 0\)), regardless of whether they have experienced prior events. The occurrence and order of previous events are ignored.

Because repeated events within subjects are correlated, the WLW model uses robust (sandwich) variance estimators to obtain valid standard errors. In practice, coefficients are first estimated separately for each event order (as in the PWP uncommon effects model), and then weighted averages of the coefficients may be calculated using the robust covariance matrix to produce efficient standard error estimates and allow for joint hypothesis tests.

In R, this distinction from PWP is reflected in the code syntax: the WLW model uses the (stop, status) marginal format, rather than the (start, stop, status) format required by conditional PWP models. The time origin is study enrollment for all subjects, and the software constructs risk sets for each event accordingly. To fit this model we must switch back to the first version of the reduced data set.

Code
# WLW (1989) marginal model
wlw <- coxph(
  Surv(tstop, status) ~ trt1 + trt2 + trt3 + trt4 + 
    number1 + number2 + number3 + number4 + 
    size1 + size2 + size3 + size4 + strata(visit) + cluster(id),
  ties = "efron", # replace with "breslow" to match SAS results
  data = bladder_reduced,
  x = TRUE)

summary(wlw)
Call:
coxph(formula = Surv(tstop, status) ~ trt1 + trt2 + trt3 + trt4 + 
    number1 + number2 + number3 + number4 + size1 + size2 + size3 + 
    size4 + strata(visit), data = bladder_reduced, ties = "efron", 
    x = TRUE, cluster = id)

  n= 344, number of events= 112 

            coef exp(coef) se(coef) robust se      z Pr(>|z|)   
trt1    -0.52598   0.59097  0.31583   0.31524 -1.669  0.09521 . 
trt2    -0.63231   0.53136  0.39308   0.36831 -1.717  0.08602 . 
trt3    -0.69849   0.49733  0.45978   0.42038 -1.662  0.09660 . 
trt4    -0.63544   0.52970  0.57645   0.49729 -1.278  0.20131   
number1  0.23818   1.26894  0.07588   0.07459  3.193  0.00141 **
number2  0.13693   1.14674  0.09171   0.08774  1.561  0.11862   
number3  0.17351   1.18947  0.10463   0.10564  1.642  0.10051   
number4  0.33247   1.39441  0.12467   0.11557  2.877  0.00402 **
size1    0.06961   1.07209  0.10156   0.08863  0.785  0.43220   
size2   -0.07775   0.92519  0.13386   0.11884 -0.654  0.51295   
size3   -0.21384   0.80748  0.18260   0.17358 -1.232  0.21799   
size4   -0.20646   0.81346  0.23087   0.19380 -1.065  0.28674   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
trt1       0.5910     1.6921    0.3186     1.096
trt2       0.5314     1.8820    0.2582     1.094
trt3       0.4973     2.0107    0.2182     1.134
trt4       0.5297     1.8879    0.1999     1.404
number1    1.2689     0.7881    1.0964     1.469
number2    1.1467     0.8720    0.9656     1.362
number3    1.1895     0.8407    0.9670     1.463
number4    1.3944     0.7171    1.1118     1.749
size1      1.0721     0.9328    0.9011     1.275
size2      0.9252     1.0809    0.7330     1.168
size3      0.8075     1.2384    0.5746     1.135
size4      0.8135     1.2293    0.5564     1.189

Concordance= 0.653  (se = 0.035 )
Likelihood ratio test= 30.09  on 12 df,   p=0.003
Wald test            = 34.73  on 12 df,   p=5e-04
Score (logrank) test = 34  on 12 df,   p=7e-04,   Robust = 17.61  p=0.1

  (Note: the likelihood ratio and score tests assume independence of
     observations within a cluster, the Wald and robust score tests do not).

Unlike the PWP models, which condition on prior events, the WLW approach treats each recurrence as a marginal model. Inference is based on combining the stratum-specific marginal Cox models into joint tests, using the joint distribution of the estimates and their robust covariance matrix. After fitting the WLW model, we can construct global and average tests of the treatment effect. This allows us to test whether the treatment has any effect across all recurrences, and, if desired, to estimate a single “common” treatment effect using optimal weights.

First, a global 4 degrees of freedom Wald chi-square test is obtained by stacking the treatment coefficients (\(\hat{\beta}_1\), \(\hat{\beta}_2\), \(\hat{\beta}_3\), \(\hat{\beta}_4\)) and using their robust covariance matrix. “Stacking” here just means treating the four recurrence-specific treatment estimates as a single parameter vector \((\hat{\beta}_1, \dots, \hat{\beta}_4)^\top\) and testing them jointly.

Code
# extract trt variance-covariance and parameter estimates
vcov_mat <- vcov(wlw)[c("trt1", "trt2", "trt3", "trt4"),
                      c("trt1", "trt2", "trt3", "trt4")]
beta_hat <- coef(wlw)[c("trt1", "trt2", "trt3", "trt4")]

# construct global chi-square test
chisq_4df <- as.numeric(t(beta_hat) %*% solve(vcov_mat, beta_hat))
pval_chisq_4df <- pchisq(chisq_4df, df = length(beta_hat), lower.tail = FALSE)

result <- data.frame(
  chi_sq = round(chisq_4df, 2),
  df = length(beta_hat),
  p_value = signif(pval_chisq_4df, 4))

print(result, row.names = FALSE)
 chi_sq df p_value
   3.91  4  0.4189

This test evaluates the null hypothesis of no treatment effect for any recurrence. In this case, there is insufficient evidence of a treatment effect (\(\chi^2\) = 3.91, df = 4, p = 0.4189).

Next, we can estimate a common treatment effect by taking an optimally weighted average of the four recurrence-specific coefficients. The optimal weights minimize the variance of the linear combination and are derived from the robust covariance matrix:

Code
# get optimal weights by minimizing the variance of the linear combination
one <- rep(1, length(beta_hat))
w <- solve(vcov_mat, one)

# optimal weights
w <- w / sum(w)   # normalize to sum to 1
print(round(w, 5))
    trt1     trt2     trt3     trt4 
 0.66167  0.28021 -0.08544  0.14356 

These weights (which may be positive or negative) reflect how much information each stratum contributes to the overall estimate. A negative weight does not mean the event reduces information; it simply reflects how the covariance structure influences the optimal linear combination.

Finally, the average treatment effect is calculated as the weighted sum of the four estimates, with its standard error derived from the robust covariance matrix via generalized least squares weighting.

Code
# average effect
beta_common <- sum(w * beta_hat)

# variance of average effect
var_common <- as.numeric(1 / (t(one) %*% solve(vcov_mat, one)))
se_common <- sqrt(var_common)

# z-score
z <- beta_common / se_common

# p-value
pval_z <- 2 * pnorm(-abs(z))

result <- data.frame(
  estimate = format(round(beta_common, 4), width = 9),
  std_err  = format(round(se_common, 4), width = 8),
  z_score  = format(round(z, 4), width = 8),
  p_value  = format(signif(pval_z, 4), width = 8))

print(result, row.names = FALSE)
  estimate  std_err  z_score  p_value
   -0.5568   0.2907  -1.9149   0.0555

This produces a single parameter estimate for treatment across all recurrences. The associated Wald test (1 df) is sometimes more powerful than the global chi-square test. With these data, the estimated common effect is –0.56 with SE = 0.29, but the evidence remains borderline (p = 0.0555), consistent with SAS output (change method for handling ties to “breslow” for exact match).

Cox PH frailty model

The methods above attempt to correct biased estimates and standard errors in different ways. Looking at the data from a different perspective, one can consider that recurrence patterns may differ across subjects, giving rise to unobserved heterogeneity, a feature that earlier approaches did not explicitly model. A frailty extension of the AG model allows us to account for this heterogeneity more directly by including a random effect that reflects individual-level variation in recurrence rates. This frailty induces dependence among repeated events and provides adjusted estimates and standard errors that explicitly incorporate variability at the subject level. The frailty model may be written as:

\[ \lambda_i(t \mid u_i, Z_i(t)) = Y_i(t) \, u_i \, \lambda_0(t) \, \exp \{ \beta^\top Z_i(t)\} \tag{7}\]

In Equation 7, \(u_i\) is a subject-specific frailty term and often modeled as a random effect with a Gamma distribution with a mean of 1 and variance \(1 / \gamma\).

The following R code implements this frailty model.

Code
# AG frailty model
ag_frailty <- coxph(
  Surv(tstart, tstop, status) ~ trt + number + size + frailty(id),
  ties = "efron",
  data = bladder_reduced %>% filter(tstop > tstart))

summary(ag_frailty)
Call:
coxph(formula = Surv(tstart, tstop, status) ~ trt + number + 
    size + frailty(id), data = bladder_reduced %>% filter(tstop > 
    tstart), ties = "efron")

  n= 178, number of events= 112 

            coef     se(coef) se2     Chisq DF    p      
trt         -0.60771 0.33305  0.21967  3.33  1.00 0.06800
number       0.23874 0.09324  0.05702  6.56  1.00 0.01000
size        -0.02155 0.11400  0.07167  0.04  1.00 0.85000
frailty(id)                           82.92 45.36 0.00056

       exp(coef) exp(-coef) lower .95 upper .95
trt       0.5446     1.8362    0.2835     1.046
number    1.2697     0.7876    1.0576     1.524
size      0.9787     1.0218    0.7827     1.224

Iterations: 6 outer, 29 Newton-Raphson
     Variance of random effect= 1.077623   I-likelihood = -436.8 
Degrees of freedom for terms=  0.4  0.4  0.4 45.4 
Concordance= 0.833  (se = 0.021 )
Likelihood ratio test= 143.6  on 46.56 df,   p=8e-12

Compared with the AG model, the standard errors in the frailty model are larger and the parameter estimates are no longer attenuated. In the AG model, events within the same subject are treated as conditionally independent given the covariates. By contrast, the frailty term induces within-subject correlation, explicitly accounting for dependence among repeated events.

Acknowledgements

Paul Allison’s Survival Analysis Using SAS (Allison, 2010) and the course website for BMI/STAT 741: Applied Survival Analysis (Spring 2025) (Mao, 2025), developed by Prof. Lu Mao (University of Wisconsin–Madison), were consulted to inform aspects of my analyses.

ChatGPT (OpenAI, 2025), Google, Cross Validated, etc. were used for coding assistance, editing, and checking some statistical concepts. All final decisions and interpretations are my own.

References

Allison, P. D. (2010). Survival analysis using SAS: A practical guide (2nd ed.). SAS Institute Inc.
Andersen, P. K., & Gill, R. D. (1982). Cox’s regression model for counting processes: A large sample study. The Annals of Statistics, 10(4), 1100–1120. http://www.jstor.org/stable/2240714
Byar, D., & Blackard, C. (1977). Comparisons of placebo, pyridoxine, and topical thiotepa in preventing recurrence of stage I bladder cancer. Urology, 10, 556–561. https://doi.org/10.1016/0090-4295(77)90101-7
Lin, D. Y., & Wei, L. J. (1989). The robust inference for the cox proportional hazards model. Journal of the American Statistical Association, 84(408), 1074–1078. https://doi.org/10.1080/01621459.1989.10478874
Lin, D. Y., Wei, L. J., Yang, I., & Ying, Z. (2002). Semiparametric regression for the mean and rate functions of recurrent events. Journal of the Royal Statistical Society Series B: Statistical Methodology, 62(4), 711–730. https://doi.org/10.1111/1467-9868.00259
Mao, L. (2025). BMI/STAT 741: Applied survival analysis. Course website, University of Wisconsin–Madison. https://lmaowisc.github.io/BMI741/
OpenAI. (2025). GPT (ChatGPT). https://chat.openai.com/.
Prentice, R. L., Williams, B. J., & Peterson, A. V. (1981). On the regression analysis of multivariate failure time data. Biometrika, 68(2), 373–379. http://www.jstor.org/stable/2335582
Wei, L. J., Lin, D. Y., & Weissfeld, L. (1989). Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. Journal of the American Statistical Association, 84(408), 1065–1073. http://www.jstor.org/stable/2290084