Submission-Grade Statistical Programming Example (SAS + R)

1 Purpose and scope
2 Project layout (recommended)
3 Example dataset logic (ADSL)
4 Section A — Run Driver (controlled execution)
- 4.1 Why this matters (best practice)
- 4.2 SAS vs R (scroll to see full code)
5 Section B — Build ADSL (derivation program)
- 5.1 Best-practice principles applied
- 5.2 SAS vs R (scroll to see full code)
6 Section C — Independent QC (dual programming pattern)
- 6.1 Why QC independence matters
- 6.2 SAS vs R (scroll to see full code)
7 Section D — “Submission-grade” completeness checklist (what you now have)

1 Purpose and scope

This document provides a submission-grade programming example demonstrating how regulated intent becomes executable code. It includes:

ADaM ADSL build from SDTM-like inputs (DM, EX)
Independent QC build and comparison
Run artifacts supporting audit readiness: manifest, checks, and deterministic outputs
SAS and R implementations with aligned logic and best-practice commentary

Important scope note

This example focuses on: (1) dataset derivation rigor, (2) audit-friendly execution, and (3) QC independence. In real submission packages, additional study-specific artifacts exist (SAP, dataset specs, Define-XML, SDRG, controlled terminology libraries). Those items are intentionally not fully generated here to keep the example readable and website-friendly.

2 Project layout

A clean repo layout supports traceability and controlled execution.

project/
  programs/
    sas/
      00_run_driver.sas
      01_build_adsl.sas
      02_qc_adsl.sas
    r/
      00_run_driver.R
      01_build_adsl.R
      02_qc_adsl.R
  data/
    sdtm/   (read-only in regulated runs)
    adam/   (write outputs here)
  outputs/
    logs/
    manifests/
    qc/
  docs/
    (optional) specs, shells, notes

Industry best practice alignment

Separation of concerns: one program per dataset (build), plus an independent QC program.
Controlled I/O: explicit, parameterized paths; read-only inputs; deterministic outputs.
Audit readiness: run manifests, logs, checksums, and QC reports produced every run.

3 Example dataset logic (ADSL)

This example derives a minimal but realistic ADSL:

Keys: STUDYID, USUBJID
Dates: RANDDT, TRTSDT, TRTEDT
Treatment: TRT01P, TRT01A
Population: SAFFL (any exposure)

Assumptions (declared in code):

SDTM-like inputs have ISO 8601 date strings (YYYY-MM-DD...)
One primary exposure treatment per subject (simplified)

4 Section A — Run Driver (controlled execution)

4.1 Why this matters (best practice)

A “submission-grade” pipeline has a driver that:

centralizes parameters (study ID, paths, run ID)
enforces input existence and write locations
produces a run manifest (what ran, when, with what inputs)
stops on failure (no silent partial runs)

4.2 SAS vs R (scroll to see full code)

SAS — 00_run_driver.sas

/*=============================================================================
Program:     00_run_driver.sas
Purpose:     Controlled execution driver (submission-grade pattern).
Author:      Jonathan D. Stallings, PhD, MS
Notes:       Centralizes parameters, sets paths, runs build + QC, writes manifest.
=============================================================================*/

options nodate nonumber mprint mlogic symbolgen validvarname=upcase missing=' ';

%let STUDYID   = STUDY-XYZ;
%let ROOT      = /path/to/project;
%let SDTM_DIR  = &ROOT/data/sdtm;
%let ADAM_DIR  = &ROOT/data/adam;
%let OUT_DIR   = &ROOT/outputs;
%let LOG_DIR   = &OUT_DIR/logs;
%let MAN_DIR   = &OUT_DIR/manifests;
%let QC_DIR    = &OUT_DIR/qc;

%let RUN_DTTM  = %sysfunc(datetime(), e8601dt.);
%let RUN_ID    = %sysfunc(compress(&RUN_DTTM, :-T));

/* Ensure output folders exist (OS-dependent; keep simple for example) */
options dlcreatedir;
libname _tmp "&OUT_DIR/_tmp";
libname _tmp clear;
options nodlcreatedir;

libname sdtm "&SDTM_DIR";
libname adam "&ADAM_DIR";
libname out  "&OUT_DIR";
libname qc   "&QC_DIR";

%macro assert_exist(ds);
  %if not %sysfunc(exist(&ds)) %then %do;
    %put ERROR: Missing required dataset: &ds;
    %abort cancel;
  %end;
%mend;

%assert_exist(sdtm.dm);
%assert_exist(sdtm.ex);

/* Run build then QC */
%include "&ROOT/programs/sas/01_build_adsl.sas";
%include "&ROOT/programs/sas/02_qc_adsl.sas";

/* Minimal run manifest */
data out.run_manifest_sas;
  length RUN_ID $40 RUN_DTTM $30 STUDYID $40 SDTM_DIR ADAM_DIR OUT_DIR $200;
  RUN_ID="&RUN_ID";
  RUN_DTTM="&RUN_DTTM";
  STUDYID="&STUDYID";
  SDTM_DIR="&SDTM_DIR";
  ADAM_DIR="&ADAM_DIR";
  OUT_DIR="&OUT_DIR";
run;

proc export data=out.run_manifest_sas
  outfile="&MAN_DIR/run_manifest_sas_&RUN_ID..csv"
  dbms=csv replace;
run;

R — 00_run_driver.R

# =============================================================================
# Program:  00_run_driver.R
# Purpose:  Controlled execution driver (submission-grade pattern).
# Author:   Jonathan D. Stallings, PhD, MS
# Notes:    Centralizes parameters, runs build + QC, writes manifest + session.
# =============================================================================

study_id <- "STUDY-XYZ"

root <- "path/to/project"
paths <- list(
  sdtm_dm = file.path(root, "data/sdtm/dm.csv"),
  sdtm_ex = file.path(root, "data/sdtm/ex.csv"),
  adam_adsl = file.path(root, "data/adam/adsl.csv"),
  out_dir = file.path(root, "outputs"),
  log_dir = file.path(root, "outputs/logs"),
  man_dir = file.path(root, "outputs/manifests"),
  qc_dir  = file.path(root, "outputs/qc")
)

dir.create(paths$out_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(paths$log_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(paths$man_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(paths$qc_dir,  recursive = TRUE, showWarnings = FALSE)

stop_if_missing <- function(p) {
  if (!file.exists(p)) stop(sprintf("Missing required input: %s", p), call. = FALSE)
}

stop_if_missing(paths$sdtm_dm)
stop_if_missing(paths$sdtm_ex)

run_dttm_utc <- format(Sys.time(), tz = "UTC", usetz = TRUE)
run_id <- gsub("[^0-9]", "", run_dttm_utc)

# Source build + QC scripts
source(file.path(root, "programs/r/01_build_adsl.R"))
source(file.path(root, "programs/r/02_qc_adsl.R"))

# Write run manifest + session info
manifest <- data.frame(
  run_id = run_id,
  run_dttm_utc = run_dttm_utc,
  study_id = study_id,
  sdtm_dm = paths$sdtm_dm,
  sdtm_ex = paths$sdtm_ex,
  adam_adsl = paths$adam_adsl,
  stringsAsFactors = FALSE
)

write.csv(manifest, file.path(paths$man_dir, paste0("run_manifest_r_", run_id, ".csv")), row.names = FALSE)
capture.output(sessionInfo(), file = file.path(paths$man_dir, paste0("sessionInfo_", run_id, ".txt")))

What this demonstrates (FDA reviewer lens)

Controlled execution: a single entry point with explicit parameters reduces ambiguity.
Deterministic runs: consistent inputs/outputs with run identifiers support traceability.
Audit trail artifacts: manifests and logs make it easy to reconstruct what happened.

5 Section B — Build ADSL (derivation program)

5.1 Best-practice principles applied

Single responsibility: one build program for ADSL
Declared assumptions: date parsing, exposure summarization rules
Explicit variable derivations with labels (SAS) and consistent naming (R)
Integrity checks that stop the run on critical failures (duplicates, missing keys)

5.2 SAS vs R (scroll to see full code)

SAS — 01_build_adsl.sas

/*=============================================================================
Program:     01_build_adsl.sas
Purpose:     Build ADaM ADSL from SDTM DM and EX (submission-grade pattern).
Inputs:      sdtm.dm, sdtm.ex
Output:      adam.adsl
Assumptions: ISO 8601 dates; 1 primary treatment per subject (simplified).
=============================================================================*/

%macro stop_if_dups(ds, key);
  proc sort data=&ds out=_chk nodupkey dupout=_dups; by &key; run;
  %local ndup;
  proc sql noprint; select count(*) into :ndup trimmed from _dups; quit;
  %if &ndup > 0 %then %do;
    %put ERROR: Duplicate key(s) detected in &ds by &key..;
    proc export data=_dups outfile="&QC_DIR/dups_%scan(&ds,2,.)..csv" dbms=csv replace; run;
    %abort cancel;
  %end;
%mend;

%macro assert_vars(ds, varlist);
  %local i v;
  %do i=1 %to %sysfunc(countw(&varlist));
    %let v = %scan(&varlist,&i);
    proc sql noprint;
      select count(*) into :_vexists trimmed
      from dictionary.columns
      where libname=%upcase("%scan(&ds,1,.)")
        and memname=%upcase("%scan(&ds,2,.)")
        and name=%upcase("&v");
    quit;
    %if &_vexists = 0 %then %do;
      %put ERROR: Variable &v not found in &ds;
      %abort cancel;
    %end;
  %end;
%mend;

/* Validate required variables exist */
%assert_vars(sdtm.dm, STUDYID USUBJID RFSTDTC AGE AGEU SEX RACE ARM);
%assert_vars(sdtm.ex, STUDYID USUBJID EXTRT EXSTDTC EXENDTC);

/* Step 1: DM subset */
data work.dm0;
  set sdtm.dm;
  where STUDYID="&STUDYID";
  keep STUDYID USUBJID SITEID SUBJID RFSTDTC RFENDTC BRTHDTC AGE AGEU SEX RACE ARM;
run;

%stop_if_dups(work.dm0, USUBJID);

/* Step 2: EX exposure summary */
data work.ex0;
  set sdtm.ex;
  where STUDYID="&STUDYID";
  keep STUDYID USUBJID EXTRT EXSTDTC EXENDTC;
run;

proc sort data=work.ex0; by USUBJID EXSTDTC; run;

data work.ex_summ;
  set work.ex0;
  by USUBJID;
  length TRT01P $20;
  retain TRT01P TRTSDT TRTEDT;
  format TRTSDT TRTEDT yymmdd10.;

  if first.USUBJID then do;
    TRT01P = strip(EXTRT);
    TRTSDT = input(substr(EXSTDTC,1,10), yymmdd10.);
    TRTEDT = .;
  end;

  /* last non-missing end date */
  if not missing(EXENDTC) then TRTEDT = input(substr(EXENDTC,1,10), yymmdd10.);

  if last.USUBJID then output;
  keep USUBJID TRT01P TRTSDT TRTEDT;
run;

%stop_if_dups(work.ex_summ, USUBJID);

/* Step 3: Build ADSL */
proc sort data=work.dm0; by USUBJID; run;
proc sort data=work.ex_summ; by USUBJID; run;

data adam.adsl(label="Subject-Level Analysis Dataset (ADSL)");
  merge work.dm0(in=a) work.ex_summ(in=b);
  by USUBJID;

  length TRT01A $20 SAFFL $1;
  format RANDDT TRTSDT TRTEDT yymmdd10.;

  if not a then delete;

  RANDDT = input(substr(RFSTDTC,1,10), yymmdd10.);
  SAFFL  = ifc(b and not missing(TRTSDT), "Y", "N");
  TRT01A = TRT01P;

  label
    RANDDT = "Date of Randomization/Reference Start Date"
    TRT01P = "Planned Treatment for Period 01"
    TRT01A = "Actual Treatment for Period 01"
    TRTSDT = "Treatment Start Date"
    TRTEDT = "Treatment End Date"
    SAFFL  = "Safety Population Flag"
  ;

  /* Hard-stop integrity checks */
  if missing(USUBJID) then do;
    put "ERROR: Missing USUBJID in ADSL";
    abort cancel;
  end;

  if SAFFL="Y" and missing(TRT01A) then do;
    put "ERROR: SAFFL=Y but TRT01A missing for " USUBJID=;
    abort cancel;
  end;
run;

%stop_if_dups(adam.adsl, USUBJID);

/* Produce a small build summary artifact */
proc sql;
  create table qc.adsl_build_summary as
  select
    count(*) as n_records,
    sum(SAFFL="Y") as n_saffl,
    sum(missing(USUBJID)) as n_missing_usubjid
  from adam.adsl;
quit;

proc export data=qc.adsl_build_summary
  outfile="&QC_DIR/adsl_build_summary.csv"
  dbms=csv replace;
run;

R — 01_build_adsl.R

# =============================================================================
# Program:  01_build_adsl.R
# Purpose:  Build ADaM ADSL from SDTM-like DM/EX (submission-grade pattern).
# Inputs:   dm.csv, ex.csv
# Output:   adsl.csv
# Assump:   ISO 8601 dates; 1 primary treatment per subject (simplified).
# =============================================================================

# Explicit namespaces (portable across environments)
dm_path <- paths$sdtm_dm
ex_path <- paths$sdtm_ex
adsl_path <- paths$adam_adsl

iso_to_date <- function(x) {
  if (all(is.na(x))) return(as.Date(rep(NA, length(x))))
  as.Date(substr(x, 1, 10))
}

assert_cols <- function(df, cols, name) {
  missing <- setdiff(cols, names(df))
  if (length(missing) > 0) {
    stop(sprintf("Missing required columns in %s: %s", name, paste(missing, collapse = ", ")), call. = FALSE)
  }
}

stop_if_dups <- function(df, key, out_csv) {
  tab <- df |>
    dplyr::count(dplyr::across(dplyr::all_of(key))) |>
    dplyr::filter(.data$n > 1)

  if (nrow(tab) > 0) {
    readr::write_csv(tab, out_csv)
    stop(sprintf("Duplicate key(s) detected by %s. See: %s", paste(key, collapse = ", "), out_csv), call. = FALSE)
  }
}

dm <- readr::read_csv(dm_path, show_col_types = FALSE) |>
  dplyr::filter(.data$STUDYID == study_id)

ex <- readr::read_csv(ex_path, show_col_types = FALSE) |>
  dplyr::filter(.data$STUDYID == study_id)

assert_cols(dm, c("STUDYID","USUBJID","RFSTDTC","AGE","AGEU","SEX","RACE","ARM"), "DM")
assert_cols(ex, c("STUDYID","USUBJID","EXTRT","EXSTDTC","EXENDTC"), "EX")

dm0 <- dm |>
  dplyr::select(STUDYID, USUBJID, SITEID, SUBJID, RFSTDTC, RFENDTC, BRTHDTC, AGE, AGEU, SEX, RACE, ARM)

stop_if_dups(dm0, "USUBJID", file.path(paths$qc_dir, "dups_dm0.csv"))

ex_summ <- ex |>
  dplyr::arrange(.data$USUBJID, .data$EXSTDTC) |>
  dplyr::group_by(.data$USUBJID) |>
  dplyr::summarise(
    TRT01P = dplyr::first(.data$EXTRT),
    TRTSDT = iso_to_date(dplyr::first(.data$EXSTDTC)),
    TRTEDT = {
      end_nonmissing <- .data$EXENDTC[!is.na(.data$EXENDTC)]
      if (length(end_nonmissing) == 0) as.Date(NA) else iso_to_date(end_nonmissing[length(end_nonmissing)])
    },
    .groups = "drop"
  )

stop_if_dups(ex_summ, "USUBJID", file.path(paths$qc_dir, "dups_ex_summ.csv"))

adsl <- dm0 |>
  dplyr::left_join(ex_summ, by = "USUBJID") |>
  dplyr::mutate(
    RANDDT = iso_to_date(.data$RFSTDTC),
    SAFFL  = dplyr::if_else(!is.na(.data$TRTSDT), "Y", "N"),
    TRT01A = .data$TRT01P
  ) |>
  dplyr::select(
    STUDYID, USUBJID, SITEID, SUBJID, ARM,
    RANDDT, TRT01P, TRT01A, TRTSDT, TRTEDT, SAFFL,
    AGE, AGEU, SEX, RACE
  )

# Hard-stop integrity checks
if (any(is.na(adsl$USUBJID) | adsl$USUBJID == "")) stop("Missing USUBJID in ADSL.", call. = FALSE)
stop_if_dups(adsl, "USUBJID", file.path(paths$qc_dir, "dups_adsl.csv"))
if (any(adsl$SAFFL == "Y" & is.na(adsl$TRT01A))) stop("SAFFL=Y but TRT01A missing.", call. = FALSE)

# Write output (deterministic)
readr::write_csv(adsl, adsl_path)

# Build summary artifact
summary <- adsl |>
  dplyr::summarise(
    n_records = dplyr::n(),
    n_saffl   = sum(.data$SAFFL == "Y"),
    n_missing_usubjid = sum(is.na(.data$USUBJID) | .data$USUBJID == "")
  )

readr::write_csv(summary, file.path(paths$qc_dir, "adsl_build_summary_r.csv"))

What this demonstrates (industry best practice)

Traceability: explicit derivations from SDTM-like sources to ADaM variables.
Integrity: hard-stop checks for keys, duplicates, and internal consistency.
Audit readiness: run artifacts (summaries, manifests) are produced every run.
Reproducibility: deterministic outputs and parameterized paths support repeatable execution.

6 Section C — Independent QC (dual programming pattern)

6.1 Why QC independence matters

A submission-grade practice is that QC:

is independent (separate code path)
reconstructs critical variables
compares to production output using a deterministic comparison step
writes a QC report artifact

6.2 SAS vs R (scroll to see full code)

SAS — 02_qc_adsl.sas

/*=============================================================================
Program:     02_qc_adsl.sas
Purpose:     Independent QC build of ADSL and compare to production ADSL.
Inputs:      sdtm.dm, sdtm.ex, adam.adsl
Outputs:     qc.adsl_qc, comparison report CSV
QC Approach: Re-derive key fields via independent logic, then PROC COMPARE.
=============================================================================*/

%macro assert_exist(ds);
  %if not %sysfunc(exist(&ds)) %then %do;
    %put ERROR: Missing required dataset: &ds;
    %abort cancel;
  %end;
%mend;

%assert_exist(adam.adsl);

/* Independent rebuild (intentionally coded differently than production) */
proc sql;
  create table qc.dm_qc as
  select
    STUDYID, USUBJID, SITEID, SUBJID, RFSTDTC, RFENDTC, BRTHDTC, AGE, AGEU, SEX, RACE, ARM
  from sdtm.dm
  where STUDYID="&STUDYID";
quit;

proc sql;
  create table qc.ex_qc as
  select STUDYID, USUBJID, EXTRT, EXSTDTC, EXENDTC
  from sdtm.ex
  where STUDYID="&STUDYID";
quit;

proc sort data=qc.ex_qc; by USUBJID EXSTDTC; run;

/* Different method: use PROC SQL aggregation for first/last */
proc sql;
  create table qc.ex_summ_qc as
  select
    USUBJID,
    min(input(substr(EXSTDTC,1,10), yymmdd10.)) as TRTSDT format=yymmdd10.,
    max(input(substr(EXENDTC,1,10), yymmdd10.)) as TRTEDT format=yymmdd10.,
    /* planned trt = first by EXSTDTC */
    (select EXTRT from qc.ex_qc b
      where b.USUBJID=a.USUBJID
      order by b.EXSTDTC
      fetch first 1 rows only) as TRT01P length=20
  from qc.ex_qc a
  group by USUBJID;
quit;

proc sort data=qc.dm_qc;      by USUBJID; run;
proc sort data=qc.ex_summ_qc; by USUBJID; run;

data qc.adsl_qc;
  merge qc.dm_qc(in=a) qc.ex_summ_qc(in=b);
  by USUBJID;
  length TRT01A $20 SAFFL $1;
  format RANDDT TRTSDT TRTEDT yymmdd10.;
  if not a then delete;

  RANDDT = input(substr(RFSTDTC,1,10), yymmdd10.);
  SAFFL  = ifc(b and not missing(TRTSDT), "Y", "N");
  TRT01A = TRT01P;

  keep STUDYID USUBJID SITEID SUBJID ARM RANDDT TRT01P TRT01A TRTSDT TRTEDT 
SAFFL AGE AGEU SEX RACE;
run;

/* Compare QC vs PROD */
proc sort data=adam.adsl;  by USUBJID; run;
proc sort data=qc.adsl_qc; by USUBJID; run;

proc compare base=adam.adsl compare=qc.adsl_qc out=qc.adsl_compare_out outnoequal noprint;
  id USUBJID;
run;

/* Create a compact QC summary artifact */
proc sql;
  create table qc.adsl_qc_summary as
  select
    (select count(*) from adam.adsl) as n_prod,
    (select count(*) from qc.adsl_qc) as n_qc,
    (select count(*) from qc.adsl_compare_out) as n_differences
  from dictionary.tables
  where libname='QC' and memname='ADSL_QC_SUMMARY';
quit;

proc export data=qc.adsl_qc_summary
  outfile="&QC_DIR/adsl_qc_summary.csv"
  dbms=csv replace;
run;

R — 02_qc_adsl.R

# =============================================================================
# Program:  02_qc_adsl.R
# Purpose:  Independent QC build of ADSL and compare to production ADSL.
# Inputs:   dm.csv, ex.csv, adsl.csv
# Outputs:  qc_adsl.csv, qc_compare.csv, qc_summary.csv
# QC:       Re-derive using different approach; compare row/col equality.
# =============================================================================

iso_to_date <- function(x) as.Date(substr(x, 1, 10))

adsl_prod <- readr::read_csv(paths$adam_adsl, show_col_types = FALSE)

dm_qc <- readr::read_csv(paths$sdtm_dm, show_col_types = FALSE) |>
  dplyr::filter(.data$STUDYID == study_id) |>
  dplyr::select(STUDYID, USUBJID, SITEID, SUBJID, RFSTDTC, RFENDTC, BRTHDTC, 
  AGE, AGEU, SEX, RACE, ARM)

ex_qc <- readr::read_csv(paths$sdtm_ex, show_col_types = FALSE) |>
  dplyr::filter(.data$STUDYID == study_id) |>
  dplyr::select(STUDYID, USUBJID, EXTRT, EXSTDTC, EXENDTC)

# Different approach than production: aggregate min/max and pick first 
treatment by earliest EXSTDTC
ex_summ_qc <- ex_qc |>
  dplyr::arrange(.data$USUBJID, .data$EXSTDTC) |>
  dplyr::group_by(.data$USUBJID) |>
  dplyr::summarise(
    TRTSDT = min(iso_to_date(.data$EXSTDTC), na.rm = TRUE),
    TRTEDT = {
      end_dates <- .data$EXENDTC[!is.na(.data$EXENDTC)]
      if (length(end_dates) == 0) as.Date(NA) else max(iso_to_date(end_dates), na.rm = TRUE)
    },
    TRT01P = dplyr::first(.data$EXTRT),
    .groups = "drop"
  ) |>
  dplyr::mutate(
    TRTSDT = ifelse(is.infinite(TRTSDT), as.Date(NA), as.Date(TRTSDT))
  )

adsl_qc <- dm_qc |>
  dplyr::left_join(ex_summ_qc, by = "USUBJID") |>
  dplyr::mutate(
    RANDDT = iso_to_date(.data$RFSTDTC),
    SAFFL  = dplyr::if_else(!is.na(.data$TRTSDT), "Y", "N"),
    TRT01A = .data$TRT01P
  ) |>
  dplyr::select(
    STUDYID, USUBJID, SITEID, SUBJID, ARM,
    RANDDT, TRT01P, TRT01A, TRTSDT, TRTEDT, SAFFL,
    AGE, AGEU, SEX, RACE
  ) |>
  dplyr::arrange(.data$USUBJID)

adsl_prod2 <- adsl_prod |>
  dplyr::arrange(.data$USUBJID)

# Compare: full join and flag diffs at cell level (compact)
cmp <- adsl_prod2 |>
  dplyr::full_join(adsl_qc, by = "USUBJID", suffix = c("_PROD", "_QC"))

# Identify differences for key derived variables (expandable)
key_vars <- c("RANDDT", "TRT01P", "TRT01A", "TRTSDT", "TRTEDT", "SAFFL")

diff_rows <- cmp |>
  dplyr::mutate(
    diff_any = FALSE
  )

for (v in key_vars) {
  prod <- paste0(v, "_PROD")
  qc   <- paste0(v, "_QC")
  if (!(prod %in% names(diff_rows) && qc %in% names(diff_rows))) next
  diff_rows[[paste0("DIFF_", v)]] <- !(isTRUE(all.equal(diff_rows[[prod]], diff_rows[[qc]])) ) # global fallback
}

# Better: row-wise diff flags
row_diff <- cmp |>
  dplyr::mutate(
    DIFF_RANDDT = .data$RANDDT_PROD != .data$RANDDT_QC,
    DIFF_TRT01P = .data$TRT01P_PROD != .data$TRT01P_QC,
    DIFF_TRT01A = .data$TRT01A_PROD != .data$TRT01A_QC,
    DIFF_TRTSDT = .data$TRTSDT_PROD != .data$TRTSDT_QC,
    DIFF_TRTEDT = .data$TRTEDT_PROD != .data$TRTEDT_QC,
    DIFF_SAFFL  = .data$SAFFL_PROD  != .data$SAFFL_QC
  ) |>
  dplyr::mutate(
    DIFF_ANY = dplyr::if_any(dplyr::starts_with("DIFF_"), ~ isTRUE(.x))
  ) |>
  dplyr::filter(.data$DIFF_ANY) |>
  dplyr::select(USUBJID, dplyr::starts_with("DIFF_"), 
  dplyr::ends_with("_PROD"), dplyr::ends_with("_QC"))

readr::write_csv(adsl_qc, file.path(paths$qc_dir, "adsl_qc_r.csv"))
readr::write_csv(row_diff, file.path(paths$qc_dir, "adsl_compare_r.csv"))

qc_summary <- data.frame(
  n_prod = nrow(adsl_prod2),
  n_qc = nrow(adsl_qc),
  n_diff = nrow(row_diff),
  stringsAsFactors = FALSE
)

readr::write_csv(qc_summary, file.path(paths$qc_dir, "adsl_qc_summary_r.csv"))

What this demonstrates (FDA reviewer lens)

QC independence: production and QC derivations are coded differently, reducing shared-mode failure risk.
Deterministic comparison: explicit compare outputs make discrepancies reviewable and auditable.
Run artifacts: QC summaries and difference listings are created as persistent records of the QC process.

7 Section D — “Submission-grade” completeness checklist

This example includes the core elements that distinguish “regulated” code from ordinary analytics.

Controlled entrypoint (driver)
Parameterized study ID and paths
Input dataset and variable existence checks
Deterministic dataset build logic
Hard-stop integrity checks (keys, duplicates, consistency)
Independent QC build with a different derivation approach
Compare outputs and persistent QC artifacts (CSV summaries)

Submission-Grade Statistical Programming Example (SAS + R)

ADaM ADSL Build + Independent QC + Reproducible Run Artifacts

Jonathan D. Stallings, PhD, MS

December 24, 2025

1 Purpose and scope

Important scope note

2 Project layout

Industry best practice alignment

3 Example dataset logic (ADSL)

4 Section A — Run Driver (controlled execution)

4.1 Why this matters (best practice)

4.2 SAS vs R (scroll to see full code)

What this demonstrates (FDA reviewer lens)

5 Section B — Build ADSL (derivation program)

5.1 Best-practice principles applied

5.2 SAS vs R (scroll to see full code)

What this demonstrates (industry best practice)

6 Section C — Independent QC (dual programming pattern)

6.1 Why QC independence matters

6.2 SAS vs R (scroll to see full code)

What this demonstrates (FDA reviewer lens)

7 Section D — “Submission-grade” completeness checklist

Data InDeed

Location

Contact