| Title: | Survival Model-Based Imputation for Laboratory Non-Detect Data |
|---|---|
| Description: | Implements survival-model-based imputation for censored laboratory measurements, including Tobit-type models with several distribution options. Suitable for data with values below detection or quantification limits, the package identifies the best-fitting distribution and produces realistic imputations that respect the censoring thresholds. |
| Authors: | Luís Pereira [aut, cre] (ORCID: <https://orcid.org/0000-0002-0628-4847>), Paulo Infante [aut] (ORCID: <https://orcid.org/0000-0002-1644-9502>), Teresa Ferreira [ths] (ORCID: <https://orcid.org/0000-0002-3900-1460>), Paulo Quaresma [ths] (ORCID: <https://orcid.org/0000-0002-5086-059X>) |
| Maintainer: | Luís Pereira <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-04 06:41:19 UTC |
| Source: | https://github.com/lpereira-ue/survlab |
This function imputes non-detect (censored) values in environmental laboratory analytical data using survival models with automatic distribution selection. It validates data quality requirements and fits multiple distributions to select the best model based on AIC. Each imputed value is guaranteed to be below its respective detection limit and above the specified minimum value.
impute_nondetect( dt, value_col = "value", cens_col = "censored", parameter_col = NULL, unit_col = NULL, dist = c("gaussian", "lognormal", "weibull", "exponential", "logistic", "loglogistic"), min_observations = 25, max_censored_pct = 75, min_value = 0, control = survival::survreg.control(), verbose = FALSE )impute_nondetect( dt, value_col = "value", cens_col = "censored", parameter_col = NULL, unit_col = NULL, dist = c("gaussian", "lognormal", "weibull", "exponential", "logistic", "loglogistic"), min_observations = 25, max_censored_pct = 75, min_value = 0, control = survival::survreg.control(), verbose = FALSE )
dt |
A data.frame or data.table containing laboratory analytical data |
value_col |
Character string specifying the column name containing values |
cens_col |
Character string specifying the column name containing censoring indicators (0 = non-detect/censored, 1 = detected/observed) |
parameter_col |
Character string specifying the column name containing parameter names (optional, for validation) |
unit_col |
Character string specifying the column name containing units (optional, for validation) |
dist |
Character vector of distributions to test. Options include:
|
min_observations |
Minimum number of observations required for modeling (default: 25) |
max_censored_pct |
Maximum percentage of censored values allowed (default: 75) |
min_value |
Minimum allowable value for imputed concentrations (default: 0,
use |
control |
A |
verbose |
Logical indicating whether to display progress messages and
distribution fitting information (default: |
The function performs several validation checks:
Ensures sufficient sample size (>= min_observations)
Checks that censoring percentage is reasonable (<= max_censored_pct)
Validates that only one parameter and unit are present (if columns provided)
Tests multiple distributions and selects the best based on AIC
Generates random imputed values below each observation's detection limit
and above min_value
For non-detect observations (censored = 0), the value in value_col
is treated as the detection limit for that specific analysis, allowing for
different detection limits across samples or analytical methods.
Convergence control: The control argument is passed directly to
survreg. Any convergence warnings raised during fitting
are silently captured and stored in the convergence_warnings attribute of
the result, rather than being printed to the console. This makes the function
safe for batch processing while still preserving a full diagnostic record. When
verbose = TRUE, captured warnings are also printed to the console.
Distributions that fail to fit entirely (hard errors) are silently skipped in
all cases.
Note: This function should be applied to data containing only ONE parameter at a time. Different environmental parameters have different distributions and should not be modelled together.
A data.table with additional columns:
[value_col]_imputedImputed values for non-detect observations
[value_col]_finalFinal values combining original detected and imputed non-detect values
The returned object also has attributes containing model information:
The fitted survival model object
Name of the best-fitting distribution
Vector of all detection limits found in the data
The highest detection limit (for reference)
Parameter name (if parameter_col provided)
Unit of measurement (if unit_col provided)
AIC value of the best model
Total number of observations
Percentage of censored observations
Character vector of convergence warning messages
emitted by survreg when fitting the best-selected
distribution. An empty character vector (character(0)) indicates
clean convergence. These warnings are always captured silently; set
verbose = TRUE to also print them to the console.
# Load example data data(multi_censored_data) # Basic imputation with default settings set.seed(123) result <- impute_nondetect( dt = multi_censored_data, value_col = "value", cens_col = "censored", verbose = FALSE ) # View imputed values for non-detects head(result[censored == 0, .(value, value_imputed, value_final)]) # Check best distribution selected attr(result, "best_distribution") # Check whether the best model converged cleanly attr(result, "convergence_warnings") # character(0) means no warnings # Increase max iterations for difficult datasets result <- impute_nondetect( dt = multi_censored_data, value_col = "value", cens_col = "censored", control = survival::survreg.control(maxiter = 200) ) # With parameter and unit validation result <- impute_nondetect( dt = multi_censored_data, value_col = "value", cens_col = "censored", parameter_col = "parameter", unit_col = "unit" ) # For strictly positive values (avoiding exactly zero) result <- impute_nondetect( dt = multi_censored_data, value_col = "value", cens_col = "censored", min_value = 1e-10, verbose = FALSE )# Load example data data(multi_censored_data) # Basic imputation with default settings set.seed(123) result <- impute_nondetect( dt = multi_censored_data, value_col = "value", cens_col = "censored", verbose = FALSE ) # View imputed values for non-detects head(result[censored == 0, .(value, value_imputed, value_final)]) # Check best distribution selected attr(result, "best_distribution") # Check whether the best model converged cleanly attr(result, "convergence_warnings") # character(0) means no warnings # Increase max iterations for difficult datasets result <- impute_nondetect( dt = multi_censored_data, value_col = "value", cens_col = "censored", control = survival::survreg.control(maxiter = 200) ) # With parameter and unit validation result <- impute_nondetect( dt = multi_censored_data, value_col = "value", cens_col = "censored", parameter_col = "parameter", unit_col = "unit" ) # For strictly positive values (avoiding exactly zero) result <- impute_nondetect( dt = multi_censored_data, value_col = "value", cens_col = "censored", min_value = 1e-10, verbose = FALSE )
A synthetic dataset containing environmental nitrate measurements with non-detect values, generated from a lognormal distribution. This dataset represents typical water quality monitoring data from an environmental laboratory, designed for demonstrating survival model-based imputation techniques.
multi_censored_datamulti_censored_data
A data.table with 200 rows and 4 variables:
Character string indicating the chemical parameter ("Nitrate")
Character string indicating the unit of measurement ("mg/l NO3")
Numeric values representing either detected measurements or detection limits for non-detect observations
Integer indicator where 0 = non-detect (below detection limit), 1 = detected (above detection limit)
This dataset simulates real-world environmental water quality data where nitrate measurements below certain detection limits are reported as non-detects. The data includes:
Single parameter (Nitrate) with consistent units (mg/l NO3)
Multiple detection limit levels reflecting different analytical conditions
Realistic distribution of detected vs non-detect values (83.5
Detection limits ranging from 5 to 25 mg/l NO3
Lognormal distribution typical of environmental contaminant data
For non-detect observations (censored = 0), the 'value' column contains the detection limit for that specific analysis. For detected measurements (censored = 1), the 'value' column contains the actual measured nitrate concentration.
Synthetic data generated for package demonstration, based on typical environmental water quality monitoring programs
data(multi_censored_data) # Basic data exploration multi_censored_data[, .( total_samples = .N, non_detects = sum(censored == 0), detects = sum(censored == 1) )] # View parameter and unit information multi_censored_data[, .( parameter = unique(parameter), unit = unique(unit) )] # View detection limit levels multi_censored_data[censored == 0, unique(value)] # Apply survival model imputation result <- impute_nondetect(multi_censored_data, parameter_col = "parameter", unit_col = "unit") validate_imputation(result)data(multi_censored_data) # Basic data exploration multi_censored_data[, .( total_samples = .N, non_detects = sum(censored == 0), detects = sum(censored == 1) )] # View parameter and unit information multi_censored_data[, .( parameter = unique(parameter), unit = unique(unit) )] # View detection limit levels multi_censored_data[censored == 0, unique(value)] # Apply survival model imputation result <- impute_nondetect(multi_censored_data, parameter_col = "parameter", unit_col = "unit") validate_imputation(result)
This function validates the quality of non-detect value imputation by checking that imputed values are below their respective limits of quantification and providing comprehensive summary statistics and model diagnostics.
validate_imputation( dt_imputed, value_col = "value", cens_col = "censored", verbose = TRUE )validate_imputation( dt_imputed, value_col = "value", cens_col = "censored", verbose = TRUE )
dt_imputed |
A data.table returned from |
value_col |
Character string specifying the column name containing original values |
cens_col |
Character string specifying the column name containing censoring indicators |
verbose |
Logical indicating whether to print validation results to console (default: TRUE) |
The function checks:
All imputed values are strictly below their respective limits of quantification
Uniqueness of imputed values
Summary statistics by limits of quantification level
Model fit information including parameter and unit details
Dataset characteristics (sample size, censoring percentage)
Invisibly returns the input data.table. When verbose = TRUE, prints validation results to console including:
Whether all imputed values are below their detection limits
Number of duplicate imputed values (if any)
Summary statistics by detection limit level
Model fit information
data(multi_censored_data) result <- impute_nondetect(multi_censored_data, verbose = FALSE) validate_imputation(result) # Silent validation for batch processing validate_imputation(result, verbose = FALSE)data(multi_censored_data) result <- impute_nondetect(multi_censored_data, verbose = FALSE) validate_imputation(result) # Silent validation for batch processing validate_imputation(result, verbose = FALSE)