Title: | Train and Apply a Gaussian Stochastic Process Model |
---|---|
Description: | Train a Gaussian stochastic process model of an unknown function, possibly observed with error, via maximum likelihood or maximum a posteriori (MAP) estimation, run model diagnostics, and make predictions, following Sacks, J., Welch, W.J., Mitchell, T.J., and Wynn, H.P. (1989) "Design and Analysis of Computer Experiments", Statistical Science, <doi:10.1214/ss/1177012413>. Perform sensitivity analysis and visualize low-order effects, following Schonlau, M. and Welch, W.J. (2006), "Screening the Input Variables to a Computer Model Via Analysis of Variance and Visualization", <doi:10.1007/0-387-28014-6_14>. |
Authors: | William J. Welch [aut, cre, cph] , Yilin Yang [aut] |
Maintainer: | William J. Welch <[email protected]> |
License: | GPL-3 |
Version: | 1.0.6 |
Built: | 2024-10-31 22:09:44 UTC |
Source: | https://github.com/cran/GaSP |
Training and test data for the borehole function; see source for background.
borehole
borehole
A list with the following four data frames:
8-dimensional input for 40 training runs.
Output (the flow) for the 40 training runs in x
.
8-dimensional input for 1000 test runs at which to predict y
.
Output for the 1000 runs in x_pred
.
https://www.sfu.ca/~ssurjano/borehole.html
GaSPModel
object.Compute leave-one-out cross-validated predictions for a GaSPModel
object.
CrossValidate(GaSP_model)
CrossValidate(GaSP_model)
GaSP_model |
Object of class |
A data frame with two columns: the cross-validated predictions
Pred
and their standard errors SE
.
RMSE
computes the root mean squared error
of the predictions.
PlotPredictions
and PlotResiduals
plot the predictions or their residuals;
PlotStdResiduals
and PlotQQ
plot the stanadardized residuals.
borehole_cv <- CrossValidate(borehole_fit)
borehole_cv <- CrossValidate(borehole_fit)
Describe the input variables to set up integration or summation ranges
for Visualize
.
DescribeX( x_names, x_min, x_max, support = NULL, num_levels = NULL, distribution = NULL )
DescribeX( x_names, x_min, x_max, support = NULL, num_levels = NULL, distribution = NULL )
x_names |
A vector of character strings containing the names of the input variables. |
x_min , x_max
|
Vectors of the same length as |
support |
Optional vector of character strings of the same length
as |
num_levels |
An optional vector of integers for the number of levels of each input;
must be present if the |
distribution |
An optional vector of character strings of the same length
as |
A data frame with the following columns:
Variable
(containing x_names
), Min
(containing x_min
),
and Max
(containing x_max
),
plus the optional columns Support
(from support
),
NumberLevels
(from num_levels
), and
Distribution
(from distribution
).
Does not check against GaSPModel
and all characters are CASE SENSITIVE.
borehole_x_names <- colnames(borehole$x) borehole_min <- c(0.05, 100.00, 63070.00, 990.00, 63.10, 700.00, 1120.00, 9855.00) borehole_max <- c(0.15, 50000.00, 115600.00, 1110.00, 116.00, 820.00, 1680.00, 12045.00) borehole_x_desc <- DescribeX(borehole_x_names, borehole_min, borehole_max)
borehole_x_names <- colnames(borehole$x) borehole_min <- c(0.05, 100.00, 63070.00, 990.00, 63.10, 700.00, 1120.00, 9855.00) borehole_max <- c(0.15, 50000.00, 115600.00, 1110.00, 116.00, 820.00, 1680.00, 12045.00) borehole_x_desc <- DescribeX(borehole_x_names, borehole_min, borehole_max)
Fit (train) a GaSP model.
Fit( x, y, reg_model, sp_model = NULL, cor_family = c("PowerExponential", "Matern"), cor_par = data.frame(0), random_error = c(FALSE, TRUE), sp_var = -1, error_var = -1, nugget = 1e-09, tries = 10, seed = 500, fit_objective = c("Likelihood", "Posterior"), theta_standardized_min = 0, theta_standardized_max = .Machine$double.xmax, alpha_min = 0, alpha_max = 1, derivatives_min = 0, derivatives_max = 3, log_obj_tol = 1e-05, log_obj_diff = 0, lambda_prior = 0.1, model_comparison = c("Objective", "CV") )
Fit( x, y, reg_model, sp_model = NULL, cor_family = c("PowerExponential", "Matern"), cor_par = data.frame(0), random_error = c(FALSE, TRUE), sp_var = -1, error_var = -1, nugget = 1e-09, tries = 10, seed = 500, fit_objective = c("Likelihood", "Posterior"), theta_standardized_min = 0, theta_standardized_max = .Machine$double.xmax, alpha_min = 0, alpha_max = 1, derivatives_min = 0, derivatives_max = 3, log_obj_tol = 1e-05, log_obj_diff = 0, lambda_prior = 0.1, model_comparison = c("Objective", "CV") )
x |
A data frame containing the input (explanatory variable) training data. |
y |
A vector or a data frame with one column containing the output (response) training data. |
reg_model |
The regression model, specified as a formula, but note the left-hand side of the formula is unused; see example. |
sp_model |
An optional stochastic process model, specified as a formula,
but note the left-hand side of the formula and the intercept are unused.
The default |
cor_family |
A character string specifying the (product, anisoptropic) correlation-function family: "PowerExponential" for the power-exponential family or "Matern" for the Matern family. |
cor_par |
An optional data frame containing the correlation parameters
with one row per |
random_error |
A boolean for the presence or not of a random (measurement, white-noise) error term. |
sp_var , error_var
|
Starting values of the stochastic process and error variances
for the first try to optimize the objective (see Details);
valid (i.e., nonnegative) values will only be used if |
nugget |
For numerical stability the proportion of the total variance
due to random error is fixed at this value ( |
tries |
Number of optimizations of the objective from different random starting points. |
seed |
The random-number seed to generate starting points. |
fit_objective |
The objective that |
theta_standardized_min , theta_standardized_max
|
The minimum and maximum of the standardized |
alpha_min , alpha_max
|
The minimum and maximum
of the |
derivatives_min , derivatives_max
|
The minimum and maximum
of the |
log_obj_tol |
An absolute tolerance for terminating the optimization of the log of the objective. |
log_obj_diff |
The critical value for the change in the log objective for informal tests during optimization of correlation parameters. No testing is done with the default of 0; a larger critical value such as 2 may give a more parsimonious model. |
lambda_prior |
The rate parameter of an exponential prior
for each |
model_comparison |
The criterion used to select from multiple solutions
when |
Fit numerically optimizes the profile objective function with respect to the correlation parameters; the mean and overall variance parameters are estimated in closed form given the correlation parameters.
A cor_par
data frame supplied by the user is the starting point
for the first optimization try.
If random_error = TRUE
,
then sp_var
/ (sp_var
+ error_var
) is another
correlation parameter to be optimized;
sp_var
and error_var
values supplied by the user
will initialize this parameter for the first try.
Set random_error = TRUE
to estimate the variance of the
random (measurement, white-noise) error;
a small nugget
error variance is for numerical stability.
For term in the stochastic-process model,
the estimate of
is constrained between
theta_standardized_min
/ and
theta_standardized_max
/ ,
where
is the range of term
.
Note that
Fit
returns unscaled estimates relating to the original, unscaled inputs.
A GaSPModel
object, which is a list with the following components:
x |
The data frame containing the input training data. |
y |
The training output data, now as a vector. |
reg_model |
The regression model, now in the form of a data frame. |
sp_model |
The stochastic process model, now in the form of a data frame. |
cor_family |
The correlation family. |
cor_par |
A data frame for the estimated correlation parameters. |
random_error |
The boolean for the presence or not of a random error term. |
sp_var |
The estimated stochastic process variance. |
error_var |
The estimated random error variance. |
beta |
A data frame holding the estimated regression-model parameters. |
objective |
The maximum value found for the objective function: the log likelihood (fit_objective = "Likelihood") or the log posterior (fit_objective = "Posterior"). |
cond_num |
The condition number. |
CVRMSE |
The leave-one-out cross-validation root mean squared error. |
Sacks, J., Welch, W.J., Mitchell, T.J., and Wynn, H.P. (1989) "Design and Analysis of Computer Experiments", Statistical Science, 4, pp. 409-423, doi:10.1214/ss/1177012413.
x <- borehole$x y <- borehole$y borehole_fit <- Fit( reg_model = ~1, x = x, y = y, cor_family = "Matern", random_error = FALSE, nugget = 0, fit_objective = "Posterior" )
x <- borehole$x y <- borehole$y borehole_fit <- Fit( reg_model = ~1, x = x, y = y, cor_family = "Matern", random_error = FALSE, nugget = 0, fit_objective = "Posterior" )
GaSPModel
object.Return a template for a GaSPModel
object.
GaSPModel( x, y, reg_model, sp_model = NULL, cor_family = c("PowerExponential", "Matern"), cor_par, random_error = c(FALSE, TRUE), sp_var, error_var = 0 )
GaSPModel( x, y, reg_model, sp_model = NULL, cor_family = c("PowerExponential", "Matern"), cor_par, random_error = c(FALSE, TRUE), sp_var, error_var = 0 )
x |
A data frame containing the input (explanatory variable) training data. |
y |
A vector or a data frame with one column containing the output (response) training data. |
reg_model |
The regression model, specified as a formula, but note the left-hand side of the formula is unused; see example. |
sp_model |
An optional stochastic process model, specified as a formula,
but note the left-hand side of the formula and the intercept are unused.
The default |
cor_family |
A character string specifying the (product, anisoptropic) correlation-function family: "PowerExponential" for the power-exponential family or "Matern" for the Matern family. |
cor_par |
A data frame containing the correlation parameters
with one row per |
random_error |
A boolean for the presence or not of a random (measurement, white-noise) error term. |
sp_var |
The stochastic process variance. |
error_var |
The random error variance, with default 0. |
The data frame cor_par
contains one row for each term
in the stochastic process model.
There are two columns. The first is named Theta
,
and the second is either Alpha
(power-exponential)
or Derivatives
(Matern).
Let be a distance between points for term
in the stochastic-process model.
For power-exponential, the contribution to the product correlation
from term
depends on
a distance-scale parameter
from the
Theta
column and
a smoothness parameter from the
Alpha
column;
the contribution is .
For example,
gives the squared-exponential (Gaussian) correlation.
The contribution to the product correlation for Matern also depends on
,
and the second parameter is the number of derivatives
from the
Derivatives
column.
The contribution is
for
(the exponential correlation),
for
,
for
, and
for
(the squared-exponential correlation).
Note that
codes for a limiting infinite number of derivatives.
This is not the usual parameterization of the Matern, but it is
consistent with power-exponential for the exponential and squared-exponential
special cases common to both.
A value should be given to error_var
if the model
has a random-error term (random_error = TRUE
),
and a small "nugget" such as may be needed for improved
numerical conditioning.
A GaSPModel
object, which is a list with the following components:
x |
The data frame containing the input training data. |
y |
The training output data, now as a vector. |
reg_model |
The regression model, now in the form of a data frame. |
sp_model |
The stochastic process model, now in the form of a data frame. |
cor_family |
The correlation family. |
cor_par |
The data frame containing the correlation parameters. |
random_error |
The boolean for the presence or not of a random error term. |
sp_var |
The stochastic process variance. |
error_var |
The random error variance. |
beta |
A placeholder for a data frame to hold the regression-model parameters. |
objective |
A placeholder for the maximum fit objective. |
cond_num |
A placeholder for the condition number. |
CVRMSE |
A placeholder for the model's cross-validated root mean squared error. |
This function does not excecute Fit
and is intended
for CrossValidate
, Predict
and Visualize
with models trained otherwise by the user.
Placeholders do not need to be specified to excecute these further functions,
as they are always recomputed as needed.
Sacks, J., Welch, W.J., Mitchell, T.J., and Wynn, H.P. (1989) "Design and Analysis of Computer Experiments", Statistical Science, 4, pp. 409-423, doi:10.1214/ss/1177012413.
x <- borehole$x y <- borehole$y theta <- c( 5.767699e+01, 0.000000e+00, 0.000000e+00, 1.433571e-06, 0.000000e+00, 2.366557e-06, 1.695619e-07, 2.454376e-09 ) alpha <- c( 1.110223e-16, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 2.494862e-03, 0.000000e+00 ) cor_par <- data.frame(Theta = theta, Alpha = alpha) rownames(cor_par) <- colnames(borehole$x) sp_var <- 38783.7 borehole_gasp <- GaSPModel( x = borehole$x, y = borehole$y, reg_model = ~1, cor_family = "PowerExponential", cor_par = cor_par, random_error = FALSE, sp_var = sp_var )
x <- borehole$x y <- borehole$y theta <- c( 5.767699e+01, 0.000000e+00, 0.000000e+00, 1.433571e-06, 0.000000e+00, 2.366557e-06, 1.695619e-07, 2.454376e-09 ) alpha <- c( 1.110223e-16, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00, 2.494862e-03, 0.000000e+00 ) cor_par <- data.frame(Theta = theta, Alpha = alpha) rownames(cor_par) <- colnames(borehole$x) sp_var <- 38783.7 borehole_gasp <- GaSPModel( x = borehole$x, y = borehole$y, reg_model = ~1, cor_family = "PowerExponential", cor_par = cor_par, random_error = FALSE, sp_var = sp_var )
PlotPredictions
, PlotResiduals
, PlotStdResiduals
, PlotMainEffects
, and PlotJointEffects
.Execute PlotPredictions
, PlotResiduals
and PlotStdResiduals
(all applied to cross validation only),
PlotMainEffects
, and PlotJointEffects
.
PlotAll( GaSP_model, cross_validation, visualization, y_name = "y", y_units = "", x_units = NULL, se_plot = TRUE, y_values = NULL, se_values = NULL, pch = 1 )
PlotAll( GaSP_model, cross_validation, visualization, y_name = "y", y_units = "", x_units = NULL, se_plot = TRUE, y_values = NULL, se_values = NULL, pch = 1 )
GaSP_model |
Object of class |
cross_validation |
A data frame returned by |
visualization |
A list object returned by |
y_name |
An optional character string containing the output variable name (for labels). |
y_units |
An optional character string containing the units of the output variable (for labels). |
x_units |
An optional vector of character strings containing the units of the input variables (for labels). |
se_plot |
An optional boolean indicating whether to make standard-error contour plots. |
y_values |
An optional vector of contour values for the estimated joint effects. |
se_values |
An optional vector of contour values for the standard errors. |
pch |
Plotting symbol for |
No return value, generates plots.
PlotAll(borehole_fit, borehole_cv, borehole_vis)
PlotAll(borehole_fit, borehole_cv, borehole_vis)
Plot the estimated joint effects.
PlotJointEffects( joint_effect, anova_percent, x_units = NULL, y_name = "y", y_units = "", se_plot = TRUE, y_values = NULL, se_values = NULL )
PlotJointEffects( joint_effect, anova_percent, x_units = NULL, y_name = "y", y_units = "", se_plot = TRUE, y_values = NULL, se_values = NULL )
joint_effect |
A data frame from |
anova_percent |
A data frame from |
x_units |
An optional vector of character strings containing the units of the input variables (for labels). |
y_name |
An optional character string containing the output variable name (for labels). |
y_units |
An optional character string containing the units of the output variable (for labels). |
se_plot |
An optional boolean indicating whether to make standard-error contour plots. |
y_values |
An optional vector of contour values for the estimated joint effects. |
se_values |
An optional vector of contour values for the standard errors. |
Plots are sent to the active device.
No return value, generates plots.
PlotJointEffects(borehole_vis$joint_effect, borehole_vis$anova_percent)
PlotJointEffects(borehole_vis$joint_effect, borehole_vis$anova_percent)
Plot the estimated main effects.
PlotMainEffects( main_effect, anova_percent, x_units = NULL, y_name = "y", y_units = "" )
PlotMainEffects( main_effect, anova_percent, x_units = NULL, y_name = "y", y_units = "" )
main_effect |
A data frame from |
anova_percent |
A data frame from |
x_units |
An optional vector of character strings containing the units of the input variables (for labels). |
y_name |
An optional character string containing the output variable name (for labels). |
y_units |
An optional character string containing the units of the output variable (for labels). |
Plots are sent to the active device. Each plot shows an estimated main effect (red solid line) and pointwise approximate 95% confidence limits (green dashed line).
No return value, generates plots.
PlotMainEffects(borehole_vis$main_effect, borehole_vis$anova_percent)
PlotMainEffects(borehole_vis$main_effect, borehole_vis$anova_percent)
Plot true versus predicted output (response)
made by Predict
or CrossValidate
.
PlotPredictions( y_pred, y, y_name = "y", y_units = "", title = c("Predict", "CrossValidate"), pch = 1 )
PlotPredictions( y_pred, y, y_name = "y", y_units = "", title = c("Predict", "CrossValidate"), pch = 1 )
y_pred |
A data frame of predicted output values
made by |
y |
A vector of length equal to the number of rows in |
y_name |
An optional character string containing the output variable name (for labels). |
y_units |
An optional character string constaining the units of the output variable (for labels). |
title |
A character string for the name of the function generating
the predictions (for an appropriate title):
"Predict" from |
pch |
Plotting symbol for |
No return value, generates plots.
PlotPredictions(borehole_cv, y, title = "CrossValidate") PlotPredictions(borehole_pred$y_pred, borehole$y_true, title = "Predict")
PlotPredictions(borehole_cv, y, title = "CrossValidate") PlotPredictions(borehole_pred$y_pred, borehole$y_true, title = "Predict")
Normal quantile-quantile (Q-Q) plot of the standardized residuals
of predictions from Predict
or CrossValidate
.
PlotQQ(y_pred, y, y_name = "y")
PlotQQ(y_pred, y, y_name = "y")
y_pred |
A data frame of predicted output values
made by |
y |
A vector of length equal to the number of rows in |
y_name |
An optional character string containing the output variable name (for labels). |
No return value, generates plots.
PlotQQ(borehole_cv, y)
PlotQQ(borehole_cv, y)
Plot residuals versus each input variable.
PlotResiduals( x, y_pred, y, x_units = NULL, y_name = "y", y_units = "", pch = 1 )
PlotResiduals( x, y_pred, y, x_units = NULL, y_name = "y", y_units = "", pch = 1 )
x |
A data frame with number of rows equal to the number of rows in
|
y_pred |
A data frame of predicted output values
made by |
y |
A vector of length equal to the number of rows in |
x_units |
An optional vector of character strings containing the units
of the input variables in |
y_name |
An optional character string containing the output variable name (for labels). |
y_units |
An optional character string constaining the units of the output variable (for labels). |
pch |
Plotting symbol for |
No return value, generates plots.
PlotResiduals(x, borehole_cv, y)
PlotResiduals(x, borehole_cv, y)
Plot standardized residuals versus predictions made by Predict
or CrossValidate
.
PlotStdResiduals( y_pred, y, y_name = "y", y_units = "", title = c("Predict", "CrossValidate"), pch = 1 )
PlotStdResiduals( y_pred, y, y_name = "y", y_units = "", title = c("Predict", "CrossValidate"), pch = 1 )
y_pred |
A data frame of predicted output values
made by |
y |
A vector of length equal to the number of rows in |
y_name |
An optional character string containing the output variable name (for labels). |
y_units |
An optional character string constaining the units of the output variable (for labels). |
title |
A character string for the name of the function generating
the predictions (for an appropriate title):
"Predict" from |
pch |
Plotting symbol for |
No return value, generates plots.
PlotStdResiduals(borehole_cv, y, title = "CrossValidate")
PlotStdResiduals(borehole_cv, y, title = "CrossValidate")
GaSPModel
object.Predict from a GaSPModel
object.
Predict(GaSP_model, x_pred, generate_coefficients = c(FALSE, TRUE))
Predict(GaSP_model, x_pred, generate_coefficients = c(FALSE, TRUE))
GaSP_model |
Object of class |
x_pred |
A data frame containing the values of the input variables at which to predict the output. |
generate_coefficients |
A boolean indicating whether coefficients for further external predictions are generated. |
A list with the following elements:
y_pred |
A data frame with two columns: the predictions
|
pred_coeffs |
A vector of coefficients for further predictions;
|
The vector of prediction coefficients in pred_coeffs
can be used as follows. Let denote the coefficients and let
denote a vector with element
containing the correlation
between the output at a given new point and the output at training point
.
Then the prediction for the output at the new point is the dot product
of
and
.
RMSE
computes the root mean squared error
of the predictions.
PlotPredictions
and PlotResiduals
plot the predictions or their residuals;
PlotStdResiduals
and PlotQQ
plot the standardized residuals.
borehole_pred <- Predict( GaSP_model = borehole_fit, x_pred = borehole$x_pred, generate_coefficients = TRUE )
borehole_pred <- Predict( GaSP_model = borehole_fit, x_pred = borehole$x_pred, generate_coefficients = TRUE )
Calculate the root mean squared error (RMSE) of prediction
RMSE(y_pred, y_true, normalized = FALSE)
RMSE(y_pred, y_true, normalized = FALSE)
y_pred |
A vector of predicted output values. |
y_true |
A vector of true output values. |
normalized |
An optional boolean: if |
The RMSE or normalized RMSE.
RMSE(borehole_pred$y_pred$Pred, borehole$y_true) RMSE(borehole_cv$Pred, y)
RMSE(borehole_pred$y_pred$Pred, borehole$y_true) RMSE(borehole_cv$Pred, y)
GaSPModel
object.Carry out a functional analysis of variance (ANOVA) of a GaSPModel
object
and generate plotting coordinates for its estimated main and 2-input joint effects.
Visualize(GaSP_model, x_description, main_percent = 0, interaction_percent = 0)
Visualize(GaSP_model, x_description, main_percent = 0, interaction_percent = 0)
GaSP_model |
Object of class |
x_description |
A data frame describing the input variables.
See |
main_percent |
An optional minimum percentage of variation explained by an input's main effect to return the effect's plotting coordinates; the default of zero gives plotting coordinates for all inputs. |
interaction_percent |
An optional minimum percentage of variation explained by the interaction effect of a pair of inputs to return the plotting coordinates for their joint effect (main effects plus interaction effect); the default of zero gives plotting coordinates for all pairs of inputs. |
If there are many inputs, to avoid excessive plotting of
many trivial joint effects set interaction_percent = 1
say.
A list with the following elements:
anova_percent |
A data frame containing the ANOVA percentages for all main effects and 2-input interaction effects. |
main_effect |
A data frame with plotting coordinates for the estimated main effects. |
joint_effect |
A data frame with plotting coordinates for the estimated 2-input joint effects. |
total_percent |
Total percentage of the prediction variation accounted for by all main effects and 2-input interaction effects. |
average |
Overall average of the prediction function, averaged with respect to all inputs. |
SE_average |
Standard error of the overall average. |
Schonlau, M. and Welch, W.J. (2006), "Screening the Input Variables to a Computer Model Via Analysis of Variance and Visualization", in Screening: Methods for Experimentation in Industry, Drug Discovery, and Genetics, Dean. A. and Lewis, S., eds., pp. 308-327, Springer, New York, doi:10.1007/0-387-28014-6_14.
borehole_vis <- Visualize(borehole_fit, borehole_x_desc)
borehole_vis <- Visualize(borehole_fit, borehole_x_desc)