Title:  Probability Theory for Selecting Candidates in Plant Breeding 

Description:  Use probability theory under the Bayesian framework for calculating the risk of selecting candidates in a multienvironment context [Dias et al. (2022) <doi:10.1007/s0012202204041y>]. Contained are functions used to fit a Bayesian multienvironment model (based on the available presets), extract posterior values and maximum posterior values, compute the variance components, check the model’s convergence, and calculate the probabilities. For both across and withinenvironments scopes, the package computes the probability of superior performance and the pairwise probability of superior performance. Furthermore, the probability of superior stability and the pairwise probability of superior stability across environments is estimated. A joint probability of superior performance and stability is also provided. 
Authors:  Saulo Chaves [aut, cre] , Kaio Dias [aut, cph] , Matheus Krause [aut] 
Maintainer:  Saulo Chaves <[email protected]> 
License:  AGPL (>= 3) 
Version:  1.0.4.0 
Built:  20240717 15:28:50 UTC 
Source:  https://github.com/saulochaves/probbreed 
Fits a Bayesian multienvironment model using rstan
, the R
interface to Stan
.
bayes_met( data, gen, loc, repl, trait, reg = NULL, year = NULL, res.het = FALSE, iter = 2000, cores = 2, chains = 4, pars = NA, warmup = floor(iter/2), thin = 1, seed = sample.int(.Machine$integer.max, 1), init = "random", verbose = FALSE, algorithm = c("NUTS", "HMC", "Fixed_param"), control = NULL, include = TRUE, show_messages = TRUE, ... )
bayes_met( data, gen, loc, repl, trait, reg = NULL, year = NULL, res.het = FALSE, iter = 2000, cores = 2, chains = 4, pars = NA, warmup = floor(iter/2), thin = 1, seed = sample.int(.Machine$integer.max, 1), init = "random", verbose = FALSE, algorithm = c("NUTS", "HMC", "Fixed_param"), control = NULL, include = TRUE, show_messages = TRUE, ... )
data 
A data frame in which to interpret the variables declared in the other arguments. 
gen , loc

A string. The name of the columns that contain the evaluated candidates and locations (or environments, if you are working with factor combinations), respectively. 
repl 
A string, a vector, or 
trait 
A string. The analysed variable. Currently, only singletrait models are fitted. 
reg 
A string or NULL. The name of the column that contain information on
regions or megaenvironments. 
year 
A string or NULL. The name of the column that contain information on
years (or seasons). 
res.het 
Should the model consider heterogeneous residual variances?
Defaults for 
iter 
A positive integer specifying the number of iterations for each chain (including warmup). The default is 2000. 
cores 
Number of cores to use when executing the chains in parallel,
which defaults to 1 but we recommend setting the 
chains 
A positive integer specifying the number of Markov chains. The default is 4. 
pars 
A vector of character strings specifying parameters of interest.
The default is 
warmup 
A positive integer specifying the number of warmup (aka burnin)
iterations per chain. If stepsize adaptation is on (which it is by default),
this also controls the number of iterations for which adaptation is run (and
hence these warmup samples should not be used for inference). The number of
warmup iterations should be smaller than 
thin 
A positive integer specifying the period for saving samples. The default is 1, which is usually the recommended value. 
seed 
The seed for random number generation. The default is generated
from 1 to the maximum integer supported by R on the machine. Even if
multiple chains are used, only one seed is needed, with other chains having
seeds derived from that of the first chain to avoid dependent samples.
When a seed is specified by a number, 
init 
Initial values specification. See the detailed documentation for
the init argument in 
verbose 

algorithm 
One of sampling algorithms that are implemented in Stan.
Current options are 
control 
A named 
include 
Logical scalar defaulting to 
show_messages 
Either a logical scalar (defaulting to 
... 
Additional arguments can be 
The function has nine available models, which will be fitted according to the options set in the arguments:
Entrymean model : fitted when repl = NULL
, reg = NULL
and year = NULL
:
$y = \mu + g + l + \varepsilon$
Where $y$
is the phenotype, $\mu$
is the intercept, $g$
is the genotypic
effect, $l$
is the location (or environment) effect, and $\varepsilon$
is
the residue (which contains the genotypebylocation interaction, in this case).
Randomized complete blocks design : fitted when repl
is a single string.
It will fit different models depending if reg
and year
are NULL
:
reg = NULL
and year = NULL
:
$y = \mu + g + l + gl + r + \varepsilon$
where $gl$
is the genotypebylocation effect, and $r$
is the replicate effect.
reg = "reg"
and year = NULL
:
$y = \mu + g + m + l + gl + gm + r + \varepsilon$
where $m$
is the region effect, and $gm$
is the genotypebyregion effect.
reg = NULL
and year = "year"
:
$y = \mu + g + t + l + gl + gt + r + \varepsilon$
where $t$
is the year effect, and $gt$
is the genotypebyyear effect.
reg = "reg"
and year = "year"
:
$y = \mu + g + m + t + l + gl + gm + gt + r + \varepsilon$
Incomplete blocks design : fitted when repl
is a string vector of size 2.
It will fit different models depending if reg
and year
are NULL
:
reg = NULL
and year = NULL
:
$y = \mu + g + l + gl + r + b + \varepsilon$
where $b$
is the block within replicates effect.
reg = "reg"
and year = NULL
:
$y = \mu + g + m + l + gl + gm + r + b + \varepsilon$
reg = NULL
and year = "year"
:
$y = \mu + g + t + l + gl + gt + r + b + \varepsilon$
reg = "reg"
and year = "year"
:
$y = \mu + g + m + t + l + gl + gm + gt + r + b + \varepsilon$
The models described above have predefined priors:
$x \sim \mathcal{N} \left( 0, S^{[x]} \right)$
$\sigma \sim \mathcal{HalfCauchy}\left( 0, S^{[\sigma]} \right)$
where $x$
can be any effect but the error, and $\sigma$
is the standard
deviation of the likelihood. If res.het = TRUE
, then $\sigma_k \sim \mathcal{HalfCauchy}\left( 0, S^{\left[ \sigma_k \right]} \right)$
.
The hyperpriors are set as follows:
$S^{[x]} \sim \mathcal{HalfCauchy}\left( 0, \phi \right)$
where $\phi$
is the known global hyperparameter defined such as $\phi = max(y) \times 10$
.
More details about the usage of bayes_met
and other functions of
the ProbBreed
package can be found at https://saulochaves.github.io/ProbBreed_site/.
Solutions to convergence or mixing issues can be found at
https://mcstan.org/misc/warnings.html.
An object of S4 class stanfit
representing
the fitted results. Slot mode
for this object
indicates if the sampling is done or not.
sampling
signature(object = "stanmodel")
Call a sampler (NUTS, HMC, or Fixed_param depending on parameters)
to draw samples from the model defined by S4 class stanmodel
given the data, initial values, etc.
rstan::sampling, rstan::stan, rstan::stanfit
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = TRUE, trait = 'Y', iter = 6000, cores = 4, chains = 4)
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = TRUE, trait = 'Y', iter = 6000, cores = 4, chains = 4)
stanfit
objects obtained from bayes_met
Extracts outputs of the Bayesian model fitted
using bayes_met()
, and provides some diagnostics.
extr_outs(model, probs = c(0.025, 0.975), verbose = FALSE)
extr_outs(model, probs = c(0.025, 0.975), verbose = FALSE)
model 
An object of class 
probs 
A vector with two elements representing the probabilities (in decimal scale) that will be considered for computing the quantiles. 
verbose 
A logical value. If 
More details about the usage of extr_outs
and other functions of
the ProbBreed
package can be found at https://saulochaves.github.io/ProbBreed_site/.
The function returns an object of class extr
, which is a list with:
variances
: a data frame containing the variance components of
the model effects, their standard deviation, naive standard error and highest
posterior density interval.
post
: a list with the posterior of the effects, and the data
generated by the model.
map
: a list with the maximum posterior values of each effect
ppcheck
: a matrix containing the pvalues of maximum, minimum,
median, mean and standard deviation; effective number of parameters, WAIC2
value, Rhat and effective sample size.
rstan::stan_diag, ggplot2::ggplot, rstan::check_hmc_diagnostics, plot.extr
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = TRUE, trait = 'Y', iter = 6000, cores = 4, chains = 4) outs = extr_outs(model = mod, probs = c(0.05, 0.95), verbose = TRUE)
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = TRUE, trait = 'Y', iter = 6000, cores = 4, chains = 4) outs = extr_outs(model = mod, probs = c(0.05, 0.95), verbose = TRUE)
This dataset belongs to value of cultivation and use maize trials of Embrapa Maize and Sorghum, and was used by Dias et al. (2022). It contains the grain yield of 32 singlecross hybrids and four commercial checks (36 genotypes in total) evaluated in 16 locations across five regions or megaenvironments. These trials were laid out in incomplete blocks design, using a block size of 6 and two replications per trial.
maize
maize
maize
A data frame with 823 rows and 6 columns:
16 locations
5 regions
2 replicates
6 blocks
36 genotypes
Grain yield (phenotypes)
Dias, K. O. G, Santos J. P. R., Krause, M. D., Piepho H. P., Guimarães, L. J. M., Pastina, M. M., and Garcia, A. A. F. (2022). Leveraging probability concepts for cultivar recommendation in multienvironment trials. Theoretical and Applied Genetics, 133(2):443455. doi:10.1007/s0012202204041y
extr
objectBuild plots using the outputs stored in the extr
object.
## S3 method for class 'extr' plot(obj, category = "ppdensity", ...)
## S3 method for class 'extr' plot(obj, category = "ppdensity", ...)
obj 
An object of class 
category 
A string indicating which plot to build. See options in the Details section. 
... 
Passed to ggplot2::geom_histogram, when 
The available options are:
ppdensity
: Density plots of the empirical and sampled data, useful to assess the
model's convergence.
density
: Density plots of the model's effects.
histogram
: Histograms of the model's effects.
traceplot
: Trace plot showing the changes in the effects' values across iterations and chains.
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = FALSE, trait = 'Y', iter = 6000, cores = 4, chains = 4) outs = extr_outs(model = mod, probs = c(0.05, 0.95), verbose = TRUE) plot(results, category = "ppdensity") plot(results, category = "density") plot(results, category = "histogram") plot(results, category = "traceplot")
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = FALSE, trait = 'Y', iter = 6000, cores = 4, chains = 4) outs = extr_outs(model = mod, probs = c(0.05, 0.95), verbose = TRUE) plot(results, category = "ppdensity") plot(results, category = "density") plot(results, category = "histogram") plot(results, category = "traceplot")
probsup
objectBuild plots using the outputs stored in the probsup
object.
## S3 method for class 'probsup' plot(obj, category = "perfo", level = "across", ...)
## S3 method for class 'probsup' plot(obj, category = "perfo", level = "across", ...)
obj 
An object of class 
category 
A string indicating which plot to build. See options in the Details section. 
level 
A string indicating the information level to be used for building
the plots. Options are 
... 
currently not used 
The available options are:
hpd
: a caterpillar plot representing the marginal genotypic value of
each genotype, and their respective highest posterior density interval (95% represented by the
thick line, and 97.5% represented by the thin line). Available only if level = "across"
.
perfo
: if level = "across"
, a lollipop plot illustrating the probabilities of superior performance.
If level = "within"
, a heatmap with the probabilities of superior performance within
environments. If a model with reg
and/or year
is fitted, multiple plots are produced.
stabi
: a lollipop plot with the probabilities of superior stability.
If a model with reg
and/or year
is fitted, multiple plots are produced.
Available only if level = "across"
.
pair_perfo
: if level = "across"
, a heatmap representing the pairwise probability of superior
performance (the probability of genotypes at the xaxis being superior
to those on the yaxis). If level = "within"
, a list of heatmaps representing the pairwise probability of superior
performance within environments. If a model with reg
and/or year
is fitted, multiple plots (and multiple lists) are produced.
Should this option is set, it is mandatory to store the outputs in an object
(e.g., pl < plot(obj, category = "pair_perfo", level = "within")
) so they can be visualized one at a time.
pair_stabi
: a heatmap with the pairwise probabilities of superior stability
(the probability of genotypes at the xaxis being more stable than those on the yaxis).
If a model with reg
and/or year
is fitted, multiple plots are produced. Available only if level = "across"
.
joint
: a lollipop plot with the joint probabilities of superior performance and stability.
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = FALSE, trait = 'Y', iter = 6000, cores = 4, chains = 4) outs = extr_outs(model = mod, probs = c(0.05, 0.95), verbose = TRUE) results = prob_sup(extr = outs, int = .2, increase = TRUE, save.df = FALSE, verbose = FALSE) plot(results, category = "hpd") plot(results, category = "perfo", level = "across") plot(results, category = "perfo", level = "within") plot(results, category = "stabi") plot(results, category = "pair_perfo", level = "across") plwithin = plot(results, category = "pair_perfo", level = "within") plot(results, category = "pair_stabi") plot(results, category = "joint")
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = FALSE, trait = 'Y', iter = 6000, cores = 4, chains = 4) outs = extr_outs(model = mod, probs = c(0.05, 0.95), verbose = TRUE) results = prob_sup(extr = outs, int = .2, increase = TRUE, save.df = FALSE, verbose = FALSE) plot(results, category = "hpd") plot(results, category = "perfo", level = "across") plot(results, category = "perfo", level = "within") plot(results, category = "stabi") plot(results, category = "pair_perfo", level = "across") plwithin = plot(results, category = "pair_perfo", level = "within") plot(results, category = "pair_stabi") plot(results, category = "joint")
extr
Print a extr
object in R console
## S3 method for class 'extr' print(obj, ...)
## S3 method for class 'extr' print(obj, ...)
obj 
An object of class 
... 
currently not used 
probsup
Print a probsup
object in R console
## S3 method for class 'probsup' print(obj, ...)
## S3 method for class 'probsup' print(obj, ...)
obj 
An object of class 
... 
currently not used 
This function estimates the probabilities of superior performance and stability across environments, and probabilities of superior performance within environments.
prob_sup(extr, int, increase = TRUE, save.df = FALSE, verbose = FALSE)
prob_sup(extr, int, increase = TRUE, save.df = FALSE, verbose = FALSE)
extr 
An object of class 
int 
A numeric representing the selection intensity (between 0 and 1) 
increase 
Logical. 
save.df 
Logical. Should the data frames be saved in the work directory?

verbose 
A logical value. If 
Probabilities provide the risk of recommending a selection candidate for a target
population of environments or for a specific environment. prob_sup
computes the probabilities of superior performance and the probabilities of superior stability:
Probability of superior performance
Let $\Omega$
represent the subset of selected genotypes based on their
performance across environments. A given genotype $j$
will belong to $\Omega$
if its genotypic marginal value ($\hat{g}_j$
) is high or low enough compared to
its peers. prob_sup
leverages the Monte Carlo discretized sampling
from the posterior distribution to emulate the occurrence of $S$
trials. Then,
the probability of the $j^{th}$
genotype belonging to $\Omega$
is the
ratio of success ($\hat{g}_j \in \Omega$
) events and the total number of sampled events,
as follows:
$Pr\left(\hat{g}_j \in \Omega \vert y \right) = \frac{1}{S}\sum_{s=1}^S{I\left(\hat{g}_j^{(s)} \in \Omega \vert y\right)}$
where $S$
is the total number of samples $\left(s = 1, 2, ..., S \right)$
,
and $I\left(g_j^{(s)} \in \Omega \vert y\right)$
is an indicator variable that can assume
two values: (1) if $\hat{g}_j^{(s)} \in \Omega$
in the $s^{th}$
sample,
and (0) otherwise. $S$
is conditioned to the number of iterations and chains
previously set at bayes_met.
Similarly, the withinenvironment probability of superior performance can be applied to
individual environments. Let $\Omega_k$
represent the subset of superior
genotypes in the $k^{th}$
environment, so that the probability of the
$j^{th} \in \Omega_k$
can calculated as follows:
$Pr\left(\hat{g}_{jk} \in \Omega_k \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{jk}^{(s)} \in \Omega_k \vert y\right)$
where $I\left(\hat{g}_{jk}^{(s)} \in \Omega_k \vert y\right)$
is an indicator variable
mapping success (1) if $\hat{g}_{jk}^{(s)}$
exists in $\Omega_k$
, and
failure (0) otherwise, and $\hat{g}_{jk}^{(s)} = \hat{g}_j^{(s)} + \widehat{ge}_{jk}^{(s)}$
.
Note that when computing withinenvironment probabilities, we are accounting for
the interaction of the $j^{th}$
genotype with the $k^{th}$
environment.
The pairwise probabilities of superior performance can also be calculated across
or within environments. This metric assesses the probability of the $j^{th}$
genotype being superior to another experimental genotype or a commercial check.
The calculations are as follows, across and within environments, respectively:
$Pr\left(\hat{g}_{j} > \hat{g}_{j^\prime} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{j}^{(s)} > \hat{g}_{j^\prime}^{(s)} \vert y\right)$
or
$Pr\left(\hat{g}_{jk} > \hat{g}_{j^\prime k} \vert y\right) = \frac{1}{S} \sum_{s=1}^S I\left(\hat{g}_{jk}^{(s)} > \hat{g}_{j^\prime k}^{(s)} \vert y\right)$
These equations are set for when the selection direction is positive. If
increase = FALSE
, $>$
is simply switched by $<$
.
Probability of superior stability
This probability makes a direct analogy with the
method of Shukla (1972): a stable genotype is the one that has a low
genotypebyenvironment interaction variance $[var(\widehat{ge})]$
.
Using the same probability principles previously described, the probability
of superior stability is given as follows:
$Pr \left[var \left(\widehat{ge}_{jk}\right) \in \Omega \vert y \right] = \frac{1}{S} \sum_{s=1}^S I\left[var \left(\widehat{ge}_{jk}^{(s)} \right) \in \Omega \vert y \right]$
where $I\left[var \left(\widehat{ge}_{jk}^{(s)} \right) \in \Omega \vert y \right]$
indicates if
$var\left(\widehat{ge}_{jk}^{(s)}\right)$
exists in $\Omega$
(1) or not (0).
Pairwise probabilities of superior stability are also possible in this context:
$Pr \left[var \left(\widehat{ge}_{jk} \right) < var\left(\widehat{ge}_{j^\prime k} \right) \vert y \right] = \frac{1}{S} \sum_{s=1}^S I \left[var \left(\widehat{ge}_{jk} \right)^{(s)} < var \left(\widehat{ge}_{j^\prime k} \right)^{(s)} \vert y \right]$
Note that $j$
will be superior to $j^\prime$
if it has a lower
variance of the genotypebyenvironment interaction effect. This is true regardless
if increase
is set to TRUE
or FALSE
.
The joint probability independent events is the product of the individual probabilities. The estimated genotypic main effects and the variances of GEI effects are independent by design, thus the joint probability of superior performance and stability as follows:
$Pr \left[\hat{g}_j \in \Omega \cap var \left(\widehat{ge}_{jk} \right) \in \Omega \right] = Pr \left(\hat{g}_j \in \Omega \right) \times Pr \left[var \left(\widehat{ge}_{jk} \right) \in \Omega \right]$
The estimation of these probabilities are strictly related to some key questions that constantly arises in plant breeding:
What is the risk of recommending a selection candidate for a target population of environments?
What is the probability of a given selection candidate having good performance if recommended to a target population of environments? And for a specific environment?
What is the probability of a given selection candidate having better performance than a cultivar check in the target population of environments? And in specific environments?
How probable is it that a given selection candidate performs similarly across environments?
What are the chances that a given selection candidate is more stable than a cultivar check in the target population of environments?
What is the probability that a given selection candidate having a superior and invariable performance across environments?
More details about the usage of prob_sup
, as well as the other function of
the ProbBreed
package can be found at https://saulochaves.github.io/ProbBreed_site/.
The function returns an object of class probsup
, which contains two lists,
one with the across
environments probabilities, and another with the within
environments probabilities.
The across
list has the following elements:
g_hpd
: Highest posterior density (HPD) of the posterior genotypic main effects.
perfo
: the probabilities of superior performance.
pair_perfo
: the pairwise probabilities of superior performance.
stabi
: a list with the probabilities of superior stability. It contains the data frames gl
,
gm
(when reg
is not NULL
) and gt
(when year
is not NULL
).
pair_stabi
: a list with the pairwise probabilities of superior stability. It contains the data frames gl
,
gm
(when reg
is not NULL
) and gt
(when year
is not NULL
).
joint_prob
: the joint probabilities of superior performance and stability.
The within
list has the following elements:
perfo
: a list of data frames containing the probabilities of superior performance
within locations (gl
), regions (gm
) and years (gt
).
pair_perfo
: lists with the pairwise probabilities of superior performance
within locations (gl
), regions (gm
) and years (gt
).
Dias, K. O. G, Santos J. P. R., Krause, M. D., Piepho H. P., Guimarães, L. J. M., Pastina, M. M., and Garcia, A. A. F. (2022). Leveraging probability concepts for cultivar recommendation in multienvironment trials. Theoretical and Applied Genetics, 133(2):443455. doi:10.1007/s0012202204041y
Shukla, G. K. (1972) Some statistical aspects of partioning genotype environmental componentes of variability. Heredity, 29:237245. doi:10.1038/hdy.1972.87
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = TRUE, trait = 'Y', iter = 6000, cores = 4, chains = 4) outs = extr_outs(model = mod, probs = c(0.05, 0.95), verbose = TRUE) results = prob_sup(extr = outs, int = .2, increase = TRUE, save.df = FALSE, verbose = FALSE)
mod = bayes_met(data = soy, gen = "Gen", loc = "Loc", repl = NULL, year = NULL, reg = NULL, res.het = TRUE, trait = 'Y', iter = 6000, cores = 4, chains = 4) outs = extr_outs(model = mod, probs = c(0.05, 0.95), verbose = TRUE) results = prob_sup(extr = outs, int = .2, increase = TRUE, save.df = FALSE, verbose = FALSE)
This dataset belongs to the USDA Northern Region Uniform Soybean Tests,
and it is a subset of the data used by Krause et al. (2023). It contains the
empirical best linear unbiased estimates of genotypic means of the seed yield
from 39 experimental genotypes evaluated in 14 locations. The original data, available at the package
SoyURT
, has 4,257 experimental genotypes evaluated at 63 locations and
31 years resulting in 591 locationyear combinations (environments) with
39,006 yield values.
soy
soy
soy
A data frame with 823 rows and 3 columns:
14 locations
39 experimental genotypes
435 EBLUEs (phenotypes)
Krause MD, Dias KOG, Singh AK, Beavis WD. 2023. Using soybean historical field trial data to study genotype by environment variation and identify megaenvironments with the integration of genetic and nongenetic factors. bioRxiv : the preprint server for biology. doi: https://doi.org/10.1101/2022.04.11.487885