| Type: | Package | 
| Title: | Training Set Determination for Genomic Selection | 
| Version: | 2.0 | 
| Date: | 2022-06-07 | 
| Description: | We propose an optimality criterion to determine the required training set, r-score, which is derived directly from Pearson's correlation between the genomic estimated breeding values and phenotypic values of the test set <doi:10.1007/s00122-019-03387-0>. This package provides two main functions to determine a good training set and its size. | 
| License: | GPL (≥ 3) | 
| Encoding: | UTF-8 | 
| Imports: | dplyr, ggplot2, latex2exp, lifecycle, parallel, Rcpp (≥ 1.0.8.3) | 
| LinkingTo: | Rcpp, RcppEigen | 
| RoxygenNote: | 7.2.0 | 
| URL: | https://github.com/oumarkme/TSDFGS | 
| BugReports: | https://github.com/oumarkme/TSDFGS/issues | 
| Depends: | R (≥ 2.10) | 
| LazyData: | true | 
| NeedsCompilation: | yes | 
| Packaged: | 2022-06-07 13:21:24 UTC; mark | 
| Author: | Jen-Hsiang Ou  | 
| Maintainer: | Jen-Hsiang Ou <jen-hsiang.ou@imbim.uu.se> | 
| Repository: | CRAN | 
| Date/Publication: | 2022-06-07 14:00:11 UTC | 
TSDFGS: Training Set Determination for Genomic Selection
Description
We propose an optimality criterion to determine the required training set, r-score, which is derived directly from Pearson's correlation between the genomic estimated breeding values and phenotypic values of the test set doi:10.1007/s00122-019-03387-0. This package provides two main functions to determine a good training set and its size.
Author(s)
Maintainer: Jen-Hsiang Ou jen-hsiang.ou@imbim.uu.se (ORCID)
Authors:
Po-Ya Wu Po-Ya.Wu@hhu.de (ORCID)
Chen-Tuo Liao ctliao@ntu.edu.tw (ORCID) [thesis advisor]
See Also
Useful links:
Fit logistic growth curve model
Description
A function for fitting logisti growth model
Usage
FGCM(geno, nt = NULL, n_iter = NULL, multi.threads = TRUE)
Arguments
geno | 
 Genotype information saved as a dataframe. Columns represent variants (SNPs or PCs).  | 
nt | 
 A numerical vector of training set sample size for estimating logistic growth curve parameters  | 
n_iter | 
 Number of simulation of each training set size. Automatically gave a suitable number by default.  | 
multi.threads | 
 Default: TRUE. Set as FALSE if you just want to run it by single thread.  | 
Value
Estimation of parameters.
Examples
data(geno)
## Not run: FGCM(geno)
Sample size determination for genomic selection
Description
This function is designed to generate an operating curve for sample size determination
Usage
SSDFGS(geno, nt = NULL, n_iter = NULL, multi.threads = TRUE)
Arguments
geno | 
 A numeric data frame carried genotype information (column: PCs, row: sample)  | 
nt | 
 A numeric vector carried training set sizes for r-score simulation.  | 
n_iter | 
 Number of iterations for estimating parameters.  | 
multi.threads | 
 Default (multi.threads = TRUE) use 75% of threads if the computer has more than 4 threads.  | 
Value
An operating curve and its information.
Author(s)
Jen-Hsiang Ou & Po-Ya Wu
Examples
data(geno)
## Not run: SSDFGS(geno)
CD-score
Description
This function calculate CD-score doi:10.1186/1297-9686-28-4-359 by given training set and test set.
Usage
cd_score(X, X0)
Arguments
X | 
 A numeric matrix. The training set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker).  | 
X0 | 
 A numeric mareix. The test set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker).  | 
Value
A floating-point number, CD score.
Author(s)
Jen-Hsiang Ou
Examples
data(geno)
## Not run: cd_score(geno[1:50, ], geno[51:100])
Genotype information
Description
A PCA matrix of rice genotype information. This data was published by Zhao et al. (2011) doi:10.1038/ncomms1467
Usage
geno
Format
A numeric matrix (PCA) with 404 rows (sample) and 404 columns (PCs).
Source
http://www.ricediversity.org/data/
Examples
data(geno)
Simulate r-scores of each training set size
Description
Calculate r-scores (un-target) by in parallel.
Usage
nt2r(geno, nt, n_iter = 30, multi.threads = TRUE)
Arguments
geno | 
 A numeric dataframe of genotype, column represent sites (genotype coding as 1, 0, -1)  | 
nt | 
 Numeric. Number of training set size  | 
n_iter | 
 Times of iteration. (default = 30)  | 
multi.threads | 
 Default: TRUE  | 
Value
A vector of r-scores of each iteration
Examples
data(geno)
## Not run: nt2r(geno, 50)
Optimal training set determination
Description
This function is designed for determining optimal training set.
Usage
optTrain(
  geno,
  cand,
  n.train,
  subpop = NULL,
  test = NULL,
  method = "rScore",
  min.iter = NULL
)
Arguments
geno | 
 A numeric matrix of principal components (rows: individuals; columns: PCs).  | 
cand | 
 An integer vector of which rows of individuals are candidates of the training set in the geno matrix.  | 
n.train | 
 The size of the target training set. This could be determined with the help of the ssdfgp function provided in this package.  | 
subpop | 
 A character vector of sub-population's group name. The algorithm will ignore the population structure if it remains NULL.  | 
test | 
 An integer vector of which rows of individuals are in the test set in the geno matrix. The algorithm will use an un-target method if it remains NULL.  | 
method | 
 Choices are rScore, PEV and CD. rScore will be used by default.  | 
min.iter | 
 Minimum iteration of all methods can be appointed. One should always check if the algorithm is converged or not. A minimum iteration will set by considering the candidate and test set size if it remains NULL.  | 
Value
This function will return 3 information including OPTtrain (a vector of chosen optimal training set), TOPscore (highest scores of before iteration), and ITERscore (criteria scores of each iteration).
Author(s)
Jen-Hsiang Ou
Examples
data(geno)
## Not run: optTrain(geno, cand = 1:404, n.train = 100)
PEV score
Description
This function calculate prediction error variance (PEV) score doi:10.1186/s12711-015-0116-6 by given training set and test set.
Usage
pev_score(X, X0)
Arguments
X | 
 A numeric matrix. The training set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker).  | 
X0 | 
 A numeric mareix. The test set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker).  | 
Value
A floating-point number, PEV score.
Author(s)
Jen-Hsiang Ou
Examples
data(geno)
## Not run: pev_score(geno[1:50, ], geno[51:100])
r-score
Description
This function calculate r-score doi:10.1007/s00122-019-03387-0 by given training set and test set.
Usage
r_score(X, X0)
Arguments
X | 
 A numeric matrix. The training set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker).  | 
X0 | 
 A numeric mareix. The test set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker).  | 
Value
A floating-point number, r-score.
Author(s)
Jen-Hsiang Ou
Examples
data(geno)
## Not run: r_score(geno[1:50, ], geno[51:100])
Sub-population information
Description
Sub-population information of samples. This data was published by Zhao et al. (2011) doi:10.1038/ncomms1467
Usage
subpop
Format
A character vector.
Source
http://www.ricediversity.org/data/
Examples
data(subpop)