vignettes/LTFHPlusExample.Rmd
LTFHPlusExample.Rmd
First, we will load the packages that we will need for the example:
For simplicity’s sake, we will use the same prevalence for both sexes. However, it is worth noting that in real-world applications, the cumulative incidence proportions (CIPs) needed for the thresholds will rarely be identical for both sexes. A common way to estimate the CIP is with the Aalen-Johansen estimator that accounts for competing events.
Single trait LT-FH++ requires 3 types of input, namely the
liability-scale heritability, the family relationships and the status,
age or age of onset, sex, and birth year for each family member, and
finally cumulative incidence proportions (CIPs). The more detailed
cumulative incidence proportions, the better. In the LT-FH++ paper, we
used CIPs that were stratified by birth year and sex and fixed the upper
and lower liability threshold for cases to \(\Phi(1 - CIP)\), since it provided a very
accurate estimate for the full liability. Each CIP curve was a function
of age, where age represented the age of onset or the current age of a
control. If less detailed information, such as CIPs stratified by sex,
but not birth year, we recommend setting the lower threshold to be \(\Phi(1 - CIP)\) and the upper limit to
infinity for cases. With this information, the thresholds needed for the
age-dependent liability threshold model can be assigned. We have
provided a function called prepare_LTFHPlus_input
that can
help users convert the input into a suitable format. An example of how
the input may look and how to use the function can be found in From
CIP and family to LT-FH++ input.
We have implemented a function that allows users to simulate under
the liability threshold model for a given family structure, i.e. mother,
father, sibling, or sibling1, sibling2, child1, child2, etc. The
function simply needs a family structure, heritability, and population
prevalence.For simplicity’s sake, we will use the same prevalence for
both sexes. However, it is worth noting that in real-world applications,
the cumulative incidence proportions needed for the thresholds, will
rarely be identical for both sexes. The function
simulate_under_LTM
outputs a list with two tibbles. The
first one sim_obs
contains the underlying liabilities,
status, and age or age of onset for each family and family members. The
second tibble thresholds
contains the formatted input that
is ready to be input to estimate the genetic liability.
sims <- simulate_under_LTM(fam_vec = c("m","f","s1"),
n_fam = NULL,
add_ind = TRUE,
h2 = h2,
n_sim = 10,
pop_prev = .1)
sims
## $sim_obs
## # A tibble: 10 × 14
## fam_ID g o m f s1 o_status m_status f_status
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
## 1 fam_ID_1 -0.0930 0.438 -0.514 -0.244 0.0223 FALSE FALSE FALSE
## 2 fam_ID_2 0.575 0.150 0.175 0.698 -0.208 FALSE FALSE FALSE
## 3 fam_ID_3 -0.219 -0.349 -2.40 -0.441 0.108 FALSE FALSE FALSE
## 4 fam_ID_4 0.692 0.880 -1.09 3.65 -0.202 FALSE FALSE TRUE
## 5 fam_ID_5 0.702 0.846 -0.789 0.0164 0.526 FALSE FALSE FALSE
## 6 fam_ID_6 -0.0935 -0.0723 -0.644 0.0447 -1.34 FALSE FALSE FALSE
## 7 fam_ID_7 -0.00631 -0.497 -0.102 1.12 0.430 FALSE FALSE FALSE
## 8 fam_ID_8 -1.30 -1.33 -2.62 0.318 -1.23 FALSE FALSE FALSE
## 9 fam_ID_9 -0.0503 0.0632 -0.715 -1.00 -1.60 FALSE FALSE FALSE
## 10 fam_ID_10 0.0616 0.528 0.0112 -0.939 -1.68 FALSE FALSE FALSE
## # ℹ 5 more variables: s1_status <lgl>, o_aoo <dbl>, m_aoo <dbl>, f_aoo <dbl>,
## # s1_aoo <int>
##
## $thresholds
## # A tibble: 40 × 5
## fam_ID indiv_ID role lower upper
## <chr> <chr> <chr> <dbl> <dbl>
## 1 fam_ID_1 fam_ID_1_1 o -Inf 3.35
## 2 fam_ID_2 fam_ID_2_1 o -Inf 3.28
## 3 fam_ID_3 fam_ID_3_1 o -Inf 2.59
## 4 fam_ID_4 fam_ID_4_1 o -Inf 2.51
## 5 fam_ID_5 fam_ID_5_1 o -Inf 3.52
## 6 fam_ID_6 fam_ID_6_1 o -Inf 2.91
## 7 fam_ID_7 fam_ID_7_1 o -Inf 3.38
## 8 fam_ID_8 fam_ID_8_1 o -Inf 3.24
## 9 fam_ID_9 fam_ID_9_1 o -Inf 3.28
## 10 fam_ID_10 fam_ID_10_1 o -Inf 2.43
## # ℹ 30 more rows
thresholds
keeps track of families and family members
with fam_id
and indiv_id
. Here we simply use
dummy variables for both. In real-world data, the indiv_id
is usually some pseudonymized identifier unique to each individual.
There is total freedom to set fam_id
to whatever, as long
as it is unique to each family. One choice could be the pseudonymized
identifier for the index person in a family, or simply numerate the
families from \(1\) to the total number
of families. Next, role
identifies each family member’s
relationship to the index person. They follow a system shown in the
documentation for construct_covmat
. If a family member does
not fit one of the shown, a suitable covariance matrix cannot be
constructed. If more detailed family structures are needed, please
contact one of the maintainers of LTFHPlus. Finally, the lower
and upper liability thresholds are provided for each family member.
Previously, we utilised list of lists in R to link information
to the correct individuals. Going forward, we will still support this
format, but in order to ease the use of non-R users, we have
opted for a long format, where each row is an indivdual instead of a
family, such that the input can be generated with the user’s preferred
software.
If you would like to use LTFHPlus, then the above can provide a
template for the input format. The object sims
is a list of
two entries, namely sim_obs
and thresholds
.
The values in sim_obs
are values such as the true
underlying liabilities, simulated ages and onset ages. etc.. The values
in thresholds
are ready to be used in
estimate_liability()
. Generating this input from ones own
register data can be tedious, as it requires identifying parents (role
for parents are \(m\) and \(f\)), then one can identify siblings (roles
for siblings are \(s1, s2, etc.\)), and
so on for all available roles. The available roles can be seen in the
documentation for estimate_liability()
.
With all input parameters generated and a chosen heritability, we have everything we need to run LT-FH++. We highly recommend the package Future to parallelize the computations for LT-FH++. Since all modern computers have more than 2 cores available to them, it should be available to all users. If a user is not able to parallelize, then LTFHPlus can still be used, but it will be far slower. The future framework also allows for easy use on high performance computing (HPC) clusters. In this example, we will utilize 4 threads.
#Setting up parallelization backend
plan(tweak(multisession, workers = nthreads))
#performs LT-FH++ analysis
simu_liab = estimate_liability(.tbl = sims$thresholds,
h2 = h2,
pid = "indiv_ID",
fam_id = "fam_ID",
role = "role")
## The number of workers is 4
simu_liab
## # A tibble: 10 × 4
## fam_ID indiv_ID genetic_est genetic_se
## <chr> <chr> <dbl> <dbl>
## 1 fam_ID_1 fam_ID_1 -0.00998 0.00417
## 2 fam_ID_2 fam_ID_2 -0.0158 0.00391
## 3 fam_ID_3 fam_ID_3 -0.0592 0.00418
## 4 fam_ID_4 fam_ID_4 0.836 0.00315
## 5 fam_ID_5 fam_ID_5 -0.0734 0.00424
## 6 fam_ID_6 fam_ID_6 -0.0353 0.00423
## 7 fam_ID_7 fam_ID_7 -0.0108 0.00397
## 8 fam_ID_8 fam_ID_8 -0.0204 0.00409
## 9 fam_ID_9 fam_ID_9 -0.0479 0.00407
## 10 fam_ID_10 fam_ID_10 -0.0735 0.00388
From here the object simu_liab
can be saved and used in
your GWAS method of choice.