First, we will load the packages that we will need for the example:

For simplicity’s sake, we will use the same prevalence for both sexes. However, it is worth noting that in real-world applications, the cumulative incidence proportions (CIPs) needed for the thresholds will rarely be identical for both sexes. A common way to estimate the CIP is with the Aalen-Johansen estimator that accounts for competing events.

Setting up input

Single trait LT-FH++ requires 3 types of input, namely the liability-scale heritability, the family relationships and the status, age or age of onset, sex, and birth year for each family member, and finally cumulative incidence proportions (CIPs). The more detailed cumulative incidence proportions, the better. In the LT-FH++ paper, we used CIPs that were stratified by birth year and sex and fixed the upper and lower liability threshold for cases to \(\Phi(1 - CIP)\), since it provided a very accurate estimate for the full liability. Each CIP curve was a function of age, where age represented the age of onset or the current age of a control. If less detailed information, such as CIPs stratified by sex, but not birth year, we recommend setting the lower threshold to be \(\Phi(1 - CIP)\) and the upper limit to infinity for cases. With this information, the thresholds needed for the age-dependent liability threshold model can be assigned. We have provided a function called prepare_LTFHPlus_input that can help users convert the input into a suitable format. An example of how the input may look and how to use the function can be found in From CIP and family to LT-FH++ input.

Generating phenotypes

We have implemented a function that allows users to simulate under the liability threshold model for a given family structure, i.e. mother, father, sibling, or sibling1, sibling2, child1, child2, etc. The function simply needs a family structure, heritability, and population prevalence.For simplicity’s sake, we will use the same prevalence for both sexes. However, it is worth noting that in real-world applications, the cumulative incidence proportions needed for the thresholds, will rarely be identical for both sexes. The function simulate_under_LTM outputs a list with two tibbles. The first one sim_obs contains the underlying liabilities, status, and age or age of onset for each family and family members. The second tibble thresholds contains the formatted input that is ready to be input to estimate the genetic liability.

sims <- simulate_under_LTM(fam_vec = c("m","f","s1"), 
                           n_fam = NULL, 
                           add_ind = TRUE, 
                           h2 = h2, 
                           n_sim = 10, 
                           pop_prev = .1)
sims
## $sim_obs
## # A tibble: 10 × 14
##    fam_ID           g       o       m       f      s1 o_status m_status f_status
##    <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <lgl>    <lgl>    <lgl>   
##  1 fam_ID_1  -0.0930   0.438  -0.514  -0.244   0.0223 FALSE    FALSE    FALSE   
##  2 fam_ID_2   0.575    0.150   0.175   0.698  -0.208  FALSE    FALSE    FALSE   
##  3 fam_ID_3  -0.219   -0.349  -2.40   -0.441   0.108  FALSE    FALSE    FALSE   
##  4 fam_ID_4   0.692    0.880  -1.09    3.65   -0.202  FALSE    FALSE    TRUE    
##  5 fam_ID_5   0.702    0.846  -0.789   0.0164  0.526  FALSE    FALSE    FALSE   
##  6 fam_ID_6  -0.0935  -0.0723 -0.644   0.0447 -1.34   FALSE    FALSE    FALSE   
##  7 fam_ID_7  -0.00631 -0.497  -0.102   1.12    0.430  FALSE    FALSE    FALSE   
##  8 fam_ID_8  -1.30    -1.33   -2.62    0.318  -1.23   FALSE    FALSE    FALSE   
##  9 fam_ID_9  -0.0503   0.0632 -0.715  -1.00   -1.60   FALSE    FALSE    FALSE   
## 10 fam_ID_10  0.0616   0.528   0.0112 -0.939  -1.68   FALSE    FALSE    FALSE   
## # ℹ 5 more variables: s1_status <lgl>, o_aoo <dbl>, m_aoo <dbl>, f_aoo <dbl>,
## #   s1_aoo <int>
## 
## $thresholds
## # A tibble: 40 × 5
##    fam_ID    indiv_ID    role  lower upper
##    <chr>     <chr>       <chr> <dbl> <dbl>
##  1 fam_ID_1  fam_ID_1_1  o      -Inf  3.35
##  2 fam_ID_2  fam_ID_2_1  o      -Inf  3.28
##  3 fam_ID_3  fam_ID_3_1  o      -Inf  2.59
##  4 fam_ID_4  fam_ID_4_1  o      -Inf  2.51
##  5 fam_ID_5  fam_ID_5_1  o      -Inf  3.52
##  6 fam_ID_6  fam_ID_6_1  o      -Inf  2.91
##  7 fam_ID_7  fam_ID_7_1  o      -Inf  3.38
##  8 fam_ID_8  fam_ID_8_1  o      -Inf  3.24
##  9 fam_ID_9  fam_ID_9_1  o      -Inf  3.28
## 10 fam_ID_10 fam_ID_10_1 o      -Inf  2.43
## # ℹ 30 more rows

thresholds keeps track of families and family members with fam_id and indiv_id. Here we simply use dummy variables for both. In real-world data, the indiv_id is usually some pseudonymized identifier unique to each individual. There is total freedom to set fam_id to whatever, as long as it is unique to each family. One choice could be the pseudonymized identifier for the index person in a family, or simply numerate the families from \(1\) to the total number of families. Next, role identifies each family member’s relationship to the index person. They follow a system shown in the documentation for construct_covmat. If a family member does not fit one of the shown, a suitable covariance matrix cannot be constructed. If more detailed family structures are needed, please contact one of the maintainers of LTFHPlus. Finally, the lower and upper liability thresholds are provided for each family member. Previously, we utilised list of lists in R to link information to the correct individuals. Going forward, we will still support this format, but in order to ease the use of non-R users, we have opted for a long format, where each row is an indivdual instead of a family, such that the input can be generated with the user’s preferred software.

Generating your own input data

If you would like to use LTFHPlus, then the above can provide a template for the input format. The object sims is a list of two entries, namely sim_obs and thresholds. The values in sim_obs are values such as the true underlying liabilities, simulated ages and onset ages. etc.. The values in thresholds are ready to be used in estimate_liability(). Generating this input from ones own register data can be tedious, as it requires identifying parents (role for parents are \(m\) and \(f\)), then one can identify siblings (roles for siblings are \(s1, s2, etc.\)), and so on for all available roles. The available roles can be seen in the documentation for estimate_liability().

Running LT-FH++

With all input parameters generated and a chosen heritability, we have everything we need to run LT-FH++. We highly recommend the package Future to parallelize the computations for LT-FH++. Since all modern computers have more than 2 cores available to them, it should be available to all users. If a user is not able to parallelize, then LTFHPlus can still be used, but it will be far slower. The future framework also allows for easy use on high performance computing (HPC) clusters. In this example, we will utilize 4 threads.

#Setting up parallelization backend
plan(tweak(multisession, workers = nthreads))
#performs LT-FH++ analysis
simu_liab = estimate_liability(.tbl = sims$thresholds,
                               h2 = h2,
                               pid = "indiv_ID",
                               fam_id = "fam_ID",
                               role = "role")
## The number of workers is 4
simu_liab
## # A tibble: 10 × 4
##    fam_ID    indiv_ID  genetic_est genetic_se
##    <chr>     <chr>           <dbl>      <dbl>
##  1 fam_ID_1  fam_ID_1     -0.00998    0.00417
##  2 fam_ID_2  fam_ID_2     -0.0158     0.00391
##  3 fam_ID_3  fam_ID_3     -0.0592     0.00418
##  4 fam_ID_4  fam_ID_4      0.836      0.00315
##  5 fam_ID_5  fam_ID_5     -0.0734     0.00424
##  6 fam_ID_6  fam_ID_6     -0.0353     0.00423
##  7 fam_ID_7  fam_ID_7     -0.0108     0.00397
##  8 fam_ID_8  fam_ID_8     -0.0204     0.00409
##  9 fam_ID_9  fam_ID_9     -0.0479     0.00407
## 10 fam_ID_10 fam_ID_10    -0.0735     0.00388

From here the object simu_liab can be saved and used in your GWAS method of choice.