vignettes/FromCIPFamilyToLTFHPlusInput.Rmd
FromCIPFamilyToLTFHPlusInput.Rmd
Here we will simply simulate a potential input format. We create
tbæ
, which contains information on each person to attach
thresholds to. It should contain each family member along with the
needed information. Here, we simply use the proband with no family
members. Next, we create CIP
, which contains the cumulative
incidence proportions. The CIP is stratified by birth year and sex for
illustrative purposes. If users only have CIPs stratified by sex, it
would simply have one fewer columns. Please note that all values
shown here are only for illustrative purposes.
n_sim = 10
tbl = tibble(
fam_id = paste0("fam", 1:n_sim),
pid = 1:n_sim,
role = rep("o", n_sim),
sex = sample(x = 0:1, size = n_sim, replace = T),
status = sample(size = n_sim, x = 0:1, replace = T),
age = sample(size = n_sim, x = 1:90, replace = T),
birth_year = 2023 - age,
aoo = purrr::map2_dbl(.x = status, .y = age, .f = ~ ifelse(.x == 1, sample(size = 1, x = 1:.y), NA))
) %>%
print()
## # A tibble: 10 × 8
## fam_id pid role sex status age birth_year aoo
## <chr> <int> <chr> <int> <int> <int> <dbl> <dbl>
## 1 fam1 1 o 1 1 52 1971 17
## 2 fam2 2 o 1 1 12 2011 10
## 3 fam3 3 o 0 0 48 1975 NA
## 4 fam4 4 o 0 0 88 1935 NA
## 5 fam5 5 o 1 0 10 2013 NA
## 6 fam6 6 o 0 0 27 1996 NA
## 7 fam7 7 o 0 0 43 1980 NA
## 8 fam8 8 o 0 0 84 1939 NA
## 9 fam9 9 o 0 0 76 1947 NA
## 10 fam10 10 o 0 1 47 1976 43
#### THIS IS DUMMY CIP. DO NOT USE FOR REAL-WORLD DATA USE ####
CIP = expand.grid(list(age = 1:100,
birth_year = 1900:2024,
sex = 0:1)) %>%
group_by(sex, birth_year) %>%
mutate(cip = (1:n() - 1)/n() * .1) %>%
ungroup() %>%
print()
## # A tibble: 25,000 × 4
## age birth_year sex cip
## <int> <int> <int> <dbl>
## 1 1 1900 0 0
## 2 2 1900 0 0.001
## 3 3 1900 0 0.002
## 4 4 1900 0 0.003
## 5 5 1900 0 0.004
## 6 6 1900 0 0.005
## 7 7 1900 0 0.006
## 8 8 1900 0 0.007
## 9 9 1900 0 0.008
## 10 10 1900 0 0.009
## # ℹ 24,990 more rows
#### THIS IS DUMMY CIP. DO NOT USE FOR REAL-WORLD DATA USE ####
Assigning thresholds to each person in tbl
can now be
done with the function prepare_LTFHPlus_input
. The
thresholds can be assigned in two ways. The first is matching directly
on the combinations of birth year, sex, and age of each person to the
combinations that are present in the CIP
object. The second
uses interpolation to predict the CIP value between the observed
combinations of birth year, sex, and age that is present in the CIP
object. Currently, only interpolation with
xgboost package is supported. The
interpolation can be useful, since real-world data often lead to ages or
age of onsets that can be expressed as decimals and rounding may lead to
large jumps in CIP values. The outputs below can be subset such that
only the required information is left. For direct input into
estimate_liability()
, only the family and personal id
columns are needed as well as role (if the graph input is not used) and
the lower and upper columns.
Without using interpolation, meaning we match on the combinations of
birth year, sex and age that are present in both the tbl
and CIP
objects. The thresholds can then be assigned in the
following way:
tbl2 = prepare_LTFHPlus_input(.tbl = tbl,
CIP = CIP,
age_col = "age",
aoo_col = "aoo",
CIP_merge_columns = c("age","birth_year", "sex"),
CIP_cip_col = "cip",
status_col = "status",
use_fixed_case_thr = F,
fam_id_col = "fam_id",
personal_id_col = "pid",
interpolation = NA,
min_CIP_value = 1e-4)
tbl2
## # A tibble: 10 × 12
## fam_id pid role sex status age birth_year aoo cip thr lower
## <chr> <int> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fam1 1 o 1 1 17 1971 17 0.016 2.14 2.14
## 2 fam2 2 o 1 1 10 2011 10 0.009 2.37 2.37
## 3 fam3 3 o 0 0 48 1975 NA 0.047 1.67 -Inf
## 4 fam4 4 o 0 0 88 1935 NA 0.087 1.36 -Inf
## 5 fam5 5 o 1 0 10 2013 NA 0.009 2.37 -Inf
## 6 fam6 6 o 0 0 27 1996 NA 0.026 1.94 -Inf
## 7 fam7 7 o 0 0 43 1980 NA 0.042 1.73 -Inf
## 8 fam8 8 o 0 0 84 1939 NA 0.083 1.39 -Inf
## 9 fam9 9 o 0 0 76 1947 NA 0.075 1.44 -Inf
## 10 fam10 10 o 0 1 43 1976 43 0.042 1.73 1.73
## # ℹ 1 more variable: upper <dbl>
If decimal ages and ages of onset are present, then we can
interpolate the cip values with the xgboost package. The input does not
change, except for interpolate = "xgboost"
. The current
input data does not contain decimal values, but for illustrative
purposes, we will use it as-is. Parameters can be passed to xgboost
through the bst.params
variable.
tbl2_xgb = prepare_LTFHPlus_input(.tbl = tbl,
CIP = CIP,
age_col = "age",
aoo_col = "aoo",
CIP_merge_columns = c("age","birth_year", "sex"),
CIP_cip_col = "cip",
status_col = "status",
use_fixed_case_thr = F,
fam_id_col = "fam_id",
personal_id_col = "pid",
interpolation = "xgboost",
xgboost_itr = 30,
min_CIP_value = 1e-4)
## [1] train-rmse:0.040135
## [2] train-rmse:0.028111
## [3] train-rmse:0.019688
## [4] train-rmse:0.013790
## [5] train-rmse:0.009659
## [6] train-rmse:0.006766
## [7] train-rmse:0.004740
## [8] train-rmse:0.003321
## [9] train-rmse:0.002328
## [10] train-rmse:0.001632
## [11] train-rmse:0.001144
## [12] train-rmse:0.000802
## [13] train-rmse:0.000563
## [14] train-rmse:0.000395
## [15] train-rmse:0.000278
## [16] train-rmse:0.000196
## [17] train-rmse:0.000139
## [18] train-rmse:0.000101
## [19] train-rmse:0.000075
## [20] train-rmse:0.000058
## [21] train-rmse:0.000047
## [22] train-rmse:0.000041
## [23] train-rmse:0.000037
## [24] train-rmse:0.000035
## [25] train-rmse:0.000034
## [26] train-rmse:0.000034
## [27] train-rmse:0.000034
## [28] train-rmse:0.000034
## [29] train-rmse:0.000034
## [30] train-rmse:0.000034
tbl2_xgb
## # A tibble: 10 × 13
## fam_id pid role sex status age birth_year aoo cip_pred event_age
## <chr> <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 fam2 2 o 1 1 12 2011 10 0.00900 10
## 2 fam5 5 o 1 0 10 2013 NA 0.00900 10
## 3 fam1 1 o 1 1 52 1971 17 0.0160 17
## 4 fam6 6 o 0 0 27 1996 NA 0.0260 27
## 5 fam7 7 o 0 0 43 1980 NA 0.0420 43
## 6 fam10 10 o 0 1 47 1976 43 0.0420 43
## 7 fam3 3 o 0 0 48 1975 NA 0.0470 48
## 8 fam9 9 o 0 0 76 1947 NA 0.0751 76
## 9 fam8 8 o 0 0 84 1939 NA 0.0830 84
## 10 fam4 4 o 0 0 88 1935 NA 0.0871 88
## # ℹ 3 more variables: thr <dbl>, lower <dbl>, upper <dbl>
From here, the above objects tbl2
and
tbl2_xgb
can be subset to the relevant columns and used in
estimate_liability()
. See LT-FH++
Example for an example of this.
The objects can also be subset to contain just the family and
personal ID columns, as well as the lower and upper columns, and then
used as input in prepare_graph()
to assign each individual
with the threshold information as attributes. See LT-FH++
Graph Example for details.