Automatic family construction

Manually constructing family history from registers can be tedious. It can be easily done for first and second degree relatives, however, it starts to become very difficult once further degrees of relatives needs to be identified. Other R packages, such as Kinship2 or FamAgg, allows users to manipulate pedigrees and perform certain actions and calculations on them, however, to the best of our knowledge, they do not allow one to find all relatives of degree \(n\) or closer of a given individual. This particular problem is core to LT-FH++, as the family members of a proband (often the genotyped individual) will need to be identified as well as the family members’ mutual kinship coefficient for the covariance matrix. With the introduction of the function prepare_graph(), we have made it possible to automatically construct families for a list of probands. Below, we will provide an example of how the function can be used in connection with the estimate_liability() function of the package. All data is simulated and should not be used as-is in any real-world analysis.

Constructing the population pedigree

The input is inspired by register information such as the trio information provided by the danish CPR register. It usually follows the format of having a column with the child’s ID (here simply id), followed by columns for the mother’s and father’s ID (here momcol and dadcol). Here, we provide an illustrative example of how the register information may look. The names used are representative of their relation to a proband named “pid”, with “pgm” refering to paternal grandmother, “mgf” refering to maternal grandfather, the “hs” prefix means half-sibling, the new parents of half-siblings will appear with the “H” or “W” suffix for husband or wife, cousins are prefaced with the first two letters of their parents, etc..

Additionally, we will also need threshold information for the phenotype(s) of interest. The information is baked into the graph object as attributes, and retrieved as needed later by the estimate_liability() function. It is possible to construct the graph without this information and only construct the desired families.

# example of register trio input
family = tribble(
  ~id, ~momcol, ~dadcol,
  "pid", "mom", "dad",
  "sib", "mom", "dad",
  "mhs", "mom", "dad2",
  "phs", "mom2", "dad",
  "mom", "mgm", "mgf",
  "dad", "pgm", "pgf",
  "dad2", "pgm2", "pgf2",
  "paunt", "pgm", "pgf",
  "pacousin", "paunt", "pauntH",
  "hspaunt", "pgm", "newpgf",
  "hspacousin", "hspaunt", "hspauntH",
  "puncle", "pgm", "pgf",
  "pucousin", "puncleW", "puncle",
  "maunt", "mgm", "mgf",
  "macousin", "maunt", "mauntH",
  "hsmuncle", "newmgm", "mgf",
  "hsmucousin", "hsmuncleW", "hsmuncle"
) %>% print()
## # A tibble: 17 × 3
##    id         momcol    dadcol  
##    <chr>      <chr>     <chr>   
##  1 pid        mom       dad     
##  2 sib        mom       dad     
##  3 mhs        mom       dad2    
##  4 phs        mom2      dad     
##  5 mom        mgm       mgf     
##  6 dad        pgm       pgf     
##  7 dad2       pgm2      pgf2    
##  8 paunt      pgm       pgf     
##  9 pacousin   paunt     pauntH  
## 10 hspaunt    pgm       newpgf  
## 11 hspacousin hspaunt   hspauntH
## 12 puncle     pgm       pgf     
## 13 pucousin   puncleW   puncle  
## 14 maunt      mgm       mgf     
## 15 macousin   maunt     mauntH  
## 16 hsmuncle   newmgm    mgf     
## 17 hsmucousin hsmuncleW hsmuncle
# random thresholds to store as attributes
thrs =  tibble(
 id = family %>% select(1:3) %>% unlist() %>% unique(),
 lower = sample(c(-Inf, 2), size = length(id), replace = TRUE),
 upper = sample(c(2, Inf), size = length(id), replace = TRUE)) %>% print()
## # A tibble: 31 × 3
##    id       lower upper
##    <chr>    <dbl> <dbl>
##  1 pid       -Inf     2
##  2 sib          2     2
##  3 mhs          2   Inf
##  4 phs       -Inf     2
##  5 mom          2   Inf
##  6 dad       -Inf   Inf
##  7 dad2      -Inf     2
##  8 paunt        2     2
##  9 pacousin     2   Inf
## 10 hspaunt   -Inf     2
## # ℹ 21 more rows

Constructing the graph

With the above input, we can construct the graph for the population specified in family and attach the information in thrs:

graph = prepare_graph(.tbl = family, 
                      thresholds = thrs,
                      fcol = "dadcol",
                      mcol = "momcol",
                      icol = "id")
graph
## IGRAPH 940c4ef DN-- 31 44 -- 
## + attr: name (v/c), lower (v/n), upper (v/n)
## + edges from 940c4ef (vertex names):
##  [1] dad     ->pid        mom     ->pid        dad     ->sib       
##  [4] mom     ->sib        dad2    ->mhs        mom     ->mhs       
##  [7] dad     ->phs        mom2    ->phs        mgf     ->mom       
## [10] mgm     ->mom        pgf     ->dad        pgm     ->dad       
## [13] pgf2    ->dad2       pgm2    ->dad2       pgf     ->paunt     
## [16] pgm     ->paunt      pauntH  ->pacousin   paunt   ->pacousin  
## [19] newpgf  ->hspaunt    pgm     ->hspaunt    hspauntH->hspacousin
## [22] hspaunt ->hspacousin pgf     ->puncle     pgm     ->puncle    
## + ... omitted several edges

The object graph is a directed graph with the connections specified in the family object and attributes (attr) from thrs. The package igraph can be used to do any manipulation of the object graph. The object graph only contains a single family in this example, however, it could just as easily contain the entirety of the danish CPR register, totalling more than \(10\) million individuals. We can create a local graph (formally, a neighborhood sub-graph) around a proband, only including individuals that are within \(n\) steps of said proband. The \(n\) steps correspond to \(n\) degree relatives, such that \(n = 1\) yields all first degree relatives, \(n = 2\) yields all second degree relatives, etc.. From this local graph, we can construct a kinship matrix and the covariance matrix required by estimate_liability().

Extracting local graphs

If we have constructed a graph on the entire Danish CPR register and a subset of these individuals have been genotyped, then we can create local graphs around each of the genotyped individuals. Here, we simply use every individual for the illustration:

genotyped_ids = V(graph)$name

family_graphs = tibble(
  pid = genotyped_ids,
  fam_graph = make_ego_graph(graph, order = 2, nodes = pid)
)

Here, family_graphs is a tibble with an id column called “pid” and a column called “fam_graph” with list of all the local graphs, centred on the individual from “pid”.

construct kinship matrix

We can construct a kinship matrix for a local graph in the following way:

get_kinship(family_graphs$fam_graph[[1]], h2 = 1, index_id = family_graphs$pid[1])
##         pid  sib  mhs  phs mom dad paunt puncle maunt  mgm  pgm  mgf  pgf pid_g
## pid    1.00 0.50 0.25 0.25 0.5 0.5  0.25   0.25  0.25 0.25 0.25 0.25 0.25  1.00
## sib    0.50 1.00 0.25 0.25 0.5 0.5  0.25   0.25  0.25 0.25 0.25 0.25 0.25  0.50
## mhs    0.25 0.25 1.00 0.00 0.5 0.0  0.00   0.00  0.25 0.25 0.00 0.25 0.00  0.25
## phs    0.25 0.25 0.00 1.00 0.0 0.5  0.25   0.25  0.00 0.00 0.25 0.00 0.25  0.25
## mom    0.50 0.50 0.50 0.00 1.0 0.0  0.00   0.00  0.50 0.50 0.00 0.50 0.00  0.50
## dad    0.50 0.50 0.00 0.50 0.0 1.0  0.50   0.50  0.00 0.00 0.50 0.00 0.50  0.50
## paunt  0.25 0.25 0.00 0.25 0.0 0.5  1.00   0.50  0.00 0.00 0.50 0.00 0.50  0.25
## puncle 0.25 0.25 0.00 0.25 0.0 0.5  0.50   1.00  0.00 0.00 0.50 0.00 0.50  0.25
## maunt  0.25 0.25 0.25 0.00 0.5 0.0  0.00   0.00  1.00 0.50 0.00 0.50 0.00  0.25
## mgm    0.25 0.25 0.25 0.00 0.5 0.0  0.00   0.00  0.50 1.00 0.00 0.00 0.00  0.25
## pgm    0.25 0.25 0.00 0.25 0.0 0.5  0.50   0.50  0.00 0.00 1.00 0.00 0.00  0.25
## mgf    0.25 0.25 0.25 0.00 0.5 0.0  0.00   0.00  0.50 0.00 0.00 1.00 0.00  0.25
## pgf    0.25 0.25 0.00 0.25 0.0 0.5  0.50   0.50  0.00 0.00 0.00 0.00 1.00  0.25
## pid_g  1.00 0.50 0.25 0.25 0.5 0.5  0.25   0.25  0.25 0.25 0.25 0.25 0.25  1.00

Note: individuals such as “hspaunt” are not present, since they are third degree relation.

Calculating genetic liabilities

With the object family_graphs, we have constructed an object with the required information and format for estimating the genetic liabilities of a list of probands. We have automatically extracted the family present for a given proband up to degree \(2\) (can be generalised to any integer \(n\)), which allows us to construct the kinship matrix. Next, we have attached the lower and upper threshold for each individual during the construction of the graph, and as such follow the indivdual in whatever local graph they are a part of. Finally, we need a liability-scale heritability. We will simply use \(h^2 = 0.5\) for illustrative purposes. We will use estimate_liability() in the folling way:

ltfhpp = estimate_liability(
  family_graphs = family_graphs,
  h2 = .5,
  pid = "pid",
  family_graphs_col = "fam_graph"
) %>% print()
## The number of workers is 1
## # A tibble: 31 × 4
##    fam_ID   pid      genetic_est genetic_se
##    <chr>    <chr>          <dbl>      <dbl>
##  1 pid      pid            1.05     0.00263
##  2 sib      sib            1.43     0.00165
##  3 mhs      mhs            1.65     0.00197
##  4 phs      phs            0.325    0.00347
##  5 mom      mom            1.82     0.00179
##  6 dad      dad            0.721    0.00306
##  7 dad2     dad2           0.779    0.00280
##  8 paunt    paunt          1.23     0.00171
##  9 pacousin pacousin       1.62     0.00193
## 10 hspaunt  hspaunt        0.597    0.00334
## # ℹ 21 more rows