项目作者: chunlinli

项目描述 :
R functions for estimation and inference of an interventional directed acyclic graph
高级语言: R
项目地址: git://github.com/chunlinli/intdag.git
创建时间: 2020-06-11T04:31:30Z
项目社区:https://github.com/chunlinli/intdag

开源协议:MIT License

下载


intdag: An R Package For Interventional DAG Estimation and Inference

This repository contains an implementation of the paper

Installation

R Packages

First, install the package glmtlp which implements the DC projection algorithm in the paper.

  1. devtools::install_github("chunlinli/glmtlp")

Then install the package intdag which offers the peeling causal discovery method and data-perturbation/asymptotic inference.

  1. devtools::install_github("chunlinli/intdag/intdag")

R Scripts of utilities

Before proceeding, source the following R scripts for utility functions required in the following illustration.

Assume the working directory is the cloned repository https://github.com/chunlinli/intdag.

  1. source("utility.R")
  2. set.seed(1110)

A Quick Example

First, we generate a random graph with p=10 and q=20.

  1. p <- 10
  2. q <- 20
  3. graph <- graph_generation(p = p, q = q, graph_type = "random", iv_sufficient = FALSE)

Note that we use the option iv_sufficient = FALSE, which means a crucial assumption in the paper — Assumption 1C is violated.
Let’s print the $U$ matrix.

  1. graph$u

This is a p by p adjacency matrix of the DAG that we want to recover and/or make inference. Note that $U{jk} \neq 0$ means an edge from $Y{j}$ to $Y_{k}$.

Then print the $W$ matrix.

  1. graph$w

This is a q by p matrix, indicating the interventional relations of an intervention varibale $X{l}$ and a primary variable $Y{j}$, where $W{lj} \neq 0$ means $X{l}$ directly intervenes on $Y_{j}$.
In above, w corresponds to the simulation Setup B in the paper.

Next, generate a random sample of size n=200 based on the graph. According to the analysis in the paper, the distribution of intervention variables $X$ does not matter too much. Here we generate $X$ so that they have an AR(1) correlation structure.

  1. n <- 200
  2. x <- matrix(rnorm(n * q), nrow = n, ncol = q)
  3. rho <- 0.5
  4. if (rho != 0) {
  5. for (j in 2:q) {
  6. x[, j] <- sqrt(1 - rho^2) * x[, j] + rho * x[, j - 1]
  7. }
  8. }

Note that x is an n by q matrix.

Then we generate the $Y$ variables, the ones of primary interest.

  1. y <- (x %*% graph$w + rmvnorm(n, sigma = diag(seq(from = 0.5, to = 1, length.out = p), p, p))) %*% solve(diag(p) - graph$u)

Here y is an n by p matrix.

Now fit the model using intdag. Note that intdag calls glmtlp for solving multi-response regression in its first step.

  1. v_out <- v_estimation(y = y, x = x, model_selection = "bic")

Second, use peeling algorithm to recover the topological layers.

  1. top_out <- topological_order(v_out$v)
  2. an_mat <- top_out$an_mat
  3. in_mat <- top_out$in_mat

Third, refit to estimate $U$ and $W$.

  1. discovery_out <- causal_discovery(y = y, x = x, an_mat = an_mat, in_mat = in_mat)

The final estimate for $U$ is:

  1. discovery_out$u

We compare the final estimate with the true graph in the structural Hamming distance.

  1. sum(abs((abs(discovery_out$u) > 0.05) - (graph$u != 0)))

Here we use a truncation threshold 0.05 to screen the small (noisy) coefficients.

Contents

The R package is placed in directory ./intdag/.

The extensive simulation code is placed in directory ./simulation/.

Citing information

If you find the code useful, please consider citing

  1. @article{li2023inference,
  2. author = {Chunlin Li, Xiaotong Shen, Wei Pan},
  3. title = {Inference for a large directed acyclic graph with unspecified interventions},
  4. journal = {Journal of Machine Learning Research},
  5. year = {2023}
  6. }

Implementing these algorithms is error-prone and this code project is still in development.
Please file an issue if you encounter any error when using the code. I will be grateful to be informed.