Implementation and demo of explainable coding of clinical notes with Hierarchical Label-wise Attention Networks (HLAN)
This project proposes an explainable automated medical coding approach based on Hierarchical Label-wise Attention Network and label embedding initialisation. The approach can be applied to multi-label text classification in any domains.
Detailed explanation of the approach is in
A part of the results (especially regarding label embedding initialisation) was orally and virtually presented in HealTAC 2020 with slides available.
Update:
--do_hierarchical_evaluation
flag for hierarchical evaluation. (6 Sep 2021)
The key computation graph is implemented in def inference_per_label(self)
in ./HLAN/HAN_model.py
.
doc__label__labelA labelB labelC
, for details see datasets
. The data can be either split to train[-validation]-test (each split as a single file) or without split (only one data file). embeddings
, also see the notebook from caml-mimic
for embedding from MIMIC-III. The trained embeddings from MIMIC-III can be downloaded from onedrive (3.5G with other files)
).if FLAGS.dataset == "YOUR_DATASET_NAME":
) with variables specified in HAN_train.py
. Please read closely the example code block and comments provided. For MIMIC-III dataset settings, use the existing data blocks in the code.python HAN_train.py --dataset YOUR_DATASET_NAME
) with arguments, see details in Training the models
.First, ensure that you have requested the MIMIC-III dataset. Place the files D_ICD_DIAGNOSES.csv
and D_ICD_PROCEDURES.csv
under the knowledge_bases
folder.
Second, download the files in checkpoints
, cache_vocabulary_label_pik
, and embeddings
folders from Onedrive (link (3.5G with other files)
).
Third, run the Jupyter Notebook demo demo_HLAN_viz.ipynb
and try with your own discharge summaries or those in the MIMIC-III dataset. By setting the to_input
in the notebook as True
, the notebook will ask you to input or paste a discharge summary; otherwise, you can save your discharge summaries, each in a line, under the ..\dataset\
folder and replace the filename_to_predict
to the filename (see Section 2.part2
in the notebook). After running, the predictions are displayed with label-wise attention visualisations. The attention visualisations are further stored as .xlsx
files in the ..\explanation\
folder.
./HLAN/HAN_train.py
contains code for configuration and training./HLAN/HAN_model.py
contains the computational graph, loss function and optimisation./HLAN/data_util_gensim.py
contains code for input and target generation./HLAN/demo_HLAN_viz.ipynb
and ./HLAN/model_predict_util.py
contains code for the demo based on Jupyter Notebook and the helper functions./HLAN/evaluation_setup.py
and ./HLAN/multi_level_eval.py
contains code from CoPHE for the hierarchical evaluation of multi-label classification./embeddings
contains self-trained word2vec embeddings: word embeddings and label embeddings./datasets
contains the datasets used./checkpoints
contains the checkpoints of HLAN, HA-GRU, and HAN models trained from the author on the MIMIC-III datasets./explanations
contains the Excel sheets displaying the attention visualisation, generated after running the demo in ./HLAN/demo_HLAN_viz.ipynb
./knowledge_bases
contains knowledge sources used for label subsumption relations and the ICD code description files./cache_vocabulary_label_pik
stores the cached .pik files about vocabularies and labels./results-HEALTAC 2020
contains the CNN, CNN+att, Bi-GRU, BERT results with label embedding initilisationAfter getting access to MIMIC-III, obtain the data split from CAML using their preprocessing script. Then follow the steps in “How to Train on New Data”.
--dataset
is set to mimic3-ds-50
for MIMIC-III-50, mimic-ds
for MIMIC-III, and mimic-ds-shielding-th50
for MIMIC-III-shielding.
To use label embedding initialisation (+LE), set --use_label_embedding
to True
; otherwise, set it to False
.
All the --marking_id
s below are simply for better marking of the command, which will appear in the name of the output files, can be changed to other values and do not affect the running.
To train with the MIMIC-III-50 dataset
python HAN_train.py --dataset mimic3-ds-50 --batch_size 32 --per_label_attention=True --per_label_sent_only=False --num_epochs 100 --report_rand_pred=False --running_times 1 --early_stop_lr 0.00002 --remove_ckpts_before_train=False --use_label_embedding=True --ckpt_dir checkpoint_HLAN+LE_50/ --use_sent_split_padded_version=False --marking_id 50-hlan --gpu=True
This is by changing --per_label_sent_only
to True
while keeping --per_label_attention
as True
.
To train with the MIMIC-III-50 dataset
python HAN_train.py --dataset mimic3-ds-50 --batch_size 32 --per_label_attention=True --per_label_sent_only=True --num_epochs 100 --report_rand_pred=False --running_times 1 --early_stop_lr 0.00002 --remove_ckpts_before_train=False --use_label_embedding=True --ckpt_dir checkpoint_HAGRU+LE_50/ --use_sent_split_padded_version=False --marking_id 50-hagru --gpu=True
This is by changing --per_label_attention
to False
. The --batch_size
is changed to 128
for this model in the experiment.
To train with the MIMIC-III-50 dataset
python HAN_train.py --dataset mimic3-ds-50 --batch_size 128 --per_label_attention=False --per_label_sent_only=False --num_epochs 100 --report_rand_pred=False --running_times 1 --early_stop_lr 0.00002 --remove_ckpts_before_train=False --use_label_embedding=True --ckpt_dir checkpoint_HAN+LE_50/ --use_sent_split_padded_version=False --marking_id 50-han --gpu=True
For all the models above, you can set the learning rate (--learning_rate
), number of epochs (--num_epochs
), early stop learning rate (--early_stop_lr
), and other configurations when you run the command, or set those in the *_train.py
files.
By setting running_times
as k
, it will report averaged results and standard deviations with k
runs. For example, --running_times 10
.
For hierarchical evaluation results using CoPHE, add the flag --do_hierarchical_evaluation=True
.
Check the full list of configurations in HAN_train.py
.
To view the changing of training loss and validation loss, replacing $PATH-logs$ to a real path.
tensorboard --logdir $PATH-logs$
Key part of the implementation of label embedding initiailisation is in the two functions def assign_pretrained_label_embedding_per_label
(for HLAN and HA-GRU) and def assign_pretrained_label_embedding
(for HAN) in ./HLAN/HAN_train.py
.
Besides, below is the implementation of label embedding initiailisation on top of the model.py
from the caml-mimic GitHub project.
# based on https://github.com/jamesmullenbach/caml-mimic/blob/master/learn/models.py
def _code_emb_init(self, code_emb, code_list):
# code_emb is a Gensim Word2Vec model loaded from pre-trained label embeddings
# code_list is a list of code having the same order as in multi-hot representation (sorted by frequency from high to low)
code_embs = Word2Vec.load(code_emb)
# bound for random variables for Xavier initialisation.
bound = np.sqrt(6.0) / np.sqrt(self.num_labels + code_embs.vector_size)
weights = np.zeros(self.classifier.weight.size())
n_exist, n_inexist = 0, 0
for i in range(self.num_labels):
code = code_list[i]
if code in code_embs.wv.vocab:
n_exist = n_exist + 1
vec = code_embs.wv[code]
#normalise to unit length
weights[i] = vec / float(np.linalg.norm(vec) + 1e-6)
#additional standardisation for BERT models:
#standardise to the same as the originial initilisation in def _init_weights(self, module) in https://huggingface.co/transformers/_modules/transformers/modeling_bert.html
#weights[i] = stats.zscore(weights[i])*self.initializer_range # self.initializer_range = 0.02
else:
n_inexist = n_inexist + 1
#using the original xavier uniform initialisation for CNN, CNN+att, and BiGRU
weights[i] = np.random.uniform(-bound, bound, code_embs.vector_size);
#or using the original normal distribution initialisation for BERT
#weights[i] = np.random.normal(0, std, code_embs.vector_size);
print("code exists embedding:", n_exist, " ;code not exist embedding:", n_inexist)
# initialise label embedding for the weights in the final linear layer
self.classifier.weight.data = torch.Tensor(weights).clone()
print("final layer: code embedding initialised")
We used the MIMIC-III dataset with the preprocessing steps from caml-mimic to generate the two dataset settings MIMIC-III and MIMIC-III-50. We also created a MIMIC-III-shielding dataset based on the NHS shielding ICD-10 codes. See details in the datasets page.
We used the Continous Bag-of-Words algorithm (CBoW) in Gensim word2vec (see gensim.models.word2vec.Word2Vec, on all label sets in the training data. Codes for training word and label embeddings are available in train_word_embedding.py
and train_code_embedding.py
.
For CNN, CNN+att, Bi-GRU models:
For BERT models: