SciFive: a text-text transformer model for biomedical literature
SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5‘s framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-base-Pubmed")
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-base-Pubmed")
sentence = "Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor ."
text = sentence + " </s>"
encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=256,
early_stopping=True
)
for output in outputs:
line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(line)
Our base Google Cloud Storage URI is at gs://scifive
As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.
Instruction on access Cloud Storage from the command line with python library gsutil is described here
The following table contains pretrained SciFive checkpoints.
Model | Size | Step | Config | Checkpoint |
---|---|---|---|---|
SciFive Pubmed | base & large | 1194600 & 1196500 | T5 configs | gs://scifive/models/pubmed/{size}/ |
SciFive Pubmed+PMC | base & large | 1200000 | T5 configs | gs://scifive/models/pubmed_pmc/{size}/ |
SciFive PMC | base & large | 1200000 | T5 configs | gs://scifive/models/pmc/{size}/ |
{size}
is either base
or large
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()
sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text = f"mednli: sentence1: {sent_1} sentence2: {sent_2}"
encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=8,
early_stopping=True
)
for output in outputs:
line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(line)
All of the finetune dataset already pre-procossed into text-text format also availabe at this
If you use SciFive model or our code for publications, please cite:
@misc{phan2021scifive,
title={SciFive: a text-to-text transformer model for biomedical literature},
author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
year={2021},
eprint={2106.03598},
archivePrefix={arXiv},
primaryClass={cs.CL}
}