A biological entity grounding search service
The identification of sub-cellular biological entities is an important consideration in the use and creation of bioinformatics analysis tools and accessible biological research apps. When research information is uniquely and unambiguously identified, it enables data to be accurately retrieved, cross-referenced, and integrated. In practice, biological entities are “identified” when they are associated with a matching record from a knowledge base that specialises in collecting and organising information of that type (e.g. gene sequences). Our search service increases the efficiency and ease of use for identifying biological entities. This identification may be used to power research apps and tools where common entity synonyms may be provided as input.
For instance, Biofactoid uses this grounding service to allow users to simply specify their preferred synonyms to identify biological entities (e.g. proteins):
https://user-images.githubusercontent.com/989043/140164756-1fa22796-1c60-4f13-9a2a-d65393b89155.mp4
To cite the Pathway Commons Grounding Search Service in a paper, please cite the Journal of Open Source Software paper:
Franz et al., (2021). A flexible search system for high-accuracy identification of biological entities and molecules. Journal of Open Source Software, 6(67), 3756, https://doi.org/10.21105/joss.03756
View the paper at JOSS or view the PDF directly.
The Pathway Commons Grounding Search Service is an academic project built and maintained by:
Bader Lab at the University of Toronto
,
Sander Lab at Harvard
, and the
Pathway and Omics Lab at the Oregon Health & Science University
.
This project was funded by the US National Institutes of Health (NIH) [U41 HG006623, U41 HG003751, R01 HG009979 and P41 GM103504].
Install Docker (>=20.10.0) and Docker Compose (>=1.29.0).
Clone this remote or at least the docker-compose.yml
file then run:
docker-compose up
Swagger documentation can be accessed at http://localhost:3000
.
NB: Server start will take some time in order for Elasticsearch to initialize and for the grounding data to be retrieved and the index restored. If it takes more than 10 minutes consider increasing the allocated memory for Docker: Preferences
> Resources
> Memory
and remove this line in docker-compose.yml: ES_JAVA_OPTS=-Xms2g -Xmx2g
With Node.js (>=8) and Elasticsearch (>=6.6.0, <7) installed with default options, run the following in a cloned copy of the repository:
npm install
: Install npm dependenciesnpm run update
: Download and index the datanpm start
: Start the server (by default on port 3000)Swagger documentation is available on a publicly-hosted instance of the service at https://grounding.baderlab.org. You can run queries to test the API on this instance.
Please do not use https://grounding.baderlab.org
for your production apps or scripts.
Here, we provide usage examples in common languages for the main search API. For more details, please refer to the Swagger documentation at https://grounding.baderlab.org, which is also accessible when running a local instance.
const response = await fetch('http://hostname:port/search', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ // search options here
q: 'p53'
})
});
const responseJSON = await response.json();
import requests
url = 'http://hostname:port/search'
body = {'q': 'p53'}
response = requests.post(url, data = body)
responseJSON = response.json()
curl -X POST "http://hostname:port/search" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"q\": \"p53\" }"
Here, we summarise a set of tools that overlap to some degree with the main use case of the Pathway Commons Grounding Search Service, where a user searches for a biological entity grounding by providing only a commonly-used synonym. This table was last updated on 25 October 2021 (2021-10-25).
If you have developed a new tool in this space or your tool supports new features, let us know by making a pull request, and we’ll add your revision to this table.
PC Grounding Search | GProfiler | GNormPlus (PubTator) | Gilda | BridgeDB | |
---|---|---|---|---|---|
Allows for searching by synonym | ● | ● | ● | ||
Supports multiple organisms | ● | ● | ● | ● | ● |
Accepts organism ranking preference | ● | ||||
Multiple organisms per query | ● | ● | Partial support (only one organism returned) | ||
Multiple results per query | ● | One per type (e.g. protein) | ● | ||
Multiple results are ranked based on relevance | ● | ● | |||
Speed/Throughput | < 100 ms | < 100 ms | < 100ms | < 100 ms | < 1000 ms |
Allows querying for a particular grounding by ID | ● | ● | ● | ● | ● |
grounding-search
uses data files provided by several public databases:
ncbi
chebi
)chebi
uniprot
)uniprot
fplx
If you have followed the Quick Start (“Run from source”), you can download and index the data provided by the source databases ncbi
, chebi
and uniprot
by running:
npm run update
Downloading and building the index from source ensures that the latest information is indexed. Alternatively, to quickly retrieve and recreate the index a dump of a previously indexed Elasticsearch instance has been published on Zenodo under the following DOI:
This data is published under the Creative Commons Zero v1.0 Universal license.
To restore, create a running Elasticsearch instance and run:
npm run restore
To both restore and start the grounding-search server run:
npm run boot
NB: Index dump published on Zenodo is offered for demonstration purposes only. We do not guarantee that this data will be up-to-date or that releases of grounding-search software will be compatible with any previously published version of the dump data. To ensure you are using the latest data compatible with grounding-search, follow instructions in “Build the index database from source database files”.
To let us know about an issue in the software or to provide feedback, please file an issue on GitHub.
To make a contribution to this project, please start by please filing an issue on GitHub that describes your proposal. Once your proposal is ready, you can make a pull request.
The following environment variables can be used to configure the server:
NODE_ENV
: the environment mode, either production
or development
(default)LOG_LEVEL
: the level for the log file (info
, warn
, error
)PORT
: the port on which the server runs (default 3000)ELASTICSEARCH_HOST
: the host:port
that points to elasticsearchMAX_SEARCH_ES
: the maximum number of results to return from elasticsearchMAX_SEARCH_WS
: the maximum number of results to return in json from the webserviceCHUNK_SIZE
: how many grounding entries make up a chunk that gets bulk inserted into elasticsearchMAX_SIMULT_CHUNKS
: maximum number of chunks to insert simulteneously into elasticsearchINPUT_PATH
: the path to the input folder where data files are locatedINDEX
: the elasticsearch index name to store data from all data sourcesUNIPROT_FILE_NAME
: name of the file where uniprot data will be read fromUNIPROT_URL
: url to download uniprot file fromCHEBI_FILE_NAME
: name of the file where chebi data will be read fromCHEBI_URL
: url to download chebi file fromNCBI_FILE_NAME
: name of the file where ncbi data will be read fromNCBI_URL
: url to download ncbi file fromNCBI_EUTILS_BASE_URL
: url for NCBI EUTILSNCBI_EUTILS_API_KEY
: NCBI EUTILS API keyFAMPLEX_URL
: url to download FamPlex remote fromFAMPLEX_FILE_NAME
: name of the file where FamPlex data will be read fromFAMPLEX_TYPE_FILTER
: entity type to include (‘protein’, ‘complex’, ‘all’ [default])ESDUMP_LOCATION
: The location (URL, file path) of elasticdump files (note: terminate with ‘/‘)ZENODO_API_URL
: base url for ZenodoZENODO_ACCESS_TOKEN
: access token for Zenodo REST API (Scope: deposit:actions
, deposit:write
)ZENODO_BUCKET_ID
: id for Zenodo deposition ‘bucket’ (Files API)ZENODO_DEPOSITION_ID
: id for Zenodo deposition (for a published dataset)npm start
: start the servernpm stop
: stop the servernpm run watch
: watch mode (debug mode enabled, autoreload)npm run refresh
: run clear, update, then startnpm test
: run tests for read only methods (e.g. search and get) assuming that data is already existingnpm test:sample
: run tests with sample datanpm run test:quality
: run the search quality tests (expects full db)npm run test
csv
: run the search quality tests and output a csv filenpm run lint
: lint the projectnpm run benchmark
: run all benchmarkingnpm run benchmark:source
: run benchmarking for source
(i.e. ncbi
, chebi
)npm run clear
: clear all datanpm run clear:source
: clear data for source
(i.e. ncbi
, chebi
)npm run update
: update all data (download then index)npm run update:source
: update data for source
(i.e. ncbi
, chebi
) in elasticsearchnpm run download
: download all datanpm run download:source
download data for source
(i.e. ncbi
, chebi
)npm run index
: index all datanpm run index:source
: index data for source
(i.e. ncbi
, chebi
) in elasticsearchnpm run test:inputgen
: generate input test file for each source
(i.e. uniprot
, …)npm run test:inputgen
: generate input test file for source
(i.e. uniprot
, …)npm run dump
: dump the information for INDEX
to ESDUMP_LOCATION
npm run restore
: restore the information for INDEX
from ESDUMP_LOCATION
npm run boot
: run clear
, restore
then start
; exit on errorsZenodo lets you you to store and retrieve digital artefacts related to a scientific project or publication. Here, we use Zenodo to store Elasticsearch index dump data used to quickly recreate the index used by grounding-search.
Briefly, using their RESTful web service API, you can create a ‘Deposition’ for a record that has a ‘bucket’ referenced by a ZENODO_BUCKET_ID
to which you can upload and download ‘files’ (i.e. <ZENODO_API_URL>api/files/<ZENODO_BUCKET_ID>/<filename>
; list them with https://zenodo.org/api/deposit/depositions/<deposition id>/files
). In particular, there are three files required to recreate an index, corresponding to the elasticsearch types: data
; mapping
and analyzer
.
To setup follow these steps:
ZENODO_ACCESS_TOKEN
by creating a ‘Personal access token’ (see docs for details). Be sure to add the deposit:actions
and deposit:write
scopes.https://zenodo.org/api/deposit/depositions
with at least the following information, keeping in mind to set the header Authorization = Bearer <ZENODO_ACCESS_TOKEN>
:
{
"metadata": {
"title": "Elasticsearch data for biofactoid.org grounding-search service",
"upload_type": "dataset",
"description": "This deposition contains files with data describing an Elasticsearch index (https://github.com/PathwayCommons/grounding-search). The files were generated from the elasticdump npm package (https://www.npmjs.com/package/elasticdump). The data are the neccessary and sufficient information to populate an Elasticsearch index.",
"creators": [
{
"name": "Biofactoid",
"affiliation": "biofactoid.org"
}
],
"access_right": "open",
"license": "cc-zero"
}
}
"bucket": "https://zenodo.org/api/files/<uuid>"
) within the links
object. The variable ZENODO_BUCKET_ID
is the value <uuid>
in the example URL.npm run dump
). Log in to the Zenodo web page and click ‘Publish’ to make the deposition public. You may need to add a publication date (YYYY-MM-DD).data
files; clear the index (npm run clear
); do a restore (npm run restore
) being sure to update the ZENODO_DEPOSITION_ID
and run the quality tests (npm run test
csv
)Once published, a deposition cannot be updated or altered. However, you can create a new version of a record (below).
In this case, you already have a record which points to a published deposition (i.e. elasticsearch index files) and wish to create a new version for that record. Here, you’ll create a new deposition under the same record:
https://zenodo.org/api/deposit/depositions/<deposition id>/actions/newversion
to create a new version. Alternatively, visit https://zenodo.org/record/<deposition id>
where deposition id
is that of the latest published version (default).https://zenodo.org/api/deposit/depositions?all_versions
to list all your depositions and identify the new deposition bucket id.All files /test
will be run by Mocha. You can npm test
to run all tests, or you can run npm test -- -g specific-test-name
to run specific tests.
Chai is included to make the tests easier to read and write.
npm test
npm run lint
npm version
, in accordance with semver. The version
command in npm
updates both package.json
and git tags, but note that it uses a v
prefix on the tags (e.g. v1.2.3
).npm version patch
.npm version minor
.npm version major.
npm version 1.2.3
.git push && git push --tags