REST web service to compute and query Latent Dirichlet Allocation models
This library provides a Python REST Web Service to access a simple pipeline to create and query LDA Models. Created models information are stored in a Mongo database and models file are stored on a shared folder in the host filesystem.
The system is composed by two Docker containers:
localhost
on port 5000
.db
hostname on port 27017
. Download the code from the repository, then edit the docker-compose.yml file to update all the mount source directories that will be shared between the containers and the host.
Then, run:
docker-compose build
docker-compose up
docker-compose exec web python db/load_fake_data.py
The main container is composed by
Following a table describing the available APIs:
Endpoint | Http request | Description | Parameters |
---|---|---|---|
models/ | GET | Lists all models | - |
models/ | PUT | Creates a new model w.r.t. the provided parameters | model_id : str, the id of the model to be created; number_of_topics : int, the number of topics to extract; language : ‘en’, the language of the documents; use_lemmer : bool, true to perform lemmatisation, false to perform stemming; min_df : int, the minimum number of documents that should contain a term to consider it; max_df : float, the maximum percentage of documents that should contain a term to consider it as valid; chunksize : int, the size of a chunk in LDA; num_passes : int, the minimum number of passes through the dataset during learning with LDA; waiting_seconds : int, the number of seconds to wait before starting the learning; data_filename : str, the filename in the ‘data’ folder that contains data, the file should contain a json dump of documents each one with doc_id and doc_content keys; data : json dictionary, a dictionary of documents, containing document_id as key and document_content as value; assign_topics : bool, true to assign topics to the newly created model and to save on db, false to ignore assignments for the learning documents; |
models/<model-id> |
GET | Shows detailed information about model with id <model-id> |
- |
models/<model-id> /documents/ |
GET | Lists all documents assigned to the model with id <model-id> |
- |
models/<model-id> / |
DELETE | Delete the model with the specified id, stops the computation if scheduled or performed | - |
models/<model-id> /documents/<doc-id> |
GET | Shows detailed information about document with id <doc-id> in model <model-id> |
* threshold : float, the minimum probability that a topic should have to be returned as associated to the document. |
models/<model-id> /neighbors/ |
GET | Computes and shows documents similar to the specified text. | text : str, the text to categorize; limit : int, the maximum number of similar documents to extract. |
models/<model-id> /documents/<doc-id> /neighbors/ |
GET | Computes and shows documents similar to the document identified with <doc-id> . |
* limit : int, the maximum number of similar documents to extract. |
models/<model-id> /topics/ |
GET | Lists all topics related to the model with id <model-id> or extracts topics from a text if text is specified. |
Only for extract topics from a text: text , str, the text to compute topics for; threshold , float, the min weight of a topic to be retrieved. |
models/<model-id> /topics/ |
SEARCH | Computes and returns all topics assigned to the text. | text , str, the text to compute topics for; threshold , float, the min weight of a topic to be retrieved. |
models/<model-id> /topics/<topic-id> |
GET | Shows detailed information about topic with id <topic-id> in model <model-id> |
* threshold : float, the minimum probability that a topic should have to be returned as associated to the document. |
models/<model-id> /topics/<topic-id> /documents |
GET | Shows all documents associated to the topic with id <topic-id> in model <model-id> |
* threshold : float, the minimum probability of the topic that the document should have to be returned as associated to the topic. |
models/<model-id> /topics/<topic-id> /documents |
PUT | Compute topics associated to the provided document (single if doc_id and doc_content are set, multiple if documents is set) in model <model-id> |
documents : json dictionary, optional, keys are document ids and values are document contents; doc_id , string, optional, the document id (in single case); doc_content , string, optional, the document content; save_on_db , bool, default True, true to save documents and topic assignments on db, False to return and forget. |
models/<model-id> /topics/<topic-id> |
PATCH | Update optional information of the topic with id <topic-id> in model <model-id> |
label : str, optional, the topic label. description : str, optional, the optional topic description. |
To invoke the modules that loads fake data into the database run, from the machine that is running Docker, the following command:
docker-compose exec web python db/load_fake_data.py
To connect directly to the mongodb instance:
When asking for model’s detailed information, the required model can be in one of the following statuses:
scheduled
, the model computation will start after the specified waiting periodcomputing
, the model computation has been started and is currently runningcompleted
, the model computation is finished and the model is stablekilled
, the model computation has been interrupted by an error The language can be specified during in model creation message. Each model can handle only one language, chosen from:
en
for english documentsit
for italian documents, stopwords are available in /app/resources
folder and lemmatisation is performed with MorphIt
During model computation it is possible to load documents in two ways:
load from file: provide the data_filename
field in the request. The file should be a json file and should be contained in the data folder. The json should be a list of dictionaries, each dictionary represent a document and contains the keys doc_id
and doc_content
. For example:
[
{'doc_id': 'doc_1', 'doc_content': 'doc content 1'},
{'doc_id': 'doc_2', 'doc_content': 'doc content 2'},
{'doc_id': 'doc_3', 'doc_content': 'doc content 3'}
]
load directly: provide the documents in the data
field. This field should contain a dictionary of key:values where keys are document ids and values are document contents.
{
'doc_1': 'doc content 1',
'doc_2': 'doc content 2',
'doc_3': 'doc content 3'
}
To build all containers
docker-compose build
To run all containers
docker-compose up
To exec a command within a running container, e.g. load fake data into the mongo database
docker-compose exec web COMANDO ARGS