Semantic Annotator
Semantice Annotator
is a library that generates semantic annotations
based on a system of syntactic patterns
. It rely on the Stanford CoreNLP library.
A tagger
is a single json file which describe the way to extract annotations
on a text using a list of rules
and a list of test to validate the tagger
.
Semantice Annotator
will load all taggers
contained in given directory and validate
all rules and unit tests on load. Then it will be able to generate the annotations
defined by those taggers
.
A tagger can import other tagger which will be executed before its own rules. But only generatedTagLabels flagged as exportable in the tagger itself will be returned.
{
"importRules": [ "otherTagger1", "otherTagger2" ],
"rules": []
}
A collection
is a single json file listing other taggers
.
{
"collection": [
"otherTagger1",
"otherTagger2"
],
"unitTests": [
{
"verbatim": [ "My cat is red", "Your dog is blue" ],
"generatedTagLabels": [ "coloredPet" ]
}
]
}
A rule
describe the way to detect an annotation based on a syntactic pattern
(or regular expression) and/or to apply transformations on the input text.
A list of samples
allow to validate the rule
.
For instance, the following rule will replace the first matching group with “SMALL”.
{
"pattern": "(petit@ADJ|mini|minuscule) :NC",
"substitutions": "1:SMALL",
"samples": [ "le petit lavabo" ]
}
This second rule will generate the annotation “smallDog” if the syntactic pattern
matches.
{
"pattern": "@SMALL chien",
"generatedTagLabels": [ { "value": "smallDog", "exported": true } ],
"samples": [ "ce petit chien", "le petit chien" ]
}
A token pattern
describe a single token (aka ‘word’).
Its syntax is :
text@type[;property=value]*
where :
- text : the exact text value or its lemma
- type : the “part of speech” label of the token such as noun, verb, adjective, etc. It use the Treebank POS tag set.
- property=value : a list of morphosyntactic properties separated by a semicolon.
Any part of a token pattern
can be empty.
You can combine multiple patterns by separating them by the boolean operator | (or).
token pattern | will match tokens | |
---|---|---|
samples@NN;lemma=sample;nb=p | samples | |
@NN;lemma=sample;nb=p | samples | |
samples;lemma=sample;nb=p | samples | |
samples;lemma=sample | samples or sample | |
samples@V | none | |
samples@ | samples or sample | |
@ | any token | |
samples@ | example:NN | samples, sample, example or examples |
A syntactic pattern
describe the syntactic structure of a text. It is composed of a sequence of token pattern
.
You can define groups using parenthesis. It is useful for applying substitutions or to apply a quantifier to it.
Quantifiers are :
syntactic pattern | will match text |
---|---|
@DT (@JJ)? dog@NN | the dog, the big dog, a small dog, etc. |
(@DT)* dog@NN | dog, the dog, the the dog, etc. |
The “substitutions” member of a rule allow to replace a matching group by a given tag.
The syntax is : index:value
Where :
- groupIndex : the group index to replace (starting at 1)
- value : the value which will replace the specified group
syntactic pattern | substitutions | text | result | |
---|---|---|---|---|
@DT (dog@NN | cat@NN) | 1:Pet | the dog | the Pet |
(hello@ | hi) (@NNP) | 1:HI,2:WHO | Hello Bryan | HI WHO |
This tool allow you to test and debug you taggers. This is a command line interface which can be used
with any text editor to validate your tagger
files.
Each time you save a file, the Semantic Annotator Console
will validate its content and
will display debug information in case of error. Watch the video.
This video describe the way to create substitutions.
This video describe the way to use a shared tagger
.
Each time you save a file, the Semantic Annotator Console
also validates all tagger
which depends on it to check if your modifications does not introduce regressions. Watch the video.
Rules based on regular expressions are also validated. Watch the video
You can test your taggers
easily. Watch the video
This feature allow you to run all your taggers
on a large text file (a book for instance). It is
really useful to detect invalid annotations. Watch the video
If this is the first time you run the console :
mvn -Dmaven.test.skip=true -pl '!semantic-annotator-console-delivery' clean install
Then :
mvn exec:java -pl semantic-annotator-console
import cle.nlp.annotator.SemanticAnnotator;
import cle.nlp.tagger.Tag;
public class App {
public static void main(String[] args) {
SemanticAnnotator annotator = new SemanticAnnotator(SupportedLanguages.FR, "/path/dir");
Collection<Tag> generatedTagLabels = annotator.getTags("this is a text");
}
}