Graph-Annotator

Textmining annotator

A simple pipeline, using existing pipes, can be created as follows (assuming you have an arangodb instance up and running):

from pyArango.collection import Collection

from cag.framework.annotator.pipeline import Pipeline
from cag.utils.config import Config

## set database configuration
config= Config(
        url="http://127.0.0.1:8529",
        user="root",
        password="root",
        database="_system",
        graph="GenericGraph"
    )

## define the pipeline
pipeline: Pipeline = Pipeline(database_config=config)

pipeline.add_annotation_pipe("NamedEntityAnnotator", save=True)

coll: Collection = pipeline.database_config.db["TextNode"]

## fetch data
docs = coll.fetchAll(limit=500)
processed = []
for txt_node in docs:
    processed.append((txt_node.text, {"_key": txt_node._key}))

## annotating using the defined pipes
pipeline.annotate(processed)

## save to the database
pipeline.save()

General annotator

These annotator fit a more general class, where we only provide basic functionality, similar to the graph creator. To ease the filtering based on the parameters, we provide a simple base class where the documents can be checked in and easily filtered:

from cag.framework import GenericAnnotator
class AnyAnnotator(GenericAnnotator):
    def __init__(self, conf: Config, params={'mode': 'run-1'}, filter_annotatable=True):
        super().__init__(query=f"""FOR dp IN {AnyGraphCreator._ANY_DATASET_NODE_NAME}
        RETURN dp
        """, params=params, conf=conf, filter_annotatable=filter_annotatable)
    def update_graph(self, timestamp, data):
        for d in data:
            d['add-prop']=some_algo(d['text'])
            self.upsert_node(d) #will annotate the data!

You can disable the filtering by providing filter_annotatable=False. When returning more complex data make sure that you also return a root-level field (in your data structure) called '_annotator_params' (from a component that will be annotated) or provide your own fieldname in the parameter annotator_fieldname. Each document that will be upserted (or checked into complete_annotation) will recieve the parameter on this field, providing the next run with the neccessary information to filter.

An example for annotation metadata as a dict() for annotations produced by keyphrase extraction is given below:

{
    "analysis_component": "keyphrase_extraction",
    "parameters": {
        "algorithm": "text_rank",
        "relevance_threshold": 0.75
    }
}