Graph-Annotator¶
Textmining annotator¶
A simple pipeline, using existing pipes, can be created as follows (assuming you have an arangodb instance up and running):
from pyArango.collection import Collection
from cag.framework.annotator.pipeline import Pipeline
from cag.utils.config import Config
## set database configuration
config= Config(
url="http://127.0.0.1:8529",
user="root",
password="root",
database="_system",
graph="GenericGraph"
)
## define the pipeline
pipeline: Pipeline = Pipeline(database_config=config)
pipeline.add_annotation_pipe("NamedEntityAnnotator", save=True)
coll: Collection = pipeline.database_config.db["TextNode"]
## fetch data
docs = coll.fetchAll(limit=500)
processed = []
for txt_node in docs:
processed.append((txt_node.text, {"_key": txt_node._key}))
## annotating using the defined pipes
pipeline.annotate(processed)
## save to the database
pipeline.save()
General annotator¶
These annotator fit a more general class, where we only provide basic functionality, similar to the graph creator. To ease the filtering based on the parameters, we provide a simple base class where the documents can be checked in and easily filtered:
from cag.framework import GenericAnnotator
class AnyAnnotator(GenericAnnotator):
def __init__(self, conf: Config, params={'mode': 'run-1'}, filter_annotatable=True):
super().__init__(query=f"""FOR dp IN {AnyGraphCreator._ANY_DATASET_NODE_NAME}
RETURN dp
""", params=params, conf=conf, filter_annotatable=filter_annotatable)
def update_graph(self, timestamp, data):
for d in data:
d['add-prop']=some_algo(d['text'])
self.upsert_node(d) #will annotate the data!
You can disable the filtering by providing filter_annotatable=False
. When returning more complex data make sure that you also return a root-level field (in your data structure) called '_annotator_params'
(from a component that will be annotated) or provide your own fieldname in the parameter annotator_fieldname
. Each document that will be upserted (or checked into complete_annotation
) will recieve the parameter on this field, providing the next run with the neccessary information to filter.
An example for annotation metadata as a dict()
for annotations produced by keyphrase extraction is given below:
{
"analysis_component": "keyphrase_extraction",
"parameters": {
"algorithm": "text_rank",
"relevance_threshold": 0.75
}
}