- 1 min

Ghostdoc is a mixture between a Wiki and a data mining tool for unstructured text documents. The front-end is a web app supported by a back-end data processing engine.

Artifacts and gems in a haystack

The main idea of Ghostdoc is that there exists a plethora of text sources that contains information about objects (called Artifacts in Ghostdoc) such as persons, locations etc. By extracting and structuring the information about these Artifacts into Wiki-like articles it becomes easier to get an overview of the information.

Furthermore Ghostdoc allows the user to define Gems that are more generalized information extraction expressions that are evaluated for the Artifacts. Further highlighting information hidden in the corpora.

Lastly the back-end also does generic analysis of the Artifacts such as community clustering, centrality scores and linking between articles.

Example use case

Currently I consider Ghostdoc primarily as a research tool though other use cases are definitely possible.

Imagine you had 100 text books about WWII and you wanted to get an overview of the information. With Ghostdoc you could add the books as sources and add for example the major figures as Artifacts. Ghostdoc would then create Wiki-like pages for all these figures.

Perhaps you were particularly interested in when the these major figures were born. By adding a Gem that extracts their date of birth you could add that information to their articles in an information box, making their date of birth extra visible.

The fine details

The front-end of Ghostdoc is written in NodeJS using the Meteor framework. It communicates with the back-end, Ritter, using RabbitMQ broker. Ritter is written in Python and does all the heavy lifting. That includes extracting Artifacts, Gems and graph analysis such as centrality scores.

The two ends are intended to run on a PaaS such as Dokku.

Trying out Ghostdoc

Both Ghostdoc and Ritter is open source and available on GitHub.

A live version is running on Feel free to test it out, but please respect that it is not hosted on a production server.

Erik Gärtner

Erik Gärtner

Research Scientist at RADiCAL