a graph-based, large-scale data model for NLP cross-referencing

I find myself in a situation where my team needs to combine several natural language processing (NLP) [1] techniques, each conducted upon the same large set of texts, to derive critical business conclusions. Particularly, we need a way to cross-reference analyses based on common key words across sentences. Toward that end, I designed the following graph-based data model for storing the results of each particular NLP analysis, implemented in Neo4j [2]. This design enables large-scale queries of stored results, particularly regarding relations between the analysis conclusions.

We illustrate the data model using the text of the “E-textiles” [3] Wikipedia article:

Modeling the Texts Themselves

As a framework for storing NLP computation results and cross-referencing them, we need to store the texts themselves in a strategic fashion:

We first create a source node called “Wikipedia”, then an article node called “E-textiles”, and finally connect the two with a “has source” relationship. We also attach a parameter indicating use of the English language to the “Wikipedia” source node, because we are using the English version of the encyclopedia.

We then cycle through the headers of the “E-textiles” article, adding heading nodes for each and connecting those nodes to the article node. We also provide an integer-valued parameter to each heading node indicating its position (order) in the article. Finally, we connect each heading node, in sequence, to its next node with a “has next heading” relationship:

MATCH p=((s:SOURCE)-[rs]-(a:ARTICLE)-[ra]-(h1:HEADING)-[r:HAS_NEXT_HEADING*1..12]->(h2:HEADING)) WHERE h1.title = "summary" RETURN p;

We follow an analogous procedure for the sentences under each heading, and the tokens in each sentence:

MATCH p=((s:SOURCE)-[rs]-(a:ARTICLE)-[ra]-(h1:HEADING)-[rst]-(sta:SENTENCE)-[rstab:HAS_NEXT_SENTENCE*1..12]-(rstb:SENTENCE)) WHERE h1.title IN ["summary", "history"] AND sta.position_in_heading = 0 RETURN p;

MATCH p=((s:SOURCE)-[rs]-(a:ARTICLE)-[ra]-(h1:HEADING)-[rst]-(sta:SENTENCE)-[rw:HAS_SENTENCE]-(w1:WORD_LOCAL)-[rww:HAS_NEXT_WORD*1..12]-(w2:WORD_LOCAL)) WHERE w1.position_in_sentence = 0 AND h1.title IN ["summary", "history"] AND sta.position_in_heading = 0 RETURN p;

Note how each sentence or word is attached to its parent heading or sentence, respectively:

MATCH p=((s:SOURCE)-[rs]-(a:ARTICLE)-[ra]-(h1:HEADING)-[rst]-(sta:SENTENCE)-[rw:HAS_SENTENCE]-(w1:WORD_LOCAL)) WHERE h1.title IN ["summary", "history"] AND sta.position_in_heading = 0 RETURN p;

Adding Basic NLP

Suppose two separate sentences contain the token “textiles”. In the above framework, a unique word node, called a “local word” node is created. In other words, there are two separate nodes containing the token “textiles”. I called these “local” because they are “local” to their particular source sentences.

Now suppose we lemmatize and stem the word to obtain “textil”. We then create a single “global word” node for this root word and attach to it a “has global word” relationship from every local node with the root “textil” (e.g. “textile”, “textiles”). The following query shows this in action, demonstrating how two sentences share the root words “textil”, “smart”, and “fabric”:

MATCH p=((s:SOURCE)-[rs]-(a:ARTICLE)-[ra]-(h1:HEADING)-[rst]-(sta:SENTENCE)-[rw:HAS_SENTENCE]-(w1:WORD_LOCAL)-[rww:HAS_NEXT_WORD*1..12]-(w2:WORD_LOCAL)), (w1)-[rw1:HAS_WORD_GLOBAL]-(wg1:WORD_GLOBAL), (w2)-[rw2:HAS_WORD_GLOBAL]-(wg2) WHERE w1.position_in_sentence = 0 AND h1.title IN ["summary", "history"] AND sta.position_in_heading = 0 RETURN p, w1, rw1, wg1, w2, rw2, wg2;

This enables us to determine all the sentences in the database containing the root word “textil”:

MATCH (wg:WORD_GLOBAL)<-[rwlwg:HAS_WORD_GLOBAL]-(wl:WORD_LOCAL)-[rs:HAS_SENTENCE]-&gt;(s:SENTENCE)-[rh:HAS_HEADING]->(h:HEADING)-[ra:HAS_ARTICLE]->(a:ARTICLE) WHERE wg.text = "textil" RETURN wg.text AS global_word, wl.text AS local_word, a.name AS article, h.title AS heading, s.text AS sentence ORDER BY h.position_in_article, s.position_in_heading

Adding More Advanced NLP

Now that we have a framework, we can attach the results of more advanced NLP analyses (such as OpenIE 5.0 shown here) to the sentence and word nodes. This shows subject, relation, object tuples uncovered by the analysis, connected to the sentences.

Future work will connect these results to the words as well. This will prove powerful once other annotation, such part of speech, is attached to the words, enabling a more comprehensive view of the data than previously accessible.


Issues to Resolve

  1. The last query in the “Adding Basic NLP” section above returns repeated content. Need to adjust the Cypher command.
  2. The code for loading a Wikipedia article (below) runs slowly. I’m not sure if this is due to the Python module, my use of Cypher, my use of single transactions, or the memory configuration of my Neo4j instance. Or perhaps I’m not indexing correctly. This matter needs to be resolved make progress!


Source code used in the production of this article is posted for peer review at https://github.com/badassdatascience/nlp-graph-data-model.


  1. https://en.wikipedia.org/wiki/Natural_language_processing
  2. https://neo4j.com
  3. https://en.wikipedia.org/wiki/E-textiles

Post Author: badassdatascience

Leave a Reply

Your email address will not be published.