E-Scribe : a programmer’s blog

About Me

PBX I'm Paul Bissex. I build web applications using open source software, especially Django. Started my career doing graphic design for newspapers and magazines in the '90s. Then wrote tech commentary and reviews for Wired, Salon, Chicago Tribune, and others you never heard of. Then I built operations software at a photography school. Then I helped big media serve 40 million pages a day. Then I worked on a translation services API doing millions of dollars of business. Now I'm building the core platform of a global startup accelerator. Feel free to email me.

Book

I co-wrote "Python Web Development with Django". It was the first book to cover the long-awaited Django 1.0. Published by Addison-Wesley and still in print!

Colophon

Built using Django, served with gunicorn and nginx. The database is SQLite. Hosted on a FreeBSD VPS at Johncompanies.com. Comment-spam protection by Akismet.

Elsewhere

Pile o'Tags

Stuff I Use

Bitbucket, Debian Linux, Django, Emacs, FreeBSD, Git, jQuery, LaunchBar, macOS, Markdown, Mercurial, Python, S3, SQLite, Sublime Text, xmonad

Spam Report

At least 236562 pieces of comment spam killed since 2008, mostly via Akismet.

Neo4J and Graph Databases

noSQL is a big tent with lots of interesting tech in it. A few years ago at work I got an assignment to evaluate graph databases as a possible datastore for our 40-million-pageviews-a-day CMS. Graph DBs are elegant stuff, though not a particularly special fit for that application. Here's what I had to say.

Graph databases are all about "highly connected" data. But instead of tracking relationships through foreign-key mappings RDBMS style, they use pointers that directly connect the related records.

These relationships can also have directionality and descriptive properties.

Graph DBs store and retrieve in a manner arguably more congruent to the true structure of heavily relational data than an RDBMS.

Using an RDBMS with foreign keys and joins can mean a significant performance cost in join-heavy situations.

There are many products in the graph database space, many of them relatively new. There are some variations in features and intended niche. I focus on Neo4j, which is the dominant player, mature, and open source.

Neo4j

Neo4j seems to be the most prominent and heavily used graph database product of the "property graph" type. Its sponsor is a company named Neo Technology. It was created in 2003 and open-sourced in 2007. It's under active development, but seems mature enough not to be undergoing disruptive changes. There's an active user community and a good ecosystem of third-party tools, and books are emerging as well.

Querying and Data Access

Cypher

Cypher is Neo4j's SQL-ish declarative query language.

One notable difference from SQL is that every database query has an explicit starting point. Usually this is a specific node in the graph. The Cypher START clause identifies this node. It's selected either by its ID or via an index lookup.

For example, given that almost any $BIGCMS object is attached to a specific site (or sites), many queries of graph-database $BIGCMS might start at a site node.

A common pattern for Cypher queries is START ... MATCH ... RETURN. (Keywords are not case sensitive, but as with SQL it improves overall query readability if they are in all caps.)

Cypher session example ("//" begins a comment):

    // A mutating operation (e.g. CREATE) doesn't have to return anything, but it can.
    // Note that we did not have to declare our nodes' data structure before creating them.
    $ CREATE paper={name:"AJC"}, tv={name: "WSB TV"}, radio={name: "WSB radio"} RETURN paper, tv, radio
    ==> +-----------------------------------------------------------------------------+
    ==> | paper                | tv                      | radio                      |
    ==> +-----------------------------------------------------------------------------+
    ==> | Node[17]{name:"AJC"} | Node[18]{name:"WSB TV"} | Node[19]{name:"WSB radio"} |
    ==> +-----------------------------------------------------------------------------+
    ==> 1 row
    ==> Nodes created: 3
    ==> Properties set: 3
    ==> 3 ms

    // Establish the relationships, fetching start nodes by ID
    $ START tv=node(19), radio=node(18) CREATE tv-[:SAME_MARKET]->radio
    $ START tv=node(19), paper=node(17) CREATE tv-[:SAME_MARKET]->paper

    // Query the graph; "-" indicates relations, with optional "<" or ">" for direction
    $ START a = node(18) MATCH a-[:SAME_MARKET]-b RETURN DISTINCT b
    ==> +----------------------------+
    ==> | b                          |
    ==> +----------------------------+
    ==> | Node[17]{name:"AJC"}       |
    ==> | Node[19]{name:"WSB radio"} |
    ==> +----------------------------+

The Cypher relation syntax looks a bit noisy at first; it's helpful to think of it as a sort of ASCII-art diagram; "a-->b" or "a<--b" or "a-[:LOVES]->b" or "b-[:TOLERATES]->a" are all legal.

Other access modes

In addition to the declarative-style Cypher, there are other supported ways to access data.

The server has a REST API. In addition to being available for "raw" use it is the basis for many of the tools and language bindings for Neo4j. For example, the provided Python bindings utilize the REST API internally.

The Neo4j shell, in addition to supporting Cypher commands, has utility functions that make interactive manipulation of graph data easier.

Gremlin is a graph traversal language based on Groovy ("the Python of Java"). It's provided as a plugin with the Neo4j distribution.

There's also py2neo, a comprehensive Python library for Neo4j access that also provides submodules for access via Cypher, Gremlin, Geoff (a graph modeling language by the same author), and raw REST.

Using Neo4j

The Neo4j "Community" version is what we would likely use. It's GPL licensed, and is the complete product.

They also offer two commercial versions, "Advanced" and "Enterprise." The selling points are advanced monitoring features, high availability support, a specialized web management console, and support services.

(The Advanced and Enterprise versions are also available under an Affero GPL license, but this is currently not practical for us.)

The user support ecosystem is what you would expect for an open source project. There's an official Google Group. Using Stack Overflow to ask questions is encouraged. There's a (quiet) IRC channel on Freenode. Github is used to distribute the source.

Scaling

Scaling a Neo4j database is not as simple as with a Dynamo-style store like Riak. Graphs are difficult to shard.

Neo4j has "high availability" features for clustering in the Neo4j Enterprise Edition. This is a master-slave setup. You can write to master or slave nodes, though there's a speed penalty for writing to slaves. All nodes get all writes eventually. Automatic fail-over can be set to elect any cluster member as master. A failed master node can later re-join as slave if desired.

In a cluster setup, backups can be performed by adding a slave to the cluster, which will pick up all the data. To restore, you stop the cluster, restore data from backup to at least one node, and re-start the cluster.

Neo Technology has been working for several years now on a system allowing the graph datastore to be distributed across servers, and to be scaled horizontally. This work (currently known as "Rassilon") will arrive with Neo4j 2.0 at the earliest (current stable version is 1.8).

Technical details

Neo4j is a JVM application (written in Java and Scala), so we would need to cultivate expertise in JVM deployment.

Neo4j likes to have its data in RAM -- specifically its node and property maps, which are mostly pointers. Having space to additionally hold the full property values in RAM is apparently not critical. Given that the vast bulk of $BIGCMS data is in property values, and that the total number of records (i.e. nodes) is nowhere near their hard limit of 32 billion, this seems achievable.

For best performance, Neo recommends maximizing the host OS's file caching. Making the server's filesystem cache size as big as the entire datastore is recommended when possible.

Their JVM tuning advice is: give the JVM a large heap that will hold as much application data as possible, but also make sure the heap fits in RAM to avoid performance degradation from virtual memory paging. Along those lines Neo advises tuning Linux to be more tolerant of dirty virtual memory pages.

Installation

Ubuntu/Debian: Neo Technology provides an apt repository.

OS X: There's a Homebrew formula for the latest stable version of Neo4j.

Other Unix platforms (e.g. CentOS): Neo Technology provides tarballs containing the full binary release. And the source is available too of course.

Suitability

Graph database technology proponents make a big deal of how well suited it is to relationship-heavy social media applications. While that's not currently a big niche for us, the technology still has some appeal.

One only needs to look at some of $BIGCMS's slowest, join-heavy SQL queries to know that a graph approach has the potential to increase performance greatly, and perhaps allow us to work with data in some ways that we have avoided or ignored because they are impractically slow.

And for our goal of "store structured data, not presentation," a graph database seems like an excellent fit. Graph relationships would give us the ability to record even more (readily usable) structure than we already do.

Final Thoughts

We could certainly speed up many slower $BIGCMS queries by moving from a RDBMS to a graph system. Our most pathologically slow SQL queries can take minutes. Getting our data into graph storage could eliminate many if not all of these.

However, the migration effort would be significant. Getting $BIGCMS data into graph form will require some careful thinking about how the data will be accessed. Common advice on creating a graph store is to think about the relationships first. This might lead to some rethinking of how we store data.

Since a major goal of $BIGCMS is to share content across sites, and we intended to build a library of that content, a graph database could offer a natural and powerful way to work with those connections.

If we were intending to directly replace our RDBMS store with a graph database, many migration challenges would arise that we might not see with other data store types. But since the our data store will live behind a REST API, disruption at the application level might be no greater than with some data store type (e.g. key-value).

As a more detailed design for the data store REST API is developed, we will likely have a better sense of how a graph database would serve in that design, and how its advantages would be felt.

Resources

O'Reilly is working on a Graph Databases book which is currently available in a free pre-release PDF at http://graphdatabases.com/. It heavily emphasizes Neo4j.

Manning is publishing "Neo4j in Action" which is currently available under their Early Access Program.

Saturday, September 16th, 2017
+ +

Post a comment

Thanks for reading! Please note: Your comment will not appear until approved, which may take a few hours or more. Spammers will be torpedoed.


(Will not be shared)

(Optional)