Friday 20 November 2015

The abstraction illusion

It's been an awfully long time since I last posted here about Neo4j. I have actually been using it quite a lot, on and off, on my personal project and doing some prototyping for the clients. Sadly it never took off (more due to political environment of the project than due to merits) but I've had some thoughts that I'd like to share here.

Let me start by saying that I absolutely love Neo4j as a database. It has definitely reached the state where you can get a lot for your (metaphorical) buck - i.e. get something working really nicely in a very short space of time. I managed to prepare a simple model, import some CSV data and link it nicely within a couple of hours (and most of that time was getting the data into CSV!). Everybody on the project reacted with a lovely "WOW" when they saw me playing around with the GUI to look into some data patterns.
There were voices that perhaps having nodes bouncing around like little bunnies is a bit too much, but otherwise, a full success.

As long as my queries are 5-liners fitting into the UI, and my data can be imported from CSV, I'm loving it. Things get quite a bit different when I start looking into building an actual proper project on top of Neo4j - i.e. where data needs to be updated on a selective basis, transactionality, object model, data retrieval for dynamically built queries, and so on. Basically, if we think Neo4j UI app is the equivalent of Embarcadero/SquirellSQL/Oracle SQL Developer/pick your poison - then what I'm talking about getting from there to building an actual application. JDBC, Hibernate, the lot.

And I think this is precisely where the problem is. We're trying to re-write JDBC and Hibernate for Neo4j and it just doesn't feel good.

I understand the appeal of "one size fits all". I understand the rationale behind Spring Data X initiative or creating the JDBC driver. I actually was really excited about it, so much Spring JDBC goodness to reuse! It's lovely to think that we can abstract away from our underlying storage, have a common API for operating with all our data, and swap things underneath. I have started by trying to use Spring Data Neo4j (in both versions, before and after re-write). I've tried Neo4j JDBC. I really wanted to love it - but I didn't. With all of them, the more I used it the more frustrated I got. The abstraction was taking away parts of the functionality of Neo4j, or at least making it very awkward and unnatural to work with. It felt almost as if I had to hack it to do what I wanted to. So whilst in theory abstraction is great, in practice... Well. It didn't work for me.

My data is not arranged in rows

The whole point I'm choosing Neo4j and not a traditional database is that my data does NOT fit nicely in a tabular world. My data is a graph - and therefore trying to fit it into abstractions that were thought of at times when data was mostly tabular just doesn't work that well. For NoSQL databases which are closer to tabular world (like Cassandra) I suppose it makes a bit more sense - but for something like Neo4j sorry, but it just does not work for me.
The simplest example I can think of: say I have a graph with cities and countries, and a relationship between city and country. I want to fetch all cities in countries X and Y and build object model where a Country contains its cities. There are 100 cities in country X and 50 in country Y. The "rows" approach of Neo4j JDBC (or SDN repository) of query along the lines of

   MATCH (country:Country) -- (city:City) WHERE country.name IN ['X','Y'] RETURN country,city

will mean that I get the details of country X 100 times, and country Y 50 times. Which is 99+49 times too many. It's not wrong. I can get to my object model from that. It's just inefficient. Neo4j itself (even on the REST endpoint) supports a "graph" view as well as "rows" view (after all some queries might return more "tabular" data) - but if you're going through an abstraction, the opportunity to choose which one is better for any given query is lost.

Cypher is different than SQL 

Another thing that really bugged me in SDN/OGM was lack of proper support for the "MERGE" functionality. One of the things that I absolutely love about cypher is how easy it is to update/enrich things. If I have 3 sources of data that complement each other, it's super easy to combine them into one superset (enrich properties, create missing nodes etc.). I don't need to try and find a matching node (if it exists) first - all this is done for me. SQL doesn't really have a direct equivalent (except perhaps specific dialects) but it's no reason not to use it with Neo4j. Which sort of brings me to next point...

I don't care about internal ID, I have a business id 

Whilst the support for composite (multi-property) business ids in Neo4j could be better, with a tiny bit of magic (AKA string concatenation), or if you're lucky and id is a single property, managing updates is super easy. Sadly, Neo4j/OGM brings Neo4j internal ID into the picture, and pretty much all the operations are based on this.
Why is that a problem? I have an externally managed business id (e.g. given by a database, or externally generated UUID). I process updates to entities, e.g. get MQ messages with new state of the entity with given business id. If I go bare-bones Neo4j with a merge, this is a single super-simple query. If I try to go via SDN/OGM-route, each message requires me to first fish out the entity out of graph (based on business id), then update all the properties on it from the received object, and only then can I issue an update. If the object has relationships you have to be really careful about the depth of the fetch, and overall things can get really messy really quickly - I managed to get all my relationships wiped out as I was trying to update an object properties for example... Probably my fault, but it wouldn't have happened if I wasn't using the "magically" generated queries and just issued a simple merge with update of properties instead.

Quo vadis?

I realize that some of the issues that I mentioned here can be fixed. However, the point I'm trying to make is that the abstraction that we're starting with is pushing us towards working with Neo4j in sub-optimal ways. It brings relational database usage patterns into a graph world. We can try and adjust it into this new world but ultimately, it wasn't designed with that in mind and will probably always feel a little bit awkward. It will always push our thinking into rows-oriented view first, which then (maybe) will be adjusted into a graph view. It creates an illusion that we're working with something familiar - but IMO we're not.
It might be especially dangerous when you try this approach with a team of people who are very familiar and comfortable with the database world, but don't take the time to understand the difference that Neo4j brings. You'll see queries like "MATCH (foo:Foo), (bar:Bar) WHERE foo.id = bar.id" - and they'll wonder why things are so slow. But it's hard to blame them - if it looks like a database, if it works with Spring JdbcTemplate, shouldn't it behave the same?
Abstractions are nice when we're abstracting from an apple and an orange to a fruit (which is why SQL and JDBC were so successful), but what do you abstract to from an apple and a bunny?

So for now, I decided to go with bare Neo4j. I've started creating a mini-abstraction over embedded querying vs REST API. It is very graph-specific - but I'm fine with that. That's the level of abstraction that I find useful. Neo4j native APIs are actually quite pleasant to work with, so I find that using them directly instead of through an abstraction works much better for me. And contrary to what I expected, I'm much more productive now that I'm not fighting the tools to do what I want to do.