It's been an awfully long time since I last posted here about Neo4j. I have actually been using it quite a lot, on and off, on my personal project and doing some prototyping for the clients. Sadly it never took off (more due to political environment of the project than due to merits) but I've had some thoughts that I'd like to share here.
Let me start by saying that I absolutely love Neo4j as a database. It has definitely reached the state where you can get a lot for your (metaphorical) buck - i.e. get something working really nicely in a very short space of time. I managed to prepare a simple model, import some CSV data and link it nicely within a couple of hours (and most of that time was getting the data into CSV!). Everybody on the project reacted with a lovely "WOW" when they saw me playing around with the GUI to look into some data patterns.
There were voices that perhaps having nodes bouncing around like little bunnies is a bit too much, but otherwise, a full success.
As long as my queries are 5-liners fitting into the UI, and my data can be imported from CSV, I'm loving it. Things get quite a bit different when I start looking into building an actual proper project on top of Neo4j - i.e. where data needs to be updated on a selective basis, transactionality, object model, data retrieval for dynamically built queries, and so on. Basically, if we think Neo4j UI app is the equivalent of Embarcadero/SquirellSQL/Oracle SQL Developer/pick your poison - then what I'm talking about getting from there to building an actual application. JDBC, Hibernate, the lot.
And I think this is precisely where the problem is. We're trying to re-write JDBC and Hibernate for Neo4j and it just doesn't feel good.
I understand the appeal of "one size fits all". I understand the rationale behind Spring Data X initiative or creating the JDBC driver. I actually was really excited about it, so much Spring JDBC goodness to reuse! It's lovely to think that we can abstract away from our underlying storage, have a common API for operating with all our data, and swap things underneath. I have started by trying to use Spring Data Neo4j (in both versions, before and after re-write). I've tried Neo4j JDBC. I really wanted to love it - but I didn't. With all of them, the more I used it the more frustrated I got. The abstraction was taking away parts of the functionality of Neo4j, or at least making it very awkward and unnatural to work with. It felt almost as if I had to hack it to do what I wanted to. So whilst in theory abstraction is great, in practice... Well. It didn't work for me.
The simplest example I can think of: say I have a graph with cities and countries, and a relationship between city and country. I want to fetch all cities in countries X and Y and build object model where a Country contains its cities. There are 100 cities in country X and 50 in country Y. The "rows" approach of Neo4j JDBC (or SDN repository) of query along the lines of
MATCH (country:Country) -- (city:City) WHERE country.name IN ['X','Y'] RETURN country,city
will mean that I get the details of country X 100 times, and country Y 50 times. Which is 99+49 times too many. It's not wrong. I can get to my object model from that. It's just inefficient. Neo4j itself (even on the REST endpoint) supports a "graph" view as well as "rows" view (after all some queries might return more "tabular" data) - but if you're going through an abstraction, the opportunity to choose which one is better for any given query is lost.
Why is that a problem? I have an externally managed business id (e.g. given by a database, or externally generated UUID). I process updates to entities, e.g. get MQ messages with new state of the entity with given business id. If I go bare-bones Neo4j with a merge, this is a single super-simple query. If I try to go via SDN/OGM-route, each message requires me to first fish out the entity out of graph (based on business id), then update all the properties on it from the received object, and only then can I issue an update. If the object has relationships you have to be really careful about the depth of the fetch, and overall things can get really messy really quickly - I managed to get all my relationships wiped out as I was trying to update an object properties for example... Probably my fault, but it wouldn't have happened if I wasn't using the "magically" generated queries and just issued a simple merge with update of properties instead.
It might be especially dangerous when you try this approach with a team of people who are very familiar and comfortable with the database world, but don't take the time to understand the difference that Neo4j brings. You'll see queries like "MATCH (foo:Foo), (bar:Bar) WHERE foo.id = bar.id" - and they'll wonder why things are so slow. But it's hard to blame them - if it looks like a database, if it works with Spring JdbcTemplate, shouldn't it behave the same?
Abstractions are nice when we're abstracting from an apple and an orange to a fruit (which is why SQL and JDBC were so successful), but what do you abstract to from an apple and a bunny?
So for now, I decided to go with bare Neo4j. I've started creating a mini-abstraction over embedded querying vs REST API. It is very graph-specific - but I'm fine with that. That's the level of abstraction that I find useful. Neo4j native APIs are actually quite pleasant to work with, so I find that using them directly instead of through an abstraction works much better for me. And contrary to what I expected, I'm much more productive now that I'm not fighting the tools to do what I want to do.
Let me start by saying that I absolutely love Neo4j as a database. It has definitely reached the state where you can get a lot for your (metaphorical) buck - i.e. get something working really nicely in a very short space of time. I managed to prepare a simple model, import some CSV data and link it nicely within a couple of hours (and most of that time was getting the data into CSV!). Everybody on the project reacted with a lovely "WOW" when they saw me playing around with the GUI to look into some data patterns.
There were voices that perhaps having nodes bouncing around like little bunnies is a bit too much, but otherwise, a full success.
As long as my queries are 5-liners fitting into the UI, and my data can be imported from CSV, I'm loving it. Things get quite a bit different when I start looking into building an actual proper project on top of Neo4j - i.e. where data needs to be updated on a selective basis, transactionality, object model, data retrieval for dynamically built queries, and so on. Basically, if we think Neo4j UI app is the equivalent of Embarcadero/SquirellSQL/Oracle SQL Developer/pick your poison - then what I'm talking about getting from there to building an actual application. JDBC, Hibernate, the lot.
And I think this is precisely where the problem is. We're trying to re-write JDBC and Hibernate for Neo4j and it just doesn't feel good.
I understand the appeal of "one size fits all". I understand the rationale behind Spring Data X initiative or creating the JDBC driver. I actually was really excited about it, so much Spring JDBC goodness to reuse! It's lovely to think that we can abstract away from our underlying storage, have a common API for operating with all our data, and swap things underneath. I have started by trying to use Spring Data Neo4j (in both versions, before and after re-write). I've tried Neo4j JDBC. I really wanted to love it - but I didn't. With all of them, the more I used it the more frustrated I got. The abstraction was taking away parts of the functionality of Neo4j, or at least making it very awkward and unnatural to work with. It felt almost as if I had to hack it to do what I wanted to. So whilst in theory abstraction is great, in practice... Well. It didn't work for me.
My data is not arranged in rows
The whole point I'm choosing Neo4j and not a traditional database is that my data does NOT fit nicely in a tabular world. My data is a graph - and therefore trying to fit it into abstractions that were thought of at times when data was mostly tabular just doesn't work that well. For NoSQL databases which are closer to tabular world (like Cassandra) I suppose it makes a bit more sense - but for something like Neo4j sorry, but it just does not work for me.The simplest example I can think of: say I have a graph with cities and countries, and a relationship between city and country. I want to fetch all cities in countries X and Y and build object model where a Country contains its cities. There are 100 cities in country X and 50 in country Y. The "rows" approach of Neo4j JDBC (or SDN repository) of query along the lines of
MATCH (country:Country) -- (city:City) WHERE country.name IN ['X','Y'] RETURN country,city
will mean that I get the details of country X 100 times, and country Y 50 times. Which is 99+49 times too many. It's not wrong. I can get to my object model from that. It's just inefficient. Neo4j itself (even on the REST endpoint) supports a "graph" view as well as "rows" view (after all some queries might return more "tabular" data) - but if you're going through an abstraction, the opportunity to choose which one is better for any given query is lost.
Cypher is different than SQL
Another thing that really bugged me in SDN/OGM was lack of proper support for the "MERGE" functionality. One of the things that I absolutely love about cypher is how easy it is to update/enrich things. If I have 3 sources of data that complement each other, it's super easy to combine them into one superset (enrich properties, create missing nodes etc.). I don't need to try and find a matching node (if it exists) first - all this is done for me. SQL doesn't really have a direct equivalent (except perhaps specific dialects) but it's no reason not to use it with Neo4j. Which sort of brings me to next point...I don't care about internal ID, I have a business id
Whilst the support for composite (multi-property) business ids in Neo4j could be better, with a tiny bit of magic (AKA string concatenation), or if you're lucky and id is a single property, managing updates is super easy. Sadly, Neo4j/OGM brings Neo4j internal ID into the picture, and pretty much all the operations are based on this.Why is that a problem? I have an externally managed business id (e.g. given by a database, or externally generated UUID). I process updates to entities, e.g. get MQ messages with new state of the entity with given business id. If I go bare-bones Neo4j with a merge, this is a single super-simple query. If I try to go via SDN/OGM-route, each message requires me to first fish out the entity out of graph (based on business id), then update all the properties on it from the received object, and only then can I issue an update. If the object has relationships you have to be really careful about the depth of the fetch, and overall things can get really messy really quickly - I managed to get all my relationships wiped out as I was trying to update an object properties for example... Probably my fault, but it wouldn't have happened if I wasn't using the "magically" generated queries and just issued a simple merge with update of properties instead.
Quo vadis?
I realize that some of the issues that I mentioned here can be fixed. However, the point I'm trying to make is that the abstraction that we're starting with is pushing us towards working with Neo4j in sub-optimal ways. It brings relational database usage patterns into a graph world. We can try and adjust it into this new world but ultimately, it wasn't designed with that in mind and will probably always feel a little bit awkward. It will always push our thinking into rows-oriented view first, which then (maybe) will be adjusted into a graph view. It creates an illusion that we're working with something familiar - but IMO we're not.It might be especially dangerous when you try this approach with a team of people who are very familiar and comfortable with the database world, but don't take the time to understand the difference that Neo4j brings. You'll see queries like "MATCH (foo:Foo), (bar:Bar) WHERE foo.id = bar.id" - and they'll wonder why things are so slow. But it's hard to blame them - if it looks like a database, if it works with Spring JdbcTemplate, shouldn't it behave the same?
Abstractions are nice when we're abstracting from an apple and an orange to a fruit (which is why SQL and JDBC were so successful), but what do you abstract to from an apple and a bunny?
So for now, I decided to go with bare Neo4j. I've started creating a mini-abstraction over embedded querying vs REST API. It is very graph-specific - but I'm fine with that. That's the level of abstraction that I find useful. Neo4j native APIs are actually quite pleasant to work with, so I find that using them directly instead of through an abstraction works much better for me. And contrary to what I expected, I'm much more productive now that I'm not fighting the tools to do what I want to do.
Great Feedback Liliana, would love to continue to talk to you about it.
ReplyDeleteBtw. you know that collect provides you with the country + list of cities view.
But you're right we should offer to return the data just as a graph. We're working towards that with the new Neo4j APIs.
MATCH (country:Country) -- (city:City) WHERE country.name IN ['X','Y'] RETURN country,collect(city)
Good feedback on SDN4/OGM too, very valuable.
Would also love to hear what we could improve in terms of APIs and documentation to make the path better for relational developers and also people like you with a deeper understanding.
Cheers, Michael
Hi Michael,
DeleteThat's a neat trick with collect, I didn't think of that but indeed, it would do what I want it to. :) Not going to work for all cases where you're fetching a graph but for simple problems like getting a node with its "children" it is indeed a good option.
The much bigger problem that I faced in SDN4 (and an ultimate deal-breaker) that I didn't mention above as it's not really related to abstraction as such, was that whilst SDN3 was pretty much embedded-only, SDN4 has become REST-only (as far as I can tell - again, correct me if I'm wrong).
What I really wanted is flexibility in terms of switching between embedded/in-memory and remote via Spring profiles (in-memory for tests, embedded for one process which needs to do heavy lifting, and remote for light read-only queries on another process). This proved to be really difficult to do with SDN4, partly due to it only working with annotations (not Spring XML) which makes it a real hackery to re-wire it to work off an embedded GDS within same process. As far as I remember I managed to do that, but then hit some other issues with it blocking MERGE queries, having issues mapping enums, and I've simply given up as it felt like I was spending more time fighting SDN than it was saving me from the mappings etc.. It was in the times of milestones so I'm sure some of the issues/bugs were ultimately resolved but I didn't really look back. I don't think I will at least until the XML config is supported.
As a disclaimer, I should probably say that I'm not a fan of Hibernate and in the relational world I always go with plain JDBC so it might also have something to do with not willing to give up control :)