What is compelling for my purposes is the ability to return a JSON document to the client directly from a SQL query that accesses relational data. This means if you have a complex JSON that would require multiple round-trips to the database to build, now it can be built right in a single SQL statement. This can be 10 to 100 times faster than an equivalent ORM-based solution depending on various factors and in my opinion is easier to code and maintain (you need to be proficient in SQL to appreciate this).
I wanted to create a fictitious example and, inspired by this other post by Dan McClain, I wanted to try a similar schema and a radically simplified SQL to see how it performs.
I did not understand why Dan designed tags to be unique by note. In this case the JSON he wants to produce is somewhat redundant, since if tags are unique by note, they can be embedded inside notes. If they aren't, then why tags have a note_id attribute?
In my design notes have tags but different notes can be tagged with the same tag. So there is an intersection table to manage the many-to-many relationship between notes and tags:
CREATE TABLE notes ( note_id serial NOT NULL PRIMARY KEY, note_title text NOT NULL, note_content text NOT NULL );
INSERT INTO notes ( note_title, note_content ) SELECT 'Note #' || s, 'Lorem ipsum...' FROM generate_series (1, 1000) AS g (s)
;
CREATE TABLE tags ( tag_id serial NOT NULL PRIMARY KEY, tag_name text NOT NULL);
INSERT INTO tags (tag_name) SELECT 'Tag #' || s FROM generate_series (1, 100) AS g (s)
;
CREATE TABLE note_tags ( note_id INTEGER NOT NULL REFERENCES notes, tag_id INTEGER NOT NULL REFERENCES tags, CONSTRAINT nota_pk PRIMARY KEY (note_id, tag_id));
INSERT INTO note_tags SELECT note_id, tag_id FROM notes, tags WHERE random ( ) <= 0.1;
I'm going to get a page of 40 notes. I want to show only tags that are referenced by the notes in the message. This JSON returns a dataset that could not be extracted by a single SQL statement if executed in a traditional manner. Normally it would take multiple queries to materialize this JSON in an application program, with a somewhat convoluted procedure. If you use a vanilla ORM it may take up to 141 queries to materialize the data to produce this JSON. By preloading all the tags in a single query we can reduce the number of queries to 42. This works because we only have 100 tags, but may not work with a larger data set. The logic for filtering out only the referenced ones may add complexity to the solution. We could reduce the number of queries to 3 by fetching all note_tags for the fetched notes in a single query, further complicating the code. Eagerly fetching note_tags with notes may be simpler but we get redundant data over the wire. Notes data is gotten 10 times instead of once. This may zero out the benefit of avoiding a nested query (not in this case since we have few columns to read, but in a more realistic situation it may actually make things worse if notes had many columns).
So, we can see that, on the application side, generating the required JSON is not trivial and may involve a lot of optimization decisions and hacks in the code to be efficient.
Now I present a PostgreSQL 9.3 query that returns the required result. It is far simpler, IMHO, than the solution presented by Dan:
So, we can see that, on the application side, generating the required JSON is not trivial and may involve a lot of optimization decisions and hacks in the code to be efficient.
Now I present a PostgreSQL 9.3 query that returns the required result. It is far simpler, IMHO, than the solution presented by Dan:
WITH n AS ( SELECT note_id, note_title, note_content, ( SELECT array_agg(tag_id) FROM note_tags WHERE note_id = notes.note_id) tags FROM notes ORDER BY note_id LIMIT 40 ) SELECT row_to_json ( ROW ( ( SELECT json_agg(t) FROM ( SELECT DISTINCT tags.tag_id, tags.tag_name FROM note_tags nt JOIN tags USING (tag_id) JOIN n USING (note_id)) AS t ), ( SELECT json_agg (n) FROM n)))
On my desktop computer (Intel i7, 12GB RAM), this query takes 5,1 milliseconds to execute. Interesting to note, the cost of JSON rendering is small, something around 0,6ms. This data can be computed from the explain analyze verbose of the statement execution plan. In order to have a realistic execution time you have to add the call latency between an hypothetical application server and the database, which in a well engineered network may be in the same order of magnitude (5ms).
The most optimized application level solution will be at least 6 times slower than this, due to latencies and redundant application level processing and data format translations. But I presume normally sloppy programmers will prefer to write more readable and maintainable code so they may opt for the first vanilla ORM solution which may be 100 times slower. Maybe this goes into production unnoticed and if the user population is small and tolerant, this may actually be acceptable (users start perceiving a latency when response time is at least 0.7 seconds).
But the SQL solution is quite simple and enjoyable. It is 17 lines of code, with no mutable variables, loops, assignments and conditionals. Completely declarative code that is quite easy to understand. The CTE pulls the notes each one with it's own array of tags (the array_agg call). The main select constructs a JSON object from a SQL ROW that is made of two columns: the first one gets all the referenced tags by joining with the CTE result and aggregating the tags rows as a JSON array; the second column is the aggregation as a JSON array of the CTE result. No messing with performance related problems. Let the database optimize the execution for us.
This solution is fast, scales well and is easy to program.
There is anyway a problem with this solution as of PostgreSQL 9.3. The ROW construct causes output columns to be named f1 and f2 instead of tags and notes. I've not found a workaround but in 9.4 there is a plethora of new JSON functions. Among these there are new constructor functions like json_build_object(VARIADIC any) which is what we need in our case:
I've not tested this, anyway.
To conclude, JSON powered relational databases show interesting possibilities:
- significant performance improvements from moving the code into the data processing layer;
- code simplification. Just express the result you want and let the database do it's magic.
In a future post I will explore the possibility of using JSON as a means to efficiently perform database updates. Stay tuned.
The most optimized application level solution will be at least 6 times slower than this, due to latencies and redundant application level processing and data format translations. But I presume normally sloppy programmers will prefer to write more readable and maintainable code so they may opt for the first vanilla ORM solution which may be 100 times slower. Maybe this goes into production unnoticed and if the user population is small and tolerant, this may actually be acceptable (users start perceiving a latency when response time is at least 0.7 seconds).
But the SQL solution is quite simple and enjoyable. It is 17 lines of code, with no mutable variables, loops, assignments and conditionals. Completely declarative code that is quite easy to understand. The CTE pulls the notes each one with it's own array of tags (the array_agg call). The main select constructs a JSON object from a SQL ROW that is made of two columns: the first one gets all the referenced tags by joining with the CTE result and aggregating the tags rows as a JSON array; the second column is the aggregation as a JSON array of the CTE result. No messing with performance related problems. Let the database optimize the execution for us.
This solution is fast, scales well and is easy to program.
There is anyway a problem with this solution as of PostgreSQL 9.3. The ROW construct causes output columns to be named f1 and f2 instead of tags and notes. I've not found a workaround but in 9.4 there is a plethora of new JSON functions. Among these there are new constructor functions like json_build_object(VARIADIC any) which is what we need in our case:
WITH n AS ( SELECT note_id, note_title, note_content, ( SELECT array_agg(tag_id) FROM note_tags WHERE note_id = notes.note_id) tags FROM notes ORDER BY note_id LIMIT 40 ) SELECT json_build_object ( "tags", ( SELECT json_agg(t) FROM ( SELECT DISTINCT tags.tag_id, tags.tag_name FROM note_tags nt JOIN tags USING (tag_id) JOIN n USING (note_id)) AS t ), "notes", ( SELECT json_agg (n) FROM n))
To conclude, JSON powered relational databases show interesting possibilities:
- significant performance improvements from moving the code into the data processing layer;
- code simplification. Just express the result you want and let the database do it's magic.
In a future post I will explore the possibility of using JSON as a means to efficiently perform database updates. Stay tuned.
No comments:
Post a Comment