Saturday, May 31, 2014

The benefits of PostgreSQL side JSON rendering

Nowadays JSON has become the standard Web data interchange format. No matter if your client runs in a browser or as a native mobile app, or you are accessing a web service from a server side component, you are probably going to model and exchange data in the JSON format. JSON parsing and rendering is available in every modern programming language and JSON has been the NoSQL databases format of choice. JSON is also supported as a native datatype in several relational databases. In PostgreSQL there is a JSON datatype and a set of operators for constructing, accessing and processing JSON data from within the RDBMS. 

What is compelling for my purposes is the ability to return a JSON document to the client directly from a SQL query that accesses relational data. This means if you have a complex JSON that would require multiple round-trips to the database to build, now it can be built right in a single SQL statement. This can be 10 to 100 times faster than an equivalent ORM-based solution depending on various factors and in my opinion is easier to code and maintain (you need to be proficient in SQL to appreciate this).

I wanted to create a fictitious example and, inspired by this other post by Dan McClain, I wanted to try a similar schema and a radically simplified SQL to see how it performs.
I did not understand why Dan designed tags to be unique by note. In this case the JSON he wants to produce is somewhat redundant, since if tags are unique by note, they can be embedded inside notes. If they aren't, then why tags have a note_id attribute?
In my design notes have tags but different notes can be tagged with the same tag. So there is an intersection table to manage the many-to-many relationship between notes and tags:


CREATE TABLE notes (
    note_id serial NOT NULL PRIMARY KEY,
    note_title text NOT NULL,
    note_content text NOT NULL );
INSERT INTO notes (
    note_title,
    note_content )
SELECT 'Note #' || s,
    'Lorem ipsum...'
FROM
    generate_series (1, 1000) AS g (s)
;
CREATE TABLE tags (
    tag_id serial NOT NULL PRIMARY KEY,
    tag_name text NOT NULL);

INSERT INTO tags (tag_name)
SELECT 'Tag #' || s
FROM generate_series (1, 100) AS g (s)
;
CREATE TABLE note_tags (
    note_id INTEGER NOT NULL REFERENCES notes,
    tag_id INTEGER NOT NULL REFERENCES tags,
    CONSTRAINT nota_pk PRIMARY KEY (note_id, tag_id));

INSERT INTO note_tags
SELECT
    note_id,
    tag_id
FROM
    notes,
    tags
WHERE random ( ) <= 0.1;

I'm going to get a page of 40 notes. I want to show only tags that are referenced by the notes in the message. This JSON returns a dataset that could not be extracted by a single SQL statement if executed in a traditional manner. Normally it would take multiple queries to materialize this JSON in an application program, with a somewhat convoluted procedure. If you use a vanilla ORM it may take up to 141 queries to materialize the data to produce this JSON. By preloading all the tags in a single query we can reduce the number of queries to 42. This works because we only have 100 tags, but may not work with a larger data set. The logic for filtering out only the referenced ones may add complexity to the solution. We could reduce the number of queries to 3 by fetching all note_tags for the fetched notes in a single query, further complicating the code. Eagerly fetching note_tags with notes may be simpler but we get redundant data over the wire. Notes data is gotten 10 times instead of once. This may zero out the benefit of avoiding a nested query (not in this case since we have few columns to read, but in a more realistic situation it may actually make things worse if notes had many columns).
So, we can see that, on the application side, generating the required JSON is not trivial and may involve a lot of optimization decisions and hacks in the code to be efficient.
Now I present a PostgreSQL 9.3 query that returns the required result. It is far simpler, IMHO, than the solution presented by Dan:


WITH n AS (
    SELECT
        note_id,
        note_title,
        note_content,
        (
            SELECT
                array_agg(tag_id)
            FROM
                note_tags
            WHERE
                note_id = notes.note_id) tags
    FROM
        notes
    ORDER BY
        note_id
    LIMIT 40 )
SELECT
    row_to_json (
        ROW (
            (
                SELECT
                    json_agg(t)
                FROM (
                        SELECT
                            DISTINCT tags.tag_id,
                            tags.tag_name
                        FROM
                            note_tags nt
                            JOIN tags USING (tag_id)
                            JOIN n USING (note_id)) AS t ),
            (
                SELECT
                    json_agg (n)
                FROM
                    n)))

On my desktop computer (Intel i7, 12GB RAM), this query takes 5,1 milliseconds to execute. Interesting to note, the cost of JSON rendering is small, something around 0,6ms. This data can be computed from the explain analyze verbose of the statement execution plan. In order to have a realistic execution time you have to add the call latency between an hypothetical application server and the database, which in a well engineered network may be in the same order of magnitude (5ms).
The most optimized application level solution will be at least 6 times slower than this, due to latencies and redundant application level processing and data format translations. But I presume normally sloppy programmers will prefer to write more readable and maintainable code so they may opt for the first vanilla ORM solution which may be 100 times slower. Maybe this goes into production unnoticed and if the user population is small and tolerant, this may actually be acceptable (users start perceiving a latency when response time is at least 0.7 seconds).
But the SQL solution is quite simple and enjoyable. It is 17 lines of code, with no mutable variables, loops, assignments and conditionals. Completely declarative code that is quite easy to understand. The CTE pulls the notes each one with it's own  array of tags (the array_agg call). The main select constructs a JSON object from a SQL ROW that is made of two columns: the first one gets all the referenced tags by joining with the CTE result and aggregating the tags rows as a JSON array; the second column is the aggregation as a JSON array of the CTE result. No messing with performance related problems. Let the database optimize the execution for us.
This solution is fast, scales well and is easy to program.
There is anyway a problem with this solution as of PostgreSQL 9.3. The ROW construct causes output columns to be named f1 and f2 instead of tags and notes. I've not found a workaround but in 9.4 there is a plethora of new JSON functions. Among these there are new constructor functions like json_build_object(VARIADIC any) which is what we need in our case:

WITH n AS (
    SELECT
        note_id,
        note_title,
        note_content,
        (
            SELECT
                array_agg(tag_id)
            FROM
                note_tags
            WHERE
                note_id = notes.note_id) tags
    FROM
        notes
    ORDER BY
        note_id
    LIMIT 40 )
SELECT
    json_build_object (
        "tags",
        (
            SELECT
                json_agg(t)
            FROM (
                    SELECT
                        DISTINCT tags.tag_id,
                        tags.tag_name
                    FROM
                        note_tags nt
                        JOIN tags USING (tag_id)
                        JOIN n USING (note_id)) AS t ),
        "notes",
        (
            SELECT
                json_agg (n)
            FROM
                n))

I've not tested this, anyway.
To conclude, JSON powered relational databases show interesting possibilities:
- significant performance improvements from moving the code into the data processing layer;
- code simplification. Just express the result you want and let the database do it's magic.

In a future post I will explore the possibility of using JSON as a means to efficiently perform database updates. Stay tuned.


No comments:

Post a Comment