Tuesday, June 17, 2014

The benefits of PostgreSQL side JSON parsing

In my previous post I talked about the benefits of directly rendering a JSON response from a PostgreSQL stored function or SQL statement in general.
In this post I want to deal with the opposite requirement: parsing a JSON string directly inside the database engine mainly for the purpose of performing a write transaction. I want to show the benefits of doing such a thing by showing an, albeit admittedly simplified, example.
There are two main advantages in parsing JSON inside the database in order to update it:
  • you can save several round trips from your application to the database if all the data needed for the transaction is sent to the database in a single shot. The amount of saving depends directly on the complexity of the JSON you need to process. The more complex is the JSON, the more savings we reap from this technique, mainly in terms of diminished latency;
  • you can expose JSON collections as relations inside your transaction and process the transaction in a set oriented manner, the way relational databases prefer to process data. This means you can get much better performance.
So, no more talking, let's describe our test setting. The example is a customer order acquisition. It is supposed orders are gotten from a web application somewhere. Clients send back their orders in a JSON format. An example JSON order is:

{
  "cust_id": 223, 
  "ord_date":"2014-06-02", 
  "payc_id": 22, 
  "rows": [{
     "rowno": 1, 
     "prod_id": 24, 
     "qty": 2, 
     "price": 5.50
    },{
     "rowno": 2, 
     "prod_id": 27, 
     "qty": 1, 
     "price": 12.50}]
}

Given this, the application server logic is trivial. The server has to take the JSON coming from the web request and send it as a parameter in a SQL call to PostgreSQL performing some sort of authentication and validation presumably. Normally the application server may not even need to parse the JSON string received as a request parameter. This setting is also totally stateless. Once the database request has been processed no state needs to be kept on the server in order to respond to the next request from the same client. This is also true when we render the JSON from inside the database as described in the other post.
The database tables used in the example represent a simple design in order to keep things tractable.
Here is the script that creates the tables and populates them with sample data:


CREATE TABLE paycond(
    paym_id serial NOT NULL PRIMARY KEY,
    paym_desc text NOT NULL );
INSERT INTO paycond(
    paym_desc)
SELECT
    'Payment method #' || s
FROM
    generate_series(1, 50) AS g (s);
CREATE TABLE cust(
    cust_id serial NOT NULL PRIMARY KEY,
    cust_name text NOT NULL,
    paym_id INTEGER NOT NULL REFERENCES paycond);
INSERT INTO cust(cust_name, paym_id)
SELECT
    'Cust #' || s, ceil(random() * 50)
FROM
    generate_series(1,1000) AS g(s);

CREATE TABLE ord_hdr(
    ord_id serial NOT NULL PRIMARY KEY,
    ord_date date NOT NULL DEFAULT CURRENT_DATE,
    cust_id INTEGER NOT NULL REFERENCES cust,
    paym_id INTEGER NOT NULL REFERENCES paycond,
    ord_amt numeric (12,2)
    NOT NULL DEFAULT 0);

CREATE TABLE prod(
    prod_id serial NOT NULL PRIMARY KEY,
    prod_name text NOT NULL,
    prod_price numeric(8, 2) NOT NULL);
INSERT INTO prod(
    prod_name,
    prod_price)
SELECT
    'Prod #' || s, round((random()* 100)::numeric, 2)
FROM
    generate_series(1, 1000) AS g(s);
CREATE TABLE ord_details(
    ord_id INTEGER NOT NULL REFERENCES ord_hdr ON DELETE CASCADE,
    rowno SMALLINT NOT NULL,
    prod_id INTEGER NOT NULL REFERENCES prod,
    qty numeric(8, 2) NOT NULL,
    price numeric(8, 2) NOT NULL,
    amt numeric(12, 2) NOT NULL,
    CONSTRAINT ord_details_pk PRIMARY KEY (ord_id, rowno));

The code for the (PL/PgSQL) function that processes the order acquisition is:

CREATE FUNCTION json_test(val json) RETURNS INTEGER AS $$
BEGIN
    WITH hdr_data AS(
        SELECT (val ->> 'cust_id')::INTEGER AS cust_id,
            (val ->> 'ord_date')::date AS ord_date,
            (val ->> 'payc_id')::INTEGER AS paym_id,
            val - > 'rows' AS details),
    rows_coll AS (
        SELECT json_array_elements(details) AS r
        FROM hdr_data),
    details AS (
        SELECT (r ->> 'rowno')::SMALLINT rowno,
            (r ->> 'prod_id')::INTEGER prod_id,
            (r ->> 'qty')::DECIMAL qty,
            (r ->> 'price')::DECIMAL price
        FROM rows_coll AS r),
    hdr AS (
        INSERT INTO ord_hdr(
            cust_id,
            ord_date,
            paym_id )
    SELECT
        cust_id,
        ord_date,
        COALESCE(paym_id, (SELECT paym_id FROM cust WHERE cust_id = hdr_data.cust_id))
    FROM hdr_data RETURNING ord_id)
    INSERT INTO ord_details(
        ord_id,
        rowno,
        prod_id,
        qty,
        price,
        amt )
    SELECT
      ord_id,
      rowno,
      prod_id,
      qty,
      COALESCE(price, prod.prod_price),
      COALESCE(price, prod.prod_price) * qty
    FROM
      hdr,
      details
      JOIN prod USING(prod_id);

    UPDATE ord_hdr
    SET ord_amt = (
      SELECT sum(amt)
      FROM ord_details
      WHERE ord_id = currval('ord_hdr_ord_id_seq'::regclass))
    WHERE ord_id = currval('ord_hdr_ord_id_seq'::regclass);
    
    RETURN currval('ord_hdr_ord_id_seq'::regclass);
END;
$$ LANGUAGE 'plpgsql';

This code needs explanation. It takes advantage of some PostgreSQL features, some of which are available on other databases. First the parameter is a JSON value. The caller can send a string casting it to JSON and this string will be parsed and sent to the function as a JSON value. An example call may be:


SELECT json_test(
        '{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{
        "rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},{
        "rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json);

Then we use a CTE (Common Table Expression) to decompose our JSON message into normalized terms that can be used as relations. So, to generalize the approach, let's consider that our input JSON is a master-detail structure, potentially with many detail sections containing further detail sections, a tree of sub-relations that starts from a root node (this approach works also if the root node is itself a collection but the code is a bit different in this case).
The first thing we want to do is take the root and expose the various data items at this level as the expected data types in order to use them as input values to an insert statement. PosthgreSQL offers us the ->> operator which accesses a single term from a JSON value and returns it as a string of characters. So for example val ->> 'cust_id' returns the cust_id value in the JSON as a string. We can then cast the string to the appropriate data type as in (val ->> 'cust_id')::integer to get the final input value.
For each collection embedded in the root node we want to expose it as a JSON array value. In order to do this we use the -> operator that return the JSON value corresponding to the referred member. So in our case val -> 'rows' returns a JSON array of the order rows.
This covers the first CTE. Once we have the JSON for a collection, in order to be able to use it we need to transform it into a relation of JSON objects. This is the task accomplished by the second CTE, rows_coll. Here we use the json_array_elements function to perform the transformation. Now that we have a relation of JSON objects we can apply the algorithm recursively to expose these objects as a flattened relation using the ->> operator. This is done in the third CTE, details, where we apply the same logic we used for the root to the order rows. The pattern can be repeated to any depth.
Now we have perfect master and details relations we can use to perform an INSERT into the ord_hdr and ord_details relations. So we use another feature of PostgreSQL, the ability to expose as a CTE DML statements and to pipeline the output of one statement to the next statements following it. Here we INSERT into the ord_hdr relation from the hdr_data relation we obtained from the first CTE. We can complicate the INSERT business logic as we want joining to other tables. For example here we enter into the order the payment method recorded in the customer row when it is missing in the JSON order we receive from the client. We use the RETURNING clause of the INSERT statement to return the ord_id value assigned from the sequence associated with the serial primary key. We need it in order to be able to fill the foreign key in the ord_details table. The next INSERT statement is the last member of this multi-operation SQL statement and inserts the rows as a set into the ord_details table, joining to the hdr CTE result, that in this case is a single row and single column relation containing the assigned order id. Incidentally, we also join to the prod table to default the price in case it is missing in the input JSON.
Now, in order to complete our order we need to perform a last step. We want to store on the ord_hdr relation the order total amount which is the sum of the amt columns for all the ord_details belonging to the same ord_hdr. So we may think we could add an UPDATE operation to compute the sum straight in the current statement but unfortunately this does not work. If we do this the ord_hdr.ord_amt column remains zero. 
The reason for this is that Postgres processes the set of CTEs and the corresponding final INSERT operation as a single statement. The planner creates a single plan for the whole computation. While this is very good in performance terms, it has a drawback, a necessary drawback. All the operations in the statement see the state of the database as it was when the statement started so any change made by DML inside a CTE is not visible to the following CTEs and final operation. This means that while we could add a RETURNING clause to the last INSERT making it into a CTE to return the amt of the newly inserted rows, there would be no way to UPDATE the ord_hdr row we inserted before since that row is not visible until the statement ends.
While we could in principle complicate things and compute the ord_amt from the JSON details, my objective is to keep things manageable and readable and to be able to automate the process of automatically generating this SQL from a representation of the underlying business rules. So I preferred to let the statement end and add a new "classical" UPDATE statement that performs the required logic. Since the statement is independent it has full visibility of the newly inserted rows and can perform the computation. There is only a little problem: we need to be able to access the newly inserted ord_id which, outside the previous statement, is no longer accessible. Now, Postgres allows you to access the last value assigned from a sequence as currval('sequence_name'::regclass), where sequence_name is the name of the sequence associated with the ord_id serial column. This returns the last assigned value in the current session for the given sequence and is fully safe to use in a concurrent environment. So we can use this expression to access the last assigned ord_id and update our order completing our transaction logic.
We return this ord_id to the caller to allow the client to fetch the processed order from the database.

Now you may want to know how fast this order processing function is. Performing this kind of evaluation is in general a tricky thing. On my development system that sports an Intel i7 processor, 12 GB RAM and an SSD Patriot Inferno 128GB disk unit I got this result:

First execution (no data in cache, function not compiled):

"Result  (cost=0.00..0.26 rows=1 width=0) (actual time=14.324..14.325 rows=1 loops=1)"
"  Output: json_test('{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{"
""rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},"
"{"
""rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json)"
"Total runtime: 14.348 ms"

Subsequent executions:

"Result  (cost=0.00..0.26 rows=1 width=0) (actual time=1.225..1.225 rows=1 loops=1)"
"  Output: json_test('{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{"
""rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},"
"{"
""rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json)"
"Total runtime: 1.236 ms"

My tests show a consistent total runtime around 1msec for repeated executions with this schema and content (all data is in memory). This is generally difficult to beat in a ORM setting where all this logic is scattered in many methods performing many SQL statements to get the same result. Here you pay a single round trip to the database (not accounted for in the timings) while in an ORM setting you probably would need several, depending on the level of optimization you can afford to hack in. A single round trip normally costs much more than 1ms in a production server setting.
I also wanted to compute the impact of JSON processing on the whole statement execution time. I used this SQL to make this test:


EXPLAIN ANALYZE VERBOSE 
WITH t AS (
    SELECT '{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{
        "rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},{
        "rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json val),
testata AS (
    SELECT (val->> 'cust_id')::INTEGER AS cust_id,
        (val - >> 'ord_date' ) ::date AS ord_date,
        (val ->> 'payc_id')::INTEGER AS paym_id,
        val -> 'rows' AS details
    FROM t),
rows_coll AS (
    SELECT json_array_elements(details) AS r
    FROM testata),
details AS (
    SELECT (r ->> 'rowno')::SMALLINT rowno,
        (r ->> 'prod_id')::INTEGER prod_id,
        (r ->> 'qty')::DECIMAL qty,
        (r ->> 'price')::DECIMAL price
    FROM rows_coll AS r)
SELECT *
FROM details;



All the CTEs are evaluated and I measure an average execution time of 0,1 msec, which corresponds to a maximum of 10% JSON processing overhead.
Considering the savings in latency I expect this approach to be significantly faster in terms of response time and more scalable for many real world workloads.
In my opinion this approach is also simpler than an imperative ORM based solution for the same functionality but my objective is to make this code a compilation target for a business rules based compiler I'm working on since this representation is not easy to manage on a larger scale in terms of code volumes and repetition.
I will talk about this in a future post.