Relational Thinking: SQL

Showing posts with label SQL. Show all posts

Tuesday, June 17, 2014

The benefits of PostgreSQL side JSON parsing

In my previous post I talked about the benefits of directly rendering a JSON response from a PostgreSQL stored function or SQL statement in general.
In this post I want to deal with the opposite requirement: parsing a JSON string directly inside the database engine mainly for the purpose of performing a write transaction. I want to show the benefits of doing such a thing by showing an, albeit admittedly simplified, example.
There are two main advantages in parsing JSON inside the database in order to update it:

you can save several round trips from your application to the database if all the data needed for the transaction is sent to the database in a single shot. The amount of saving depends directly on the complexity of the JSON you need to process. The more complex is the JSON, the more savings we reap from this technique, mainly in terms of diminished latency;
you can expose JSON collections as relations inside your transaction and process the transaction in a set oriented manner, the way relational databases prefer to process data. This means you can get much better performance.

So, no more talking, let's describe our test setting. The example is a customer order acquisition. It is supposed orders are gotten from a web application somewhere. Clients send back their orders in a JSON format. An example JSON order is:

{
"cust_id": 223,
"ord_date":"2014-06-02",
"payc_id": 22,
"rows": [{
"rowno": 1,
"prod_id": 24,
"qty": 2,
"price": 5.50
},{
"rowno": 2,
"prod_id": 27,
"qty": 1,
"price": 12.50}]
}

Given this, the application server logic is trivial. The server has to take the JSON coming from the web request and send it as a parameter in a SQL call to PostgreSQL performing some sort of authentication and validation presumably. Normally the application server may not even need to parse the JSON string received as a request parameter. This setting is also totally stateless. Once the database request has been processed no state needs to be kept on the server in order to respond to the next request from the same client. This is also true when we render the JSON from inside the database as described in the other post.

The database tables used in the example represent a simple design in order to keep things tractable.

Here is the script that creates the tables and populates them with sample data:

CREATE TABLE paycond(
    paym_id serial NOT NULL PRIMARY KEY,
    paym_desc text NOT NULL );
INSERT INTO paycond(
    paym_desc)
SELECT
    'Payment method #' || s
FROM
    generate_series(1, 50) AS g (s);

CREATE TABLE cust(
    cust_id serial NOT NULL PRIMARY KEY,
    cust_name text NOT NULL,
    paym_id INTEGER NOT NULL REFERENCES paycond);
INSERT INTO cust(cust_name, paym_id)
SELECT
    'Cust #' || s, ceil(random() * 50)
FROM
    generate_series(1,1000) AS g(s);

CREATE TABLE ord_hdr(
    ord_id serial NOT NULL PRIMARY KEY,
    ord_date date NOT NULL DEFAULT CURRENT_DATE,
    cust_id INTEGER NOT NULL REFERENCES cust,
    paym_id INTEGER NOT NULL REFERENCES paycond,
    ord_amt numeric (12,2)
    NOT NULL DEFAULT 0);

CREATE TABLE prod(
    prod_id serial NOT NULL PRIMARY KEY,
    prod_name text NOT NULL,
    prod_price numeric(8, 2) NOT NULL);
INSERT INTO prod(
    prod_name,
    prod_price)
SELECT
    'Prod #' || s, round((random()* 100)::numeric, 2)
FROM
    generate_series(1, 1000) AS g(s);

CREATE TABLE ord_details(
    ord_id INTEGER NOT NULL REFERENCES ord_hdr ON DELETE CASCADE,
    rowno SMALLINT NOT NULL,
    prod_id INTEGER NOT NULL REFERENCES prod,
    qty numeric(8, 2) NOT NULL,
    price numeric(8, 2) NOT NULL,
    amt numeric(12, 2) NOT NULL,
    CONSTRAINT ord_details_pk PRIMARY KEY (ord_id, rowno));

The code for the (PL/PgSQL) function that processes the order acquisition is:

CREATE FUNCTION json_test(val json) RETURNS INTEGER AS $$
BEGIN
    WITH hdr_data AS(
        SELECT (val ->> 'cust_id')::INTEGER AS cust_id,
            (val ->> 'ord_date')::date AS ord_date,
            (val ->> 'payc_id')::INTEGER AS paym_id,
            val - > 'rows' AS details),
    rows_coll AS (
        SELECT json_array_elements(details) AS r
        FROM hdr_data),
    details AS (
        SELECT (r ->> 'rowno')::SMALLINT rowno,
            (r ->> 'prod_id')::INTEGER prod_id,
            (r ->> 'qty')::DECIMAL qty,
            (r ->> 'price')::DECIMAL price
        FROM rows_coll AS r),
    hdr AS (
        INSERT INTO ord_hdr(
            cust_id,
            ord_date,
            paym_id )
    SELECT
        cust_id,
        ord_date,
        COALESCE(paym_id, (SELECT paym_id FROM cust WHERE cust_id = hdr_data.cust_id))
    FROM hdr_data RETURNING ord_id)
    INSERT INTO ord_details(
        ord_id,
        rowno,
        prod_id,
        qty,
        price,
        amt )
    SELECT
      ord_id,
      rowno,
      prod_id,
      qty,
      COALESCE(price, prod.prod_price),
      COALESCE(price, prod.prod_price) * qty
    FROM
      hdr,
      details
      JOIN prod USING(prod_id);

    UPDATE ord_hdr
    SET ord_amt = (
      SELECT sum(amt)
      FROM ord_details
      WHERE ord_id = currval('ord_hdr_ord_id_seq'::regclass))
    WHERE ord_id = currval('ord_hdr_ord_id_seq'::regclass);

    RETURN currval('ord_hdr_ord_id_seq'::regclass);
END;
$$ LANGUAGE 'plpgsql';

This code needs explanation. It takes advantage of some PostgreSQL features, some of which are available on other databases. First the parameter is a JSON value. The caller can send a string casting it to JSON and this string will be parsed and sent to the function as a JSON value. An example call may be:

SELECT json_test(
        '{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{
        "rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},{
        "rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json);

Then we use a CTE (Common Table Expression) to decompose our JSON message into normalized terms that can be used as relations. So, to generalize the approach, let's consider that our input JSON is a master-detail structure, potentially with many detail sections containing further detail sections, a tree of sub-relations that starts from a root node (this approach works also if the root node is itself a collection but the code is a bit different in this case).
The first thing we want to do is take the root and expose the various data items at this level as the expected data types in order to use them as input values to an insert statement. PosthgreSQL offers us the ->> operator which accesses a single term from a JSON value and returns it as a string of characters. So for example val ->> 'cust_id' returns the cust_id value in the JSON as a string. We can then cast the string to the appropriate data type as in (val ->> 'cust_id')::integer to get the final input value.
For each collection embedded in the root node we want to expose it as a JSON array value. In order to do this we use the -> operator that return the JSON value corresponding to the referred member. So in our case val -> 'rows' returns a JSON array of the order rows.
This covers the first CTE. Once we have the JSON for a collection, in order to be able to use it we need to transform it into a relation of JSON objects. This is the task accomplished by the second CTE, rows_coll. Here we use the json_array_elements function to perform the transformation. Now that we have a relation of JSON objects we can apply the algorithm recursively to expose these objects as a flattened relation using the ->> operator. This is done in the third CTE, details, where we apply the same logic we used for the root to the order rows. The pattern can be repeated to any depth.
Now we have perfect master and details relations we can use to perform an INSERT into the ord_hdr and ord_details relations. So we use another feature of PostgreSQL, the ability to expose as a CTE DML statements and to pipeline the output of one statement to the next statements following it. Here we INSERT into the ord_hdr relation from the hdr_data relation we obtained from the first CTE. We can complicate the INSERT business logic as we want joining to other tables. For example here we enter into the order the payment method recorded in the customer row when it is missing in the JSON order we receive from the client. We use the RETURNING clause of the INSERT statement to return the ord_id value assigned from the sequence associated with the serial primary key. We need it in order to be able to fill the foreign key in the ord_details table. The next INSERT statement is the last member of this multi-operation SQL statement and inserts the rows as a set into the ord_details table, joining to the hdr CTE result, that in this case is a single row and single column relation containing the assigned order id. Incidentally, we also join to the prod table to default the price in case it is missing in the input JSON.
Now, in order to complete our order we need to perform a last step. We want to store on the ord_hdr relation the order total amount which is the sum of the amt columns for all the ord_details belonging to the same ord_hdr. So we may think we could add an UPDATE operation to compute the sum straight in the current statement but unfortunately this does not work. If we do this the ord_hdr.ord_amt column remains zero.
The reason for this is that Postgres processes the set of CTEs and the corresponding final INSERT operation as a single statement. The planner creates a single plan for the whole computation. While this is very good in performance terms, it has a drawback, a necessary drawback. All the operations in the statement see the state of the database as it was when the statement started so any change made by DML inside a CTE is not visible to the following CTEs and final operation. This means that while we could add a RETURNING clause to the last INSERT making it into a CTE to return the amt of the newly inserted rows, there would be no way to UPDATE the ord_hdr row we inserted before since that row is not visible until the statement ends.
While we could in principle complicate things and compute the ord_amt from the JSON details, my objective is to keep things manageable and readable and to be able to automate the process of automatically generating this SQL from a representation of the underlying business rules. So I preferred to let the statement end and add a new "classical" UPDATE statement that performs the required logic. Since the statement is independent it has full visibility of the newly inserted rows and can perform the computation. There is only a little problem: we need to be able to access the newly inserted ord_id which, outside the previous statement, is no longer accessible. Now, Postgres allows you to access the last value assigned from a sequence as currval('sequence_name'::regclass), where sequence_name is the name of the sequence associated with the ord_id serial column. This returns the last assigned value in the current session for the given sequence and is fully safe to use in a concurrent environment. So we can use this expression to access the last assigned ord_id and update our order completing our transaction logic.
We return this ord_id to the caller to allow the client to fetch the processed order from the database.

Now you may want to know how fast this order processing function is. Performing this kind of evaluation is in general a tricky thing. On my development system that sports an Intel i7 processor, 12 GB RAM and an SSD Patriot Inferno 128GB disk unit I got this result:

First execution (no data in cache, function not compiled):

"Result (cost=0.00..0.26 rows=1 width=0) (actual time=14.324..14.325 rows=1 loops=1)"
" Output: json_test('{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{"
""rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},"
"{"
""rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json)"
"Total runtime: 14.348 ms"

Subsequent executions:

"Result (cost=0.00..0.26 rows=1 width=0) (actual time=1.225..1.225 rows=1 loops=1)"

" Output: json_test('{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{"

""rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},"

"{"

""rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json)"

"Total runtime: 1.236 ms"

My tests show a consistent total runtime around 1msec for repeated executions with this schema and content (all data is in memory). This is generally difficult to beat in a ORM setting where all this logic is scattered in many methods performing many SQL statements to get the same result. Here you pay a single round trip to the database (not accounted for in the timings) while in an ORM setting you probably would need several, depending on the level of optimization you can afford to hack in. A single round trip normally costs much more than 1ms in a production server setting.
I also wanted to compute the impact of JSON processing on the whole statement execution time. I used this SQL to make this test:

EXPLAIN ANALYZE VERBOSE

WITH t AS (
    SELECT '{"cust_id": 223, "ord_date":"2014-06-02", "payc_id": 22, "rows": [{
        "rowno": 1, "prod_id": 24, "qty": 2, "price": 5.50},{
        "rowno": 2, "prod_id": 27, "qty": 1, "price": 12.50}]}'::json val),
testata AS (
    SELECT (val->> 'cust_id')::INTEGER AS cust_id,
        (val - >> 'ord_date' ) ::date AS ord_date,
        (val ->> 'payc_id')::INTEGER AS paym_id,
        val -> 'rows' AS details
    FROM t),
rows_coll AS (
    SELECT json_array_elements(details) AS r
    FROM testata),
details AS (
    SELECT (r ->> 'rowno')::SMALLINT rowno,
        (r ->> 'prod_id')::INTEGER prod_id,
        (r ->> 'qty')::DECIMAL qty,
        (r ->> 'price')::DECIMAL price
    FROM rows_coll AS r)
SELECT *
FROM details;

All the CTEs are evaluated and I measure an average execution time of 0,1 msec, which corresponds to a maximum of 10% JSON processing overhead.
Considering the savings in latency I expect this approach to be significantly faster in terms of response time and more scalable for many real world workloads.
In my opinion this approach is also simpler than an imperative ORM based solution for the same functionality but my objective is to make this code a compilation target for a business rules based compiler I'm working on since this representation is not easy to manage on a larger scale in terms of code volumes and repetition.
I will talk about this in a future post.

Saturday, May 31, 2014

The benefits of PostgreSQL side JSON rendering

Nowadays JSON has become the standard Web data interchange format. No matter if your client runs in a browser or as a native mobile app, or you are accessing a web service from a server side component, you are probably going to model and exchange data in the JSON format. JSON parsing and rendering is available in every modern programming language and JSON has been the NoSQL databases format of choice. JSON is also supported as a native datatype in several relational databases. In PostgreSQL there is a JSON datatype and a set of operators for constructing, accessing and processing JSON data from within the RDBMS.

What is compelling for my purposes is the ability to return a JSON document to the client directly from a SQL query that accesses relational data. This means if you have a complex JSON that would require multiple round-trips to the database to build, now it can be built right in a single SQL statement. This can be 10 to 100 times faster than an equivalent ORM-based solution depending on various factors and in my opinion is easier to code and maintain (you need to be proficient in SQL to appreciate this).

I wanted to create a fictitious example and, inspired by this other post by Dan McClain, I wanted to try a similar schema and a radically simplified SQL to see how it performs.
I did not understand why Dan designed tags to be unique by note. In this case the JSON he wants to produce is somewhat redundant, since if tags are unique by note, they can be embedded inside notes. If they aren't, then why tags have a note_id attribute?
In my design notes have tags but different notes can be tagged with the same tag. So there is an intersection table to manage the many-to-many relationship between notes and tags:

CREATE TABLE notes (
    note_id serial NOT NULL PRIMARY KEY,
    note_title text NOT NULL,
    note_content text NOT NULL );

INSERT INTO notes (
    note_title,
    note_content )
SELECT 'Note #' || s,
    'Lorem ipsum...'
FROM
    generate_series (1, 1000) AS g (s)

CREATE TABLE tags (
    tag_id serial NOT NULL PRIMARY KEY,
    tag_name text NOT NULL);


INSERT INTO tags (tag_name)
SELECT 'Tag #' || s
FROM generate_series (1, 100) AS g (s)

CREATE TABLE note_tags (
    note_id INTEGER NOT NULL REFERENCES notes,
    tag_id INTEGER NOT NULL REFERENCES tags,
    CONSTRAINT nota_pk PRIMARY KEY (note_id, tag_id));

INSERT INTO note_tags
SELECT
    note_id,
    tag_id
FROM
    notes,
    tags
WHERE random ( ) <= 0.1;

I'm going to get a page of 40 notes. I want to show only tags that are referenced by the notes in the message. This JSON returns a dataset that could not be extracted by a single SQL statement if executed in a traditional manner. Normally it would take multiple queries to materialize this JSON in an application program, with a somewhat convoluted procedure. If you use a vanilla ORM it may take up to 141 queries to materialize the data to produce this JSON. By preloading all the tags in a single query we can reduce the number of queries to 42. This works because we only have 100 tags, but may not work with a larger data set. The logic for filtering out only the referenced ones may add complexity to the solution. We could reduce the number of queries to 3 by fetching all note_tags for the fetched notes in a single query, further complicating the code. Eagerly fetching note_tags with notes may be simpler but we get redundant data over the wire. Notes data is gotten 10 times instead of once. This may zero out the benefit of avoiding a nested query (not in this case since we have few columns to read, but in a more realistic situation it may actually make things worse if notes had many columns).
So, we can see that, on the application side, generating the required JSON is not trivial and may involve a lot of optimization decisions and hacks in the code to be efficient.
Now I present a PostgreSQL 9.3 query that returns the required result. It is far simpler, IMHO, than the solution presented by Dan:

WITH n AS (
    SELECT
        note_id,
        note_title,
        note_content,
        (
            SELECT
                array_agg(tag_id)
            FROM
                note_tags
            WHERE
                note_id = notes.note_id) tags
    FROM
        notes
    ORDER BY
        note_id
    LIMIT 40 )
SELECT
    row_to_json (
        ROW (
            (
                SELECT
                    json_agg(t)
                FROM (
                        SELECT
                            DISTINCT tags.tag_id,
                            tags.tag_name
                        FROM
                            note_tags nt
                            JOIN tags USING (tag_id)
                            JOIN n USING (note_id)) AS t ),
            (
                SELECT
                    json_agg (n)
                FROM
                    n)))

On my desktop computer (Intel i7, 12GB RAM), this query takes 5,1 milliseconds to execute. Interesting to note, the cost of JSON rendering is small, something around 0,6ms. This data can be computed from the explain analyze verbose of the statement execution plan. In order to have a realistic execution time you have to add the call latency between an hypothetical application server and the database, which in a well engineered network may be in the same order of magnitude (5ms).
The most optimized application level solution will be at least 6 times slower than this, due to latencies and redundant application level processing and data format translations. But I presume normally sloppy programmers will prefer to write more readable and maintainable code so they may opt for the first vanilla ORM solution which may be 100 times slower. Maybe this goes into production unnoticed and if the user population is small and tolerant, this may actually be acceptable (users start perceiving a latency when response time is at least 0.7 seconds).
But the SQL solution is quite simple and enjoyable. It is 17 lines of code, with no mutable variables, loops, assignments and conditionals. Completely declarative code that is quite easy to understand. The CTE pulls the notes each one with it's own array of tags (the array_agg call). The main select constructs a JSON object from a SQL ROW that is made of two columns: the first one gets all the referenced tags by joining with the CTE result and aggregating the tags rows as a JSON array; the second column is the aggregation as a JSON array of the CTE result. No messing with performance related problems. Let the database optimize the execution for us.
This solution is fast, scales well and is easy to program.
There is anyway a problem with this solution as of PostgreSQL 9.3. The ROW construct causes output columns to be named f1 and f2 instead of tags and notes. I've not found a workaround but in 9.4 there is a plethora of new JSON functions. Among these there are new constructor functions like json_build_object(VARIADIC any) which is what we need in our case:

WITH n AS (
    SELECT
        note_id,
        note_title,
        note_content,
        (
            SELECT
                array_agg(tag_id)
            FROM
                note_tags
            WHERE
                note_id = notes.note_id) tags
    FROM
        notes
    ORDER BY
        note_id
    LIMIT 40 )
SELECT
    json_build_object (
        "tags",
        (
            SELECT
                json_agg(t)
            FROM (
                    SELECT
                        DISTINCT tags.tag_id,
                        tags.tag_name
                    FROM
                        note_tags nt
                        JOIN tags USING (tag_id)
                        JOIN n USING (note_id)) AS t ),
        "notes",
        (
            SELECT
                json_agg (n)
            FROM
                n))

I've not tested this, anyway.
To conclude, JSON powered relational databases show interesting possibilities:
- significant performance improvements from moving the code into the data processing layer;
- code simplification. Just express the result you want and let the database do it's magic.

In a future post I will explore the possibility of using JSON as a means to efficiently perform database updates. Stay tuned.

Tuesday, May 13, 2014

Renumbering rows in SQL

Today I struggled around a seemingly simple problem the would be quite easy to solve in a traditional imperative programming language but is not so immediate to solve in SQL so I want to share my solution.

I have a master table and a master_lines table where lines should be numbered from 1 to N, where N is the count of lines per master.
For some reason that does not matter in this context, a program inserted lines for a given master leaving large gaps between lines and I had to renumber those lines, keeping them in the same sort order, since I was getting numeric overflow errors in the line_number column which had been unfortunately declared numeric(3, 0). The program was continuing to increase the line_number starting from the highest maximum read from the database and adding nearly random offsets from there. (yeah, an integer column probably would have solved this problem, but I could not change the database).
The user was complaining and I had to apply a quick fix to allow him to proceed entering new lines for that master row.

Since this was a PostgreSQL database I knew I could renumber the lines easily using the row_number window function.
Window functions are a nice SQL feature that often allows you to solve order dependent problems in SQL. SQL is a set oriented language, it's purpose is to process relations, which are sets, and sets have no predefined order. So it's generally difficult to express order dependent logic in SQL. Window functions can compute values on windows of contiguous rows returned from relational operations applied by a query. They are not relational algebra operators themselves but they allow you to post-process the result of relational algebra operations.

Among the set of window functions, row_number returns an integer starting from 1 to N where N is the cardinality of the result set (or size of the window to be correct). This is exactly the functionality we need. Since the result of a window function depends on the sort order of the result set, we need to specify in which order we want to count the rows. This is achieved adding a "window" clause to the statement, which can also be written inline with the function call. This clause has many features but for our purposes we only need to be able to specify the ordering we need. So we can write

row_number() over (order by line_number)

to request a sort by line_number and then return the position of the "current row" in the sorted set, counting from 1. To update the rows for a given master I wish I could write:

update master_lines o
set line_number = row_number() over (order by line_number)
where master_nbr = '14/08925'
;

but this does not work since SQL window functions are not supported in UPDATE statements. Semantically it may have sense but is not allowed. This is one example of lack of orthogonality still present in SQL. In order to solve this I need to perform the operation in two steps, first computing the row number and then using this result to update the master_lines table. I may use an inline view for this but I find it more elegant and readable to use a CTE (Common Table Expression) for this purpose.
A CTE is like an assignment in a functional programming language. If we consider that SQL operands are relations (Chris Date, forgive me for the imprecision, I know I should say that SQL allows duplicates and so on...), a CTE like

with t as (
query

)

is nothing more than an assigment to the final variable t, of an immutable value as defined by the result of executing the query. It is a relation in the relational model sense (an immutable value). So I can legally reference it inside other parts of the same SQL statement. In our case we can write:

with t as (
select line_number,
row_number() over (order by line_number) as pos
from master_lines
where master_nbr = '14/08925'
)
update master_lines o
set line_number = t.pos
from t
where master_nbr = '14/08925'
and o.line_number = t.line_number
;

This solution can be used whenever you want to renumber a set of rows according to some arbitrary sort order.