PostgreSQL: Search dates on large tables with pg_IdFromDate()
Introduction
pg_IdFromDate() is a small set of PostgreSQL stored procedures allowing you to search a row based on a date or a timestamp in a few milliseconds, even on very large tables. It uses dichotomy to progressively eliminate ranges of rows until a small number of rows is found then a regular query allows it to find the ID of the row the closest to a specific date. This process is very different from how the database traditionally works and allows you to find the row corresponding to a date in less than 10 ms for a table with > 100 million lines.
Try it yourself
If you would like to try the examples yourself, please create the following table:
-- Create the following table:
CREATE TABLE test (
id serial unique,
date timestamp,
some_data text
);
-- Create en index on the ID
CREATE INDEX ON test (id);
-- The following query will insert about 8,700 random rows in the table
INSERT INTO test (date, some_data)
SELECT generate_series, md5(random()::text)
FROM generate_series('2015-01-01 00:00:00'::timestamp, '2015-12-31 23:59:59'::timestamp, '1 hour');
Source code
You can find the source code on Github:
Watch Star Fork DownloadHow does it works?
pg_IdFromDate() uses a simple mathematical concept called dichotomy. Simply explained, we separate the table into 2 groups, A and B. We will search the middle between A and B and find if the date is below or above this middle row. If the date is below (it's within group A), we will subdivide group A into 2 sub-groups, A' and A'' and repeat the same operation until the final group has less than 50 rows. When that happens we will run a simple query to find the matching row within that range. pg_IdFromDate() takes 3 input: the table name, the name of the "date" column and the timestamp to search.
Examples
Here are a few examples showing how pg_IdFromDate() works:
-- Search a timestamp
gab@gab # SELECT pg_IdFromDate('test', 'date', '2015-08-09 16:53:00');
pg_idfromdate
---------------
5298
(1 row)
Time: 3.710 ms
-- Search a date
SELECT pg_IdFromDate('test', 'date', '2015-09-03');
pg_idfromdate
---------------
5881
(1 row)
Time: 4.754 ms
-- Search a dynamic timestamp
SELECT pg_IdFromDate('test', 'date', (NOW() - interval '1 week')::timestamp);
pg_idfromdate
---------------
2134
(1 row)
Time: 6.868 ms
-- Search a date with no matching result
SELECT pg_IdFromDate('test', 'date', '2016-08-08 23:32:45');
pg_idfromdate
---------------
{NULL}
(1 row)
Time: 3.710 ms
Benchmark
Here is a benchmark comparing the performance of pg_IdFromDate() versus a regular SQL query:
Method | "...WHERE date = '[DATE]';" | pg_IdFromDate() | Result |
---|---|---|---|
Querying a timestamp (no index) | 25,225.50 ms | 7.07 ms | pg_IdFromDate() is 3,568x faster |
Querying a date (no index) | 41,174.18 ms | 6.56 ms | pg_IdFromDate() is 6,276x faster |
Querying a timestamp (+ index) | 12.44 ms | 7.11 ms | pg_IdFromDate() is 2x faster |
Querying a date (+ index) | 43,361.46 ms | 7.27 ms | pg_IdFromDate() is 5,964x faster |
Limitations
pg_IdFromDate() is a project that I hope will evolve soon. Here are the current limitations:
- The ID from the table searched needs to be called ID. However, you could change that in the source code if you needed to and adapt it to any ID name.
- Dates need to be stored as dates/timestamp. If you are storing dates as a Unix epoch, pg_IdFromDate() won't work. But, here again, you can easily adapt this behavior in the source code if you need to.
- Dates need to be stored in the same chronological order as IDs. It means that the first ID needs to have the oldest date, the second ID need to have the second oldest date... If you table does not respect that, pg_IdFromDate() is uncertain.
Conclusion
This tool was developed to easily find rows by dates in tables with hundred of million of rows. It's methodology allows to search rows efficiently even on very large tables. You can view and download the project on Github.