Why PostgreSQL? Why compress?
TwoSix Labs has a vast store of timeseries IoT data; battery voltages, temperature readings, spectral readings, and many others. Our goal is to compress our multi-dimensional timeseries data (specifically, spectral data) in PostgreSQL while still enabling full access to that data from SQL. In other words, no opaque BLOBs.
You can use this technique to achieve massive compression for any multi-dimensional data, providing savings by as much as 80%, reducing cost and increasing performance due to decreased IO.
As with many projects, we store much of our information in PostgreSQL: users, groups, device registrations, etc. However, this data is minuscule compared the spectral data itself:
- huge number of devices sending their spectral readings every second (1 Hz)
- with a spectral resolution of 512, that means 512 * 16-bit integer = 1024 bytes per second (1KiB per second, per device)
- therefore, each device can upload as much as ~84MB of data per day!
Compression across columns is very common for column stores such as Clickhouse, but until very recently, this kind of functionality was unavailable on PostgreSQL (excluding cstore_fdw as it has too many limitations for most users). We wanted to experiment with JOINs between our spectral readings and devices tables.
It’s worth mentioning that there are other databases (like Clickhouse) that are better suited to this kind of data, but we wanted to see what performance would be like if we kept as much data as possible in PostgreSQL.
Tools of the trade
How It Works
Our strategy combines the per-dimension compression of pgpointcloud with the time-based sharding approach of TimescaleDB for scalability. PostgreSQL is one of the few databases that allows–even encourages–third-party development of extensions, and people have been writing extensions for decades.
pgpointcloud is designed to store point clouds from LIDAR, but, as its README describes, pgpointcloud itself is pretty generic:
Much of the complexity in handling LIDAR comes from the need to deal with multiple variables per point. The variables captured by LIDAR sensors varies by sensor and capture process. Some data sets might contain only X/Y/Z values. Others will contain dozens of variables: X, Y, Z; intensity and return number; red, green, and blue values; return times; and many more. There is no consistency in how variables are stored: intensity might be stored in a 4-byte integer, or in a single byte; X/Y/Z might be doubles, or they might be scaled 4-byte integers.
We can take advantage of the compression pgpointcloud provides by mapping our spectral data onto a 3D plane.
Spectral data itself is 2D: X is the channel/bin and Y is the count at that channel. If you add time, you get 3D data: X becomes the time (milliseconds since Unix epoch), Y becomes the channel/bin, and Z becomes the count.
In pgpointcloud parlance, every (time, bin, count) becomes a PcPoint, and PcPoints for a given time-range and sensor get combined into a PcPatch.
---------------------- --- 1) load the necessary extension ------------------------ CREATE EXTENSION pointcloud; ---------------------- --- 2) create the pointcloud schema: X=uint64_t, Y=uint16_t,Z=uint16_t ------------------------ INSERT INTO pointcloud_formats (pcid, srid, schema) VALUES (3, 4326, '<?xml version="1.0" encoding="UTF-8"?> <pc:PointCloudSchema xmlns:pc="http://pointcloud.org/schemas/PC/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <pc:dimension> <pc:position>1</pc:position> <pc:size>8</pc:size> <pc:description>X coordinate.</pc:description> <pc:name>X</pc:name> <pc:interpretation>uint64_t</pc:interpretation> <pc:scale>1</pc:scale> </pc:dimension> <pc:dimension> <pc:position>2</pc:position> <pc:size>2</pc:size> <pc:description>Y coordinate.</pc:description> <pc:name>Y</pc:name> <pc:interpretation>uint16_t</pc:interpretation> <pc:scale>1</pc:scale> </pc:dimension> <pc:dimension> <pc:position>3</pc:position> <pc:size>2</pc:size> <pc:description>Z coordinate.</pc:description> <pc:name>Z</pc:name> <pc:interpretation>uint16_t</pc:interpretation> <pc:scale>1</pc:scale> </pc:dimension> <pc:metadata> <Metadata name="compression">laz</Metadata> </pc:metadata> </pc:PointCloudSchema>') ; ----------------- ----- 3) create the table that will actually store the data ----------------- create table Spectrum ( sensorId UUID NOT NULL, starttime TIMESTAMP WITH TIME ZONE, endtime TIMESTAMP WITH TIME ZONE, spectrum pcpatch(3), PRIMARY KEY(sensorId, starttime) ); ----------------------- --------- 4) create a timescale hypertable out of the table, with daily chunking ----------------------- SELECT create_hypertable('Spectrum', 'starttime', chunk_time_interval := interval '1 day');
With that DDL, we can insert data using the following SQL…
with patch as( select PC_MakePatch(3, ?) as pa) INSERT INTO Spectrum(sensorId, startTime, endTime, spectrum) VALUES( -- sensorId ?, -- startime (select to_timestamp(timeseries.PC_PatchMin(pa, 'X') / 1000.0) from patch), -- endtime (select to_timestamp(timeseries.PC_PatchMax(pa, 'X') / 1000.0) from patch), (select pa from patch))
…and the following JDBC code:
/* * NOTE: data must be multiple of 3, since we defined the data as 3-dimensional in the schema */ Array array = ps.getConnection().createArrayOf("float8", data); ps.setArray(1, array); ps.setObject(2, "61bda704-0b0f-4c4f-9a0a-60d3f79f2528"); ps.executeUpdate();
It’s worth pointing out that you don’t need Java/JDBC to do this. As long as your language’s PostgreSQL driver supports arrays, you’re good to go!
We have contributed some improvements back to the pgpointcloud open source community:
- Creating a PcPatch directly from a float array, which is extremely useful if you don’t need the intermediary PcPoints
- Improving parallel usage of pgpointcloud functions, for those of you using recent versions of PostgreSQL
Not only did it result in a space reduction of 80%, but it also reduced the number of rows in our spectral readings table by two orders of magnitude!
- Using time as the X dimension, we get compression of our spectral data both within a single spectral reading, and over time (although not across sensors)
- For our use cases, we don’t mind rows that overlap a time period, but if you do, you can use an exclusion constraint by treating startTime and endTime as a range.
- The downside is that you must aggregate/buffer the data on the client-side before sending it to PostgreSQL (we use Spring Integration for Kafka for that).
- Using TimescaleDB completely orthogonal to using pgpointcloud and so, it’s not necessary, but I encourage it because of insertion slowdown over time.