How it Works

VolumeServer: How it works

This document provides a high level overview of how the DensityServer works.

Overview

Data is stored in using block layout to reduce the number of disk seeks/reads each query requires.
Data is downsampled by 1/2, 1/4, 1/8, ... depending on the size of the input.
To keep the server response time/size small, each query is satisfied using the appropriate downsampling level.
The server response is encoded using the BinaryCIF format.
The contour level is preserved using relative instead of absolute values.

Data Layout

To enable efficient access to the 3D data, the density values are stored in a "block level" format. This means that the data is split into NxNxN blocks (by default N=96, which corresponds to 96^3 * 4 bytes = 3.375MB disk read per block access and provides good size/performance ratio). This data layout enables to access the data from a hard drive using a bounded number of disk seeks/reads which greatly reduces the server latency.

Downsampling

The input is density data with [H,K,L] number of samples along each axis (i.e. the extent field in the CCP4 header).
To downsample, use the kernel C = [1,4,6,4,1] (customizable on the source code level) along each axis, because it is "separable":
```
downsampled[i] = C[0] * source[2 * i - 2] + ... + C[4] * source[2 * i + 2]
```
The downsampling step is applied in 3 steps:
```
[H,K,L] => [H/2, K, L] => [H/2, K/2, L] => [H/2, K/2, L/2]
```
(if the dimension is odd, the value (D+1)/2 is used instead).
Apply the downsampling step iteratively until the number of samples along the largest dimension is smaller than "block size" (or the smallest dimension has >2 samples).

Satisfying the query

When the server receives a query for a 3D region, it chooses the the appropriate downsampling level based on the required details so that the number of voxels in the response is small enough. This enables sub-second response time even for the largest of entries.

Encoding the response

The BinaryCIF format is used to encode the response. Floating point data are quantized into 1 byte values (256 levels) before being sent back to the client. This quantization is performed by splitting the data interval into uniform pieces.

Preserving the contour level

Downsampling the data results in changing of absolute contour levels. To mitigate this effect, relative values are always used when displaying the data.

Imagine the input data points are A = [-0.3, 2, 0.1, 6, 3, -0.4]:
Downsampling using every other value results in B = [-0.3, 0.1, 3].
The "range" of the data went from (-0.4, 6) to (-0.3,3).
Attempting to use the same absolute contour level on both "data sets" will likely yield very different results.
The effect is similar if instead of skipping values they are averaged (or weighted averaged in the case of the [1 4 6 4 1] kernel) only not as severe.
As a result, the "absolute range" of the data changes, some outlier values are lost, but the mean and relative proportions (i.e. deviation X from mean in Y = mean + sigma * X) are preserved.

Compression Analysis

Downsampling: i-th level (starting from zero) reduces the size by approximate factor 1/[(2^i)^3] (i.e. "cubic" of the frequency).
BinaryCIF: CCP4 mode 2 (32 bit floats) is reduced by factor of 4, CCP4 mode 1 (16bit integers) by factor of 2, CCP4 mode 0 (just bytes) is not reduced. This is done by single byte quantization, but smarter than CCP4 mode 0
Gzip, from observation:
Gzipping BinaryCIF reduces the size by factor ~2 - ~7 (2 for "dense" data such as x-ray density, 7 for sparse data such such an envelope of a virus)
Gzipping CCP4 reduces the size by 10-25% (be it mode 2 or 0)
Applying the downsampling kernel helps with the compression ratios because it smooths out the values.

Toy example:

Start with 3.5GB compressed density data in the CCP4 mode 2 format (32-bit float for each value)
    => ~4GB uncompressed CCP4
    => Downsample by 1/4 => 4GB * (1/4)^3 = 62MB
    => Convert to BinaryCIF => 62MB / 4 = ~16MB
    => Gzip: 2 - 8 MB depending on the "density" of the data 
        (e.g. a viral shell data will be smaller because it is "empty" inside)