checksum_impl.h source code [PostgreSQL/src/include/storage/checksum_impl.h]

1	/-------------------------------------------------------------------------*
2	*
3	* checksum_impl.h
4	* Checksum implementation for data pages.
5	*
6	* This file exists for the benefit of external programs that may wish to
7	* check Postgres page checksums. They can #include this to get the code
8	* referenced by storage/checksum.h. (Note: you may need to redefine
9	* Assert() as empty to compile this successfully externally.)
10	*
11	* Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
12	* Portions Copyright (c) 1994, Regents of the University of California
13	*
14	* src/include/storage/checksum_impl.h
15	*
16	*-------------------------------------------------------------------------
17	*/
18
19	/*
20	* The algorithm used to checksum pages is chosen for very fast calculation.
21	* Workloads where the database working set fits into OS file cache but not
22	* into shared buffers can read in pages at a very fast pace and the checksum
23	* algorithm itself can become the largest bottleneck.
24	*
25	* The checksum algorithm itself is based on the FNV-1a hash (FNV is shorthand
26	* for Fowler/Noll/Vo). The primitive of a plain FNV-1a hash folds in data 1
27	* byte at a time according to the formula:
28	*
29	* hash = (hash ^ value) * FNV_PRIME
30	*
31	* FNV-1a algorithm is described at http://www.isthe.com/chongo/tech/comp/fnv/
32	*
33	* PostgreSQL doesn't use FNV-1a hash directly because it has bad mixing of
34	* high bits - high order bits in input data only affect high order bits in
35	* output data. To resolve this we xor in the value prior to multiplication
36	* shifted right by 17 bits. The number 17 was chosen because it doesn't
37	* have common denominator with set bit positions in FNV_PRIME and empirically
38	* provides the fastest mixing for high order bits of final iterations quickly
39	* avalanche into lower positions. For performance reasons we choose to combine
40	* 4 bytes at a time. The actual hash formula used as the basis is:
41	*
42	* hash = (hash ^ value) * FNV_PRIME ^ ((hash ^ value) >> 17)
43	*
44	* The main bottleneck in this calculation is the multiplication latency. To
45	* hide the latency and to make use of SIMD parallelism multiple hash values
46	* are calculated in parallel. The page is treated as a 32 column two
47	* dimensional array of 32 bit values. Each column is aggregated separately
48	* into a partial checksum. Each partial checksum uses a different initial
49	* value (offset basis in FNV terminology). The initial values actually used
50	* were chosen randomly, as the values themselves don't matter as much as that
51	* they are different and don't match anything in real data. After initializing
52	* partial checksums each value in the column is aggregated according to the
53	* above formula. Finally two more iterations of the formula are performed with
54	* value 0 to mix the bits of the last value added.
55	*
56	* The partial checksums are then folded together using xor to form a single
57	* 32-bit checksum. The caller can safely reduce the value to 16 bits
58	* using modulo 2^16-1. That will cause a very slight bias towards lower
59	* values but this is not significant for the performance of the
60	* checksum.
61	*
62	* The algorithm choice was based on what instructions are available in SIMD
63	* instruction sets. This meant that a fast and good algorithm needed to use
64	* multiplication as the main mixing operator. The simplest multiplication
65	* based checksum primitive is the one used by FNV. The prime used is chosen
66	* for good dispersion of values. It has no known simple patterns that result
67	* in collisions. Test of 5-bit differentials of the primitive over 64bit keys
68	* reveals no differentials with 3 or more values out of 100000 random keys
69	* colliding. Avalanche test shows that only high order bits of the last word
70	* have a bias. Tests of 1-4 uncorrelated bit errors, stray 0 and 0xFF bytes,
71	* overwriting page from random position to end with 0 bytes, and overwriting
72	* random segments of page with 0x00, 0xFF and random data all show optimal
73	* 2e-16 false positive rate within margin of error.
74	*
75	* Vectorization of the algorithm requires 32bit x 32bit -> 32bit integer
76	* multiplication instruction. As of 2013 the corresponding instruction is
77	* available on x86 SSE4.1 extensions (pmulld) and ARM NEON (vmul.i32).
78	* Vectorization requires a compiler to do the vectorization for us. For recent
79	* GCC versions the flags -msse4.1 -funroll-loops -ftree-vectorize are enough
80	* to achieve vectorization.
81	*
82	* The optimal amount of parallelism to use depends on CPU specific instruction
83	* latency, SIMD instruction width, throughput and the amount of registers
84	* available to hold intermediate state. Generally, more parallelism is better
85	* up to the point that state doesn't fit in registers and extra load-store
86	* instructions are needed to swap values in/out. The number chosen is a fixed
87	* part of the algorithm because changing the parallelism changes the checksum
88	* result.
89	*
90	* The parallelism number 32 was chosen based on the fact that it is the
91	* largest state that fits into architecturally visible x86 SSE registers while
92	* leaving some free registers for intermediate values. For future processors
93	* with 256bit vector registers this will leave some performance on the table.
94	* When vectorization is not available it might be beneficial to restructure
95	* the computation to calculate a subset of the columns at a time and perform
96	* multiple passes to avoid register spilling. This optimization opportunity
97	* is not used. Current coding also assumes that the compiler has the ability
98	* to unroll the inner loop to avoid loop overhead and minimize register
99	* spilling. For less sophisticated compilers it might be beneficial to
100	* manually unroll the inner loop.
101	*/
102
103	#include "storage/bufpage.h"
104
105	/ number of checksums to calculate in parallel /
106	#define N_SUMS 32
107	/ prime multiplier of FNV-1a hash /
108	#define FNV_PRIME 16777619
109
110	/ Use a union so that this code is valid under strict aliasing /
111	typedef union
112	{
113	PageHeaderData phdr;
114	uint32 data[BLCKSZ / (sizeof(uint32) * N_SUMS)][N_SUMS];
115	} PGChecksummablePage;
116
117	/*
118	* Base offsets to initialize each of the parallel FNV hashes into a
119	* different initial state.
120	*/
121	static const uint32 checksumBaseOffsets[N_SUMS] = {
122	`0x5B1F36E9`, `0xB8525960`, `0x02AB50AA`, `0x1DE66D2A`,
123	`0x79FF467A`, `0x9BB9F8A3`, `0x217E7CD2`, `0x83E13D2C`,
124	`0xF8D4474F`, `0xE39EB970`, `0x42C6AE16`, `0x993216FA`,
125	`0x7B093B5D`, `0x98DAFF3C`, `0xF718902A`, `0x0B1C9CDB`,
126	`0xE58F764B`, `0x187636BC`, `0x5D7B3BB1`, `0xE73DE7DE`,
127	`0x92BEC979`, `0xCCA6C0B2`, `0x304A0979`, `0x85AA43D4`,
128	`0x783125BB`, `0x6CA8EAA2`, `0xE407EAC6`, `0x4B5CFC3E`,
129	`0x9FBF8C76`, `0x15CA20BE`, `0xF2CA9FD3`, `0x959BD756`
130	};
131
132	/*
133	* Calculate one round of the checksum.
134	*/
135	#define CHECKSUM_COMP(checksum, value) \
136	do { \
137	uint32 __tmp = (checksum) ^ (value); \
138	(checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17); \
139	} while (0)
140
141	/*
142	* Block checksum algorithm. The page must be adequately aligned
143	* (at least on 4-byte boundary).
144	*/
145	static uint32
146	pg_checksum_block(const PGChecksummablePage *page)
147	{
148	uint32 sums[N_SUMS];
149	uint32 result = `0`;
150	uint32 i,
151	j;
152
153	/ ensure that the size is compatible with the algorithm /
154	Assert(sizeof(PGChecksummablePage) == BLCKSZ);
155
156	/ initialize partial checksums to their corresponding offsets /
157	memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets));
158
159	/ main checksum calculation /
160	for (i = `0`; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++)
161	for (j = `0`; j < N_SUMS; j++)
162	CHECKSUM_COMP(sums[j], page->data[i][j]);
163
164	/ finally add in two rounds of zeroes for additional mixing /
165	for (i = `0`; i < `2`; i++)
166	for (j = `0`; j < N_SUMS; j++)
167	CHECKSUM_COMP(sums[j], `0`);
168
169	/ xor fold partial checksums together /
170	for (i = `0`; i < N_SUMS; i++)
171	result ^= sums[i];
172
173	return result;
174	}
175
176	/*
177	* Compute the checksum for a Postgres page.
178	*
179	* The page must be adequately aligned (at least on a 4-byte boundary).
180	* Beware also that the checksum field of the page is transiently zeroed.
181	*
182	* The checksum includes the block number (to detect the case where a page is
183	* somehow moved to a different location), the page header (excluding the
184	* checksum itself), and the page data.
185	*/
186	uint16
187	pg_checksum_page(char *page, BlockNumber blkno)
188	{
189	PGChecksummablePage cpage = (PGChecksummablePage ) page;
190	uint16 save_checksum;
191	uint32 checksum;
192
193	/ We only calculate the checksum for properly-initialized pages /
194	Assert(!PageIsNew(&cpage->phdr));
195
196	/*
197	* Save pd_checksum and temporarily set it to zero, so that the checksum
198	* calculation isn't affected by the old checksum stored on the page.
199	* Restore it after, because actually updating the checksum is NOT part of
200	* the API of this function.
201	*/
202	save_checksum = cpage->phdr.pd_checksum;
203	cpage->phdr.pd_checksum = `0`;
204	checksum = pg_checksum_block(cpage);
205	cpage->phdr.pd_checksum = save_checksum;
206
207	/ Mix in the block number to detect transposed pages /
208	checksum ^= blkno;
209
210	/*
211	* Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
212	* one. That avoids checksums of zero, which seems like a good idea.
213	*/
214	return (checksum % `65535`) + `1`;
215	}
216

Browse the source code of PostgreSQL/src/include/storage/checksum_impl.h