Redis PFCOUNT 命令返回给定 HyperLogLog 的基数估算值。
当 PFCOUNT 命令作用于单个键时, 返回储存在给定键的 HyperLogLog 的近似基数, 如果键不存在, 那么返回 0
。
当 PFCOUNT 命令作用于多个键时, 返回所有给定 HyperLogLog 的并集的近似基数, 这个近似基数是通过将所有给定 HyperLogLog 合并至一个临时 HyperLogLog 来计算得出的。
通过 HyperLogLog 数据结构, 用户可以使用少量固定大小的内存, 来储存集合中的唯一元素 (每个 HyperLogLog 只需使用 12k 字节内存,以及几个字节的内存来储存键本身)。
命令返回的可见集合(observed set)基数并不是精确值, 而是一个带有 0.81% 标准错误(standard error)的近似值。
举个例子, 为了记录一天会执行多少次各不相同的搜索查询, 一个程序可以在每次执行搜索查询时调用一次 PFADD, 并通过调用 PFCOUNT
命令来获取这个记录的近似结果。
*语法
redis Pfcount
命令基本语法如下:
redis 127.0.0.1:6379> PFCOUNT key [key ...]
*返回值
整数, specifically:
- 给定 HyperLogLog 包含的唯一元素的近似数量。
*例子
(integer) 1redis> PFADD hll zap zap zap
(integer) 0redis> PFADD hll foo bar
(integer) 0redis> PFCOUNT hll
(integer) 3redis> PFADD some-other-hll 1 2 3
(integer) 1redis> PFCOUNT hll some-other-hll
(integer) 6
*Performances
When PFCOUNT is called with a single key, performances are excellent even if in theory constant times to process a dense HyperLogLog are high. This is possible because the PFCOUNT uses caching in order to remember the cardinality previously computed, that rarely changes because most PFADD operations will not update any register. Hundreds of operations per second are possible.
When PFCOUNT is called with multiple keys, an on-the-fly merge of the HyperLogLogs is performed, which is slow, moreover the cardinality of the union can't be cached, so when used with multiple keys PFCOUNT may take a time in the order of magnitude of the millisecond, and should be not abused.
The user should take in mind that single-key and multiple-keys executions of this command are semantically different and have different performances.
*HyperLogLog representation
Redis HyperLogLogs are represented using a double representation: the sparse representation suitable for HLLs counting a small number of elements (resulting in a small number of registers set to non-zero value), and a dense representation suitable for higher cardinalities. Redis automatically switches from the sparse to the dense representation when needed.
The sparse representation uses a run-length encoding optimized to store efficiently a big number of registers set to zero. The dense representation is a Redis string of 12288 bytes in order to store 16384 6-bit counters. The need for the double representation comes from the fact that using 12k (which is the dense representation memory requirement) to encode just a few registers for smaller cardinalities is extremely suboptimal.
Both representations are prefixed with a 16 bytes header, that includes a magic, an encoding / version field, and the cached cardinality estimation computed, stored in little endian format (the most significant bit is 1 if the estimation is invalid since the HyperLogLog was updated since the cardinality was computed).
The HyperLogLog, being a Redis string, can be retrieved with GET and restored with SET. Calling PFADD, PFCOUNT or PFMERGE commands with a corrupted HyperLogLog is never a problem, it may return random values but does not affect the stability of the server. Most of the times when corrupting a sparse representation, the server recognizes the corruption and returns an error.
The representation is neutral from the point of view of the processor word size and endianness, so the same representation is used by 32 bit and 64 bit processor, big endian or little endian.
More details about the Redis HyperLogLog implementation can be found in this blog post. The source code of the implementation in the hyperloglog.c
file is also easy to read and understand, and includes a full specification for the exact encoding used for the sparse and dense representations.