Friday, October 7, 2016

A high performance Hyperloglog server

Hyperloglog is an algorithm to get the approximate distinct element count of a multiset. Calculating the exact value for distinct count requires lot of memory and may not be practical for all situations. Generally, the memory required is proportional to the cardinality of the multiset.

The basis of the algorithm is:
Cardinality of a multiset of uniformly distributed random numbers can be calculated by finding the maximum number of leading zeroes in the binary representations of those numbers. If the maximum number of leading zeroes  observed is n, then  is the cardinality of the numbers. It comes from the fact that, probability of encountering a 0 in the first bit is 1/2, probability of encountering 2 consecutive zero bits is 1/ 2 * 1/2 = 1/4, probability of 3 consecutive zero bit is 1/8,.... and so probability of encountering n leading zero is . That means if we encountered n leading zeros in the randomly distributed numbers, then it is highly probable that there are around  distinct numbers in the multiset.

Generally, we apply a hash function on the original elements of the multiset and calculate the maximum number of leading zeroes from the hash values. For example, to find the distinct count of the documents in a large collection of documents, we may apply a hash function on the title (or on the content)  of the documents and compute the leading zeroes of the hash values returned by the hash function.

Simply using the above approach sometimes may give very incorrect results when the multiset has few distinct elements. So, the algorithm uses different approaches when the number of distinct elements are less. Secondly, it splits the input multiset into multiple subsets and calculate maximum leading zeroes in each subset and applies a harmonic mean on the value calculated for each subset to obtain the final result.


I have implemented a Hyperloglog server in Go. It exposes its functionalities over HTTP and Thrift protocol and so we may access the functionalities from almost all the programming languages.
For detailed installation and API information, please check this GITHUB LINK.  It provides persistence (optional) and hence the data can be maintained across restarts. Also, it works blazingly fast and can take advantages of multiple processors of the running machine.

Installing the server:
$ go get github.com/nipuntalukdar/hllserver/hllserverd

Running the sever:
$hllserverd -db /tmp -persist

Detailed usage of hllserverd:

$ hllserverd --help
Usage of hllserverd:
  -db string
        directory for hyperlog db (default "/tmp")
  -dbfile string
        hyperlogdb file (default "hyperlogs.db")
  -http string
        give http lister_address (default ":55123")
  -logbackup int
        maximum backup for logs (default 10)
  -logfile string
        log file path (default "/tmp/hllogs.log")
  -logfilesize int
        log rollover size (default 2048000)
  -loglevel string
        logging level (default "INFO")
  -persist
        should persist the hyperlogs in db?
  -thrift string
        thrift rpc address (default "127.0.0.1:55124")


Below is a simple python program (hllclient.py) demonstrating the use of various hyperloglog server APIs over HTTP:

import httplib
import json
import urllib
import sys
from base64 import b64encode
from uuid import uuid1

hlld_http_address = '127.0.0.1:55123'
if len(sys.argv) > 1:
    hlld_http_address = sys.argv[1]


def check_response(resp):
    if resp.status != 200:
        print 'Failed'
        sys.exit(1)
    else:
        resp_data = resp.read()
        resp_json = json.loads(resp_data)
        if resp_json['status'] != 'success':
            print 'Failed to add logkey'
            sys.exit(1)
        return resp_json
    return None

print 'Adding logkey {} with expiry {}'.format('visitors', 86400)
client = httplib.HTTPConnection(hlld_http_address)
client.request('GET', '/addlogkey?logkey=visitors&expiry=86400')
resp = client.getresponse()
check_response(resp)

print 'Successfully added hyperlog key {}'.format('visitors')
client.close()

print 'Updating logkey visitors'.format('visitors')
i = 0
while i < 100:
    j = 0
    params = {'logkey': 'visitors', 'values': []}
    while j < 100:
        params['values'].append(uuid1().hex)
        j += 1
    client = httplib.HTTPConnection(hlld_http_address)
    client.request("POST", "/updatelog", json.dumps(params))
    resp = client.getresponse()
    check_response(resp)
    i += 1
print 'Updated logkey {} successfully'.format('visitors')
client.close()


print 'Getting cardinality of the logkey {}'.format('visitors')
client = httplib.HTTPConnection(hlld_http_address)
url = '/cardinality?logkey={}'.format('visitors')
client.request('GET', url)
resp = client.getresponse()
resp = check_response(resp)
print 'Cardinality for logkey {}: {}'.format('visitors', resp['cardinality'])
client.close()

print 'Updating expiry of logkey {}'.format('visitors')
client = httplib.HTTPConnection(hlld_http_address)
url = '/updexpiry?logkey={}&expiry={}'.format('visitors', 10)
client.request('GET', url)
resp = client.getresponse()
resp = check_response(resp)
client.close()
print 'Successfully updated expiry of logkey {}'.format('visitors')

print 'Deleting logkey {}'.format('visitors')
client = httplib.HTTPConnection(hlld_http_address)
url = '/dellogkey?logkey={}'.format('visitors')
client.request('GET', url)
resp = client.getresponse()
resp = check_response(resp)
client.close()
print 'Successfully deleted logkey {}'.format('visitors')


Output of a sample run of the above script:

$ python hllclient.py 127.0.0.1:55123
Adding logkey visitors with expiry 86400
Successfully added hyperlog key visitors
Updating logkey visitors
Updated logkey visitors successfully
Getting cardinality of the logkey visitors
Cardinality for logkey visitors: 9540
Updating expiry of logkey visitors
Successfully updated expiry of logkey visitors
Deleting logkey visitors

Successfully deleted logkey visitors