Hyperloglog is an algorithm to get the approximate distinct element count of a multiset. Calculating the exact value for distinct count requires lot of memory and may not be practical for all situations. Generally, the memory required is proportional to the cardinality of the multiset.
The basis of the algorithm is:
Cardinality of a multiset of uniformly distributed random numbers can be calculated by finding the maximum number of leading zeroes in the binary representations of those numbers. If the maximum number of leading zeroes observed is n, then is the cardinality of the numbers. It comes from the fact that, probability of encountering a 0 in the first bit is 1/2, probability of encountering 2 consecutive zero bits is 1/ 2 * 1/2 = 1/4, probability of 3 consecutive zero bit is 1/8,.... and so probability of encountering n leading zero is . That means if we encountered n leading zeros in the randomly distributed numbers, then it is highly probable that there are around distinct numbers in the multiset.
Generally, we apply a hash function on the original elements of the multiset and calculate the maximum number of leading zeroes from the hash values. For example, to find the distinct count of the documents in a large collection of documents, we may apply a hash function on the title (or on the content) of the documents and compute the leading zeroes of the hash values returned by the hash function.
Simply using the above approach sometimes may give very incorrect results when the multiset has few distinct elements. So, the algorithm uses different approaches when the number of distinct elements are less. Secondly, it splits the input multiset into multiple subsets and calculate maximum leading zeroes in each subset and applies a harmonic mean on the value calculated for each subset to obtain the final result.
I have implemented a Hyperloglog server in Go. It exposes its functionalities over HTTP and Thrift protocol and so we may access the functionalities from almost all the programming languages.
For detailed installation and API information, please check this GITHUB LINK. It provides persistence (optional) and hence the data can be maintained across restarts. Also, it works blazingly fast and can take advantages of multiple processors of the running machine.
Installing the server:
$ go get github.com/nipuntalukdar/hllserver/hllserverd
Running the sever:
$hllserverd -db /tmp -persist
Detailed usage of hllserverd:
Below is a simple python program (hllclient.py) demonstrating the use of various hyperloglog server APIs over HTTP:
Output of a sample run of the above script:
The basis of the algorithm is:
Cardinality of a multiset of uniformly distributed random numbers can be calculated by finding the maximum number of leading zeroes in the binary representations of those numbers. If the maximum number of leading zeroes observed is n, then is the cardinality of the numbers. It comes from the fact that, probability of encountering a 0 in the first bit is 1/2, probability of encountering 2 consecutive zero bits is 1/ 2 * 1/2 = 1/4, probability of 3 consecutive zero bit is 1/8,.... and so probability of encountering n leading zero is . That means if we encountered n leading zeros in the randomly distributed numbers, then it is highly probable that there are around distinct numbers in the multiset.
Generally, we apply a hash function on the original elements of the multiset and calculate the maximum number of leading zeroes from the hash values. For example, to find the distinct count of the documents in a large collection of documents, we may apply a hash function on the title (or on the content) of the documents and compute the leading zeroes of the hash values returned by the hash function.
Simply using the above approach sometimes may give very incorrect results when the multiset has few distinct elements. So, the algorithm uses different approaches when the number of distinct elements are less. Secondly, it splits the input multiset into multiple subsets and calculate maximum leading zeroes in each subset and applies a harmonic mean on the value calculated for each subset to obtain the final result.
I have implemented a Hyperloglog server in Go. It exposes its functionalities over HTTP and Thrift protocol and so we may access the functionalities from almost all the programming languages.
For detailed installation and API information, please check this GITHUB LINK. It provides persistence (optional) and hence the data can be maintained across restarts. Also, it works blazingly fast and can take advantages of multiple processors of the running machine.
Installing the server:
$ go get github.com/nipuntalukdar/hllserver/hllserverd
Running the sever:
$hllserverd -db /tmp -persist
Detailed usage of hllserverd:
$ hllserverd --help Usage of hllserverd: -db string directory for hyperlog db (default "/tmp") -dbfile string hyperlogdb file (default "hyperlogs.db") -http string give http lister_address (default ":55123") -logbackup int maximum backup for logs (default 10) -logfile string log file path (default "/tmp/hllogs.log") -logfilesize int log rollover size (default 2048000) -loglevel string logging level (default "INFO") -persist should persist the hyperlogs in db? -thrift string thrift rpc address (default "127.0.0.1:55124")
Below is a simple python program (hllclient.py) demonstrating the use of various hyperloglog server APIs over HTTP:
import httplib import json import urllib import sys from base64 import b64encode from uuid import uuid1 hlld_http_address = '127.0.0.1:55123' if len(sys.argv) > 1: hlld_http_address = sys.argv[1] def check_response(resp): if resp.status != 200: print 'Failed' sys.exit(1) else: resp_data = resp.read() resp_json = json.loads(resp_data) if resp_json['status'] != 'success': print 'Failed to add logkey' sys.exit(1) return resp_json return None print 'Adding logkey {} with expiry {}'.format('visitors', 86400) client = httplib.HTTPConnection(hlld_http_address) client.request('GET', '/addlogkey?logkey=visitors&expiry=86400') resp = client.getresponse() check_response(resp) print 'Successfully added hyperlog key {}'.format('visitors') client.close() print 'Updating logkey visitors'.format('visitors') i = 0 while i < 100: j = 0 params = {'logkey': 'visitors', 'values': []} while j < 100: params['values'].append(uuid1().hex) j += 1 client = httplib.HTTPConnection(hlld_http_address) client.request("POST", "/updatelog", json.dumps(params)) resp = client.getresponse() check_response(resp) i += 1 print 'Updated logkey {} successfully'.format('visitors') client.close() print 'Getting cardinality of the logkey {}'.format('visitors') client = httplib.HTTPConnection(hlld_http_address) url = '/cardinality?logkey={}'.format('visitors') client.request('GET', url) resp = client.getresponse() resp = check_response(resp) print 'Cardinality for logkey {}: {}'.format('visitors', resp['cardinality']) client.close() print 'Updating expiry of logkey {}'.format('visitors') client = httplib.HTTPConnection(hlld_http_address) url = '/updexpiry?logkey={}&expiry={}'.format('visitors', 10) client.request('GET', url) resp = client.getresponse() resp = check_response(resp) client.close() print 'Successfully updated expiry of logkey {}'.format('visitors') print 'Deleting logkey {}'.format('visitors') client = httplib.HTTPConnection(hlld_http_address) url = '/dellogkey?logkey={}'.format('visitors') client.request('GET', url) resp = client.getresponse() resp = check_response(resp) client.close() print 'Successfully deleted logkey {}'.format('visitors')
Output of a sample run of the above script:
$ python hllclient.py 127.0.0.1:55123
Adding logkey visitors with expiry 86400
Successfully added hyperlog key visitors
Updating logkey visitors
Updated logkey visitors successfully
Getting cardinality of the logkey visitors
Cardinality for logkey visitors: 9540
Updating expiry of logkey visitors
Successfully updated expiry of logkey visitors
Deleting logkey visitors
Successfully deleted logkey visitors
No comments:
Post a Comment