{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Data Structures for Big Data\n", "====\n", "\n", "When dealing with big data, minimizing the amount of memory used is critical to avoid having to use disk based access, which can be 100,000 times slower for random access. This notebook deals with ways to minimizee data storage for several common use case:\n", "\n", "- Large arrays of homogenous data (often numbers)\n", "- Large string collections\n", "- Counting distinct valuees\n", "- Yes/No responses to queries\n", "\n", "Methods covered range from the mundane (use `numpy` arrays rather than lists), to classic but less well known data structures (e.g. prefix trees or tries) to algorihtmically ingenious probabilistic data structures (e.g. bloom filter and hyperloglog)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Storing numbers\n", "----\n", "\n", "Less memory is used when storing numbers in numpy arrays rather than lists." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%load_ext memory_profiler" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 117.23 MiB, increment: 0.21 MiB\n" ] } ], "source": [ "%memit" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 880.07 MiB, increment: 762.77 MiB\n" ] } ], "source": [ "%memit [0]*int(1e8)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 3967.04 MiB, increment: 3849.74 MiB\n" ] } ], "source": [ "%memit list(range(int(1e8)))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 879.11 MiB, increment: 760.94 MiB\n" ] } ], "source": [ "%memit np.arange(int(1e8))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Storing strings\n", "----" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def flatmap(func, items):\n", " return it.chain.from_iterable(map(func, items))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def flatten(xss):\n", " return (x for xs in xss for x in xs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using a list" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 151.86 MiB, increment: 33.65 MiB\n" ] } ], "source": [ "%%memit\n", "\n", "with open('data/Ulysses.txt') as f:\n", " word_list = list(flatten(line.split() for line in f))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "target = 'WARRANTIES'" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 1: 5.59 ms per loop\n" ] } ], "source": [ "%timeit -r1 -n1 word_list.index(target)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 1: 10.1 ms per loop\n" ] } ], "source": [ "%timeit -r1 -n1 target in word_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using a sorted list" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 137.34 MiB, increment: 0.00 MiB\n" ] } ], "source": [ "%memit word_list.sort()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 1: 24.7 µs per loop\n" ] } ], "source": [ "import bisect\n", "%timeit -r1 -n1 bisect.bisect(word_list, target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using a set" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 140.59 MiB, increment: 3.25 MiB\n" ] } ], "source": [ "%memit word_set = set(word_list)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 1: 3.39 µs per loop\n" ] } ], "source": [ "%timeit -r1 -n1 target in word_set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using a [trie](https://en.wikipedia.org/wiki/Trie) (prefix tree)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from hat_trie import Trie" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "peak memory: 140.76 MiB, increment: 0.00 MiB\n" ] } ], "source": [ "%memit word_trie = Trie(word_list)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loops, best of 1: 5.73 µs per loop\n" ] } ], "source": [ "%timeit -r1 -n1 target in word_trie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Probabilisitc Data Structues\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Morris counter\n", "\n", "The Morris counter is used as a simple illustration of a probabiliistic data structure, with the standard trading off of using less memory in return for less accuracy. The algorithm is extreemly simple - keep a counte $c$ that represents the **exponent** - that is, when the Morris counter is $c$, the estimated count is $2^c$. The probabilistic part comes from the way that the counter is incremented by comparing a unform random variate to $1/2^c$." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from random import random\n", "\n", "class MorrisCounter:\n", " def __init__(self, c=0):\n", " self.c = c\n", "\n", " def __len__(self):\n", " return 2 ** self.c\n", "\n", " def add(self, item):\n", " self.c += random() < 1/(2**self.c)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mc = MorrisCounter()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\t\tMorris\t\tRel Error\n", " 0\t 2\t0.00\n", " 20000\t 16384\t0.18\n", " 40000\t 32768\t0.18\n", " 60000\t 65536\t0.09\n", " 80000\t 65536\t0.18\n", " 100000\t 131072\t0.31\n", " 120000\t 131072\t0.09\n", " 140000\t 131072\t0.06\n", " 160000\t 131072\t0.18\n", " 180000\t 131072\t0.27\n", " 200000\t 131072\t0.34\n", " 220000\t 131072\t0.40\n", " 240000\t 131072\t0.45\n", " 260000\t 131072\t0.50\n" ] } ], "source": [ "print('True\\t\\tMorris\\t\\tRel Error')\n", "for i, word in enumerate(word_list):\n", " mc.add(word)\n", " if i%int(.2e5)==0:\n", " print('%8d\\t%8d\\t%.2f' % (i, len(mc), 0 if i==0 else abs(i - len(mc))/i))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Increasing accuracy\n", "\n", "A simple way to increase the accuracy is to have multiple Morris counters and take the average. These two ideas of using a proabbilisitc calculation and multiple samples to imrprove precision are the basis for the more useful probabilisitc data structures described below." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mcs = [MorrisCounter() for i in range(10)]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\t\tMorris\t\tRel Error\n", " 0\t 2\t0.00\n", " 20000\t 14336\t0.28\n", " 40000\t 40140\t0.00\n", " 60000\t 57344\t0.04\n", " 80000\t 93388\t0.17\n", " 100000\t 93388\t0.07\n", " 120000\t 113049\t0.06\n", " 140000\t 127795\t0.09\n", " 160000\t 140902\t0.12\n", " 180000\t 170393\t0.05\n", " 200000\t 173670\t0.13\n", " 220000\t 203161\t0.08\n", " 240000\t 255590\t0.06\n", " 260000\t 255590\t0.02\n" ] } ], "source": [ "print('True\\t\\tMorris\\t\\tRel Error')\n", "for i, word in enumerate(word_list):\n", " for j in range(10):\n", " mcs[j].add(word)\n", " estimate = np.mean([len(m) for m in mcs])\n", " if i%int(.2e5)==0:\n", " print('%8d\\t%8d\\t%.2f' % (i, estimate, 0 if i==0 else abs(i - estimate)/i))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Distinct value Sketches\n", "\n", "The Morris counter is less useful because the degree of memory saved as compared to counitng the number of elemetns exactly is not much unless the numbers are staggeringly huge. In contrast, counting the number of **distinct** elements exactly requires storage of all distinct elements (e.g. in a set) and hence grows with the cardinality $n$. Probabilistic data structures knwon as Distinct Value Sketches can do this with a tiny and fixed memory size.\n", "\n", "Examples where counting distinct values is useful:\n", "\n", "- number of unique users in a Twitter stream\n", "- number of distinct records to be fetched by a databse query\n", "- number of unique IP addresses accessing a website\n", "- number of distinct queries submitted to a search engine\n", "- number of distinct DNA motifs in genomics data sets (e.g. microbiome)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### [Hash functions](https://en.wikipedia.org/wiki/Hash_function_\n", "\n", "A hash function takes data of arbitrary size and converts it into a number in a fixed range. Ideally, given an arbitrary set of data items, the hash function generates numbers that follow a uniform distribution within the fixed range. Hash functions are immensely useful throughout computer science (for example - they power Python sets and dictionaries), and especially for the generation of probabilistic data structures." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A simple hash function mapping\n", "\n", "Note the **collisions**. If not handled, there is a loss of information. Commonly, practical hash functions return a 32 or 64 bit integer. Also note that there are an arbitrary number of hash functions that can return numbers within a given range.\n", "\n", "Note also that because the hash function is deterministic, the same item will always map to the same bin. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def string_hash(word, n):\n", " return sum(ord(char) for char in word) % n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The 9\n", "quick 1\n", "brown 2\n", "fox 3\n", "jumps 9\n", "over 4\n", "the 1\n", "lazy 8\n", "dog. 0\n" ] } ], "source": [ "sentence = \"The quick brown fox jumps over the lazy dog.\"\n", "for word in sentence.split():\n", " print(word, string_hash(word, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Built-in Python hash function" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The -5484744546882122582\n", "quick -7108007350593547279\n", "brown -6814734720224203537\n", "fox -1763212765689429847\n", "jumps 849897467049591091\n", "over -2139207452593339407\n", "the 6020327406432540674\n", "lazy -7708543853339552466\n", "dog. -6627810124593739785\n" ] } ], "source": [ "for word in sentence.split():\n", " print('{:<10s} {:24}'.format(word, hash(word)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using a hash function from the MurmurHash3 library\n", "\n", "Note that the hash function accepts a seed, allowing the creation of multiple hash functions. We also display the hash result as a 32-bit binary string." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The +0001000011111110001001110101100 +1110110100100101010111100011010\n", "quick -0101111111011110110101100101000 +1000100001101010110000101101100\n", "brown +1000101010000110110010001110101 -1101101110000000010001100010100\n", "fox -1000000010010010000111001111011 +0111011111000011001001001110111\n", "jumps +0000010111000011010000100101010 +0010010001111110100010010110011\n", "over -0110101101111001001101011111011 -1101110111110010000101101000100\n", "the -1000000101110000000110011111001 +0001000111100111011000011100101\n", "lazy -1101011000111111110011111001100 +0010101110101100001000101110000\n", "dog. +0100110101101111101011110111111 -0101111000110000001011110001011\n" ] } ], "source": [ "import mmh3\n", "\n", "for word in sentence.split():\n", " print('{:<10} {:+032b} {:+032b}'.format(word.ljust(10), mmh3.hash(word, seed=1234), \n", " mmh3.hash(word, seed=4321)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### LogLog family\n", "\n", "The binary digits in a (say) 32-bit hash are effectively random, and equivalent to a sequence of fair coin tosses. Hence the probability that we see a run of 5 zeros in the smallest hash so far suggests that we have added $2^5$ unique items so far. This is the intuition behind the loglog family of Distinct Value Sketches. Note that the biggest count we can track with 32 bits is $2^32 = 4294967296$.\n", "\n", "The accuracy of the sketch can be improved by averaging results with multiple coin flippers. In practice, this is done by using the first $k$ bit registers to identify $2^k$ different coin flippers. Hence, the max count is now $2 ** (32 - k)$. The hyperloglog algorithm uses the harmonic mean of the $2^k$ flippers which reduces the effect of outliers and hence the variance of the estimate." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 2\n", "2 2\n", "4 2\n", "8 4\n", "16 4\n", "32 32\n", "64 128\n", "128 256\n", "256 256\n", "512 1024\n", "1024 4096\n", "2048 8192\n", "4096 8192\n", "8192 8192\n", "16384 32768\n", "32768 32768\n" ] } ], "source": [ "best = 0\n", "k = 16\n", "for i in range(2**k):\n", " r = bin(np.random.randint(0, 2**k))[2:].zfill(k)\n", " run = len(list(it.takewhile(lambda x: x=='1', r)))\n", " if run > best:\n", " best = run\n", " if np.log2(i) in range(k):\n", " print(i, 2**best)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from hyperloglog import HyperLogLog" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [], "source": [ "hll = HyperLogLog(0.01)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\t\tHLL\t\tRel Error\n", " 1\t 1\t\t0.00\n", " 6585\t 6560\t\t0.00\n", " 11862\t 11777\t\t0.00\n", " 15390\t 15318\t\t0.00\n", " 18358\t 18236\t\t0.00\n", " 24705\t 24711\t\t0.00\n", " 28693\t 28749\t\t0.00\n", " 30791\t 30945\t\t0.00\n", " 34530\t 34676\t\t0.00\n", " 36002\t 36076\t\t0.00\n", " 41720\t 42091\t\t0.00\n", " 45842\t 46384\t\t0.00\n", " 46389\t 46978\t\t0.00\n", " 49524\t 50226\t\t0.00\n" ] } ], "source": [ "print('True\\t\\tHLL\\t\\tRel Error')\n", "s = set([])\n", "for i, word in enumerate(word_list):\n", " s.add(word)\n", " hll.add(word)\n", " if i%int(.2e5)==0:\n", " print('%8d\\t%8d\\t\\t%.2f' % (len(s), hll.card(), 0 if i==0 else abs(len(s) - hll.card())/i))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bloom filters\n", "----\n", "\n", "Bloom filters are designed to answer queries about whether a specific item is in a collection. If the answer is NO, then it is definitive. However, if the answer is yes, it might be a false positive. The possibility of a false positive makes the Bloom filter a probabilistic data structure.\n", "\n", "A bloom filter consists of a bit vector of length $k$ initially set to zero, and $n$ different hash functions that return a hash value that will fall into one of the $k$ bins. In the construction phase, for every item in the collection, $n$ hash values are generated by the $n$ hash functions, and every position indicated by a hash value is flipped to one. In the query phase, given an item, $n$ hash values are calculated as before - if any of these $n$ positions is a zero, then the item is definitely not in the collection. However, because of the possibility of hash collisions, even if all the positions are one, this could be a false positive. Clearly, the rate of false positives depends on the ratio of zero and one bits, and there are Bloom filter implementations that will dynamically bound the ratio and hence the false positive rate.\n", "\n", "Possible uses of a Bloom filter include:\n", "\n", "- Does a particular sequence motif appear in a DNA string?\n", "- Has this book been recommended to this customer before?\n", "- Check if an element exists on disk before performing I/O\n", "- Check if URL is a potential malware site using in-browser Bloom filter to minimize network communication\n", "- As an alternative way to generate distinct value counts cheaply (only increment count if Bloom filter says NO)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pybloom import ScalableBloomFilter\n", "\n", "# The Scalable Bloom Filter grows as needed to keep the error rate small \n", "# The default error_rate=0.001\n", "sbf = ScalableBloomFilter()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for word in word_set:\n", " sbf.add(word)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "test_words = ['banana', 'artist', 'Dublin', 'masochist', 'Obama']" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "banana True\n", "artist True\n", "Dublin True\n", "masochist False\n", "Obama False\n" ] } ], "source": [ "for word in test_words:\n", " print(word, word in sbf)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "banana True\n", "artist True\n", "Dublin True\n", "masochist False\n", "Obama False\n" ] } ], "source": [ "### Chedck\n", "for word in test_words:\n", " print(word, word in word_set)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%load_ext version_information" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "application/json": { "Software versions": [ { "module": "Python", "version": "3.5.1 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]" }, { "module": "IPython", "version": "4.0.1" }, { "module": "OS", "version": "Linux 4.2.0 23 generic x86_64 with debian jessie sid" }, { "module": "pybloom", "version": "2.0.0" }, { "module": "hyperloglog", "version": "0.0.10" }, { "module": "hat_trie", "version": "0.2" } ] }, "text/html": [ "
SoftwareVersion
Python3.5.1 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
IPython4.0.1
OSLinux 4.2.0 23 generic x86_64 with debian jessie sid
pybloom2.0.0
hyperloglog0.0.10
hat_trie0.2
Sun Jan 24 22:48:43 2016 EST
" ], "text/latex": [ "\\begin{tabular}{|l|l|}\\hline\n", "{\\bf Software} & {\\bf Version} \\\\ \\hline\\hline\n", "Python & 3.5.1 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] \\\\ \\hline\n", "IPython & 4.0.1 \\\\ \\hline\n", "OS & Linux 4.2.0 23 generic x86\\_64 with debian jessie sid \\\\ \\hline\n", "pybloom & 2.0.0 \\\\ \\hline\n", "hyperloglog & 0.0.10 \\\\ \\hline\n", "hat_trie & 0.2 \\\\ \\hline\n", "\\hline \\multicolumn{2}{|l|}{Sun Jan 24 22:48:43 2016 EST} \\\\ \\hline\n", "\\end{tabular}\n" ], "text/plain": [ "Software versions\n", "Python 3.5.1 64bit [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]\n", "IPython 4.0.1\n", "OS Linux 4.2.0 23 generic x86_64 with debian jessie sid\n", "pybloom 2.0.0\n", "hyperloglog 0.0.10\n", "hat_trie 0.2\n", "Sun Jan 24 22:48:43 2016 EST" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%version_information pybloom, hyperloglog, hat_trie" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }