You are currently viewing an older version of the docs. Go to the latest version →

Managing Large Values and Blobs

This tutorial illustrates techniques for storing and managing large values in FoundationDB. We’ll look at using the blob (binary large object) layer, which provides a simple interface for storing unstructured data. We’ll be drawing on Data Modeling and Python API, so you should take a look at those documents if you’re not familiar with them.

For an introductory tutorial that begins with “Hello world” and explains the basic concepts used in FoundationDB, take a look at our class scheduling tutorial.

Although we’ll be using Python, the concepts in this tutorial are also applicable to the other languages supported by FoundationDB.

If you’d like to see the finished version of the code we’ll be working with, take a look at the Appendix: FileLib.py.

Modeling large values

For key-value pairs stored in FoundationDB, values are limited to a size of 100 kB (see Known Limitations). Furthermore, you’ll usually get the best performance by keeping value sizes below 10 kb, as discussed in our performance guidelines.

Splitting structured values

These factors lead to an obvious question: what should you do if your first cut at a data model results in values that are larger than those allowed by the above guidelines?

The answer depends on the nature and size of your values. If your values have some internal structure, consider revising your data model to split the values across multiple keys. For example, suppose you’re recording a directed graph of users who “follow” other users to receive status updates. Your initial thought may be to have a single key for each userID and store the users being followed as its value, using something like:

tr[fdb.tuple.pack(('hasFollowed', userID))] = '/'.join((userID2,userID3, . . .))

However, if the number of users being followed is large or if you’ll often need to access only one of them at a time, it may be better to store them as part of the key:

tr[fdb.tuple.pack(('hasFollowed', userID, userID2))] = ''
tr[fdb.tuple.pack(('hasFollowed', userID, userID3))] = ''
. . .

Similarly, suppose you’d like to store a serialized JSON object. Instead of storing the object as the value of a single key, you could construct a key for each path in the object, as described for documents.

Note

In general, you should consider splitting your values if their sizes are above 10kb, or if they are above 1kb and you only use a part of each value after reading it.

Binary large objects

Sometimes, revising your data model using the above approach is not sufficient or even feasible. Some data, even after splitting, would result in values that are still too large. Unstructured data may not have elements that can naturally serve as keys. Binary large objects (blobs) are common examples of the latter sort.

Using the blob layer

You can store large values as binary large objects using our example blob layer, available on GitHub. This layer provides an abstraction for random reads and writes of a blob, allowing it to be partially accessed or streamed. Sequential reads and writes are also supported. The implementation automatically splits a blob into chunks and stores it using an efficient, sparse representation.

The blob layer contains a class with the methods below.

class Blob

An instance of Blob is used to read and write a single blob in the database. It’s initialized with a Subspace that defines the subspace of keys the database will use to store the blob:

my_blob = Blob(Subspace('my_blob',))
Blob.delete(tr)

Delete all key-value pairs associated with the blob.

Blob.get_size(tr)

Get the size of the blob in bytes.

Blob.read(tr, offset, n)

Read from the blob, starting at offset, retrieving up to n bytes. Fewer than n bytes will be returned if the end of the blob is reached.

Blob.write(tr, offset, data)

Write data to the blob, starting at offset and overwriting any existing data at that location. The length of the blob will be increased if necessary.

Blob.append(tr, data)

Append the contents of data onto the end of the blob.

Blob.truncate(tr, new_length)

Change the blob length to new_length, erasing any data when shrinking, and filling new bytes with 0 when growing.

Before looking at a larger example, let’s do a quick check by writing and reading back a million bytes:

>>> my_blob = Blob(Subspace(('my_blob',)))
>>> my_blob.delete(db)
>>> size_in = 1000000
>>> my_blob.append(db, '.'*size_in)
>>> size_out = len(my_blob.read(db,0,size_in))
>>> size_out == size_in
True

File libraries

Suppose you have files that you’d like to store and manage in the database. The file content could be anything: text, audio, video clips, etc. You might want to group the files into named libraries and record some metadata for each file when you store it.

Let’s use the blob layer to implement a class with the basic methods we’d need to manage this kind of library. The class FileLibrary is initialized with its own subspace (separate from those of the individual blobs) that it will use to store metadata on the files:

class FileLibrary(object):
    def __init__(self, space):
        self._space = space

The class internally uses transactional methods to add or retrieve metadata attributes or to delete all attributes for file:

@fdb.transactional
def _add_attribute(self, tr, filename, attribute, value):
    tr[self._space.pack((filename, attribute))] = value
    print 'Added', attribute, '=>', value, 'for', filename

@fdb.transactional
def _get_attribute(self, tr, filename, attribute):
    return tr[self._space.pack((filename, attribute))]

@fdb.transactional
def _delete_attributes(self, tr, filename):
    del tr[self._space.range((filename,))]

The class exposes methods to import a file into a library, export a file, and remove a file from the library.

To import a file, it would be simple to read it in its entirety and write it to a blob, but let’s assume we may be reading from a CD or DVD that’s subject to damage. We can handle minor damage by reading the file in small chunks with error-checking that replaces a chunk with 0’s if it proves unreadable, but otherwise preserving the file content.

The import_file() method initializes an empty blob and begins to read the file in chunks specified by CHUNK_SIZE. (We can adjust CHUNK_SIZE according to our knowledge of the media we’re working with.) Good chunks are directly appended to the blob, while bad chunks are replaced by 0’s before being appended. Finally, we record the condition of the file with its metadata and return the blob:

def import_file(self, filename):
    target = Blob(Subspace((filename,)))
    target.delete(db)

    input = open(filename, 'rb')
    file_size = os.stat(filename).st_size;
    position = 0
    damaged = False

    while (position < file_size):
        try:
            chunk = input.read(CHUNK_SIZE)
            if not chunk: break;
            target.append(db,chunk)
            position += CHUNK_SIZE
        except:
            if (file_size - position) > CHUNK_SIZE:
                bytes = CHUNK_SIZE
            else:
                bytes = file_size - position
            target.append("\0" * bytes)
            position += CHUNK_SIZE
            input.seek(position)
            damaged = True

    input.close()

    self._delete_attributes(db, filename)
    print "Adding attributes for", filename
    condition = 'damaged' if damaged else 'intact'
    self._add_attribute(db, filename, 'condition', condition)
    return target

The export_file() method reads successive chunks of CHUNK_SIZE from the blob and writes them out to a file:

def export_file(self, source, filename):
    output = open(filename, 'wb')
    position = 0
    while (position < source.get_size(db)):
        chunk = source.read(db, position, CHUNK_SIZE)
        sys.stdout.flush()
        output.write(chunk)
        position += CHUNK_SIZE
    output.close()

The reason export_file() reads in chunks is not to do error-checking: unlike a CD or DVD, FoundationDB will not lose bits of your data. However, the blob’s read() method is transactional, and the intent of the blob layer is to allow blobs to be large. Transactions in FoundationDB cannot be long (currently defined as over five seconds) and are best kept under one second. You might be tempted to read a blob from the database in a single transaction, like so:

size = source.get_size(db)
data = source.read(db, 0, size)

but doing so with an arbitrary blob may well result in a long transaction. Hence, export_file() reads from the database with one transaction per chunk.

The remove_file() method deletes the file’s blob and its metadata:

def remove_file(self, filename):
    stored = Blob(Subspace((filename,)))
    stored.delete(db)
    self._delete_attributes(db, filename)

Text files

Let’s try out the import method on a text file. We can grab a plain text copy of Hamlet from Project Gutenberg and save it locally as hamlet.txt. Before we import the file, we’ll use grep -b in the shell to find the byte location of something easy to recognize:

$ grep -b 'To be, or not to be' hamlet.txt
85829:To be, or not to be,--that is the question:--

Now, we import the file:

>>> shakespeare = FileLibrary(Subspace(('shakespeare',)))
>>> shakespeare.list_files(db)
[]
>>> hamlet = shakespeare.import_file('hamlet.txt')
Adding attributes for hamlet.txt
Added condition => intact
>>> shakespeare.list_files(db)
['hamlet.txt']

As a quick check that the layer is preserving the byte order of the data as we’d expect, we can read from the blob beginning at byte 85829:

>>> hamlet.read(db,85829,45)
'To be, or not to be,--that is the question:--'

MP3 metadata

FileLibrary and Blob treat the data to be stored as uninterpreted binary data. Many standard data formats have well-defined metadata that you might want to record.

Let’s say you want to store MP3 files in a library. Most MP3’s contain metadata in one of the ID3 formats, with ID3v2 being the most popular. We’ll use pytagger, one of the many Python modules available for ID3 handling, to capture ID3v2 metadata:

from tagger import ID3v2, ID3Exception

We can define an MP3Library class as a subclass of FileLibrary, extending the import_file() method to grab ID3v2 tags. It will use an _id3v2() method that scans for the tags and adds them as attributes:

class MP3Library(FileLibrary):

    def _id3v2(self, filename):
        try:
            id3 = ID3v2(filename)
            if not id3.tag_exists():
                print 'Unable to find ID3v2 tags in', filename
            else:
                self._add_attribute(db, filename, 'tag version',
                                    str(id3.version))
                for frame in id3.frames:
                    val = ' '.join(frame.strings)
                    if val: self._add_attribute(db, filename, frame.fid, val)
        except ID3Exception, e:
            print 'ID3v2 exception', str(e), 'in', filename

    def import_file(self, filename):
        target = super(MP3Library, self).import_file(filename)
        self._id3v2(filename)
        return target

Let’s try out the extended import method. We’ll use an MP3 from LibraVox’s readings of works in the public domain:

>>> mp3_lib = MP3Library(Subspace(('mp3_lib',)))
>>> stored = mp3_lib.import_file('reddeath.mp3')
Adding attributes for reddeath.mp3
Added condition => intact
Added tag version => 2.3
Added TSSE => LAME 64bits version 3.98.4 (http://www.mp3dev.org/)
Added TPE1 => Edgar Allen Poe
Added TIT2 => 12 - The Masque Of The Red Death
Added TALB => LibriVox Short Story Collection Vol. 051

Conclusion

The flexibility of FoundationDB’s ordered key-value store allows it to model a wide variety of data.The goal of data modeling is to design a mapping of data to keys and values that efficiently supports the operations you need. In the case of large data objects, a good model will usually map a single object to multiple key-value pairs.

With structured data, you can often design a good model by encoding selected data elements into keys in a way that splits an object. With unstructured data, whether text or a binary large object, you can take advantage of the simple but powerful interface provided by the blob layer.

Appendix: FileLib.py

Here’s the code for the large value tutorial:

#!/usr/bin/python

import sys
import os

import fdb
import fdb.tuple

from blob import Blob, Subspace
from tagger import ID3v2, ID3Exception

fdb.api_version(22)

db = fdb.open()

##################
## File Library ##
##################

CHUNK_SIZE = 16384 # roughly 1 second of audio in an MP3

class FileLibrary(object):

    def __init__(self, space):
        self._space = space

    @fdb.transactional
    def _add_attribute(self, tr, filename, attribute, value):
        tr[self._space.pack((filename, attribute))] = value
        print 'Added', attribute, '=>', value, 'for', filename

    @fdb.transactional
    def _get_attribute(self, tr, filename, attribute):
        return tr[self._space.pack((filename, attribute))]

    @fdb.transactional
    def _delete_attributes(self, tr, filename):
        del tr[self._space.range((filename,))]

    def import_file(self, filename):
        target = Blob(Subspace((filename,)))
        target.delete(db)

        input = open(filename, 'rb')
        file_size = os.stat(filename).st_size;
        position = 0
        damaged = False

        while (position < file_size):
            try:
                chunk = input.read(CHUNK_SIZE)
                if not chunk: break;
                target.append(db,chunk)
                position += CHUNK_SIZE
            except:
                if (file_size - position) > CHUNK_SIZE:
                    bytes = CHUNK_SIZE
                else:
                    bytes = file_size - position
                target.append("\0" * bytes)
                position += CHUNK_SIZE
                input.seek(position)
                damaged = True

        input.close()

        self._delete_attributes(db, filename)
        print "Adding attributes for", filename
        condition = 'damaged' if damaged else 'intact'
        self._add_attribute(db, filename, 'condition', condition)
        return target

    def export_file(self, source, filename):
        output = open(filename, 'wb')
        position = 0
        while (position < source.get_size(db)):
            chunk = source.read(db, position, CHUNK_SIZE)
            sys.stdout.flush()
            output.write(chunk)
            position += CHUNK_SIZE
        output.close()

    def remove_file(self, filename):
        stored = Blob(Subspace((filename,)))
        stored.delete(db)
        self._delete_attributes(db, filename)

    @fdb.transactional
    def list_files(self, tr):
        KVs = tr[self._space.range(())]
        return list({self._space.unpack(k)[0] for k,v in KVs})

class MP3Library(FileLibrary):

    def _id3v2(self, filename):
        try:
            id3 = ID3v2(filename)
            if not id3.tag_exists():
                print 'Unable to find ID3v2 tags in', filename
            else:
                self._add_attribute(db, filename, 'tag version',
                                    str(id3.version))
                for frame in id3.frames:
                    val = ' '.join(frame.strings)
                    if val: self._add_attribute(db, filename, frame.fid, val)
        except ID3Exception, e:
            print 'ID3v2 exception', str(e), 'in', filename

    def import_file(self, filename):
        target = super(MP3Library, self).import_file(filename)
        self._id3v2(filename)
        return target