A Lightweight Query Language

FoundationDB decouples its data storage technology from data models and query languages, implementing them as layers. This tutorial illustrates a lightweight query language for FoundationDB in the form of a binding for pyDatalog, an open source implementation of Datalog for use within Python.

Caveat: this layer has not been designed for optimal efficiency. It has deliberately been kept simple to show how you can use a small amount of code to bind a powerful query language to FoundationDB. The code was implemented in less than a day at a Hackathon sponsored by FoundationDB and Open Source Connections.

We’ll be drawing on the Python API, so you should take a look at that document if you’re not familiar with it. For an introductory tutorial that begins with “Hello world” and explains the basic concepts used in FoundationDB, take a look at our class scheduling tutorial. We’ll also be using the code from the class scheduling tutorial to generate some sample data.

Although we’ll be using Python, the concepts in this tutorial are also applicable to the other languages supported by FoundationDB.

If you’d like to see the code files we’ll be working with, take a look at the Appendix: datalog.py, Appendix: dl_examples.py, and Appendix: fdbquery.

Introduction

Datalog is a declarative query language within the family of logic programming languages. You can consult pyDatalog’s Online Datalog Tutorial to learn more about the version we’ll be working with.

Let’s start with a quick look before going into more detail. One of the nice features of the binding is that it can be used to query any data stored in FoundationDB using the tuple layer, whether or not the data was stored with Datalog in mind. We’ll illustrate this feature with the data from the class scheduling tutorial.

To help with interactive querying, we’ve written a command line tool called fdbquery that works with the pyDatalog binding. fdbquery has a query mode and a python mode (described further in Query tool):

$ ./fdbquery
pyDatalog version 0.13.0

query>

We need some test data to query. We’ll use the code from the Class Scheduling tutorial to generate it:

query> python
python> from class_scheduling import *
python> init(db)
python> run(10,10)
Ran 100 transactions

The class scheduling data model employs two key-value patterns:

# ('attends', student, class) = ''
# ('class', class_name) = seats_available

Our binding exposes a Datalog class that can be initialized with any subspace. The class scheduling data happens to have been defined with a scheduling subspace. Initializing Datalog with scheduling will make it evaluate queries within that subspace:

python> Datalog(scheduling, 10)
<datalog.Datalog object at 0x10efa0650>

The class scheduling code generates more data than we want to dump to the screen, but we can easily sample it. The kvp predicate lets us retrieve arbitrary key-value pairs, automatically performing range reads or gets as appropriate:

python> query
query> kvp('attends', 's0008', Class, '')
[('8:00 calc intro',), ('6:00 music 301',), ('13:00 geometry lab',)]
query> kvp('class', '16:00 chem mastery', Attendance)
[('99',)]

Queries in Datalog are defined using logical predicates, which can express either base facts, or conclusions derived from other facts. By convention, predicates begin with lower case and logical variables begin with upper case. We can define convenience predicates for data of particular interest:

query> attending(Student,Class) <= kvp('attends',Student,Class,'')
<pyDatalog.pyEngine.Clause object at 0x109c79310>
query> attending('s0006',Class)
[('10:00 chem 301',), ('6:00 geometry seminar',), ('5:00 alg 201',)]

We’ll examine fdbquery further in Query tool.

Binding for pyDatalog

Base facts correspond to rows in a relational model. By default, pyDatalog stores base facts in an in-memory database, but it also provides methods to associate Python generator functions called “resolvers” with predicates. A resolver can read from an external database and yield the data as predicate solutions.

Predicates are defined to have fixed arity, and their arguments can be naturally interpreted as tuples. Our strategy is to map base facts to key-value pairs using FoundationDB’s tuple layer. There are many approaches we could take to this mapping, including optimizations for particular data models, but we’ll focus on a generic approach that makes no assumptions about how data is being stored beyond use of the tuple layer. By taking this generic approach, we’ll be able to use the query language with any tuple-encoded FoundationDB database.

Our binding defines a family of predicates, one for each arity n, with the form:

kvp(arg_1, ..., arg_n, value)

Each kvp predicate has a resolver that can read arbitrary key-value pairs with keys corresponding to (arg_1, ..., arg_n). The only restriction on keys is they’ve been encoded with the tuple layer using a subspace you specify when you instantiate the Datalog class. We normally create or open a directory to obtain an initial subspace:

app = fdb.directory.create_or_open(db, ('app,'))
dl = Datalog(app, 10)

Here’s the method that creates a resolver for a kvp predicate of a given arity:

def _resolve_generic_fdb(self, arity):

    def func(*args):
        assert len(args) == arity, "arity mismatch"
        leftmost_consts = [arg.id for arg in
                           itertools.takewhile(lambda x: x.is_const(), args[:-1])]
        prefix_tuple = tuple(leftmost_consts)
        partial_key = len(leftmost_consts) < arity - 1
        for k, v in self._get_generic_predicate(db, prefix_tuple, partial_key):
            yield self.subspace.unpack(k)+(v,)

    str_arity = str(arity)
    pyEngine.Python_resolvers['kvp'+str_arity+'/'+str_arity] = func

Each resolver is created from the closure func(). It’s designed to read from the database as efficiently as possible given the data it’s been passed. It first examines its Datalog arguments and determines which have already been bound to values, looking in particular for the leftmost prefix of bound values. This information allows it to take advantage of FoundationDB’s prefix range reads, specifying as large a prefix as possible per read. Finally, the method registers the closure with pyDatalog as a resolver.

The reads are performed by a transactional method:

@fdb.transactional
def _get_generic_predicate(self, tr, prefix_tuple, partial_key):
    if partial_key:
        for k, v in tr[self.subspace.range(prefix_tuple)]:
            yield k, v
    else:
        k = self.subspace.pack(prefix_tuple)
        v = tr[k]
        if v.present():
            yield k, v

If all the Datalog arguments happen to be bound to values, the method further optimizes the read by performing a single get() (using the v = tr[k] syntax) rather than a range read.

The resolvers are created by a method called during initialization:

def _create_generic_fdb(self):
    for arity in range(1, self.max_arity+1):
        self._resolve_generic_fdb(arity)

We created resolvers for arities 1 through 10 when we called Datalog(app, 10). You can set the maximum arity as high as you need; just keep in mind that keys cannot exceed 10K bytes in any case.

That’s it: three private methods that bind pyDatalog to FoundationDB when you make an instance of the Datalog class. The class also contains some methods that let you create resolvers for predicates of your choosing (instead of kvp), but their logic is similar to the ones we’ve seen.

Now, let’s take a look at some Datalog.

Basic relational algebra

We’ve seen Datalog retrieving key-value pairs from the database. What other kinds of queries can you do? To begin with, we can easily perform the basic operations familiar from relational algebra

Suppose we have a predicate parent(X,Y) indicating that X is the parent of Y:

query> parent(X, Y)
[('Henry', 'Mark'), ('Henry', 'Karen'), ('Karen', 'Susan'), ('Mark', 'Joe'), ('Joe', 'Frank')]

We can perform selections by supplying values for any argument:

query> parent(X, 'Susan')
[('Karen',)]

Comparison operators can also be used, as we’ll see below.

We can perform projections by defining a new predicate from an existing one. For example, we could define a unary predicate parent(X) to be true if X has any child:

query> parent(X) <= parent(X, Y)
<pyDatalog.pyEngine.Clause object at 0x1053d1410>
query> parent(X)
[('Joe',), ('Henry',), ('Karen',), ('Mark',)]

We can perform joins by sharing arguments between predicates:

query> school_parent(X, Y, S) <= parent(X, Y) & attends_school(Y, S)
<pyDatalog.pyEngine.Clause object at 0x1033fa310>
query> school_parent(X, Y, S)
[('Joe', 'Frank', 'Pine Elementary'), ('Karen', 'Susan', 'Pine Elementary')]

Beyond relational algebra

Datalog is strictly more powerful than relational algebra due to the availability of recursion. We’ll use a few varieties of graph traversal to illustrate queries that can be formulated with a couple lines of Datalog.

Transitive closure

Suppose we have a graph that’s been stored in the database using tuples of the form ('edge', Source, Target). We’ll define a predicate edge for convenient access:

edge(Source, Target) <= kvp('edge', Source, Target, '')

Consider finding the transitive closure of the edge relation. Transitive closures arise naturally for many relations. For example, if edge represents the parent relation from the previous example, then the transitive closure would represent an ancestor relation.

We can find the transitive closure with the following right-recursive definition:

closure(Source, Target) <= edge(Source, Target)
closure(Source, Target) <= edge(Source, Intermediate) & closure(Intermediate, Target)

The closure of the edge relation represents directed paths in the graph. There is a path from Source and Target if there is either an edge from Source to Target, or an edge from Source to some Intermediate and a path from Intermediate to Target.

Cyclic graphs

How does the above definition work on cyclic graphs? If pyDatalog were implemented using a simple depth-first evaluation strategy, it would go into an infinite loop on cyclic graphs. However, pyDatalog uses a more sophisticated strategy incorporating memoization, and the above query works fine for cyclic graphs. We can check using a function that creates a single cycle of a specified length:

python> set_cycle(db,9,'a')
python> closure('a0',X)
[('a6',), ('a3',), ('a4',), ('a5',), ('a8',), ('a7',), ('a0',), ('a1',), ('a2',)]

Left recursion

We can also define closure in a left-recursive manner:

closure(Source, Target) <= edge(Source, Target)
closure(Source, Target) <= closure(Source, Intermediate) & edge(Intermediate, Target)

If you know Prolog, you know that it’s restricted to right-recursive definitions due to its depth-first evaluation strategy. The same restriction applies to recursive-descent parsers. In pyDatalog, memoization lets us freely use left-recursion as well. In fact, left-recursion will usually be more efficient than the right-recursive equivalent.

Strongly connected components

Suppose we want to test strong connection in a directed graph, where Source is strongly connected to Target if and only if there is a directed path from Source to Target, and from Target back to Source. This is easy to define in terms of closure:

strongly_connected(Source,Target) <= closure(Source,Target) & closure(Target,Source)

We can illustrate this predicate by constructing a graph with two directed cycles connected by a single directed edge. (Imagine the two wheels of a bicycle connected by its frame.) Each of the cycles will then form its own strongly connected component:

python> set_bicycle(db, 9, 'a', 'b')
python> strongly_connected('a0',X)
[('a2',), ('a1',), ('a4',), ('a3',), ('a7',), ('a5',), ('a6',), ('a0',), ('a8',)]
python> strongly_connected('b0',X)
[('b1',), ('b0',), ('b3',), ('b2',), ('b6',), ('b5',), ('b4',), ('b8',), ('b7',)]

The 'a' cycle has a directed edge to the 'b' cycle, but not vice versa. Hence, the query starting from 'a0' must fully explore the 'b' cycle to determine that it is not part of the 'a' component, which it does correctly.

Bipartite graphs

Suppose we want to test if a graph is bipartite, this time interpreting the edge predicate as undirected. We can use the fact that a graph is bipartite if and only if it has no odd cycles. We define an odd_path predicate using a minor variation of transitive closure to force an odd number of edge traverals:

odd_path(Source, Target) <= edge(Source, Target)
odd_path(Source, Target) <= odd_path(Source, Mid1) & \
                            edge(Mid1, Mid2) & edge(Mid2, Target)

There is an odd cycle in the graph if there is some node X for which odd_path(X,X) is true. Therefore, we can test that the graph is bipartite by verifying the absence of any odd cycles:

bipartite() <= ~odd_path(Source, Source)

We’re using pyDatalog’s ~ operator for predicate negation. We can test the predicate with a couple cycles. A cycle with 50 nodes and edges is bipartite; it yields a successful solution with no bound values, indicating true:

python> del db[app.range(('edge',))]
python> set_cycle(db, 50, 'a')
python> bipartite()
[()]

A cycle with 51 nodes and edges is not bipartite; it yields no successful solutions, indicating false:

python> del db[app.range(('edge',))]
python> set_cycle(db, 51, 'a')
python> bipartite()
[]

Indexing

A basic data modeling technique in FoundationDB is to index data by storing a tuple in more than one order to allow efficient access. Datalog can freely make use of indexes in your data model when defining relations.

For example, suppose we’re maintaining a family tree database, and the edges of our directed graph represent a parent relation. We also decided to index the edges using a child relation. We then encounter a need to define members of the same generation within the tree. The recursive definition requires us to move both up and down in the tree, which the indexing allows us to do efficiently. Two people are members of the same generation if they are the children of the same parent, or the children of parents of the same generation:

same_generation(Child1, Child2) <= child(Child1, Parent) & parent(Parent, Child2)
same_generation(Child1, Child2) <= same_generation(Parent1, Parent2) & \
                                   child(Child1, Parent2) & parent(Parent2, Child2)

Functions and tuples

pyDatalog supports several extensions beyond strict relations, as described in its tutorial. One is the use of functions to define unique values for a given set of inputs; another is the use of tuples as values for logical variables.

Let’s illustrate these extensions with a more data-driven problem that may not be fully specified until run time. Consider the following puzzle:

When Theseus traveled to Crete, King Minos summoned him to his throne room and taunted
him with a puzzle devised by Daedalus. "I have a game board," Minos explained,
"on which each place is marked by a number. Likewise, each game piece placed on the
board is marked by a letter. I have a scroll on which is written pairs of numbers
indicating the places between which pieces may move. Further, a piece may move only to
an empty place."

"I will place a number of pieces on the board and tell you how they must be arranged
in the end. I'll give you a day to think about the puzzle, and then you must return
and promptly solve it."

Theseus wrinkled his brow. "What is the board shaped like, and how may the pieces
move?" he asked

Minos grinned. "I'm not going to tell you now."

"How are the pieces arranged at the start, and how must they be arranged in the end?"
Theseus asked.

"You'll find out tomorrow," Minos replied.

Theseus returned to his cell and pondered the puzzle.

Before you take a look at the Datalog solution, it might be instructive to write a solution in Python or your favorite language. It shouldn’t be too hard: this is just a generalization on the N-puzzle. You don’t even need to find minimum-length solutions, just good ones.

Our Datalog solution is an extension of transitive closure. We’ll assume that the contents of Minos’ scroll have been written to the database as tuples. Whatever those tuples may be, we know that they generate legal moves when applied to a board configuration, where legality requires an empty place for a piece to move to. Legal moves form the edges of abstract graph, and transitive closure over those edges gives us paths.

The main extension of transitive closure is to extract the path. We’ll define path[Start, End] as a function and construct its value as a tuple of moves. The board configurations Start and End are tuples of board pieces indexed by place:

(path[Start, End] == Path) <= edge(Start, End, Move) & (Path == (Move,))
(path[Start, End] == Path) <= (path[Start, Mid] == Path1) & edge(Mid, End, Move) & \
                              (Path == Path1+(Move,))

When you make a legal move, your End configuration is just your Start configuration with a pair of pieces swapped:

edge(Start, End, Move) <= legal_move(Start, A, B) & (Move == (A, B)) & \
                (End == Start[0:A]+Start[B:B+1]+Start[A+1:B]+Start[A:A+1]+Start[B+1:])

We could inline the definition of legal_move in the definition of edge, but creating a separate predicate allows pyDatalog to memoize the legal moves, reducing the number of database reads:

legal_move(Start, A, B) <= kvp('action', A, B, '') & legal(Start, A, B)

We’ll use a '*' to represent an empty place on the board:

legal(Start, A, B) <= (Start[A] == '*')
legal(Start, A, B) <= (Start[B] == '*')

We can use this definition to solve any number of combination puzzles. For example, we can solve the HASHWAG contest puzzle from PyCon 13. We’ll modify their board description slightly to number places from 0 to 8 rather than 1 to 9 to facilitate tuple indexing. We can then record the operations in the database as follows:

@fdb.transactional
def set_operations(tr):
    del tr[app.range(('action',))]
    tr[app.pack(('action', 0, 1))] = ''
    tr[app.pack(('action', 0, 3))] = ''
    tr[app.pack(('action', 1, 2))] = ''
    tr[app.pack(('action', 1, 4))] = ''
    tr[app.pack(('action', 2, 5))] = ''
    tr[app.pack(('action', 3, 4))] = ''
    tr[app.pack(('action', 3, 6))] = ''
    tr[app.pack(('action', 4, 5))] = ''
    tr[app.pack(('action', 4, 7))] = ''
    tr[app.pack(('action', 5, 8))] = ''
    tr[app.pack(('action', 6, 7))] = ''
    tr[app.pack(('action', 7, 8))] = ''

Solving the puzzle is then easy:

python> set_operations(db)
python> print(path[list('HAAHS*T*G'), list('HASHTAG**')] == Path)
Path
------------------------------------------------------------------------
((4, 5), (6, 7), (1, 4), (1, 2), (2, 5), (4, 5), (4, 7), (7, 8), (6, 7))

Aggregation

pyDatalog also supports a number of aggregation functions, described in its tutorial, that allow us to count various quantities and order results by these counts.

For example, suppose we have a dataset of job opportunities and potential candidates. The jobs have locations and various required skills. Likewise, the data records where candidates live and the skills they possess. We can access the key-value pairs as follows:

has_skill(Candidate,Skill) <= kvp('has_skill',Candidate, Skill, '')
lives_in(Candidate,City) <= kvp('lives_in', Candidate, City, '')
in_location(Job,City) <= kvp('in_location', Job, City, '')
requires(Job, Skill) <= kvp('requires', Job, Skill, '')

We’d like to do some data exploration by quickly writing queries to match candidates with jobs. Ideally, we’d like to generate rough rankings of candidates for a given job and jobs for a given candidate.

We can build our queries bottom-up. First, we need to check if a candidate and a job have a matching skill:

matching_skill(Candidate, Job, Skill) <= has_skill(Candidate, Skill) & \
                                         requires(Job, Skill)

We’ll say that a candidate matches a job if the pair have some matching skill and are located in the same city (ignoring the possibility of relocation for now):

match(Candidate, Job) <= matching_skill(Candidate, Job, Skill) & \
                         lives_in(Candidate, City) & in_location(Job, City)

We’ll also need to count the number of matching skills for a pair and the number of required skills for a job:

(num_matching_skills[Candidate, Job] == len_(Skill)) <= \
                                        matching_skill(Candidate, Job, Skill)

(num_reqs[Job] == len_(Skill)) <= requires(Job, Skill)

We can now assign a score to a match, using the ratio of matching skills to the number of required skills:

match(Candidate, Job, Score) <= match(Candidate,Job) & \
                          (Score == num_matching_skills[Candidate, Job]/num_reqs[Job])

Finally, we can order matches by score to get either the best jobs for a candidate or the best candidates for a job:

(best_jobs[Candidate] == concat_(Job, order_by=Score, sep=',')) <= \
                                                          match(Candidate, Job, Score)

(best_candidates[Job] == concat_(Candidate, order_by=Score, sep=',')) <= \
                                                          match(Candidate, Job, Score)

We’ve written some code (available in the appendix) to generate test data. Let’s give it a try:

python> set_job_data()
python> print(best_jobs['Henry7']==Job)
Job
-------------------------------------
Trianon_1,Xanath_9,Zandian_4,Xanath_7
python> print(best_candidates['Zandian_4']==Candidate)
Candidate
---------------------------------------------------------
Susan5,Mark2,Susan7,Henry7,Susan8,Joe1,Joanna0,Mark4,Joe8

Query tool

The FoundationDB binding for pyDatalog can be used from any Python environment. The fdbquery tool is intended to make the binding more convenient for interactive use. It supports a python mode and a query mode, which you can freely switch between with the commands python or query, respectively:

$ ./fdbquery
pyDatalog version 0.13.0

query> python

The python mode is just a Python environment with the datalog binding imported. By default, pyDatalog requires any predicates or logical variables to be declared prior to use in order to distinguish them from Python function or variable names:

python> pyDatalog.create_atoms('p, q, X, Y')
python> p(X) <= q(X, Y)
<pyDatalog.pyEngine.Clause object at 0x105e93f50>

For interactive querying, having to declare symbols like this would get old pretty quickly . The query mode takes care of symbols automatically, so queries can be formulated on-the-fly. All lower-case symbols are assumed to be predicates, and all upper-case symbols are assumed to be logical variables:

python> query
query> scala_programmer(X) <= kvp('has_skill', X, 'Scala','')
<pyDatalog.pyEngine.Clause object at 0x105e88b90>
query> scala_programmer(X)
[('Mark6',), ('Joanna6',), ('Joe6',), ('Joanna4',), ('Joanna5',), ('Susan9',), ('Susan7',), ('Joe4',), ('Henry0',), ('Joe8',), ('Henry1',), ('Mark4',), ('Joanna0',), ('Henry5',), ('Mark9',), ('Joanna8',), ('Joe3',), ('Susan4',), ('Joe5',), ('Joanna3',), ('Mark3',), ('Mark2',), ('Mark1',), ('Mark0',), ('Henry2',), ('Joanna2',), ('Henry9',), ('Henry6',), ('Susan8',)]

In python mode, you can also escape into query mode (without switching) by prefacing a line with qry::

python> qry: dc_candidate(X) <= kvp('lives_in', X, 'DC', '')
<pyDatalog.pyEngine.Clause object at 0x101b70390>
python> dc_candidate(X)
[('Joe4',), ('Mark7',), ('Mark6',), ('Susan0',), ('Mark5',), ('Henry1',), ('Susan2',), ('Joanna5',), ('Joanna4',)]

fdbquery also has command-line options for loading files:

  • -p imports a Python module that may freely mix Python and pyDatalog. The pyDatalog.create_atoms() function must be called for all Datalog predicates and variables.
  • -d loads a file with Datalog statements (only). Datalog predicates and variables may be freely used without declaration.

Future directions

In this binding, we chose to associate resolvers with generic kvp predicates to perform range reads or gets from FoundationDB. There are other approaches that would allow for further query optimization. If we have an SQL engine with a good query planner, we can take advantage of it by translating a subset of our Datalog clauses to SQL.

While Datalog with recursion is strictly more powerful than relational algebra, Datalog without recursion is equivalent. It’s straightforward to translate non-recursive Datalog to either relational algebra or SQL.

One strategy would be to identify the non-recursive predicates in our clause set (taking into account dependencies amongthem) and translate them to SQL. We would then associate resolvers with these predicates to pass the SQL to the engine for optimized exection.

Conclusion

FoundationDB’s layer concept allows a variety of data models and query languages to be mapped to its ordered key-value store with small amounts of code. Lightweight query languages like pyDatalog can easily be integrated with FoundationDB using our language APIs.

The binding described in this tutorial has been kept simple, mapping base predicates to database reads according to a simple pattern that takes advantage of the information available at evaluation time. If efficiency is a priority, there are more sophisticated optimization strategies that employ global analysis of the query.

Appendix: datalog.py

Here’s the code for the datalog binding:

import itertools

import fdb
import fdb.tuple

from pyDatalog import pyDatalog, pyEngine

fdb.api_version(200)

db = fdb.open()

pyDatalog.create_atoms('kvp')


class Datalog(object):

    def __init__(self, subspace, max_arity):
        self.subspace = subspace
        self.max_arity = max_arity
        self._create_generic_fdb()

    #####################################
    ##   Generic FDB Predicates: kvp   ##
    #####################################

    @fdb.transactional
    def _get_generic_predicate(self, tr, prefix_tuple, partial_key):
        if partial_key:
            for k, v in tr[self.subspace.range(prefix_tuple)]:
                yield k, v
        else:
            k = self.subspace.pack(prefix_tuple)
            v = tr[k]
            if v.present():
                yield k, v

    def _resolve_generic_fdb(self, arity):

        def func(*args):
            assert len(args) == arity, "arity mismatch"
            leftmost_consts = [arg.id for arg in
                               itertools.takewhile(lambda x: x.is_const(), args[:-1])]
            prefix_tuple = tuple(leftmost_consts)
            partial_key = len(leftmost_consts) < arity - 1
            for k, v in self._get_generic_predicate(db, prefix_tuple, partial_key):
                yield self.subspace.unpack(k)+(v,)

        str_arity = str(arity)
        pyEngine.Python_resolvers['kvp'+str_arity+'/'+str_arity] = func

    def _create_generic_fdb(self):
        for arity in range(1, self.max_arity+1):
            self._resolve_generic_fdb(arity)

    ################################
    ##   Custom FDB Predicates    ##
    ################################

    @fdb.transactional
    def _get_custom_predicate(self, tr, prefix_tuple, partial_key):
        if partial_key:
            for k, _ in tr[self.subspace.range(prefix_tuple)]:
                yield k
        else:
            k = self.subspace.pack(prefix_tuple)
            if tr[k].present():
                yield k

    def _resolve_custom_fdb(self, predicate, arity):
        str_arity = str(arity)
        prefix = predicate+str_arity

        def func(*args):
            assert len(args) == arity, "arity mismatch"
            leftmost_consts = [arg.id for arg in
                               itertools.takewhile(lambda x: x.is_const(), args)]
            prefix_tuple = (prefix,) + tuple(leftmost_consts)
            partial_key = len(leftmost_consts) < arity
            for t in self._get_custom_predicate(db, prefix_tuple, partial_key):
                yield self.subspace.unpack(t)[1:]

        pyEngine.Python_resolvers[prefix+'/'+str_arity] = func

    def create_custom_fdb(self, predicates):
        # predicates should be a list of form [('pred_name', arity)]
        for predicate, arity in predicates:
            self._resolve_custom_fdb(predicate, arity)

Appendix: dl_examples.py

Here’s the code for the tutorial examples:

import itertools
import random

import fdb
import fdb.tuple

from pyDatalog import pyDatalog
from datalog import Datalog, kvp

fdb.api_version(200)
db = fdb.open()

# app is a subspace for an open directory
app = fdb.directory.create_or_open(db, ('app,'))
Datalog(app, 10)

############################
##   Relational Algebra   ##
############################

pyDatalog.create_atoms('q,r,s,X,Y,Z')

# Select
r(X,'foo')

# Project
q(X) <= r(X,Y)

# Join
q(X, Y, Z) <= r(X, Y) & s(X, Z)

###################################
##   Beyond Relational Algebra   ##
###################################

pyDatalog.create_atoms('closure,edge,Source,Target,Intermediate')

@fdb.transactional
def set_cycle(tr, size, label=''):
    last = size-1
    for i in range(last):
        tr[app.pack(('edge', label+str(i), label+str(i+1)))] = ''
    tr[app.pack(('edge', label+str(last), label+'0'))] = ''

edge(Source, Target) <= kvp('edge', Source, Target, '')

# Transitive closure (left-recursive)

closure(Source, Target) <= edge(Source, Target)
closure(Source, Target) <= closure(Source, Intermediate) & edge(Intermediate, Target)

# Strongly connected components

pyDatalog.create_atoms('strongly_connected')

strongly_connected(Source, Target) <= closure(Source, Target) & closure(Target, Source)

@fdb.transactional
def set_bicycle(tr, size, label1, label2):
    del tr[app.range(('edge',))]
    set_cycle(tr, size, label1)
    set_cycle(tr, size, label2)
    tr[app.pack(('edge', label1+'0', label2+'0'))] = ''

# Bipartite graphs

pyDatalog.create_atoms('odd_path, bipartite, Mid1, Mid2')

odd_path(Source, Target) <= edge(Source, Target)
odd_path(Source, Target) <= odd_path(Source, Mid1) & \
                            edge(Mid1, Mid2) & edge(Mid2, Target)

bipartite() <= ~odd_path(Source, Source)


# Same generation in genealogy

pyDatalog.create_atoms('same_generation,child,parent,attends_school')
pyDatalog.create_atoms('Child,Child1,Child2,Parent,Parent1,Parent2,School')

same_generation(Child1, Child2) <= child(Child1, Parent) & parent(Parent, Child2)
same_generation(Child1, Child2) <= same_generation(Parent1, Parent2) & \
                                   child(Child1, Parent2) & parent(Parent2, Child2)

child(Child, Parent) <= kvp('child', Child, Parent, '')

parent(Parent, Child) <= kvp('parent', Parent, Child, '')

@fdb.transactional
def set_parent_child(tr, parent, child):
    tr[app.pack(('parent', parent, child))] = ''
    tr[app.pack(('child', child, parent))] = ''

@fdb.transactional
def set_genealogy(tr):
    del tr[app.range(('parent',))]
    del tr[app.range(('child',))]

    set_parent_child(tr, 'Henry', 'Mark')
    set_parent_child(tr, 'Henry', 'Karen')
    set_parent_child(tr, 'Mark', 'Joe')
    set_parent_child(tr, 'Karen', 'Susan')
    set_parent_child(tr, 'Joe', 'Frank')

attends_school(Child, School) <= kvp('attends_school', Child, School, '')

@fdb.transactional
def set_school(tr):
    tr[app.pack(('attends_school', 'Susan', 'Pine Elementary'))] = ''
    tr[app.pack(('attends_school', 'Frank', 'Pine Elementary'))] = ''

# Functions and lists (Daedalus puzzle)

pyDatalog.create_atoms('path,legal,legal_move')
pyDatalog.create_atoms('Start,End,Mid,Path,Path1,Move,A,B')

(path[Start, End] == Path) <= edge(Start, End, Move) & (Path == (Move,))
(path[Start, End] == Path) <= (path[Start, Mid] == Path1) & edge(Mid, End, Move) & \
                              (Path == Path1+(Move,))

edge(Start, End, Move) <= legal_move(Start, A, B) & (Move == (A, B)) & \
                (End == Start[0:A]+Start[B:B+1]+Start[A+1:B]+Start[A:A+1]+Start[B+1:])

legal_move(Start, A, B) <= kvp('action', A, B, '') & legal(Start, A, B)

legal(Start, A, B) <= (Start[A] == '*')
legal(Start, A, B) <= (Start[B] == '*')

@fdb.transactional
def set_operations(tr):
    del tr[app.range(('action',))]
    tr[app.pack(('action', 0, 1))] = ''
    tr[app.pack(('action', 0, 3))] = ''
    tr[app.pack(('action', 1, 2))] = ''
    tr[app.pack(('action', 1, 4))] = ''
    tr[app.pack(('action', 2, 5))] = ''
    tr[app.pack(('action', 3, 4))] = ''
    tr[app.pack(('action', 3, 6))] = ''
    tr[app.pack(('action', 4, 5))] = ''
    tr[app.pack(('action', 4, 7))] = ''
    tr[app.pack(('action', 5, 8))] = ''
    tr[app.pack(('action', 6, 7))] = ''
    tr[app.pack(('action', 7, 8))] = ''

# Example: solve_dedalus('HAAHS*T*G', 'HASHTAG**')
def solve_dedalus(start, end):
    print(path[list(start), list(end)] == Path)

# Aggregation (job matching)

pyDatalog.create_atoms('has_skill,lives_in,in_location,requires,match')
pyDatalog.create_atoms('matching_skill,num_matching_skills,num_reqs')
pyDatalog.create_atoms('best_jobs,best_candidates')
pyDatalog.create_atoms('Candidate,Job,Skill,City,Score')

has_skill(Candidate,Skill) <= kvp('has_skill',Candidate, Skill, '')
lives_in(Candidate,City) <= kvp('lives_in', Candidate, City, '')
in_location(Job,City) <= kvp('in_location', Job, City, '')
requires(Job, Skill) <= kvp('requires', Job, Skill, '')

matching_skill(Candidate, Job, Skill) <= has_skill(Candidate, Skill) & \
                                         requires(Job, Skill)

match(Candidate, Job) <= matching_skill(Candidate, Job, Skill) & \
                        lives_in(Candidate, City) & in_location(Job, City)

(num_matching_skills[Candidate, Job] == len_(Skill)) <= \
                                        matching_skill(Candidate, Job, Skill)

(num_reqs[Job] == len_(Skill)) <= requires(Job, Skill)

match(Candidate, Job, Score) <= match(Candidate,Job) & \
                          (Score == num_matching_skills[Candidate, Job]/num_reqs[Job])

(best_jobs[Candidate] == concat_(Job, order_by=Score, sep=',')) <= \
                                                          match(Candidate, Job, Score)

(best_candidates[Job] == concat_(Candidate, order_by=Score, sep=',')) <= \
                                                          match(Candidate, Job, Score)

@fdb.transactional
def clear_job_data(tr):
    del tr[app.range(('has_skill',))]
    del tr[app.range(('lives_in',))]
    del tr[app.range(('requires',))]
    del tr[app.range(('in_location',))]

@fdb.transactional
def set_job_record(tr, pred, entity, value):
    tr[app.pack((pred, entity, value))] = ''

def set_job_data():
    names = ['Joe', 'Henry', 'Susan', 'Mark', 'Joanna']
    companies = ['Zandian', 'Xanath', 'Trianon', 'Micromoves', 'Prionics']

    name_product = itertools.product(names, xrange(10))
    company_product = itertools.product(companies, xrange(10))
    candidates = [name+str(num) for name, num in name_product]
    jobs = [company+'_'+str(num) for company, num in company_product]

    skills = ['Java', 'Scala', 'Python', 'R', 'C++']
    locations = ['San Francisco', 'DC', 'New York City', 'Boston', 'Austin']

    clear_job_data(db)

    for candidate in candidates:
        cand_skills = random.sample(skills, random.randint(1, len(skills)))
        for skill in cand_skills:
            set_job_record(db, 'has_skill', candidate, skill)
        set_job_record(db, 'lives_in', candidate, random.choice(locations))

    for job in jobs:
        job_skills = random.sample(skills, random.randint(1, len(skills)))
        for skill in job_skills:
            set_job_record(db, 'requires', job, skill)
        set_job_record(db, 'in_location', job, random.choice(locations))

Appendix: fdbquery

Here’s the code for the query tool:

#!/usr/bin/python

import argparse
import code
import string
import sys

from pyDatalog import pyDatalog, pyEngine, pyParser, util
from datalog import Datalog, kvp

#################################################
##   Command Line Tool for Interactive Queries ##
#################################################

def import_all_from(module_name):
    mod = __import__(module_name)
    for member_name in dir(mod):
        globals()[member_name] = getattr(mod, member_name)

def globalize_atoms(code):
    for name in code.co_names:
        if name in globals():
            if not isinstance(globals()[name], (pyParser.Symbol, pyParser.Variable)):
                raise util.DatalogError("Name conflict. Can't redefine %s as atom" %
                                        name, None, None)
        else:
            if name[0] not in string.ascii_uppercase:
                globals()[name] = pyParser.Symbol(name)
            else:
                globals()[name] = pyParser.Variable(name)

def exec_datalog(source):
    code = compile(source, '<string>', 'single')
    with pyParser.ProgramContext():
        newglobals = {}
        pyParser.add_symbols(code.co_names, newglobals)
        globalize_atoms(code)
        exec code in newglobals

class fdbqueryConsole(code.InteractiveConsole):
    valid_modes = ['query', 'python']

    def set_mode(self,mode):
        assert mode in fdbqueryConsole.valid_modes
        self.mode = mode
        sys.ps1 = mode+'> '

    def interpolate(self, source):
        # ugly string interpolation
        return """
exec_datalog('''
%s
''')
""" % source

    def runsource(self, source, filename='console', symbol='single'):

        if source in fdbqueryConsole.valid_modes:
            self.set_mode(source)
            return

        if self.mode == 'query':
            new_source = self.interpolate(source)
        elif source.lstrip().startswith('qry:'):
            source = source.lstrip().lstrip('qry:').lstrip()
            new_source = self.interpolate(source)
        else:
            new_source = source

        try:
            code.InteractiveConsole.runsource(self, new_source, filename, symbol)
        except Exception as e:
            print(e)

pyEngine.Auto_print = True

if __name__ == "__main__":

    parser = argparse.ArgumentParser(description='FoundationDB Query Console')
    parser.add_argument('-p', '--python', help='''Python module to be imported.
        pyDatalog.create_atoms must be called for any Datalog included.''')
    parser.add_argument('-d', '--datalog', help='''File with Datalog statements
        (only) to be loaded. Atoms will be automatically created.''')
    args = parser.parse_args()
    if args.python:
        import_all_from(args.python)
    if args.datalog:
        with open(args.datalog, 'r') as f:
            dl_defs = f.read()
        f.closed
        pyDatalog.load(dl_defs)
        globalize_atoms(compile(dl_defs, '<string>', 'exec'))

    console = fdbqueryConsole(locals=locals())
    console.set_mode('query')
    console.interact('')