The ODIn Lab - FastPDB

Aaron Huber presented his paper on FastPDB at this year's SIGMOD, last week. Here's a quick rundown, if you missed it!

Probabilistic databases, for those of you who may not have heard of them, are a type of database system that can cope with data that is not precisely defined. For example, a typical Amtrak train schedule might say

"Departs at 5:00"

If that schedule were stored in a probabilistic database, you'd instead see "Departs at 5:00" 50% "Departs at 5:10" 10% "Departs at 5:20" 10% ... "Departs at 10:00" 5%

Why might you possibly want this? Well, it turns out that most data sucks. Unfortunately, people implicitly trust data that pops out of a computer, and if we're going to have any hope of changing that, we need the computer to be able to communicate uncertainty about information. That's where probabilistic databases shine, because they can give you answers couched in established statistics. For example, the database might be able to tell me that I have a 70% chance of leaving by 5:20.

Unfortunately, probabilistic databases are... not ideal. Even leaving aside the difficulty of getting meaningful data into the system, or helping users to understand results framed as probabilities (both projects ODIn students have worked on), there's the foundational problem of performance. Probabilistic databases, even ones heavily optimized for speed, tend to be orders of magnitude slower than regular deterministic databases.

FastPDB represents the first steps towards fixing this problem. Aaron first asked whether there was something fundamental about probabilistic databases that made them slower. Probabilistic databases that work with sets are known to be slow (#P runtime complexity), but the general consensus has been that bag probabilistic databases are "fast" (P-time). However, it turns out that in general, the extra power you get from managing probabilities does come at a cost: Bag Probabilistic Databases (modulo the validity of several well-established complexity assumptions) scale super-linearly in the runtime of deterministic queries.

Given this result, the logical next step was to build an approximation algorithm. Typical probabilistic algorithms operate in two phases: first generating provenance for the query, and then analyzing the provenance to cope with correlated variables, which make it difficult to efficiently produce query results. Again, looking at the complexity of different approaches, we settled on a strategy based on Provenance Circuits, and showed that the complexity bottleneck was in sampling from the circuit, rather than constructing it. Aaron implemented an algorithm for estimating the expected count of a bag probabilistic query result over such a circuit, and it was off to the experimental benchmarks...

... where the entire process failed. Complexity-wise, building a provenance circuit is fast... but it has a really big constant factor. So we started exploring for other options, and wound up making the observation that the structure of our sampling algorithm very closely mirrored the structure of an approximate query processing algorithm called WanderJoin. Indeed, sampling directly from the circuit was no different from sampling directly from the query result using WanderJoin. So Aaron implemented a simple query-query transpiler (using GProM) over XDB (the implementation of WanderJoin) that can produce results faster than Postgres can produce deterministic query results!

In short, we've developed the world's first Fast Probabilistic Database!

Aaron is presently on the market! You should hire him