Zeyuan Hu's Page

Postmortem on TreeTracker Join: Simple, Optimal, Fast

After five years of blood, sweat, and tears, TreeTracker Join (\(\mathsf{TTJ}\)) has finally been published. The research itself is very interesting and kept me hooked for five years. As a postmortem, I reflect on some of what I learned during the process. I'm hoping this post can be useful to my fellow PhD students who are currently struggling to make progress or finding their work a proper home.

Short background on \(\mathsf{TTJ}\)

This research concerns query processing, specifically the evaluation of queries that involve multiple joins, e.g., \(R \Join S \Join T\) for relations \(R\), \(S\), and \(T\). Those queries are known as conjunctive queries. It has been long known from database theory that evaluation of an important subclass of conjunctive queries called acyclic conjunctive queries 2 can be done very efficiently using an algorithm called Yannakakis's algorithm (\(\mathsf{YA}\)) 1. (There are many expositions on Yannakakis's algorithm such as wikipedia, blog post, slides, and video.) However, the issue with this algorithm is that it is somewhat galactic (theoretically optimal but empirically impractical) due to its poor empirical performance and awkward algorithm structure. On the empirical performance front, the core issue of \(\mathsf{YA}\) is that it uses semijoin to preprocess the input relations of a query so that the follow-up joins only generate the tuples that can be part of the final query result. There are two problems with semijoins:

  1. Semijoin is really a heavy operation from empirical perspective: If one implements hash-based semijoin, building hash table and probing tuples against it can be quite expensive.

  2. \(\mathsf{YA}\) just uses too many semijoins; if your query has \(k\) relations for a positive integer \(k\), \(\mathsf{YA}\) uses \(2k-2\) semijoins.

The structural issue of \(\mathsf{YA}\) is mostly due to its algorithmic ingredients that are not commonly-seen in popular database systems; it utilizes a data structure called join tree to guide how semijoins and joins should be computed. Existing query systems work with query plans. It is not obvious on how to convert a query plan to a join tree. (\(\mathsf{TTJ}\) takes a stab on this issue; see Lemma 4.6 of the paper. For a more thorough treatment, check out this upcoming ICDT '26 paper.) Furthermore, adding extra semijoins in a query plan for a query that doesn't involve EXISTS can confuse the user and make query optimizer error-prone.

In the past 3-4 years, many people have taken a stab at aforementioned issues of \(\mathsf{YA}\) to make the algorithm practical and can be accessible to wide database practitioners. (This upcoming ICDT '26 paper surveys this research thrust well.) \(\mathsf{TTJ}\) is one of those stabs. The unique angle that \(\mathsf{TTJ}\) brings to the table is that, instead of addressing the challenges following the algorithmic framework laid out by \(\mathsf{YA}\) such as replacing semijoins with Bloom filters or reducing the number of semijoins, \(\mathsf{TTJ}\) is a whole new algorithm that matches \(\mathsf{YA}\) runtime under data complexity in the worst-case scenario. There are no semijoins at all in \(\mathsf{TTJ}\). Instead of preprocessing input relations and working with a join tree, \(\mathsf{TTJ}\) works with traditional query plans and the evaluation starts right away. The big plus of \(\mathsf{TTJ}\) is that it is not much different (\(\le 5\) lines of code difference) from in-memory binary hash-join \(\mathsf{HJ}\). In fact, \(\mathsf{TTJ}\) does no more number of hash table lookups than \(\mathsf{HJ}\).

Two forms of \(\mathsf{TTJ}\)

I should point out that during the many attempts at publishing the work, two forms of \(\mathsf{TTJ}\) appeared. One is the form that eventually got published, which mostly uses if, else, and for language constructs. The other one takes the iterator form consisting of open(), next(), and close() methods. Under iterator form, \(\mathsf{TTJ}\) is an object in an object-oriented language context and the algorithmic logic of \(\mathsf{TTJ}\) is implemented in the three methods of the iterator form. The iterator version of \(\mathsf{TTJ}\) is only available as the Algorithm 3.1 of the arXiv preprint (version 1).

Lessons learned

With the context of \(\mathsf{TTJ}\) and \(\mathsf{YA}\) introduced, I reflect on some lessons I learned during the process of getting paper published. I list out all the past six submissions and corresponding reviews that led to the final publication of the work, which I may call back on during the reflection.

Submission Reviews Supplementary
PODS 2022 reviews
VLDB 2024 reviews
SIGMOD 2025 reviews SIGMOD2025-extended
ICDT 2025 reviews
CIDR 2025 reviews
TODS 2025 reviews revision

Algorithm design in the wild

This is the first time I designed an algorithm out of classroom setting. The task was to adapt TreeTracker algorithms from constraint satisfaction problem (CSP) into relational database setting and the result of the adaptation has to take the iterator form (Algorithm 3.1 of the preprint). The difficulty of the task was fair and well-suited for my skillset. There are two algorithms in the paper \(\mathsf{TT}1\) and \(\mathsf{TT}2\). \(\mathsf{TTJ}\) is an adaptation of \(\mathsf{TT}1\) 3.

There are a few things I tried and I found them helpful in designing the algorithm.

  1. Explain the original algorithm in your own words. The first thing I did after reading the original TreeTracker paper was to rewrite the proof of correctness and runtime analysis in my own words. I found there are three benefits of doing this. First, rewriting helps to fill the omitted details in the original proof. Those details might take short amount of seconds to reconstruct in mind but they can add a lot of delay for me to start to make progress in understanding the proof if I have to do it every time especially in the first few times where I haven't gained much intuition about the ideas behind the proof. Second, rewriting helps to gain some intuition. Since I rewrote the proof in my own words, I had to ensure the correctness of proof remains intact. Thus for every sentence I wrote I always asked myself if there are any considerations that the authors write like this; can I say the sentence differently? If so, does the sentence break the whole proof? By answering those questions, I felt I gained more intuition on the ideas behind the proof and the intuition on the algorithm and its properties. Third, rewriting allows me to forget. Since I had already rewritten the proof with great care in my note, I had the freedom to forget those details. I usually just had some high-level idea of how the proof goes in my mind. Since proof is an application of some properties of the proposed algorithm, the high-level idea of proof is in a way captures the essence of the algorithm. If I had to remember every detail of the proof, I felt there wasn't brain power for understanding new things or thinking about the problem.

  2. "The art of doing mathematics is finding that special case that contains all the germs of generality." This is a quote from David Hilbert. I gained a fairly deep appreciation on this quote during the design of \(\mathsf{TTJ}\). My takeaway on the quote is that we need to find an example that is simple enough that captures all the necessary ingredients of the algorithm. Thus by having an algorithm that handles this example, we obtain an algorithm that automatically handles all possible inputs. In a way, the example itself encodes the algorithm. In the case of \(\mathsf{TTJ}\), my personal favorite simple but general example is Example 1 from the arXiv preprint. The generality is captured by the branching at relation \(S\): instead of every relation in its join tree has at most one child, some relation needs to have at least two children in order to capture backjumping aspect of \(\mathsf{TTJ}\). If one only uses chain query, where each internal node in the join tree has exactly one child, the resulting \(\mathsf{TTJ}\) is not generalized enough to handle all the possible inputs. Certainly, there are multiple issues with this example such as it actually shows \(\mathsf{HJ}\) can also be optimal, and it includes deletion propagation and no-good list (See Appendix R of v3 version of the preprint), which are not necessary key ingredients of \(\mathsf{TTJ}\) for the proposed runtime guarantee. Nevertheless, the example is sufficient enough to have the correct algorithm designed; the rest is optimization of the example.

  3. Documentation to have a sense of progress. Ryan O'Donnell has this quote "Philosophy about how to do research: Make 1% progress per day, for 100 days. Then you will solve your problem." My takeaway from this is that it's very unlikely to design an algorithm in one day. Design of \(\mathsf{TTJ}\) roughly took me one year. It would feel discouraged if I cannot visibly see my progress; after all, a failed attempt is progress because the failed attempt rules out a possibility and equips me with more understanding of the problem. If I can rule out 100 different possibilities, there is a decent chance that I can find the answer to my question. With this mindset, I started to record my attempts from day one in a LaTeX document and I could see my progress by just adding new content to this document. I finally obtained \(\mathsf{TTJ}\) after this document had grown to around 1 MiB and I felt quite satisfied each day after adding a page or so to this document.

  4. Writing programs in algorithm design (a.k.a. system-assisted algorithm design). I wrote two major programs during the design of \(\mathsf{TTJ}\). First, I implemented \(\mathsf{TT}2\) of the paper. The role this program served was manifold. First, it helped me to better understand the algorithm. Both the \(\mathsf{TT}2\) description and the associated proofs use a lot of prose, which is inherently inaccurate 4. To ensure I had the correct understanding of the algorithm, gained a sense of concreteness, and could continue to build up the correct intuition, I found that implementing the algorithm and letting it run on a few examples helped. Second, it made me appreciate the complexity of the algorithm, which pushed me to aim for a simpler algorithm, which led to TT1 and eventually \(\mathsf{TTJ}\). The second major program I wrote was a query instance generator where it produced a query in the form of a join tree and associated relation instances, which could be randomly generated or manually specified. The major benefit of the program was that it generated a lot of examples for me to look at. Those examples were important because they quickly got me jumpstarted on the design without making the task look too daunting. Additionally, I think of an algorithm as a procedure to handle a class of inputs that satisfy certain restriction (e.g., acyclic conjunctive queries) and the goal of the algorithm is to produce the desired output (e.g., the correct query result). Thus the design loop of an algorithm consists of writing the initial procedure, finding where the procedure breaks on certain inputs, patching the procedure to handle the broken inputs, and repeating until the procedure can handle all possible inputs. Examples helped in all aspects of this design loop: they helped me to write the initial algorithm, then some examples broke my algorithm, and after a patch, those examples allowed me to verify that the patch could indeed work. As one can see, the loop can be easily automated and that is what I did: the generator itself ran my algorithm and compared the output with the results of evaluating the same query in a third-party database like PostgreSQL. I immediately received feedback on whether the algorithm broke. Immediate feedback is crucial because it can be very messy to trace through the evaluation of a query using a plan consisting of \(\mathsf{TTJ}\) iterators to check if \(\mathsf{TTJ}\) worked correctly. I found it was better to perform tracing when the algorithm broke. With the loop largely automated, I could spend my energy focusing on thinking about good examples that could capture all the ingredients of \(\mathsf{TTJ}\) and designing the algorithm itself.

Limitation of system-assisted algorithm design

I found system-assisted algorithm design particularly suited for \(\mathsf{TTJ}\) because the problem space is discrete and structured around join trees. However, for continuous problems (like facility placement on a real line) where the problem input space can be large, it's not very straightforward to apply this idea immediately. Classical methods of conjecture and refinement remain more effective. In such case, once I have some classification on the problem inputs and derive some properties of the desired algorithm should have, it's fairly natural to have the algorithm and there is no need for system-assisted algorithm design.

When talking about empirical study, what should we implement?

For work like \(\mathsf{TTJ}\) that simplifies an existing algorithm without breaking any complexity barrier (\(\mathsf{YA}\) is already an optimal algorithm; so what to break anyway), PODS 2022 reviews taught me that the best bet to get the work published within database community is to implement something and send it to a system conference like SIGMOD, VLDB, or some journal like TODS. (In broad algorithm or theoretical computer science community, there exist conferences like SOSA that might solicit work like \(\mathsf{TTJ}\). However, it's not until very recently I learned about how to write a theory paper properly; our PODS 2022 submission is unlikely to make the cut as a theory paper to SOSA.) A fairly natural question to ask next is if we implement something, what should we implement? From my experience, there are two types of implementations we can consider:

  1. Algorithmic comparison where one implements the proposed algorithm and compare with existing algorithms to highlight and confirm the proposed algorithm behavior indeed aligns with the design and analysis. Furthermore, such comparison highlights different algorithm behaviors against different kinds of inputs that are usually not well-covered in traditional worst-case scenario Big-Oh analysis. Examples include sections 5.3 and 5.4 of our SIGMOD 2025 submission and section 6 of Zip-Tries. (This type of implementation can even motivate the design and analysis of algorithms; see testimony in section 4 of this paper and section 1 of this paper.)

  2. System comparison where one integrates the proposed algorithm into a large system (or builds a new system from scratch with the proposed algorithm) and compares the new system with an existing system that serves the same purpose (e.g., query systems) to show end-to-end performance advantages of the new system. Many papers published in system conferences like SIGMOD, VLDB, or SOSP have this type of implementation. It's immediate to see that system comparison can be applied in a broader scope than the algorithmic comparison: system comparison not only applies to the study of algorithms but also to exploration of new system design ideas and principles. (This blog post gives a more detailed recount of system comparison and entailed perspectives.)

I used to believe that it was sufficient to implement algorithmic comparison for \(\mathsf{TTJ}\). However, reviews I received (search using keywords "system" or "DuckDB" on VLDB 2024, SIGMOD 2025, CIDR 2025, and TODS 2025 reviews) put heavy premiums on the system comparison. The biggest confusion I had on applying system comparison to algorithmic work like \(\mathsf{TTJ}\) is that system comparison introduces lots of moving parts that may not be relevant to the algorithm itself and the end-to-end performance advantage of the new system with the proposed algorithm may not indicate the performance improvement is due to algorithm improvement. For example, suppose I have an algorithm and I integrate it into two query systems \(S_1\) and \(S_2\). The only difference between \(S_1\) and \(S_2\) is that compared to \(S_2\), \(S_1\) assumes single user scenario and doesn't implement session manager and worry about transactions. If empirically I show \(S_1\) is faster than \(S_2\), can I claim that \(S_1\) being faster is because it has a better algorithm? More generally speaking, if the system \(S_1\) consists of \(X\) components with one component being the proposed algorithm \(A\) and system \(S_2\) consists of \(Y\) components with one component being the compared algorithm \(B\), in order to pin down \(A\) is indeed empirically better than \(B\), I need to rule out the possibilities that the performance advantage is not coming from the rest of \(X-1\) components of \(S_1\), which requires substantial effort. The end result of doing so effectively reduces system comparison to algorithm comparison. Then the question is why do we not doing system comparison from the beginning? With this mindset, I implemented a query system from scratch that supports both \(\mathsf{TTJ}\), \(\mathsf{YA}\), and other relevant algorithms. Every algorithm being compared such as \(\mathsf{YA}\) enjoys the same setup as \(\mathsf{TTJ}\) and the only difference in terms of code being executed is the algorithm itself. One of the reasons that \(\mathsf{TTJ}\) got published is because Remy came in and directed the effort of implementing a variation of \(\mathsf{TTJ}\) in Rust that beats DuckDB to fullfil the requirement of system comparison. (This blog post articulates the potential impact of forcing people to compare against DuckDB on database research and a related panel discussion in VLDB 2025 on the phenomenon of "reviewing process may over-index on work that must be faster than some baseline".)

The "approximation from below" perspective

The way now I reconcile and justify the need of system comparison on algorithmic work is to view system comparison as an approximation of the proposed algorithm performance from below.

The reasoning is as follows. Suppose I have a new algorithm \(A\) and a target algorithm \(B\) that I would like to compare with. I implement \(A\) in a system \(S_1\) that has \(X\) extra components. Algorithm \(B\) is built inside a system \(S_2\) that has \(Y\) more components. Suppose \(X < Y\) and assume all \(X\) components and \(Y\) components have the same performance. System \(S_2\) serves as a lower bound on the performance of \(A\) because even though it may not be conclusive that \(S_1\) runs faster than \(S_2\) indicating \(A\) being better than \(B\), \(S_1\) slower than \(S_2\) definitely indicates that \(A\) does not offer performance advantages. In other words, there is a possibility that algorithm \(B\) in the algorithm comparison may not be implemented properly, which results in the baseline being too low and algorithm \(A\) does not yet offer a real-world advantage; system comparison rules out such possibility.

Certainly, it is a big "if" that all \(X\) components and \(Y\) components have the same performance. (Or generally speaking, the total runtime of all the not-\(A\) components of system \(S_1\) is less or or equal to the total runtime of all the not-\(B\) components of system \(S_2\).) Ensuring that holds requires significant engineering effort and it is not directly relevant to the innovation of algorithm \(A\). A good heuristic to satisfy the "big if" is to make \(X\) as small as possible through a simple, high-quality prototype design that avoids unnecessary components.

Once I adopted this "approximation" view on system comparison, I could see algorithm comparison can also be done approximately. In an ideal world, all algorithms are done in an apple-to-apple comparison fashion such that the only "delta" in terms of performance difference among algorithms is the algorithms themselves. In SIGMOD 2025 submission, I was instructed to do so. Since both \(\mathsf{TTJ}\) and \(\mathsf{YA}\) use join tree, and \(\mathsf{TTJ}\) and \(\mathsf{HJ}\) use query plans, the most fair way was to build an optimizer that generate query plans and join trees such that all algorithms use the same query plans and the same join tree if needed. However, the issue with this empirical setup was that it made the water muddy meaning we need to introduce how optimization of join trees and generation of query plans for acyclic conjunctive queries are done, which can be treated in a separate paper. To not let the issue of query optimization distracts the reader, many papers (such as Lookahead Information Passing, the published version of \(\mathsf{TTJ}\), and Predicate Transfer) consider the proposed algorithm purely as a runtime technique, where the algorithm takes in a query plan that is generated from a baseline system (in the case of \(\mathsf{TTJ}\), plans are taken from SQLite and PostgreSQL) and then measure the empirical performance. This way we approximate the best possible performance of both the proposed algorithm and the baseline algorithm with the assumption that additional query optimization can unlock even better performance of respective algorithm. Then the comparison is done based on the proxies of both algorithms.

Sometimes the algorithm comparison can be mixed with system comparison because system being compared is a lower bound to the target algorithm. In TODS submission, we used PostgreSQL as the lower bound to all the implemented algorithms in the query engine namely \(\mathsf{TTJ}\), \(\mathsf{YA}\), and \(\mathsf{HJ}\). We ensured \(\mathsf{HJ}\) ran faster than PostgreSQL. Since \(\mathsf{TTJ}\) and \(\mathsf{YA}\) ran faster than \(\mathsf{HJ}\) in most cases, we eliminated the possibility of a "straw man" implementation of baseline algorithms. An extra consideration on choosing PostgreSQL is that its architecture was mostly aligned with ours, most notably, row-oriented iterator-based execution model. The challenge here was that the community really likes DuckDB and many bad decisions described later on the design of the engine don't allow the engine to easily beat DuckDB. Eventually, we presented both algorithm comparison and system comparison separately with two different prototypes.

One might have this fear that system comparison involves a ton of work; I certainly did because I had been avoiding system comparison all along (The first review suggests system comparison is from VLDB 2024, which I received in 2023). In the deep down, I didn't have this approximation perspective and I thought it would be a ton of work to do fully rigorous way given all the variables we need to control across multiple systems being compared. However, if all we did was to compare proxies of both algorithms, there is, in fact, a lot of leeway. I think the challenge here is to not know what are sufficient ingredients we need to build to carry out the system comparison; a more experienced person can really help a lot on this. For example, I didn't need to build a whole optimizer at all for both algorithm and system comparisons.

"If a thing is worth doing, it's worth doing badly"

During the empirical study of \(\mathsf{TTJ}\), I was haunted by the idea that a valid system comparison required building a perfect, end-to-end system from scratch to beat DuckDB--a task I estimated would take at least a year. However, I later learned that I didn't need a perfect system; a barebone "semi-system" was sufficient to demonstrate the algorithm's potential and fulfill the referees' requirements.

Considerations on the design and implementation of prototype

With algorithm in hand and some understanding of the type of implementations we need to provide for the empirical study, we now needed to decide how to design the prototype to help with empirical study. The decision naturally coupled with experimental design. There were again a few considerations on the design of the prototype, which I made both good and bad calls in hindsight.

The common wisdom in building a research prototype is to treat the prototype as "one-shot" meaning it is a software that will not get maintained and the main purpose that such prototype serves is to generate numbers for plots used in the paper 5. The biggest merit of this wisdom I think is the emphasis of being "agile" that prevents over-engineering and takes necessary shortcuts to deliver results. However, I think we should be careful about when and what shortcuts we take.

The biggest good call I made was to take a shortcut on prototype design but not the actual implementation. That is, I aimed for the simplest possible design but when it came down to the actual implementation of the design, I strove for production-ready engineering quality. Since I was going to build the query engine all by myself in a limited time, it's unrealistic to build a fully-fledged query engine; trade-offs had to be made. As a result, I aimed for the most barebone query engine possible. The guiding principle was that if there was no code, then there was no bug, and there was no performance penalty. Thus the whole query engine was modeled after Trino, which I was fairly familiar with: the engine used JDBC to connect to some relational database (SQLite, PostgreSQL, and DuckDB are supported) and the data were pulled into the engine to perform join computation. Since SQLs are merely some specification of certain computations, I could implement the specification imperatively using a set of low-level APIs to specify the computation. As a result, no parser was built. The main code path was fairly simple: taking a specification of a join query and pull the data from the connected database and produce the query result. Such simple design enabled quick adaption of the prototype for unplanned empirical evaluation needs. For example, since SIGMOD 2025 pursues an apple-to-apple comparison setup, an optimizer, that was not planned in the design of the prototype, had to be in-place. Since the design of the query engine was so simple, I could model the optimization of different algorithms (\(\mathsf{YA}\), \(\mathsf{TTJ}\), \(\mathsf{HJ}\), and Predicate Transfer) as plugins to the engine, which could be enabled or disabled based on the needs. Query optimization as a plugin saved me a lot of headache later on in TODS submission because we switched back to runtime technique comparison philosophy and completely remove the query optimization work of algorithms. In fact, experiments done in SIGMOD 2025 submission but removed in the TODS submission were all implemented as plugins, which gave me plenty of flexibilities to cater various instructions. When it came to the implementation, I extensively used checkState() or checkArgument() to explicitly solidify the assumption that I made on the implementation of the methods. Also, nontrivial methods were thoroughly tested. For example, all the algorithms implemented were covered by the same suite of queries and relation instances.

However, there were many bad calls I made that I wish I could done differently. The biggest one was to extend the database and query generator that facilitated the design of \(\mathsf{TTJ}\) into a query engine for empirical study. This decision had two significant drawbacks that I realized later:

  1. The generator was written in Java for fast development. (The flexibility offered by Python was too much for me; I wanted to use a moderately restrictive language with native support for type checking and rich ecosystem. Java was a nice language for this purpose and I was quite familiar with.) However, when it came to performance optimization in the late stage, Java program was not easy to optimize straightforwardly. For example, all the data were stored in a underlying database like DuckDB and fetched on-demand using JDBC. To compare with in-memory database system, I needed to make my setup as close to in-memory as possible. In Java, I needed to deal with on-heap memory managed by JVM and off-heap memory if I wanted to completely buffer the database instance inside the memory. This was more cumbersome to deal with compared to other language like Rust.

  2. The generator used a federation database architecture where tuples were fetched from an external database via JDBC. The benefit of this design was that it allowed me to avoid managing storage part myself and to ingest new relations easily for algorithm development. Furthermore, in the case of conjunctive queries like the ones in JOB, all the selection predicates are pushed down to and evaulated by the underlying database. Thus I could solely focus on the join evaluation part of the queries. However, the design was a bad call for a performance-focused prototype. By treating the data source as a separate entity, I introduced a massive communication bottleneck between the storage and query processing. As described in this post, JDBC is not designed to transfer a large volume of data; it introduces significant serialization and deserialization overhead. Every tuple fetched requires the underlying database to serialize internal binary records into a wire protocol, which the JDBC driver must then deserialize into Java objects. This 'marshaling' process consumes significant CPU cycles and creates memory pressure, often making the communication interface—rather than the join algorithm—the primary bottleneck. To mitigate the issue, I tweaked setFetchSize, which decides how many tuples are fetched in a batch from data source. However, there was a trade-off: If the value set too large, then effectively one maintains two copies of the data, which run into the risk of out-of-memory for JVM. For example, if the query engine fetched a table of 10000 rows, the 10000 rows were cached in JDBC connection. Since the engine had to map each row fetched by JDBC to an internal row representation, in the worst-case scenario, 20000 rows are maintained inside JVM heap, which caused significant memory pressure. On the other hand, if one sets the value of setFetchSize too small, next() call on resultSet can take long time. Furthermore, the JDBC driver implementation can impact the engine performance a lot. I initially used PostgreSQL to store the data. PostgreSQL JDBC driver is implemented purely in Java, which is slower than DuckDB JDBC driver, which is implemented in C++. The takeaway from working with JDBC is that it is pretty neat for fast prototyping. However, when we build a research prototype for system comparison, it's probably better to use something else. One idea used in the eventual TODS paper was to implement a prototype as a barebone program where all the data were stored as Parquet files. Then we used Polars to read those Parquet files into memory and stored as structs. This design was very intuitive and completely removed the hassle of dealing with JDBC in performance optimization.

In hindsight, I disregarded a famous quote from Fred Brooks "In most projects, the first system built is barely usable ... Hence plan to throw one away; you will, anyhow."

Another important point is to study benchmark workload. The prototype was expected to run on a fixed workload and such workload was known beforehand, which contained rich opportunities to enable simplification of prototype design and implementation. For example, if any pair of relations joins on an integer column only, it is not wise to implement more generic case where two tables can join on a composite key or on a string column.

Some tips on Java and software engineering in general

The development of the prototype effectively started from day one given the decision of extending the query generator for the purpose of algorithm design into a research prototype. However, the generator itself was not built with the performance-centric in mind. Therefore a huge chunk of time was spent optimzing the implementation so that the engine can outperforms various system baselines. The following tips are ones I found useful during the optimization process.

  1. Profiling. I think the biggest tip is to use some profiler to see where the code spent time. There is a famous 90/10 rule on system performance optimization, which says something like one should spend 90% of the effort on 10% of the code of the system because that 10% of the code makes the biggest impact on the system performance. (I believe I read it somewhere on the web but I cannot find the version I read; the closest one I found is from Richard Pattis.) In order to identify this 10% mission-critical code, it's inevitable to use profiler. Once we start to profile the code, the rest becomes easy: we just enter the loop of squashing the most significant performance anomaly shown by the profiler and repeat until there is absolutely nothing to be optimized. Another purpose of using profiler is to generate plots like runtime breakdown (e.g., Figure 7-9 in the preprint (v1)) in empirical section of the paper. Using print statement like System.out.println won't be precise enough as the operation of printing to screen adds significant runtime overhead, which makes the breakdown of your algorithm runtime inaccurate.

  2. Replace Java Streams with Vanilla Loops. I used Stream API extensively in the code because of its clean look. However, profiling showed that the Stream API added unnecessary abstraction layers. Rewriting these as vanilla loops allowed the Just-in-Time compiler to optimize the code more effectively and reduced the allocation of short-lived objects.

  3. Use primitive type if possible. For example, instead of Integer, we use int. As a concrete example, all queries in JOB join any two relations on a single integer column. When we read tuples from underlying database, we need to map those tuples into our internal row representation. Naively, we can wrap each integer value into Integer. However, such wrapping adds significant penalty: First, we need to do wrapping for each tuple and each integer column (4000 wrapping operations for 1000 tuples from a table of four integer columns). Second, such wrapping achieves nothing because later on, when we do join, we need to extract the join attribute value from the row and use it to obtain the hash bucket from the hash table. During the process, the only information we need is the integer value that Integer object contains and the wrapping doesn't add anything for this purpose.

  4. Map out unnecessary columns. As mentioned in the prior section, JDBC can impose a significant performance penalty. This tip is specific to query engine that maps out unnecessary columns that are not used in a given query. Doing so reduces the amount of tuples need to be fetched from the underlying database. This performance optimization tip falls into a broad category of using some query analysis to reduce the work actually done during the query runtime.

  5. Preallocate fixed-length arrays. It's better to preallocate a fixed length of an array that has sufficient size during the whole life cycle of query evaluation. If we go with the default array length, we run into the situation of constantly resizing the array, which makes the performance less predictable and fragments the JVM heap. In query engine, the information about the size of array we need to allocate is readily available during the optimization through cardinality estimation. Sometimes, we just use relation size as the approximation towards true cardinality and it works quite well in practice.

  6. Use high-performance standard library. Standard library is sometimes too generic for performance-critical use case. I used fastutil in the query engine for specialized data structure like IntSet. I found that fastutil's specialized collections avoided the overhead of object pointers. Storing primitive keys in contiguous memory improved CPU cache hit rates and minimized pointer chasing during hash table probes. This tip also applies to the Rust prototype where we used HashMap and HashSet from ahash instead of the standard library to implement hash tables. Further, we used memchr for string predicate evaluation instead of doing it ourselves.

  7. Use integer comparison instead of string comparison. Initial implementation of the query engine assumed fairly generic queries as input workload. Specifically, the engine allowed two relations are joined on a string column, which didn't show up in any benchmark that I have run. Thus I made the adjustment of assuming join only happens on a integer column (in fact, all the queries in the three benchmarks I run don't even join on a composite key). This assumption simplified a lot of implementation and brought a significant performance improvement because string comparison is a lot slower than the integer comparison. (Papers like this exploit the performance gap between string and integer comparison; see section 3.1.) This is a good shortcut to take when building a research prototype.

  8. Use better JDBC. This point is already mentioned in the prior section where C++ implementation of DuckDB JDBC driver has better performance than PostgreSQL JDBC driver implemented purely in Java.

  9. Avoid repeated object creation (Object Pooling). When \(\mathsf{TTJ}\) was implemented in an iterator form, backjumping was implemented in the form of message passing. The message itself was just an object to signal which relation should delete a tuple from its hash table. Instead of allocating a new message object for every backjump, I implemented a form of object reuse. This stabilized the heap usage and prevented frequent GC pauses from interrupting the query evaluation pipeline.

  10. Optimize performance on a small workload. Running all three benchmarks with 120ish queries in a full-blown way takes multiple days. Doing so is inefficient if our goal is to optimize our prototype performance. A more ideal way to approach this is to sample a few queries from the workload and optimize the performance of those queries first. We can sample the queries in different ways such as by the number of relations appeared in the queries.

Departing thoughts

Given the publication of \(\mathsf{TTJ}\), I have always been asking myself whether it is worthwhile spending five years pushing a work to publication. Is it better to work on something else that doesn't rely on \(\mathsf{TTJ}\) so that I can have enough material to write a thesis? My answer right now is mixed. On one hand, research is all about confidence and I think it is absolutely crucial for the first research project to be successful so that a student can gain confidence in continue pursuing the research path. The tricky thing is about how to define success: for a senior professor who has published hundreds of papers, whether a paper is published or not is no longer relevant; for a PhD student just starting out, success means publication. They need to see what research is publishable and gain the reassurance that they can do it. It would require strong confidence to not publish a result and instead continue building upon it. From this perspective, I think it's absolutely worth pushing a research project to publication. Plus, we discovered new things along the way such as acyclic convolution and the importance of optimizing join tree for acyclic conjunctive query evaluation. On the other hand, the algorithm of \(\mathsf{TTJ}\) was finished in Summer 2021 and it was just waste of time to try different things in the hope of getting the badge that says "this work is published". I think the takeaway from all years of trying is to try to do some research that is completely different from the field that the struggling manuscript concerns; by working something completely different, the risk is diversified and it is good for mental health.

Acknowledgements

I imagine this post serves as a sort of informal thesis. It would be a shame if I did not give proper thanks to the many people who helped me during my time at UT-Austin.

Bailu Ding, Dixin Tang, and Marcelo Arenas provided various thoughtful comments on the manuscript and valuable advice on how to push this work toward publication. Greg Plaxton has been incredibly understanding of my struggles and spent countless hours advising me on all aspects of research--from navigating the PhD program to the nuances of theory research. Juan Sequeda and Nathan Clement generously shared their perspectives and experiences regarding their own PhD journeys. Remy Wang was an indispensable force; he strengthened the work significantly and helped push it across the finish line. Zhiting Zhu has been a wonderful friend and someone I could always talk to while I was away from my family in Austin. Finally, Sherry has unconditionally supported my seemingly insane endeavor to continue doing research after a challenging five years at UT-Austin.


  1. People call this algorithm Yannakakis algorithm, Yannakakis' algorithm, or Yannakakis's algorithm. I think grammatically speaking, the most correct way is either Yannakakis's algorithm or Yannakakis algorithm. 

  2. Acyclic conjunctive queries are everywhere; one may argue that many real life queries are in fact acyclic. See this paper for a survey on acyclicity of many query workloads. 

  3. \(\mathsf{TT}1\) is slower than \(\mathsf{TT}2\) by a factor of \(\log n\) where \(n\) is the number of variables in a constraint network. However, \(\mathsf{TT}1\) is much simpler than \(\mathsf{TT}2\) and the number of variables in a constraint network is translated into the number of relations in a query, which is usually considered as a constant in a complexity model called data complexity where database instance is the problem input to a query evaluation algorithm. 

  4. For a more in-depth discussion of the limitations of prose proofs, see some writings (1, 2) by Leslie Lamport. 

  5. My friend's PhD advisor once told him that it's sufficient to have the software compile once just for the numbers; it's okay that the software no longer functions once the paper is published. 

comments powered by Disqus