# A Template for Implementing Fast Lock-free Trees Using HTM

Trevor Brown Technion, Israel me@tbrown.pro

## ABSTRACT

Algorithms that use hardware transactional memory (HTM) must provide a software-only fallback path to guarantee progress. The design of the fallback path can have a profound impact on performance. If the fallback path is allowed to run concurrently with hardware transactions, then hardware transactions must be instrumented, adding significant overhead. Otherwise, hardware transactions must wait for any processes on the fallback path, causing concurrency bottlenecks, or move to the fallback path. We introduce an approach that combines the best of both worlds. The key idea is to use three execution paths: an HTM fast path, an HTM middle path, and a software fallback path, such that the middle path can run concurrently with each of the other two. The fast path and fallback path do not run concurrently, so the fast path incurs no instrumentation overhead. Furthermore, fast path transactions can move to the middle path instead of waiting or moving to the software path. We demonstrate our approach by producing an accelerated version of the tree update template of Brown et al., which can be used to implement fast lock-free data structures based on down-trees. We used the accelerated template to implement two lock-free trees: a binary search tree (BST), and an (a, b)-tree (a generalization of a B-tree). Experiments show that, with 72 concurrent processes, our accelerated (a, b)-tree performs between 4.0x and 4.2x as many operations per second as an implementation obtained using the original tree update template.

#### **1 INTRODUCTION**

Concurrent data structures are crucial building blocks in multithreaded software. There are many concurrent data structures implemented using locks, but locks can be inefficient, and are not fault tolerant (since a process that crashes while holding a lock can prevent all other processes from making progress). Thus, it is often preferable to use hardware synchronization primitives like compareand-swap (CAS) instead of locks. This enables the development of *lock-free* (or *non-blocking*) data structures, which guarantee that at least one process will always continue to make progress, even if

PODC '17, July 25-27, 2017, Washington, DC, USA

some processes crash. However, it is notoriously difficult to implement lock-free data structures from CAS, and this has inhibited the development of advanced lock-free data structures.

One way of simplifying this task is to use a higher level synchronization primitive that can atomically access multiple locations. For example, consider a k-word compare-and-swap (k-CAS), which atomically: reads k locations, checks if they contain k expected values, and, if so, writes k new values. k-CAS is highly expressive, and it can be used in a straightforward way to implement *any* atomic operation. Moreover, it can be implemented from CAS and registers [16]. However, since k-CAS is so expressive, it is difficult to implement efficiently.

Brown et al. [6] developed a set of new primitives called LLX and SCX that are less expressive than k-CAS, but can still be used in a natural way to implement many advanced data structures. These primitives can be implemented much more efficiently than k-CAS. At a high level, LLX returns a snapshot of a node in a data structure, and after performing LLXs on one or more nodes, one can perform an SCX to atomically: change a field of one of these nodes, and finalize a subset of them, only if none of these nodes have changed since the process performed LLXs on them. Finalizing a node prevents any further changes to it, which is useful to stop processes from erroneously modifying deleted parts of the data structure. In a subsequent paper, Brown et al. used LLX and SCX to design a tree update template that can be followed to produce lock-free implementations of down-trees (trees in which all nodes except the root have in-degree one) with any kinds of update operations [7]. They demonstrated the use of the template by implementing a chromatic tree, which is an advanced variant of a red-black tree (a type of balanced binary search tree) that offers better scalability. The template has also been used to implement many other advanced data structures, including lists, relaxed AVL trees, relaxed (a, b)-trees, relaxed b-slack trees and weak AVL trees [5, 7, 17]. Some of these data structures are highly efficient, and would be well suited for inclusion in data structure libraries.

In this work, we study how the new hardware transactional memory (HTM) capabilities found in recent processors (e.g., by Intel and IBM) can be used to produce significantly faster implementations of the tree update template. By accelerating the tree update template, we also provide a way to accelerate all of the data structures that have been implemented with it. Since library data structures are reused many times, even minor performance improvements confer a large benefit.

HTM allows a programmer to run blocks of code in transactions, which either commit and take effect atomically, or abort and have no effect on shared memory. Although transactional memory was originally intended to *simplify* concurrent programming, researchers have since realized that HTM can also be used effectively to *improve the performance of existing concurrent code* [21, 22, 28]: Hardware transactions typically have very little overhead, so they can often be

This work was performed while Trevor Brown was a student at the University of Toronto. Funding was provided by the Natural Sciences and Engineering Research Council of Canada. I would also like thank my supervisor Faith Ellen for her helpful comments on this work, and to Oracle Labs for providing access to the 72-thread Intel machine used in my experiments.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

<sup>© 2017</sup> Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery.

ACM ISBN 978-1-4503-4992-5/17/07...\$15.00

https://doi.org/http://dx.doi.org/10.1145/3087801.3087834

used to replace other, more expensive synchronization mechanisms. For example, instead of performing a sequence of CAS primitives, it may be faster to perform reads, if-statements and writes inside a transaction. Note that this represents a *non-standard use of HTM*: we are *not* interested in its ease of use, but, rather, in its ability to reduce synchronization costs.

Although hardware transactions are fast, it is surprisingly difficult to obtain the full performance benefit of HTM. Here, we consider Intel's HTM, which is a *best-effort* implementation. This means it offers *no guarantee* that transactions will ever commit. Even in a single threaded system, a transaction can repeatedly abort because of internal buffer overflows, page faults, interrupts, and many other events. So, to guarantee progress, any code that uses HTM must also provide a *software fallback path* to be executed if a transaction fails. The design of the fallback path profoundly impacts the performance of HTM-based algorithms.

Allowing concurrency between two paths. Consider an operation O that is implemented using the tree update template. One natural way to use HTM to accelerate O is to use the original operation as a fallback path, and then obtain an HTM-based fast path by wrapping O in a transaction, and performing optimizations to improve performance [21]. We call this the **2-path concurrent** algorithm (2-path con). Since the fast path is just an optimized version of the fallback path, transactions on the fast path and fallback path can safely run concurrently. If a transaction aborts, it can either be retried on the fast path, or be executed on the fast path and fallback path can add significant overhead on the fast path.

The first source of overhead is *instrumentation* on the fast path that manipulates the *meta-data* used by the fallback path to synchronize processes. For example, lock-free algorithms often create a *descriptor* for each update operation (so that processes can determine how to help one another make progress), and store pointers to these descriptors in Data-records, where they act as locks. The fast path must also manipulate these descriptors and pointers so that the fallback path can detect changes made by the fast path.

The second source of overhead comes from constraints imposed by algorithmic assumptions made on the fallback path. The tree update template implementation in [7] assumes that only child pointers can change, and all other fields of nodes, such as keys and values, are never changed. Changes to these other *immutable* fields must be made by replacing a node with a new copy that reflects the desired change. Because of this assumption on the fallback path, transactions on the fast path *cannot* directly change any field of a node other than its child pointers. This is because the fallback path has no mechanism to detect such a change (and may, for example, erroneously delete a node that is concurrently being modified by the fast path). Thus, just like the fallback path, the fast path must replace a node with a new copy to change one of its immutable fields, which can be much less efficient than changing the field directly.

**Disallowing concurrency between two paths.** To avoid the overheads described above, concurrency is often *disallowed* between the fast path and fallback path. The simplest example of this approach is a technique called **transactional lock elision** (TLE) [25, 26]. TLE is used to implement an operation by wrapping its sequential code in a transaction, and falling back to acquire a global lock after a

certain number of transactional attempts. At the beginning of each transaction, a process reads the state of the global lock and aborts the transaction if the lock is held (to prevent inconsistencies that might arise because the fallback path is not atomic). Once a process begins executing on the fallback path, all concurrent transactions abort, and processes wait until the fallback path is empty before retrying their transactions.

If transactions never abort, then *TLE represents the best performance we can hope to achieve*, because the fallback path introduces almost no overhead and synchronization is performed entirely by hardware. Note, however, that TLE is not lock-free. Additionally, in workloads where operations periodically run on the fallback path, performance can be very poor.

As a toy example, consider a TLE implementation of a binary search tree, with a workload consisting of insertions, deletions and *range queries*. A range query returns all of the keys in a range [lo,hi]. Range queries access many memory locations, and cause frequent transactional aborts due to internal processor buffer overflows (capacity limits). Thus, range queries periodically run on the fallback path, where they can lead to numerous performance problems. Since the fallback path is sequential, range queries (or any other long-running operations) cause a severe *concurrency bottleneck*, because they prevent transactions from running on the fast path while they slowly complete, serially.

One way to mitigate this bottleneck is to replace the sequential fallback path in TLE with a lock-free algorithm, and replace the global lock with a fetch-and-increment object F that counts how many operations are running on the fallback path. Instead of aborting if the lock is held, transactions on the fast path abort if F is non-zero. We call this the **2-path non-concurrent** algorithm (2-path con). In this algorithm, if transactions on the fast path retry only a few times before moving to the fallback path, or do not wait between retries for the fallback path to become empty, then the lemming effect [13] can occur. (The lemming effect occurs when processes on the fast path rapidly fail and move to the fallback path, simply because other processes are on the fallback path.) This can cause the algorithm to run only as fast as the (much slower) fallback path. However, if transactions avoid the lemming effect by retrying many times before moving to the fallback path, and waiting between retries for the fallback path to become empty, then processes can spend most of their time waiting.

The problem with two paths. In this paper, we study two different types of workloads: **light workloads**, in which transactions rarely run on the fallback path, and **heavy workloads**, in which transactions more frequently run on the fallback path. In light workloads, algorithms that allow concurrency between paths perform very poorly (due to high overhead) in comparison to algorithms that disallow concurrency. However, in heavy workloads, algorithms that disallow concurrency perform very poorly (since transactions on the fallback path prevent transactions from running on the fast path) in comparison to algorithms that allow concurrency between paths. Consequently, all two path algorithms have workloads that yield poor performance. Our experiments confirm this, showing surprisingly poor performance for two path algorithms in many cases.

Using three paths. We introduce a technique that simultaneously achieves high performance for both light and heavy workloads by



using three paths: an HTM fast path, an HTM middle path and a non-transactional fallback path. (See Figure 1.) Each operation begins on the fast path, and moves to the middle path after it retries F times. An operation on the middle path moves to the fallback path after retrying M times on the middle path. The fast path does not manipulate any synchronization meta-data used by the fallback path, so operations on the fast path and fallback path cannot run concurrently. Thus, whenever an operation is on the fallback path, all operations on the fast path move to the middle path. The middle path manipulates the synchronization meta-data used by the fallback path, so operations on the middle path and fallback path can run concurrently. Operations on the middle path can also run concurrently with operations on the fast path (since conflicts are resolved by the HTM system). We call this the **3-path** algorithm (*3-path*).

We briefly discuss why this approach avoids the performance problems described above. Since transactions on the fast path do not run concurrently with transactions on the fallback path, transactions on the fast path run with *no instrumentation overhead*. When a transaction is on the fallback path, transactions can freely execute on the middle path, *without waiting*. The *lemming effect* does not occur, since transactions do not have to move to the fallback path simply because a transaction is on the fallback path. Furthermore, we enable a high degree of concurrency, because the fast and middle paths can run concurrently, and the middle and fallback paths can run concurrently.

We performed experiments to evaluate our new template algorithms by comparing them with the original template algorithm. In order to compare the different template algorithms, we used each algorithm to implement two data structures: a binary search tree (BST) and a relaxed (a,b)-tree. We then ran microbenchmarks to compare the performance (operations per second) of the different implementations in both light and heavy workloads. The results show that our new template algorithms offer significant performance improvements. For example, on an Intel system with 72 concurrent processes, our best implementation of the relaxed (a,b)-tree outperformed the implementation using the original template algorithm by an average of 410% over all workloads.

#### Contributions

• We present four accelerated implementations of the tree update template of Brown et al. that explore the design space for HTM-based implementations: 2-path con, TLE, 2-path con, and 3-path.

- We highlight the importance of studying both light and heavy workloads in the HTM setting. Each serves a distinct role in evaluating algorithms: light workloads demonstrate the potential of HTM to improve performance by reducing overhead, and heavy workloads capture the performance impact of interactions between different execution paths.
- We demonstrate the effectiveness of our approach by accelerating two different lock-free data structures: an unbalanced BST, and a relaxed (*a*,*b*)-tree. Experimental results show a significant performance advantage for our accelerated implementations.

The remainder of the paper is structured as follows. The model is introduced in Section 2. Section 3 describes LLX and SCX, and the tree update template. In Section 4, we describe our four accelerated template implementations, and argue correctness and progress. Experimental results are presented in Section 5. Related work is surveyed in Section 7. Section 6 describes an optimization to two of our accelerated template implementations. Finally, we conclude in Section 8.

# 2 MODEL

We consider an asynchronous shared memory system with n processes, and Intel's implementation of HTM. Arbitrary blocks of code can be executed as transactions, which either commit (and appear to take place instantaneously) or abort (and have no effect on the contents of shared memory). A transaction is started by invoking txBegin, is committed by invoking txEnd, and can be aborted by invoking txAbort. Intel's implementation of HTM is best-effort, which means that the system can force transactions to abort at any time, and no transactions are ever guaranteed to commit. Each time a transaction aborts, the hardware provides a reason why the abort occurred. Two reasons are of particular interest. Conflict aborts occur when two processes contend on the same cache-line. Since a cache-line contains multiple machine words, conflict aborts can occur even if two processes never contend on the same memory location. Capacity aborts occur when a transaction exhausts some shared resource within the HTM system. Intuitively, this occurs when a transaction accesses too many memory locations. (In reality, capacity aborts also occur for a variety of complex reasons that make it difficult to predict when they will occur.)

## **3 BACKGROUND**

The LLX and SCX primitives. The load-link extended (LLX) and store-conditional extended (SCX) primitives are multi-word generalizations of the well-known load-link (LL) and store-conditional (SC), and they have been implemented from single-word CAS [6]. LLX and SCX operate on Data-records, each of which consists of a fixed number of mutable fields (which can change), and a fixed number of immutable fields (which cannot).

LLX(r) attempts to take a snapshot of the mutable fields of a Data-record r. If it is concurrent with an SCX involving r, it may return FAIL, instead. Individual fields of a Data-record can also be read directly. An SCX(V, R, fld, new) takes as its arguments a sequence V of Data-records, a subsequence R of V, a pointer fld to a mutable field of one Data-record in V, and a new value new for that field. The SCX tries to atomically store the value new in



Figure 2: Example of the tree update template.

the field that *fld* points to and *finalize* each Data-record in *R*. Once a Data-record is finalized, its mutable fields cannot be changed by any subsequent SCX, and any LLX of the Data-record will return FINALIZED instead of a snapshot.

Before a process p invokes SCX, it must perform an LLX(r) on each Data-record r in V. For each  $r \in V$ , the last LLX(r) performed by p prior to the SCX is said to be *linked* to the SCX, and this linked LLX must return a snapshot of r (not FAIL or FINALIZED). An SCX(V, R, fld, new) by a process modifies the data structure and returns TRUE (in which case we say it *succeeds*) only if no Data-record r in V has changed since its linked LLX(r); otherwise the SCX fails and returns FALSE. Although LLX and SCX can fail, their failures are limited in such a way that they can be used to build data structures with lock-free progress. See [6] for a more formal specification.

We briefly describe the implementation of SCX. Each Data-record is augmented with two fields: *marked* and *info*. Each SCX(V, R, *fld*, *new*) starts by creating an SCX-record D, which contains all of the information necessary to perform the SCX, and then completes the SCX by invoking a function called HELP, and passing D as its argument. Invocations of SCX synchronize with one another by taking a special kind of lock on each Data-record in V. These locks grant exclusive access to an SCX *operation*, rather than to a *process*. An SCX *S* locks a Data-record u by using CAS to store a pointer to its SCX-record in *u.info*. Suppose *S* successfully locks all Data-records in its *V* sequence. Then, *S* finalizes each Data-record  $u \in R$  by setting *u.marked* := TRUE, and releases its locks on all *other* Data-records (leaving finalized Data-records permanently locked). Whenever a process encounters a (non-finalized) node locked for *S*, it invokes HELP(D) to help *S* complete and release its locks.

Observe that SCX can only change a single value in a Data-record (and finalize a sequence of Data-records) atomically. Thus, to implement an *operation* that changes multiple fields, one must create *new* Data-records that contain the desired changes, and use SCX to change *one* pointer to replace the old Data-records.

The tree update template. The tree update template implements lock-free updates that atomically replace an old connected subgraph R of a down-tree by a new connected subgraph N (as shown in Figure 2). Such an update can implement any change to the tree, such as an insertion into a BST or a rotation in a balanced tree. The old subgraph includes all nodes with a field to be modified. The new subgraph may have pointers to nodes in the old tree. Since every

node in a down-tree has indegree one, the update can be performed by changing a single child pointer of some node *parent*. However, problems could arise if a concurrent operation changes the part of the tree being updated. For example, nodes in the old subgraph, or even *parent*, could be removed from the tree before *parent*'s child pointer is changed. The template takes care of the process coordination required to prevent such problems.

Each tree node is represented by a Data-record with a fixed number of child pointers as its mutable fields. Each child pointer either points to a Data-record or contains NIL (denoted by  $-\infty$  in our figures). Any other data in the node is stored in immutable fields. Thus, if an update must change some of this data, it makes a new copy of the node with the updated data. There is a Data-record *entry* which acts as the entry point to the data structure and is never deleted.

At a high level, an update that follows the template proceeds in two phases: the *search phase* and the *update phase*. In the search phase, the update searches for a location where it should occur. Then, in the update phase, the update performs LLXs on a connected subgraph of nodes in the tree, including *parent* and the set *R* of nodes to be removed from the tree. Next, it decides whether the tree should be modified, and, if so, creates a new subgraph of nodes and performs an SCX that atomically changes a child pointer, as shown in Figure 2, and finalizes any nodes in *R*. See [7] for further details.

# 4 ACCELERATED TEMPLATE IMPLEMENTATIONS

**HTM-based** LLX **and** SCX. In the following, we use  $SCX_O$  and  $LLX_O$  to refer to the original lock-free implementation of LLX and SCX. We give an implementation of SCX that uses an HTM-based fast path called  $SCX_{HTM}$ , and  $SCX_O$  as its fallback path. Hardware transactions are instrumented so they can run concurrently with processes executing  $SCX_O$ . This algorithm guarantees lock-freedom and achieves a high degree of concurrency. Pseudocode appears in Figure 3. At a high level, an  $SCX_{HTM}$  by a process *p* starts a transaction, then attempts to perform a highly optimized version of  $SCX_O$ . Each time a transaction executed by *p* aborts, control jumps to the onAbort label, at the beginning of the SCX procedure. If a process *explicitly* aborts a transaction at line 19, then SCX returns FALSE at line 4. Each process has a budget *AttemptLimit* that specifies how many times it will attempt hardware transactions before it will fall back to executing  $SCX_O$ .

In SCX<sub>0</sub>, SCX-records are used (1) to facilitate helping, and (2) to lock Data-records and detect changes to them. In particular, SCX<sub>0</sub> guarantees the following property. **P1**: between any two changes to (the user-defined fields of) a Data-record u, a new SCX-record pointer is stored in *u.info*. However, SCX<sub>HTM</sub> does not create SCX-records. In a transactional setting, helping causes unnecessary aborts, since executing a transaction that performs the same work as a running transaction will cause at least one (and probably both) to abort. Helping in transactions is also *not necessary* to guarantee progress, since progress is guaranteed by the fallback path. So, to preserve property P1, we give each process p a *tagged sequence number tseq* bit. The *tag* bit is the least significant bit. On modern systems where pointers are word aligned, the least significant bit in a pointer is always zero. Thus, the tag bit allows a process

|                                                                                    | 12 | $SCX_{HTM}(V, R, fld, new)$ by process p                                                |
|------------------------------------------------------------------------------------|----|-----------------------------------------------------------------------------------------|
| Private variable for process p: $attempts_p, tagseq_p$                             | 13 | Let <i>infoFields</i> be a pointer to a table in <i>p</i> 's private memory containing, |
|                                                                                    |    | for each $r$ in $V$ , the value of <i>r.info</i> read by $p$ 's last LLX $(r)$          |
| 2  SCX(V, R, fld, new)  by process  p                                              | 14 | Let <i>old</i> be the value for <i>fld</i> returned by $p$ 's last LLX( $r$ )           |
| <sup>3</sup> onAbort: ▷ jump here on transaction abort                             |    |                                                                                         |
| 4 <b>if</b> we jumped here after an <b>explicit</b> abort <b>then return</b> FALSE | 15 | Begin hardware transaction                                                              |
| 5 <b>if</b> $attempts_p < AttemptLimit$ <b>then</b>                                | 16 | $tagseq_p := tagseq_p + 2^{\lceil \log n \rceil}$                                       |
| - F -                                                                              | 17 | for each $r \in V$ do                                                                   |
| $attempts_p := attempts_p + 1$                                                     | 18 | Let <i>rinfo</i> be the pointer indexed by r in <i>infoFields</i>                       |
| 7 $retval := SCX_{HTM}(V, R, fld, new) \triangleright Fast$                        | 19 | <b>if</b> $r.info \neq rinfo$ <b>then</b> Abort hardware transaction (explicitly)       |
| 8 else                                                                             |    |                                                                                         |
| 9 $retval := SCX_O(V, R, fld, new) \triangleright$ Fallback                        | 20 | for each $r \in V$ do $r.info := tagseq_p$                                              |
|                                                                                    | 21 | for each $r \in R$ do <i>r.marked</i> := TRUE                                           |
| 10 <b>if</b> retval <b>then</b> $attempts_p := 0$                                  | 22 | write <i>new</i> to the field pointed to by <i>fld</i>                                  |
| 11 return retval                                                                   | 23 | Commit hardware transaction                                                             |
|                                                                                    | 24 | return True                                                                             |

Figure 3: HTM-based implementation of SCX.

to distinguish between a tagged sequence number and a pointer. In  $SCX_{HTM}$ , instead of having *p* create a new SCX-record and store pointers to it in Data-records to lock them, *p* increments its sequence number in  $tseq_p$  and stores  $tseq_p$  in Data-records. Since no writes performed by a transaction *T* can be seen until it commits, it never actually needs to hold any locks. Thus, every value of  $tseq_p$  stored in a Data-record represents an unlocked value, and writing  $tseq_p$  represents *p* locking and immediately unlocking a node.

After storing  $tseq_p$  in each  $r \in V$ , SCX<sub>*HTM*</sub> finalizes each  $r \in R$  by setting *r.marked* := TRUE (mimicking the behaviour of SCX<sub>0</sub>). Then, it stores *new* in the field pointed to by *fld*, and commits. Note that eliminating the creation of SCX-records on the fast path also eliminates the need to *reclaim* any created SCX-records, which further reduces overhead.

The SCX<sub>*HTM*</sub> algorithm also necessitates a small change to LLX<sub>O</sub>, to handle tagged sequence numbers. An invocation of LLX<sub>O</sub>(r) reads a pointer *rinfo* to an SCX-record, follows *rinfo* to read one of its fields, and uses the value it reads to determine whether r is locked. However, *rinfo* may now contains a tagged sequence number, instead of a pointer to an SCX-record. So, in our modified algorithm, which we call LLX<sub>*HTM*</sub>, before a process tries to follow *rinfo*, it first checks whether *rinfo* is a tagged sequence number, and, if so, behaves as if r is unlocked.

**Correctness and progress.** Due to a lack of space, we omit the full proof of correctness and progress for our HTM-based implementation of LLX and SCX, and give a brief sketch, instead. The high-level idea is to show that one can start with LLX<sub>O</sub> and SCX<sub>O</sub>, and obtain our HTM-based implementation by applying a sequence of transformations. Intuitively, these transformations preserve the semantics of SCX and maintain backwards compatibility with SCX<sub>O</sub> so that the transformed versions can be run concurrently with invocations of SCX<sub>O</sub>. More formally, for each execution of a transformed algorithm, there is an execution of the original algorithm in which: the same operations are performed, they are linearized in the same order, and they return the same results. After arguing correctness and progress for the HTM-based implementation of LLX and SCX<sub>O</sub> follows immediately from the original proof for LLX<sub>O</sub> and SCX<sub>O</sub>.

**The 2-path con algorithm.** We now use our HTM-based LLX and SCX to obtain an HTM-based implementation of a template operation O. The fallback path for O is simply a lock-free implementation of O using LLX<sub>O</sub> and SCX<sub>O</sub>. The fast path for O starts a transaction, then performs the same code as the fallback path, except that it uses the HTM-based LLX and SCX. Since the *entire operation* is performed inside a transaction, we can optimize the invocations of SCX<sub>HTM</sub> that are performed by O as follows. Lines 15 and 23 can be eliminated, since SCX<sub>HTM</sub> is already running inside a large transaction will abort due to a data conflict if *r.info* changes after it is read in the (preceding) linked invocation of LLX(r), and before the transaction commits. The proof of correctness and progress for 2-*path con* follows immediately from the proof of the original template and the proof of the HTM-based LLX and SCX implementation.

Note that it is not necessary to perform the entire operation in a single transaction. In Section 6, we describe a modification that allows a read-only *searching* prefix of the operation to be performed before the transaction begins.

**The** *TLE* **algorithm.** To obtain a TLE implementation of an operation *O*, we simply take *sequential code* for *O* and wrap it in a transaction on the fast path. The fallback path acquires and releases a global lock instead of starting and committing a transaction, but otherwise executes the same code as the fast path. To prevent the fast path and fallback path from running concurrently, transactions on the fast path start by reading the lock state and aborting if it is held. An operation attempts to run on the fast path up to *AttemptLimit* times (waiting for the lock to be free before each attempt) before resorting to the fallback path. The correctness of TLE is trivial. Note, however, that TLE only satisfies deadlock-freedom (not lock-freedom).

**The** 2-path  $\overline{con}$  algorithm. We can improve concurrency on the fallback path and guarantee lock-freedom by using a lock-free algorithm on the fallback path, and a global fetch-and-increment object F instead of a global lock. Consider an operation O implemented with the tree update template. We describe a 2-path  $\overline{con}$  implementation of O. The fallback path increments F, then executes the lock-free tree update template implementation of O, and finally decrements F. The fast path executes sequential code for O in a transaction. To

prevent the fast path and fallback path from running concurrently, transactions on the fast path start by reading F and aborting if it is nonzero. An operation attempts to run on the fast path up to *At*-*temptLimit* times (waiting for F to become zero before each attempt) before resorting to the fallback path.

Recall that operations implemented using the tree update template can only change a single pointer atomically (and can perform multiple changes atomically only by creating a connected set of new nodes that reflect the desired changes). Thus, each operation on the fallback path simply creates new nodes and changes a single pointer (and assumes that all other operations also behave this way). However, since the fast path and fallback path do not run concurrently, the fallback path does *not* impose this requirement on the fast path. Consequently, the fast path can make (multiple) direct changes to nodes. Unfortunately, as we described above, this algorithm can still suffer from concurrency bottlenecks.

The 3-path algorithm. One can think of the 3-path algorithm as a kind of hybrid between the 2-path con and 2-path con algorithms that obtains their benefits while avoiding their downsides. Consider an operation O implemented with the tree update template. We describe a 3-path implementation of O. As in 2-path con, there is a global fetch-and-increment object F, and the fast path executes sequential code for O in a transaction. The middle path and fallback path behave like the fast path and fallback path in the 2-path con algorithm, respectively. Each time an operation begins (resp., stops) executing on the fallback path, it increments (resp., decrements) F. (If the scalability of fetch-and-increment is of concern, then a scalable non-zero indicator object [15] can be used, instead.) This prevents the fast and fallback paths from running concurrently. As we described above, operations begin on the fast path, and move to the middle path after *FastLimit* attempts, or if they see  $F \neq 0$ . Operations move from the middle path to the fallback path after MiddleLimit attempts. Note that an operation never waits for the fallback path to become empty-it simply moves to the middle path.

Since the fast path and fallback path do not run concurrently, the fallback path does not impose any overhead on the fast path, except checking if F = 0 (offering low overhead for light workloads). Additionally, when there are operations running on the fallback path, hardware transactions can continue to run on the middle path (offering high concurrency for heavy workloads).

**Correctness and progress for 3-path.** The correctness argument is straightforward. The goal is to prove that all template operations are linearizable, regardless of which path they execute on. Recall that the fallback path and middle path behave like the fast path and fallback path in 2-path con. It follows that, if there are no operations on the fast path, then the correctness of operations on the middle path and fallback path is immediate from the correctness of 2-path con. Of course, whenever there is an operation executing on the fallback path, no operation can run on the fast path. Since operations on the fast path and middle path run in transactions, they are atomic, and any conflicts between the fast path and middle path are handled automatically by the HTM system. Therefore, template operations are linearizable.

The progress argument for 3-path relies on three assumptions.

- **A1.** The sequential code for an operation executed on the fast path must terminate after a finite number of steps if it is run on a static tree (which does not change during the operation).
- **A2.** In an operation executed on the middle path or fallback path, the search phase must terminate after a finite number of steps if it is run on a static tree.
- **A3.** In an operation executed on the middle path or fallback path, the update phase can modify only a finite number of nodes.

We give a simple proof that 3-path satisfies lock-freedom. To obtain a contradiction, suppose there is an execution in which after some time t, some process takes infinitely many steps, but no operation terminates. Thus, the tree does not change after t. We first argue that no process takes infinitely many steps in a transaction T. If T occurs on the fast path, then A1 guarantees it will terminate. If T occurs on the middle path, then A2 and A3 guarantee that it will terminate. Therefore, eventually, processes only take steps on the fallback path. Progress then follows from the fact that the original tree update template implementation (our fallback path) is lock-free.

## **5 EXPERIMENTS**

We used two different Intel systems for our experiments: a dualsocket 12-core E7-4830 v3 with hyperthreading for a total of 48 hardware threads (running Ubuntu 14.04LTS), and a dual-socket 18-core E5-2699 v3 with hyperthreading for a total of 72 hardware threads (running Ubuntu 15.04). Each machine had 128GB of RAM. We used the scalable thread-caching allocator (tcmalloc) from the Google perftools library. All code was compiled on GCC 4.8+ with arguments -std=c++0x -02 -mcx16. (Using the higher optimization level -03 did not significantly improve performance for any algorithm, and decreased performance for some algorithms.) On both machines, we *pinned* threads such that we saturate one socket before scheduling any threads on the other.

**Data structures implemented with the template.** We used two data structures to study the performance of our accelerated template implementations: an unbalanced BST, and a relaxed (a, b)-tree. The BST is similar to the chromatic tree in [7], but with no rebalancing. The relaxed (a, b)-tree is a concurrency-friendly generalization of a B-tree that is based on the work of Jacobson and Larsen [18]. Nodes contain up to *b* keys, and, when there are no ongoing updates, they contain at least *a* keys (where  $b \ge 2a - 1$ ). In our experiments, we fix a = 6 and b = 16. With b = 16, each node occupies four consecutive cache lines. Since  $b \ge 2a - 1$ , with b = 16, we must have  $a \le 8$ . We chose to make *a* slightly smaller than 8 in order to exploit a performance tradeoff: a smaller minimum degree may slightly increase depth, but decreases the number of rebalancing steps that are needed to maintain balance.

**Template implementations studied.** We implemented each of data structures with four different template implementations: *3-path*, *2-path con*, *TLE* and the original template implementation, which we call *Non-HTM*. (*2-path con* is omitted, since it performed similarly to TLE, and cluttered the graphs.) The *2-path con* and *TLE* implementations perform up to 20 attempts on the fast path before resorting to the fallback path. *3-path* performs up to 10 attempts (each) on the fast path and middle path. We implemented memory reclamation using DEBRA [4], an epoch based reclamation scheme.



Figure 4: Results (48-thread system) showing throughput (operations per second) versus the number of concurrent processes.

#### 5.1 Light vs. Heavy workloads

Methodology. We study two workloads: in light, n processes perform updates (50% insertion and 50% deletion), and in heavy, n-1processes perform updates, and one thread performs 100% range queries (RQs). For each workload and data structure implementation, and a variety of thread counts, we perform a set of five randomized trials. In each trial, n processes perform either updates or RQs (as appropriate for the workload) for one second, and counted the number of completed operations. Updates are performed on keys drawn uniformly randomly from a fixed key range [0, K). RQs are performed on ranges [lo, lo + s) where lo is uniformly random in [0, K)and s is chosen, according to a probability distribution described below, from [1, 1000] for the BST and [1, 10000] for the (a, b)-tree. (We found that nodes in the (a, b)-tree contained approximately 10 keys, on average, so the respective maximum values of s for the BST and (a,b)-tree resulted in range queries returning keys from approximately the same number of nodes in both data structures.) To ensure that we are measuring steady-state performance, at the start of each trial, the data structure is prefilled by having threads perform 50% insertions and 50% deletions on uniform keys until the data structure contains approximately half of the keys in [0, K).

We verified the correctness of each data structure after each trial by computing *key-sum hashes*. Each thread maintains the sum of all keys it successfully inserts, minus the sum of all keys it successfully deletes. At the end of the trial, the total of these sums over all threads must match the sum of keys in the tree.

**Probability distribution of** *s*. We chose the probability distribution of *s* to produce many small RQs, and a smaller number of very large ones. To achieve this, we chose *s* to be  $\lfloor x^2S \rfloor + 1$ , where *x* is a uniform real number in [0, 1), and *S* = 1000 for the BST and *S* = 10000 for the (*a*,*b*)-tree. By squaring *x*, we bias the uniform distribution towards zero, creating a larger number of small RQs.

**Results.** We briefly discuss the results from the 48 thread machine, which appear in Figure 4. The BST and the relaxed (a, b)-tree behave



Figure 5: Results (72 thread system) showing throughput (operations per second) versus the number of concurrent processes.

fairly similarly. Since the (*a*, *b*)-tree has large nodes, it benefits much more from a low-overhead fast path (in *TLE* or *3-path*) which can avoid creating new nodes during updates. In the light workloads, *3-path* performs significantly better than *2-path* con (which has more overhead) and approximately as well as *TLE*. On average, the *3-path* algorithms completed 2.1x as many operations as their non-HTM counterparts (and with 48 concurrent processes, this increases to 3.0x, on average). In the heavy workloads, *3-path* significantly outperforms *TLE* (completing 2.0x as many operations, on average), which suffers from *excessive waiting*. Interestingly, *3-path* is also significantly faster than *2-path con* in the heavy workloads. This is because, even though RQs are always being performed, some RQs can succeed on the fast path, so many update operations can still run on the fast path in *3-path*, where they incur much less overhead (than they would in *2-path con*).

Results from the 72-thread machine appear in Figure 5. There, *3-path* shows an even larger performance advantage over *Non-HTM*.

#### 5.2 Code path usage and abort rates

To gain further insight into the behaviour of our accelerated template implementations, we gathered some additional metrics about the experiments described above. Here, we only describe results from the 48-thread Intel machine. (Results from the 72-thread Intel machine were similar.)

**Operations completed on each path.** We started by measuring how often operations completed successfully on each execution path. This revealed that operations almost always completed on the fast path. Broadly, over all thread counts, the minimum number of operations completed on the fast path in any trial was 86%, and the average over all trials was 97%.

In each trial that we performed with 48 concurrent threads, at least 96% of operations completed on the fast path, *even in the workloads with RQs*. Recall that RQs are the operations most likely to run on the fallback path, and they are only performed by a single thread, so they



Figure 6: Summary of how many transactions commit vs. how many abort in our experiments on the 48-thread Intel machine.

make up a relatively small fraction of the total operations performed in a trial. In fact, our measurements showed that the number of operations which completed on the fallback path was never more than a fraction of one percent in our trials with 48 concurrent threads.

In light of this, it might be somewhat surprising that the performance of TLE was so much worse in heavy workloads than light ones. However, the cost of serializing threads is high, and this cost is compounded by the fact that the operations which complete on the fallback path are often long-running. Of course, in workloads where more operations run on the fallback path, the advantage of improving concurrency between paths would be even greater.

**Commit/abort rates.** We also measured how many transactions committed and how many aborted, on each execution path, in each of our trials. Figure 6 summarizes the average commit/abort rates for each data structure, template implementation and workload. Since nearly all operations completed on the fast path, we decided not to distinguish between the commit/abort rate on the fast path and the commit/abort rate on the middle path.

#### 5.3 Comparing with hybrid transactional memory

*Hybrid transactional memory* (hybrid TM) combines hardware and software transactions to hide the limitations of HTM and guarantee progress. This offers an alternative way of using HTM to implement concurrent data structures. Note, however, that state of the art hybrid TMs use locks. So, **they cannot be used to implement lock-free data structures.** Regardless, to get an idea of how such implementations would perform, relative to our accelerated template implementations, we implemented the unbalanced BST using Hybrid NOrec, which is arguably the fastest hybrid TM implementation with readily available code [27].

If we were to use a precompiled library implementation of Hybrid NOrec, then the unbalanced BST algorithm would have to perform a library function call for *each read and write to shared memory*,



Figure 7: Results showing throughput (operations per second) versus number of processes for an unbalanced BST implemented with different tree update template algorithms, and with the hybrid TM algorithm *Hybrid NOrec*.

which would incur significant overhead. So, we directly compiled the code for Hybrid NOrec into the code for the BST, allowing the compiler to inline the Hybrid NOrec functions for reading and writing from shared memory into our BST code, eliminating this overhead. Of course, if one intended to use hybrid TM in practice (and not in a research prototype), one would use a precompiled library, with all of the requisite overhead. Thus, the following results are quite charitable towards hybrid TMs.

We implemented the BST using Hybrid NOrec by wrapping sequential code for the BST operations in transactions, and manually replacing each read from (resp., write to) shared memory with a read (resp., write) operation provided by Hybrid NOrec. Figure 7 compares the performance of the resulting implementation to the other BST implementations discussed in Section 5.

The BST implemented with Hybrid NOrec performs relatively well with up to six processes. However, beyond six processes, it experiences severe negative scaling. The negative scaling occurs because Hybrid NOrec increments a global counter in each updating transaction (i.e., each transaction that performs at least one write). This global contention hotspot in updating transactions causes many transactions to abort, simply because they contend on the global counter (and not because they conflict on any data in the tree). However, even without this bottleneck, Hybrid NOrec would still perform poorly in heavy workloads, since it incurs very high instrumentation overhead for software transactions (which must acquire locks, perform repeated validation of read-sets, maintain numerous auxiliary data structures for read-sets and write-sets, and so on). Note that this problem is not unique to Hybrid NOrec, as every hybrid TM must use a software TM as its fallback path in order to guarantee progress. In contrast, in our template implementations, the software-only fallback path is a fast lock-free algorithm.

# 6 MODIFICATIONS FOR PERFORMING SEARCHES OUTSIDE OF TRANSACTIONS

In this section, we describe how the *3-path* implementations of the unbalanced BST and relaxed (a, b)-tree can be modified so that each operation attempt on the fast path or middle path performs its search phase *before* starting a transaction (and only performs its update phase in a transaction). (The same technique also applies to the *2-path con* implementations.) First, note that the lock-free search procedure for each of these data structures is actually a standard, sequential search procedure. Consequently, a simple sequential search procedure will return the correct result, regardless of whether it is performed inside a transaction. (Generally, whenever we produce a *3-path* implementation starting from a lock-free fallback path, we will have access to a correct non-transactional search procedure.)

The difficulty is that, when an operation starts a transaction and performs its update phase, it may be working on a part of the tree that was deleted by another operation. One can imagine an operation  $O_d$  that deletes an entire subtree, and an operation  $O_i$  that inserts a node into that subtree. If the search phase of  $O_i$  is performed, then  $O_d$  is performed, then the update phase of  $O_i$  is performed, then  $O_i$  may erroneously insert a node into the deleted subtree.

We fix this problem as follows. Whenever an operation O on the fast path or middle path removes a node from the tree, it sets a *marked* bit in the node (just like operations on the fallback path do). Whenever O first accesses a node u in its transaction, it checks whether u has its *marked* bit set, and, if so, aborts immediately. This way, O's transaction will commit only if every node that it accessed is in the tree.

We found that this modification yielded small performance improvements (on the order of 5-10%) in our experiments. The reason this improves performance is that fewer memory locations are tracked by the HTM system, which results in fewer capacity aborts. We briefly discuss why the performance benefit is small in our experiments. The relaxed (a,b)-tree has a very small height, because it is balanced, and its nodes contain many keys. The BST also has a fairly small height (although it is considerably taller than the relaxed (a,b)-tree), because processes in our experiments perform insertions and deletions on uniformly random keys, which leads to trees of logarithmic height with high probability. So, in each case, the sequence of nodes visited by searches is relatively small, and is fairly unlikely to cause capacity aborts.

The performance benefit associated with this modification will be greater for data structures, operations or workloads in which an operation's search phase will access a large number of nodes. Additionally, IBM's HTM implementation in their POWER8 processors is far more prone to capacity aborts than Intel's implementation, since a transaction *will* abort if it accesses more than 64 different cache lines [24]. (In contrast, in Intel's implementation, a transaction can potentially commit after accessing tens of thousands of cache lines.) Thus, this modification could lead to significantly better performance on POWER8 processors.

## 7 RELATED WORK

Hybrid TMs share some similarities to our work, since they all feature multiple execution paths. The first hybrid TM algorithms allowed HTM and STM transactions to run concurrently [10, 19].

Hybrid NOrec [9] and Reduced hardware NOrec [23] are hybrid TMs that both use global locks on the fallback path, eliminating any concurrency. We discuss two additional hybrid TMs, Phased TM [20] (PhTM) and Invyswell [8], in more detail.

PhTM alternates between five phases: HTM-only, STM-only, concurrent HTM/STM, and two global locking phases. Roughly speaking, PhTM's HTM-only phase corresponds to our uninstrumented fast path, and its concurrent HTM/STM phase corresponds to our middle HTM and fallback paths. However, their STM-only phase (which allows no concurrent hardware transactions) and global locking phases (which allow no concurrency) have no analogue in our approach. In heavy workloads, PhTM must oscillate between its HTM-only and concurrent HTM/STM phases to maximize the performance benefit it gets from HTM. When changing phases, PhTM typically waits until all in-progress transactions complete before allowing transactions to begin in the new mode. Thus, after a phase change has begun, and before the next phase has begun, there is a window during which new transactions must wait (reducing performance). One can also think of our three path approach as proceeding in two phases: one with concurrent fast/middle transactions and one with concurrent middle/fallback transactions. However, in our approach, "phase changes" do not introduce any waiting, and there is always concurrency between two execution paths.

Invyswell is closest to our three path approach. At a high level, it features an HTM middle path and STM slow path that can run concurrently (sometimes), and an HTM fast path that can run concurrently with the middle path (sometimes) but not the slow path, and two global locking fallback paths (that prevent any concurrency). Invyswell is more complicated than our approach, and has numerous restrictions on when transactions can run concurrently. Our three path methodology does not have these restrictions. The HTM fast path also uses an optimization called lazy subscription. It has been shown that lazy subscription can cause opacity to be violated, which can lead to data corruption or program crashes [11].

Hybrid TM is very general, and it pays for its generality with high overhead. Consequently, data structure designers can extract far better performance for library code by using more specialized techniques. Additionally, we stress that state of the art hybrid TMs use locks, so they cannot be used to produce lock-free data structures.

Different options for concurrency have recently begun to be explored in the context of TLE. Refined TLE [12] and Amalgamated TLE [1] both improve the concurrency of TLE when a process is on the fallback path by allowing HTM transactions to run concurrently with a *single process* on the fallback path. Both of these approaches still serialize processes on the fallback path. They also use locks, so they cannot be used to produce lock-free data structures.

Timnat, Herlihy and Petrank [28] proposed using a strong synchronization primitive called *multiword compare-and-swap* (*k*-CAS) to obtain fast HTM algorithms. They showed how to take an algorithm implemented using *k*-CAS and produce a two-path implementation that allows concurrency between the fast and fallback paths. One of their approaches used a lock-free implementation of *k*-CAS on the fallback path, and an HTM-based implementation of *k*-CAS on the fast path. They also experimented with two-path implementations that do not allow concurrency between paths, and found that allowing concurrency between the fast path and fallback path introduced significant overhead. Makreshanski, Levandoski and Stutsman [22] also independently proposed using HTM-based *k*-CAS in the context of databases.

Liu, Zhou and Spear [21] proposed a methodology for accelerating concurrent data structures using HTM, and demonstrated it on several lock-free data structures. Their methodology uses an HTMbased fast path and a non-transactional fallback path. The fast path implementation of an operation is obtained by encapsulating part (or all) of the operation in a transaction, and then applying sequential optimizations to the transactional code to improve performance. Since the optimizations do not change the code's logic, the resulting fast path implements the same logic as the fallback path, so both paths can run concurrently. Consequently, the fallback path imposes overhead on the fast path.

Some of the optimizations presented in that paper are similar to some optimizations in our HTM-based implementation of LLX and SCX. For instance, when they applied their methodology to the lock-free unbalanced BST of Ellen et al. [14], they observed that helping can be avoided on the fast path, and that the descriptors which are normally created to facilitate helping can be replaced by a small number of statically allocated descriptors. However, they did not give details on exactly how these optimizations work, and did not give correctness arguments for them. In contrast, our optimizations are applied to a more complex algorithm, and are proved correct.

Multiversion concurrency control (MVCC) is another way to implement range queries efficiently [2, 3]. At a high level, it involves maintaining multiple copies of data to allow read-only transactions to see a consistent view of memory and serialize even in the presence of concurrent modifications. However, our approach could also be applied to operations that *modify* a range of keys, so it is more general than MVCC.

## 8 CONCLUDING REMARKS

In this work, we explored the design space for HTM-based implementations of the tree update template of Brown et al. and presented four accelerated implementations. We discussed performance issues affecting HTM-based algorithms with two execution paths, and developed an approach that avoids them by using three paths. We used our template implementations to accelerate two different lock-free data structures, and performed experiments that showed significant performance improvements over several different workloads. This makes our implementations an attractive option for producing fast concurrent data structures for inclusion in libraries, where performance is critical.

#### REFERENCES

- Y. Afek, A. Matveev, O. R. Moll, and N. Shavit. Amalgamated lock-elision. In Distributed Computing: 29th International Symposium, DISC 2015, Tokyo, Japan, October 7-9, 2015, Proceedings, pages 309–324. Springer Berlin Heidelberg, Berlin, Heidelberg, 2015.
- [2] H. Attiya and E. Hillel. A single-version stm that is multi-versioned permissive. *Theory of Computing Systems*, 51(4):425–446, 2012.
- [3] P. A. Bernstein and N. Goodman. Multiversion concurrency control-theory and algorithms. ACM Transactions on Database Systems (TODS), 8(4):465–483, 1983.
- [4] T. Brown. Reclaiming memory for lock-free data structures: There has to be a better way. In Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing, PODC '15, pages 261–270, 2015.
- [5] T. Brown. Techniques for Constructing Efficient Data Structures. PhD thesis, University of Toronto, 2017.

- [6] T. Brown, F. Ellen, and E. Ruppert. Pragmatic primitives for non-blocking data structures. In Proceedings of the 2013 ACM Symposium on Principles of Distributed Computing, PODC '13, pages 13–22, 2013.
- [7] T. Brown, F. Ellen, and E. Ruppert. A general technique for non-blocking trees. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 329–342, 2014.
- [8] I. Calciu, J. Gottschlich, T. Shpeisman, G. Pokam, and M. Herlihy. Invyswell: a hybrid transactional memory for haswell's restricted transactional memory. In *Proceedings of the 23rd international conference on Parallel architectures and compilation*, pages 187–200. ACM, 2014.
- [9] L. Dalessandro, F. Carouge, S. White, Y. Lev, M. Moir, M. L. Scott, and M. F. Spear. Hybrid norec: A case study in the effectiveness of best effort hardware transactional memory. In *Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems*, ASPLOS XVI, pages 39–52, New York, NY, USA, 2011. ACM.
- [10] P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir, and D. Nussbaum. Hybrid transactional memory. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pages 336–346, New York, NY, USA, 2006. ACM.
- [11] D. Dice, T. Harris, A. Kogan, Y. Lev, and M. Moir. Pitfalls of lazy subscription. In Proceedings of the 6th Workshop on the Theory of Transactional Memory, Paris, France, 2014.
- [12] D. Dice, A. Kogan, and Y. Lev. Refined transactional lock elision. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, pages 19:1–19:12, New York, NY, USA, 2016. ACM.
- [13] D. Dice, Y. Lev, M. Moir, and D. Nussbaum. Early experience with a commercial hardware transactional memory implementation. In *Proceedings of the 14th Int. Conf. on Architectural Support for Programming Languages and Operating Systems*, ASPLOS XIV, pages 157–168, New York, NY, USA, 2009.
- [14] F. Ellen, P. Fatourou, E. Ruppert, and F. van Breugel. Non-blocking binary search trees. In *Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing*, PODC '10, pages 131–140, 2010. Full version available as Technical Report CSE-2010-04, York University.
- [15] F. Ellen, Y. Lev, V. Luchangco, and M. Moir. Snzi: Scalable nonzero indicators. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, pages 13–22. ACM, 2007.
- [16] T. L. Harris, K. Fraser, and I. A. Pratt. A practical multi-word compare-and-swap operation. In *Proceedings of the 16th International Conference on Distributed Computing*, DISC '02, pages 265–279, 2002.
- [17] M. He and M. Li. Deletion without rebalancing in non-blocking binary search trees. In Proceedings of the 20th International Conference on Principles of Distributed Systems, 2016.
- [18] L. Jacobsen and K. S. Larsen. Variants of (a, b)-trees with relaxed balance. Int. J. Found. Comput. Sci., 12(4):455–478, 2001.
- [19] S. Kumar, M. Chu, C. J. Hughes, P. Kundu, and A. Nguyen. Hybrid transactional memory. In Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '06, pages 209–220, New York, NY, USA, 2006. ACM.
- [20] Y. Lev, M. Moir, and D. Nussbaum. Phtm: Phased transactional memory. In Workshop on Transactional Computing (Transact), 2007.
- [21] Y. Liu, T. Zhou, and M. Spear. Transactional acceleration of concurrent data structures. In Proc. of 27th ACM Sym. on Parallelism in Algorithms and Arch., SPAA '15, pages 244–253, New York, NY, USA, 2015. ACM.
- [22] D. Makreshanski, J. Levandoski, and R. Stutsman. To lock, swap, or elide: on the interplay of hardware transactional memory and lock-free indexing. *Proceedings* of the VLDB Endowment, 8(11):1298–1309, 2015.
- [23] A. Matveev and N. Shavit. Reduced hardware norec: A safe and scalable hybrid transactional memory. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 59–71, New York, NY, USA, 2015. ACM.
- [24] A. T. Nguyen. Investigation of Hardware Transactional Memory. PhD thesis, Massachusetts Institute of Technology, 2015.
- [25] R. Rajwar and J. R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In *Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture*, pages 294–305. IEEE Computer Society, 2001.
- [26] R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS X, pages 5–17, New York, NY, USA, 2002. ACM.
- [27] T. Riegel, P. Marlier, M. Nowack, P. Felber, and C. Fetzer. Optimizing hybrid transactional memory: The importance of nonspeculative operations. In *Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures*, pages 53–64. ACM, 2011.
- [28] S. Timnat, M. Herlihy, and E. Petrank. A practical transactional memory interface. In Euro-Par 2015: Parallel Processing, pages 387–401. Springer, 2015.