# 0x04 A glance into the kernel: turnstile

Sunday, March 15, 2020

/sys/sys/proc.h

# Thread and Turnstile

line 228

A thread may block for a short , medium, or long period of time depending on the reason that it needs to wait. A short wait occurs when a thread need to wait for access to a lock that protects a data structure. A medium wait occurs while a thread waits for an event that is expected to occur quickly such as waiting for data to be read from a disk. A long wait occurs when a thread is waiting for an event that will happen at an indeterminate time in the future such as input from a user.

Short-term waits arise only form a lock request. Short-term locks include mutexes, read-writer locks, and read-mostly locks.

A short-term lock is managed by a turnstile data structure.

# Turnstile structure

Turnstile interfaces is provided in sys\sys\turnstile.h and detailed implementation can be found in sys\kern\subr_turnstile.c


/*
* Implementation of turnstiles used to hold queue of threads blocked on
* non-sleepable locks.  Sleepable locks use condition variables to
* implement their queues.  Turnstiles differ from a sleep queue in that
* turnstile queue's are assigned to a lock held by an owning thread.  Thus,
* when one thread is enqueued onto a turnstile, it can lend its priority
* to the owning thread.
*
* We wish to avoid bloating locks with an embedded turnstile and we do not
* want to use back-pointers in the locks for the same reason.  Thus, we
* use a similar approach to that of Solaris 7 as described in Solaris
* Internals by Jim Mauro and Richard McDougall.  Turnstiles are looked up
* in a hash table based on the address of the lock.  Each entry in the
* hash table is a linked-lists of turnstiles and is called a turnstile
* chain.  Each chain contains a spin mutex that protects all of the
* turnstiles in the chain.
*
* Each time a thread is created, a turnstile is allocated from a UMA zone
* and attached to that thread.  When a thread blocks on a lock, if it is the
* first thread to block, it lends its turnstile to the lock.  If the lock
* already has a turnstile, then it gives its turnstile to the lock's
* turnstile's free list.  When a thread is woken up, it takes a turnstile from
* the free list if there are any other waiters.  If it is the only thread
* blocked on the lock, then it reclaims the turnstile associated with the lock
* and removes it from the hash table.
*/


and here’s the hash table implementation

/*
* Constants for the hash table of turnstile chains.  TC_SHIFT is a magic
* number chosen because the sleep queue's use the same value for the
* shift.  Basically, we ignore the lower 8 bits of the address.
* TC_TABLESIZE must be a power of two for TC_MASK to work properly.
*/
#define	TC_TABLESIZE	128			/* Must be power of 2. */
#define	TC_MASK		(TC_TABLESIZE - 1)
#define	TC_SHIFT	8
#define	TC_HASH(lock)	(((uintptr_t)(lock) >> TC_SHIFT) & TC_MASK)
#define	TC_LOOKUP(lock)	&turnstile_chains[TC_HASH(lock)]


/*
* There are three different lists of turnstiles as follows.  The list
* connected by ts_link entries is a per-thread list of all the turnstiles
* attached to locks that we own.  This is used to fixup our priority when
* a lock is released.  The other two lists use the ts_hash entries.  The
* first of these two is the turnstile chain list that a turnstile is on
* when it is attached to a lock.  The second list to use ts_hash is the
* free list hung off of a turnstile that is attached to a lock.
*
* Each turnstile contains three lists of threads.  The two ts_blocked lists
* are linked list of threads blocked on the turnstile's lock.  One list is
* for exclusive waiters, and the other is for shared waiters.  The
* ts_pending list is a linked list of threads previously awakened by
* turnstile_signal() or turnstile_wait() that are waiting to be put on
* the run queue.
*
* Locking key:
*  c - turnstile chain lock
*  q - td_contested lock
*/
struct turnstile {
struct mtx ts_lock;			/* Spin lock for self. */
struct threadqueue ts_blocked[2];	/* (c + q) Blocked threads. */
struct threadqueue ts_pending;		/* (c) Pending threads. */
LIST_ENTRY(turnstile) ts_hash;		/* (c) Chain and free list. */
LIST_ENTRY(turnstile) ts_link;		/* (q) Contested locks. */
LIST_HEAD(, turnstile) ts_free;		/* (c) Free turnstiles. */
struct lock_object *ts_lockobj;		/* (c) Lock we reference. */
struct thread *ts_owner;		/* (c + q) Who owns the lock. */
};


The most significant difference between turnstile and sleepqueue is that turnstile provide a mechanism called priority propagation, which is used to handle the occurrence of priority inversion. Detailed info about the priority inversion can be found on Wikipedia.

static void
{
struct turnstile *ts, *top;
int pri;

pri = td->td_priority;
top = ts = td->td_blocked;

/*
* The original turnstile lock is held across the entire
* operation.  We only ever lock down the chain so the lock
* order is constant.
*/
for (;;) {
td = ts->ts_owner;

if (td == NULL) {
/*
* This might be a read lock with no owner.  There's
* not much we can do, so just bail.
*/
propagate_unlock_ts(top, ts);
return;
}

/*
* Wait for the thread lock to be stable and then only
* acquire if it is not the turnstile lock.
*/
if (td->td_lock != &ts->ts_lock) {
propagate_unlock_ts(top, ts);
}
MPASS(td->td_proc != NULL);
MPASS(td->td_proc->p_magic == P_MAGIC);

/*
* If the thread is asleep, then we are probably about
* to deadlock.  To make debugging this easier, show
* backtrace of misbehaving thread and panic to not
* leave the kernel deadlocked.
*/
if (TD_IS_SLEEPING(td)) {
printf(
"Sleeping thread (tid %d, pid %d) owns a non-sleepable lock\n",
td->td_tid, td->td_proc->p_pid);
}

/*
* If this thread already has higher priority than the
* thread that is being blocked, we are finished.
*/
if (td->td_priority <= pri) {
propagate_unlock_td(top, td);
return;
}

/*
* Bump this thread's priority.
*/
sched_lend_prio(td, pri);

/*
* If lock holder is actually running or on the run queue
* then we are done.
*/
if (TD_IS_RUNNING(td) || TD_ON_RUNQ(td)) {
MPASS(td->td_blocked == NULL);
propagate_unlock_td(top, td);
return;
}

#ifndef SMP
/*
* For UP, we check to see if td is curthread (this shouldn't
* ever happen however as it would mean we are in a deadlock.)
*/
#endif

/*
* If we aren't blocked on a lock, we should be.
*/
KASSERT(TD_ON_LOCK(td), (
"thread %d(%s):%d holds %s but isn't blocked on a lock\n",
td->td_tid, td->td_name, td->td_state,
ts->ts_lockobj->lo_name));

/*
* Pick up the lock that td is blocked on.
*/
ts = td->td_blocked;
MPASS(ts != NULL);
/* Resort td on the list if needed. */
propagate_unlock_ts(top, ts);
return;
}
/* The thread lock is released as ts lock above. */
}
}


Basically the functional part of propagate_priority() is contained in the for(;;) loop. This function continues looking along the turnstile chain and lend the initial thread’s priority to the lower-priority thread which holds a lock and obstructs the higher-priority one to execute, until it meets a higher-priority thread, that means we’ve got job done, or an unwilling kernel panic, while it probably means we get into a deadlock. The kernel panic can be reported by KASSERT() and its variant MPASS().

kernel
CC BY 4.0

Erect a SQL playground with docker compose

0x03 A glance into the kernel: Proc source