0x04 A glance into the kernel: turnstile

Sunday, March 15, 2020

/sys/sys/proc.h

Thread and Turnstile

line 228

A thread may block for a short , medium, or long period of time depending on the reason that it needs to wait. A short wait occurs when a thread need to wait for access to a lock that protects a data structure. A medium wait occurs while a thread waits for an event that is expected to occur quickly such as waiting for data to be read from a disk. A long wait occurs when a thread is waiting for an event that will happen at an indeterminate time in the future such as input from a user.

Short-term waits arise only form a lock request. Short-term locks include mutexes, read-writer locks, and read-mostly locks.

A short-term lock is managed by a turnstile data structure.

Turnstile structure

Turnstile interfaces is provided in sys\sys\turnstile.h and detailed implementation can be found in sys\kern\subr_turnstile.c


/*
 * Implementation of turnstiles used to hold queue of threads blocked on
 * non-sleepable locks.  Sleepable locks use condition variables to
 * implement their queues.  Turnstiles differ from a sleep queue in that
 * turnstile queue's are assigned to a lock held by an owning thread.  Thus,
 * when one thread is enqueued onto a turnstile, it can lend its priority
 * to the owning thread.
 *
 * We wish to avoid bloating locks with an embedded turnstile and we do not
 * want to use back-pointers in the locks for the same reason.  Thus, we
 * use a similar approach to that of Solaris 7 as described in Solaris
 * Internals by Jim Mauro and Richard McDougall.  Turnstiles are looked up
 * in a hash table based on the address of the lock.  Each entry in the
 * hash table is a linked-lists of turnstiles and is called a turnstile
 * chain.  Each chain contains a spin mutex that protects all of the
 * turnstiles in the chain.
 *
 * Each time a thread is created, a turnstile is allocated from a UMA zone
 * and attached to that thread.  When a thread blocks on a lock, if it is the
 * first thread to block, it lends its turnstile to the lock.  If the lock
 * already has a turnstile, then it gives its turnstile to the lock's
 * turnstile's free list.  When a thread is woken up, it takes a turnstile from
 * the free list if there are any other waiters.  If it is the only thread
 * blocked on the lock, then it reclaims the turnstile associated with the lock
 * and removes it from the hash table.
 */

and here’s the hash table implementation

/*
 * Constants for the hash table of turnstile chains.  TC_SHIFT is a magic
 * number chosen because the sleep queue's use the same value for the
 * shift.  Basically, we ignore the lower 8 bits of the address.
 * TC_TABLESIZE must be a power of two for TC_MASK to work properly.
 */
#define	TC_TABLESIZE	128			/* Must be power of 2. */
#define	TC_MASK		(TC_TABLESIZE - 1)
#define	TC_SHIFT	8
#define	TC_HASH(lock)	(((uintptr_t)(lock) >> TC_SHIFT) & TC_MASK)
#define	TC_LOOKUP(lock)	&turnstile_chains[TC_HASH(lock)]

/*
 * There are three different lists of turnstiles as follows.  The list
 * connected by ts_link entries is a per-thread list of all the turnstiles
 * attached to locks that we own.  This is used to fixup our priority when
 * a lock is released.  The other two lists use the ts_hash entries.  The
 * first of these two is the turnstile chain list that a turnstile is on
 * when it is attached to a lock.  The second list to use ts_hash is the
 * free list hung off of a turnstile that is attached to a lock.
 *
 * Each turnstile contains three lists of threads.  The two ts_blocked lists
 * are linked list of threads blocked on the turnstile's lock.  One list is
 * for exclusive waiters, and the other is for shared waiters.  The
 * ts_pending list is a linked list of threads previously awakened by
 * turnstile_signal() or turnstile_wait() that are waiting to be put on
 * the run queue.
 *
 * Locking key:
 *  c - turnstile chain lock
 *  q - td_contested lock
 */
struct turnstile {
	struct mtx ts_lock;			/* Spin lock for self. */
	struct threadqueue ts_blocked[2];	/* (c + q) Blocked threads. */
	struct threadqueue ts_pending;		/* (c) Pending threads. */
	LIST_ENTRY(turnstile) ts_hash;		/* (c) Chain and free list. */
	LIST_ENTRY(turnstile) ts_link;		/* (q) Contested locks. */
	LIST_HEAD(, turnstile) ts_free;		/* (c) Free turnstiles. */
	struct lock_object *ts_lockobj;		/* (c) Lock we reference. */
	struct thread *ts_owner;		/* (c + q) Who owns the lock. */
};

The most significant difference between turnstile and sleepqueue is that turnstile provide a mechanism called priority propagation, which is used to handle the occurrence of priority inversion. Detailed info about the priority inversion can be found on Wikipedia.

static void
propagate_priority(struct thread *td)
{
	struct turnstile *ts, *top;
	int pri;

	THREAD_LOCK_ASSERT(td, MA_OWNED);
	pri = td->td_priority;
	top = ts = td->td_blocked;
	THREAD_LOCKPTR_ASSERT(td, &ts->ts_lock);

	/*
	 * The original turnstile lock is held across the entire
	 * operation.  We only ever lock down the chain so the lock
	 * order is constant.
	 */
	for (;;) {
		td = ts->ts_owner;

		if (td == NULL) {
			/*
			 * This might be a read lock with no owner.  There's
			 * not much we can do, so just bail.
			 */
			propagate_unlock_ts(top, ts);
			return;
		}

		/*
		 * Wait for the thread lock to be stable and then only
		 * acquire if it is not the turnstile lock.
		 */
		thread_lock_block_wait(td);
		if (td->td_lock != &ts->ts_lock) {
			thread_lock_flags(td, MTX_DUPOK);
			propagate_unlock_ts(top, ts);
		}
		MPASS(td->td_proc != NULL);
		MPASS(td->td_proc->p_magic == P_MAGIC);

		/*
		 * If the thread is asleep, then we are probably about
		 * to deadlock.  To make debugging this easier, show
		 * backtrace of misbehaving thread and panic to not
		 * leave the kernel deadlocked.
		 */
		if (TD_IS_SLEEPING(td)) {
			printf(
		"Sleeping thread (tid %d, pid %d) owns a non-sleepable lock\n",
			    td->td_tid, td->td_proc->p_pid);
			kdb_backtrace_thread(td);
			panic("sleeping thread");
		}

		/*
		 * If this thread already has higher priority than the
		 * thread that is being blocked, we are finished.
		 */
		if (td->td_priority <= pri) {
			propagate_unlock_td(top, td);
			return;
		}

		/*
		 * Bump this thread's priority.
		 */
		sched_lend_prio(td, pri);

		/*
		 * If lock holder is actually running or on the run queue
		 * then we are done.
		 */
		if (TD_IS_RUNNING(td) || TD_ON_RUNQ(td)) {
			MPASS(td->td_blocked == NULL);
			propagate_unlock_td(top, td);
			return;
		}

#ifndef SMP
		/*
		 * For UP, we check to see if td is curthread (this shouldn't
		 * ever happen however as it would mean we are in a deadlock.)
		 */
		KASSERT(td != curthread, ("Deadlock detected"));
#endif

		/*
		 * If we aren't blocked on a lock, we should be.
		 */
		KASSERT(TD_ON_LOCK(td), (
		    "thread %d(%s):%d holds %s but isn't blocked on a lock\n",
		    td->td_tid, td->td_name, td->td_state,
		    ts->ts_lockobj->lo_name));

		/*
		 * Pick up the lock that td is blocked on.
		 */
		ts = td->td_blocked;
		MPASS(ts != NULL);
		THREAD_LOCKPTR_ASSERT(td, &ts->ts_lock);
		/* Resort td on the list if needed. */
		if (!turnstile_adjust_thread(ts, td)) {
			propagate_unlock_ts(top, ts);
			return;
		}
		/* The thread lock is released as ts lock above. */
	}
}

Basically the functional part of propagate_priority() is contained in the for(;;) loop. This function continues looking along the turnstile chain and lend the initial thread’s priority to the lower-priority thread which holds a lock and obstructs the higher-priority one to execute, until it meets a higher-priority thread, that means we’ve got job done, or an unwilling kernel panic, while it probably means we get into a deadlock. The kernel panic can be reported by KASSERT() and its variant MPASS().

kernel
CC BY 4.0

Erect a SQL playground with docker compose

0x03 A glance into the kernel: Proc source

comments powered by Disqus