spin lock の C 実装

xv6 を見て C で spin lock をどのように実装するのか理解したのでその内容を整理します。また、MIT 6.828 ではそれの代替となる lock についても解説されていたので簡単にまとめます。

spin lock
ticket lock
read write lock
MCS lock
RCU

spin lock

マルチプロセッサ、マルチスレッド間で共有するデータ構造は何も考えずに実装すると壊れたり期待しない動作をする
その簡単な対処の一つとして spin lock を使う
- lock で守られた処理はあるプロセッサ、スレッドのみが実行することを保証する
spin lock はロックが取得できるまで while ループを繰り返す
- x86 の場合 pause 命令で spin するのが良さそう
単純だけどロック待ちの人数が少なく待ち時間も短いならば十分

実装例:

From JOS and xv6.

#include <sys/types.h>

struct spinlock {
	unsigned locked;       // Is the lock held?
};

static inline u_int32_t
xchg(volatile u_int32_t *addr, u_int32_t newval) {
	u_int32_t result;

	// The + in "+m" denotes a read-modify-write operand.
	asm volatile("lock; xchgl %0, %1"
		     : "+m" (*addr), "=a" (result)
		     : "1" (newval)
		     : "cc");
	return result;
}

void spin_lock(struct spinlock *lk) {
    // The xchg is atomic.
    while (xchg(&lk->locked, 1) != 0)
        asm volatile ("pause");

    // The xchg also serializes, so we can do anything with lock from here
    //
    // __sync_synchronize() is a gcc function to tell the C compiler and the processor
    // to not move loads or stores past this point, to ensure that
    // the critical section's memory references happen after the lock is acquired.
    // But it is not necessary here as said in the above.
}

void spin_unlock(struct spinlock *lk) {
    // do something with lock until here

    // The xchg instruction is atomic (i.e. uses the "lock" prefix) with
    // respect to any other instruction which references the same memory.
    // x86 CPUs will not reorder loads/stores across locked instructions
    // (vol 3, 8.2.2). Because xchg() is implemented using asm volatile,
    // gcc will not reorder C statements across the xchg.
    xchg(&lk->locked, 0);
}

使用例:

counter with lock

#include <pthread.h>
#include <stdio.h>

...

#define N 10000

int count;
struct spinlock l;

void *f(void *arg) {
    for (int i = 0; i < N; i++) {
        spin_lock(&l);
        count++;
        spin_unlock(&l);
    }
    return 0;
}

int main(void) {
    pthread_t tid1, tid2;

    pthread_create(&tid1, NULL, f, NULL);
    pthread_create(&tid2, NULL, f, NULL);

    pthread_join(tid1, NULL);
    pthread_join(tid2, NULL);

    printf("%d\n", count); // print 20000 as expected

    return 0;
}

コメントがやたら長いことからわかるように単純そうに見えて考慮すべきことは多くあります。インラインアセンブリに volatile をつけることでコンパイラの reorder を防ぎ、lock 命令を使うことで atomic なメモリ操作を行いつつ CPU の reorder も制限しています。

また上の実装例だと実は割り込みについての考慮ができていません。lock 取得中に割り込みが発生すると dead lock になってしまう可能性があります。主な対策としては以下のようなものが考えられます。

lock 取得中は割り込みを許可しない (lock 取得前に cli, 解放後に sti)
- xv6 のアプローチ
- 以下の push_cli, pop_cli を spin_lock の初め、spin_unlock の最後に行っている
  - スタック管理になっているのは同一プロセッサで複数の lock を取得し得るためだと思う。スタック管理でないと例えば 2 つ lock を取得した場合に 1 つを解放した時点で sti されてしまう
  - EFLAGS レジスタの IF bit を操作するので CPU 毎にこの情報を管理する必要がある
lock を再入可能 (reentrant) にする
- lock に lock 取得者の id (e.g. スレッド毎の ID) と counter 情報を持たせればいい？
- pthread_mutex_init 関数で attr に PTHREAD_MUTEX_RECURSIVE を渡した場合の実装がそんな感じ

// Pushcli/popcli are like cli/sti except that they are matched:
// it takes two popcli to undo two pushcli.  Also, if interrupts
// are off, then pushcli, popcli leaves them off.

void pushcli(void) {
    int eflags;
  
    eflags = readeflags();
    cli();
    if(mycpu()->ncli == 0)
        mycpu()->intena = eflags & FL_IF;
    mycpu()->ncli += 1;
}

void popcli(void) {
    if(readeflags()&FL_IF)
        panic("popcli - interruptible");
    if(--mycpu()->ncli < 0)
        panic("popcli");
    if(mycpu()->ncli == 0 && mycpu()->intena)
        sti();
}

問題点:

fairness ではない
- 先にロック待ちになったプロセッサ、スレッドが先にロックを取得できるとは限らない
  - ticket lock による解決
read, write 間だけでなく read 間でも排他的
- read-write lock による解決

疑問:

上の実装だと lock で守りたいデータ構造の可視性は保証できていないのでは？
- write したはいいけどコアのキャッシュから書き戻されないまま lock を手放すことがあるのかなと
  - mfence のようなメモリバリア命令が必要？
  - あるいは lock 命令単体でメモリバリアと同様の振る舞いが期待できるのか？
Scalable Locking (MIT 授業用スライド) P.21 いわくマルチプロセッサの場合、ロック待ちの人数を N として待ち時間が O(N^2) になるとしている
- ロックを手放すのに O(N) (xchg 命令なので？) かかり、自分の番までそれが N 回必要みたいな話かと
  - MCS lock による解決

ticket lock

Linux で v2.6.25 (2008 年 4 月リリース) から使われていた spin lock の実装
fairness
xchg のような atomic instruction ではなく単純な read 命令で spin できる

実装例:

Scalable Locking P.23 より。

structlock {
    int current_ticket;
    int next_ticket;
}

void acquire(l) {
    int t = atomic_fetch_and_inc(&l->next_ticket);
    while (t != l->current_ticket) ; // spin
}

void release(l) {
    l->current_ticket++;
}

問題点:

結局 spin lock 同様ロックの取得までに O(N^2) かかるのでスケールし辛い
- Scalable Locking P.24 いわくこれはロックを手放すのに他コアのキャッシュを invalid する必要があるためとのこと

read write lock

reader 間は競合しない lock
- read の頻度が write よりも多いという場面で有用

実装例:

Kernel Scalability (MIT 授業用スライド) P.9 より。

typedef struct {
    volatile int cnt;
} rwlock_t;

void read_lock(rwlock_t *l) {
    int x;
    while (1) {
        x = l->cnt;
        if (x < 0) { // is write lock held?
            continue;
        }
        if (CMPXCHG(&l->cnt, x, x + 1)) { // replace l->cnt with x + 1 and return true if l->cnt == x.
            break;
        }
    }
}

void read_unlock(rwlock_t *l) {
    ATOMIC_DEC(&l->cnt); // perform l->cnt++ atomically
}

void write_lock(rwlock_t *l) {
    int x;
    while (1) {
        x = l->cnt;
        if (x != 0) { // avoid to use CMPXCHG at first because its performance cost is higher?
            continue;
        }
        if (CMPXCHG(&l->cnt, 0, -1)) {
            break;
        }
    }
}

void write_unlock(rwlock_t *l) {
    ATOMIC_INC(&l->cnt); // perform l->cnt++ atomically
}

問題点:

write 中は read を待たせてしまう
- RCU による解決

MCS lock

lock の解放を O(1) で行える
- 各コアを異なる cache line 上で spin させる
- これにより ticket lock 時の問題が解決されている
Linux では MCS lock をベースにした実装が 2013 年ごろに追加されている

実装例:

scalable-lock-code.c

実装の概要は Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors の Figure. 1 あたりを見るのがわかりやすかったです。

RCU (Read Copy Update)

Kernel Scalability の P.14 以降。あるいは RCU Usage In the Linux Kernel: One Decade Later

read に lock は発生しない
read と write が競合しない
writer は既存のデータをコピーし更新後、置き換える。その後、古いデータを参照する reader がいなくなったときにそれを破棄する
- なので reader は古いデータを見ている可能性はある