中年engineerの独り言 - crumbjp

LinuxとApacheの憂鬱

Ghost sync

Ghost sync

MongoDBのレプリカセットではデータの書き込みは常にPrimaryに行わる。
各Secondary(slave)は『基本』Primaryから同期する。
ただし、色々な理由でPrimaryから同期出来ない場合は別のSecondaryから同期をする事がある。

なんでこんな事するのか?

レプリカセットの可用性を高める為。
この機能によって信頼性がネットワークを利用できるようになる。
またDC跨ぎのDR(デザスタリカバリ)など物理的に遠いレプリカを組みたい場合に便利。
本当に直接繋がらないノードを含むレプリカを組むことも出来るが、、現実的ではないか。。

replica set ghost target no good

ghost target
同期対象に選んだSecondary(がPrimaryのフリをしている部分)
ghost slave
Primaryから見て、Secondaryが別のSecondaryをフリをしている部分
Bug until 2.0.X
ghost targetがghost slaveになってしまうと相互参照になってPrimaryから切り離れてしまう。
Ghost sync が起きるケース

少なくとも私が確認できたもの

  1. 部分的なネットワーク障害(Primaryと特定のSecondary間のみ)
  2. Primary oplog溢れ時、あるSecondaryがPrimaryより大きなoplogを持っていた場合
  3. 完全新規Secondary追加時(ランダム?)

実証実験

ケース1を使って実際にGhost syncを起こしてみる。

構成
192.168.159.133
Primary
192.168.159.134
Secondary
192.168.159.135
Secondary
config = { _id: 'RS', members:
    [
     {_id: 0, host: '192.168.159.133:27017', priority:2},
     {_id: 1, host: '192.168.159.134:27017', priority:1},
     {_id: 2, host: '192.168.159.135:27017', priority:1}
     ]
};
rs.initiate(config);
手順 (2.3.2)

1. 完全新規レプリカセットを立てる
2. 部分ネットワーク断 (192.168.159.133 <- 192.168.159.135)
   (Ghost sync状態に入る)
3. 192.168.159.133 にデータ投入
   192.168.159.135 にデータが同期される。
4. ネットワーク復旧 (192.168.159.133 <- 192.168.159.135)
   (Ghost sync解除のはず)
5. 部分ネットワーク断 (192.168.159.133 <- 192.168.159.134)
   (Ghost sync状態に入る)
6. 192.168.159.133 にデータ投入
   192.168.159.134 にデータが同期される。

環境(ログ抜粋)
Tue Feb  5 19:01:01.113 [initandlisten] ** NOTE: This is a development version (2.3.3-pre-) of MongoDB.
Tue Feb  5 19:01:01.113 [initandlisten] db version v2.3.3-pre-, pdfile version 4.5
Tue Feb  5 19:01:01.113 [initandlisten] build info: Linux centos6 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 BOOST_LIB_VERSION=1_49

1. 完全新規レプリカセットを立てる

手順省略

local.slaves
192.168.159.133
Primaryから同期しているようだ
db.slaves.find()
{ "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
192.168.159.134
なし
192.168.159.135
なし

2. 部分ネットワーク断 (192.168.159.133 <- 192.168.159.135)

 # iptables -A INPUT -p tcp --dport 27017 -s 192.168.159.135 -j DROP
local.slaves

変化なし

log
192.168.159.133
192.168.159.135の障害を検知した
Tue Feb  5 19:11:32.682 [rsHealthPoll] replset info 192.168.159.135:27017 thinks that we are down
Tue Feb  5 19:11:34.683 [rsHealthPoll] replset info 192.168.159.135:27017 thinks that we are down
    :
192.168.159.134
我関せず・・・
192.168.159.135
192.168.159.133のダウンを検知
backGroundSyncが失敗
Tue Feb  5 19:11:30.653 [rsHealthPoll] DBClientCursor::init call() failed
Tue Feb  5 19:11:30.694 [rsHealthPoll] replSet info 192.168.159.133:27017 is down (or slow to respond):
Tue Feb  5 19:11:30.695 [rsHealthPoll] replSet member 192.168.159.133:27017 is now in state DOWN
Tue Feb  5 19:11:30.695 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:36.666 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:42.664 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:48.666 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:53.492 [rsBackgroundSync] Socket recv() timeout  192.168.159.133:27017
Tue Feb  5 19:11:53.492 [rsBackgroundSync] SocketException: remote: 192.168.159.133:27017 error: 9001 socket exception [3] server [192.168.159.133:27017]
Tue Feb  5 19:11:53.492 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: 192.168.159.133:27017
Tue Feb  5 19:11:54.669 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:54.696 [rsHealthPoll] couldn't connect to 192.168.159.133:27017: couldn't connect to server 192.168.159.133:27017
Tue Feb  5 19:12:00.671 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:12:04.696 [rsHealthPoll] couldn't connect to 192.168.159.133:27017: couldn't connect to server 192.168.159.133:27017
    :

予想に反しまだGhost syncに入っていない模様。

3. 192.168.159.133 にデータ投入

$ mongo 192.168.159.133:27017 <<<'
use testdb
db.testcol.save({key:1})
'
***local.slaves
:192.168.159.133: <span class="deco" style="color:#FF0000;font-weight:bold;">変化なし!</span>
>||
db.slaves.find()
{ "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
192.168.159.134
Primaryにあった(_idが同じ)レコードが移動してきた!
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360228165000, "i" : 138} }
192.168.159.135
なし

db.slavesのレコードはアクティブなslaveを表している訳ではない

log
192.168.159.133
新規DBファイルが作られた
怪しげな[conn9]のエラー
Tue Feb  5 19:15:31.560 [FileAllocator] creating directory /usr/local/mongo/data00/testdb/_tmp
Tue Feb  5 19:15:31.567 [FileAllocator] done allocating datafile /usr/local/mongo/data00/testdb/testdb.ns, size: 4MB,  took 0 secs
Tue Feb  5 19:15:31.567 [FileAllocator] allocating new datafile /usr/local/mongo/data00/testdb/testdb.0, filling with zeroes...
Tue Feb  5 19:15:31.568 [FileAllocator] done allocating datafile /usr/local/mongo/data00/testdb/testdb.0, size: 64MB,  took 0 secs
Tue Feb  5 19:15:31.569 [FileAllocator] allocating new datafile /usr/local/mongo/data00/testdb/testdb.1, filling with zeroes...
Tue Feb  5 19:15:31.571 [FileAllocator] done allocating datafile /usr/local/mongo/data00/testdb/testdb.1, size: 128MB,  took 0.001 secs
Tue Feb  5 19:15:31.574 [conn50] build index testdb.testcol { _id: 1 }
Tue Feb  5 19:15:31.574 [conn50] build index done.  scanned 0 total records. 0 secs
Tue Feb  5 19:15:31.588 [conn9] getMore: cursorid not found local.oplog.rs 969546968802276
192.168.159.134
[rsGhostSync] キター!!
Tue Feb  5 19:15:31.576 [FileAllocator] allocating new datafile /usr/local/mongo/data01/testdb/testdb.ns, filling with zeroes...
Tue Feb  5 19:15:31.576 [FileAllocator] creating directory /usr/local/mongo/data01/testdb/_tmp
Tue Feb  5 19:15:31.578 [FileAllocator] done allocating datafile /usr/local/mongo/data01/testdb/testdb.ns, size: 4MB,  took 0 secs
Tue Feb  5 19:15:31.579 [FileAllocator] allocating new datafile /usr/local/mongo/data01/testdb/testdb.0, filling with zeroes...
Tue Feb  5 19:15:31.580 [FileAllocator] done allocating datafile /usr/local/mongo/data01/testdb/testdb.0, size: 64MB,  took 0 secs
Tue Feb  5 19:15:31.581 [FileAllocator] allocating new datafile /usr/local/mongo/data01/testdb/testdb.1, filling with zeroes...
Tue Feb  5 19:15:31.583 [repl writer worker 1] build index testdb.testcol { _id: 1 }
Tue Feb  5 19:15:31.584 [FileAllocator] done allocating datafile /usr/local/mongo/data01/testdb/testdb.1, size: 128MB,  took 0.002 secs
Tue Feb  5 19:15:31.584 [repl writer worker 1] build index done.  scanned 0 total records. 0.001 secs
Tue Feb  5 19:15:31.588 [rsSyncNotifier] repl: old cursor isDead, will initiate a new one
Tue Feb  5 19:15:33.563 [rsGhostSync] handshake between 2 and 192.168.159.133:27017
Tue Feb  5 19:15:34.564 [slaveTracking] build index local.slaves { _id: 1 }
Tue Feb  5 19:15:34.565 [slaveTracking] build index done.  scanned 0 total records. 0 secs
192.168.159.135
同期先が192.168.159.134になり、新規DBファイルが作られた!
Tue Feb  5 19:15:33.544 [rsBackgroundSync] replSet syncing to: 192.168.159.134:27017
Tue Feb  5 19:15:33.547 [FileAllocator] allocating new datafile /usr/local/mongo/data02/testdb/testdb.ns, filling with zeroes...
Tue Feb  5 19:15:33.547 [FileAllocator] creating directory /usr/local/mongo/data02/testdb/_tmp
Tue Feb  5 19:15:33.550 [FileAllocator] done allocating datafile /usr/local/mongo/data02/testdb/testdb.ns, size: 4MB,  took 0 secs
Tue Feb  5 19:15:33.550 [FileAllocator] allocating new datafile /usr/local/mongo/data02/testdb/testdb.0, filling with zeroes...
Tue Feb  5 19:15:33.550 [FileAllocator] done allocating datafile /usr/local/mongo/data02/testdb/testdb.0, size: 64MB,  took 0 secs
Tue Feb  5 19:15:33.552 [FileAllocator] allocating new datafile /usr/local/mongo/data02/testdb/testdb.1, filling with zeroes...
Tue Feb  5 19:15:33.553 [repl writer worker 1] build index testdb.testcol { _id: 1 }
Tue Feb  5 19:15:33.553 [FileAllocator] done allocating datafile /usr/local/mongo/data02/testdb/testdb.1, size: 128MB,  took 0 secs
Tue Feb  5 19:15:33.554 [repl writer worker 1] build index done.  scanned 0 total records. 0 secs
Tue Feb  5 19:15:33.554 [rsSyncNotifier] replset setting oplog notifier to 192.168.159.134:27017

Ghost sync 完成!!

因みにここのfrom 2 はreplica setの config . members . _id を示している。

[rsGhostSync] handshake between 2 and 192.168.159.133:27017

省略するが勿論ちゃんと192.168.159.135から{key:1}が取れる。

4. ネットワーク復旧 (192.168.159.133 <- 192.168.159.135)

 # iptables -D INPUT -p tcp --dport 27017 -s 192.168.159.135 -j DROP
local.slaves

変化なし

log
192.168.159.133
rsHealthPollのエラーが止まる
Tue Feb  5 19:20:44.903 [rsHealthPoll] replset info 192.168.159.135:27017 thinks that we are down
Tue Feb  5 19:20:46.564 [snapshotthread] cpu: elapsed:4000  writelock: 0%
Tue Feb  5 19:20:50.564 [snapshotthread] cpu: elapsed:4000  writelock: 0%
Tue Feb  5 19:20:54.565 [snapshotthread] cpu: elapsed:4000  writelock: 0%
192.168.159.134
我関せず・・・
192.168.159.135
Primary復旧を検知
Tue Feb  5 19:20:45.699 [rsHealthPoll] replSet member 192.168.159.133:27017 is up
Tue Feb  5 19:20:45.699 [rsHealthPoll] replSet member 192.168.159.133:27017 is now in state PRIMARY

Ghost syncから抜けたか否かは解らない・・・

5. 部分ネットワーク断 (192.168.159.133 <- 192.168.159.134)

 # iptables -A INPUT -p tcp --dport 27017 -s 192.168.159.134 -j DROP
local.slaves

変化なし

log
192.168.159.133
192.168.159.134の障害を検知した
Tue Feb  5 19:22:42.949 [rsHealthPoll] replset info 192.168.159.134:27017 thinks that we are down
Tue Feb  5 19:22:44.950 [rsHealthPoll] replset info 192.168.159.134:27017 thinks that we are down
192.168.159.134
192.168.159.133のダウンを検知
backGroundSyncが失敗
Tue Feb  5 19:22:41.069 [rsHealthPoll] DBClientCursor::init call() failed
Tue Feb  5 19:22:41.070 [rsHealthPoll] replSet info 192.168.159.133:27017 is down (or slow to respond):
Tue Feb  5 19:22:41.070 [rsHealthPoll] replSet member 192.168.159.133:27017 is now in state DOWN
Tue Feb  5 19:22:41.071 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:22:47.076 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:22:53.072 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:22:59.080 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:23:03.203 [rsBackgroundSync] Socket recv() timeout  192.168.159.133:27017
Tue Feb  5 19:23:03.203 [rsBackgroundSync] SocketException: remote: 192.168.159.133:27017 error: 9001 socket exception [3] server [192.168.159.133:27017]
Tue Feb  5 19:23:03.203 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: 192.168.159.133:27017
Tue Feb  5 19:23:05.071 [rsHealthPoll] replset info 192.168.159.133:27017 heartbeat failed, retrying
Tue Feb  5 19:23:05.083 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:23:11.085 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:23:17.087 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:23:18.071 [rsHealthPoll] replset info 192.168.159.133:27017 heartbeat failed, retrying
Tue Feb  5 19:23:23.090 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
192.168.159.135
我関せず・・・

6. 192.168.159.133 にデータ投入

$ mongo 192.168.159.133:27017 <<<'
use testdb
db.testcol.save({key:2})
'
local.slaves
192.168.159.133
変化なし
db.slaves.find()
{ "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
192.168.159.134
変化なし
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360228165000, "i" : 138} }
192.168.159.135
やはりPrimaryからレコードが移動して来る
{ "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360228784000, "i" : 229 } }
log
192.168.159.133
特に何も。。
192.168.159.134
192.168.159.135から同期してるよー
Tue Feb  5 19:24:33.117 [rsMgr] not electing self, we are not freshest
Tue Feb  5 19:24:35.224 [rsBackgroundSync] replSet syncing to: 192.168.159.135:27017
Tue Feb  5 19:24:35.288 [rsSyncNotifier] replset setting oplog notifier to 192.168.159.135:27017
192.168.159.135
[rsGhostSync] member=1(192.168.159.134)と192.168.159.133の間を取り持ってやってるぜ!
Tue Feb  5 19:24:34.909 [rsBackgroundSync] replSet syncing to: 192.168.159.133:27017
Tue Feb  5 19:24:34.923 [rsSyncNotifier] replset setting oplog notifier to 192.168.159.133:27017
Tue Feb  5 19:24:35.302 [rsGhostSync] handshake between 1 and 192.168.159.133:27017
Tue Feb  5 19:24:36.303 [slaveTracking] build index local.slaves { _id: 1 }
Tue Feb  5 19:24:36.304 [slaveTracking] build index done.  scanned 0 total records. 0 secs

省略するが勿論ちゃんと192.168.159.134から{key:2}が取れる。

まとめ

  • 2.3.2ではGhost syncはまともに機能する!
  • Ghost sync状態に入るのは、ノード障害検知後の最初のOplog同期時。(本当に切り替えが必要になったタイミング)
  • Ghost sync 状態→通常状態への復旧タイミングは不明

ついでに。。

rs.syncFrom('[host]:[port]') (since 2.2)

実は手動で同期先を切り替える事も可能。

rs.syncFrom('192.168.159.133:27017')
slaveDelayとGhostSyncは相性悪い!

slaveDelayのセカンダリをghost targetにしてしまう問題がある。

mongo jira

書いても書いても読めるデータは古いまま・・・

Primaryに繋がらない場所に居るノード

レプリカセット構築時に繋がらない場合はリトライを繰り返して暫く経たないとJOIN出来ない。
しかし構築後に繋がらなくなり、更にそのノードを再起動した場合でも、即座にSecondaryに復帰できる。
しかしSync先の情報は保存して無さそうなので、ghost target側から何か情報が回って来るのかも知れない。

この辺はまた後で追うつもり。