Ghost sync
Ghost sync
MongoDBのレプリカセットではデータの書き込みは常にPrimaryに行わる。
各Secondary(slave)は『基本』Primaryから同期する。
ただし、色々な理由でPrimaryから同期出来ない場合は別のSecondaryから同期をする事がある。
なんでこんな事するのか?
レプリカセットの可用性を高める為。
この機能によって信頼性がネットワークを利用できるようになる。
またDC跨ぎのDR(デザスタリカバリ)など物理的に遠いレプリカを組みたい場合に便利。
本当に直接繋がらないノードを含むレプリカを組むことも出来るが、、現実的ではないか。。
replica set ghost target no good
- ghost target
- 同期対象に選んだSecondary(がPrimaryのフリをしている部分)
- ghost slave
- Primaryから見て、Secondaryが別のSecondaryをフリをしている部分
- Bug until 2.0.X
- ghost targetがghost slaveになってしまうと相互参照になってPrimaryから切り離れてしまう。
Ghost sync が起きるケース
少なくとも私が確認できたもの
- 部分的なネットワーク障害(Primaryと特定のSecondary間のみ)
- Primary oplog溢れ時、あるSecondaryがPrimaryより大きなoplogを持っていた場合
- 完全新規Secondary追加時(ランダム?)
実証実験
ケース1を使って実際にGhost syncを起こしてみる。
構成
- 192.168.159.133
- Primary
- 192.168.159.134
- Secondary
- 192.168.159.135
- Secondary
config = { _id: 'RS', members: [ {_id: 0, host: '192.168.159.133:27017', priority:2}, {_id: 1, host: '192.168.159.134:27017', priority:1}, {_id: 2, host: '192.168.159.135:27017', priority:1} ] }; rs.initiate(config);
手順 (2.3.2)
1. 完全新規レプリカセットを立てる
2. 部分ネットワーク断 (192.168.159.133 <- 192.168.159.135)
(Ghost sync状態に入る)
3. 192.168.159.133 にデータ投入
192.168.159.135 にデータが同期される。
4. ネットワーク復旧 (192.168.159.133 <- 192.168.159.135)
(Ghost sync解除のはず)
5. 部分ネットワーク断 (192.168.159.133 <- 192.168.159.134)
(Ghost sync状態に入る)
6. 192.168.159.133 にデータ投入
192.168.159.134 にデータが同期される。
環境(ログ抜粋)
Tue Feb 5 19:01:01.113 [initandlisten] ** NOTE: This is a development version (2.3.3-pre-) of MongoDB. Tue Feb 5 19:01:01.113 [initandlisten] db version v2.3.3-pre-, pdfile version 4.5 Tue Feb 5 19:01:01.113 [initandlisten] build info: Linux centos6 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 BOOST_LIB_VERSION=1_49
1. 完全新規レプリカセットを立てる
手順省略
local.slaves
- 192.168.159.133
- Primaryから同期しているようだ
db.slaves.find() { "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } } { "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
- 192.168.159.134
- なし
- 192.168.159.135
- なし
2. 部分ネットワーク断 (192.168.159.133 <- 192.168.159.135)
# iptables -A INPUT -p tcp --dport 27017 -s 192.168.159.135 -j DROP
local.slaves
変化なし
log
- 192.168.159.133
- 192.168.159.135の障害を検知した
Tue Feb 5 19:11:32.682 [rsHealthPoll] replset info 192.168.159.135:27017 thinks that we are down Tue Feb 5 19:11:34.683 [rsHealthPoll] replset info 192.168.159.135:27017 thinks that we are down :
- 192.168.159.134
- 我関せず・・・
- 192.168.159.135
- 192.168.159.133のダウンを検知
backGroundSyncが失敗
Tue Feb 5 19:11:30.653 [rsHealthPoll] DBClientCursor::init call() failed Tue Feb 5 19:11:30.694 [rsHealthPoll] replSet info 192.168.159.133:27017 is down (or slow to respond): Tue Feb 5 19:11:30.695 [rsHealthPoll] replSet member 192.168.159.133:27017 is now in state DOWN Tue Feb 5 19:11:30.695 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:11:36.666 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:11:42.664 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:11:48.666 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:11:53.492 [rsBackgroundSync] Socket recv() timeout 192.168.159.133:27017 Tue Feb 5 19:11:53.492 [rsBackgroundSync] SocketException: remote: 192.168.159.133:27017 error: 9001 socket exception [3] server [192.168.159.133:27017] Tue Feb 5 19:11:53.492 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: 192.168.159.133:27017 Tue Feb 5 19:11:54.669 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:11:54.696 [rsHealthPoll] couldn't connect to 192.168.159.133:27017: couldn't connect to server 192.168.159.133:27017 Tue Feb 5 19:12:00.671 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:12:04.696 [rsHealthPoll] couldn't connect to 192.168.159.133:27017: couldn't connect to server 192.168.159.133:27017 :
予想に反しまだGhost syncに入っていない模様。
3. 192.168.159.133 にデータ投入
$ mongo 192.168.159.133:27017 <<<' use testdb db.testcol.save({key:1}) ' ***local.slaves :192.168.159.133: <span class="deco" style="color:#FF0000;font-weight:bold;">変化なし!</span> >|| db.slaves.find() { "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } } { "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
- 192.168.159.134
- Primaryにあった(_idが同じ)レコードが移動してきた!
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360228165000, "i" : 138} }
- 192.168.159.135
- なし
db.slavesのレコードはアクティブなslaveを表している訳ではない
log
- 192.168.159.133
- 新規DBファイルが作られた
怪しげな[conn9]のエラー
Tue Feb 5 19:15:31.560 [FileAllocator] creating directory /usr/local/mongo/data00/testdb/_tmp Tue Feb 5 19:15:31.567 [FileAllocator] done allocating datafile /usr/local/mongo/data00/testdb/testdb.ns, size: 4MB, took 0 secs Tue Feb 5 19:15:31.567 [FileAllocator] allocating new datafile /usr/local/mongo/data00/testdb/testdb.0, filling with zeroes... Tue Feb 5 19:15:31.568 [FileAllocator] done allocating datafile /usr/local/mongo/data00/testdb/testdb.0, size: 64MB, took 0 secs Tue Feb 5 19:15:31.569 [FileAllocator] allocating new datafile /usr/local/mongo/data00/testdb/testdb.1, filling with zeroes... Tue Feb 5 19:15:31.571 [FileAllocator] done allocating datafile /usr/local/mongo/data00/testdb/testdb.1, size: 128MB, took 0.001 secs Tue Feb 5 19:15:31.574 [conn50] build index testdb.testcol { _id: 1 } Tue Feb 5 19:15:31.574 [conn50] build index done. scanned 0 total records. 0 secs Tue Feb 5 19:15:31.588 [conn9] getMore: cursorid not found local.oplog.rs 969546968802276
- 192.168.159.134
- [rsGhostSync] キター!!
Tue Feb 5 19:15:31.576 [FileAllocator] allocating new datafile /usr/local/mongo/data01/testdb/testdb.ns, filling with zeroes... Tue Feb 5 19:15:31.576 [FileAllocator] creating directory /usr/local/mongo/data01/testdb/_tmp Tue Feb 5 19:15:31.578 [FileAllocator] done allocating datafile /usr/local/mongo/data01/testdb/testdb.ns, size: 4MB, took 0 secs Tue Feb 5 19:15:31.579 [FileAllocator] allocating new datafile /usr/local/mongo/data01/testdb/testdb.0, filling with zeroes... Tue Feb 5 19:15:31.580 [FileAllocator] done allocating datafile /usr/local/mongo/data01/testdb/testdb.0, size: 64MB, took 0 secs Tue Feb 5 19:15:31.581 [FileAllocator] allocating new datafile /usr/local/mongo/data01/testdb/testdb.1, filling with zeroes... Tue Feb 5 19:15:31.583 [repl writer worker 1] build index testdb.testcol { _id: 1 } Tue Feb 5 19:15:31.584 [FileAllocator] done allocating datafile /usr/local/mongo/data01/testdb/testdb.1, size: 128MB, took 0.002 secs Tue Feb 5 19:15:31.584 [repl writer worker 1] build index done. scanned 0 total records. 0.001 secs Tue Feb 5 19:15:31.588 [rsSyncNotifier] repl: old cursor isDead, will initiate a new one Tue Feb 5 19:15:33.563 [rsGhostSync] handshake between 2 and 192.168.159.133:27017 Tue Feb 5 19:15:34.564 [slaveTracking] build index local.slaves { _id: 1 } Tue Feb 5 19:15:34.565 [slaveTracking] build index done. scanned 0 total records. 0 secs
- 192.168.159.135
- 同期先が192.168.159.134になり、新規DBファイルが作られた!
Tue Feb 5 19:15:33.544 [rsBackgroundSync] replSet syncing to: 192.168.159.134:27017 Tue Feb 5 19:15:33.547 [FileAllocator] allocating new datafile /usr/local/mongo/data02/testdb/testdb.ns, filling with zeroes... Tue Feb 5 19:15:33.547 [FileAllocator] creating directory /usr/local/mongo/data02/testdb/_tmp Tue Feb 5 19:15:33.550 [FileAllocator] done allocating datafile /usr/local/mongo/data02/testdb/testdb.ns, size: 4MB, took 0 secs Tue Feb 5 19:15:33.550 [FileAllocator] allocating new datafile /usr/local/mongo/data02/testdb/testdb.0, filling with zeroes... Tue Feb 5 19:15:33.550 [FileAllocator] done allocating datafile /usr/local/mongo/data02/testdb/testdb.0, size: 64MB, took 0 secs Tue Feb 5 19:15:33.552 [FileAllocator] allocating new datafile /usr/local/mongo/data02/testdb/testdb.1, filling with zeroes... Tue Feb 5 19:15:33.553 [repl writer worker 1] build index testdb.testcol { _id: 1 } Tue Feb 5 19:15:33.553 [FileAllocator] done allocating datafile /usr/local/mongo/data02/testdb/testdb.1, size: 128MB, took 0 secs Tue Feb 5 19:15:33.554 [repl writer worker 1] build index done. scanned 0 total records. 0 secs Tue Feb 5 19:15:33.554 [rsSyncNotifier] replset setting oplog notifier to 192.168.159.134:27017
Ghost sync 完成!!
因みにここのfrom 2 はreplica setの config . members . _id を示している。
[rsGhostSync] handshake between 2 and 192.168.159.133:27017
省略するが勿論ちゃんと192.168.159.135から{key:1}が取れる。
4. ネットワーク復旧 (192.168.159.133 <- 192.168.159.135)
# iptables -D INPUT -p tcp --dport 27017 -s 192.168.159.135 -j DROP
local.slaves
変化なし
log
- 192.168.159.133
- rsHealthPollのエラーが止まる
Tue Feb 5 19:20:44.903 [rsHealthPoll] replset info 192.168.159.135:27017 thinks that we are down Tue Feb 5 19:20:46.564 [snapshotthread] cpu: elapsed:4000 writelock: 0% Tue Feb 5 19:20:50.564 [snapshotthread] cpu: elapsed:4000 writelock: 0% Tue Feb 5 19:20:54.565 [snapshotthread] cpu: elapsed:4000 writelock: 0%
- 192.168.159.134
- 我関せず・・・
- 192.168.159.135
- Primary復旧を検知
Tue Feb 5 19:20:45.699 [rsHealthPoll] replSet member 192.168.159.133:27017 is up Tue Feb 5 19:20:45.699 [rsHealthPoll] replSet member 192.168.159.133:27017 is now in state PRIMARY
Ghost syncから抜けたか否かは解らない・・・
5. 部分ネットワーク断 (192.168.159.133 <- 192.168.159.134)
# iptables -A INPUT -p tcp --dport 27017 -s 192.168.159.134 -j DROP
local.slaves
変化なし
log
- 192.168.159.133
- 192.168.159.134の障害を検知した
Tue Feb 5 19:22:42.949 [rsHealthPoll] replset info 192.168.159.134:27017 thinks that we are down Tue Feb 5 19:22:44.950 [rsHealthPoll] replset info 192.168.159.134:27017 thinks that we are down
- 192.168.159.134
- 192.168.159.133のダウンを検知
backGroundSyncが失敗
Tue Feb 5 19:22:41.069 [rsHealthPoll] DBClientCursor::init call() failed Tue Feb 5 19:22:41.070 [rsHealthPoll] replSet info 192.168.159.133:27017 is down (or slow to respond): Tue Feb 5 19:22:41.070 [rsHealthPoll] replSet member 192.168.159.133:27017 is now in state DOWN Tue Feb 5 19:22:41.071 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:22:47.076 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:22:53.072 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:22:59.080 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:23:03.203 [rsBackgroundSync] Socket recv() timeout 192.168.159.133:27017 Tue Feb 5 19:23:03.203 [rsBackgroundSync] SocketException: remote: 192.168.159.133:27017 error: 9001 socket exception [3] server [192.168.159.133:27017] Tue Feb 5 19:23:03.203 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: 192.168.159.133:27017 Tue Feb 5 19:23:05.071 [rsHealthPoll] replset info 192.168.159.133:27017 heartbeat failed, retrying Tue Feb 5 19:23:05.083 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:23:11.085 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:23:17.087 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date' Tue Feb 5 19:23:18.071 [rsHealthPoll] replset info 192.168.159.133:27017 heartbeat failed, retrying Tue Feb 5 19:23:23.090 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
- 192.168.159.135
- 我関せず・・・
6. 192.168.159.133 にデータ投入
$ mongo 192.168.159.133:27017 <<<' use testdb db.testcol.save({key:2}) '
local.slaves
- 192.168.159.133
- 変化なし
db.slaves.find() { "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } } { "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
- 192.168.159.134
- 変化なし
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360228165000, "i" : 138} }
- 192.168.159.135
- やはりPrimaryからレコードが移動して来る
{ "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360228784000, "i" : 229 } }
log
- 192.168.159.133
- 特に何も。。
- 192.168.159.134
- 192.168.159.135から同期してるよー
Tue Feb 5 19:24:33.117 [rsMgr] not electing self, we are not freshest Tue Feb 5 19:24:35.224 [rsBackgroundSync] replSet syncing to: 192.168.159.135:27017 Tue Feb 5 19:24:35.288 [rsSyncNotifier] replset setting oplog notifier to 192.168.159.135:27017
- 192.168.159.135
- [rsGhostSync] member=1(192.168.159.134)と192.168.159.133の間を取り持ってやってるぜ!
Tue Feb 5 19:24:34.909 [rsBackgroundSync] replSet syncing to: 192.168.159.133:27017 Tue Feb 5 19:24:34.923 [rsSyncNotifier] replset setting oplog notifier to 192.168.159.133:27017 Tue Feb 5 19:24:35.302 [rsGhostSync] handshake between 1 and 192.168.159.133:27017 Tue Feb 5 19:24:36.303 [slaveTracking] build index local.slaves { _id: 1 } Tue Feb 5 19:24:36.304 [slaveTracking] build index done. scanned 0 total records. 0 secs
省略するが勿論ちゃんと192.168.159.134から{key:2}が取れる。
まとめ
- 2.3.2ではGhost syncはまともに機能する!
- Ghost sync状態に入るのは、ノード障害検知後の最初のOplog同期時。(本当に切り替えが必要になったタイミング)
- Ghost sync 状態→通常状態への復旧タイミングは不明
ついでに。。
rs.syncFrom('[host]:[port]') (since 2.2)
実は手動で同期先を切り替える事も可能。
rs.syncFrom('192.168.159.133:27017')
slaveDelayとGhostSyncは相性悪い!
slaveDelayのセカンダリをghost targetにしてしまう問題がある。
書いても書いても読めるデータは古いまま・・・
Primaryに繋がらない場所に居るノード
レプリカセット構築時に繋がらない場合はリトライを繰り返して暫く経たないとJOIN出来ない。
しかし構築後に繋がらなくなり、更にそのノードを再起動した場合でも、即座にSecondaryに復帰できる。
しかしSync先の情報は保存して無さそうなので、ghost target側から何か情報が回って来るのかも知れない。
この辺はまた後で追うつもり。