Ghost sync

MongoDBのレプリカセットではデータの書き込みは常にPrimaryに行わる。
各Secondary(slave)は『基本』Primaryから同期する。
ただし、色々な理由でPrimaryから同期出来ない場合は別のSecondaryから同期をする事がある。

なんでこんな事するのか？

レプリカセットの可用性を高める為。
この機能によって信頼性がネットワークを利用できるようになる。
またDC跨ぎのDR（デザスタリカバリ）など物理的に遠いレプリカを組みたい場合に便利。
本当に直接繋がらないノードを含むレプリカを組むことも出来るが、、現実的ではないか。。

replica set ghost target no good

ghost target: 同期対象に選んだSecondary(がPrimaryのフリをしている部分)
ghost slave: Primaryから見て、Secondaryが別のSecondaryをフリをしている部分
Bug until 2.0.X: ghost targetがghost slaveになってしまうと相互参照になってPrimaryから切り離れてしまう。

Ghost sync が起きるケース

少なくとも私が確認できたもの

部分的なネットワーク障害(Primaryと特定のSecondary間のみ)
Primary oplog溢れ時、あるSecondaryがPrimaryより大きなoplogを持っていた場合
完全新規Secondary追加時(ランダム？)

実証実験

ケース1を使って実際にGhost syncを起こしてみる。

構成

192.168.159.133: Primary
192.168.159.134: Secondary
192.168.159.135: Secondary

config = { _id: 'RS', members:
    [
     {_id: 0, host: '192.168.159.133:27017', priority:2},
     {_id: 1, host: '192.168.159.134:27017', priority:1},
     {_id: 2, host: '192.168.159.135:27017', priority:1}
     ]
};
rs.initiate(config);

手順 (2.3.2)

1. 完全新規レプリカセットを立てる
2. 部分ネットワーク断（192.168.159.133 <- 192.168.159.135）
　　　（Ghost sync状態に入る）
3. 192.168.159.133 にデータ投入
　　　192.168.159.135 にデータが同期される。
4. ネットワーク復旧（192.168.159.133 <- 192.168.159.135）
　　　（Ghost sync解除のはず）
5. 部分ネットワーク断（192.168.159.133 <- 192.168.159.134）
　　　（Ghost sync状態に入る）
6. 192.168.159.133 にデータ投入
　　　192.168.159.134 にデータが同期される。

環境（ログ抜粋）

Tue Feb  5 19:01:01.113 [initandlisten] ** NOTE: This is a development version (2.3.3-pre-) of MongoDB.
Tue Feb  5 19:01:01.113 [initandlisten] db version v2.3.3-pre-, pdfile version 4.5
Tue Feb  5 19:01:01.113 [initandlisten] build info: Linux centos6 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 BOOST_LIB_VERSION=1_49

1. 完全新規レプリカセットを立てる

手順省略

local.slaves

192.168.159.133: Primaryから同期しているようだ

db.slaves.find()
{ "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }

192.168.159.134: なし
192.168.159.135: なし

2. 部分ネットワーク断（192.168.159.133 <- 192.168.159.135）

 # iptables -A INPUT -p tcp --dport 27017 -s 192.168.159.135 -j DROP

local.slaves

変化なし

log

192.168.159.133: 192.168.159.135の障害を検知した

Tue Feb  5 19:11:32.682 [rsHealthPoll] replset info 192.168.159.135:27017 thinks that we are down
Tue Feb  5 19:11:34.683 [rsHealthPoll] replset info 192.168.159.135:27017 thinks that we are down
　　　　：

192.168.159.134: 我関せず・・・
192.168.159.135: 192.168.159.133のダウンを検知
backGroundSyncが失敗

Tue Feb  5 19:11:30.653 [rsHealthPoll] DBClientCursor::init call() failed
Tue Feb  5 19:11:30.694 [rsHealthPoll] replSet info 192.168.159.133:27017 is down (or slow to respond):
Tue Feb  5 19:11:30.695 [rsHealthPoll] replSet member 192.168.159.133:27017 is now in state DOWN
Tue Feb  5 19:11:30.695 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:36.666 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:42.664 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:48.666 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:53.492 [rsBackgroundSync] Socket recv() timeout  192.168.159.133:27017
Tue Feb  5 19:11:53.492 [rsBackgroundSync] SocketException: remote: 192.168.159.133:27017 error: 9001 socket exception [3] server [192.168.159.133:27017]
Tue Feb  5 19:11:53.492 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: 192.168.159.133:27017
Tue Feb  5 19:11:54.669 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:11:54.696 [rsHealthPoll] couldn't connect to 192.168.159.133:27017: couldn't connect to server 192.168.159.133:27017
Tue Feb  5 19:12:00.671 [rsMgr] not electing self, 192.168.159.134:27017 would veto with '192.168.159.135:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:12:04.696 [rsHealthPoll] couldn't connect to 192.168.159.133:27017: couldn't connect to server 192.168.159.133:27017
　　　　：

予想に反しまだGhost syncに入っていない模様。

3. 192.168.159.133 にデータ投入

$ mongo 192.168.159.133:27017 <<<'
use testdb
db.testcol.save({key:1})
'
***local.slaves
:192.168.159.133: <span class="deco" style="color:#FF0000;font-weight:bold;">変化なし！</span>
>||
db.slaves.find()
{ "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }

192.168.159.134: Primaryにあった(_idが同じ)レコードが移動してきた！

{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360228165000, "i" : 138} }

192.168.159.135: なし

db.slavesのレコードはアクティブなslaveを表している訳ではない

log

192.168.159.133: 新規DBファイルが作られた
怪しげな[conn9]のエラー

Tue Feb  5 19:15:31.560 [FileAllocator] creating directory /usr/local/mongo/data00/testdb/_tmp
Tue Feb  5 19:15:31.567 [FileAllocator] done allocating datafile /usr/local/mongo/data00/testdb/testdb.ns, size: 4MB,  took 0 secs
Tue Feb  5 19:15:31.567 [FileAllocator] allocating new datafile /usr/local/mongo/data00/testdb/testdb.0, filling with zeroes...
Tue Feb  5 19:15:31.568 [FileAllocator] done allocating datafile /usr/local/mongo/data00/testdb/testdb.0, size: 64MB,  took 0 secs
Tue Feb  5 19:15:31.569 [FileAllocator] allocating new datafile /usr/local/mongo/data00/testdb/testdb.1, filling with zeroes...
Tue Feb  5 19:15:31.571 [FileAllocator] done allocating datafile /usr/local/mongo/data00/testdb/testdb.1, size: 128MB,  took 0.001 secs
Tue Feb  5 19:15:31.574 [conn50] build index testdb.testcol { _id: 1 }
Tue Feb  5 19:15:31.574 [conn50] build index done.  scanned 0 total records. 0 secs
Tue Feb  5 19:15:31.588 [conn9] getMore: cursorid not found local.oplog.rs 969546968802276

192.168.159.134: [rsGhostSync] キター！！

Tue Feb  5 19:15:31.576 [FileAllocator] allocating new datafile /usr/local/mongo/data01/testdb/testdb.ns, filling with zeroes...
Tue Feb  5 19:15:31.576 [FileAllocator] creating directory /usr/local/mongo/data01/testdb/_tmp
Tue Feb  5 19:15:31.578 [FileAllocator] done allocating datafile /usr/local/mongo/data01/testdb/testdb.ns, size: 4MB,  took 0 secs
Tue Feb  5 19:15:31.579 [FileAllocator] allocating new datafile /usr/local/mongo/data01/testdb/testdb.0, filling with zeroes...
Tue Feb  5 19:15:31.580 [FileAllocator] done allocating datafile /usr/local/mongo/data01/testdb/testdb.0, size: 64MB,  took 0 secs
Tue Feb  5 19:15:31.581 [FileAllocator] allocating new datafile /usr/local/mongo/data01/testdb/testdb.1, filling with zeroes...
Tue Feb  5 19:15:31.583 [repl writer worker 1] build index testdb.testcol { _id: 1 }
Tue Feb  5 19:15:31.584 [FileAllocator] done allocating datafile /usr/local/mongo/data01/testdb/testdb.1, size: 128MB,  took 0.002 secs
Tue Feb  5 19:15:31.584 [repl writer worker 1] build index done.  scanned 0 total records. 0.001 secs
Tue Feb  5 19:15:31.588 [rsSyncNotifier] repl: old cursor isDead, will initiate a new one
Tue Feb  5 19:15:33.563 [rsGhostSync] handshake between 2 and 192.168.159.133:27017
Tue Feb  5 19:15:34.564 [slaveTracking] build index local.slaves { _id: 1 }
Tue Feb  5 19:15:34.565 [slaveTracking] build index done.  scanned 0 total records. 0 secs

192.168.159.135: 同期先が192.168.159.134になり、新規DBファイルが作られた！

Tue Feb  5 19:15:33.544 [rsBackgroundSync] replSet syncing to: 192.168.159.134:27017
Tue Feb  5 19:15:33.547 [FileAllocator] allocating new datafile /usr/local/mongo/data02/testdb/testdb.ns, filling with zeroes...
Tue Feb  5 19:15:33.547 [FileAllocator] creating directory /usr/local/mongo/data02/testdb/_tmp
Tue Feb  5 19:15:33.550 [FileAllocator] done allocating datafile /usr/local/mongo/data02/testdb/testdb.ns, size: 4MB,  took 0 secs
Tue Feb  5 19:15:33.550 [FileAllocator] allocating new datafile /usr/local/mongo/data02/testdb/testdb.0, filling with zeroes...
Tue Feb  5 19:15:33.550 [FileAllocator] done allocating datafile /usr/local/mongo/data02/testdb/testdb.0, size: 64MB,  took 0 secs
Tue Feb  5 19:15:33.552 [FileAllocator] allocating new datafile /usr/local/mongo/data02/testdb/testdb.1, filling with zeroes...
Tue Feb  5 19:15:33.553 [repl writer worker 1] build index testdb.testcol { _id: 1 }
Tue Feb  5 19:15:33.553 [FileAllocator] done allocating datafile /usr/local/mongo/data02/testdb/testdb.1, size: 128MB,  took 0 secs
Tue Feb  5 19:15:33.554 [repl writer worker 1] build index done.  scanned 0 total records. 0 secs
Tue Feb  5 19:15:33.554 [rsSyncNotifier] replset setting oplog notifier to 192.168.159.134:27017

Ghost sync 完成！！

因みにここのfrom 2 はreplica setの config . members . _id を示している。

[rsGhostSync] handshake between 2 and 192.168.159.133:27017

省略するが勿論ちゃんと192.168.159.135から{key:1}が取れる。

4. ネットワーク復旧（192.168.159.133 <- 192.168.159.135）

 # iptables -D INPUT -p tcp --dport 27017 -s 192.168.159.135 -j DROP

local.slaves

変化なし

log

192.168.159.133: rsHealthPollのエラーが止まる

Tue Feb  5 19:20:44.903 [rsHealthPoll] replset info 192.168.159.135:27017 thinks that we are down
Tue Feb  5 19:20:46.564 [snapshotthread] cpu: elapsed:4000  writelock: 0%
Tue Feb  5 19:20:50.564 [snapshotthread] cpu: elapsed:4000  writelock: 0%
Tue Feb  5 19:20:54.565 [snapshotthread] cpu: elapsed:4000  writelock: 0%

192.168.159.134: 我関せず・・・
192.168.159.135: Primary復旧を検知

Tue Feb  5 19:20:45.699 [rsHealthPoll] replSet member 192.168.159.133:27017 is up
Tue Feb  5 19:20:45.699 [rsHealthPoll] replSet member 192.168.159.133:27017 is now in state PRIMARY

Ghost syncから抜けたか否かは解らない・・・

5. 部分ネットワーク断（192.168.159.133 <- 192.168.159.134）

 # iptables -A INPUT -p tcp --dport 27017 -s 192.168.159.134 -j DROP

local.slaves

変化なし

log

192.168.159.133: 192.168.159.134の障害を検知した

Tue Feb  5 19:22:42.949 [rsHealthPoll] replset info 192.168.159.134:27017 thinks that we are down
Tue Feb  5 19:22:44.950 [rsHealthPoll] replset info 192.168.159.134:27017 thinks that we are down

192.168.159.134: 192.168.159.133のダウンを検知
backGroundSyncが失敗

Tue Feb  5 19:22:41.069 [rsHealthPoll] DBClientCursor::init call() failed
Tue Feb  5 19:22:41.070 [rsHealthPoll] replSet info 192.168.159.133:27017 is down (or slow to respond):
Tue Feb  5 19:22:41.070 [rsHealthPoll] replSet member 192.168.159.133:27017 is now in state DOWN
Tue Feb  5 19:22:41.071 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:22:47.076 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:22:53.072 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:22:59.080 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:23:03.203 [rsBackgroundSync] Socket recv() timeout  192.168.159.133:27017
Tue Feb  5 19:23:03.203 [rsBackgroundSync] SocketException: remote: 192.168.159.133:27017 error: 9001 socket exception [3] server [192.168.159.133:27017]
Tue Feb  5 19:23:03.203 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: 192.168.159.133:27017
Tue Feb  5 19:23:05.071 [rsHealthPoll] replset info 192.168.159.133:27017 heartbeat failed, retrying
Tue Feb  5 19:23:05.083 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:23:11.085 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:23:17.087 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'
Tue Feb  5 19:23:18.071 [rsHealthPoll] replset info 192.168.159.133:27017 heartbeat failed, retrying
Tue Feb  5 19:23:23.090 [rsMgr] not electing self, 192.168.159.135:27017 would veto with '192.168.159.134:27017 is trying to elect itself but 192.168.159.133:27017 is already primary and more up-to-date'

192.168.159.135: 我関せず・・・

6. 192.168.159.133 にデータ投入

$ mongo 192.168.159.133:27017 <<<'
use testdb
db.testcol.save({key:2})
'

local.slaves

192.168.159.133: 変化なし

db.slaves.find()
{ "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }
{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360227753000, "i" : 1 } }

192.168.159.134: 変化なし

{ "_id" : ObjectId("51136dc468286134d0bae2e4"), "host" : "192.168.159.135", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360228165000, "i" : 138} }

192.168.159.135: やはりPrimaryからレコードが移動して来る

{ "_id" : ObjectId("51136dc25310e984dbbcbdff"), "host" : "192.168.159.134", "ns" : "local.oplog.rs", "syncedTo" : { "t" : 1360228784000, "i" : 229 } }

log

192.168.159.133: 特に何も。。
192.168.159.134: 192.168.159.135から同期してるよー

Tue Feb  5 19:24:33.117 [rsMgr] not electing self, we are not freshest
Tue Feb  5 19:24:35.224 [rsBackgroundSync] replSet syncing to: 192.168.159.135:27017
Tue Feb  5 19:24:35.288 [rsSyncNotifier] replset setting oplog notifier to 192.168.159.135:27017

192.168.159.135: [rsGhostSync] member=1(192.168.159.134)と192.168.159.133の間を取り持ってやってるぜ！

Tue Feb  5 19:24:34.909 [rsBackgroundSync] replSet syncing to: 192.168.159.133:27017
Tue Feb  5 19:24:34.923 [rsSyncNotifier] replset setting oplog notifier to 192.168.159.133:27017
Tue Feb  5 19:24:35.302 [rsGhostSync] handshake between 1 and 192.168.159.133:27017
Tue Feb  5 19:24:36.303 [slaveTracking] build index local.slaves { _id: 1 }
Tue Feb  5 19:24:36.304 [slaveTracking] build index done.  scanned 0 total records. 0 secs

省略するが勿論ちゃんと192.168.159.134から{key:2}が取れる。

まとめ

2.3.2ではGhost syncはまともに機能する！
Ghost sync状態に入るのは、ノード障害検知後の最初のOplog同期時。(本当に切り替えが必要になったタイミング)
Ghost sync 状態→通常状態への復旧タイミングは不明

ついでに。。

rs.syncFrom('[host]:[port]') (since 2.2)

実は手動で同期先を切り替える事も可能。

rs.syncFrom('192.168.159.133:27017')

slaveDelayとGhostSyncは相性悪い！

slaveDelayのセカンダリをghost targetにしてしまう問題がある。

mongo jira

書いても書いても読めるデータは古いまま・・・

Primaryに繋がらない場所に居るノード

レプリカセット構築時に繋がらない場合はリトライを繰り返して暫く経たないとJOIN出来ない。
しかし構築後に繋がらなくなり、更にそのノードを再起動した場合でも、即座にSecondaryに復帰できる。
しかしSync先の情報は保存して無さそうなので、ghost target側から何か情報が回って来るのかも知れない。

この辺はまた後で追うつもり。

中年engineerの独り言 - crumbjp

LinuxとApacheの憂鬱

Ghost sync

Ghost sync

なんでこんな事するのか？

replica set ghost target no good

Ghost sync が起きるケース

実証実験

構成

手順 (2.3.2)

環境（ログ抜粋）

1. 完全新規レプリカセットを立てる

local.slaves

2. 部分ネットワーク断（192.168.159.133 <- 192.168.159.135）

local.slaves

log

3. 192.168.159.133 にデータ投入

log

4. ネットワーク復旧（192.168.159.133 <- 192.168.159.135）

local.slaves

log

5. 部分ネットワーク断（192.168.159.133 <- 192.168.159.134）

local.slaves

log

6. 192.168.159.133 にデータ投入

local.slaves

log

まとめ

ついでに。。

rs.syncFrom('[host]:[port]') (since 2.2)

slaveDelayとGhostSyncは相性悪い！

Primaryに繋がらない場所に居るノード

Ghost sync

なんでこんな事するのか？

replica set ghost target no good

Ghost sync が起きるケース

実証実験

構成

手順 (2.3.2)

環境（ログ抜粋）

1. 完全新規レプリカセットを立てる

local.slaves

2. 部分ネットワーク断 （192.168.159.133 <- 192.168.159.135）

local.slaves

log

3. 192.168.159.133 にデータ投入

log

4. ネットワーク復旧 （192.168.159.133 <- 192.168.159.135）

local.slaves

log

5. 部分ネットワーク断 （192.168.159.133 <- 192.168.159.134）

local.slaves

log

6. 192.168.159.133 にデータ投入

local.slaves

log

まとめ

ついでに。。

rs.syncFrom('[host]:[port]') (since 2.2)

slaveDelayとGhostSyncは相性悪い！

Primaryに繋がらない場所に居るノード

2. 部分ネットワーク断（192.168.159.133 <- 192.168.159.135）

4. ネットワーク復旧（192.168.159.133 <- 192.168.159.135）

5. 部分ネットワーク断（192.168.159.133 <- 192.168.159.134）