river-jdbc試用 - radio-keiosの日記

river-jdbcはMySQLに登録されたデータをElasticsearchに流し込むことができる便利なプラグインである。
ただし、利用時に注意点がある。
対象件数が多い場合にデフォルト設定だと、MySQLに登録された件数とElasticsearchに追加された件数があわないことがある。
その場合は、max_bulk_requestsの値をデフォルトの30から変更してみるとうまくいくかもしれない。
もう1点、インデックスは未作成でも自動的に作成してくれるが、マッピング情報は事前に設定しておいたほうがよい。

"post"というTypeを設定した"posts"という名前のIndexを作成する

curl -XPOST localhost:9200/posts/ -d '
{
  "mapping": {
    "post": {
      "properties": {
        "id": { "type": "integer", "index": "not_analyzed" },
        "title": { "type": "string", "index": "not_analyzed" },
        "body": { "type": "string", "index": "not_analyzed" }
      }
    }
  }
}
'

"sample"データベースの"posts"テーブルからデータを抽出してElasticsearchに流し込む

{
  "type" : "jdbc",
  "jdbc": {
    "url" : "jdbc:mysql://localhost:3306/sample",
    "user" : "root",
    "password" : "",
    "sql" : "select id,title,body from posts",
    "index" : "posts",
    "type" : "post",
    "bulk_size" : 100,
    "max_bulk_requests" : 1 // ここがポイント
  }
}

32万件ほどのデータで試すと、max_bulk_requestsがデフォルトのままだと18万件程度しか登録されなかった。
これを8に変更すると、全件登録することができた。登録時間は10〜20秒程度。
ソースまで確認してないが、とりあえずこれでもれなく登録できた。