kuromoji.jsを使ってみた

hubotにてユーザが入力した文章の読みを解析したかったため、Pure JavaScriptの形態素解析器 kuromoji.js を利用して文章の読みを取得してみた際のメモです。

kuromoji.jsとは

kuromoji.jsは@takuya_aさんによるJavaの形態素解析器KuromojiのJavaScript移植版です。

特徴は、形態素解析プログラムのラッパーではなく単体で動作するため、Node.js環境はもちろんブラウザ上でも動作させることができます。

（以降では、Node.jsおよびCoffeeScriptが導入済みの環境を想定しています）

参考：ブラウザで自然言語処理 – JavaScriptの形態素解析器kuromoji.jsを作った

さっそく使ってみる

kuromoji.jsはnpmやbowerで導入可能で、辞書も内蔵されており、簡単に使い始められます。

まず、npmでカレントディレクトリに導入します。

$ npm install kuromoji

1 2	$ npm install kuromoji

次に、同じディレクトリに以下のkuromoji-test.coffeeを作ります。

kuromoji = require 'kuromoji'

tokenizer = null
DIC_URL = "node_modules/kuromoji/dist/dict/"

kuromoji.builder({ dicPath: DIC_URL }).build (err, _tokenizer) ->
  tokenizer = _tokenizer

  tokens = tokenizer.tokenize(process.argv[2])
  console.log tokens

  readings = []
  for token in tokens
    if token['reading']
      readings.push token['reading']
  console.log readings

kuromoji = require 'kuromoji'

tokenizer = null

DIC_URL = "node_modules/kuromoji/dist/dict/"

kuromoji.builder({ dicPath: DIC_URL }).build (err, _tokenizer) ->

tokenizer = _tokenizer

tokens = tokenizer.tokenize(process.argv[2])

console.log tokens

readings = []

for token in tokens

if token['reading']

readings.push token['reading']

console.log readings

hubotでの利用を見据えて、CoffeeScriptで実装してみました。
コマンド引数で形態素解析したい文字列を指定できます。
10行目で形態素解析の結果を、16行目で解析結果をもとに文字列の読みを出力します。

実行

前記のソースをCoffeeScriptで実行すれば形態素解析の結果を確認できます。

$ coffee kuromoji-test.coffee "残像に口紅を"
[ { word_id: 678760,
    word_type: 'KNOWN',
    word_position: 1,
    surface_form: '残像',
    pos: '名詞',
    pos_detail_1: '一般',
    pos_detail_2: '*',
    pos_detail_3: '*',
    conjugated_type: '*',
    conjugated_form: '*',
    basic_form: '残像',
    reading: 'ザンゾウ',
    pronunciation: 'ザンゾー' },
  { word_id: 2594290,
    word_type: 'KNOWN',
    word_position: 3,
    surface_form: 'に',
    pos: '助詞',
    pos_detail_1: '格助詞',
    pos_detail_2: '一般',
    pos_detail_3: '*',
    conjugated_type: '*',
    conjugated_form: '*',
    basic_form: 'に',
    reading: 'ニ',
    pronunciation: 'ニ' },
  { word_id: 521680,
    word_type: 'KNOWN',
    word_position: 4,
    surface_form: '口紅',
    pos: '名詞',
    pos_detail_1: '一般',
    pos_detail_2: '*',
    pos_detail_3: '*',
    conjugated_type: '*',
    conjugated_form: '*',
    basic_form: '口紅',
    reading: 'クチベニ',
    pronunciation: 'クチベニ' },
  { word_id: 2595140,
    word_type: 'KNOWN',
    word_position: 6,
    surface_form: 'を',
    pos: '助詞',
    pos_detail_1: '格助詞',
    pos_detail_2: '一般',
    pos_detail_3: '*',
    conjugated_type: '*',
    conjugated_form: '*',
    basic_form: 'を',
    reading: 'ヲ',
    pronunciation: 'ヲ' } ]  
ザンゾウニクチベニヲ

$ coffee kuromoji-test.coffee "残像に口紅を"

[ { word_id: 678760,

word_type: 'KNOWN',

word_position: 1,

surface_form: '残像',

pos: '名詞',

pos_detail_1: '一般',

pos_detail_2: '*',

pos_detail_3: '*',

conjugated_type: '*',

conjugated_form: '*',

basic_form: '残像',

reading: 'ザンゾウ',

pronunciation: 'ザンゾー' },

{ word_id: 2594290,

word_type: 'KNOWN',

word_position: 3,

surface_form: 'に',

pos: '助詞',

pos_detail_1: '格助詞',

pos_detail_2: '一般',

pos_detail_3: '*',

conjugated_type: '*',

conjugated_form: '*',

basic_form: 'に',

reading: 'ニ',

pronunciation: 'ニ' },

{ word_id: 521680,

word_type: 'KNOWN',

word_position: 4,

surface_form: '口紅',

pos: '名詞',

pos_detail_1: '一般',

pos_detail_2: '*',

pos_detail_3: '*',

conjugated_type: '*',

conjugated_form: '*',

basic_form: '口紅',

reading: 'クチベニ',

pronunciation: 'クチベニ' },

{ word_id: 2595140,

word_type: 'KNOWN',

word_position: 6,

surface_form: 'を',

pos: '助詞',

pos_detail_1: '格助詞',

pos_detail_2: '一般',

pos_detail_3: '*',

conjugated_type: '*',

conjugated_form: '*',

basic_form: 'を',

reading: 'ヲ',

pronunciation: 'ヲ' } ]

ザンゾウニクチベニヲ

備考

上記スクリプトをもとにhubotに形態素解析機能をslackのbotとして実装してみました。
heroku上でhubotを動かしてみたところ、エラーとなりbotが起動しませんでした。しっかり原因を確認できていませんが、メモリ不足のように見えます。

別途、VPSサーバにてhubotを動作させ、入力された文章の読みを取得できることを確認しました。

いろいろと試してみた結果、単語をただしく解析できない場合があるため、
多くの単語や新語に対応しているという mecab-ipadic-NEologd という辞書を利用してみようと思っています。こちらは別途記事にしたいと思います。