[PYTHON] How to use Japanese with NLTK plot

Output result

figure_1.gif

Overview

NLTK (Natural Language Processing Library) plot function (graph output) enables Japanese to be used. Oliley book "Introduction to Natural Language Processing" ([-> English version [free]](http: / /www.nltk.org/book/)) in the chapter Japanese Natural Language Processing with Python "However, note that Japanese characters are garbled by default in matplotlib." I couldn't find a solution, so I dealt with it myself.

Prerequisite knowledge

-> Japanese natural language processing with Python

environment

LinuxMint13(Ubuntu12.04)

code

NLTK Japanese plot.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
reload(sys)
sys.setdefaultencoding('UTF-8')

import MeCab
import nltk
from numpy import *
from nltk.corpus.reader import *
from nltk.corpus.reader.util import *
from nltk.text import Text
import jptokenizer

###matplotlib Specify default font###← Point 1: Explicitly specify Japanese font
import matplotlib
import matplotlib.font_manager as font_manager
#TTF file(font)Specify the address of
font_path = '/usr/share/fonts/truetype/fonts-japanese-gothic.ttf'
#Get detailed font information
font_prop = font_manager.FontProperties(fname = font_path)
#Use the font name and specify it as the default font for matplot
matplotlib.rcParams['font.family'] = font_prop.get_name()

###Japanese corpus(unicode)Creation###← Point 2: Words are managed by unicode
#Load the corpus
jp_sent_tokenizer = nltk.RegexpTokenizer(u'[^ 「」!?。]*[!?。]')
reader = PlaintextCorpusReader("/home/User/desktop", r'NKMK.txt',
                                encoding='utf-8',
                                para_block_reader=read_line_block,
                                sent_tokenizer=jp_sent_tokenizer,
                                word_tokenizer=jptokenizer.JPMeCabTokenizer())
#Get word group by unicode specification from corpus
nkmk = Text(unicode(w) for w in reader.words())

###drawing###← Point 3: Arguments are also specified in unicode
nkmk.dispersion_plot([u'Nico',u'Maki',u'Here',u'Heart'])

Commentary

(See comments in the source)

Task

The label of ConditionalFreqDist.plot () cannot be translated into Japanese. If you read /usr/local/lib/python2.7/dist-packages/nltk/probability.py, "kwargs ['label'] = str (condition)" (line 1790). In other words, the label string is output through the str () function, so Japanese is definitely garbled. The correction method is to change the previous line to "kwargs ['label'] = unicode (condition)". If there is a similar case, it seems that the library needs to be modified as well.

[Before correction] figure_2.jpeg

[Revised] figure_3.jpg

Reference site

-> About Japanese in Matplotlib -> How to output Japanese with plot () of nltk.FreqDist and nltk.ConditionalFreqDist-(Mainly) Programming memo

Recommended Posts

How to use Japanese with NLTK plot
Python: How to use async with
How to use virtualenv with PowerShell
How to use FTP with Python
How to use ManyToManyField with Django's Admin
How to use OpenVPN with Ubuntu 18.04.3 LTS
How to use Cmder with PyCharm (Windows)
How to use Ass / Alembic with HtoA
How to display python Japanese with lolipop
How to use jupyter notebook with ABCI
How to use CUT command (with sample)
How to enter Japanese with Python curses
How to use SQLAlchemy / Connect with aiomysql
How to use JDBC driver with Redash
How to use xml.etree.ElementTree
How to use Python-shell
How to use tf.data
How to use Seaboan
How to use image-match
How to use shogun
How to use Pandas 2
How to use Virtualenv
How to use numpy.vectorize
How to use pytest_report_header
How to use partial
How to use Bio.Phylo
How to use x-means
How to use WikiExtractor.py
How to use IPython
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv
How to use list []
How to use python-kabusapi
How to use OptParse
How to use return
How to use dotenv
How to use pyenv-virtualenv
How to use Go.mod
How to use imutils
How to use import
How to use GCP trace with open Telemetry
How to use tkinter with python in pyenv
[Python] How to handle Japanese characters with openCV
How to make Linux compatible with Japanese keyboard
How to use Qt Designer
How to use search sorted
How to use xgboost: Multi-class classification with iris data
[gensim] How to use Doc2Vec
python3: How to use bottle (2)
Understand how to use django-filter
How to use the generator
[Python] How to use list 1
How to use FastAPI ③ OpenAPI
How to use python interactive mode with git bash
How to use Python argparse