About this article

Last time, I tried to analyze the principal component of text data. I wanted to try it with different text data, so I would like to challenge the principal component analysis using Livedoor News Corpus published by Ronwitt Co., Ltd.

As a pre-processing, I would like to read the contents of the text file divided for each article in sequence, perform morphological analysis, and then combine them into one csv file.

The morphological analysis library used janome.

reference

-Livedoor News Corpus

Livedoor News Corpus Directory Structure

If you download the file from the above link and unzip it, there are 9 folders under each category such as it-life-hack under the text folder, and articles of that category are stored in 1 article 1 file unit under each folder. It has been.

Preprocessing program

`python`


import pandas as pd
import numpy as np
import pathlib
import glob
from janome.tokenizer import Tokenizer
tnz = Tokenizer()

pth = pathlib.Path('c:/temp/text')

l = []
for p in pth.glob('**/*.txt') :
    #Skip other than article data
    if p.name in ['CHANGES.txt','README.txt','LICENSE.txt']:
        continue
        
    #Open article data and morphological analysis with janome ⇒ Keep in list in 1 line 1 word format
    with open(p,'r',encoding='utf-8-sig') as f :
        l.extend([[p.parent.name, p.name, t.surface, t.part_of_speech] for s in f for t in tnz.tokenize(s)])

#Convert list to dataframe
df = pd.DataFrame(np.array(l))

#Give column name
df.columns = ['Article classification','file name','word','Part of speech']

#Csv output data frame
df.to_csv('c:/temp/livedoor_corpus.csv', index=False)