[PYTHON] Language processing 100 knocks-21: Extract lines containing category names

Language processing 100 knocks 2015 "Chapter 3: Regular expressions" It is a record of 21st "Extract lines containing category name" of .ac.jp/nlp100/#ch3). The last time was a preparation, and this time we will practice regular expressions. Until now, Gugu uses a lot of basic content that I remembered. Specifically, it is full of basics such as ** raw character string, re.VERBOSE, re.MULTILINE, triple quote **.

Reference link

Link Remarks
021.Extract rows containing category names.ipynb Answer program GitHub link
100 amateur language processing knocks:21 Copy and paste source of many source parts
Python regular expression basics and tips to learn from scratch I organized what I learned in this knock
Regular expression HOWTO Python Official Regular Expression How To
re ---Regular expression operation Python official re package description
Help:Simplified chart Wikipediaの代表的なマークアップのSimplified chart

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3

Chapter 3: Regular Expressions

content of study

By applying regular expressions to the markup description on Wikipedia pages, various information and knowledge can be extracted.

Regular Expressions, JSON, Wikipedia, InfoBox, Web Services

Knock content

File jawiki-country.json.gz that exports Wikipedia articles in the following format There is.

--One article information per line is stored in JSON format --In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. --The entire file is gzipped

Create a program that performs the following processing.

21. Extract lines containing category name

Extract the line that declares the category name in the article.

Problem supplement (about "category name")

According to Help: Quick Reference, the "category name" is [[[ Category: Help | Hayami Hiyo]] format. Extract the following part of the file with a regular expression.

Excerpt from the "category name" part of the file


[[Category:England|*]]\n'
[[Category:Commonwealth Kingdom|*]]\n'
[[Category:G8 member countries]]\n'
[[Category:European Union member states]]\n'
[[Category:Maritime nation]]\n'
[[Category:Sovereign country]]\n'
[[Category:Island country|Kureito Furiten]]\n'
[[Category:States / Regions Established in 1801]]'

Answer

Answer program [021. Extract lines containing category name.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/03.%E6%AD%A3%E8%A6%8F%E8%A1 % A8% E7% 8F% BE / 021.1% E3% 82% AB% E3% 83% 86% E3% 82% B4% E3% 83% AA% E5% 90% 8D% E3% 82% 92% E5% 90% AB% E3% 82% 80% E8% A1% 8C% E3% 82% 92% E6% 8A% BD% E5% 87% BA.ipynb)

from pprint import pprint
import re

import pandas as pd

def extract_by_title(title):
    df_wiki = pd.read_json('jawiki-country.json', lines=True)
    return df_wiki[(df_wiki['title'] == title)]['text'].values[0]

wiki_body = extract_by_title('England')

#Ignore escape sequences in raw string with r at the beginning
#Ignore line breaks in the middle with triple quotes
# re.Ignore whitespace and comments by using the VERBOSE option
# re.Search for multiple lines with MULTILINE
pprint(re.findall(r'''
                     ^                  #The beginning of the string(Even if you don't have it, the result will not change, but put it in)
                     (                  #Start grouping
                     .*                 #Arbitrary character string 0 or more characters
                     \[\[Category:      #Search term(\Is an escape process)
                     .*                 #Arbitrary character string 0 or more characters
                     \]\]               #Search term(\Is an escape process)
                     .*                 #Arbitrary character string 0 or more characters
                     )                  #End of grouping
                     $                  #End of string(Even if you don't have it, the result will not change, but put it in)
                     ''', wiki_body, re.MULTILINE+re.VERBOSE))

Answer commentary

The main subject of this knock is as follows.

pprint(re.findall(r'''
                     ^                  #The beginning of the string(Even if you don't have it, the result will not change, but put it in)
                     (                  #Start grouping
                     .*                 #Arbitrary character string 0 or more characters
                     \[\[Category:      #Search term(\Is an escape process)
                     .*                 #Arbitrary character string 0 or more characters
                     \]\]               #Search term(\Is an escape process)
                     .*                 #Arbitrary character string 0 or more characters
                     )                  #End of grouping
                     $                  #End of string(Even if you don't have it, the result will not change, but put it in)
                     ''', wiki_body, re.MULTILINE+re.VERBOSE))

Get all search results with findall function

The findall function ** returns all strings that match the pattern in list format **. The following example extracts all adverb words that end with ly ( \ w is "alphanumeric characters and underscores" #% E7% 89% B9% E6% AE% 8A% E6% 96% 87% E5% AD% 97)).

findall example


>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']

[raw string](https://qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#raw%E6%96%87%E5%AD%97%E5%88%97%E3%81%A7% E3% 82% A8% E3% 82% B9% E3% 82% B1% E3% 83% BC% E3% 83% 97% E3% 82% B7% E3% 83% BC% E3% 82% B1% E3% Escape sequence disabled at 83% B3% E3% 82% B9% E7% 84% A1% E5% 8A% B9)

Prefix the quotation mark with r to make it a raw string. You can disable escape sequences by using raw strings. ** If the regular expression pattern has an escape sequence, it is difficult to read, so make it a raw string and invalidate it **.

Raw string print output example


>>> print('a\tb\nA\tB')
a   b
A   B

>>> print(r'a\tb\nA\tB')
a\tb\nA\tB

[Triple Quart](https://qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#%E3%83%88%E3%83%AA%E3%83%97%E3%83%AB%E3%82%AF% E3% 82% A9% E3% 83% BC% E3% 83% 88% E3% 81% A8 reverbose% E3% 81% A7% E6% 94% B9% E8% A1% 8C% E3% 82% B3% E3% 83% A1% E3% 83% B3% E3% 83% 88% E7% A9% BA% E7% 99% BD% E7% 84% A1% E8% A6% 96)

You can use line breaks in the regular expression pattern by enclosing them in ''' triple quotes (which can be " "" ). ** Regular by line breaks. Makes the expression pattern easier to read **

Triple quote usage example


a = re.compile(r'''\d +
                   \.  
                   \d *''')

[re.VERBOSE](https://qiita.com/FukuharaYohei/items/459f27f0d7bbba551af7#%E3%83%88%E3%83%AA%E3%83%97%E3%83%AB%E3%82 % AF% E3% 82% A9% E3% 83% BC% E3% 83% 88% E3% 81% A8reverbose% E3% 81% A7% E6% 94% B9% E8% A1% 8C% E3% 82% B3 % E3% 83% A1% E3% 83% B3% E3% 83% 88% E7% A9% BA% E7% 99% BD% E7% 84% A1% E8% A6% 96)

By passing re.VERBOSE to the parameter flags, you can use comments and whitespace in the regular expression pattern (no problem if you don't use it). ** Make the regular expression pattern easier to read by inserting a comment and a space **. This is a readability improvement method used in combination with triple quotes.

Triple quote usage example


a = re.compile(r'''\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits''', re.VERBOSE)

Search multiple lines with re.MULTILINE

Use this when you want to search for multiple lines individually.

re.MULTILINE usage example


string = r'''\
1st line
2nd line'''

#Search target for multiple lines
print(re.findall(r'^Beginning of line.*', string, re.MULTILINE))
# ['1st line', '2nd line']

#Only the first line is the search target
print(re.findall(r'^Beginning of line.*', string))
# ['1st line']

Output result (execution result)

When the program is executed, the following results will be output.

Output result


['[[Category:England|*]]',
 '[[Category:Commonwealth Kingdom|*]]',
 '[[Category:G8 member countries]]',
 '[[Category:European Union member states]]',
 '[[Category:Maritime nation]]',
 '[[Category:Sovereign country]]',
 '[[Category:Island country|Kureito Furiten]]',
 '[[Category:States / Regions Established in 1801]]']

Recommended Posts

Language processing 100 knocks-21: Extract lines containing category names
Language processing 100 knocks-22: Extraction of category names
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
100 language processing knocks (2020): 36
100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 08