[Chapter 5] Introduction to Python with 100 knocks of language processing

This article is a sequel to my book Introduction to Python with 100 Knock. I will mainly explain the class using 100 knocks Chapter 5.

First, let's install CaboCha and parse Aozora Bunko's "I am a cat".

!cabocha -f1 < neko.txt > neko.txt.cabocha
!head -n15 neko.txt.cabocha
* 0 -1D 0/0 0.000000

One noun, number, *, *, *, *, one, one, one EOS EOS * 0 2D 0/0 -0.764522 Symbol, blank, *, *, *, *, ,, * 1 2D 0/1 -0.764522 I noun, pronoun, general, *, *, *, I, Wagamama, Wagamama Is a particle, a particle, *, *, *, *, is, ha, wa * 2 -1D 0/2 0.000000 Cat noun, general, *, *, *, *, cat, cat, cat Auxiliary verb, *, *, *, special / da, continuous form, da, de, de Auxiliary verb, *, *, *, five-dan / la line al, uninflected word, al, al, al .. Symbols, Kuten, *, *, *, * ,. ,. ,. EOS

A line like * clause number, destination clause number D has been added.

class

In the first place, what is object-oriented programming? If you go to the point of ..., it will lead to controversy, so I will focus on the Python class. I think it will be easier if you know Java a little.

Why use classes

In terms of issues in this chapter, sentences->Phrase->If you try to handle the hierarchical structure of morphemes without using classes, (statement)|Phrase|(Variable) for (morpheme)|Functions) get mixed up and it becomes a big deal.forThe sentence is likely to be confusing. You can use classes to group functions by argument type, and you can reduce the variable scope, which makes coding a little easier.

Explanation of various terms

Let's look at an example of the date data type datetime.date.

import datetime
#Instantiation
a_day = datetime.date(2022,2,22)
print(a_day)
print(type(a_day))
#Instance variables
print(a_day.year, a_day.month, a_day.day)
#Class variables
print(datetime.date.max, a_day.max)
#Instance method
print(a_day.weekday(), datetime.date.weekday(a_day))
#Class method
print(datetime.date.today())
2022-02-22
<class 'datetime.date'>
2022 2 22
9999-12-31 9999-12-31
1 1
2020-05-06

datetime.date () creates an entity (instance) of type datetime.date and sets (initializes) the value based on the date passed to the argument.

Instance ʻa_dayhas instance-specific values (attributes) that represent the year, month, and day, and these are called instance variables. On the other hand, a value common to all instances of typedatetime.date` is called a class variable. In this example, the maximum possible date value is a class variable. Functions that require an instance are called instance methods, and functions that do not are called class methods.

Note that ʻa_day.weekday ()anddatetime.date.weekday (a_day)` are equivalent for instance methods. Furthermore, think of Python as converting the former to the latter and executing it. By the way, the return value means 0 is Monday and 1 is Tuesday.

Class definition

The data type blueprint is * class *. Let's actually define the class. After all, is it easy to understand the class that imitates a certain monster?

class Nezumi:
    #Class variables
    types = ('Denki',)
    learnable_moves = {'Denkou Sekka', 'Kaminari', 'Denji'}
    base_hp = 35
    
    #Initialization method
    def __init__(self, name, level):
        #Instance variables
        self.name = name
        self.level = level
        self.learned_moves = []
    
    #Instance method
    def learn(self, move):
        if move in self.learnable_moves:
            self.learned_moves.append((move))
            print(f'{self.name}Is new{move}I remembered!')
            
    #Class method
    @classmethod
    def hatch(cls):
        nezumi = cls('mouse or rat', 1)
        nezumi.learned_moves.append('Denkou Sekka')
        print('The egg was born instead of the mouse!')
        return nezumi

#Instantiation
reo = Nezumi('Leo', 44)
#Member variable confirmation
print(reo.name, reo.level, reo.learned_moves, reo.types)
#Instance method call
reo.learn('Kaminari')

Leo 44 [] Denki Leo learned a new Kaminari!

#Class method call
nezu = Nezumi.hatch()
#Instance variable confirmation
print(nezu.name, nezu.level, nezu.learned_moves)

The egg was born instead of the mouse! Mouse 1 ['Denkou Sekka']

__init __ () is a method to initialize an instance variable. It is called an initialization method or an initializer. The first argument, self, represents an instance. And what is defined in the method in the form of self.variable name becomes an instance variable.

By calling the defined Nezumi classNezumi (), after instantiation (new) internally,__ init__ ()is executed for the created instance self.

The instance reo is assigned to the first argument self of the instance method. Calling like reo.learn ('Kaminari') will execute Nezumi.learn (reo,'Kaminari'). That's why we need this self.

Class methods are defined with a * decorator * called @ classmethod. In class methods, it is orthodox to describe a special initialization method. The class object Nezumi is assigned to the first argument cls. Therefore, cls ('mouse', 1) is the same as Nezumi ('mouse', 1).

By the way, in the above example, the instance variables are accessed and checked one by one, but if you use the built-in function vars (), the list will be returned in dict format.

Special method

A method that is called without permission when a special operation / operation is performed is called a special method. __init __ () is one of them. There are [many] special methods (https://docs.python.org/ja/3/reference/datamodel.html#special-method-names), so I won't go into too much detail, but there are the following.

--__str__ (): Defines the display string when passed to print () etc. --__repr__ (): Define the display string for debugging. You can see it by letting it evaluate in interactive mode or passing it to repr (). --__len__ (): Defines the return value when passed to len ().

Private member

Python doesn't have that feature. It is customary to name it with one underscore in front of it, such as _spam.

Inheritance

Description of inheritance itself is not that long, and it sometimes appears in deep learning code. However, it is unnecessary in this chapter, and there are many accompanying stories (use, super (), namespace, static method, etc.), so I will omit it. ~~ I brought an example that seems to be easy to explain inheritance. ~~

Data class

A module, dataclasses, has been added to easily define classes for data retention in Python 3.7. It defines __init__ () and __repr__ () without permission. I feel that it can be used in this problem as well, but I will omit it because I may remember type annotations as it is.

40. Reading the dependency analysis result (morpheme)

Implement the class Morph that represents morphemes. This class has surface type (surface), uninflected word (base), part of speech (pos), and part of speech subclassification 1 (pos1) as member variables. In addition, read the analysis result of CaboCha (neko.txt.cabocha), express each sentence as a list of Morph objects, and display the morpheme string of the third sentence.

Below is an example of the answer.

q40.py


import argparse
from itertools import groupby
import sys


class Morph:
    """Read a line from the cabocha lattice format file"""
    __slots__ = ('surface', 'pos', 'pos1', 'base')
    #I noun,Pronoun,General,*,*,*,I,Wagamama,Wagamama
    def __init__(self, line):
        self.surface, temp = line.rstrip().split('\t')
        info = temp.split(',')
        self.pos = info[0]
        self.pos1 = info[1]
        self.base = info[6]
   
    @classmethod
    def load_cabocha(cls, fi):
        """Generate Morph instance from cabocha lattice format file"""
        for is_eos, sentence in groupby(fi, key=lambda x: x == 'EOS\n'):
            if not is_eos:
                yield [cls(line) for line in sentence if not line.startswith('* ')]
    
    def __str__(self):
        return self.surface
    
    def __repr__(self):
        return 'q40.Morph({})'.format(', '.join((self.surface, self.pos, self.pos1, self.base)))
    

def main():
    sent_idx = arg_int()
    for i, sent_lis in enumerate(Morph.load_cabocha(sys.stdin), start=1):
        if i == sent_idx:
            # print(*sent_lis)
            print(repr(sent_lis))
            break

def arg_int():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--number', default='1', type=int)
    args = parser.parse_args()
    return args.number


if __name__ == '__main__':
    main()
!python q40.py -n2 < neko.txt.cabocha

[q40.Morph (, symbol, blank,), q40.Morph (I, noun, pronoun, I), q40.Morph (, particle, particle, ha), q40.Morph (cat, noun, general, cat) ), Q40.Morph (in, particle, *, da), q40.Morph (is, particle, *, is), q40.Morph (., Sign, punctuation ,.)]

Passing the instance variable name to a special class variable called __slots__ saves memory and speeds up attribute search. Instead, you will not be able to add new instance variables from the outside, or you will not be able to get the instance variable list with vars ().

The design is that it cannot be instantiated without passing a line of morpheme information. Do you like that area?

If you write a description in a string literal at the beginning of a function definition etc., it will be treated as * docstring *. String literals can be broken with three quotes. The docstring can be referenced on the help () function, Jupyter and editors. Also, doctest and pydoc It may be used together with the module.

You can use group by () to write elegantly for files whose end of sentence is expressed by ʻEOS. On the other hand, in the problem of this chapter, there is a part like ʻEOS \ nEOS because of blank lines, so be careful that the way of counting sentences in the problem sentence and the way of counting sentences by group by () are different. ..

41. Reading the dependency analysis result (phrase / dependency) and an example of answering the following questions

In addition to> 40, implement the clause class Chunk. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.

Based on the object-oriented spirit, it is better to create a Sentence class because it has a has-a relationship of sentence-> clause-> morpheme. Certainly, even if you create it, the instance variable is only self.chunks, and it is not essential because it does not change much from the viewpoint of variable management.

However, since srcs cannot be set unless the analysis result of one sentence is read to the end, it seems natural to let it be done when initializing the Sentence object, and sentence level processing (Cabocha file reading, nth sentence) I think it is an advantage to be able to separate reading the analysis result of, receiving n as a command line argument) and sentence-level processing (subsequent problems, especially those asked in 43 and later).

The following is an example of the answer, but it includes the answers to the following questions. Since it is described in * docstring *, skip the code unrelated to No. 41 and refer to it later.

q41.py


from collections import defaultdict
from itertools import groupby, combinations
import sys

from q40 import Morph, arg_int


class Chunk:
    """Read clauses from cabocha lattice format file"""
    __slots__ = ('idx', 'dst', 'morphs', 'srcs')
    
    # * 0 2D 0/0 -0.764522
    def __init__(self, line):
        info = line.rstrip().split()
        self.idx = int(info[1])
        self.dst = int(info[2].rstrip("D"))
        self.morphs = []
        self.srcs = []
    
    def __str__(self):
        return ''.join([morph.surface for morph in self.morphs])
    
    def __repr__(self):
        return 'q41.Chunk({}, {})'.format(self.idx, self.dst)
    
    def srcs_append(self, src_idx):
        """Add the original clause index. Sentence.__init__()Used in."""
        self.srcs.append(src_idx)
    
    def morphs_append(self, line):
        """Add a morpheme. Sentence.__init__()Used in."""
        self.morphs.append(Morph(line))
    
    def tostr(self):
        """Returns the surface form of the clause with the symbol removed. Used in q42 or later."""
        return ''.join([morph.surface for morph in self.morphs if morph.pos != 'symbol'])
    
    def contain_pos(self, pos):
        """Returns whether the part of speech in the clause exists. Used in q43 or later."""
        return pos in (morph.pos for morph in self.morphs)

    def replace_np(self, symbol):
        """Replace noun phrases in phrases with symbols. For q49."""
        morph_lis = []
        for pos, morphs in groupby(self.morphs, key=lambda x: x.pos):
            if pos == 'noun':
                for morph in morphs:
                    morph_lis.append(symbol)
                    break
            elif pos != 'symbol':
                for morph in morphs:
                    morph_lis.append(morph.surface)
        return ''.join(morph_lis)
        
    
class Sentence:
    """Read statements from the cabocha lattice format file."""
    __slots__ = ('chunks', 'idx')
    
    def __init__(self, sent_lines):
        self.chunks = []
        
        for line in sent_lines:                    
            if line.startswith('* '):
                self.chunks.append(Chunk(line))
            else:
                self.chunks[-1].morphs_append(line)

        for chunk in self.chunks:
            if chunk.dst != -1:
                self.chunks[chunk.dst].srcs_append(chunk.idx)
    
    def __str__(self):
        return ' '.join([morph.surface for chunk in self.chunks for morph in chunk.morphs])
    
    @classmethod
    def load_cabocha(cls, fi):
        """Generate Sentence instance from cabocha lattice format file"""
        for is_eos, sentence in groupby(fi, key=lambda x: x == 'EOS\n'):
            if not is_eos:
                yield cls(sentence)
    
    def print_dep_idx(self):
        """q41.Display the original clause index and the destination clause index"""
        for chunk in self.chunks:
            print('{}:{} => {}'.format(chunk.idx, chunk, chunk.dst))
    
    def print_dep(self):
        """q42.Display the surface layer of the original clause and the destination clause separated by tabs"""
        for chunk in self.chunks:
            if chunk.dst != -1:
                print('{}\t{}'.format(chunk.tostr(), self.chunks[chunk.dst].tostr()))

    def print_noun_verb_dep(self):
        """q43.Extract clauses containing nouns related to clauses containing verbs"""
        for chunk in self.chunks:
            if chunk.contain_pos('noun') and self.chunks[chunk.dst].contain_pos('verb'):
                print('{}\t{}'.format(chunk.tostr(), self.chunks[chunk.dst].tostr()))
                
    def dep_edge(self):
        """For making pydot output a dependency with q44"""
        return [(f"{i}: {chunk.tostr()}", f"{chunk.dst}: {self.chunks[chunk.dst].tostr()}")
                    for i, chunk in enumerate(self.chunks) if chunk.dst != -1]
    
    def case_pattern(self):
        """q45.Verb case pattern extraction"""
        for chunk in self.chunks:
            for morph in chunk.morphs:
                if morph.pos == 'verb':
                    verb = morph.base
                    particles = [] #List of particles
                    for src in chunk.srcs:
                        #Add the rightmost particle in the segment
                        particles.extend([word.base for word in self.chunks[src].morphs 
                                             if word.pos == 'Particle'][-1:])
                    particles.sort()
                    print('{}\t{}'.format(verb, ' '.join(particles)))
                    #I only use the leftmost verb, so I can get out quickly
                    break

    def pred_case_arg(self):
        """q46.Verb case frame information extraction"""
        for chunk in self.chunks:
            for morph in chunk.morphs:
                if morph.pos == 'verb':
                    verb = morph.base
                    particle_chunks = []
                    for src in chunk.srcs:
                        # (Particle,Surface layer of the original segment)
                        particle_chunks.extend([(word.base, self.chunks[src].tostr()) 
                                                for word in self.chunks[src].morphs if word.pos == 'Particle'][-1:])
                    if particle_chunks:
                        particle_chunks.sort()
                        particles, chunks = zip(*particle_chunks)
                    else:
                        particles, chunks = [], []

                    print('{}\t{}\t{}'.format(verb, ' '.join(particles), ' '.join(chunks)))
                    break
                    
    def sahen_case_arg(self):
        """q47.Functional verb syntax mining"""
        #Flag for extracting sa-hen noun + verb
        sahen_flag = 0
        for chunk in self.chunks:
            for morph in chunk.morphs:
                if sahen_flag == 0 and morph.pos1 == 'Change connection':
                    sahen_flag = 1
                    sahen = morph.surface
                elif sahen_flag == 1 and morph.base == 'To' and morph.pos == 'Particle':
                    sahen_flag = 2
                elif sahen_flag == 2 and morph.pos == 'verb':
                    sahen_wo = sahen + 'To'
                    verb = morph.base
                    particle_chunks = []
                    for src in chunk.srcs:
                        # (Particle,Surface layer of the original segment)
                        particle_chunks.extend([(word.base, self.chunks[src].tostr()) for word in self.chunks[src].morphs 
                                         if word.pos == 'Particle'][-1:])
                    for j, part_chunk in enumerate(particle_chunks[:]):
                        if sahen_wo in part_chunk[1]:
                            del particle_chunks[j]

                    if particle_chunks:
                        particle_chunks.sort()
                        particles, chunks = zip(*particle_chunks)
                    else:
                        particles, chunks = [], []

                    print('{}\t{}\t{}'.format(sahen_wo + verb, ' '.join(particles), ' '.join(chunks)))
                    sahen_flag = 0
                    break
                else:
                    sahen_flag = 0 

    def trace_dep_path(self):
        """q48.Track dependency paths from clauses containing nouns to root"""
        path = []
        for chunk in self.chunks:
            if chunk.contain_pos('noun'):
                path.append(chunk)
                d = chunk.dst
                while d != -1:
                    path.append(self.chunks[d])
                    d = self.chunks[d].dst
                
                yield path
                path = []

    def print_noun2noun_path(self):
        """q49.Extraction of dependency paths between nouns"""
        #List of Chunk lists showing paths from clauses containing nouns to root (1 sentence)
        all_paths = list(self.trace_dep_path())
        arrow = ' -> '
        #A list of a set of clause ids for each path
        all_paths_set = [{chunk.idx for chunk in chunks} for chunks in all_paths]
        # all_Choose a pair from paths
        for p1, p2 in combinations(range(len(all_paths)), 2):
            #Find common phrase k
            intersec = all_paths_set[p1] & all_paths_set[p2]
            len_intersec = len(intersec)
            len_smaller = min(len(all_paths_set[p1]), len(all_paths_set[p2]))
            #The intersection is not empty and either is not a subset
            if 0 < len_intersec < len_smaller:
                #Show path
                k = min(intersec)
                path1_lis = []
                path1_lis.append(all_paths[p1][0].replace_np('X'))
                for chunk in all_paths[p1][1:]:
                    if chunk.idx < k:
                        path1_lis.append(chunk.tostr())
                    else:
                        break
                path2_lis = []
                rest_lis = []
                path2_lis.append(all_paths[p2][0].replace_np('Y'))                     
                for chunk in all_paths[p2][1:]:
                    if chunk.idx < k:
                        path2_lis.append(chunk.tostr())
                    else:
                        rest_lis.append(chunk.tostr())
                print(' | '.join([arrow.join(path1_lis), arrow.join(path2_lis),
                                 arrow.join(rest_lis)]))
        #Find and display patterns related to nouns from nouns
        for chunks in all_paths:
            for j in range(1, len(chunks)):
                if chunks[j].contain_pos('noun'):
                    outstr = []
                    outstr.append(chunks[0].replace_np('X'))
                    outstr.extend(chunk.tostr() for chunk in chunks[1:j])
                    outstr.append(chunks[j].replace_np('Y'))
                    print(arrow.join(outstr))
                    
    
def main():
    sent_id = arg_int()
    for i, sent in enumerate(Sentence.load_cabocha(sys.stdin), start=1):
        if i == sent_id:
            sent.print_dep_idx()
            break


if __name__ == '__main__':
    main()

!python q41.py -n8 < neko.txt.cabocha

0: This => 1 1: A student is => 7 2: Sometimes => 4 3: We => 4 4: Catch => 5 5: Boil => 6 6: Eat => 7 7: It's a story. => -1

Since the aim of this chapter is to learn class definitions, I will skip the explanation of the following problems.

Supplement to special methods

It's a little advanced story, so you can skip it.

What is worrisome in the code of the Sentence class is that descriptions such as for chunk in self.chunks and self.chunks [i] occur frequently. If you define the following special method, you can use for chunk in self orself [i].

    def __iter__(self):
        return iter(self.chunks)

    def __getitem__(self, key):
        return self.chunks[key]

Actually, when turning the list with a for statement, it was internally converted to an iterator by the ʻiter ()function. And the ʻiter ()function calls the object's __iter () __ method. Therefore, if you write it like this, you can turn the instance of Sentence itself with a for statement.

Index access is possible by defining __getitem__ ().

Furthermore, if you define the Chunk class in the same way, you can execute the following for statement for the instance.

for chunk in sentence:
    for morph in chunk:

Perhaps the Python wrapper for the dependency parser in the world is also made like this. Also, I think it is more natural to define the methods for q41 and 42 in the Chunk class, write the above for statement on the outside, and call it in it.

42 and 43 are omitted because there is nothing special to say.

44. Visualization of dependent trees

Visualize the dependency tree of a given sentence as a directed graph. For visualization, convert the dependency tree to DOT language and use Graphviz. Also, to visualize directed graphs directly from Python, use pydot.

I will do my best to install Graphviz and pydot. Many people say that pydot-ng should be used because pydot has not been maintained for a long time, but in my environment it worked with pydot, so I use pydot. I have only used pydot with 100 knocks, so I will not explain it in particular. For implementation, I referred to this blog.

q44.py


import sys

from q40 import arg_int
from q41 import Sentence
import pydot


def main():
    sent_id = arg_int()
    for i, sent in enumerate(Sentence.load_cabocha(sys.stdin), start=1):
        if i == sent_id:
            edges = sent.dep_edge()
            n = pydot.Node('node')
            n.fontname="MS Gothic"
            n.fontsize = 9
            graph = pydot.graph_from_edges(edges, directed=True)
            graph.add_node(n)
            graph.write_jpeg(f"dep_tree_neko{i}.jpg ")
            break

if __name__ == "__main__":
    main()

!python q44.py  -n8 < neko.txt.cabocha
from IPython.display import Image
Image("dep_tree_neko8.jpg ")

output_23_0.jpg

There is a way to reverse the direction of the dependency arrow (especially based on Universal Dependencies), but I think either one is fine for this problem.

(Due to the specification of graph_from_edges (), if there are multiple clauses with the same surface layer (and different ids) in the statement, they will be regarded as the same node. To avoid that, graph_from_edges () It is necessary to change the implementation of or add an id to the surface layer system.)

(The font is MS P Gothic because my environment is Ubuntu with WSL and I couldn't find the Japanese font, so I used the technique of referencing the Windows font.)

45. Extraction of verb case patterns

I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications.

--In a clause containing a verb, the uninflected word of the leftmost verb is used as a predicate. --The case is a particle related to a predicate --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces.

Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.

See `` `

Save the output of this program to a file and check the following items using UNIX commands.

--Combination of predicates and case patterns that frequently appear in the corpus
--The case pattern of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)

If you want to investigate case, you should focus on case particles, but follow the crying problem statement. Arrange the particles in lexicographic order. Whether or not to output the case where there is no particle in the predicate's element is not written in the problem sentence, but since it is a case where the case is omitted, I decided to aggregate it.

Python's sorted () and list.sorted () can also sort character strings, but they are lexicographic orders based on code points and are different from the alphabetical order used in Japanese dictionaries. Let's know.

>>> sorted(['Science', 'Scarecrow']) 

['Scarecrow','Scarecrow']

The example solution above uses ʻextend (). A method that concatenates lists with the same result as +. Be careful not to confuse list.append with list.extend`.

!python q45.py < neko.txt.cabocha | head

Be born Tsukuka To do By crying To do At the beginning To see Listen To catch Boil

!python q45.py < neko.txt.cabocha | sort | uniq -c | sort -rn | head -20

704 452 435 333 I think 202 To become 199 188 175 Look 159 122 say 117 113 108 98 see 97 When you see 94 90 89 85 80 see

!python q45.py < neko.txt.cabocha | grep -E "^(To do|to see|give)\s" | sort | uniq -c | sort -nr | head -20

452 435 188 175 Look 159 117 113 98 see 90 85 80 see 61 60 60 51 51 from 46 40 39 What is 37

Truth value judgment

Since the description such as ʻif list:is used in the answer example of No. 46, I will explain the truth value judgment. You can write any object other than the conditional expression in the if statement, and what happens to the truth value judgment in that case? [Documentation](https://docs.python.org/ja/3/library) /stdtypes.html#truth), so please refer to it. In short,None, zero-like things, and xwithlen (x) == 0are all treated as False, and others are treated as True. Knowing this makes it easy to write things like "when the list isn't empty". If you are not sure, try applying the built-in functionbool ()` to various objects.

There is nothing else to say, so 46 is omitted.

47. Functional verb syntax mining

I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications.

-Only when the phrase consisting of "Sahen connecting noun + (particle)" is related to the verb -The predicate is "Sahen connection noun + is the basic form of + verb", and if there are multiple verbs in the phrase, the leftmost verb is used. -If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. -If there are multiple clauses related to the predicate, arrange all the terms separated by spaces (align with the order of particles).

For example, the following output should be obtained from the sentence, "The master will reply to the letter, even if it comes to another place."

`` `When you reply, the owner says ```

Save the output of this program to a file and check the following items using UNIX commands. -Predicates that frequently appear in the corpus (sa-variant noun + + verb) -Predicates and particles that frequently appear in the corpus

A s-irregular connection noun is a noun that can be made into a s-irregular verb by adding "to" at the end, such as "answer". By the way, in school grammar, "reply" is one word, but in morphological analysis, it is usually divided into "reply / reply".

In addition, please forgive me that the answer example is a terrible chord.

!python q47.py < neko.txt.cabocha | cut -f1 | sort | uniq -c | sort -nr | head

30 reply 21 Say hello 14 imitate 13 talk 13 quarrel 6 Take a nap 5 Exercise 5 Ask a question 5 Ask a question 5 Listen to the story

!python q47.py < neko.txt.cabocha | cut -f 1,2 | sort | uniq -c | sort -nr | head

8 imitate 6 When you reply 6 quarrel 4 exercise 4 To reply 4 Reply 4 Listen to the story 4 When you say hello 4 I'll say hello 3 To ask a question

48. Extracting paths from nouns to roots

For a clause that contains all the nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications. -Each clause is represented by a (superficial) morpheme sequence -Concatenate the expressions of each clause with `` `"-> "``` from the start clause to the end clause of the path.

From the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.

I am->saw
here->Start with->Human->Things->saw
Human->Things->saw
Things->saw

All you have to do is follow the destination with while or recursion. With the following problem in mind, I made a method only for the processing up to one step before the output.

q48.py


import sys

from q40 import arg_int
from q41 import Sentence


def main():
    sent_id = arg_int()
    for i, sent in enumerate(Sentence.load_cabocha(sys.stdin), start=1):
        if i == sent_id:
            for chunks in sent.trace_dep_path():
                print(' -> '.join([chunk.tostr() for chunk in chunks]))
            break

if __name__ == '__main__':
    main()
!python q48.py -n6 < neko.txt.cabocha

I saw-> Here-> for the first time-> human-> I saw something-> I saw a human-> thing-> I saw something->

(Added on 2020/5/18) After receiving comments, I noticed that the example of dependency analysis of the problem statement was an analysis error. In issue, we are planning to modify the example sentences used. The main purpose of Chapter 5 of 100 knocks is to practice class definition, and I think that the reason why CaboCha is specified is simply because it is easy to use. On the other hand, it is also considered to devise a problem statement to give the dependency analyzer a wider range. GiNZA is also easy to use these days.

49. Extraction of dependency paths between nouns

Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, the phrase number of the noun phrase pair is i When> and j (i <j), the dependency path shall satisfy the following specifications.

-Similar to Problem 48, the path is expressed by concatenating the expressions (surface morpheme strings) of each phrase from the start clause to the end clause with "->". -Replace noun phrases in clauses i and j with X and Y, respectively

In addition, the shape of the dependency path can be considered in the following two ways.

-If clause j exists on the path from clause i to the root of the syntax tree: Show the path from clause i to clause j -Other than the above, when clause i and clause j intersect at a common clause k on the path from clause j to the root of the syntax tree: the path immediately before clause i to clause k and the path from clause j to just before clause k> Display the contents of clause k by concatenating them with "|"

For example, from the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.

X is|In Y->Start with->Human->Things|saw
X is|Called Y->Things|saw
X is|Y|saw
In X->Start with-> Y
In X->Start with->Human-> Y
Called X-> Y

By the way, this is the most difficult problem in the 2020 version of 100 knocks. First of all, the problem statement is very difficult to understand. And even if you understand the meaning, it is still difficult. For the time being, let's understand just the meaning of the problem statement.

In short, the problem is to convert the output of problem 48 to something like this.

However, there are three traps below the output example that correspond to "When clause j exists on the path from clause i to the root of the syntax tree: Display the path from clause i to clause j". Fortunately, this part seems easy to implement. Since the path in question 48 is a noun phrase start, the first clause of all four paths is a candidate for i. All you have to do is search for j from there. And finally, replace the noun phrases i and j with X and Y (I think X-> Y would be instead of X-> Y, but gogogogogogogo).

The part that seems to be difficult is "other than the above". This is "I am->"I saw it" and "human->Things->If there is a path like "I saw it", combine it and say "I am|Human->Things|As "I saw", further clause i,Replace the noun phrase of j with "X is|Called Y->Things|"I saw it."

Since there are four paths in Problem 48, we need to choose two paths from them and run the loop 4C2 times. It's painful if you don't know ʻitertools.combinations () `. And if you don't ignore the pairs that are in a subset relationship, such as "human-> things-> saw" and "things-> saw", it will be bad. This process is annoying. Also, for clauses i and j, the first clause of the two paths is a candidate, but the process of searching for a common phrase k from there and the process of creating an output string after it is found are troublesome. ~~ It's neither an introduction to natural language processing nor an introduction to Python, so I don't think it's necessary to force it.

Summary

--Class

in conclusion

This completes the introductory series to Python with 100 knocks of language processing?

Recommended Posts

[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 language processing knocks ~ Chapter 1
Introduction to Python language
100 language processing knocks Chapter 2 (10 ~ 19)
Getting started with Python with 100 knocks on language processing
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
[Language processing 100 knocks 2020] Summary of answer examples by Python
An introduction to Python distributed parallel processing with Ray
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 natural language processing knocks Chapter 4 Commentary
[Language processing 100 knocks 2020] Chapter 6: Machine learning
100 Language Processing Knock Chapter 1 in Python
100 language processing knocks 2020: Chapter 4 (morphological analysis)
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
[Introduction to Python3 Day 13] Chapter 7 Strings (7.1-7.1.1.1)
[Introduction to Python3 Day 14] Chapter 7 Strings (7.1.1.1 to 7.1.1.4)
Introduction to Protobuf-c (C language ⇔ Python)
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
Image processing with Python 100 knocks # 3 Binarization
[Language processing 100 knocks 2020] Chapter 7: Word vector
[Introduction to Python3 Day 15] Chapter 7 Strings (7.1.2-7.1.2.2)
10 functions of "language with battery" python
100 language processing knocks 2020: Chapter 3 (regular expression)
[Language processing 100 knocks 2020] Chapter 8: Neural network
[Language processing 100 knocks 2020] Chapter 2: UNIX commands
[Language processing 100 knocks 2020] Chapter 9: RNN, CNN
Image processing with Python 100 knocks # 2 Grayscale
100 Language Processing Knock Chapter 1 by Python
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
[Raspi4; Introduction to Sound] Stable recording of sound input with python ♪
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
Basics of binarized image processing with Python
100 language processing knocks (2020): 32
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
Introduction of Python
IPynb scoring system made with TA of Introduction to Programming (Python)
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
100 language processing knocks (2020): 22