[Chapter 2] Introduction to Python with 100 knocks of language processing

This article is a continuation of my book Introduction to Python with 100 Knocking on Language Processing. For those who want to learn the basics of Python (and Unix commands) while working on 100 Knock Chapter 2.

If you can reach number 12, the basics of Python are almost OK.

After that, I think that you will feel that you will steadily acquire a little detailed knowledge. First, download the file specified in the question text by an appropriate method.

$ wget https://nlp100.github.io/data/popular-names.txt

Supplement to the problem statement

In natural language processing, there are many situations where you want to process a huge text file line by line, and so is the problem in this chapter.

TSV (Tab-Separated Values) and CSV (Comma-Separated Values) are often used to express a structure in which one row is one data and columns are divided into items. The files dealt with in this chapter are tab-delimited, so it's TSV.

(Although it is confusing, these formats are sometimes collectively referred to as CSV (Character-Separated Values).)

Answer example policy

Problems in this chapter use pandas and the standard library csv You can solve it, but I don't feel the need so much, so I will explain the simplest method. By the way, the coding style of the answer example will follow PEP8. For example, variable name / function name should be snake_case, indent should be 4 half-width spaces, etc.

About Unix commands

If you are not familiar with options, pipe redirects, and less, please read verses 1 and 3 of this Qiita article. Then check the contents of the downloaded file with $ less popular-names.txt.

Since the command name is specified in the problem statement, you can use it by making full use of --help without knowing it. However, most of the Unix commands in this chapter are common, so keep them in mind if you can.

File reading

In C, we used file pointers, but in Python we use a convenient data type called file objects. The file object is iterable, so if you want to read a text file line by line, write:

with open('popular-names.txt') as f:
    for line in f:
        print(line, end='')

The with syntax is great for doing f = open ('popular-names.txt') and f.close () when exiting a block. The official documentation also states that it is a ** good habit **, so be sure to use it.

In each loop of the for statement, the contents of each line are assigned to line.

To use multiple files at once, use commas like with open ('test1') as f1, open ('test2') as f2.

Use sys.stdin if you want to read standard input line by line. This is also a file object. All the problems in this chapter can be done the above way, but it's somewhat more convenient to use standard input.

import sys

for line in sys.stdin:
    print(line, end='')
    

(Since the standard input is ʻopen ()from the beginning, please think thatwith` is unnecessary.)

(Some Unix commands are designed to accept standard input and file names, but it is a little troublesome to do so with Python → [Reference article](https://qiita.com/hi-asano/items/ 010e7e3410ea4e1486cb)))

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

Let's save the Python script and get popular-names.txt from the standard input and run it.

Below is an example of the answer.

q10.py


import sys

i = 0
for line in sys.stdin:
    i += 1
print(i)

$ python q10.py < popular-names.txt
2780

You can't use C ʻi ++in Python, so use the cumulative assignment operator+ =`.

(Avoid f.read (). Splitlines (), f.readlines (), list (f) when the file size is large or when you want to perform complicated processing.

I will also touch on a slightly elegant method. It takes advantage of the fact that sys.stdin is iterable and that the Python for block does not form a scope.

import sys


i = 0
for i, _ in enumerate(sys.stdin, start=1):
    pass

print(i)

The built-in function ʻenumerate ()that counts the number of loops is useful. It's a Python convention to receive unused return values with_. The pass` statement is used when you don't want to do anything but need to write something grammatically.

Use the wc command (* word count *) for confirmation. If you use it normally, you will see various things, so specify the options -l, --lines.

$ wc -l popular-names.txt
2780 popular-names.txt

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

Let's use str.replace (old, new). This method replaces the substring ʻoldin the string withnewand returns it. The tab character is\ t` as in C.

Below is an example of the answer.

q11.py


import sys


for line in sys.stdin:
    print(line.replace('\t', ' '), end='')

Since there are many lines, check the result with python q11.py <popular-names.txt | less etc.

There are three Unix commands listed, but sed -e's / \ t / / g'popular-names.txt is the most popular. On Twitter, I sometimes see people who correct their typos in this way with rips. sed stands for Stream EDitor and is a versatile command.

Personally, the s / \ t / / g part is troublesome, so I wonder if I use tr'\ t'''<popular-names.txt ...

However, sed is a command you should know, and you can use sed -n 10p to extract the 10th line, sed -n 10, 20p to extract the 10th to 20th lines, and so on. Convenient.

File writing

In the next question you will learn about writing files. Use ʻopen (filename,'w)` to open a text file in write mode.

with open('test', 'w') as fo:
    # fo.write('hoge')
    print('hoge', file=fo)

You can use the write () method when writing, but it is a little inconvenient to forget to add a line break, so I think it is better to use the optional argument file ofprint ().

12. Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

Below is an example of the answer.

q12.py


import sys


with open('col1.txt', 'w') as fo1,\
     open('col2.txt', 'w') as fo2:
    for line in sys.stdin:
        cols = line.rstrip('\n').split('\t')
        print(cols[0], file=fo1)
        print(cols[1], file=fo2)

It would have been nice to write the two ʻopen ()together on one line, but by using the backslash` it is considered that the statement continues even if the line breaks.

You can use ʻopen ()to readpopular-names.txt, but I don't want the with` statement to be longer, so I use the standard input method.

The part of line.rstrip ('\ n'). Split ('\ t') is called a method chain, and the methods are executed in order from the left. In this problem, the result does not change without rstrip (), but it is to prevent cols [-1] from including a newline character. It's a habit of reading text.

Unix commands are OK if you specify the options -f, --fields in cut. You can execute the command twice, but you can do it at once with &&.

!cut -f1 popular-names.txt > col1.txt && cut -f2 popular-names.txt > col2.txt

13. Merge col1.txt and col2.txt

Combine col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

This is easy, isn't it? Below is an example of the answer.

q13.py


with open('col1.txt') as fi1,\
     open('col2.txt') as fi2:
    for col1, col2 in zip(fi1, fi2):
        col1 = col1.rstrip()
        col2 = col2.rstrip()
        print(f'{col1}\t{col2}')

Writing q13.py

This is where the built-in function zip () comes into play. rstrip () removes all trailing newline and whitespace characters if the argument is omitted.

(Given that there will be 3 or more input files, it is better to receive the return value of zip () as one variable and join (). Further, the behavior is closer to the paste command. It's hard to try. Please read this article.)

Unix commands are OK with paste col1.txt col2.txt.

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

Command line arguments can be obtained with sys.argv, but it is somewhat more convenient to use ʻargparse`. For instructions, read the Official Great Tutorial from the beginning to the "Short Options" ...

File objects are not sequence type and cannot be sliced. Let's count the number of lines in other ways.

Below is an example of the answer.

q14.py


import argparse
import sys


def arg_lines():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--lines', default=1, type=int)
    args = parser.parse_args()
    return args.lines


def head(N):
    for i, line in enumerate(sys.stdin):
        if i < N:
            print(line, end='')
        else:
            break


if __name__ == '__main__':
    head(arg_lines())
$ python q14.py -n 10 < popular-names.txt

The ʻargparse part is made into a function independently so that it can be reused in the next problem. ʻIf __name__ =='__main__': prevents the main process from being executed arbitrarily at the time of import.

(Noisy, ʻif name =='main':` It's not good to write a long process below, because all variables are global. Python has slow access to global variables, so performance is poor. There is also a disadvantage that it goes down. Actually, it will be slightly faster by making the code that was written solidly without making it into a function.)

The break is a control statement used inside the for block, which immediately exits the for statement. Remember with continue (immediately move to the next loop).

The head () function can be written a little more elegantly.

import sys
from itertools import islice

def head(N):
    for line in islice(sys.stdin, N):
        print(line, end='')

Unix commands can be head -n 5 popular-names.txt etc. If you omit the option, it runs with the default value (probably 10).

At the time of the explanation of No. 11, I wrote that the number of lines is long, so please pass it to less with a pipe, but if you want to check only the first one, head was enough.

If you pass these commands on a pipe, you will get a Broken Pipe Error at the end. If you want to prevent thishead popular-names.txt | python q11.pyLike firstheadOr,python q11.py < popular-names.txt 2>/dev/null | headLet's discard the error output like this.

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

Python file objects (return values of sys.stdin and ʻopen ()) can only move the file pointer from the beginning (in principle). It is wasteful to count the number of lines once and then reopen the file again. It is a waste of memory to do readlines ()` and slice the back ...

This can be smart if you know the * queue * first-in, last-out data structure. In other words, you can put the contents of the file line by line in a queue of length N. Then, the elements that extend beyond the length of the queue will come out without permission, so in the end, only the last N lines will remain in the queue.

Use the deque in the collections module to implement queues in Python. Or rather, the official docs deque recipe has an example oftail (), which is almost the answer to this question.

Below is an example of the answer.

q15.py


from collections import deque
import sys

from q14 import arg_lines


def tail(N):
    buf = deque(sys.stdin, N)
    print(''.join(buf))


if __name__ == '__main__':
    tail(arg_lines())

The deque can be turned with a for statement as well as the list. Until now, the list was turned with a for statement andprinted (), but it is faster to join () and print () at once ([Reference](https: //). qiita.com/hi-asano/items/aa2976466739f280b887#%E3%81%8A%E3%81%BE%E3%81%91-%E5%95%8F%E9%A1%8C3-print)). For number 14, print (''. Join (islice (sys.stdin, N)), end ='') was enough.

Unix commands are OK with tail -n 5.

Below, the difficulty level will increase a little, but I would like you to understand it.

Iterators and iterators

In the previous article, I explained that "what can be turned by a for statement is called iterable". Here, the iterable data types (+ α) that have appeared so far can be classified as follows. There is no need to memorize the finer terms, but it can be easier to see what they look like when you come across new data types in the future.

Now let's talk about ** Iterators ** (which we used to cheat and deal with). Data types such as lists are cumbersome when they are large because all the elements are stored in memory at once. Also, there is a lot of waste if you don't need to use len () or index and just use it with arguments such as for statement orstr.join (). It's a waste of time to generate the later elements, especially if you don't need to loop to the last element. Iterators have eliminated such drawbacks. Since the iterator returns only one element in one loop, it has the advantage of being able to handle memory efficiently. You can't use slices, but you can do something like that with ʻitertools.islice (). Also, once the loop is completely turned, nothing can be done. Due to these restrictions, it is used exclusively in functions that take for` statements or iterables as arguments.

Data types that support ʻin operations and len () as well as for` statements are called collections or containers (although they are called containers in the official documentation, abstract base class By definition, collections are stricter).

Indexes and slices can be used for all sequence types.

In addition to deque, the collections module defines useful data types such as Counter and defaultdict, so be aware of this as well. May be used in future issues.

List comprehension and generator expressions

There are times when you want to perform some operation on all the elements of an iterable object one by one, or when you want to extract only the elements that meet the conditions. Comprehensions and generator expressions can be used to describe such processing in a concise manner. Let's take the example of 100 knock 03 "Pi".

tokens = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'.split()

#List comprehension
stripped_list = [len(token.rstrip('.,')) for token in tokens]
print(stripped_list)
[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

First from the list comprehension. Previously, it was ʻappend ()in thefor statement, but in the list comprehension notation, write what you want to ʻappend () first, and attach the for statement to the end with []. Enclose. It may be a little difficult to get to, but this way of writing works faster ([Reference](https://qiita.com/hi-asano/items/aa2976466739f280b887#%E5%95%8F%E9%] A1% 8C1-% E3% 83% AA% E3% 82% B9% E3% 83% 88% E7% 94% 9F% E6% 88% 90)) Use it positively.

Next is the generator formula. The return value of the generator type is called the generator type and is a kind of iterator. The iterator can only be used by turning it again with a for statement or passing it to another function. If anything, the latter is the more common usage.

#Generator type
stripped_iter = (len(token.rstrip('.,')) for token in tokens)

for token in stripped_iter:
    print(token, end=' ')
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 
' '.join(str(len(token.rstrip('.,'))) for token in tokens)
'3 1 4 1 5 9 2 6 5 3 5 8 9 7 9'

As you can see, the generator expression just changed the list comprehension [] to (). However, you can omit the () when passing it as a function argument.

Passing a generator expression to a function has the advantage of reducing intermediate variables. Compared to passing a list (inclusive notation), memory usage can be reduced, and generator expressions are often faster.

(In rare cases, there are functions that are faster to pass a list, and this join () seems to be one of the exceptions ...)

Generator function

The problem with generator expressions is that they are difficult to write when you want to do complicated processing. So you may want to define a function that returns an iterator. The easiest way is to use the yield statement, and the one defined in that way is called the generator function. Of course, the object generated by the generator function is also a generator type.

To define a generator function, place the yield return value" in the middle "(or at the end) of the function's operation. The big difference is that return ends the processing of the function there and the local variables in the function disappear.

def tokens2lengths(tokens):
    for token in tokens:
        yield len(token.rstrip('.,'))

for token in tokens2lengths(tokens):
    print(token, end=' ')
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 

What are you grateful for? That might be the case ... You won't use generator functions in these two chapters ... It is said that the description of recursive functions will be easier, but the recursive functions themselves are not written much ... Personally, I use it in the following cases. First, suppose you have your own function process.

for elem in lis:
    if elem is not None:
        outstr = process(elem)
        print(outstr)

This code doesn't make a fool of the function call time as the number of elements in lis increases. Therefore, if process is converted into a generator function, it will be slightly faster. You can also absorb the conditional expression, and the main function will be refreshed.

for outstr in iter_process(lis):
    print(outstr)

The story that is not the main story has become long. Let's solve the following problem.

Very detailed term swing

The documentation states that a generator usually refers to a generator function, and the object generated by a generator function is called a generator iterator. However, if you use the type () function to check the return type of the generator function (and generator expression), you will see Generator. For this reason, I feel that the official generator iterator is often referred to as a generator in unofficial documents.

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

It's a difficult problem. There are various possible methods, but it seems that the only way to break the constraint of not putting the contents of the file in memory at once is to first count the total number of lines and then divide it. You can reopen the file again to return the reference point of the file object to the beginning, or use f.seek (0) (meaning to refer to the first 0 bytes).

And ** N splitting to be as even as possible is annoying. For example, if you want to divide 14 lines into 4 lines, you want to divide them into 4 lines, 4 lines, 3 lines, 3 lines. Let's think about it.

If you can do that, just read and write 4 lines. There is a method called fi.readline () that reads only one line, but that may be the turn. You should probably write to a separate file.

Below is an example of the answer.

q16.py


import argparse
import sys


def main():
    parser = argparse.ArgumentParser(
        description='Output pieces of FILE to FILE1, FILE2, ...;')
    parser.add_argument('file')
    parser.add_argument('-n', '--number', type=int,
                        help='split FILE into n pieces')
    args = parser.parse_args()
    file_split(args.file, args.number)

    
def file_split(filename, N):
    with open(filename) as fi:
        n_lines = sum(1 for _ in fi)
        fi.seek(0)
        for nth, width in enumerate((n_lines+i)//N for i in range(N)):
            with open(f'{filename}.split{nth}', 'w') as fo:
                for _ in range(width):
                    fo.write(fi.readline())


if __name__ == '__main__':
    main()
$ python q16.py -n 3 popular-names.txt
$ wc -l popular-names.txt.split*
  926 popular-names.txt.split0
  927 popular-names.txt.split1
  927 popular-names.txt.split2
 2780 total

You don't have to worry about how to use ʻargparse. To count the number of lines, this time we use the built-in function sum ()` to calculate the sum of the elements of the iterable.

And how to divide the integers evenly. Suppose the quotient is q and the remainder is r when dividing m items into n people. At this time, if you normally distribute q pieces to(nr)people and add one remainder to the remaining r people and distribute them by (q + 1) pieces, it will be even. I will.

I wrote it elegantly in the ((n_lines + i) // N for i in range (N)) part. You can truncate and divide the decimal point with //. Please see Qiita article here for why this is evenly divided.

If you don't care about the order of the lines, you can use tee () and ʻislice () of ʻitertools. If you don't care about memory, it may be easier to use zip_longest ().

The Unix command should be split -n l / 5 -d popular-names.txt popular-names.txt, but it may not work depending on the split in your environment.

The latter problem is easy.

17. Difference in the character string in the first column

Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.

You just add the first row to the set. Only list comprehensions have been explained above, but there are also set comprehensions and dictionary comprehensions.

Below is an example of the answer.

q17.py


import sys


names = {line.split('\t')[0] for line in sys.stdin}
print('\n'.join(names))

Keep in mind that the aggregate type changes its order each time it is executed. If you don't like that, use dictionary types (CPython implementations are 3.6 and above, officially 3.7 and above are in the order of key additions).

Unix commandscut -f1 popular-names.txt | sort | uniqWill be.uniqTo remove duplicates in adjacent lines, so to do something like thissortIs required.

Lambda expression and sort

I will use it for the next problem, so I will follow the lambda expression. Lambda expressions are used to define small functions. For example, if you write lambda a, b: a + b, it is a function that returns the sum of two more numbers. It can be called like a normal function, but it is mainly used as an optional argument of the sort () function. For the time being, it may be passed to another function or used as the return value of a self-made function.

The official sort HOW TO is a good source for sort (). It is enough to read up to "Ascending and Descending".

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

(What is a column ... I used to say that I was in line ...) Below is an example of the answer.

q18.py


import sys


sorted_list = sorted(sys.stdin, key=lambda x: int(x.split('\t')[2]), reverse=True)
print(''.join(sorted_list))

Note that the numbers in the third column will remain strings unless you cast them to a numeric type. Casting can be done with built-in functions.

The Unix command is sort -k3 -nr popular-names.txt. It means that the third element is regarded as a number and sorted in ascending order.

Unix sort is very good and will run large files without out of memory. It's also relatively easy to speed up (messing with the locale, splitting and merging at the end, etc.).

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

Foreshadowing of collections.Counter has been recovered here! If you read the Documents, there should be no problem. Keep in mind that Counter is a subclass of dict.

Below is an example of the answer.

q19.py


from collections import Counter
import sys


col1_freq = Counter(line.split('\t')[0] for line in sys.stdin)
for elem, num in col1_freq.most_common():
    print(num, elem)

Unix commandscut -f1 popular-names.txt | sort | uniq -c | sort -nris. When the pipes are connected one by oneheadI think it's a good idea to check the intermediate output using.

Summary

--Unix command basics --File reading and writing

Next is Chapter 3

The JSON file can be read with the json module. Learn about regular expressions in the official Regular Expressions HOW TO. I will write a sequel if there is LGTM or comments.

(4/30 postscript) The explanation of Chapter 3 has been released. → https://qiita.com/hi-asano/items/8e303425052781d95f09

Recommended Posts

[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 language processing knocks ~ Chapter 1
Introduction to Python language
Getting started with Python with 100 knocks on language processing
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
[Language processing 100 knocks 2020] Summary of answer examples by Python
An introduction to Python distributed parallel processing with Ray
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
100 natural language processing knocks Chapter 4 Commentary
[Language processing 100 knocks 2020] Chapter 6: Machine learning
100 Language Processing Knock Chapter 1 in Python
100 language processing knocks 2020: Chapter 4 (morphological analysis)
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
[Introduction to Python3 Day 13] Chapter 7 Strings (7.1-7.1.1.1)
[Introduction to Python3 Day 14] Chapter 7 Strings (7.1.1.1 to 7.1.1.4)
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
Image processing with Python 100 knocks # 3 Binarization
[Language processing 100 knocks 2020] Chapter 7: Word vector
[Introduction to Python3 Day 15] Chapter 7 Strings (7.1.2-7.1.2.2)
10 functions of "language with battery" python
[Language processing 100 knocks 2020] Chapter 8: Neural network
[Language processing 100 knocks 2020] Chapter 2: UNIX commands
[Language processing 100 knocks 2020] Chapter 9: RNN, CNN
Image processing with Python 100 knocks # 2 Grayscale
100 Language Processing Knock Chapter 1 by Python
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
Language processing 100 knocks-48: Extraction of paths from nouns to roots
[Raspi4; Introduction to Sound] Stable recording of sound input with python ♪
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
Basics of binarized image processing with Python
100 language processing knocks (2020): 32
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
IPynb scoring system made with TA of Introduction to Programming (Python)
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 language processing knocks (2020): 47
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.1-8.2.5)
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
100 language processing knocks (2020): 22
Language processing 100 knocks-22: Extraction of category names
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.3-8.3.6.1)
100 language processing knocks (2020): 42
Introduction to Python Image Inflating Image inflating with ImageDataGenerator
100 language processing knocks Chapter 4: Morphological analysis 31. Verbs
Summary of Chapter 3 of Introduction to Design Patterns Learned in Java Language