[PYTHON] Summary of how to use csvkit

Introduction

--An open source tool for easy handling and cleansing of csv files for data analysis --Can be installed from python ――The official page documents are available, so please check them if you want to use them in earnest.

Installation

If you can use pip, you can install it with the following one line

$ pip install csvkit

Please see the Official Page for details.

Command list

$ ls /usr/local/bin/*csv*
/usr/local/bin/csvclean   /usr/local/bin/csvgrep  /usr/local/bin/csvlook  /usr/local/bin/csvsql    /usr/local/bin/in2csv
/usr/local/bin/csvcut     /usr/local/bin/csvjoin  /usr/local/bin/csvpy    /usr/local/bin/csvstack  /usr/local/bin/sql2csv
/usr/local/bin/csvformat  /usr/local/bin/csvjson  /usr/local/bin/csvsort  /usr/local/bin/csvstat

There are quite a few. Is this all?

Data preparation

--Bring the csv file you want to touch.

$ head -n 10 iris.csv
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa

Data confirmation

csvlook ――It will be shaped and displayed for easy viewing. --If the separator is other than a comma, `-d" | "` etc.

$ csvlook iris.csv | head -n 10
|------+--------------+-------------+--------------+-------------+-------------|
|      | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species     |
|------+--------------+-------------+--------------+-------------+-------------|
|  1   | 5.1          | 3.5         | 1.4          | 0.2         | setosa      |
|  2   | 4.9          | 3           | 1.4          | 0.2         | setosa      |
|  3   | 4.7          | 3.2         | 1.3          | 0.2         | setosa      |
|  4   | 4.6          | 3.1         | 1.5          | 0.2         | setosa      |
|  5   | 5            | 3.6         | 1.4          | 0.2         | setosa      |
|  6   | 5.4          | 3.9         | 1.7          | 0.4         | setosa      |
|  7   | 4.6          | 3.4         | 1.4          | 0.3         | setosa      |

Check only column names

csvcut -n Like names () in R

$ csvcut -n iris.csv
  1:
  2: Sepal.Length
  3: Sepal.Width
  4: Petal.Length
  5: Petal.Width
  6: Species

Column selection

csvcut -c Only one column (by column number)

$ csvcut -c 2 iris.csv | head
Sepal.Length
5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4

Only one column (by column name)

$ csvcut -c Sepal.Length iris.csv | head
Sepal.Length
5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4

Select two or more columns & csvlook

$ csvcut -c 2,4 iris.csv | csvlook | head
|---------------+---------------|
|  Sepal.Length | Petal.Length  |
|---------------+---------------|
|  5.1          | 1.4           |
|  4.9          | 1.4           |
|  4.7          | 1.3           |
|  4.6          | 1.5           |
|  5            | 1.4           |
|  5.4          | 1.7           |
|  4.6          | 1.4           |

Except for some columns & csvlook

$ csvcut -C Species iris.csv | csvlook | head
|------+--------------+-------------+--------------+--------------|
|      | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width  |
|------+--------------+-------------+--------------+--------------|
|  1   | 5.1          | 3.5         | 1.4          | 0.2          |
|  2   | 4.9          | 3           | 1.4          | 0.2          |
|  3   | 4.7          | 3.2         | 1.3          | 0.2          |
|  4   | 4.6          | 3.1         | 1.5          | 0.2          |
|  5   | 5            | 3.6         | 1.4          | 0.2          |
|  6   | 5.4          | 3.9         | 1.7          | 0.4          |
|  7   | 4.6          | 3.4         | 1.4          | 0.3          |

Row extraction / filtering

csvgrep --View data for a column under specified conditions --Grep-like --You can also define what goes into the interval

$ csvgrep -c Species -m setosa iris.csv | csvlook | head
|-----+--------------+-------------+--------------+-------------+----------|
|     | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species  |
|-----+--------------+-------------+--------------+-------------+----------|
|  1  | 5.1          | 3.5         | 1.4          | 0.2         | setosa   |
|  2  | 4.9          | 3           | 1.4          | 0.2         | setosa   |
|  3  | 4.7          | 3.2         | 1.3          | 0.2         | setosa   |
|  4  | 4.6          | 3.1         | 1.5          | 0.2         | setosa   |
|  5  | 5            | 3.6         | 1.4          | 0.2         | setosa   |
|  6  | 5.4          | 3.9         | 1.7          | 0.4         | setosa   |
|  7  | 4.6          | 3.4         | 1.4          | 0.3         | setosa   |

Even regular expressions

$ csvgrep -c Species -r ^s.*$ iris.csv | csvlook | head
|-----+--------------+-------------+--------------+-------------+----------|
|     | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species  |
|-----+--------------+-------------+--------------+-------------+----------|
|  1  | 5.1          | 3.5         | 1.4          | 0.2         | setosa   |
|  2  | 4.9          | 3           | 1.4          | 0.2         | setosa   |
|  3  | 4.7          | 3.2         | 1.3          | 0.2         | setosa   |
|  4  | 4.6          | 3.1         | 1.5          | 0.2         | setosa   |
|  5  | 5            | 3.6         | 1.4          | 0.2         | setosa   |
|  6  | 5.4          | 3.9         | 1.7          | 0.4         | setosa   |
|  7  | 4.6          | 3.4         | 1.4          | 0.3         | setosa   |

Even in the section

$ csvgrep -c Sepal.Length  -r "[4.5-5]"  iris.csv | csvlook | head
|------+--------------+-------------+--------------+-------------+-------------|
|      | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species     |
|------+--------------+-------------+--------------+-------------+-------------|
|  1   | 5.1          | 3.5         | 1.4          | 0.2         | setosa      |
|  2   | 4.9          | 3           | 1.4          | 0.2         | setosa      |
|  3   | 4.7          | 3.2         | 1.3          | 0.2         | setosa      |
|  4   | 4.6          | 3.1         | 1.5          | 0.2         | setosa      |
|  5   | 5            | 3.6         | 1.4          | 0.2         | setosa      |
|  6   | 5.4          | 3.9         | 1.7          | 0.4         | setosa      |
|  7   | 4.6          | 3.4         | 1.4          | 0.3         | setosa      |

Format transformation

csvformat --Format can be transformed such as tab delimiter

$ csvformat -T  iris.csv
	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
<snip>

Join with other files

csvjoin --You can join with another file --Can be specified as Outer, Left, Right

Summary of statistical information

csvstat ――It's convenient.

csvstat iris.csv
  1. _unnamed
	<type 'int'>
	Nulls: False
	Min: 1
	Max: 150
	Sum: 11325
	Mean: 75.5
	Median: 75.5
	Standard Deviation: 43.3003079281
	Unique values: 150
  2. Sepal.Length
<snip>

csv file check

csvclean ――It will check if the number of columns is correct. If there is no problem, ** No errors. ** is output.

$ csvclean iris.csv
No errors.

Convert from csv

csvjson --Converts to JSON --It is easier to see if you specify the indent with the -i option.

$ csvjson -i 2 iris.csv | head -n 20
[
  {
    "": "1",
    "Sepal.Length": "5.1",
    "Sepal.Width": "3.5",
    "Petal.Length": "1.4",
    "Petal.Width": "0.2",
    "Species": "setosa"
  },
  {
    "": "2",
    "Sepal.Length": "4.9",
    "Sepal.Width": "3",
    "Petal.Length": "1.4",
    "Petal.Width": "0.2",
    "Species": "setosa"
  },
  {
    "": "3",
    "Sepal.Length": "4.7",
<snip>

csvsql -i --You can convert to SQL system such as MySQL, create a sample, change the table name and insert it.

List of things that can be converted

$ csvsql -h
<snip>
  -i {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}, --dialect {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}
<snip>

It looks like this in mysql

$ csvsql -i mysql iris.csv
CREATE TABLE iris (
	_unnamed INTEGER NOT NULL,
	`Sepal.Length` FLOAT NOT NULL,
	`Sepal.Width` FLOAT NOT NULL,
	`Petal.Length` FLOAT NOT NULL,
	`Petal.Width` FLOAT NOT NULL,
	`Species` VARCHAR(10) NOT NULL
);

Convert to csv

in2csv --Conversion from Excel file (.xlsx) to csv. The data received from the business division has many xlsx patterns and is uploaded to the server as it is, so it's actually quite convenient. --In R, I wonder if I don't have to ** read.xlsx **.

in2csv -f json --You can convert a json file to csv by specifying json with the -f option. --It is very convenient to be able to instantly convert the data (JSON format) acquired by REST API into a form that is easy to read with R.

sql2csv --Conversion from the one saved in SQL DB to csv ――This also seems to be useful for people who do not understand SQL grammar to touch it lightly.

in conclusion

--Since the delimiter can be specified, it can also be applied to tsv files, etc. ――Speaking of formatting csv files, there is also NYSOL, so if you have large data, you need to compare the speed with it, but the simple one is csvkit.

reference

-Official page -Try using csvkit -Python csvkit that can be used for data preprocessing and processing -Data science starting from the command line

Recommended Posts

Summary of how to use csvkit
Summary of how to use pandas.DataFrame.loc
Summary of how to use pyenv-virtualenv
[Python] Summary of how to use pandas
[Python2.7] Summary of how to use unittest
Summary of how to use Python list
[Python2.7] Summary of how to use subprocess
Summary of how to use MNIST in Python
How to calculate Use% of df command
Jupyter Notebook Basics of how to use
Summary of how to write AWS Lambda
[Question] How to use plot_surface of python
How to use xml.etree.ElementTree
How to use Python-shell
How to use tf.data
How to use virtualenv
How to use Seaboan
How to use image-match
How to use shogun
How to use Virtualenv
How to use numpy.vectorize
How to use pytest_report_header
How to use partial
How to use Bio.Phylo
How to use SymPy
How to use x-means
How to use WikiExtractor.py
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use Pyenv
How to use list []
How to use python-kabusapi
How to use OptParse
How to use return
How to use dotenv
How to use pyenv-virtualenv
How to use Go.mod
How to use imutils
How to use import
How to use folium (visualization of location information)
A simple example of how to use ArgumentParser
Summary of how to import files in Python 3
Not much mention of how to use Pickle
Summary of studying Python to use AWS Lambda
How to use Qt Designer
How to use search sorted
python3: How to use bottle (2)
Understand how to use django-filter
How to use the generator
How to use FastAPI ③ OpenAPI
How to use Python argparse
How to use IPython Notebook
How to use Pandas Rolling
[Note] How to use virtualenv
How to use redis-py Dictionaries
[Python] How to use checkio
[Go] How to use "... (3 periods)"