Summary of tools needed to analyze data in Python

About this article

Here's a setup to help you analyze your data in Python.

If you are interested in data analysis, please also check here. If you are interested in data scientists, please take a look around here first. Summary of literature and videos (added as needed) --Qiita

Execution environment

Jupyter (formerly iPython Notebook)

http://jupyter.org/ An environment for interactive code execution It is very suitable for data analysis, and once you get used to it, you will not be able to analyze it with other IDEs.

In addition to being able to execute each code block that is divided arbitrarily and display the result each time, ・ Inline display of graph ・ Formula description (Latex) ・ Markdown method text description

It is very suitable for analysis work while exploring, sharing and storage of results, etc. It is also widely used in the scientific industry because it can be written in a dissertation-like format by drawing sentences and charts with iPython.

image

There is also a product called jupyterhub for use by multiple people. https://github.com/jupyter/jupyterhub

Other options

Google Cloud Datalab https://cloud.google.com/datalab/?hl=ja Google Cloud data discovery front end based on Jupyter Reference: BigQuery integration for Python users --Qiita

beaker notebook http://beakernotebook.com/

Apache Zeppelin https://zeppelin.incubator.apache.org/

Library

Numerical calculation, data manipulation

Numpy http://www.numpy.org/ Compared to Python's built-in List, for handling array-to-array operations and multidimensional arrays (matrix calculation) A library that provides good objects A collection of Numpy Arrays will be the Pandas dataframe objects introduced below.

Learn more about using Numpy and Pandas in this book

Introduction to data analysis with Python-Data processing using NumPy and pandas http://www.oreilly.co.jp/books/9784873116556/

Pandas http://pandas.pydata.org/ Library for handling data in RDB-like form (data frame) in Python It has become the standard for data analysis, including Sciki learn and Matplotlib. Coordination with Pandas objects is smooth

image

Commentary article

A rudimentary summary of data manipulation in Python Pandas http://qiita.com/hik0107/items/d991cc44c2d1778bb82e

Scipy http://docs.scipy.org/doc/scipy/reference/ Library for scientific and technical calculations Includes various techniques such as special functions, optimizations, statistical processing (quite many)

Example of scipy.optimize for function approximation (qiita article)

Non-linear function modeling in Python http://qiita.com/hik0107/items/9bdc236600635a0e61e8

Data linkage

csv http://docs.python.jp/2/library/csv.html#module-csv A convenient library for loading, processing, and operating csv Provide a reader or writer for csv files

db connection

There is a library for connecting to various DBs such as MySQL, PostgreSQL, BigQuery, SQLite, etc.

MySQL : MySQL-Connector-Python https://pypi.python.org/pypi/mysql-connector-python/

PostgreSQL : Pycopg2 http://initd.org/psycopg/download/

BigQuery : BigQuery-Python https://github.com/tylertreat/BigQuery-Python

Or see here for how to use Pandas http://qiita.com/hik0107/items/3944ccea04371331c3b4

SQLite: SQLite3 (installation is not required because it is built-in) http://docs.python.jp/2/library/sqlite3.html

Simple analysis

pivottablejs https://pypi.python.org/pypi/pivottablejs A library that accepts Pandas objects and allows you to work like an Excel PivotTable Useful when you want to make simple tabulations and check data

image

collections (built-in functions)

http://docs.python.jp/2/library/collections.html Module containing functions such as "Counter" that can be used like Count Distinct and "named tuple" that can design simplified objects of data frames 

Modeling (machine learning)

scikitlearn http://scikit-learn.org/ Machine learning package packed with models for classification and prediction This is also almost de facto in data analysis in Python.

image

Graph drawing

matplotlib (+ seaborn) http://matplotlib.org/ http://stanford.edu/~mwaskom/software/seaborn/ matplotlib is effectively the de facto tool for data visualization in Python. seaborn is a wrapper like that, which makes it easier to draw beautiful graphs.

There are various graphs such as line graphs, bar graphs, histograms, scatter plots, etc.

Qiita article

Beautiful graph drawing with python -seaborn makes data analysis and visualization easier http://qiita.com/hik0107/items/3dc541158fceb3156ee0

image

Other options

Both are high-performance graphing tools If you don't like matplotlib, aren't satisfied with it, or are a former R user, please check it out.

・ Bokeh http://bokeh.pydata.org/en/latest/ -Ggplot (Python version of R's ggplogt2 library) http://ggplot.yhathq.com/ ・ Plotly https://plot.ly/

Other

Accelerated calculation: Cython

http://cython.org/ Compile some Python code into C code for fast execution Useful when the amount of calculation is large and speed becomes a bottleneck

Computer algebra: sympy

http://www.sympy.org/en/index.html

Manipulating and calculating dates: datetime

http://docs.python.jp/2/library/datetime.html

This article also

It's time to seriously think about the definition and skill set of data scientists http://qiita.com/hik0107/items/f9bf14a7575d5c885a16

Recommended Posts

Summary of tools needed to analyze data in Python
Summary of how to import files in Python 3
Summary of how to use MNIST in Python
Data analysis in Python Summary of sources to look at first for beginners
Basic summary of data manipulation in Python Pandas-Second half: Data aggregation
[Python] Summary of how to use pandas
Summary of various for statements in Python
[Python2.7] Summary of how to use unittest
Summary of built-in methods in Python list
Summary of how to use Python list
[Python2.7] Summary of how to use subprocess
[Introduction to Data Scientists] Basics of Python ♬
How to send a visualization image of data created in Python to Typetalk
Summary of OSS tools and libraries created in 2016
Real-time visualization of thermography AMG8833 data in Python
Summary of tools used in Command Line vol.8
Summary of tools used in Command Line vol.5
The story of reading HSPICE data in Python
Summary of studying Python to use AWS Lambda
A well-prepared record of data analysis in Python
Summary of Excel operations using OpenPyXL in Python
Numerical summary of data
Summary of Python arguments
Summary of how to read numerical data with python [CSV, NetCDF, Fortran binary]
Organize Python tools to speed up the initial movement of data analysis competitions
Full-width and half-width processing of CSV data in Python
How to get the number of digits in Python
Power BI visualization of Salesforce data entirely in Python
Summary of tools for operating Windows GUI with Python
Summary of Pandas methods used when extracting data [Python]
Not being aware of the contents of the data in python
List of Python code used in big data analysis
Let's use the open data of "Mamebus" in Python
[Python] Summary of how to specify the color of the figure
I tried to analyze J League data with Python
To do the equivalent of Ruby's ObjectSpace._id2ref in Python
Summary of date processing in Python (datetime and dateutil)
Summary of statistical data analysis methods using Python that can be used in business
[Introduction to Python] Summary of functions and methods that frequently appear in Python [Problem format]
To flush stdout in Python
Summary of python file operations
Summary of Python3 list operations
What's new in Python 3.10 (Summary)
Display UTM-30LX data in Python
Login to website in Python
Equivalence of objects in Python
Python data type summary memo
Speech to speech in python [text to speech]
Face detection summary in Python
How to develop in Python
Implementation of quicksort in Python
What's new in Python 3.9 (Summary)
Post to Slack in Python
Try scraping the data of COVID-19 in Tokyo with Python
I want to be able to analyze data with Python (Part 3)
Allow brew install of command line tools made in Python
[python] Summary of how to retrieve lists and dictionary elements
I want to be able to analyze data with Python (Part 1)
Processing of python3 that seems to be usable in paiza
[Updated from time to time] Summary of design patterns in Java
[For beginners] Summary of standard input in Python (with explanation)