[PYTHON] How to use MultiIndex (personal memorandum)

What is MultiIndex?

Click here for details ↓ https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html According to the document, it is a hierarchical index and can be used when manipulating high-dimensional data (roughly).

Panel that handles 3D data structures in pandas has been abolished, so check how to handle it here. Since it is a personal memorandum, it is not exhaustive at all. Therefore, if you have any questions, you have to investigate it as appropriate.

Environment etc.

python 3.7.7 pandas 1.1.1 Execution environment: jupyter notebook

Data confirmation

As expected, research data cannot be used, so sample data will be created appropriately.

Data creation

#pandas import
import pandas as pd

#Data creation
array =[['cat','cat','cat','cat','cat','dog','dog','dog','bird','bird'],
        ['black','white','white','brown','brown','black','white','brown','white','yellow'],
        ['male','male','female','male','female','female','female','male','male','female'],
        [3,5,5,5,4,7,8,10,1,2], 
        [20,25,26,30,22,77,55,90,10,15], 
        [3.6,4.5,4.0,5.2,3.0,15.6,10.5,20.1,0.52,0.89]]

#Convert to DataFrame.Transpose rows and columns with T
df = pd.DataFrame(array).T

#column settings
df.columns = ['animal','color','male or female','age','height','weight']

Data content

Like this. Since the content is created implicitly, there seems to be a very obese cat.

animal color male or female age height weight
0 cat black male 3 20 3.6
1 cat white male 5 25 4.5
2 cat white female 5 26 4
3 cat brown male 5 30 5.2
4 cat brown female 4 22 3
5 dog black female 7 77 15.6
6 dog white female 8 55 10.5
7 dog brown male 10 90 20.1
8 bird white male 1 10 0.52
9 bird yellow female 2 15 0.89

Digression: You can use .to_markdown () to create a markdown notation table. It seems that it was implemented from 1.0.0 or later of pandas? However, in the end, .to_html () is indebted.

Basic usage

index How to set and reset the index

index setting

The Column name specified by set_index is the index of the level starting from 0 from animal. level→0 : animal, 1 : color, 2 : male or female


df2 = df.set_index(['animal','color','male or female'])
age height weight
animal color male or female
cat black male 3 20 3.6
white male 5 25 4.5
female 5 26 4
brown male 5 30 5.2
female 4 22 3
dog black female 7 77 15.6
white female 8 55 10.5
brown male 10 90 20.1
bird white male 1 10 0.52
yellow female 2 15 0.89

By the way, the index levels can be set in any order you like.

df3 = df.set_index(['animal','male or female','color'])
age height weight
animal male or female color
cat male black 3 20 3.6
white 5 25 4.5
female white 5 26 4
male brown 5 30 5.2
female brown 4 22 3
dog female black 7 77 15.6
white 8 55 10.5
male brown 10 90 20.1
bird male white 1 10 0.52
female yellow 2 15 0.89

Comparing with the above table, it can be confirmed that the values of the first and second levels are interchanged.

As a side note, .set_index () has a parameter drop, which is True by default. This specifies whether to delete the data of the column used as Index. Therefore, when drop = False, the data of the column specified in Index remains as it is as shown below.

df4 = df.set_index(['animal','color','male or female'], drop=False)
                                                                                                                                                                                                                                                 
animalcolormale or femaleageheightweight
animalcolormale or female
catblackmalecatblackmale3203.6
whitemalecatwhitemale5254.5
femalecatwhitefemale5264
brownmalecatbrownmale5305.2
femalecatbrownfemale4223
dogblackfemaledogblackfemale77715.6
whitefemaledogwhitefemale85510.5
brownmaledogbrownmale109020.1
birdwhitemalebirdwhitemale1100.52
yellowfemalebirdyellowfemale 2 15 0.89

Like this. Basically, it seems that there is no problem with the default, so you do not have to worry about it.

index reset

To reset the Index, use .reset_index ().

df1 = df2.reset_index()

If you compare df and df1, you can see that the table is the same.

bonus

Confirmation that the column name specified in Index is likely to be included in names.

df2.index

>
MultiIndex([( 'cat',  'black',   'male'),
            ( 'cat',  'white',   'male'),
            ( 'cat',  'white', 'female'),
            ( 'cat',  'brown',   'male'),
            ( 'cat',  'brown', 'female'),
            ( 'dog',  'black', 'female'),
            ( 'dog',  'white', 'female'),
            ( 'dog',  'brown',   'male'),
            ('bird',  'white',   'male'),
            ('bird', 'yellow', 'female')],
           names=['animal', 'color', 'male or female'])

I wonder if it can be used when checking the column name in Index.

df2.index.names

>
FrozenList(['animal', 'color', 'male or female'])

Data extraction

From here, we will use df2 to see various things.

Extraction of values contained in a particular level

Use .index.get_level_values (). This will return a specific level of label as a vector.

#Specified by a specific level column name
df2.index.get_level_values('animal')

> Index(['cat', 'cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'bird', 'bird'], dtype='object', name='animal')

Same result as above ↓

#Can also be specified by level number
df2.index.get_level_values(0)

> Index(['cat', 'cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'bird', 'bird'], dtype='object', name='animal')

Avoid duplication with set, convert to list, and sort it to create a new list. If you want to sort in descending order, set the argument reverse to True. (I don't think it makes much sense here.)

Extract labels without duplication

#Get label without duplication
a = sorted(list(set(df2.index.get_level_values('animal'))))
a

> ['bird', 'cat', 'dog']

Extract the value using .xs ()

By using .xs (), you can specify the index column name and its value (label) for selection or extraction.

Let's use the list created earlier to check the operation of xs (). The contents are ['bird','cat','dog'] ← Using the values in this list as keys, find the average height for each male and female animal.

M = {}
F = {}

for s in a:
    m = df2['height'].xs([s, 'male'], level=['animal', 'male or female']).mean()
    M[s] = m
    f = df2['height'].xs([s, 'female'], level=['animal', 'male or female']).mean()
    F[s] = f

#result
M
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}
F
> {'bird': 15.0, 'cat': 24.0, 'dog': 66.0}

In addition to .mean (), .min (), .max (), .std (), .count (), etc. can be used according to the purpose.

For the time being, commentary

First, specify the column name of the value you want to find (this time I want to know the average of'height') df2['height']

Next, use .xs () to make detailed specifications. df2['height'].xs([s, 'male'], level=['animal', 'male or female'])

First, about [s,'male']. This time, in order to calculate the average height of each animal for each male (male) and female (female), specify a label corresponding to the information of "which animal and which gender" in the first argument of .xs ().

Next, about level = ['animal','male or female']. This seems to indicate what level the [s,'male'] you specified earlier is. So, it would be nice if you could correspond to the feeling of s in'animal'and'male' in'male or female'. Since the level can be specified by a number, the result is the same even if level = [0,2].

Finally, I want to find the average, so add .mean ().

The contents of the dictionary look like this ↓

#In the program above.mean()Remove and run
m = df2['height'].xs([s, 'male'], level=['animal', 'male or female'])

#result
M
> 
{'bird': color
         white    10
 Name: height, dtype: object,
 'cat': color
         black    20
         white    25
         brown    30
 Name: height, dtype: object,
 'dog': color
         brown    90
 Name: height, dtype: object}

For each animal, the value of'height'corresponding to the male (male) can be obtained.

Extract values using .loc

In order, I felt that I should check from here first. I think the usage of .loc is the same as for a normal DataFrame that is not Multiindex.

Data review

df2
>
                             age height weight
animal color  male or female                  
cat    black  male             3     20    3.6
       white  male             5     25    4.5
              female           5     26      4
       brown  male             5     30    5.2
              female           4     22      3
dog    black  female           7     77   15.6
       white  female           8     55   10.5
       brown  male            10     90   20.1
bird   white  male             1     10   0.52
       yellow female           2     15   0.89

For Index, specify'cat'of'animal'. Specify all column names of Column.

df2.loc['cat']
#Or
# df2.loc['cat', :]
>
                     age height weight
color male or female                  
black male             3     20    3.6
white male             5     25    4.5
      female           5     26      4
brown male             5     30    5.2
      female           4     22      3

You can specify only the column names of Column that you want to narrow down.

df2.loc['cat', ['height', 'weight']]
>
                     height weight
color male or female              
black male               20    3.6
white male               25    4.5
      female             26      4
brown male               30    5.2
      female             22      3

The first argument is the label of a specific level of Index, and the second argument is the column name of the Column you want to narrow down.

df2.loc['cat', 'height']
>
color  male or female
black  male              20
white  male              25
       female            26
brown  male              30
       female            22
Name: height, dtype: object

Specify multiple labels within the same level of Index.

df2.loc[['cat', 'bird'], 'height']
>
animal  color   male or female
cat     black   male              20
        white   male              25
                female            26
        brown   male              30
                female            22
bird    white   male              10
        yellow  female            15
Name: height, dtype: object

This time, how to cross the index level. Narrow down in order of Index level (from 0 in terms of number). If nothing is specified in (), the result is the same as df2.loc [:,'height'].

# 'cat'What is.
df2.loc[('cat'), 'height']
>
color  male or female
black  male              20
white  male              25
       female            26
brown  male              30
       female            22
Name: height, dtype: object


# 'cat'Of'white'What is.
df2.loc[('cat','white'), 'height']
>
male or female
male      25
female    26
Name: height, dtype: object


# 'cat'Of'white'Is'male'What is.
df2.loc[('cat','white','male'), 'height']
> 25

Bonus: Value assignment

If you narrow down to this point, you can rewrite the value.

df3 = df2.copy()
#Substitute 30
df3.loc[('cat','white','male'), 'height'] = 30

df3.loc[('cat','white','male'), 'height']
> 30

Specify by slice & notes

It seems that slices can also be used, but if you do it without thinking, the following error will occur

df2.loc[('cat','brown','male'):('dog','black','female')]

> UnsortedIndexError: 'Key length (3) was greater than MultiIndex lexsort depth (0)'

Solution Sort using .sort_index ().

df5 = df2.sort_index(level=0)
df5
>
                             age height weight
animal color  male or female                  
bird   white  male             1     10   0.52
       yellow female           2     15   0.89
cat    black  male             3     20    3.6
       brown  female           4     22      3
              male             5     30    5.2
       white  female           5     26      4
              male             5     25    4.5
dog    black  female           7     77   15.6
       brown  male            10     90   20.1
       white  female           8     55   10.5

Specify level = number as an argument. Since level = 0 was specified this time, it is sorted based on the animal label.

df5.loc[('cat','brown','male'):('dog','black','female')]
>
                            age height weight
animal color male or female                  
cat    brown male             5     30    5.2
       white female           5     26      4
             male             5     25    4.5
dog    black female           7     77   15.6

('cat','brown'): ('dog','white') can be narrowed down.

pd.indexSlice

idx = pd.IndexSlice
df5.loc[idx[:, :, ['male']], 'height']
>
 animal  color  male or female
bird    white  male              10
cat     black  male              20
        brown  male              30
        white  male              25
dog     brown  male              90
Name: height, dtype: object

Looking only at the output result, the male'height'for each animal can be extracted as in the case of using .xs (). So if you want to get the same result as using .xs (), do this. a = ['bird', 'cat', 'dog']

# pd.Use Index Slice
M_idx = {}
for s in a:
    M_idx[s] = df5.loc[idx[[s], :, ['male']], 'height'].mean()
M_idx
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}

Comparison with when using .xs ()

# .xs()use
M = {}
for s in a:
    M[s] = df2['height'].xs([s, 'male'], level=['animal', 'male or female']).mean()
M
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}

Which one to use seems to be personal preference

in conclusion

I feel that Multiindex can be used in various ways once I get used to it, but I'm still not sure how to use it. Later, there may be typographical errors or typographical errors that I overlooked, so I will correct them.

Reference site

MultiIndex / advanced indexing

Recommended Posts

How to use MultiIndex (personal memorandum)
How to use cron (personal memo)
Memorandum on how to use gremlin python
How to use xml.etree.ElementTree
How to use virtualenv
How to use Seaboan
How to use image-match
How to use shogun
How to use Pandas 2
How to use Virtualenv
How to use numpy.vectorize
How to use pytest_report_header
How to use partial
How to use SymPy
How to use x-means
How to use WikiExtractor.py
How to use IPython
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv
How to use list []
How to use python-kabusapi
How to use OptParse
How to use dotenv
How to use pyenv-virtualenv
How to use imutils
How to use import
How to use search sorted
[gensim] How to use Doc2Vec
Understand how to use django-filter
How to use the generator
[Python] How to use list 1
How to use FastAPI ③ OpenAPI
How to use Python argparse
How to use IPython Notebook
How to use Pandas Rolling
[Note] How to use virtualenv
How to use redis-py Dictionaries
Python: How to use pydub
[Python] How to use checkio
[Go] How to use "... (3 periods)"
How to use Django's GeoIp2
[Python] How to use input ()
How to use the decorator
[Introduction] How to use open3d
How to use Python lambda
How to use Jupyter Notebook
[Python] How to use virtualenv
python3: How to use bottle (3)
How to use Google Colaboratory
How to use Python bytes
Python: How to use async with
How to use the zip function
How to use the optparse module
Summary of how to use pandas.DataFrame.loc
How to install and use Tesseract-OCR