[PYTHON] How to use MultiIndex (personal memorandum)

What is MultiIndex?

Click here for details ↓ https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html According to the document, it is a hierarchical index and can be used when manipulating high-dimensional data (roughly).

Panel that handles 3D data structures in pandas has been abolished, so check how to handle it here. Since it is a personal memorandum, it is not exhaustive at all. Therefore, if you have any questions, you have to investigate it as appropriate.

Environment etc.

python 3.7.7 pandas 1.1.1 Execution environment: jupyter notebook

Data confirmation

As expected, research data cannot be used, so sample data will be created appropriately.

Data creation

#pandas import
import pandas as pd

#Data creation
array =[['cat','cat','cat','cat','cat','dog','dog','dog','bird','bird'],
        ['black','white','white','brown','brown','black','white','brown','white','yellow'],
        ['male','male','female','male','female','female','female','male','male','female'],
        [3,5,5,5,4,7,8,10,1,2], 
        [20,25,26,30,22,77,55,90,10,15], 
        [3.6,4.5,4.0,5.2,3.0,15.6,10.5,20.1,0.52,0.89]]

#Convert to DataFrame.Transpose rows and columns with T
df = pd.DataFrame(array).T

#column settings
df.columns = ['animal','color','male or female','age','height','weight']

Data content

Like this. Since the content is created implicitly, there seems to be a very obese cat.

	animal	color	male or female	age	height	weight
0	cat	black	male	3	20	3.6
1	cat	white	male	5	25	4.5
2	cat	white	female	5	26	4
3	cat	brown	male	5	30	5.2
4	cat	brown	female	4	22	3
5	dog	black	female	7	77	15.6
6	dog	white	female	8	55	10.5
7	dog	brown	male	10	90	20.1
8	bird	white	male	1	10	0.52
9	bird	yellow	female	2	15	0.89

Digression: You can use .to_markdown () to create a markdown notation table. It seems that it was implemented from 1.0.0 or later of pandas? However, in the end, .to_html () is indebted.

Basic usage

index How to set and reset the index

index setting

The Column name specified by set_index is the index of the level starting from 0 from animal. level→0 : animal, 1 : color, 2 : male or female


df2 = df.set_index(['animal','color','male or female'])

			age	height	weight
animal	color	male or female
cat	black	male	3	20	3.6
	white	male	5	25	4.5
	white	female	5	26	4
	brown	male	5	30	5.2
	brown	female	4	22	3
dog	black	female	7	77	15.6
	white	female	8	55	10.5
	brown	male	10	90	20.1
bird	white	male	1	10	0.52
bird	yellow	female	2	15	0.89

By the way, the index levels can be set in any order you like.

df3 = df.set_index(['animal','male or female','color'])

			age	height	weight
animal	male or female	color
cat	male	black	3	20	3.6
	male	white	5	25	4.5
	female	white	5	26	4
	male	brown	5	30	5.2
	female	brown	4	22	3
dog	female	black	7	77	15.6
	female	white	8	55	10.5
	male	brown	10	90	20.1
bird	male	white	1	10	0.52
bird	female	yellow	2	15	0.89

Comparing with the above table, it can be confirmed that the values of the first and second levels are interchanged.

As a side note, .set_index () has a parameter drop, which is True by default. This specifies whether to delete the data of the column used as Index. Therefore, when drop = False, the data of the column specified in Index remains as it is as shown below.

df4 = df.set_index(['animal','color','male or female'], drop=False)

			animal	color	male or female	age	height	weight
animal	color	male or female
cat	black	male	cat	black	male	3	20	3.6
	white	male	cat	white	male	5	25	4.5
	white	female	cat	white	female	5	26	4
	brown	male	cat	brown	male	5	30	5.2
	brown	female	cat	brown	female	4	22	3
dog	black	female	dog	black	female	7	77	15.6
	white	female	dog	white	female	8	55	10.5
	brown	male	dog	brown	male	10	90	20.1
bird	white	male	bird	white	male	1	10	0.52
bird	yellow	female	bird	yellow	female	2	15	0.89

Like this. Basically, it seems that there is no problem with the default, so you do not have to worry about it.

index reset

To reset the Index, use .reset_index ().

df1 = df2.reset_index()

If you compare df and df1, you can see that the table is the same.

bonus

Confirmation that the column name specified in Index is likely to be included in names.

df2.index

>
MultiIndex([( 'cat',  'black',   'male'),
            ( 'cat',  'white',   'male'),
            ( 'cat',  'white', 'female'),
            ( 'cat',  'brown',   'male'),
            ( 'cat',  'brown', 'female'),
            ( 'dog',  'black', 'female'),
            ( 'dog',  'white', 'female'),
            ( 'dog',  'brown',   'male'),
            ('bird',  'white',   'male'),
            ('bird', 'yellow', 'female')],
           names=['animal', 'color', 'male or female'])

I wonder if it can be used when checking the column name in Index.

df2.index.names

>
FrozenList(['animal', 'color', 'male or female'])

Data extraction

From here, we will use df2 to see various things.

Extraction of values contained in a particular level

Use .index.get_level_values (). This will return a specific level of label as a vector.

#Specified by a specific level column name
df2.index.get_level_values('animal')

> Index(['cat', 'cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'bird', 'bird'], dtype='object', name='animal')

Same result as above ↓

#Can also be specified by level number
df2.index.get_level_values(0)

> Index(['cat', 'cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'bird', 'bird'], dtype='object', name='animal')

Avoid duplication with set, convert to list, and sort it to create a new list. If you want to sort in descending order, set the argument reverse to True. (I don't think it makes much sense here.)

Extract labels without duplication

#Get label without duplication
a = sorted(list(set(df2.index.get_level_values('animal'))))
a

>　['bird', 'cat', 'dog']

Extract the value using .xs ()

By using .xs (), you can specify the index column name and its value (label) for selection or extraction.

Let's use the list created earlier to check the operation of xs (). The contents are ['bird','cat','dog'] ← Using the values in this list as keys, find the average height for each male and female animal.

M = {}
F = {}

for s in a:
    m = df2['height'].xs([s, 'male'], level=['animal', 'male or female']).mean()
    M[s] = m
    f = df2['height'].xs([s, 'female'], level=['animal', 'male or female']).mean()
    F[s] = f

#result
M
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}
F
> {'bird': 15.0, 'cat': 24.0, 'dog': 66.0}

In addition to .mean (), .min (), .max (), .std (), .count (), etc. can be used according to the purpose.

For the time being, commentary

First, specify the column name of the value you want to find (this time I want to know the average of'height') df2['height']

Next, use .xs () to make detailed specifications. df2['height'].xs([s, 'male'], level=['animal', 'male or female'])

First, about [s,'male']. This time, in order to calculate the average height of each animal for each male (male) and female (female), specify a label corresponding to the information of "which animal and which gender" in the first argument of .xs ().

Next, about level = ['animal','male or female']. This seems to indicate what level the [s,'male'] you specified earlier is. So, it would be nice if you could correspond to the feeling of s in'animal'and'male' in'male or female'. Since the level can be specified by a number, the result is the same even if level = [0,2].

Finally, I want to find the average, so add .mean ().

The contents of the dictionary look like this ↓

#In the program above.mean()Remove and run
m = df2['height'].xs([s, 'male'], level=['animal', 'male or female'])

#result
M
> 
{'bird': color
         white    10
 Name: height, dtype: object,
 'cat': color
         black    20
         white    25
         brown    30
 Name: height, dtype: object,
 'dog': color
         brown    90
 Name: height, dtype: object}

For each animal, the value of'height'corresponding to the male (male) can be obtained.

Extract values using .loc

In order, I felt that I should check from here first. I think the usage of .loc is the same as for a normal DataFrame that is not Multiindex.

Data review

df2
>
                             age height weight
animal color  male or female                  
cat    black  male             3     20    3.6
       white  male             5     25    4.5
              female           5     26      4
       brown  male             5     30    5.2
              female           4     22      3
dog    black  female           7     77   15.6
       white  female           8     55   10.5
       brown  male            10     90   20.1
bird   white  male             1     10   0.52
       yellow female           2     15   0.89

For Index, specify'cat'of'animal'. Specify all column names of Column.

df2.loc['cat']
#Or
# df2.loc['cat', :]
>
                     age height weight
color male or female                  
black male             3     20    3.6
white male             5     25    4.5
      female           5     26      4
brown male             5     30    5.2
      female           4     22      3

You can specify only the column names of Column that you want to narrow down.

df2.loc['cat', ['height', 'weight']]
>
                     height weight
color male or female              
black male               20    3.6
white male               25    4.5
      female             26      4
brown male               30    5.2
      female             22      3

The first argument is the label of a specific level of Index, and the second argument is the column name of the Column you want to narrow down.

df2.loc['cat', 'height']
>
color  male or female
black  male              20
white  male              25
       female            26
brown  male              30
       female            22
Name: height, dtype: object

Specify multiple labels within the same level of Index.

df2.loc[['cat', 'bird'], 'height']
>
animal  color   male or female
cat     black   male              20
        white   male              25
                female            26
        brown   male              30
                female            22
bird    white   male              10
        yellow  female            15
Name: height, dtype: object

This time, how to cross the index level. Narrow down in order of Index level (from 0 in terms of number). If nothing is specified in (), the result is the same as df2.loc [:,'height'].

# 'cat'What is.
df2.loc[('cat'), 'height']
>
color  male or female
black  male              20
white  male              25
       female            26
brown  male              30
       female            22
Name: height, dtype: object


# 'cat'Of'white'What is.
df2.loc[('cat','white'), 'height']
>
male or female
male      25
female    26
Name: height, dtype: object


# 'cat'Of'white'Is'male'What is.
df2.loc[('cat','white','male'), 'height']
> 25

Bonus: Value assignment

If you narrow down to this point, you can rewrite the value.

df3 = df2.copy()
#Substitute 30
df3.loc[('cat','white','male'), 'height'] = 30

df3.loc[('cat','white','male'), 'height']
> 30

Specify by slice & notes

It seems that slices can also be used, but if you do it without thinking, the following error will occur

df2.loc[('cat','brown','male'):('dog','black','female')]

> UnsortedIndexError: 'Key length (3) was greater than MultiIndex lexsort depth (0)'

Solution Sort using .sort_index ().

df5 = df2.sort_index(level=0)
df5
>
                             age height weight
animal color  male or female                  
bird   white  male             1     10   0.52
       yellow female           2     15   0.89
cat    black  male             3     20    3.6
       brown  female           4     22      3
              male             5     30    5.2
       white  female           5     26      4
              male             5     25    4.5
dog    black  female           7     77   15.6
       brown  male            10     90   20.1
       white  female           8     55   10.5

Specify level = number as an argument. Since level = 0 was specified this time, it is sorted based on the animal label.

df5.loc[('cat','brown','male'):('dog','black','female')]
>
                            age height weight
animal color male or female                  
cat    brown male             5     30    5.2
       white female           5     26      4
             male             5     25    4.5
dog    black female           7     77   15.6

('cat','brown'): ('dog','white') can be narrowed down.

pd.indexSlice

idx = pd.IndexSlice
df5.loc[idx[:, :, ['male']], 'height']
>
 animal  color  male or female
bird    white  male              10
cat     black  male              20
        brown  male              30
        white  male              25
dog     brown  male              90
Name: height, dtype: object

Looking only at the output result, the male'height'for each animal can be extracted as in the case of using .xs (). So if you want to get the same result as using .xs (), do this. a = ['bird', 'cat', 'dog']

# pd.Use Index Slice
M_idx = {}
for s in a:
    M_idx[s] = df5.loc[idx[[s], :, ['male']], 'height'].mean()
M_idx
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}

Comparison with when using .xs ()

# .xs()use
M = {}
for s in a:
    M[s] = df2['height'].xs([s, 'male'], level=['animal', 'male or female']).mean()
M
> {'bird': 10.0, 'cat': 25.0, 'dog': 90.0}

Which one to use seems to be personal preference

in conclusion

I feel that Multiindex can be used in various ways once I get used to it, but I'm still not sure how to use it. Later, there may be typographical errors or typographical errors that I overlooked, so I will correct them.

Reference site

MultiIndex / advanced indexing