[PYTHON] How to change the behavior when loading / dumping yaml with PyYAML and its details

Introduction

PyYAML is a yaml library for python.

A story about trying to change the behavior when loading / dumping yaml using this library.

What i want to do

Let's try changing the behavior of PyYAML using the following two examples as an example so that it behaves as you like.

--Supports OrderedDict (input / output in a format that can be read by other yaml loaders) --To output in an expression compatible with json

Premise memo

This is a memo that you should know as a premise.

--Representer --hook object to add tag at dump timing --Constructor --hook object to generate python expression at load timing

Supports OrderedDict (input / output in a format that can be read by other yaml loaders)

To support OrderedDict, add the following code.

def represent_odict(dumper, instance):
    return dumper.represent_mapping('tag:yaml.org,2002:map', instance.items())

yaml.add_representer(OrderedDict, represent_odict)

def construct_odict(loader, node):
    return OrderedDict(loader.construct_pairs(node))

yaml.add_constructor('tag:yaml.org,2002:map', construct_odict)

The setting example also exists in qiita as follows.

-Read with PyYAML in order --Qiita

This time I happened to take a peek inside PyYAML, so I decided to take a note of the details such as the meaning of the above settings.

The story of the output (yaml.dump)

Use yaml.dump () when stringifying a Python object as yaml.

import yaml

with open("ok.yaml", "w") as wf:
    yaml.dump({200: "ok"}, wf)

This yaml.dump () internally calls Representer.represent (). This transforms the Python object into a Node object inside yaml, which is converted to a string with Serializer.serialize ().

Internally, type () is called for each Python object and the processing content is branched according to the return type. If there are no branch candidates, the mro of the object is traced to search for conversion candidates. (If no candidate is found, the object is reached as the final point of the candidate search, and represent_object () is called.)

From the standpoint of PyYAML, it seems that I want to use OrderedDict and dict properly, so I purposely set the representation function for OrderedDict.

Representer.add_representer(collections.OrderedDict,
        Representer.represent_ordered_dict)

That's why the following Python objects are

d = OrderedDict()
d["a"] = 1
d["z"] = 2

The output will be as follows.

!!python/object/apply:collections.OrderedDict
- - [a, 1]
  - [z, 2]

To prevent this, rewrite the presenter settings so that ʻOrderedDict` outputs the same as a normal dict. However, it is necessary to output so that the order is maintained at the output timing.

As with OrderedDict, there is a Representer setting for dict. Internally it is treated as a map Node.

SafeRepresenter.add_representer(dict,
        SafeRepresenter.represent_dict)

class SafeRepresenter:
    def represent_dict(self, data):
        return self.represent_mapping('tag:yaml.org,2002:map', data)

That's why you can add the following to output OrderedDict as a map node.

def represent_odict(dumper, instance):
    return dumper.represent_mapping('tag:yaml.org,2002:map', instance.items())

yaml.add_representer(OrderedDict, represent_odict)

Input (yaml.load) story

import yaml

with open("ok.yaml") as rf:
    data = yaml.load(rf)

The same is true for load, and inside yaml.load (),Constructor.get_single_data ()is called. Again, use get_single_node () to create a Node object and construct_document () to convert it to a Python object.

This time as well, you can add conversion support for each Node. The Node object itself is an object with the following definition.

class Node(object):
    def __init__(self, tag, value, start_mark, end_mark):
        self.tag = tag
        self.value = value
        self.start_mark = start_mark
        self.end_mark = end_mark

This tag part determines how to process.

Regardless of whether it is OrderedDict or not in the previous correspondence, it will be retained as a node with a tag of map (treated as dict). So you can set the conversion for the node with the tag of map.

By the way, there are actually pairs in yaml, which is convenient for keeping the order. It is easy to make an Ordered Dict via this.

Those with the pair tag specified are converted as follows (tuple with length 2 as the meaning of pair).

s = """\
foo:
  !!pairs
  - a: b
  - x: y
"""
load(s)  # {'foo': [('a', 'b'), ('x', 'y')]

Use these pairs to add the following settings.

def construct_odict(loader, node):
    return OrderedDict(loader.construct_pairs(node))

yaml.add_constructor('tag:yaml.org,2002:map', construct_odict)

Output in a representation compatible with json (set the dict numeric key to a string)

Managing the configuration file with json can be troublesome because you can't write comments. In such a case, it may be written in yaml. Especially, it is hard to write settings such as json schema and swagger in json, so I sometimes use yaml.

Most of the time it doesn't cause any problems. It is sometimes said that key is a numerical dict expression because there is a difference between json and yaml. Specifically, the situation is as follows.

Suppose there is such a json schema.

{
    "type": "object",
    "patternProperties": {
        "\d{3}": {"type": "string"}
    },
    "additionalProperties": False
}

When considering the following dict as a value that matches this schema, a slight inconvenience occurs (it is inconvenient and not inappropriate). This code is valid though. If you do this with yaml, it will be invalid.

import json

validate(json.loads(json.dumps({200: "ok"})), schema)
import yaml
from io import StringIO

#For yaml, dumps as well as json module,I want you to define loads. ..

def loads(s):
    io = StringIO(s)
    return yaml.load(io)


def dumps(d):
    io = StringIO()
    yaml.dump(d, io)
    return io.getvalue()

validate(loads(dumps({200: "ok"})), schema)  # error

The reason is that the above dict data is expressed as follows on yaml.

200: ok

The above yaml is a little verbose and accurate, it looks like this: On yaml, even if the key type of map Node is numeric, it remains numeric.

!!int 200: ok

#If you write like this{'200': ok}Recognized as
'200': ok

#Or if you write like this{'200': ok}Recognized as
!!str 200: ok

Since only a character string is allowed as the key of the object on json, it is automatically treated as a character string. The story of adding settings that match this behavior. This is like a review so far, and you can add the following settings.

def construct_json_compatible_map(loader, node):
    return {str(k): v for k, v in loader.construct_pairs(node)}

yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG  # 'tag:yaml.org,2002:map'
yaml.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, construct_json_compatible_map)

Recommended Posts

How to change the behavior when loading / dumping yaml with PyYAML and its details
How to deal with errors when installing whitenoise and deploying to Heroku
How to deal with errors when installing Python and pip with choco
I summarized how to change the boot parameters of GRUB and GRUB2
[PostgreSQL] How to grant superuser authority when the user (role) with superuser authority is 0
How to get the date and time difference in seconds with python
How to query BigQuery with Kubeflow Pipelines and save the result and notes
Behavior when returning in the with block
Django ~ Let's display it in the browser ~
Make a thermometer with Raspberry Pi and make it visible on the browser Part 3
Scraping the holojour and displaying it with CLI
How to change the behavior when loading / dumping yaml with PyYAML and its details
Let's transpose the matrix with numpy and multiply the matrices.
Read the csv file and display it in the browser
POST the image with json and receive it with flask
Extract the maximum value with pandas and change that value
How to make VS Code aware of the venv environment and its benefits
How to pass the path to the library built with pyenv and virtualenv in PyCharm
When I tried to change the root password with ansible, I couldn't access it.
How to display in the entire window when setting the background image with tkinter
How to deal with errors when hitting pip â‘¡
How to specify the NIC to scan with amazon-dash
How to try the friends-of-friends algorithm with pyfof
How to deal with SessionNotCreatedException when using Selenium
How to Learn Kaldi with the JUST Corpus
It's too easy to access the Twitter API with rauth and I have her ...
[Note] How to write QR code and description in the same image with python
How to make an arbitrary DictCursor with PyMySQL and not return None when NULL
How to delete the specified string with the sed command! !! !!
[Introduction to Python] How to iterate with the range function?
How to create a submenu with the [Blender] plugin
[Python] How to specify the download location with youtube-dl
How to share folders with Docker and Windows with tensorflow
How to access with cache when reading_json in pandas
How to use the grep command and frequent samples
How to loop and play gif video with openCV
How to use argparse and the difference between optparse
[Python] How to rewrite the table style with python-pptx [python-pptx]
Extract the maximum value with pandas and change that value
[How to!] Learn and play Super Mario with Tensorflow !!
How to deal with the terminal getting into the pipenv environment without permission when using pipenv with vscode
How to insert a specific process at the start and end of spider with scrapy