[LINUX] How to create an OCF compliant resource agent

I want to manage my own application in a cluster with Pacemaker

You can easily make it by reading OCF Resource Agent Developer's Guide, but since there is a lot of text, "something that works for the time being" For people who want to make it.

OCF basics

OCF is an abbreviation for Open Cluster Framework, which defines the interface for clustering applications. Cluster managers such as Pacemaker, which manages clusters, manage managed applications and virtual IP addresses as "resources". Pacemaker commands resources to start, stop, migrate, promote to master, demote to slave, and so on. You can cluster your own apps using Pacemaker by building an OCF-compliant program with the interface between resources and cluster management software such as Pacemaker. The goal of this time is to create this resource agent by myself.

Specifically, OCF-compliant cluster management software kicks the resource agent by putting an action to take in the environment variable $ __ OCF_ACTION. Actions include start / stop / migration / promotion to master / demotion to slave, and some actions require definition and some do not (optional). Executables are kicked when controlling resources. This executable file is called a resource agent and actually manages the operation of resources. The resource agent looks at the environment variable $ __ OCF_ACTION and acts to actually perform that action. If parameters are required for each action, they are passed in an environment variable prefixed with $ OCF_RESKEY. As long as the executable file format meets the API requirements of OCF, there are no restrictions on the language, but it seems that it is generally implemented by a shell script.

The simplest resource agent

The simplest OCF compliant code

sample-resource


#!/bin/sh

#Initialize
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs

RUNNING_FILE=/tmp/.running

sample_meta_data() {
    cat << EOF
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="sample-resource" version="0.1">
    <version>0.1</version>
    <longdesc lang="en">sample resource</longdesc>
    <shortdesc lang="en">sample resource</shortdesc>
    <parameters>
    </parameters>
    <actions>
        <action name="meta-data" timeout="5" />
        <action name="start" timeout="5" />
        <action name="stop" timeout="5" />
        <action name="monitor" timeout="5" />
        <action name="validate-all" timeout="5" />
    </actions>
</resource-agent>
EOF
    return $OCF_SUCCESS
}

sample_validate(){
    return $OCF_SUCCESS
}

sample_start(){
    touch ${RUNNING_FILE}
    return $OCF_SUCCESS
}

sample_stop(){
    rm -f ${RUNNING_FILE}
    return $OCF_SUCCESS
}

sample_monitor(){
    if [ -f ${RUNNING_FILE} ];
    then
        return $OCF_SUCCESS
    fi
    return $OCF_NOT_RUNNING
}

sample_usage(){
    echo "Test Resource."
    return $OCF_SUCCESS
}

# Translate each action into the appropriate function call
case $__OCF_ACTION in
meta-data)      sample_meta_data
                exit $OCF_SUCCESS
                ;;
start)          sample_start;;
stop)           sample_stop;;
monitor)        sample_monitor;;
validate-all)   sample_validate;;
*)              sample_usage
                exit $OCF_ERR_UNIMPLEMENTED
                ;;
esac

Initialization

It is a magic that is also described in OCF Resource Agent Developer's Guide.

#Initialize
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs

Implement required actions

The following are the actions that must be implemented. Other than this, it is treated as an option, so let's define the following actions for the time being.

Appropriate return values must be returned when each action is performed. The return value is [defined] by OCI (https://linux-ha.osdn.jp/wp/archives/4328#3).

Environment variables used within the resource agent

Describes the environment variables to use before getting into the concrete definition of each required action.

$ __ OCF_ACTION variable

The process requested from the resource agent is stored and set when the resource agent is called. The resource agent first reads this environment variable and dispatches each action.

Implementation example OCF Resource Agent Developer's Guide

case $__OCF_ACTION in
meta-data)      sample_meta_data
                exit $OCF_SUCCESS
                ;;
start)          sample_start;;
stop)           sample_stop;;
monitor)        sample_monitor;;
validate-all)   sample_validate;;
*)              sample_usage
                exit $OCF_ERR_UNIMPLEMENTED
                ;;
esac
$ OCF_RESKEY_ parameter name

This variable contains the parameters that are set when the resource is created. Although not used in this example, it is used when clustering the TCP server daemon described below.

meta-data Metadata provides basic information about the resource agent, such as the name of the resource agent, the actions it provides, and the parameters it can receive. Written in XML, the resource agent returns metadata on standard output when requested.

start Implement resource startup process. Specifically, write a daemon startup script. This time, the actual daemon will not be started, just create a certain file. Returns $ OCF_SUCCESS if the startup is successful.

stop Implements resource termination. Describes the shutdown of the daemon. This time, we will delete the file created by start.

If there is no problem with the stop processing, $ OCF_SUCCESS is returned. (Note that it is not $ OCF_NOT_RUNNING)

Note that the stop action means "forced stop" of the resource. Even if the resource cannot be safely stopped, it is an action to stop it anyway. If the stop action fails, the cluster manager may fencing the node (isolation, such as by forced shutdown), as it can cause fatal problems. The stop action should use every possible means to stop a resource and only return an error code if the stop still fails.

monitor Implement the process to get the status of the resource. If it is running, it returns $ OCF_SUCCESS, and if it is not running, it returns $ OCF_NOT_RUNNING. If there is any error, it returns the appropriate error constant starting with $ OCF_ERR_ depending on the error content.

validate-all Validate the resource settings. Check that the parameters are set correctly, and that the permissions of the files used by the resource are appropriate. The return value must be one of the following:

Return value meaning
$OCF_SUCCESS no problem
$OCF_ERR_CONFIGURED There is a problem with the settings
$OCF_ERR_INSTALLED Required components do not exist (for example, the daemon to be started is not installed)
$OCF_ERR_PERM There is a problem with the access authority of the file required for resource management

(This time it's easy, so it always returns $ OCF_SUCCESS.)

Try to test

You can test with ocf-tester.

#ocf-tester -n [Resource name] [Resource agency path]

salacia@ha1:~/ocf-scr$ sudo ocf-tester -n sample-resource ./sample-resource
Beginning tests for ./sample-resource...
* Your agent does not support the notify action (optional)
* Your agent does not support the demote action (optional)
* Your agent does not support the promote action (optional)
* Your agent does not support master/slave (optional)
* Your agent does not support the reload action (optional)
./sample-resource passed all tests

Actually start with Pacemaker

Deploy a resource agent

The location of Pacemaker's resource agent is under /usr/lib/ocf/resource.d/. Create a directory with the provider name here, and place the created resource agent in it.

Since my handle name is kamaboko, the provider name is kamaboko and the above sample-resource is placed. (Resource agents must be deployed on all nodes of the cluster)

salacia@ha1:~/ocf-scr$ ls -al /usr/lib/ocf/resource.d/kamaboko/
total 16
drwxrwxr-x 2 root root 4096 Aug 11 14:07 .
drwxr-xr-x 6 root root 4096 Jun 21 03:36 ..
-rwxr-xr-x 1 root root 1547 Aug 11 14:07 sample-resource
-rwxrwxr-x 1 root root 2103 Jun 21 03:36 sample-tcp-server

Now that the resource agent is available from Pacemaker, let's create a resource from the pcs command.

salacia@ha1:~/ocf-scr$ sudo pcs resource create SAMPLE ocf:kamaboko:sample-resource

salacia@ha1:~$ sudo pcs status
Cluster name: c1
Stack: corosync
Current DC: ha2 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Aug 11 14:15:47 2020
Last change: Tue Aug 11 14:15:45 2020 by root via cibadmin on ha1

2 nodes configured
1 resource configured

Online: [ ha1 ha2 ]

Full list of resources:

 SAMPLE (ocf::kamaboko:sample-resource):        Started ha1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

I was able to run my own resource agent in Pacemaker.

A little more practical example

Here is the sample code. OCF Resource Agent Samle

For the time being, only the .deb package is supported, so I think it can work on Debian, Ubuntu, etc. (Development environment is Ubuntu 18.04) Pacemaker is a prerequisite that has already been installed.

Apps to cluster

I just made an appropriate daemon, so I made an app that just greets me when I connect via TCP. (Don't mention the code itself as it's just a sample ...) When you start the application and connect with telnet etc., the greeting text set in the environment variable and your own node name are returned.

daemon/tcp-server.py.py


#!/usr/bin/python3
import os
import socket
import threading
import time
import signal
import sys

PORT = 5678
PID_FILE_DIR = "/var/run/sample-tcp-server"
PID_FILE_NAME = "tcp-server.pid"
PID_FILE = "%s/%s" % (PID_FILE_DIR, PID_FILE_NAME)
EXIT = False
GREET = os.environ.get("GREET", "Hello!")

def signal_handler(signum, stack):
    EXIT = True

def server():
    os.makedirs(PID_FILE_DIR, exist_ok=True)
    if os.path.isfile(PID_FILE):
        raise Exception("Already running")
    
    with open(PID_FILE, "w") as f:
        f.write(str(os.getpid()))
    
    print("Create Socket")
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    s.bind(('', PORT))
    s.listen(5)

    try:
        while True:
            if EXIT:
                raise Exception("Stop daemon due to receive signal")
            
            (con, addr) = s.accept()
            t = threading.Thread(
                target=handler,
                args=(con, addr),
                daemon=False
            )
            t.start()
    except Exception as e:
        sys.stderr.write("%s\n" % e)
    finally:
        print("Close Socket")
        s.close()
        os.remove(PID_FILE)
        return

def handler(con, addr):
    con.send(("%s This is %s!\n" % (GREET, socket.gethostname())).encode())
    con.close()

if __name__ == '__main__':
    signal.signal(signal.SIGINT, handler)
    signal.signal(signal.SIGTERM, handler)
    server()
    

Since it was made only with standard modules, there is no need to install libraries. I am trying to generate a PID file to confirm the start of the process. Since the PID file is deleted when the application is closed, you can check the startup of the application by the existence of the PID file, but if it is dropped by SIGKILL etc., it will not be deleted, so I think it is not very good. (This time it's just a test, so I'm doing this for simplicity)

Start-up

GREET=Hii! python3 tcp-server.py

Try connecting with telnet

salacia@ha1:~$ telnet localhost 5678
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Hi!, This is ha1!
Connection closed by foreign host.

The sample code includes a service file so that you can start and stop this app with systemd. This time, let's run this app in a high availability cluster to make it redundant.

Installation

You can download and create a package with make, so all you have to do is install it from dpkg. (Must be installed on all nodes of the cluster)

git clone https://github.com/kamaboko123/OCF_resource_agent_sample.git
cd OCF_resource_agent_sample
make
sudo dpkg -i dist/sampletcpserver_1.0_amd64.deb

Operation check

For the time being, set VIP (VRRP) between the two nodes and fail over in the event of a failure.

#Register VIP as a service
sudo pcs resource create VIP ocf:heartbeat:IPaddr2 ip=172.16.0.50 cidr_netmask=24 op monitor interval=10s on-fail="standby"

#Register the sample application as a service
sudo pcs resource create TCP-SERVER ocf:kamaboko:sample-tcp-server greet=Hi!

#Set constraints so that the VIP and ACTIVE nodes of the sample app are the same
sudo pcs constraint colocation add TCP-SERVER with VIP INFINITY

Connect to the virtual IP address by telnet from an external node.

salacia@Vega:~$ telnet 172.16.0.50 5678
Trying 172.16.0.50...
Connected to 172.16.0.50.
Escape character is '^]'.
Hi!, This is ha1!
Connection closed by foreign host.

Shut down the connected node to fail over and verify that the service is still available.

#Currently VIP and TCP-Node on which the SERVER resource is running(ha1)Drop
salacia@ha1:~$ sudo pcs status
[sudo] password for salacia:
Cluster name: c1
Stack: corosync
Current DC: ha2 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Aug 11 14:31:30 2020
Last change: Tue Aug 11 14:17:26 2020 by root via cibadmin on ha1

2 nodes configured
2 resources configured

Online: [ ha1 ha2 ]

Full list of resources:

 VIP    (ocf::heartbeat:IPaddr2):       Started ha1
 TCP-SERVER     (ocf::kamaboko:sample-tcp-server):      Started ha1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

salacia@ha1:~$ sudo shutdown -h now
Connection to 172.16.0.51 closed by remote host.
Connection to 172.16.0.51 closed.



#Check if the service continues to be provided from the external node to the VIP
salacia@Vega:~$ telnet 172.16.0.50 5678
Trying 172.16.0.50...
Connected to 172.16.0.50.
Escape character is '^]'.
Hi!, This is ha2!
Connection closed by foreign host.

#Failover destination node(ha2)Check the status with
salacia@ha2:~$ sudo pcs status
[sudo] password for salacia:
Cluster name: c1
Stack: corosync
Current DC: ha2 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Tue Aug 11 14:35:51 2020
Last change: Tue Aug 11 14:17:26 2020 by root via cibadmin on ha1

2 nodes configured
2 resources configured

Online: [ ha2 ]
OFFLINE: [ ha1 ]

Full list of resources:

 VIP    (ocf::heartbeat:IPaddr2):       Started ha2
 TCP-SERVER     (ocf::kamaboko:sample-tcp-server):      Started ha2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

By the way, when testing ocf-tester, it is OK if you specify the full path.

sudo ocf-tester -n sample-tcp-server /usr/lib/ocf/resource.d/kamaboko/sample-tcp-server

Commentary

As explained earlier, the resource agent is written in a shell script.

/usr/lib/ocf/resource.d/kamaboko/sample-tcp-server


#!/bin/sh

#Initialize
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs

#default value
OCF_RESKEY_greet_default="Hello!"
: ${OCF_RESKEY_greet=${OCF_RESKEY_greet_default}}

#environment variables for systemd
DAEMON_PID_FILE=/var/run/sample-tcp-server/tcp-server.pid

sample_meta_data() {
    cat << EOF
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="sample-tcp-server" version="0.1">
    <version>0.1</version>
    <longdesc lang="en">sample tcp server</longdesc>
    <shortdesc lang="en">sample tcp server</shortdesc>
    <parameters>
        <parameter name="greet" unique="0" required="0">
            <longdesc lang="en">greet message</longdesc>
            <shortdesc lang="en">greet message</shortdesc>
            <content type="string"/>
        </parameter>
    </parameters>
    <actions>
        <action name="meta-data" timeout="5" />
        <action name="start" timeout="5" />
        <action name="stop" timeout="5" />
        <action name="monitor" timeout="5" />
        <action name="validate-all" timeout="5" />
    </actions>
</resource-agent>
EOF
    return $OCF_SUCCESS
}

sample_validate(){
    return $OCF_SUCCESS
}

sample_start(){
    mkdir -p /var/run/sample-tcp-server
    echo "GREET=${OCF_RESKEY_greet}" > /var/run/sample-tcp-server/env
    systemctl start sample-tcp-server
    sleep 1
    return $OCF_SUCCESS
}

sample_stop(){
    systemctl stop sample-tcp-server
    return $OCF_SUCCESS
}

sample_monitor(){
    if [ -f ${DAEMON_PID_FILE} ];
    then
        return $OCF_SUCCESS
    fi
    return $OCF_NOT_RUNNING
}

sample_usage(){
    echo "Test Resource."
    return $OCF_SUCCESS
}

# Translate each action into the appropriate function call
case $__OCF_ACTION in
meta-data)      sample_meta_data
                exit $OCF_SUCCESS
                ;;
start)          sample_start;;
stop)           sample_stop;;
monitor)        sample_monitor;;
validate-all)   sample_validate;;
*)              sample_usage
                exit $OCF_ERR_UNIMPLEMENTED
                ;;
esac

meta-data and default values

In meta-data, parameters are defined in addition to the required items. The parameter is a value set when the resource is created, and can be obtained by the variable ʻOCF_RESKEY_parameter namein the resource agent. This time, the text of the greeting is defined by the parameter namegreet. In the case of a required parameter, the requiredattribute is set to 1, but this time it is 0, so it is treated as an option. Therefore, it also includes a definition of the default value if not specified. (If not specified, this default valueHello!Will be in$ OCF_RESKEY_greet`.)

#default value
OCF_RESKEY_greet_default="Hello!"
: ${OCF_RESKEY_greet=${OCF_RESKEY_greet_default}}

sample_meta_data() {
    cat << EOF
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="sample-tcp-server" version="0.1">
    <version>0.1</version>
    <longdesc lang="en">sample tcp server</longdesc>
    <shortdesc lang="en">sample tcp server</shortdesc>
    <parameters>
        <parameter name="greet" unique="0" required="0">
            <longdesc lang="en">greet message</longdesc>
            <shortdesc lang="en">greet message</shortdesc>
            <content type="string"/>
        </parameter>
    </parameters>
    <actions>
        <action name="meta-data" timeout="5" />
        <action name="start" timeout="5" />
        <action name="stop" timeout="5" />
        <action name="monitor" timeout="5" />
        <action name="validate-all" timeout="5" />
    </actions>
</resource-agent>
EOF
    return $OCF_SUCCESS
}

start I'm just launching service with systemd. The service file will be explained later. Since the parameters from OCF are passed as environment variables when service is started, $ {OCF_RESKEY_greet} is written to the file.

sample_start(){
    mkdir -p /var/run/sample-tcp-server
    echo "GREET=${OCF_RESKEY_greet}" > /var/run/sample-tcp-server/env
    systemctl start sample-tcp-server
    sleep 1
    return $OCF_SUCCESS
}

stop There is no particular explanation and the service is stopped.

sample_stop(){
    systemctl stop sample-tcp-server
    return $OCF_SUCCESS
}

monitor This time I'm looking at the PID file.

sample_monitor(){
    if [ -f ${DAEMON_PID_FILE} ];
    then
        return $OCF_SUCCESS
    fi
    return $OCF_NOT_RUNNING
}

I've done this for simplicity, but I don't think it's really a good implementation. It is implemented to delete the PID file at the end of the process, but if it is killed by SIGKILL, the PID file will not be deleted. Depending on the creation of the service file, the stop action is considered to stop the resource by all means, so SIGKILL may be issued in the end. In that case, the PID file will continue to remain even if it is stopped, and there is a possibility that the actual state will differ from the state confirmed by the monitor action. (Since I'm using systemd for service management, I should have done it via systemd)

validate-all I don't do anything in particular.

sample_validate(){
    return $OCF_SUCCESS
}

service file

Nothing special is done, just start the daemon while reading the environment variables from the environment variable file created when setting the resource.

/lib/systemd/system/sample-tcp-server.service


[Unit]
Description=Sample TCP Server

[Service]
Type=simple
ExecStartPre=/bin/touch /var/run/sample-tcp-server/env
EnvironmentFile=/var/run/sample-tcp-server/env
ExecStart=/usr/bin/tcp-server.py
ExecStop=/usr/bin/pkill -F /var/run/sample-tcp-server/tcp-server.pid

[Install]
WantedBy=multi-user.target

Summary

OCF resource agents are surprisingly easy to create. This time, as an entrance, we first created a resource agent that actually works with only the required actions, but there are some detailed notes when actually creating it. There are various hints in the OCF Resource Agent Developer's Guide, so I think it will be helpful.

Digression

LPIC304 When I was studying, I didn't really understand what Pacemaker was doing, and while researching various things, I realized that I could make my own resource agent, so this article was born.

Recommended Posts

How to create an OCF compliant resource agent
How to create an email user
How to create an NVIDIA Docker environment
How to create an article from the command line
[Blender x Python] How to create an original object
How to create an image uploader in Bottle (Python)
[Python Kivy] How to create an exe file with pyinstaller
How to create an ISO file (CD image) on Linux
How to create a Conda package
How to create your own Transform
How to create a virtual bridge
How to create / delete symbolic links
How to create a Dockerfile (basic)
How to create a config file
How to create a heatmap with an arbitrary domain in Python
How to create a git clone folder
python3 How to install an external module
How to convert Python to an exe file
How to create a repository from media
How to create sample CSV data with hypothesis
How to make an embedded Linux device driver (11)
How to create large files at high speed
How to create a Python virtual environment (venv)
How to make an embedded Linux device driver (8)
How to make an embedded Linux device driver (1)
How to make an embedded Linux device driver (4)
How to create a function object from a string
How to create explanatory variables and objective functions
How to create a JSON file in Python
How to make an embedded Linux device driver (7)
How to make an embedded Linux device driver (2)
How to crop an image with Python + OpenCV
How to get help in an interactive shell
How to make an embedded Linux device driver (3)
How to read an array with Python's ConfigParser
How to create data to put in CNN (Chainer)
Create an AWS GPU instance to train StyleNet
How to create a shortcut command for LINUX
[Blender x Python] How to make an animation
[Note] How to create a Ruby development environment
How to make an embedded Linux device driver (6)
How to create a Kivy 1-line input box
How to create a multi-platform app with kivy
Try to create an HTTP server using Node.js
How to create a Rest Api in Django
How to make an embedded Linux device driver (5)
How to make an embedded Linux device driver (10)
How to make an embedded Linux device driver (9)
[Note] How to create a Mac development environment
How to create random numbers with NumPy's random module
How to use NUITKA-Utilities hinted-compilation to easily create an executable file from a Python script
Read the Python-Markdown source: How to create a parser
How to create a submenu with the [Blender] plugin
[Go] How to create a custom error for Sentry
Backtrader How to import an indicator from another file
An easy way to create an import module with jupyter
How to make an interactive CLI tool in Golang
How to turn a .py file into an .exe file
How to make an HTTPS server with Go / Gin
How to create a local repository for Linux OS
[Python] How to create Correlation Matrix and Heat Map