Skip to content
svaroqui edited this page Feb 4, 2013 · 34 revisions

The heartbeat is the part of the state machine.

The state machine collect changes in status to propage changing state events in the local network and in the cloud.

One can watch at the last global hearbeat with the command:

{"level":"instances","command":{"action":"heartbeat","group":"all","type":"all"}}'
or
bin/clmgr instances heartbeat

A list of actions is put in a queue and execute step after step .

Starting a database service required

  • A launched instance at the given ip
  • A started instance at the given ip
  • A ssh service answering in the instance
  • A bootstap of the binaries in the instance
  • A running ScambleDB in the instance
  • An install of the database files
  • A replication setup and running
  • Reconfiguration of the proxies

In the horizon of local network we compute status based on information inside the configuration file:

For instances

  • ssh ping
  • gearman ping

For services

  • connect and test to direct IP
  • connect and test to VIP

Until all test pass the status is not reported running for the instance

The same cloud status is extended with amazon or vcloud API on your private office instance. This enabled translation from ip to EC2 instance names and provider external state of an instance.

Warning : The number of call to the api is invoiced we could later store in a cache such API status

{ "command" : {

"action" : "ping",
"group" : "all",
"type" : "db"
},
"host" : { "interfaces" : [ { "eth0" : {
"IP" : "10.0.0.102",
"STATE" : "UP"
},
"lo" : {
"IP" : "127.0.0.1",
"STATE" : "UP"
}
} ],
"ram" : "0"
},
"instances_status" : { "instances" : [ { "i-830dcde4" : {
"id" : "i-830dcde4",
"ip" : null,
"state" : "stopped"
} },
{ "i-7fba6203" : {
"id" : "i-7fba6203",
"ip" : "10.0.0.209",
"state" : "stopped"
} },
{ "i-987388e6" : {
"id" : "i-987388e6",
"ip" : "10.0.0.102",
"state" : "running"
} },
{ "i-4436ed3a" : {
"id" : "i-4436ed3a",
"ip" : "10.0.0.48",
"state" : "running"
} },
{ "i-4236ed3c" : {
"id" : "i-4236ed3c",
"ip" : "10.0.0.47",
"state" : "running"
} }
],
"return" : { "cloud" : {
"driver" : "EC2",
"elastic_ip" : "107.21.41.133",
"ex_vdc" : "na",
"host" : "na",
"instance_type" : "t1.micro",
"key" : "SDS145000",
"password" : "xxx",
"public_key" : "SDS145000.pem",
"region" : "us-east",
"security_groups" : "secure-group-vpc",
"status" : "master",
"subnet" : "subnet-49326222",
"template" : "ami-cb23a0a2",
"user" : "AKIAJR7YEOZPXASJCYDQ",
"version" : "1.5",
"vpc" : "vpc-7032621b",
"zone" : "us-east-1b"
}, "command" : {
"action" : "status",
"group" : "all",
"type" : "all"
}
}
},
"level" : "services",
"services_status" : { "services" : [ { "node10" : {
"code" : "000000",
"ip" : "10.0.0.102",
"mode" : "mariadb",
"name" : "node10",
"state" : "running",
"status" : "master",
"time" : "Thu Dec 20 16:10:13 2012"
} },
{ "node11" : { "code" : "ER0003",
"ip" : "10.0.0.47",
"mode" : "mariadb",
"name" : "node11",
"state" : "Database communication failure",
"status" : "slave",
"time" : "Thu Dec 20 16:10:13 2012"
} },
{ "node12" : {
"code" : "000000",
"ip" : "10.0.0.48",
"mode" : "mariadb",
"name" : "node12",
"state" : "running",
"status" : "slave",
"time" : "Thu Dec 20 16:10:14 2012"
} },
{ "nosql1" : {
"code" : "000000",
"ip" : "10.0.0.102",
"mode" : "memcache",
"name" : "nosql1",
"state" : "running",
"status" : "master",
"time" : "Thu Dec 20 16:10:14 2012"
} },
{ "nosql2" : {
"code" : "000000",
"ip" : "10.0.0.48",
"mode" : "memcache",
"name" : "nosql2",
"state" : "running",
"status" : "slave",
"time" : "Thu Dec 20 16:10:14 2012"
} },
{ "proxy1" : { "code" : "000000",
"ip" : "10.0.0.102",
"mode" : "mysql-proxy",
"name" : "proxy1",
"state" : "running",
"status" : "na",
"time" : "Thu Dec 20 16:10:14 2012"
} },
{ "proxy2" : {
"code" : "ER0003",
"ip" : "10.0.0.47",
"mode" : "mysql-proxy",
"name" : "proxy2",
"state" : "Database communication failure",
"status" : "na",
"time" : "Thu Dec 20 16:10:14 2012"
} },
{ "proxy3" : {
"code" : "ER0003",
"ip" : "10.0.0.48",
"mode" : "mysql-proxy",
"name" : "proxy3",
"state" : "Database communication failure",
"status" : "na",
"time" : "Thu Dec 20 16:10:14 2012"
} },
{ "lb1" : {
"code" : "ER0003",
"ip" : "10.0.0.102",
"mode" : "keepalived",
"name" : "lb1",
"state" : "Database communication failure",
"status" : "master",
"time" : "Thu Dec 20 16:10:14 2012"
} },
{ "lb2" : {
"code" : "ER0003",
"ip" : "10.0.0.47",
"mode" : "keepalived",
"name" : "lb2",
"state" : "Database communication failure",
"status" : "slave",
"time" : "Thu Dec 20 16:10:14 2012"
} },
{ "lb3" : {
"code" : "000000",
"ip" : "10.0.0.102",
"mode" : "haproxy",
"name" : "lb3",
"state" : "running",
"status" : "master",
"time" : "Thu Dec 20 16:10:14 2012"
} },
{ "lb4" : {
"code" : "000000",
"ip" : "10.0.0.47",
"mode" : "haproxy",
"name" : "lb4",
"state" : "On",
"status" : "master",
"time" : "Thu Dec 20 16:10:14 2012"
} }
] }
}

Delayed actions

Delayed action are placed into memcache when the status of an ip is not running and we need to take actions on a service running in that instance

One can watch at the current actions with the command:

{"level":"instances","command":{"action":"actions","group":"all","type":"all"}}'
or
bin/clmgr instances actions

{"actions":[{

"event_ip":"10.0.0.102",
"event_type":"instances",
"do_action":"bootstrap_ncc",
"do_group":"node10",
"do_level":"services",
"event_state":"running"
},{
"event_ip":"10.0.0.102",
"event_type":"instances",
"do_action":"start",
"do_group":"node10",
"do_level":"services",
"event_state":"running"

}]}

The type of status to be monitored to trigger an action in the local network is defined with parametres

"event_ip":"10.0.0.102",
"event_type":"instances",
"event_state":"running"

Special events are placed by the cluster to request cloud action

"event_type":"cloud",
"do_group":"X.X.X.X",

In this case do_group will store the ip to place a cloud command Like in ./clmgr instances start X.X.X.X

The action to perform :

"do_action":"start",
"do_group":"node10",
"do_level":"services",

Status differences are compute on heartbeat and send to the cluster doctor worker scripts
Status differences are compute on your private office and send to the cloud doctor worker scripts

The type of local network messages send to the cluster doctor worker is define like this :

{"events":[{

"ip":"10.0.0.102",
"name":"i-987388e6",
"type":"instances",
"previous_state":"pending",
"state":"running"
"previous_code":0,
"code":"0",
},{
"ip":"10.0.0.49",
"name":"i-eb2eb69a",
"type":"instances",
"state":"pending
"previous_state":"stopped",
"previous_code":0,
"code":"0",

"}]}

Clone this wiki locally