Skip to content

Commit

Permalink
api,agent,server,engine-schema: scalability improvements
Browse files Browse the repository at this point in the history
Following changes and improvements have been added:

- Improvements in handling of PingRoutingCommand

    1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.
    2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch
    3. Optimized scanning stalled VMs

- Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers`

- Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine

- Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0.

- Added caching for account/use role API access with expiration after write set to 60 seconds.

- Added caching for some recurring DB retrievals

    1. CapacityManager - listing service offerings - beneficial in host capacity calculation
    2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins
    3. DownloadListener - hypervisors for zone - beneficial for host joins
    5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands

- Optimized MS list retrieval for agent connect

- Optimize finding ready systemvm template for zone

- Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks

- Changes in agent-agentmanager connection with NIO client-server classes

    1. Optimized the use of the executor service
    2. Refactore Agent class to better handle connections.
    3. Do SSL handshakes within worker threads
    5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end.
    6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used.

- Improvements in StatsCollection - minimize DB retrievals.

- Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.

- Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.

- Minor improvements in resource limit calculations wrt DB retrievals

Signed-off-by: Abhishek Kumar <[email protected]>

Co-authored-by: Abhishek Kumar <[email protected]>
Co-authored-by: Rohit Yadav <[email protected]>
  • Loading branch information
shwstppr and rohityadavcloud committed Oct 23, 2024
1 parent 019f2c6 commit e3cf7fd
Show file tree
Hide file tree
Showing 128 changed files with 3,072 additions and 2,041 deletions.
2 changes: 1 addition & 1 deletion .python-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.6
3.10
6 changes: 6 additions & 0 deletions agent/conf/agent.properties
Original file line number Diff line number Diff line change
Expand Up @@ -433,3 +433,9 @@ iscsi.session.cleanup.enabled=false

# Implicit host tags managed by agent.properties
# host.tags=

# Timeout(in seconds) for SSL handshake when agent connects to server
#ssl.handshake.timeout=

# Wait(in seconds) during agent reconnections
#backoff.seconds=
772 changes: 401 additions & 371 deletions agent/src/main/java/com/cloud/agent/Agent.java

Large diffs are not rendered by default.

57 changes: 33 additions & 24 deletions agent/src/main/java/com/cloud/agent/AgentShell.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,29 +16,6 @@
// under the License.
package com.cloud.agent;

import com.cloud.agent.Agent.ExitStatus;
import com.cloud.agent.dao.StorageComponent;
import com.cloud.agent.dao.impl.PropertiesStorage;
import com.cloud.agent.properties.AgentProperties;
import com.cloud.agent.properties.AgentPropertiesFileHandler;
import com.cloud.resource.ServerResource;
import com.cloud.utils.LogUtils;
import com.cloud.utils.ProcessUtil;
import com.cloud.utils.PropertiesUtil;
import com.cloud.utils.backoff.BackoffAlgorithm;
import com.cloud.utils.backoff.impl.ConstantTimeBackoff;
import com.cloud.utils.exception.CloudRuntimeException;
import org.apache.commons.daemon.Daemon;
import org.apache.commons.daemon.DaemonContext;
import org.apache.commons.daemon.DaemonInitException;
import org.apache.commons.lang.math.NumberUtils;
import org.apache.commons.lang3.BooleanUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.logging.log4j.Logger;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.core.config.Configurator;

import javax.naming.ConfigurationException;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
Expand All @@ -53,6 +30,31 @@
import java.util.Properties;
import java.util.UUID;

import javax.naming.ConfigurationException;

import org.apache.commons.daemon.Daemon;
import org.apache.commons.daemon.DaemonContext;
import org.apache.commons.daemon.DaemonInitException;
import org.apache.commons.lang.math.NumberUtils;
import org.apache.commons.lang3.BooleanUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.apache.logging.log4j.core.config.Configurator;

import com.cloud.agent.Agent.ExitStatus;
import com.cloud.agent.dao.StorageComponent;
import com.cloud.agent.dao.impl.PropertiesStorage;
import com.cloud.agent.properties.AgentProperties;
import com.cloud.agent.properties.AgentPropertiesFileHandler;
import com.cloud.resource.ServerResource;
import com.cloud.utils.LogUtils;
import com.cloud.utils.ProcessUtil;
import com.cloud.utils.PropertiesUtil;
import com.cloud.utils.backoff.BackoffAlgorithm;
import com.cloud.utils.backoff.impl.ConstantTimeBackoff;
import com.cloud.utils.exception.CloudRuntimeException;

public class AgentShell implements IAgentShell, Daemon {
protected static Logger LOGGER = LogManager.getLogger(AgentShell.class);

Expand Down Expand Up @@ -406,7 +408,9 @@ public void init(String[] args) throws ConfigurationException {

LOGGER.info("Defaulting to the constant time backoff algorithm");
_backoff = new ConstantTimeBackoff();
_backoff.configure("ConstantTimeBackoff", new HashMap<String, Object>());
Map<String, Object> map = new HashMap<>();
map.put("seconds", _properties.getProperty("backoff.seconds"));
_backoff.configure("ConstantTimeBackoff", map);

Check warning on line 413 in agent/src/main/java/com/cloud/agent/AgentShell.java

View check run for this annotation

Codecov / codecov/patch

agent/src/main/java/com/cloud/agent/AgentShell.java#L411-L413

Added lines #L411 - L413 were not covered by tests
}

private void launchAgent() throws ConfigurationException {
Expand Down Expand Up @@ -455,6 +459,11 @@ public void launchNewAgent(ServerResource resource) throws ConfigurationExceptio
agent.start();
}

@Override
public Integer getSslHandshakeTimeout() {
return AgentPropertiesFileHandler.getPropertyValue(AgentProperties.SSL_HANDSHAKE_TIMEOUT);
}

Check warning on line 465 in agent/src/main/java/com/cloud/agent/AgentShell.java

View check run for this annotation

Codecov / codecov/patch

agent/src/main/java/com/cloud/agent/AgentShell.java#L463-L465

Added lines #L463 - L465 were not covered by tests

public synchronized int getNextAgentId() {
return _nextAgentId++;
}
Expand Down
2 changes: 2 additions & 0 deletions agent/src/main/java/com/cloud/agent/IAgentShell.java
Original file line number Diff line number Diff line change
Expand Up @@ -70,4 +70,6 @@ public interface IAgentShell {
String getConnectedHost();

void launchNewAgent(ServerResource resource) throws ConfigurationException;

Integer getSslHandshakeTimeout();
}
Original file line number Diff line number Diff line change
Expand Up @@ -810,6 +810,13 @@ public Property<Integer> getWorkers() {
*/
public static final Property<String> HOST_TAGS = new Property<>("host.tags", null, String.class);

/**
* Timeout for SSL handshake in seconds
* Data type: Integer.<br>
* Default value: <code>null</code>
*/
public static final Property<Integer> SSL_HANDSHAKE_TIMEOUT = new Property<>("ssl.handshake.timeout", null, Integer.class);

public static class Property <T>{
private String name;
private T defaultValue;
Expand Down
5 changes: 5 additions & 0 deletions api/src/main/java/org/apache/cloudstack/acl/RoleService.java
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@ public interface RoleService {
ConfigKey<Boolean> EnableDynamicApiChecker = new ConfigKey<>("Advanced", Boolean.class, "dynamic.apichecker.enabled", "false",
"If set to true, this enables the dynamic role-based api access checker and disables the default static role-based api access checker.", true);

ConfigKey<Integer> DynamicApiCheckerCachePeriod = new ConfigKey<>("Advanced", Integer.class,
"dynamic.apichecker.cache.period", "0",
"Defines the expiration time in seconds for the Dynamic API Checker cache, determining how long cached data is retained before being refreshed. If set to zero then caching will be disabled",
false);

boolean isEnabled();

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ public EnumSet<DomainDetails> getDetails() throws InvalidParameterValueException
dv = EnumSet.of(DomainDetails.all);
} else {
try {
ArrayList<DomainDetails> dc = new ArrayList<DomainDetails>();
ArrayList<DomainDetails> dc = new ArrayList<>();

Check warning on line 103 in api/src/main/java/org/apache/cloudstack/api/command/admin/domain/ListDomainsCmd.java

View check run for this annotation

Codecov / codecov/patch

api/src/main/java/org/apache/cloudstack/api/command/admin/domain/ListDomainsCmd.java#L103

Added line #L103 was not covered by tests
for (String detail : viewDetails) {
dc.add(DomainDetails.valueOf(detail));
}
Expand Down Expand Up @@ -142,7 +142,10 @@ protected void updateDomainResponse(List<DomainResponse> response) {
if (CollectionUtils.isEmpty(response)) {
return;
}
_resourceLimitService.updateTaggedResourceLimitsAndCountsForDomains(response, getTag());
EnumSet<DomainDetails> details = getDetails();
if (details.contains(DomainDetails.all) || details.contains(DomainDetails.resource)) {
_resourceLimitService.updateTaggedResourceLimitsAndCountsForDomains(response, getTag());
}
if (!getShowIcon()) {
return;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,10 @@ protected void updateAccountResponse(List<AccountResponse> response) {
if (CollectionUtils.isEmpty(response)) {
return;
}
_resourceLimitService.updateTaggedResourceLimitsAndCountsForAccounts(response, getTag());
EnumSet<DomainDetails> details = getDetails();
if (details.contains(DomainDetails.all) || details.contains(DomainDetails.resource)) {
_resourceLimitService.updateTaggedResourceLimitsAndCountsForAccounts(response, getTag());
}
if (!getShowIcon()) {
return;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ public interface OutOfBandManagementService {
long getId();
boolean isOutOfBandManagementEnabled(Host host);
void submitBackgroundPowerSyncTask(Host host);
boolean transitionPowerStateToDisabled(List<? extends Host> hosts);
boolean transitionPowerStateToDisabled(List<Long> hostIds);

OutOfBandManagementResponse enableOutOfBandManagement(DataCenter zone);
OutOfBandManagementResponse enableOutOfBandManagement(Cluster cluster);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ public class CheckNetworkCommand extends Command {

public CheckNetworkCommand(List<PhysicalNetworkSetupInfo> networkInfoList) {
this.networkInfoList = networkInfoList;
setWait(120);
}

public List<PhysicalNetworkSetupInfo> getPhysicalNetworkInfoList() {
Expand Down
8 changes: 8 additions & 0 deletions core/src/main/java/com/cloud/resource/ServerResource.java
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,12 @@ public interface ServerResource extends Manager {

void setAgentControl(IAgentControl agentControl);

default boolean isExitOnFailures() {
return true;
}

Check warning on line 83 in core/src/main/java/com/cloud/resource/ServerResource.java

View check run for this annotation

Codecov / codecov/patch

core/src/main/java/com/cloud/resource/ServerResource.java#L81-L83

Added lines #L81 - L83 were not covered by tests

default boolean isAppendAgentNameToLogs() {
return false;
}

Check warning on line 87 in core/src/main/java/com/cloud/resource/ServerResource.java

View check run for this annotation

Codecov / codecov/patch

core/src/main/java/com/cloud/resource/ServerResource.java#L85-L87

Added lines #L85 - L87 were not covered by tests

}
22 changes: 12 additions & 10 deletions engine/api/src/main/java/com/cloud/vm/VirtualMachineManager.java
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
import java.util.List;
import java.util.Map;

import com.cloud.exception.ResourceAllocationException;
import org.apache.cloudstack.context.CallContext;
import org.apache.cloudstack.framework.config.ConfigKey;

Expand All @@ -38,6 +37,7 @@
import com.cloud.exception.InsufficientCapacityException;
import com.cloud.exception.InsufficientServerCapacityException;
import com.cloud.exception.OperationTimedoutException;
import com.cloud.exception.ResourceAllocationException;
import com.cloud.exception.ResourceUnavailableException;
import com.cloud.host.Host;
import com.cloud.hypervisor.Hypervisor.HypervisorType;
Expand Down Expand Up @@ -101,6 +101,10 @@ public interface VirtualMachineManager extends Manager {
"refer documentation",
true, ConfigKey.Scope.Zone);

ConfigKey<Boolean> VmSyncPowerStateTransitioning = new ConfigKey<>("Advanced", Boolean.class, "vm.sync.power.state.transitioning", "true",
"Whether to sync power states of the transitioning and stalled VMs while processing VM power reports.", false);


interface Topics {
String VM_POWER_STATE = "vm.powerstate";
}
Expand Down Expand Up @@ -286,24 +290,22 @@ static String getHypervisorHostname(String name) {

/**
* Obtains statistics for a list of VMs; CPU and network utilization
* @param hostId ID of the host
* @param hostName name of the host
* @param host host
* @param vmIds list of VM IDs
* @return map of VM ID and stats entry for the VM
*/
HashMap<Long, ? extends VmStats> getVirtualMachineStatistics(long hostId, String hostName, List<Long> vmIds);
HashMap<Long, ? extends VmStats> getVirtualMachineStatistics(Host host, List<Long> vmIds);
/**
* Obtains statistics for a list of VMs; CPU and network utilization
* @param hostId ID of the host
* @param hostName name of the host
* @param vmMap map of VM IDs and the corresponding VirtualMachine object
* @param host host
* @param vmMap map of VM instanceName and its ID
* @return map of VM ID and stats entry for the VM
*/
HashMap<Long, ? extends VmStats> getVirtualMachineStatistics(long hostId, String hostName, Map<Long, ? extends VirtualMachine> vmMap);
HashMap<Long, ? extends VmStats> getVirtualMachineStatistics(Host host, Map<String, Long> vmMap);

HashMap<Long, List<? extends VmDiskStats>> getVmDiskStatistics(long hostId, String hostName, Map<Long, ? extends VirtualMachine> vmMap);
HashMap<Long, List<? extends VmDiskStats>> getVmDiskStatistics(Host host, Map<String, Long> vmInstanceNameIdMap);

HashMap<Long, List<? extends VmNetworkStats>> getVmNetworkStatistics(long hostId, String hostName, Map<Long, ? extends VirtualMachine> vmMap);
HashMap<Long, List<? extends VmNetworkStats>> getVmNetworkStatistics(Host host, Map<String, Long> vmInstanceNameIdMap);

Map<Long, Boolean> getDiskOfferingSuitabilityForVm(long vmId, List<Long> diskOfferingIds);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,11 @@
// under the License.
package com.cloud.capacity;

import java.util.Map;

import org.apache.cloudstack.framework.config.ConfigKey;
import org.apache.cloudstack.storage.datastore.db.StoragePoolVO;

import com.cloud.host.Host;
import com.cloud.offering.ServiceOffering;
import com.cloud.service.ServiceOfferingVO;
import com.cloud.storage.VMTemplateVO;
import com.cloud.utils.Pair;
import com.cloud.vm.VirtualMachine;
Expand Down Expand Up @@ -118,6 +115,10 @@ public interface CapacityManager {
"Percentage (as a value between 0 and 1) of secondary storage capacity threshold.",
true);

ConfigKey<Integer> CapacityCalculateWorkers = new ConfigKey<>(ConfigKey.CATEGORY_ADVANCED, Integer.class,
"capacity.calculate.workers", "1",
"Number of worker threads to be used for capacities calculation", true);

public boolean releaseVmCapacity(VirtualMachine vm, boolean moveFromReserved, boolean moveToReservered, Long hostId);

void allocateVmCapacity(VirtualMachine vm, boolean fromLastHost);
Expand All @@ -133,8 +134,6 @@ boolean checkIfHostHasCapacity(long hostId, Integer cpu, long ram, boolean check

void updateCapacityForHost(Host host);

void updateCapacityForHost(Host host, Map<Long, ServiceOfferingVO> offeringsMap);

/**
* @param pool storage pool
* @param templateForVmCreation template that will be used for vm creation
Expand All @@ -151,12 +150,12 @@ boolean checkIfHostHasCapacity(long hostId, Integer cpu, long ram, boolean check

/**
* Check if specified host has capability to support cpu cores and speed freq
* @param hostId the host to be checked
* @param host the host to be checked
* @param cpuNum cpu number to check
* @param cpuSpeed cpu Speed to check
* @return true if the count of host's running VMs >= hypervisor limit
*/
boolean checkIfHostHasCpuCapability(long hostId, Integer cpuNum, Integer cpuSpeed);
boolean checkIfHostHasCpuCapability(Host host, Integer cpuNum, Integer cpuSpeed);

/**
* Check if cluster will cross threshold if the cpu/memory requested are accommodated
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,13 +138,13 @@ public interface ResourceManager extends ResourceService, Configurable {

public List<HostVO> listAllHostsInOneZoneNotInClusterByHypervisors(List<HypervisorType> types, long dcId, long clusterId);

public List<HypervisorType> listAvailHypervisorInZone(Long hostId, Long zoneId);
public List<HypervisorType> listAvailHypervisorInZone(Long zoneId);

public HostVO findHostByGuid(String guid);

public HostVO findHostByName(String name);

HostStats getHostStatistics(long hostId);
HostStats getHostStatistics(Host host);

Long getGuestOSCategoryId(long hostId);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@

import org.apache.cloudstack.engine.subsystem.api.storage.DataStore;
import org.apache.cloudstack.engine.subsystem.api.storage.HypervisorHostListener;
import org.apache.cloudstack.engine.subsystem.api.storage.Scope;
import org.apache.cloudstack.framework.config.ConfigKey;
import org.apache.cloudstack.storage.datastore.db.StoragePoolVO;

Expand All @@ -42,6 +43,7 @@
import com.cloud.offering.ServiceOffering;
import com.cloud.storage.Storage.ImageFormat;
import com.cloud.utils.Pair;
import com.cloud.utils.exception.CloudRuntimeException;
import com.cloud.vm.DiskProfile;
import com.cloud.vm.VMInstanceVO;

Expand Down Expand Up @@ -209,6 +211,10 @@ public interface StorageManager extends StorageService {
ConfigKey<Long> HEURISTICS_SCRIPT_TIMEOUT = new ConfigKey<>("Advanced", Long.class, "heuristics.script.timeout", "3000",
"The maximum runtime, in milliseconds, to execute the heuristic rule; if it is reached, a timeout will happen.", true);

ConfigKey<Integer> StoragePoolHostConnectWorkers = new ConfigKey<>("Storage", Integer.class,
"storage.pool.host.connect.workers", "1",
"Number of worker threads to be used to connect hosts to a primary storage", true);

/**
* should we execute in sequence not involving any storages?
* @return tru if commands should execute in sequence
Expand Down Expand Up @@ -360,6 +366,9 @@ static Boolean getFullCloneConfiguration(Long storeId) {

String getStoragePoolMountFailureReason(String error);

void connectHostsToPool(DataStore primaryStore, List<Long> hostIds, Scope scope,
boolean handleStorageConflictException, boolean errorOnNoUpHost) throws CloudRuntimeException;

boolean connectHostToSharedPool(long hostId, long poolId) throws StorageUnavailableException, StorageConflictException;

void disconnectHostFromSharedPool(long hostId, long poolId) throws StorageUnavailableException, StorageConflictException;
Expand Down
Loading

0 comments on commit e3cf7fd

Please sign in to comment.