An Empirical Approach to Cloud Workload Health Scoring framework

Sridhar NomulaSridhar Nomula
17 min read

Introduction:

Cloud computing offers a compelling "pay-as-you-go" model, promising flexibility and cost efficiency. However, this very advantage can become a pitfall. Unlike traditional IT infrastructure with upfront costs, the cloud's dynamic nature can lead to a scenario where we lose sight of resource utilization and associated expenses. We often find ourselves surprised by unexpected bill spikes, highlighting inefficiencies that went unnoticed. Studies show that on average, cloud costs can be inflated by up to 30% due to a lack of systems for monitoring resource usage and taking timely action.

Today's cloud management tools offer a multitude of recommendations for different ways of optimization. This abundance of information can be overwhelming. Users are often left wondering:

• Which recommendations should be prioritized?

• When is the best time to act?

• How do I know if my overall cloud health is good or bad?

However, the dynamic and complex nature of cloud environments presents challenges in managing and optimizing workloads effectively. Traditional approaches to workload management often lack transparency and fail to provide actionable insights for decision-making. In response to these challenges, there is a growing need for empirical methods that offer a clear and intuitive means of assessing the health of cloud workloads.

This is where a Cloud Workload Health Score steps in, providing much-needed clarity and direction. By consolidating and interpreting multiple recommendations into a single, easy-to-understand score, this mechanism empowers users to:

• Gain a holistic view of their cloud health at a glance.

• Prioritize actions based on the severity of identified issues.

• Make informed decisions about optimization efforts to improve performance, cost efficiency, and security.

By demystifying the sea of recommendations, the Cloud Workload Health Score becomes a valuable guide, directing users towards a healthy and optimized cloud environment. Our approach leverages mathematical formulations, weights, and linear transformations to evaluate the health of cloud workloads based on key performance indicators, cost factors, best practices, and security posture.

Following are the related work on Cloud Workload Health Scoring:

Cloud Workload Characterization for Performance Prediction and Resource Optimization (You et al., 2011) proposes a workload characterization approach for cloud environments, focusing on resource usage patterns for performance prediction and optimization. This aligns with the concept of using historical data to understand workload behavior for health scoring. While workload characterization helps understand resource usage patterns, it might not capture all aspects relevant for cloud workload health scoring (e.g., security posture, cost efficiency). The paper focuses on performance prediction, but health scoring might require a broader range of metrics beyond just resource usage

An Automated Configuration Management System for Optimizing Cloud Applications (Mao et al., 2016) explores automated configuration management for cloud applications. This concept can be integrated with health scoring to trigger automated actions like configuration adjustments based on identified issues.

The paper doesn't explicitly discuss how configuration changes impact health score calculation. Its focus is on automation, but integration with health scoring might require human intervention for complex configuration adjustments depending on the severity of the health issue. Lack of information on the health scoring metrics and their relationship with specific configurations and it would have strengthened the connection between automated configuration management and health optimization.

Cloud Resource Management: A Survey of Literature and Future Directions (Li et al., 2016) provides a comprehensive survey of cloud resource management techniques. This paper serves as a valuable resource to understand the broader context of workload health scoring within the cloud resource management landscape.

By analyzing the survey results, we can potentially identify weaknesses in current cloud workload scoring or management practices. This information can be used to inform the development of a more comprehensive health scoring framework.

A Holistic Approach to Cloud Security Posture Management (Gupta et al., 2018) introduces a holistic approach to cloud security posture management. This aligns with the idea of incorporating security metrics into the Cloud Workload Health Score for a comprehensive assessment.

Machine Learning for Cloud Resource Management: A Survey (Mahmoudi et al., 2019) surveys the application of machine learning for cloud resource management. This highlights the potential of utilizing machine learning models for workload health score prediction and anomaly detection

A Survey on Cloud Security and Service Level Agreements (Buyya et al., 2011): Provides a survey on cloud security and Service Level Agreements (SLAs), emphasizing the importance of security compliance within workload health considerations.

Runtime Enforcement of Security Policies in Cloud Systems (Xu et al., 2014): Discusses runtime enforcement of security policies, highlighting the role of security posture in maintaining workload health.

The paper might not explicitly discuss how runtime enforcement mechanisms translate into actionable insights for a health scoring framework. The paper's primary focus might be on enforcement mechanisms. It might not offer specific guidance on integrating security posture assessment into a health scoring system.

These papers showcase different aspects relevant to Cloud Workload Health Scoring. They emphasize workload characterization, automated configuration management, broader resource management techniques, security considerations, and the potential of machine learning for workload health assessment.

Gap Analysis:

One fundamental limitation of previous approaches lies in the completeness of the health score to depict the overall performance and the lack of clear direction for focusing on specific resources. Traditional methods often lack a comprehensive view of workload health and fail to provide actionable insights for resource optimization. This gap has prompted the development of an empirical framework that takes a top-down view to address these challenges more effectively.

Granularity of Recommendation Characteristics:

While previous approaches have attempted to incorporate recommendations for workload optimization, they often lack granularity in considering the characteristics of these recommendations. This hampers the accuracy of health assessments and may lead to suboptimal resource allocation.

There is a pressing need to refine the methodology to account for the diverse characteristics of recommendations, such as criticality of the recommendation, resource location (environment- dev, pre-prod, Prod) and policy sub-category. Enhancing the granularity of recommendation analysis will enable more precise health scoring and resource optimization.

Subjectivity in Scoring Transformation and Labelling:

The transformation of recommendation scores and subsequent labelling into discrete categories introduces subjectivity and may not fully capture the nuances of workload health. This ambiguity can impede decision-making and hinder the identification of critical areas requiring attention.

Developing objective criteria for scoring transformation and labelling will enhance the reliability and consistency of health assessments. By establishing clear benchmarks, stakeholders can make more informed decisions based on standardized metrics.

Interpretation of Aggregated Health Scores:

While aggregating health scores provides a holistic view of workload health, the interpretation of aggregated scores may lack actionable insights. Without clear direction on where to focus remediation efforts, organizations may struggle to prioritize resources effectively.

Advancing visualization techniques and decision support mechanisms is essential for translating aggregated scores into actionable insights. By providing intuitive tools for identifying root causes and prioritizing remediation efforts, stakeholders can optimize resource allocation and improve overall workload health.

Integration of Actionable Insights:

While the approach identifies resources contributing to low health scores and prioritizes recommendations, the integration of actionable insights into existing workflows may be challenging. Without seamless integration into operational processes, organizations may struggle to enact meaningful change.

Streamlining the integration of remediation strategies into existing workflows is critical for driving tangible improvements in workload health. Automation and proactive alerting mechanisms can facilitate timely action, ensuring that identified issues are promptly addressed.

This gap analysis highlights areas where further research and development efforts can lead to improvements in the proposed approach, ultimately enabling organizations to achieve better control over their cloud workloads and enhance overall performance, cost optimization, and security posture.

Methodology

The proposed methodology for assessing the health of cloud workloads represents a significant advancement in providing actionable insights for workload management within organizations. Illustrated in the accompanying figure is a visual depiction of our bottom-up scoring approach, which reveals the health status at various levels. This enables users to navigate through the hierarchy, pinpoint issues, and undertake recommended actions to enhance health from a 'Red' status to 'Green'

This methodology integrates recommendations sourced from diverse avenues such as license optimization, identification of idle instances, right-sizing, performance enhancements, and cost-saving strategies, along with key performance indicators. We assign weights to these recommendations based on their attributes, such as the resource environment and policy subcategory, to prioritize impactful actions. Subsequently, scores are normalized using a parameterized hyperbolic tangent function to maintain a consistent range of 0 to 1. These transformed scores are then categorized as 'Red' (below 0.5), 'Amber' (0.5-0.7), or 'Green' (above 0.7) to provide clear indications of health status at the resource level.

Moreover, the health scores are aggregated hierarchically at various levels including resource type, resource group, subscription level, cloud provider level, and billing account level, offering a comprehensive perspective for targeted optimization endeavors. This multi-layered approach facilitates the identification of resources contributing to a low overall health score and enables prioritization of the most impactful recommendations to enhance workload health

The objectives of this study are twofold:

A. To develop a transparent and intuitive framework for scoring the health of cloud workloads.

B. To provide practical guidance for organizations seeking to improve their cloud workload management practices.

Score constitutes comprises the following key pillars:

• Security and Best Practices

• Performance

• Savings Opportunity

Security and Best Practices:

Cloud management tools offer valuable recommendations for optimizing resource utilization. These recommendations go beyond generic advice and consider the specific characteristics of individual resources. Here are some key factors that shape these recommendations:

1. Resource Properties:

• Location: Where the resource resides (e.g., Production (Prod), Development (Pre-Prod), or Evaluation (EV)). Each environment has distinct performance and cost considerations.

• Criticality: The resource's importance to business operations. Critical resources cause significant disruptions if they fail or become unresponsive.

2. Origin and Importance:

Recommendations don't arise in isolation. Each one has a clear origin:

Policy: A set of rules or guidelines governing resource usage or configuration within an organization.

Policy Subcategory: A specific category within a broader policy (e.g., "Security best practices" within an overall Security Policy).

Importantly, the relevance of these policies varies depending on the project's or client's domain. For instance, security best practices carry higher weight for a financial services application compared to a non-critical internal project. We represent this variable importance with configurable weights assigned to recommendations based on their policy origin.

This forms the base for the induvial recommendations ranking that shows the relative importance for a resource. Below table depict the dynamics involved in every recommendation and each one is weighted according to the business domain.

Values can be configured with the ranges provided below to represent the importance. Higher the value more the importance. Max and Min values within which user need to choose for feeding the framework.

Multiply all the weights to come-up with relative rank for individual recommendations.

Aggregation:

The Scoring approach combines multiple recommendations into a single weight and then passed to the activation function named hyperbolic tangent function (Resource level) reflecting individual resource health.

Transformation function for resource scoring

Prioritization:

The graph above illustrates the distribution of weights ranging from 0 to 1. Initially, the weighting is aggressive to ensure that any critical resources with urgent needs are promptly identified at the resource level. Function saturates slowly and reach 1. This property is critical for scoring the risky resources. This scoring system is made easily interpretable by subtracting the resource score from 1, yielding the health metric. Recommendations are then prioritized based on this health metric.

Performance:

Every service has a different set of metrics to measure the performance of applications. For instance, EC2 instances have metrics like CPU utilization (in milli cores) and memory utilization (in gigabytes) to gauge their performance. Similarly, for EMR clusters, metrics like YARN (Yet Another Resource Negotiator) Queue Fair Usage and Number of Running Tasks can be used to evaluate the cluster's health and resource utilization.

The selection of Key Performance Indicators (KPIs) and metrics for scoring directly correlates with the cloud service being evaluated. Recommendations, particularly those based on machine learning (ML) usage patterns, might include suggested CPU and memory resource allocations. The health score calculation then determines the percentage deviation from these recommendations. A positive percentage indicates under-provisioning, while a negative percentage signifies over-provisioning.

Since deviation metrics from recommended resource allocation are relative, we introduce another factor: machine size. This refers to the overall capacity of the cloud instance (e.g., number of CPU cores, memory). A larger deviation for a more powerful machine (higher machine size) can potentially have a greater impact on business processes compared to a smaller machine with a similar deviation. This notion is key for prioritizing the performance recommendations.

Computed the impact score based on the machine size and the CPU and memory deviation.

Calculate Individual Impact Scores for performance:

Formula here to calculate an impact score for both CPU and memory is as follows:

• If the deviation = Positive; 0.8(1+ CPU deviation)

• If the deviation = Negative; ABS (CPU deviation) likewise for Memory

Linear transformation between weights and cloud characteristics.

• Performance score =

1- [ 0.33 CPU impact +0.33 Memory impact + 0.34 *Machine size]

Note: Performance score is calculated if and only if the % of deviation exists in the KPIs.

Savings Opportunity**:**

The potential cost savings achievable through implementing recommendations are a crucial aspect of cloud resource optimization. We can analyze this using two primary metrics:

• Projected Savings Percentage: This metric, provided by the recommendation system, reflects the estimated percentage reduction in cost that could be achieved by following the recommendation. A negative percentage suggests upgrading to higher levels for potential savings.

• Dollar Savings: This measures the actual amount saved from the recommendation. Note that savings can be negative if the recommendation involves upgrading to a pricier machine for better performance. While this might incur higher costs, it's prioritized for overall system performance, emphasizing the importance of Performance scores.

Savings Score Calculation

While a single formula might not perfectly capture the nuances of all recommendations, a Savings Score can help prioritize cost-saving opportunities. Here's a potential approach:

Weighted Average: Calculate a weighted average of the projected savings percentage and the dollar savings, with weights assigned based on your organization's priorities. For example, a weight of 0.7 for projected savings percentage and 0.3 for dollar savings percentile emphasize the long-term cost reduction potential.

Scaling: The score can be scaled to a range of 0 (no savings) to 1 (maximum savings potential), allowing for easier comparison across different recommendations.

NOTE: The specific weights assigned in the Savings Score calculation should be customized based on your organization's financial goals and risk tolerance.

Overall health

By combining this Savings Score with the previously discussed Performance Score, you can create a more holistic picture of how implementing recommendations can impact both resource efficiency and cost optimization within your cloud environment. The Overall Score provides a comprehensive assessment of a cloud resource's health by considering four key pillars:

Security: This pillar evaluates the resource's adherence to security best practices and configuration guidelines.

Performance: This pillar measures the resource's efficiency and ability to meet performance requirements.

Savings: This pillar assesses the potential cost savings achievable by implementing recommendations.

Best Practices: This pillar evaluates the resource's compliance with recommended configurations and optimization strategies.

Dynamic Weighting for Tailored Scoring:

To prioritize specific pillars based on your unique needs, the Overall Score employs a dynamic weighting system. These weights are:

• Predefined: We establish baseline weights for each pillar, reflecting their general importance.

• Configurable: These weights can be adjusted to emphasize specific pillars for different resource types or your organization's priorities.

• Additive: The weights for all pillars sum to 1, ensuring they contribute proportionally to the Overall Score.

Example, If, no security-specific recommendations exist for a particular resource, the pre-defined weight for Security is proportionately distributed across the other pillars, maintaining their initial relative weightings. This ensures that the remaining pillars (Performance, Savings, and Best Practices) receive a slightly increased weight in the Overall Score calculation, reflecting the absence of security-related adjustments.

Benefits of Overall Score:

• Holistic View: Provides a comprehensive picture of resource health, encompassing security, performance, cost, and adherence to best practices.

• Prioritization: Dynamic weights allow you to tailor the scoring to your specific needs and priorities.

• Actionable Insights: Enables informed decision-making about resource optimization and resource management efforts.

Overall Score empowers you to identify potential issues across various aspects of your cloud resources, allowing for proactive optimization and improved cloud health. This is the endo of the story; these scores are aggregated at higher levels by making use of the resource levels scores being labelled as “Red”, Amber, Green with custom thresholds.

Health Score Calculation:

A Hierarchical Approach This research proposes a hierarchical approach for calculating a cloud resource health score. The process involves several key steps:

1. Resource Grouping and Categorization:

Resources are initially grouped by their type (e.g., virtual machines, storage buckets).Within each type, a pivot table is used to categorize resources based on their health status, typically labelled as "Red" (critical), "Amber" (warning), or "Green" (healthy).

2. Normalized Health Score Calculation:

For each health category (Red, Amber, Green), the number of resources is counted.

A normalized score is then calculated for each category using the following formula:

Normalized Score = log2(Count) / 3

This formula applies a logarithmic transformation (log2) to the resource count, compressing the range of values and emphasizing the significance of larger deviations. The division by 3 scales the result within a manageable range.

Additionally, a sigmoid function is applied to the normalized score. The sigmoid function introduces a non-linear transformation, resulting in a smoother transition between health categories and potentially reducing the impact of outliers.

3. Weighted Score and Health Calculation:

Configurable weights are assigned to each health category

Red (critical): Weight of 0.7 (highest weight)

Amber (warning): Weight of 0.3

Green (healthy): Weight of 0.1 (lowest weight)

These weights reflect the relative importance of each health category in determining the overall health score. Critical issues carry a higher weight, impacting the score more significantly.

A weighted score is calculated by multiplying the normalized score for each category by its corresponding weight and then summing the products:

Weighted Score = (0.7 Red Score) + (0.3 Amber Score) + (0.1 \ Green Score)*

Finally, the health score for the resource type is obtained by subtracting the weighted score from 1:

Health Score (Resource Type) = 1 - Weighted Score

4.Hierarchical Aggregation:

This process can be applied hierarchically. The health scores calculated for individual resource types can be aggregated to compute a health score for a higher level, such as a project or client. The specific aggregation method can be tailored to your specific needs.

Benefits of Hierarchical Approach:

• Standardized Scoring: The consistent use of categories and weights across resource types allows for easy comparison and identification of areas needing attention.

• Weighted Importance: Assigning weights to health categories enables you to prioritize critical issues and tailor the scoring to your organization's priorities.

• Hierarchical Aggregation: The ability to aggregate scores hierarchically provides a comprehensive health overview at different levels of granularity.

This hierarchical health score calculation approach offers a structured and configurable method for evaluating the health of cloud resources, empowering you to make informed decisions about resource optimization and cloud workload management.

To seamlessly integrate this health scoring approach with your existing cloud management system (CMS), you'll need to develop mechanisms for data acquisition and build the pipeline for processing it in the scoring system that we developed.

The above system diagram simplifies what is needed to get benefited using this scoring system. You can even monitor the trend of health score overall and at multiple levels simultaneously. Continuously monitoring resource health and updating the scores will make you happy on your resources being healthy.

Limitations:

Static Weights:

The current health scoring system utilizes pre-defined weights for different health categories ("Red", "Amber", "Green"). While these weights are configurable, they remain static across all resources. This approach might not capture the nuances of resource health across different contexts.

Impact:

Static weights might not accurately reflect the relative importance of health categories for specific resource types. For example, a "Red" health status for storage might be less critical than a "Red" status for a high-availability database server.

Solution:

Future versions could explore dynamic weight adjustments based on: Resource Type: Assign different weights to health categories depending on the resource type (e.g., higher weight for "Red" on critical databases).

Workload Characteristics:

Adjust weights based on workload demands. For instance, a resource experiencing high CPU utilization might have a higher emphasis on performance-related health categories.

Real-time Performance Data:

Utilize real-time metrics to dynamically adapt weights based on the current resource behavior.

Conclusion:

Cloud workloads require continuous monitoring and optimization for optimal performance, cost efficiency, and security. Existing Cloud Workload Health Scoring methods often have limitations like focus on specific aspects or lack a granular scoring mechanism. This paper proposes a novel, empirical approach that addresses these limitations. Our framework integrates recommendations from diverse sources and KPIs, assigning configurable weights for prioritization. A parameterized hyperbolic tangent function transforms scores into a clear health indication ("Red," "Amber," "Green") offering intuitive understanding of focus areas and primary actions needed to improve workload health. This gives intuitive understanding on the focus areas and prime actions to bring it back to track. Additionally, hierarchical aggregation of health scores provides a comprehensive view for targeted optimization. This transparent and intuitive framework empowers organizations to make informed decisions regarding cloud workloads deployment, resource allocation, and security measures.

0
Subscribe to my newsletter

Read articles from Sridhar Nomula directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sridhar Nomula
Sridhar Nomula