SLALOM organized two live sessions to present the final versions of our legal terms and technical specifications for #Cloud #SLAs. The sessions provide examples showing how to practically apply SLALOM to improve current practice in the industry for # Cloud #SLAs and support development of cloud computing metrics.
The first webinar covered SLALOM Technical track "Using metrics to improve Cloud SLAs".
3. Problem snapshot
SLA Technological Landscape
• A lot of ambiguities exist in SLAs of Cloud providers
• The measurement/auditing process of an SLA cannot be done
non-repudiably
– i.e., the involved parties may be able to challenge the auditing of the SLOs
• Standard models are rare and are not widely used
• Differences between Cloud providers cannot be easily assessed
– Absolute percentages cannot be compared among providers
3
4. Problem snapshot
Ambiguities in SLAs
• Availability (as defined by providers) definition may encapsulate different
formulas for its calculation
• The definition and calculation of availability may include different ways of
identifying a failure, e.g.:
– Response time less than a limit
– Returned response within a string enumeration
(i.e. a predefined range of string values)
• Preconditions apply
4
5. Problem snapshot
Real world example of Ambiguity
• Ambiguity in the measurement process of AWS EC2 SLA
• “Unavailable” and “Unavailability” mean:
– When all of your running instances have no external connectivity
• Determination of external connectivity. How?
– Internet Layer: Pinging (ICMP)?
• Security threat
– Application layer: Endpoint checking?
• Includes application downtime
• Not exclusively the responsibility of AWS EC2
5
6. Problem snapshot
Examples of preconditions
• For any SLA to apply, a number of preconditions typically exist per
provider
• Examples:
– Deployment: A specified number of Availability Zones must be used
– Deployment: Replication options must be used
– Usage/Measurement: Unavailable resources must first be restarted
– Usage/Measurement: The number of request must be throttled
6
7. Problem snapshot
SLALOM Technical objectives
• To have a standard model for defining SLAs that eliminates ambiguities
• To facilitate the measurement, monitoring and enforcement of SLAs to
achieve non-repudiability
• To abstract the SLA definition process (SLA SLO metric sub-
metric) so as to enable the application of metrics that allow for
direct comparability
7
9. SLALOM@ISO
Interaction with ISO
• Mapped SLALOM 3-layer initial approach to ISO baseline model
– ISO approach powerful at describing more complex metrics (e.g. MS Azure SLA)
• Demonstrated and suggested the ISO model Extendibility for fully defining the
way an SLO can be audited – ACCEPTED
– Suggested the inclusion of an Extension class in the ISO model
– Instantiate the ISO Extension class as the base Sample class of SLALOM
– Introduce the SLALOM Sample layer for concretely defining the sampling process
– In the latest revision of the draft ISO model all classes are extendable
• Applied on different types of Objectives of Commercial SLAs
– GAE Datastore (PaaS)
– AWS EC2 (IaaS)
– Microsoft Azure (Storage)
• Showed applicability of the proposed approach for directly creating machine
understandable descriptions of the SLOs
9
10. SLALOM@ISO
ISO 19086-2 Metric model
• SLALOM two-fold contribution:
– ISO model classes parameters: machine understandable
– ISO model extension: definition of sampling process
10
SLALOM - proposed
extension
Model from the latest
revision of the 19086-2
draft standard,
to be made available in
the forthcoming weeks
All classes extendible
11. SLALOM@ISO
SLALOM vs. ISO compliance
ISO-compliant SLA
• Usage of the ISO fields
(classes, parameters)
• SLA not necessarily fully
defined
11
SLALOM-compliant SLA
• ISO compliant
• Clear and Well-defined
• Non-repudiable
• SLAs still not comparable
among providers
13. Commercial SLAs @SLALOM
Amazon WS EC2
Amazon EC2
Level / definition Expression Notes
Sample definition
sc: UNDEFINED (assumed ‘ping’->
ICMP)
The sampling condition is not defined in the
Amazon EC2 SLA. The concrete wording is “when
all of your running instances have no external
connectivity”. Nonetheless, the way to specify /
measure “external connectivity” is not defined.
For example, a customer could use a ping
operation or a custom monitoring mechanism.
Type of operation: ping
Not defined how the condition of connectivity
can be actually measured (e.g. the ping operation
mentioned previously).
Boundary period
and error
definitions
bp > 60 sec
The exact wording is “the percentage of
minutes”, thus the period is 60 seconds.
ec = 100%
Error condition reflecting that the error ratio is
that for the entire bp the resource must be
continuously “unavailable”.
Abstract metric
definition
availability < 99.95 %
Availability metric definition given the boundary
period and error condition.
13
14. Commercial SLAs @SLALOM
Google AE Datastore
Google AppEngine Datastore
Level / definition Expression Notes
Sample definition
sc: INTERNAL_ERROR
Several sampling conditions are
defined per type of operation. For
example it is specified (exact wording)
“INTERNAL_ERROR, TIMEOUT, …” for
API calls.
Type of operation: API calls
Several type of operations are defined.
An example is provided here.
Boundary period
and error
definitions
bp > 300 sec
The exact wording is “five consecutive
minutes”.
ec > 10%
Error condition reflecting that the
error ratio is (exact wording) “ten
percent Error Rate”.
Abstract metric
definition
availability < 99.95 %
Availability metric definition given the
boundary period and error condition.
14
15. Commercial SLAs @SLALOM
Microsoft Azure
15
Microsoft Azure Storage
Level / definition Expression Notes
Sample definition
sc = 60 sec
Several sampling conditions are defined
per type of operation. For example it is
specified (exact wording) “Sixty (60)
seconds” for PutBlockList and
GetBlockList.
Type of operation: PutBlockList and
GetBlockList
Several type of operations are defined.
An example is provided here.
Boundary period
and error
definitions
bp > 3600 sec
The exact wording is “given one-hour
interval”.
ec > 0%
Error condition reflecting that all periods
should be taken into account for the
availability metric evaluation (exact
wording) “is the sum of Error Rates for
each hour”.
Abstract metric
definition
availability < 99.9 %
Availability metric definition given the
boundary period and error condition.
17. SLA comparability
Overview
• Despite the fact that through the SLALOM / ISO model SLA descriptions
may be aligned, this does not mean that SLAs (or their parameters) will be
directly comparable
• Need for more abstract metrics, that result in direct comparisons
– SLA success ratio (Published* by Cloud WG of SPEC**)
– SLA strictness (Published* by Cloud WG of SPEC+)
– Standardised datasets
• SLALOM model enables the application of comparable metrics
– All SLA parameters are clearly and well defined
– The SLAs are machine readable
– Greatly simplifies the process and its automation
* Ready for Rain? A View from SPEC Research on the Future of Cloud Metrics
** SPEC: Standard Performance Evaluation Corporation
17
18. SLA comparability
Comparative metrics
• SLA success ratio
– Based on experience of usage of a service or provider
– In the course of time keep track of successful or violated SLAs and total SLAs
– Calculate the ratio: (Successful SLAs / Total SLAs)
• SLA strictness
– Extract static SLA parameters of importance for a given domain or application
– Assign weights to parameters and normalise
– Map these parameters to an arbitrary function
– Results in a comparative ranking of different SLAs
• Standardised datasets
– Define a set of failure scenarios
– Benchmark each provider SLA definition against the predefined scenario
18
20. Lessons Learnt
Do
1) Target metrics that are directly comparable among providers
2) Consider directly machine understandable descriptions via standardised
templates
3) Look into the ISO 19086 series of standards and adopt if applicable
4) Think outside the narrow Cloud box. With the advent of *aaS and the
emergence of IoT, SLAs may refer to services external to the data center or to
specific metrics needed by Cloud Services based on the individual Use Case
5) Consider composite services that may create chains of SLAs and their
interdependencies. For guaranteeing response time to service-support services
consider downstream (reseller) and upstream (e.g. provider’s subcontractors)
actors’ requirements and the need to ‘float’ SLA clauses down the chain
6) Consider resource management as a key part of SLA upkeep and analysis process
7) Consider mechanisms that would allow providers, resellers and users to easily
monitor the SLA in a common and understandable way, even if not experts.
20
21. Lessons Learnt
Don’t
1) Consider that offered terms are equivalent, even if they originally seem to refer
to the same SLO. Always check the fine print for differences in how metrics are
actually calculated
2) Consider that SLAs are monitored by providers.
3) Leave end users out of the loop. Comprehensiveness and clarity of an SLA (or its
relevant metric) for non-experts should be a key target. Translate your metrics
into plain English if necessary.
4) Limit yourself to popular metrics (e.g. availability) in SLAs. Users are also
interested in more generic Quality of Experience (QoE) indexes such as stability
5) Expect the market to bend for you: fit in to current practice to the maximum
extent and if not possible, hone your value proposition
21
23. SLALOM contribution
Tender Evaluation
• Usable by various actors
– Adopters to specify their needs
– Providers to describe their value proposition
– Third parties (resellers/brokers) to combine and offer services and
suggest options
• Added value
– Application of comparative metrics
– Automation of the process
• Benefits
– Improve transparency
– Enhance efficiency
– Establish fairness
23
24. SLALOM contribution
Contract monitoring
• Benefits
– Achieve SLA non-repudiation
– Establish trust and transparency for service execution compliant to
the terms and proper violation management
– Enable automation of contract and performance management and
monitoring
– Aid the involvement of actors like trusted third parties offering
relevant services
24
25. • SLALOM proposed specification / reference model already
takes into account:
– Standardisation approaches and working groups outcomes
– Current SLAs and metrics offered by commercial Cloud providers
– Views expressed by Cloud providers and adopters
– Research outcomes
• Further feedback regarding applicability and practical usage
of our model is more than welcome
• Please take the survey on IoT/Cloud metrics here:
https://docs.google.com/forms/d/1JmwDXyO_1hT9iR-lm1c3LCQu_zF64nf-uFnxBeGMv3g/viewform
25
SLALOM contribution
Your feedback needed
27. SLALOM Project 27
SLALOM is a CSA financed by European
Commission under Grant agreement 644270
For more information on the initiative contact us:
@CloudSLAlom
www.SLALOM-Project.eu
SLALOM Project Coordinator
(daniel.field@atos.net)
29. Backup sliSLA strictness example
29
Provider/Service t q (s1 * q) q’ (s2 * q) p (s3 * p) x S S’
Google Compute 0 5 (1.00) 5 (0.10) 99.95 (0.50) 0 0.50 1.60
Amazon EC2 0 1 (0.20) 1 (0.02) 99.95 (0.50) 0 1.30 1.48
MS Azure Compute 1 1 (0.20) 1 (0.02) 99.95 (0.50) 0 2.30 2.48
• Extract static SLA parameters of importance for a given domain/application
– All these parameters (e.g. boundary period, error rates) are described in the SLALOM model
• Map these parameters to an arbitrary Function, e.g.:
, where:
– q: size of the boundary period
– p: percentage of availability
– t: running time vs. overall monthly time (boolean), t ϵ {0,1}
– x: existence of performance metrics (boolean), x ϵ {0,1}
– si: normalisation factor for the continuous variables so that:
(s1*q) ϵ [0,1], (s2*q) ϵ [0,0.1] and (s3*p) ϵ [0,0.5]
• Resulting value may be compared between providers
S = t + (1 - s1/2q) + s3p + x
31. AWS EC2 SLA @SLALOM (1/9)
Amazon EC2
Level / definition Expression Notes
Sample definition
sc: UNDEFINED (assumed ‘ping’->
ICMP)
The sampling condition is not defined in the
Amazon EC2 SLA. The concrete wording is “when
all of your running instances have no external
connectivity”. Nonetheless, the way to specify /
measure “external connectivity” is not defined.
For example, a customer could use a ping
operation or a custom monitoring mechanism.
Type of operation: ping
Not defined how the condition of connectivity
can be actually measured (e.g. the ping operation
mentioned previously).
Boundary period
and error
definitions
bp > 60 sec
The exact wording is “the percentage of
minutes”, thus the period is 60 seconds.
ec = 100%
Error condition reflecting that the error ratio is
that for the entire bp the resource must be
continuously “unavailable”.
Abstract metric
definition
availability < 99.95 %
Availability metric definition given the boundary
period and error condition.
31
32. AWS EC2 SLA @SLALOM (2/9)
32
Abstract metric
definition
availability < 99.95 %
Availability metric definition given
the boundary period and error
condition.
Condition of SLA violation specification
Availability threshold specification
Availability definition and calculation
Billing period specification
Unavailability definition and calculation
Unavailability interval definition and calculation
Boundary period specification
Unreachable sample specification
Sample definition and retrieval
PARAM_001
PARAM_002
SAMPLE_001
QDT_001
UAP_001
BP_001
CFA_002
PARAM_003
CONDITION
33. AWS EC2 SLA @SLALOM (3/9)
33
• Examples of preconditions:
– Deployment: Number of Availability Zones used
– Deployment: Replication options used
– Usage/Measurement: Restarting of resources when unavailable
– Usage/Measurement: Applied Throttling of requests
• Practical suggestions:
– The strict definition of the Rules class to be concerning the
necessary preconditions to apply
– Note field as placeholder for the actual SLA text that refers to a
given block
34. AWS EC2 SLA @SLALOM (4/9)
34
SAMPLE_001
Sample
definition
sc: UNDEFINED
(assumed ‘ping’-
> ICMP)
The sampling condition is not defined in
the Amazon EC2 SLA. The concrete wording
is “when all of your running instances have
no external connectivity”. Nonetheless, the
way to specify / measure “external
connectivity” is not defined. For example, a
customer could use a ping operation or a
custom monitoring mechanism.
Type of
operation: ping
Not defined how the condition of
connectivity can be actually measured (e.g.
the ping operation mentioned previously).
SAMPLE_001
35. AWS EC2 SLA @SLALOM (5/9)
35
Boundary period
and error
definitions
bp > 60 sec
The exact wording is “the percentage of
minutes”, thus the period is 60 seconds.
ec = 100%
Error condition reflecting that the error ratio
is that for the entire bp the resource must be
continuously “unavailable”.
PARAM_001
PARAM_002
SAMPLE_001
PARAM_001
PARAM_002
36. AWS EC2 SLA @SLALOM (6/9)
36
PARAM_001
PARAM_002
SAMPLE_001
QDT_001
PARAM_001
PARAM_002SAMPLE_001
QDT_001
• Calculation of Cloud Service Unavailability Interval
• Based on:
- The current sample
- The defined boundary period
- The definition of unreachable sample
QDT_001
SAMPLE_001
PARAM_001
PARAM_002
37. AWS EC2 SLA @SLALOM (7/9)
37
PARAM_001
PARAM_002
SAMPLE_001
QDT_001
• Calculation of Cloud Service Unavailability
• Based on:
- The Cloud Service Unavailability Interval
QDT_001
QDT_001
UAP_001
UAP_001
UAP_001
38. AWS EC2 SLA @SLALOM (8/9)
38
PARAM_001
PARAM_002
SAMPLE_001
QDT_001
• Calculation of Cloud Service Availability
• Based on:
- Billing period
- The Cloud Service Unavailability
UAP_001
UAP_001
UAP_001
UAP_001
BP_001
BP_001 BP_001
BP_001
BP_001
CFA_002
CFA_002
CFA_002
41. GAE Datastore SLA @SLALOM(1/11)
Google AppEngine Datastore
Level / definition Expression Notes
Sample definition
sc: INTERNAL_ERROR
Several sampling conditions are
defined per type of operation. For
example it is specified (exact wording)
“INTERNAL_ERROR, TIMEOUT, …” for
API calls.
Type of operation: API calls
Several type of operations are defined.
An example is provided here.
Boundary period
and error
definitions
bp > 300 sec
The exact wording is “five consecutive
minutes”.
ec > 10%
Error condition reflecting that the
error ratio is (exact wording) “ten
percent Error Rate”.
Abstract metric
definition
availability < 99.95 %
Availability metric definition given the
boundary period and error condition.
41
42. GAE Datastore SLA @SLALOM(2/11)
42
SAMPLE_001SAMPLE_001
PARAM_003
PARAM_002
PARAM_001
ER_001
DUR_001
QDT_001
UAP_001
BP_001
CFA_002
PARAM_004
ASV_001
Condition of SLA Violation specification
Availability threshold specification
Availability definition and calculation
Billing Period specification
Unavailability definition and calculation
Unavailability Interval definition and calculation
Sampling Period duration definition and calculation
Error Rate definition and calculation
Boundary Period specification
Error Rate threshold specification
Unreachable sample values specification
Sample definition and retrieval
Abstract metric
definition
availability < 99.95 %
Availability metric definition given the
boundary period and error condition.
43. GAE Datastore SLA @SLALOM(3/11)
43
• Examples of preconditions:
– Deployment: Number of Availability Zones used
– Deployment: Replication options used
– Usage/Measurement: Restarting of resources when unavailable
– Usage/Measurement: Applied Throttling of requests
• Practical suggestions:
– The strict definition of the Rules class to be concerning the necessary
preconditions to apply
– Note field as placeholder for the actual SLA text that refers to a given
block
44. GAE Datastore SLA @SLALOM(4/11)
44
Sample
definition
sc: INTERNAL_ERROR
Several sampling conditions are
defined per type of operation. For
example it is specified (exact wording)
“INTERNAL_ERROR, TIMEOUT, …” for
API calls.
Type of operation: API calls
Several type of operations are
defined. An example is provided here.
SAMPLE_001SAMPLE_001
SAMPLE_001
45. GAE Datastore SLA @SLALOM(5/11)
45
Sample
definition
sc: INTERNAL_ERROR
Several sampling conditions are
defined per type of operation. For
example it is specified (exact wording)
“INTERNAL_ERROR, TIMEOUT, …” for
API calls.
Type of operation: API calls
Several type of operations are
defined. An example is provided here.
SAMPLE_001SAMPLE_001PARAM_003
PARAM_003
46. GAE Datastore SLA @SLALOM(6/11)
46
SAMPLE_001SAMPLE_001PARAM_003
PARAM_003
Boundary period
and error
definitions
bp > 300 sec The exact wording is “five consecutive minutes”.
ec > 10%
Error condition reflecting that the error ratio is
(exact wording) “ten percent Error Rate”.
PARAM_002
PARAM_002
PARAM_001
PARAM_001
47. GAE Datastore SLA @SLALOM(7/11)
47
SAMPLE_001SAMPLE_001
PARAM_003
PARAM_002
PARAM_001
• Calculation of duration of sampling period:
- The period during which a number of samples was
received
- Period duration calculation based on samples timestamp
• Calculation of actual Error Rate for sampling period:
- Number of violation samples / number of total samples
- Violation samples: samples containing values from a
specific values pool
ER_001
ER_001
SAMPLE_001
SAMPLE_001
PARAM_003
SAMPLE_001DUR_001
DUR_001
SAMPLE_001
DUR_001
ER_001
PARAM_003
48. GAE Datastore SLA @SLALOM(8/11)
48
SAMPLE_001SAMPLE_001
PARAM_003
PARAM_002
PARAM_001
• Calculation of Unavailability Interval
- IF [Sampling Period duration > Boundary Period]
- AND IF [Error Rate > Thershold (10%)]
- THEN [Unavailability Interval = Sampling Period duration]
ER_001
ER_001
QDT_001 DUR_001
DUR_001
PARAM_001
PARAM_002
DUR_001
QDT_001
QDT_001
QDT_001
QDT_001
ER_001 PARAM_002
DUR_001 PARAM_001
QDT_001 DUR_001
49. GAE Datastore SLA @SLALOM(9/11)
49
SAMPLE_001SAMPLE_001
PARAM_003
PARAM_002
PARAM_001
• Calculation of Unavailability period
- It equals the SUM of Unavailability Intervals
ER_001
DUR_001
QDT_001
QDT_001
UAP_001
UAP_001
UAP_001
QDT_001
UAP_001
50. GAE Datastore SLA @SLALOM(10/11)
50
SAMPLE_001SAMPLE_001
PARAM_003
PARAM_002
PARAM_001
ER_001
DUR_001
QDT_001
UAP_001
UAP_001
BP_001
BP_001BP_001
BP_001
CFA_002
CFA_002
CFA_002
• Calculation of Cloud Service Availability
• Based on:
- Billing period
- The Cloud Service Unavailability
CFA_002
BP_001
UAP_001