SLALOM Webinar Final Technical Outcomes Explanined "Using the SLALOM Technical Model to Improve #Cloud #SLA" v1

Using the SLALOM model to improve Cloud SLAs
Efstathios Karanastasis
ICCS/NTUA

Problem snapshot
SLA Technological Landscape
• A lot of ambiguities exist in SLAs of Cloud providers
• The measurement/auditing process of an SLA cannot be done
non-repudiably
– i.e., the involved parties may be able to challenge the auditing of the SLOs
• Standard models are rare and are not widely used
• Differences between Cloud providers cannot be easily assessed
– Absolute percentages cannot be compared among providers
3

Problem snapshot
Ambiguities in SLAs
• Availability (as defined by providers) definition may encapsulate different
formulas for its calculation
• The definition and calculation of availability may include different ways of
identifying a failure, e.g.:
– Response time less than a limit
– Returned response within a string enumeration
(i.e. a predefined range of string values)
• Preconditions apply
4

Problem snapshot
Real world example of Ambiguity
• Ambiguity in the measurement process of AWS EC2 SLA
• “Unavailable” and “Unavailability” mean:
– When all of your running instances have no external connectivity
• Determination of external connectivity. How?
– Internet Layer: Pinging (ICMP)?
• Security threat
– Application layer: Endpoint checking?
• Includes application downtime
• Not exclusively the responsibility of AWS EC2
5

Problem snapshot
Examples of preconditions
• For any SLA to apply, a number of preconditions typically exist per
provider
• Examples:
– Deployment: A specified number of Availability Zones must be used
– Deployment: Replication options must be used
– Usage/Measurement: Unavailable resources must first be restarted
– Usage/Measurement: The number of request must be throttled
6

Problem snapshot
SLALOM Technical objectives
• To have a standard model for defining SLAs that eliminates ambiguities
• To facilitate the measurement, monitoring and enforcement of SLAs to
achieve non-repudiability
• To abstract the SLA definition process (SLA  SLO  metric  sub-
metric) so as to enable the application of metrics that allow for
direct comparability
7

SLALOM@ISO
Interaction with ISO
• Mapped SLALOM 3-layer initial approach to ISO baseline model
– ISO approach powerful at describing more complex metrics (e.g. MS Azure SLA)
• Demonstrated and suggested the ISO model Extendibility for fully defining the
way an SLO can be audited – ACCEPTED
– Suggested the inclusion of an Extension class in the ISO model
– Instantiate the ISO Extension class as the base Sample class of SLALOM
– Introduce the SLALOM Sample layer for concretely defining the sampling process
– In the latest revision of the draft ISO model all classes are extendable
• Applied on different types of Objectives of Commercial SLAs
– GAE Datastore (PaaS)
– AWS EC2 (IaaS)
– Microsoft Azure (Storage)
• Showed applicability of the proposed approach for directly creating machine
understandable descriptions of the SLOs
9

SLALOM@ISO
ISO 19086-2 Metric model
• SLALOM two-fold contribution:
– ISO model classes parameters: machine understandable
– ISO model extension: definition of sampling process
10
SLALOM - proposed
extension
Model from the latest
revision of the 19086-2
draft standard,
to be made available in
the forthcoming weeks
All classes extendible

SLALOM@ISO
SLALOM vs. ISO compliance
ISO-compliant SLA
• Usage of the ISO fields
(classes, parameters)
• SLA not necessarily fully
defined
11
SLALOM-compliant SLA
• ISO compliant
• Clear and Well-defined
• Non-repudiable
• SLAs still not comparable
among providers

Commercial SLAs @SLALOM
Amazon WS EC2
Amazon EC2
Level / definition Expression Notes
Sample definition
sc: UNDEFINED (assumed ‘ping’->
ICMP)
The sampling condition is not defined in the
Amazon EC2 SLA. The concrete wording is “when
all of your running instances have no external
connectivity”. Nonetheless, the way to specify /
measure “external connectivity” is not defined.
For example, a customer could use a ping
operation or a custom monitoring mechanism.
Type of operation: ping
Not defined how the condition of connectivity
can be actually measured (e.g. the ping operation
mentioned previously).
Boundary period
and error
definitions
bp > 60 sec
The exact wording is “the percentage of
minutes”, thus the period is 60 seconds.
ec = 100%
Error condition reflecting that the error ratio is
that for the entire bp the resource must be
continuously “unavailable”.
Abstract metric
definition
availability < 99.95 %
Availability metric definition given the boundary
period and error condition.
13

Google AE Datastore
Google AppEngine Datastore
Sample definition
sc: INTERNAL_ERROR
Several sampling conditions are
defined per type of operation. For
example it is specified (exact wording)
“INTERNAL_ERROR, TIMEOUT, …” for
API calls.
Type of operation: API calls
Several type of operations are defined.
An example is provided here.
Boundary period
and error
definitions
bp > 300 sec
The exact wording is “five consecutive
minutes”.
ec > 10%
Error condition reflecting that the
error ratio is (exact wording) “ten
percent Error Rate”.
Abstract metric
definition
Availability metric definition given the
boundary period and error condition.
14

Microsoft Azure
15
Microsoft Azure Storage
Sample definition
sc = 60 sec
Several sampling conditions are defined
per type of operation. For example it is
specified (exact wording) “Sixty (60)
seconds” for PutBlockList and
GetBlockList.
Type of operation: PutBlockList and
GetBlockList
Boundary period
and error
definitions
bp > 3600 sec
The exact wording is “given one-hour
interval”.
ec > 0%
Error condition reflecting that all periods
should be taken into account for the
availability metric evaluation (exact
wording) “is the sum of Error Rates for
each hour”.
Abstract metric
definition

SLA comparability
Overview
• Despite the fact that through the SLALOM / ISO model SLA descriptions
may be aligned, this does not mean that SLAs (or their parameters) will be
directly comparable
• Need for more abstract metrics, that result in direct comparisons
– SLA success ratio (Published* by Cloud WG of SPEC**)
– SLA strictness (Published* by Cloud WG of SPEC+)
– Standardised datasets
• SLALOM model enables the application of comparable metrics
– All SLA parameters are clearly and well defined
– The SLAs are machine readable
– Greatly simplifies the process and its automation
* Ready for Rain? A View from SPEC Research on the Future of Cloud Metrics
** SPEC: Standard Performance Evaluation Corporation
17

SLA comparability
Comparative metrics
• SLA success ratio
– Based on experience of usage of a service or provider
– In the course of time keep track of successful or violated SLAs and total SLAs
– Calculate the ratio: (Successful SLAs / Total SLAs)
• SLA strictness
– Extract static SLA parameters of importance for a given domain or application
– Assign weights to parameters and normalise
– Map these parameters to an arbitrary function
– Results in a comparative ranking of different SLAs
• Standardised datasets
– Define a set of failure scenarios
– Benchmark each provider SLA definition against the predefined scenario
18

SLA-related Lessons Learnt for Cloud Uptake
19

Lessons Learnt
Do
1) Target metrics that are directly comparable among providers
2) Consider directly machine understandable descriptions via standardised
templates
3) Look into the ISO 19086 series of standards and adopt if applicable
4) Think outside the narrow Cloud box. With the advent of *aaS and the
emergence of IoT, SLAs may refer to services external to the data center or to
specific metrics needed by Cloud Services based on the individual Use Case
5) Consider composite services that may create chains of SLAs and their
interdependencies. For guaranteeing response time to service-support services
consider downstream (reseller) and upstream (e.g. provider’s subcontractors)
actors’ requirements and the need to ‘float’ SLA clauses down the chain
6) Consider resource management as a key part of SLA upkeep and analysis process
7) Consider mechanisms that would allow providers, resellers and users to easily
monitor the SLA in a common and understandable way, even if not experts.
20

Lessons Learnt
Don’t
1) Consider that offered terms are equivalent, even if they originally seem to refer
to the same SLO. Always check the fine print for differences in how metrics are
actually calculated
2) Consider that SLAs are monitored by providers.
3) Leave end users out of the loop. Comprehensiveness and clarity of an SLA (or its
relevant metric) for non-experts should be a key target. Translate your metrics
into plain English if necessary.
4) Limit yourself to popular metrics (e.g. availability) in SLAs. Users are also
interested in more generic Quality of Experience (QoE) indexes such as stability
5) Expect the market to bend for you: fit in to current practice to the maximum
extent and if not possible, hone your value proposition
21

SLALOM Contribution and Expected Impact
22

SLALOM contribution
Tender Evaluation
• Usable by various actors
– Adopters to specify their needs
– Providers to describe their value proposition
– Third parties (resellers/brokers) to combine and offer services and
suggest options
• Added value
– Application of comparative metrics
– Automation of the process
• Benefits
– Improve transparency
– Enhance efficiency
– Establish fairness
23

SLALOM contribution
Contract monitoring
• Benefits
– Achieve SLA non-repudiation
– Establish trust and transparency for service execution compliant to
the terms and proper violation management
– Enable automation of contract and performance management and
monitoring
– Aid the involvement of actors like trusted third parties offering
relevant services
24

• SLALOM proposed specification / reference model already
takes into account:
– Standardisation approaches and working groups outcomes
– Current SLAs and metrics offered by commercial Cloud providers
– Views expressed by Cloud providers and adopters
– Research outcomes
• Further feedback regarding applicability and practical usage
of our model is more than welcome 
• Please take the survey on IoT/Cloud metrics here:
https://docs.google.com/forms/d/1JmwDXyO_1hT9iR-lm1c3LCQu_zF64nf-uFnxBeGMv3g/viewform
25
SLALOM contribution
Your feedback needed

Contact us
26
• SLALOM Technicl WP Leader
ekaranas@mail.ntua.gr
vandro@mail.ntua.gr
gkousiou@mail.ntua.gr
• SLALOM Project Coordinator
daniel.field@atos.net
?

SLALOM Project 27
SLALOM is a CSA financed by European
Commission under Grant agreement 644270
For more information on the initiative contact us:
@CloudSLAlom
www.SLALOM-Project.eu
SLALOM Project Coordinator
(daniel.field@atos.net)

Backup slides
SLA Strictness example
28

Backup sliSLA strictness example
29
Provider/Service t q (s1 * q) q’ (s2 * q) p (s3 * p) x S S’
Google Compute 0 5 (1.00) 5 (0.10) 99.95 (0.50) 0 0.50 1.60
Amazon EC2 0 1 (0.20) 1 (0.02) 99.95 (0.50) 0 1.30 1.48
MS Azure Compute 1 1 (0.20) 1 (0.02) 99.95 (0.50) 0 2.30 2.48
• Extract static SLA parameters of importance for a given domain/application
– All these parameters (e.g. boundary period, error rates) are described in the SLALOM model
• Map these parameters to an arbitrary Function, e.g.:
, where:
– q: size of the boundary period
– p: percentage of availability
– t: running time vs. overall monthly time (boolean), t ϵ {0,1}
– x: existence of performance metrics (boolean), x ϵ {0,1}
– si: normalisation factor for the continuous variables so that:
(s1*q) ϵ [0,1], (s2*q) ϵ [0,0.1] and (s3*p) ϵ [0,0.5]
• Resulting value may be compared between providers
S = t + (1 - s1/2q) + s3p + x

Backup slides
Mapping of AWS EC2 SLA
30

AWS EC2 SLA @SLALOM (1/9)
Amazon EC2
Sample definition
sc: UNDEFINED (assumed ‘ping’->
ICMP)
The sampling condition is not defined in the
Amazon EC2 SLA. The concrete wording is “when
all of your running instances have no external
connectivity”. Nonetheless, the way to specify /
measure “external connectivity” is not defined.
For example, a customer could use a ping
operation or a custom monitoring mechanism.
Type of operation: ping
Not defined how the condition of connectivity
can be actually measured (e.g. the ping operation
mentioned previously).
Boundary period
and error
definitions
bp > 60 sec
ec = 100%
that for the entire bp the resource must be
Abstract metric
definition
Availability metric definition given the boundary
period and error condition.
31

32
Abstract metric
definition
Availability metric definition given
the boundary period and error
condition.
Condition of SLA violation specification
Availability threshold specification
Availability definition and calculation
Billing period specification
Unavailability definition and calculation
Unavailability interval definition and calculation
Boundary period specification
Unreachable sample specification
Sample definition and retrieval
PARAM_001
PARAM_002
SAMPLE_001
QDT_001
UAP_001
BP_001
CFA_002
PARAM_003
CONDITION

33
• Examples of preconditions:
– Deployment: Number of Availability Zones used
– Deployment: Replication options used
– Usage/Measurement: Restarting of resources when unavailable
– Usage/Measurement: Applied Throttling of requests
• Practical suggestions:
– The strict definition of the Rules class to be concerning the
necessary preconditions to apply
– Note field as placeholder for the actual SLA text that refers to a
given block

34
SAMPLE_001
Sample
definition
sc: UNDEFINED
(assumed ‘ping’-
> ICMP)
The sampling condition is not defined in
the Amazon EC2 SLA. The concrete wording
is “when all of your running instances have
no external connectivity”. Nonetheless, the
way to specify / measure “external
connectivity” is not defined. For example, a
customer could use a ping operation or a
custom monitoring mechanism.
Type of
operation: ping
Not defined how the condition of
connectivity can be actually measured (e.g.
the ping operation mentioned previously).
SAMPLE_001

35
Boundary period
and error
definitions
bp > 60 sec
ec = 100%
Error condition reflecting that the error ratio
is that for the entire bp the resource must be
PARAM_001
PARAM_002
SAMPLE_001
PARAM_001
PARAM_002

36
PARAM_001
PARAM_002
SAMPLE_001
QDT_001
PARAM_001
PARAM_002SAMPLE_001
QDT_001
• Calculation of Cloud Service Unavailability Interval
• Based on:
- The current sample
- The defined boundary period
- The definition of unreachable sample
QDT_001
SAMPLE_001
PARAM_001
PARAM_002

37
PARAM_001
PARAM_002
SAMPLE_001
QDT_001
• Calculation of Cloud Service Unavailability
• Based on:
- The Cloud Service Unavailability Interval
QDT_001
QDT_001
UAP_001
UAP_001
UAP_001

38
PARAM_001
PARAM_002
SAMPLE_001
QDT_001
• Calculation of Cloud Service Availability
• Based on:
- Billing period
- The Cloud Service Unavailability
UAP_001
UAP_001
UAP_001
UAP_001
BP_001
BP_001 BP_001
BP_001
BP_001
CFA_002
CFA_002
CFA_002

39
PARAM_001
PARAM_002
SAMPLE_001
QDT_001
• SLA Violation Condition
- i.e.: Availability < 99.95%
UAP_001
BP_001
CFA_002
CFA_002 CFA_002
PARAM_003
PARAM_003
PARAM_003
PARAM_003
ASV_001
ASV_001
ASV_001

Backup slides
Mapping of GAE Datastore SLA
40

GAE Datastore SLA @SLALOM(1/11)
Google AppEngine Datastore
Sample definition
sc: INTERNAL_ERROR
API calls.
Boundary period
and error
definitions
bp > 300 sec
The exact wording is “five consecutive
minutes”.
ec > 10%
Error condition reflecting that the
error ratio is (exact wording) “ten
percent Error Rate”.
Abstract metric
definition
41

42
SAMPLE_001SAMPLE_001
PARAM_003
PARAM_002
PARAM_001
ER_001
DUR_001
QDT_001
UAP_001
BP_001
CFA_002
PARAM_004
ASV_001
Condition of SLA Violation specification
Availability threshold specification
Availability definition and calculation
Billing Period specification
Unavailability definition and calculation
Unavailability Interval definition and calculation
Sampling Period duration definition and calculation
Error Rate definition and calculation
Boundary Period specification
Error Rate threshold specification
Unreachable sample values specification
Sample definition and retrieval
Abstract metric
definition

43
• Examples of preconditions:
– Deployment: Number of Availability Zones used
– Deployment: Replication options used
– Usage/Measurement: Restarting of resources when unavailable
– Usage/Measurement: Applied Throttling of requests
• Practical suggestions:
– The strict definition of the Rules class to be concerning the necessary
preconditions to apply
– Note field as placeholder for the actual SLA text that refers to a given
block

44
Sample
definition
sc: INTERNAL_ERROR
API calls.
Several type of operations are
defined. An example is provided here.
SAMPLE_001

45
Sample
definition
sc: INTERNAL_ERROR
API calls.
Several type of operations are
defined. An example is provided here.
SAMPLE_001SAMPLE_001PARAM_003
PARAM_003

46
SAMPLE_001SAMPLE_001PARAM_003
PARAM_003
Boundary period
and error
definitions
bp > 300 sec The exact wording is “five consecutive minutes”.
ec > 10%
(exact wording) “ten percent Error Rate”.
PARAM_002
PARAM_002
PARAM_001
PARAM_001

47
PARAM_003
PARAM_002
PARAM_001
• Calculation of duration of sampling period:
- The period during which a number of samples was
received
- Period duration calculation based on samples timestamp
• Calculation of actual Error Rate for sampling period:
- Number of violation samples / number of total samples
- Violation samples: samples containing values from a
specific values pool
ER_001
ER_001
SAMPLE_001
SAMPLE_001
PARAM_003
SAMPLE_001DUR_001
DUR_001
SAMPLE_001
DUR_001
ER_001
PARAM_003

48
PARAM_003
PARAM_002
PARAM_001
• Calculation of Unavailability Interval
- IF [Sampling Period duration > Boundary Period]
- AND IF [Error Rate > Thershold (10%)]
- THEN [Unavailability Interval = Sampling Period duration]
ER_001
ER_001
QDT_001 DUR_001
DUR_001
PARAM_001
PARAM_002
DUR_001
QDT_001
QDT_001
QDT_001
QDT_001
ER_001 PARAM_002
DUR_001 PARAM_001
QDT_001 DUR_001

49
PARAM_003
PARAM_002
PARAM_001
• Calculation of Unavailability period
- It equals the SUM of Unavailability Intervals
ER_001
DUR_001
QDT_001
QDT_001
UAP_001
UAP_001
UAP_001
QDT_001
UAP_001

50
PARAM_003
PARAM_002
PARAM_001
ER_001
DUR_001
QDT_001
UAP_001
UAP_001
BP_001
BP_001BP_001
BP_001
CFA_002
CFA_002
CFA_002
• Calculation of Cloud Service Availability
• Based on:
- Billing period
- The Cloud Service Unavailability
CFA_002
BP_001
UAP_001

51
PARAM_003
PARAM_002
PARAM_001
ER_001
DUR_001
QDT_001
UAP_001
BP_001
CFA_002
• SLA Violation Condition
- i.e.: Availability < 99.95%
PARAM_004CFA_002
CFA_002
PARAM_004
PARAM_004
PARAM_004
ASV_001
ASV_001
ASV_001

SLALOM Webinar Final Technical Outcomes Explanined "Using the SLALOM Technical Model to Improve #Cloud #SLA" v1

Recomendados

Recomendados

Más contenido relacionado

Similar a SLALOM Webinar Final Technical Outcomes Explanined "Using the SLALOM Technical Model to Improve #Cloud #SLA" v1

Similar a SLALOM Webinar Final Technical Outcomes Explanined "Using the SLALOM Technical Model to Improve #Cloud #SLA" v1 (20)

Más de Oliver Barreto Rodríguez

Más de Oliver Barreto Rodríguez (20)

Último

Último (20)

SLALOM Webinar Final Technical Outcomes Explanined "Using the SLALOM Technical Model to Improve #Cloud #SLA" v1