More Related Content Similar to The science of network performance (20) More from Martin Geddes (16) The science of network performance2. 2
ABOUT US
• Martin Geddes Consulting Ltd is a boutique provider of innovation services, based
in London, England.
• We work in collaboration with several industry partners, notably Predictable
Network Solutions Ltd, also based in England.
• Together, we are the world’s first (and only) team of network performance
scientists.
• We offer network measurement and optimisation services to fixed and mobile
operators. We have also provided scientific advice to regulators.
• To get in touch email mail@martingeddes.com.
3. 3|
MARTIN GEDDES
• Martin Geddes is a scholar of emerging telecoms
business models and network technologies.
• He is co-founder and Executive Director of the
Hypervoice Consortium, an new industry association.
• Geddes is formerly Strategy Director at BT's network
division, and Chief Analyst at Telco 2.0.
• He previously worked at Sprint, where he was a named
inventor on 8 patents, and at Oracle as a specialist in
high-scalability database systems.
• He holds an MA in Mathematics & Computation from the
University of Oxford.
6. 6DISTRIBUTED COMPUTING APPLICATIONS
22 February 2015
© Martin Geddes Consulting Ltd
We are not in the business of networking, but instead are in the business of
distributed computing. Users want to run applications that perform distributed
computations.
Furthermore, networks themselves internally perform distributed
computations. Indeed, modern networks can be thought of as being large,
distributed supercomputers, where the processor interconnects have been
stretched out.
8. 8PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
What we uniquely manufacture as an industry is performance for these
distributed computing applications. You can get connectivity from the postal
service, but not performance.
Applications have a diversity of performance demands, and we can supply good
performance or bad performance.
10. 10PERFORMANCE HAZARDS
22 February 2015
© Martin Geddes Consulting Ltd
Since we manufacture performance, we are interested in performance hazards.
These represent the chance of the customer having a bad experience.
A hazard can be latent (can’t currently happen), armed (could happen in the
current circumstances), or matured (has happened). The language of hazards is
taken from the study of safety-critical systems. We want to know how likely it is
that something will go wrong without waiting for a disaster to tell us!
As network performance scientists, our core task is to understand performance
hazards.
11. 11
22 February 2015
© Martin Geddes Consulting Ltd
3 KEY ACTIVITIES
11
Measure Model Manage
Performance hazards
12. 123 KEY ACTIVITIES
22 February 2015
© Martin Geddes Consulting Ltd
The three main activities in network performance science are to measure,
model and manage performance hazards. Let’s examine each in turn.
14. 14MEASURE PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
We measure performance hazards by obtaining network metrics that are also a
strong proxy for the customer experience. After all, the perception of failure is
entirely down to the customer. What we want to know is: how at risk the
customer is of having a bad experience?
16. 16MODEL PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
We model performance for many reasons. A common example is capacity
planning. When and where should we add more capacity? The answer depends
on whether a lack of capacity is putting the customer experience at risk.
Another example is software-defined networking. The orchestration system
contains a model of the network, and predicts what the effect of resource
allocation decisions will be on network performance & user experience.
As networks contain more internal computing services (e.g. NFV, CDNs) we
used models to predict the right place to locate each function.
17. 17MANAGE PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
This aspect is a bit more complex. Networks can be thought of as systems that
allocate resources, at all timescales from microseconds to months.
Each resource allocation decision is a “trade”: when you give the supply of
transmission or computation to one source of demand, you don’t give it to
every other one. This transfers the “disappointment” (of worse performance)
around. You can’t get rid of it, only move it to where it does least harm.
19. 19MANAGE PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
We’re going to use this chart to create a picture of the “trades” being done.
Along the horizontal axis we have the timescales at which resources are being
reserved. This represents the time-shifting that other demand experiences
when it can’t get hold of the resource immediately.
20. 20MANAGE PERFORMANCE
10-6 106100 10310-3
Core Gap! SDN
22 February 2015
© Martin Geddes Consulting Ltd
Access
Provisioning
Gap!
TIMESCALE OF TRADE (SECONDS)
21. 21MANAGE PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
At the shortest timescales we have resources being allocated in the network
core. It might take only a microsecond (or less) for a packet to be squirted over
a link. Access networks are typically much slower, most working at the
milliseconds and above.
22. 22MANAGE PERFORMANCE
10-6 106100 10310-3
Core Gap! SDN
Capacity
planning
22 February 2015
© Martin Geddes Consulting Ltd
Provisioning
Gap!
TIMESCALE OF TRADE (SECONDS)
23. 23MANAGE PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
At longer timescales we have resources being allocated over long periods.
Software Defined Networking (SDN) manages resources over the 10 to 10,000
second range (approximately). We might provision a light path for days or
months. Capacity planning happens at months and longer.
Not shown are trades at very long timescales, like cell tower placement, or
spectrum acquisition. When you place a cell tower in one place, you don’t place
it elsewhere; when you spend a dollar on one slice of spectrum in one place, you
don’t spend it elsewhere. Everything is a trade.
24. 24MANAGE PERFORMANCE
10-6 106100 10310-3
Core Gap! SDN
Capacity
planning
22 February 2015
© Martin Geddes Consulting Ltd
TIMESCALE OF TRADE
Access
Provisioning
Gap!
25. 25MANAGE PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
What’s interesting is that there are ‘gaps’ in the mechanisms that we have to
perform trades. As we shall soon see, these are opportunities to make money or
deliver a better experience that we are failing to engage with.
27. 27MANAGE PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
At every timescale we can make good or bad trades. Swapping all the cell
towers in London for those in a small part of rural Wales would be a bad trade.
Swapping packet delay from an overnight backup onto a voice call is usually a
bad trade. But if it causes the backup to stall and fail, and the voice call can
withstand some more delay, then it’s a good trade.
Although we may not know the impact of every resource decision on the value
delivered to each customer, it certainly has one, however small or large. What
we aim to achieve is a good enough level of trading. The trading system itself has
costs (of acquiring information about demand, and creating trading
mechanisms of supply), so absolute perfection is both unattainable and
undesirable.
28. 28
10-6 106100 10310-3
MANAGE PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
VALUEOFTRADE
CREATEDESTROY
TIMESCALE OF TRADE (SECONDS)
29. 29MANAGE PERFORMANCE
22 February 2015
© Martin Geddes Consulting Ltd
Now we’re in a position to ‘zoom out’ on our business, and look at it from far
away. How well are we doing at performing trades?
31. 31MANAGE (TDM CORE)
22 February 2015
© Martin Geddes Consulting Ltd
In the TDM circuit world, all trades at short (sub-second) timescales were
externalised from the network. Customers were allocated their path and time
slots, and they had total control over the order in which data was injected and
thus the timing of its exit.
The whole point of broadband is to allow these trades at short timescales to
happen within the network. We allow contention to occur, in order to share the
resources more efficiently. The speed of light means we can’t coordinate
everything from the edge at these timescales. Networks aren’t pipes, they are
‘trading spaces’.
33. 33MANAGE (TDM CORE)
22 February 2015
© Martin Geddes Consulting Ltd
In the traditional circuit business, we did a good job of performing the trades at
longer timescales. We might occasionally mis-provision capacity where it isn’t
needed, but overall it was a business that was efficient and effective (within its
inherent constraints).
Then to gain more efficiency we moved to packet-based statistical
multiplexing. What happened?
35. 35PROBLEMS, PROBLEMS…
22 February 2015
© Martin Geddes Consulting Ltd
The downside of packets is an impact on performance from statistical sharing.
We trade more efficiency for lower effectiveness. We arm new performance
hazards. That means we have to reconstruct the whole supply chain to manage
these hazards. How well are we doing?
We’ll look at examples of:
- Measure (IPX spec; KPSN data)
- Model (small cells, IPX de-jitter)
- Manage (IP core networks)
37. 37THE WRONG SPECIFICATION
22 February 2015
© Martin Geddes Consulting Ltd
We fall down at the first hurdle. This is part of the specification for IPX, the
standard for quality-assured voice (and other services). (The same issue applies
to standards from the ITU and ETSI, so we’re not picking on the GSM
Association.)
An average monthly packet loss of 0.1% means I could lose every packet for
about the first 40 minutes of the month, as long as I deliver all the rest. That
doesn’t specify a working voice service!
The problem is that there is no quality in averages. We have failed to engage
with the stochastic nature of packet multiplexing from the beginning. What
matters if the probability distribution of loss, not its average.
38. 38
38
22 February 2015
© Martin Geddes Consulting Ltd
Well-run UK public
sector network that
wants to run voice
over IP.
Will it work?
39. 39THE WRONG METRICS
22 February 2015
© Martin Geddes Consulting Ltd
This chart is a very interesting one. It compares the data typically used to run a
network (in green) with the customer experience (in red). It was done by
performing advanced measurements over a single link in a collector arc
(between an access and core network). It carries mixed traffic types.
The green line is a 5 minute load average over 5 days. The red crosses are a
synthetic metric which shows how at risk a VoIP call is of failing. (Any cross
means there is a risk of failure, and the higher it is, the more the risk.)
There is a correlation between the network being busy, and the QoE risk, but…
40. 40
40
22 February 2015
© Martin Geddes Consulting Ltd
Well-run UK public
sector network that
wants to run voice
over IP.
Will it work?
41. 41THE WRONG METRICS
22 February 2015
© Martin Geddes Consulting Ltd
In the evenings, the network is running hot, but there is no QoE risk. Yet
traditional capacity planning rules would be suggesting an upgrade, wasting
money.
The reason there is no risk is because of the arrival patterns being generated.
Remember, there is no quality in averages, and the number of packets on this
100Mbit link in 5 minutes could number in the millions. Averaging that data
hides essential detail of its structure.
42. 42
42
22 February 2015
© Martin Geddes Consulting Ltd
Well-run UK public
sector network that
wants to run voice
over IP.
Will it work?
43. 43THE WRONG METRICS
22 February 2015
© Martin Geddes Consulting Ltd
In the middle of the night, the network is empty, but there is a risk of a VoIP call
failing. The particular burst pattern is arming the hazard. What this tells us is
that over-provisioning just doesn’t work, even at a factor of 10000x.
The common metrics we are using to navigate our networking businesses
simply don’t tell us what we need to know.
45. 45FAILURE TO COMPOSE
22 February 2015
© Martin Geddes Consulting Ltd
We have more problems: our contracts for our digital supply chains don’t
compose. It’s like the concrete, glass and steel of a building not fitting together,
resulting in a structure that falls apart.
In this example, a mobile operator was building a small cell service, and they
had outsourced the construction to three vendors. These respectively supplied
the radio access, backhaul and core. When the service was turned on, it didn’t
work. Who is to blame?
The vendors were all compliant with their specs. There was a residual
integration risk caused by the technical non-composability of those contracts.
46. 46FAILURE TO LIMIT DEMAND
22 February 2015
© Martin Geddes Consulting Ltd
This was not the only problem. The contracts for performance failed to
appropriately limit demand. (After all, even a building is only expected to
withstand a certain loading from earthquakes and storms.)
The core network was over-driving the RAN, causing it to fail. Thus the effect
and the cause were in very different places.
These issues of non-composability and failure to limit demand are absolutely
endemic to the packet world. They drive lots of cost, failed projects and
lawsuits.
48. 48LOCAL OPTIMUM, GLOBAL PESSIMUM
22 February 2015
© Martin Geddes Consulting Ltd
We are still unconsciously tied to circuits. For example, in the IPX world we de-
jitter at every network boundary, because we see jitter as being “bad” (and un-
TDM like).
The effect is to just add more delay (since you’re just levelling up the best case
towards the worse). It creates an end-to-end delay that results in a non-working
phone call.
Our attempt to locally optimise has created the worse possible global
performance.
49. 49MANAGE (IP CORE, NOT OVERDRIVEN)
22 February 2015
© Martin Geddes Consulting Ltd
TIMESCALE OF TRADE
VALUEOFTRADE
CREATEDESTROY
50. 50MANAGE (IP CORE, NOT OVERDRIVEN)
22 February 2015
© Martin Geddes Consulting Ltd
This is our ‘heatmap’ of how well a typical IP core business is being run.
The queue mechanisms are designed for trades at the very shortest of
timescales, and they overall make more good than bad ones.
They aren’t able to control the trades at longer timescales. They lack both
appropriate memory, and suitable information about what a ‘good’ trade might
be. As a result, we destroy a lot of value.
51. 51TCP DOES TRADES, TOO
22 February 2015
© Martin Geddes Consulting Ltd
TIMESCALE OF TRADE
VALUEOFTRADE
CREATEDESTROY
52. 52MANAGE (IP CORE, NOT OVERDRIVEN)
22 February 2015
© Martin Geddes Consulting Ltd
Furthermore, at medium timescales we have turned the trades over to our
customers by protocols like TCP. They are all attempting to exploit a quality
arbitrage in our networks that we ourselves created!
Who wouldn’t want the quality of TDM for the price of low-quality bulk data
delivery?
54. 54FAILURE UNDER LOAD
22 February 2015
© Martin Geddes Consulting Ltd
The result of our failings at measurement, modelling and management are
creating systems that can fail under load. We’ve not taken control over the
performance hazards.
This issue is going to get worse as we move to an SDN/NFV world. This adds
more statistical resource sharing, more variability of performance, and new
hazards of failure under load.
56. 56CRISIS OF LEGITIMACY
22 February 2015
© Martin Geddes Consulting Ltd
If we continue down our current path, we will eventually face a crisis of
legitimacy. Our credibility will be lost.
Investors, regulators and the public will no longer have faith that we are able to
predictably engineer reliable systems. They expect and take for granted in other
domains. Why not packet networking?
59. 59THE BIG QUESTION
22 February 2015
© Martin Geddes Consulting Ltd
The question we face is how to get beyond being a highly skilled craft. We’re
building the equivalent of medieval cathedrals, each as a one-off project.
There’s no modularity and replicability.
61. 61THE BIG QUESTION
22 February 2015
© Martin Geddes Consulting Ltd
Instead we want to be able to construct high-performance networks with
managed hazards and replicability over and over – at a previously unheard of
level of resource sharing.
It a bit like how a skyscraper allows us to share land. It can only be done if we
have the right materials science and structural engineering.
63. 63IMMATURE INDUSTRY
22 February 2015
© Martin Geddes Consulting Ltd
This transition requires us to recognise that our industry remains relatively
immature in terms of the science and engineering that underpins it.
65. 65SCIENCE GAP
22 February 2015
© Martin Geddes Consulting Ltd
Getting to the desired state requires us to address a gap in the fundamental
science.
We had this sorted for voice circuits. Erlang models told us with precision what
to build.
We now need to generalise that for a packet world. How?
68. 68NEED NEW ‘THINKWARE’
22 February 2015
© Martin Geddes Consulting Ltd
To progress from craft to science we need to think about the problem
differently. By truly engaging with the problem we can reach a state of
“consciously incompetence”. We need to understand the boundaries of our real
(scientific) knowledge, so we can expand them.
From this place of reflective understanding we then can invest in the necessary
human and technical enablers for progress. These give us the tools to manage
the performance hazards and trades along a complete digital supply chain.
69. 69NEW TOOLS TO MEASURE
Multi-point distributions
of packet loss and delay
22 February 2015
© Martin Geddes Consulting Ltd
70. 70NEW TOOLS TO MEASURE
22 February 2015
© Martin Geddes Consulting Ltd
We are currently measuring average data at (many) single points.
What we need to capture are probability distributions (of loss and delay)
obtained by watching packets go past multiple points.
72. 72MEASURE
22 February 2015
© Martin Geddes Consulting Ltd
That’s because only multi-point distributions give us the ability to resolve all of
the performance effects in both space and time.
74. 74STOCHASTIC RESOURCE MODEL
22 February 2015
© Martin Geddes Consulting Ltd
To successfully model networks, we need a high-fidelity model. Anything that
doesn’t accurately represent reality means our models will give wrong
predictions.
Since broadband network are stochastic, that means our models need to be
stochastic too. (Good news! It is easier than it sounds. Science collapses
complexity.)
There is only one possible model that has the desired properties, just like there
is only one valid model of electromagnetism. We call this model ΔQ. It’s non-
proprietary: everyone is welcome to it.
76. 76QUANTITATIVE QUALITY CONTRACT
22 February 2015
© Martin Geddes Consulting Ltd
With a stochastic resource model that composes, we can build “quality
contracts” for whole supply chains that compose.
This requires a mathematical language to encode those technical contracts.
77. 77
June 2009 - V 0.4
HELL
HEAVEN
90% of load
REVENUE OPPOTUNITY
22 February 2015
© Martin Geddes Consulting Ltd
78. 78REVENUE OPPORTUNITY
22 February 2015
© Martin Geddes Consulting Ltd
If we achieve these things, and engage with the resource trades, then we can
build digital supply chains that have a large revenue uplift.
Users see value in services that are fit-for-purpose. We can extract a lot more
value out of broadband networks than we do at present. We can lower our costs
by sharing resources more efficiently.
80. 80ASSURANCE
22 February 2015
© Martin Geddes Consulting Ltd
When we deliver on our performance promises, and can prove it, then we can
charge extra for the service assurance.
This requires promises that are anchored in the customer’s world: “Salesforce
Seat Erlangs” or “Assured Netflix Erlangs”.
81. 81NEW SKILL NEEDED
22 February 2015
© Martin Geddes Consulting Ltd
81
Run networks
in saturation
(and maintain QoE)
84. 84EXAMPLE: KENT PUBLIC SERVICE NETWORK
22 February 2015
© Martin Geddes Consulting Ltd
How big is the prize? Let’ go back to KPSN. As a buyer consortium of user
institutions, they are on “both sides” of the trading space, representing supply
and demand. Thus they have every incentive to time-shift demand to lower
demand peaks and save money (since costs track peak demand).
86. 86THEIR SECRET WEAPON!
22 February 2015
© Martin Geddes Consulting Ltd
Their secret weapon is a meeting room. They negotiate time-shifting demand
where it can be moved. For instance, they alter the time and day of backups so
as not to overlap.
88. 88HOW KPSN SAVES MONEY
22 February 2015
© Martin Geddes Consulting Ltd
TIMESCALE OF TRADE
VALUEOFTRADE
CREATEDESTROY
89. 89SIZE OF THE SAVING?
22 February 2015
© Martin Geddes Consulting Ltd
89
50-80%savings verses equivalent service from incumbent
90. 90SIZE OF THE SAVING?
22 February 2015
© Martin Geddes Consulting Ltd
They agree to time-shift demand by hours or days, which to de-peaks
aggregate demand. As a result they get a 50-80% cost saving over buying an
equivalent service from the local incumbent supplier. This is just by ‘manually‘
engaging with the long timescale trades.
If you are a telco, this is terrifying if your customers realise they can coordinate
to arbitrage your product; or wonderful if you create suitable pricing incentives
to discover the (non)time-shiftable demand.
93. 93HOW KPSN SAVES MONEY
22 February 2015
© Martin Geddes Consulting Ltd
KPSN revised their capacity planning rules to a user-centric view rather than a
network-centric one. This identified enough slack for an additional year of
capex-free growth. (This is typical of every network we see, just most we can’t
talk about due to NDAs.)
94. 94BENEFIT OF USING RIGHT PLANNING METRIC?
22 February 2015
© Martin Geddes Consulting Ltd
94
1-2 year
capex deferral
(and bigger savings in specific areas)
95. 95LOOKING INTO THE FUTURE…
22 February 2015
© Martin Geddes Consulting Ltd
We are working with KPSN to implement (over time) a system to fully exploit
the trades. It’s a five-class model, which is applicable to all operators.
We divide traffic into:
- Different qualities: Economy, standard, superior
- Different resilience levels: Resilient, non-resilient
So the phone lines to the office of a school might be superior class and resilient,
whereas those used by pupils for French class to talk to students abroad might
be superior non-resilient.
96. 96
5 CLASS ‘POLYSERVICE’ MODEL
HAS RATIONAL ECONOMICS
SuperiorStandard
SuperiorStandard
Economy
SuperiorStandard
SuperiorStandard
Economy
SuperiorStandard
SuperiorStandard
Economy
Drives capacity
planning cost
(primary service)
Drives resilience &
redundancy capacity
planning cost
Drives revenue
99. 99THE END GAME (>5 YEARS)
22 February 2015
© Martin Geddes Consulting Ltd
TIMESCALE OF TRADE
VALUEOFTRADE
CREATEDESTROY
100. 100THE SIZE OF THE PRIZE?
22 February 2015
© Martin Geddes Consulting Ltd
This 5 class model moves us towards the ideal. (We can’t get all the way as there
are limits to the TCP/IP protocol suite.) Based on the operational and
experimental data we have calculated the amount of potential slack is such
networks is enough to fund…
101. 101
5-10
years of “free” growth
POTENTIAL COST SAVING FOR ALL OPERATORS
22 February 2015
© Martin Geddes Consulting Ltd
101
102. 102ASSUMPTIONS
22 February 2015
© Martin Geddes Consulting Ltd
This is a remarkably large claim, so it needs to be qualified. This cost saving is
under the assumption that:
• the “superior” and “standard” traffic make up a small amount of general
requirement (e.g. 30% or less), and
• it is ONLY the growth of those that make you expand your network - i.e. you
are successfully running your network “hot” in multiple locations, and the
“economy” traffic not mutating into “standard” too quickly.
We believe these assumptions hold for typical general-use networks.
103. 103SUMMARY
22 February 2015
© Martin Geddes Consulting Ltd
103
1. The digital supply chain is failing
(due to poor science).
2. Measure, model & manage ΔQ
(the science of performance).
3. Deliver fit-for-purpose services
(by exploiting the science).
104. 104
WANT TO LEARN MORE?
THEN SIGN UP FOR MY NEWSLETTER
FOLLOW ME ON TWITTER
READ THE GEDDES THINK TANK
105. 105FURTHER READING
• How to X-ray a telecoms network describes the
measurement tools needed.
• How to do network performance chemistry describes
how to separate out structural from dynamic effects.
• IPX: Telecoms salvation or suffering? outlines the
general problems in this area.