We have presented the idea of coarse grain lock-stepping (COLO) virtual machiens for non-stop service in last year's xen summit. We have made significant progress in the past year and submitted the patch series to the community. It is a good time for us to present the latest status to the community and call for participation.
STATUS UPDATE OF COLO PROJECT XIAOWEI YANG, HUAWEI AND WILL AULD, INTEL
1. Status of COLO Project
Eddie Dong*, Xiaowei Yang#
*Intel Open Source Technology Center
#Huawei Technology Co.
Key Contributors: Jianshan Lai, Congyang Wen, Tao Hong
1
4. What is COLO ?
COarse-grain LOck-stepping Virtual Machines for Non-stop Service
Solution
for Client / Server application without application awareness
Dual VM based high availability solution
Relaxed constraints for higher performance
5. Replicated network
Copy client request to both PVM/SVM
Compare response packets from PVM and SVM with compare module
When
both are the same the response is send to the client
When they are not the same, sync PVM and SVM and then send the
response
6. Non-Stop Service with VM Replication
Primary
APPs
Secondary
Fail Over
OS
PVM
VMM
OS
VM Replication
Network
Hardware
SVM
VMM
Hardware
Hardware
Failure
Storage
6
APPs
Compare w/
Remus
7. Problems with existing approaches
Instruction level lock-stepping
Excessive
memory
overhead from maintaining the exact machine state
access in an MP-guest is un-deterministic
Periodic Check-pointing
Extra
network latency
Excessive VM checkpoint overhead
7
8. Relaxed constraints help
Relaxing constraints tends to lower the rate of synchronization
Periodic
check-pointing defines the rate of synchronization
Tying the rate of synchronization to dissimilar responses ties it to the
application characteristics
In
8
most cases this lowers the rate as compared to the periodic mothod
11. Current Status
Patches for Xen are sent to the mailing list
Academia paper published at ACM Symposium on Cloud Computing
(SOCC’13)
Refer
to “COLO: COarse-grained LOck-stepping Virtual Machines for Nonstop Service” for details
http://www.socc2013.org/home/program
Industry announcement
Huawei
FusionSphere uses COLO
http://enterprise.huawei.com/ilink/enenterprise/about/news/news-
list/HW_308817?KeyTemps=
11
12. TCP/IP optimization
Per-Connection Comparison (no modification to TCP/IP)
Coarse-grain TCP Timestamp
Coarse-grain TCP Notification Window Size
Deterministic Algorithm to segment application data
Deterministic Algorithm to generate Initail Seq Number
Deterministic Algorithm to generate ID(IP packet header)
Immediately Acknowledgement
Use separated packet to send FIN
…
13. EXAMPLE:Coarse-grain TCP Notification Window
Size
Coarse-grain Window size rules:
if origin window < 256
rounds down to the nearest power of 2
else
masks the 8 least significant bits
For example:
1.orgin window size=172(10101100b)
set window size to 128(1000000b)
2. orgin window size=283(100011011b)
set window size to 256(100000000b)
3. orgin window size=789(1100010101b)
set window size to 768(1100000000b)
14. EXAMPLE :Deterministic segmentation
Application data to send at T1 and T2
3000 B
2000 B
App data1 (Time point1)
App data2(Time point2)
TCP/IP packet header
Method1:Find latest unsent skb and append app data2 to unused tail skb payload
280 B 1080 B
920 B
1360 B
1360 B
Method2:Find latest unsent skb(skb==NULL) and use new skb to send app data2
Colo Deterministic Method:NOT check the latest unsend skb and use new skb to send app data2
280 B
1360 B
640 B
1360 B
1360 B
15. Storage process
Write
Pnode
DM sends the Write request (offset, len, data) to PVM
cache in Snode
DM calls block driver to write to storage
Snode
DM saves Write request in SVM cache
Read
Snode
From SVM cache, or storage otherwise
Pnode
From storage
Checkpoint
DM calls block driver to flush PVM cache
Failover
DM calls block driver to flush SVM cache
16. Memory sync
One of the biggest time-consume step
Asynchronous sends dirty memory
when the PVM/SVM are running
Less dirty memory transmission during VM
checkpoint
Less CPU pressure and latency
Critical for the case where the VM
checkpoint happens very few
17. Faster VBD/VIF frontend/backend suspend/resume
Old method:
communication
between Frontend and backend through xenstored - low
efficient
New method:
Use
event channel to speed frontend/backend communication
18. Agenda
Background
Status
Performance
Call for action
*Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance
tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to
assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
18
19. Web Server Performance - Web Bench
For more complete information about performance and benchmark results, visit
www.intel.com/benchmarks
19
Source: Intel
20. Web Server Performance - Web Bench (MP)
For more complete information about performance and benchmark results, visit
www.intel.com/benchmarks
Source: Intel
20
21. PostgreSQL Performance - Pgbench
For more complete information about performance and benchmark results, visit
www.intel.com/benchmarks
21
Source: Intel
22. PostgreSQL Performance - Pgbench (MP)
For more complete information about performance and benchmark results, visit
www.intel.com/benchmarks
22
Source: Intel
23. Upstream
Initial patch series are posted
More
comments are welcome
Depend on the readiness of the Remus on top of XL
COLO
reuses Remus for VM checkpoint and heartbeat
25. Next and Call for actions
Work good when HVM linux guest + PV driver
Window guest support is under developement
Need more participants and fast turn over of upstreaming