SlideShare a Scribd company logo
1 of 42
Running the World’s Internet Servers 
By Steve Mushero 
September, 2014 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Zabbix 
At Scale 
Build & Manage Servers  Optimize & Manage Servers  Managed Cloud Servers Copyright © 2014 ChinaNetCloud
Running the World’s Internet Servers www.ChinaNetCloud.com 
Greetings 
I’m Steve 
I’m from Shanghai, China 
We have a big Internet there 
We have a big business 
We have a big monitoring system 
That’s Zabbix 
Let me tell you more about it . . . 
2
First, a Word about Numbers 
My Title: 2500 hosts & 1 million items 
Well, I wrote that in spring 
We’re not quite there yet 
Today – about 1250 hosts & 300K items 
With 75K triggers 
500 templates 
25+ graphs per host 
200 screens & maps (60% customer specific) 
10 proxies around the world 
About 550 new values per sec (NVPS) 
Hope you’re not too disappointed 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Still lots to share 
3
Running the World’s Internet Servers www.ChinaNetCloud.com 
More Numbers 
This is a busy system 
About 1250 hosts & 300K items 
Billions of data points in DB 
Hundreds of users 
78,000 Alerts/Events per month 
2,600 per day 
Mostly false alerts 
Working hard to reduce these 
1,000+ Actionable Tickets per month 
4
Running the World’s Internet Servers www.ChinaNetCloud.com 
What we do 
Internet Managed Service Provider 
But higher-end than most 
We build & run large-scale Internet systems 
In any data center in Asia, USA, and beyond . . . 
Our customers have about 250M users 
We have over 100 customers 
1000 servers for 100 customers <> 1000 for yourself 
We run about 200 DB servers, and dozens or hundreds of 
everything else 
We run every possible service and system (Linux-based) 
5
Running the World’s Internet Servers www.ChinaNetCloud.com 
How we do this 
About 100 employees, 50% in Operations 
About 25 integrated systems 
Zabbix, ticket, customer, procedure, logs, CMDB, etc. 
7x24 Operations 
Continual new customers/systems 
Continual changes to systems/monitoring 
Customers rarely test things 
VERY dynamic environment 
Exact opposite of most Enterprises 
Think Wild, Wild West 
6
How we use Zabbix 
Core monitor system for six years 
Alerts link to our Procedure system 
Via URL in Trigger, customized 
We are on 1.8.3 and -> 2.2 this month 
Other systems feed it via sender (DWM, SEC) 
Heavily modified dashboard with ticket integration 
Automation & reporting 
We have 100 day-time users, about 25 hard-core 
All our customers use it, varying degrees of use 
Lots of macros (and reports) 
Running the World’s Internet Servers www.ChinaNetCloud.com 
7
Our Ops Area – Zabbix Everywhere 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Image of AR/Ops area 
8
How we use Zabbix 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Very dynamic system 
Things changing every day, every hour 
Make changes every day, by dozens of people 
Servers coming & going, updating templates 
Very easy to misalign, misconfigure – have safety report 
We have 500 templates 
About 25 that get used for every customer 
By service – Apache, MySQL, etc. 
9
How we use Zabbix – Templates 
Running the World’s Internet Servers www.ChinaNetCloud.com 
About 500 templates 
Many customer specific or obsolete 
Trying to reduce to < 100 
One core Template for Linux 
163 Items 
79 Triggers 
Will soon split to RedHat vs. Ubuntu 
One Template per Service 
MySQL, Apache, Nginx, many, many more 
SNMP devices also, messy ports 
10
How we use Zabbix – Agents 
Agents on all non-SNMP hosts 
Running the World’s Internet Servers www.ChinaNetCloud.com 
100% Passive Agents 
Server Security issues with Active 
No good way to lock down 100 locations 
So every item/new value is an update to DB 
LOTS of custom scripts – 25+ 
MySQL – Real-time any variable/config/status 
Java – Custom JMX reader 
Lots of services – Haproxy, Apache, Node.js, etc. 
Linux /proc reader – TCP, conntrack 
iostat/vmstat 
RAID/Hardware 
CatchAll low disk filesystem – Very important 
11
How we use Zabbix – Pushing Data 
Several monitors that push data 
We patched Server to allow direct reporting 
Normally must push data to proxy 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Very painful globally 
Syslog via SEC 
Pushes fixed set of items like OOM 
Clear manually in Dashboard GUI 
Distributed Web Monitors (DWM) 
Push status, failures to server 
12
How we use Zabbix – Web Checks 
Running the World’s Internet Servers www.ChinaNetCloud.com 
We don’t use any more 
Not distributed (until 2.2) 
Can’t get error reasons 
Makes troubleshooting harder 
Easy to break triggers 
Built our own Distributed Web Monitor 
Runs on nodes / proxies 
Hosts are in zones/nodes 
Nodes pull config 
Parallel checks on nodes 
Detailed errors/triggers 
Single node per host 
Soon to be multi-node with voting 
More sensitive, less false-alerts 
13
How we use Zabbix – Other Parts 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Screens 
Used heavily 
Hard to build, need automation 
XML Import/Export 
Not used much 
But powerful 
API 
Use for adding servers 
Use to pull graphs 
Maps 
Use a few 
Automate in future 
IT Services & Reports – Don’t use 
Discovery – Thinking how to use 
14
Experience 
Running the World’s Internet Servers www.ChinaNetCloud.com 15
Our General Experience 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Overall it’s very good 
False alerts kill us 
Especially unreachables 
Proxies kill us 
Lots of patches/changes 
Long list of improvements 
We will be world’s largest user 
16
Issues & Improvements 
Time checks – Fuzzy proxies 
Totally useless globally 
No solution yet, just SQL reports 
What is unreachable, when, how ? 
Many agent improvements needed 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Linux memory RSS 
iostat devices 
We have long list 
Latching – Need a way to latch 
To keep short-lived events up 
Can use hysteresis, but not great 
17
Running the World’s Internet Servers www.ChinaNetCloud.com 
Scale issues 
Housekeeper 
Large templates – core Linux template 
Database size 
Disk I/O challenges 
Operations time out 
Or lockup system 
Proxies challenging 
Poor Permission System 
18
Zabbix 
Details 
Running the World’s Internet Servers www.ChinaNetCloud.com 19
Our Zabbix Architecture 
Central Server in Shanghai 
Proxies in 10 locations in 5 countries 
Most support many customers 
Some dedicated in customer data centers 
System consists of 3 large VMs: 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Web GUI 
Zabbix Serer 
MySQL Database 
Looking at geographic HA options 
Need to move all hosts to proxies first 
20
Running the World’s Internet Servers www.ChinaNetCloud.com 
Hardware History 
2008 - Dell R200 Virtualized – up to 350 hosts 
2010 - Dell 1950 Virtualized – up to 750 hosts 
Integrated single larger VM – GUI, Server, DB 
2013 - Dell R420 Virtualized with SAS – up to 1250 hosts 
Split VMs – GUI, Server, DB 
96GB or RAM 
2014 - Dell R420 Virtualized with SSD – should go 5-10X 
Split VMs – GUI, Server, DB & slave DB for reporting/backup 
128GB or RAM, looking at 256GB 
Bottleneck is always DB RAM & I/O 
CPU always okay 
Eventually 256-512GB of RAM and 500GB SSD - Happiness 
21
Running the World’s Internet Servers www.ChinaNetCloud.com 
Database Details 
MySQL Percona 5.5 w/ very optimized config 
Will go to 5.6 when 5.6 is more stable 
No query cache 
Main DB about 200GB in size 
Expect 1TB in 2015 
History is about 2 billion rows of data, 157GB size 
Trends is 500 million, 37GB size 
Backups take 8 hours (including compress/encrypt) 
Has 48GB of Innodb Buffer (will expand to 64+ soon) 
Will go to 96, 128, 256 over time 
Generally runs well, except I/O bottleneck 
22
Zabbix Server Config 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Nothing that special 
Pollers – Use a lot 
Max num ~500 in most cases 
Can be 90-100% busy 
DB Sync is slow, screws up time 
Data time-stamped on db sync, not received 
Housekeeper On, Scheduled 
100% passive 
Agents locked down to server/proxies 
Don’t use Actions 
23
Running the World’s Internet Servers www.ChinaNetCloud.com 
Housekeeper 
Total rows in history tables: 2 BILLION 
Hourly housekeeping delete: 7 MILLION 
Takes 3 hours to run 
Killing us if 8 hour backups are running 
Did not partition yet, problems in 2.0/MySQL 
Changed code to make it configurable 
Still very slow, as item by item 
Now only run during day, avoid backups 
Hourly (shortest option) 
Will go to partitions in 2.2 
24
Running the World’s Internet Servers www.ChinaNetCloud.com 
Proxies 
We hate proxies, sometimes 
Great in theory, but 
Always getting stuck – very poor network handling in 1.8 
Fuzzytime() useless for time check – no solution 
Server time is not data gathered time 
We run them in many data centers, countries 
With often poor connections to main servers 
We build SQL tools, with PHP GUI coming soon 
Poller Queue, Sending Queue, Oldest unsent value 
Coping 
Use cron to check queue and auto restart 
Use auto re-route to improve traffic via TCP relays 
Hoping for better TCP/IP in 2.x 
25
Customizations 
Running the World’s Internet Servers www.ChinaNetCloud.com 26
Customizations – GUI Dashboard 
Mostly change columns & colors 
We only use Top Issues section 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Added custom columns 
Mostly ticket system integration 
Ticket numbers & status/history 
ACKs include person’s initials 
Status if someone logs into host 
Changed popup menus 
Links to various systems 
Create ticket popups 
Triggers Dynamic link to procedures 
Sends hostname and profile fields 
New dynamic priority system 
Rule & time-based priority changes 
Auto escalation, tracking 
27
Customizations – GUI Dashboard 
Running the World’s Internet Servers www.ChinaNetCloud.com 
28
Customizations – GUI Dashboard 
Running the World’s Internet Servers www.ChinaNetCloud.com 
29 
Popup Menu Customized 
Ping – Via ssh/get/snmp 
Server Summary 
Custom history, owner, contact 
Create Ticket 
Wikis for each level
Customizations – GUI Dashboard 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Several Dashboards 
Main Dashboard - Internal 
Secondary Dashboard 
We push things we can’t fix 
Customer Dashboard 
More limited view 
Graphs 
American Dates 
Fonts & Sizes 
Many more coming 
Annotations, Max limits 
LOTS more for future improvements 
30
Customizations – Graphs 
Running the World’s Internet Servers www.ChinaNetCloud.com 
31
Customizations - Other 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Housekeeper 
Configurable schedule 
Avoids conflict with backups 
Direct data sender 
Bypass proxies 
Major improvement 
Ping Host – Can’t use behind proxy 
Changed to use zabbix_get, ssh-key 
Use SNMP tool for network hosts 
A LOT more changes planned for 2.2 
Dozens, including agent, proxy UI 
32
Automation & 
Reporting 
Running the World’s Internet Servers www.ChinaNetCloud.com 33
Running the World’s Internet Servers www.ChinaNetCloud.com 
Automation 
Lots of direct DB interaction 
Dozens of Reports 
Safety checks 
Monthly reports 
Metrics reports 
Auto Ticket 
Creates tickets based on rules 
Auto Closer 
Closes tickets if alert disappears 
Auto Priority/Escalate 
Rules-based event driver 
Dynamic priority engine 
Push new hosts in via API 
Some still direct DB inserts 
34
Running the World’s Internet Servers www.ChinaNetCloud.com 
Reporting 
Two key types: 
Customer Reports – Monthly / on demand 
Custom selection by server, customer, graph 
Use API to pull real graphs 
Pull data from DB for metrics 
Auditing – SQL-based 
Ad Hoc – Lots of SQL-based 
Safety Reports – Look for problems 
35
Running the World’s Internet Servers www.ChinaNetCloud.com 
Safety Reports 
Remember – We are in the Wild West 
Look for mistakes 
Direct from DB 
See my DB talk for details 
So many changes by so many people 
Not enough automation yet 
Dozens of people making changes 
Not all as careful as we’d like 
Easy to break things 
Hard to know or detect 
36
Safety Reports – Examples 
Items that differ from template 
Running the World’s Internet Servers www.ChinaNetCloud.com 
Missing templates 
Disable items/hosts, forget to enable 
Alerts with no URL/Wiki 
Hosts missing profile data 
Items disabled conflict with trigger 
Web alerts with no trigger 
Web alerts with long/short timeouts 
Hosts in wrong, duplicate, conflicting groups 
Servers in Zabbix, not core system 
37
Running the World’s Internet Servers www.ChinaNetCloud.com 
Security 
Pretty good 
Very good for customers 
One group per customer 
Read-only rights 
BUT, not granular enough 
Either read or write 
Can’t have different level employees 
Can’t limit tasks / functions 
Useless for ops company 
Audit system great 
But GUI & reports useless 
We have custom reports 
We’ll modify this in 2.2 
Not sure how yet 
38
Running the World’s Internet Servers www.ChinaNetCloud.com 
Future 
Lots more basic improvements to make 
Changing event processing, rule-based 
Thinking correlation engines 
Adding new Agent features in 2015 
New GUI & Security & Audit & Reports 
Thinking about 10X scale for 2015-16 
10,000 servers, millions of items 
Architecture, Federation, ??? 
Wondering about 100X scale 
39
Running the World’s Internet Servers www.ChinaNetCloud.com 
Summary 
Zabbix is a great system 
It’s critical to our business 
Still the best system out there 
We’ve invested a lot, more to go 
Still lots of improvements to make 
Glad to see vibrant user & developers 
Happy to be part of the Zabbix ecosystem 
Happy to share all we know & have learned 
40
Thanks from ChinaNetCloud 
Pioneers in OaaS – Operations as a Service 
Running the World’s Internet Servers www.ChinaNetCloud.com 41
ChinaNetCloud 
Shanghai Headquarters: 
X2 Space 1-601, 1238 Xietu Lu 
Shanghai, 200032 China 
T: +86-21-6422-1946 F: +86-21-6422-4911 
Beijing Office: 
Lee World Business Building #305 
57 Happiness Village Road, Chaoyang District 
Beijing, 100027 China 
Sales@ChinaNetCloud.com 
www.ChinaNetCloud.com 
Silicon Valley Office: 
California Avenue 
Palo Alto, 94123 USA 
Running the World’s Internet Servers www.ChinaNetCloud.com 42

More Related Content

What's hot

Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 

What's hot (20)

Sizing Your Scylla Cluster
Sizing Your Scylla ClusterSizing Your Scylla Cluster
Sizing Your Scylla Cluster
 
Nagios Conference 2011 - Tony Roman - Cacti Workshop
Nagios Conference 2011 - Tony Roman - Cacti WorkshopNagios Conference 2011 - Tony Roman - Cacti Workshop
Nagios Conference 2011 - Tony Roman - Cacti Workshop
 
Scale and Throughput @ Clicktale with Akka
Scale and Throughput @ Clicktale with AkkaScale and Throughput @ Clicktale with Akka
Scale and Throughput @ Clicktale with Akka
 
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
Tales from the four-comma club: Managing Kafka as a service at Salesforce | L...
 
Data Stores @ Netflix
Data Stores @ NetflixData Stores @ Netflix
Data Stores @ Netflix
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
25 snowflake
25 snowflake25 snowflake
25 snowflake
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
 
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
 
Oracle 12c Parallel Execution New Features
Oracle 12c Parallel Execution New FeaturesOracle 12c Parallel Execution New Features
Oracle 12c Parallel Execution New Features
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance Benchmarking
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache KafkaCommon Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache Kafka
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
Nagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu Skin
Nagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu SkinNagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu Skin
Nagios Conference 2011 - Jeff Sly - Case Study Nagios @ Nu Skin
 

Similar to ChinaNetCloud - Using Zabbix Monitoring at Scale - Zabbix Conference 2014

Building for the Cloud | NC CSDN Cloud Conference 2012
Building for the Cloud | NC CSDN Cloud Conference 2012Building for the Cloud | NC CSDN Cloud Conference 2012
Building for the Cloud | NC CSDN Cloud Conference 2012
ChinaNetCloud
 
NZSPC 2013 - Ultimate SharePoint Infrastructure Best Practices Session
NZSPC 2013 - Ultimate SharePoint Infrastructure Best Practices SessionNZSPC 2013 - Ultimate SharePoint Infrastructure Best Practices Session
NZSPC 2013 - Ultimate SharePoint Infrastructure Best Practices Session
Michael Noel
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
LarryZaman
 
Drizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's ConferenceDrizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's Conference
Brian Aker
 
Java & SOA Cloud Service for Fusion Middleware Administrators
Java & SOA Cloud Service for Fusion Middleware AdministratorsJava & SOA Cloud Service for Fusion Middleware Administrators
Java & SOA Cloud Service for Fusion Middleware Administrators
Simon Haslam
 
Dealing with Chinese Network Anatomy | NC SZ Architect Event Speech 2012
Dealing with Chinese Network Anatomy | NC SZ Architect Event Speech 2012Dealing with Chinese Network Anatomy | NC SZ Architect Event Speech 2012
Dealing with Chinese Network Anatomy | NC SZ Architect Event Speech 2012
ChinaNetCloud
 

Similar to ChinaNetCloud - Using Zabbix Monitoring at Scale - Zabbix Conference 2014 (20)

Building for the Cloud | NC CSDN Cloud Conference 2012
Building for the Cloud | NC CSDN Cloud Conference 2012Building for the Cloud | NC CSDN Cloud Conference 2012
Building for the Cloud | NC CSDN Cloud Conference 2012
 
ChinaNetCloud - Company & Services Overview
ChinaNetCloud - Company & Services OverviewChinaNetCloud - Company & Services Overview
ChinaNetCloud - Company & Services Overview
 
ChinaNetCloud - OaaS Operations-as-a-Service - HostingCon - Sep 2014
ChinaNetCloud - OaaS Operations-as-a-Service - HostingCon - Sep 2014ChinaNetCloud - OaaS Operations-as-a-Service - HostingCon - Sep 2014
ChinaNetCloud - OaaS Operations-as-a-Service - HostingCon - Sep 2014
 
NZSPC 2013 - Ultimate SharePoint Infrastructure Best Practices Session
NZSPC 2013 - Ultimate SharePoint Infrastructure Best Practices SessionNZSPC 2013 - Ultimate SharePoint Infrastructure Best Practices Session
NZSPC 2013 - Ultimate SharePoint Infrastructure Best Practices Session
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
Drizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's ConferenceDrizzle Keynote at the MySQL User's Conference
Drizzle Keynote at the MySQL User's Conference
 
Management and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - SlidesManagement and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - Slides
 
Managing RightScale on RightScale
Managing RightScale on RightScaleManaging RightScale on RightScale
Managing RightScale on RightScale
 
OaaS - Operations as a Service
OaaS - Operations as a ServiceOaaS - Operations as a Service
OaaS - Operations as a Service
 
Java & SOA Cloud Service for Fusion Middleware Administrators
Java & SOA Cloud Service for Fusion Middleware AdministratorsJava & SOA Cloud Service for Fusion Middleware Administrators
Java & SOA Cloud Service for Fusion Middleware Administrators
 
How we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50kHow we scaled Rudder to 10k, and the road to 50k
How we scaled Rudder to 10k, and the road to 50k
 
L21 scalability
L21 scalabilityL21 scalability
L21 scalability
 
ChinaNetCloud - The Zabbix Database - Zabbix Conference 2014
ChinaNetCloud - The Zabbix Database - Zabbix Conference 2014ChinaNetCloud - The Zabbix Database - Zabbix Conference 2014
ChinaNetCloud - The Zabbix Database - Zabbix Conference 2014
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 
11g Identity Management - InSync10
11g Identity Management - InSync1011g Identity Management - InSync10
11g Identity Management - InSync10
 
0.Web Application Architecture.ppt
0.Web Application Architecture.ppt0.Web Application Architecture.ppt
0.Web Application Architecture.ppt
 
Dealing with Chinese Network Anatomy | NC SZ Architect Event Speech 2012
Dealing with Chinese Network Anatomy | NC SZ Architect Event Speech 2012Dealing with Chinese Network Anatomy | NC SZ Architect Event Speech 2012
Dealing with Chinese Network Anatomy | NC SZ Architect Event Speech 2012
 

More from ChinaNetCloud

运维安全 抵抗黑客攻击_云络安全沙龙4月上海站主题分享
运维安全 抵抗黑客攻击_云络安全沙龙4月上海站主题分享运维安全 抵抗黑客攻击_云络安全沙龙4月上海站主题分享
运维安全 抵抗黑客攻击_云络安全沙龙4月上海站主题分享
ChinaNetCloud
 

More from ChinaNetCloud (20)

AWS ELB Tips & Best Practices
AWS ELB Tips & Best PracticesAWS ELB Tips & Best Practices
AWS ELB Tips & Best Practices
 
OpsStack--Integrated Operation Platform
OpsStack--Integrated Operation PlatformOpsStack--Integrated Operation Platform
OpsStack--Integrated Operation Platform
 
ChinaNetCloud Online Lecture:Something About Tshark
ChinaNetCloud Online Lecture:Something About TsharkChinaNetCloud Online Lecture:Something About Tshark
ChinaNetCloud Online Lecture:Something About Tshark
 
ChinaNetCloud Online Lecture: Fight Against External Attacks From Different L...
ChinaNetCloud Online Lecture: Fight Against External Attacks From Different L...ChinaNetCloud Online Lecture: Fight Against External Attacks From Different L...
ChinaNetCloud Online Lecture: Fight Against External Attacks From Different L...
 
Steve Mushero on Entrepreneurship - 创业 - 崔牛会
Steve Mushero on Entrepreneurship - 创业 - 崔牛会Steve Mushero on Entrepreneurship - 创业 - 崔牛会
Steve Mushero on Entrepreneurship - 创业 - 崔牛会
 
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
 
云中漫步 颠覆创新_创业邦春季创新峰会主题演讲 Cloud Innovation in China
云中漫步 颠覆创新_创业邦春季创新峰会主题演讲 Cloud Innovation in China云中漫步 颠覆创新_创业邦春季创新峰会主题演讲 Cloud Innovation in China
云中漫步 颠覆创新_创业邦春季创新峰会主题演讲 Cloud Innovation in China
 
运维安全 抵抗黑客攻击_云络安全沙龙4月上海站主题分享
运维安全 抵抗黑客攻击_云络安全沙龙4月上海站主题分享运维安全 抵抗黑客攻击_云络安全沙龙4月上海站主题分享
运维安全 抵抗黑客攻击_云络安全沙龙4月上海站主题分享
 
AWS Summit OaaS Talk by ChinaNetCloud
AWS Summit OaaS Talk by ChinaNetCloudAWS Summit OaaS Talk by ChinaNetCloud
AWS Summit OaaS Talk by ChinaNetCloud
 
Running Internet Systems in China - The Details You Need to Succeed in Chines...
Running Internet Systems in China - The Details You Need to Succeed in Chines...Running Internet Systems in China - The Details You Need to Succeed in Chines...
Running Internet Systems in China - The Details You Need to Succeed in Chines...
 
Making Internet Operations Easier
Making Internet Operations EasierMaking Internet Operations Easier
Making Internet Operations Easier
 
Internet Cloud Operations - ChinaNetcloud & AWS Event Beijing
Internet Cloud Operations - ChinaNetcloud & AWS Event BeijingInternet Cloud Operations - ChinaNetcloud & AWS Event Beijing
Internet Cloud Operations - ChinaNetcloud & AWS Event Beijing
 
Big Data Security (ChinaNetCloud - Guiyang Conference)
Big Data Security (ChinaNetCloud - Guiyang Conference)Big Data Security (ChinaNetCloud - Guiyang Conference)
Big Data Security (ChinaNetCloud - Guiyang Conference)
 
Internet System Security Overview
Internet System Security OverviewInternet System Security Overview
Internet System Security Overview
 
Why Work at ChinaNetCloud
Why Work at ChinaNetCloudWhy Work at ChinaNetCloud
Why Work at ChinaNetCloud
 
Cloud Operations Challenges - Talk by ChinaNetCloud at Joint Cisco event
Cloud Operations Challenges - Talk by ChinaNetCloud at Joint Cisco eventCloud Operations Challenges - Talk by ChinaNetCloud at Joint Cisco event
Cloud Operations Challenges - Talk by ChinaNetCloud at Joint Cisco event
 
Automatically Managing Internet Operations In The Cloud - 云计算平台的自动化运维
Automatically Managing  Internet Operations  In The Cloud - 云计算平台的自动化运维Automatically Managing  Internet Operations  In The Cloud - 云计算平台的自动化运维
Automatically Managing Internet Operations In The Cloud - 云计算平台的自动化运维
 
ChinaNetCloud - Aliyun Joint Event on Cloud Operations
ChinaNetCloud - Aliyun Joint Event on Cloud Operations ChinaNetCloud - Aliyun Joint Event on Cloud Operations
ChinaNetCloud - Aliyun Joint Event on Cloud Operations
 
Clouds in China
Clouds in ChinaClouds in China
Clouds in China
 
ChinaNetCloud - Public Clouds in China Overview
ChinaNetCloud - Public Clouds in China OverviewChinaNetCloud - Public Clouds in China Overview
ChinaNetCloud - Public Clouds in China Overview
 

Recently uploaded

Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
Diya Sharma
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
ellan12
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
sexy call girls service in goa
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
SofiyaSharma5
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 

Recently uploaded (20)

Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
₹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 

ChinaNetCloud - Using Zabbix Monitoring at Scale - Zabbix Conference 2014

  • 1. Running the World’s Internet Servers By Steve Mushero September, 2014 Running the World’s Internet Servers www.ChinaNetCloud.com Zabbix At Scale Build & Manage Servers  Optimize & Manage Servers  Managed Cloud Servers Copyright © 2014 ChinaNetCloud
  • 2. Running the World’s Internet Servers www.ChinaNetCloud.com Greetings I’m Steve I’m from Shanghai, China We have a big Internet there We have a big business We have a big monitoring system That’s Zabbix Let me tell you more about it . . . 2
  • 3. First, a Word about Numbers My Title: 2500 hosts & 1 million items Well, I wrote that in spring We’re not quite there yet Today – about 1250 hosts & 300K items With 75K triggers 500 templates 25+ graphs per host 200 screens & maps (60% customer specific) 10 proxies around the world About 550 new values per sec (NVPS) Hope you’re not too disappointed Running the World’s Internet Servers www.ChinaNetCloud.com Still lots to share 3
  • 4. Running the World’s Internet Servers www.ChinaNetCloud.com More Numbers This is a busy system About 1250 hosts & 300K items Billions of data points in DB Hundreds of users 78,000 Alerts/Events per month 2,600 per day Mostly false alerts Working hard to reduce these 1,000+ Actionable Tickets per month 4
  • 5. Running the World’s Internet Servers www.ChinaNetCloud.com What we do Internet Managed Service Provider But higher-end than most We build & run large-scale Internet systems In any data center in Asia, USA, and beyond . . . Our customers have about 250M users We have over 100 customers 1000 servers for 100 customers <> 1000 for yourself We run about 200 DB servers, and dozens or hundreds of everything else We run every possible service and system (Linux-based) 5
  • 6. Running the World’s Internet Servers www.ChinaNetCloud.com How we do this About 100 employees, 50% in Operations About 25 integrated systems Zabbix, ticket, customer, procedure, logs, CMDB, etc. 7x24 Operations Continual new customers/systems Continual changes to systems/monitoring Customers rarely test things VERY dynamic environment Exact opposite of most Enterprises Think Wild, Wild West 6
  • 7. How we use Zabbix Core monitor system for six years Alerts link to our Procedure system Via URL in Trigger, customized We are on 1.8.3 and -> 2.2 this month Other systems feed it via sender (DWM, SEC) Heavily modified dashboard with ticket integration Automation & reporting We have 100 day-time users, about 25 hard-core All our customers use it, varying degrees of use Lots of macros (and reports) Running the World’s Internet Servers www.ChinaNetCloud.com 7
  • 8. Our Ops Area – Zabbix Everywhere Running the World’s Internet Servers www.ChinaNetCloud.com Image of AR/Ops area 8
  • 9. How we use Zabbix Running the World’s Internet Servers www.ChinaNetCloud.com Very dynamic system Things changing every day, every hour Make changes every day, by dozens of people Servers coming & going, updating templates Very easy to misalign, misconfigure – have safety report We have 500 templates About 25 that get used for every customer By service – Apache, MySQL, etc. 9
  • 10. How we use Zabbix – Templates Running the World’s Internet Servers www.ChinaNetCloud.com About 500 templates Many customer specific or obsolete Trying to reduce to < 100 One core Template for Linux 163 Items 79 Triggers Will soon split to RedHat vs. Ubuntu One Template per Service MySQL, Apache, Nginx, many, many more SNMP devices also, messy ports 10
  • 11. How we use Zabbix – Agents Agents on all non-SNMP hosts Running the World’s Internet Servers www.ChinaNetCloud.com 100% Passive Agents Server Security issues with Active No good way to lock down 100 locations So every item/new value is an update to DB LOTS of custom scripts – 25+ MySQL – Real-time any variable/config/status Java – Custom JMX reader Lots of services – Haproxy, Apache, Node.js, etc. Linux /proc reader – TCP, conntrack iostat/vmstat RAID/Hardware CatchAll low disk filesystem – Very important 11
  • 12. How we use Zabbix – Pushing Data Several monitors that push data We patched Server to allow direct reporting Normally must push data to proxy Running the World’s Internet Servers www.ChinaNetCloud.com Very painful globally Syslog via SEC Pushes fixed set of items like OOM Clear manually in Dashboard GUI Distributed Web Monitors (DWM) Push status, failures to server 12
  • 13. How we use Zabbix – Web Checks Running the World’s Internet Servers www.ChinaNetCloud.com We don’t use any more Not distributed (until 2.2) Can’t get error reasons Makes troubleshooting harder Easy to break triggers Built our own Distributed Web Monitor Runs on nodes / proxies Hosts are in zones/nodes Nodes pull config Parallel checks on nodes Detailed errors/triggers Single node per host Soon to be multi-node with voting More sensitive, less false-alerts 13
  • 14. How we use Zabbix – Other Parts Running the World’s Internet Servers www.ChinaNetCloud.com Screens Used heavily Hard to build, need automation XML Import/Export Not used much But powerful API Use for adding servers Use to pull graphs Maps Use a few Automate in future IT Services & Reports – Don’t use Discovery – Thinking how to use 14
  • 15. Experience Running the World’s Internet Servers www.ChinaNetCloud.com 15
  • 16. Our General Experience Running the World’s Internet Servers www.ChinaNetCloud.com Overall it’s very good False alerts kill us Especially unreachables Proxies kill us Lots of patches/changes Long list of improvements We will be world’s largest user 16
  • 17. Issues & Improvements Time checks – Fuzzy proxies Totally useless globally No solution yet, just SQL reports What is unreachable, when, how ? Many agent improvements needed Running the World’s Internet Servers www.ChinaNetCloud.com Linux memory RSS iostat devices We have long list Latching – Need a way to latch To keep short-lived events up Can use hysteresis, but not great 17
  • 18. Running the World’s Internet Servers www.ChinaNetCloud.com Scale issues Housekeeper Large templates – core Linux template Database size Disk I/O challenges Operations time out Or lockup system Proxies challenging Poor Permission System 18
  • 19. Zabbix Details Running the World’s Internet Servers www.ChinaNetCloud.com 19
  • 20. Our Zabbix Architecture Central Server in Shanghai Proxies in 10 locations in 5 countries Most support many customers Some dedicated in customer data centers System consists of 3 large VMs: Running the World’s Internet Servers www.ChinaNetCloud.com Web GUI Zabbix Serer MySQL Database Looking at geographic HA options Need to move all hosts to proxies first 20
  • 21. Running the World’s Internet Servers www.ChinaNetCloud.com Hardware History 2008 - Dell R200 Virtualized – up to 350 hosts 2010 - Dell 1950 Virtualized – up to 750 hosts Integrated single larger VM – GUI, Server, DB 2013 - Dell R420 Virtualized with SAS – up to 1250 hosts Split VMs – GUI, Server, DB 96GB or RAM 2014 - Dell R420 Virtualized with SSD – should go 5-10X Split VMs – GUI, Server, DB & slave DB for reporting/backup 128GB or RAM, looking at 256GB Bottleneck is always DB RAM & I/O CPU always okay Eventually 256-512GB of RAM and 500GB SSD - Happiness 21
  • 22. Running the World’s Internet Servers www.ChinaNetCloud.com Database Details MySQL Percona 5.5 w/ very optimized config Will go to 5.6 when 5.6 is more stable No query cache Main DB about 200GB in size Expect 1TB in 2015 History is about 2 billion rows of data, 157GB size Trends is 500 million, 37GB size Backups take 8 hours (including compress/encrypt) Has 48GB of Innodb Buffer (will expand to 64+ soon) Will go to 96, 128, 256 over time Generally runs well, except I/O bottleneck 22
  • 23. Zabbix Server Config Running the World’s Internet Servers www.ChinaNetCloud.com Nothing that special Pollers – Use a lot Max num ~500 in most cases Can be 90-100% busy DB Sync is slow, screws up time Data time-stamped on db sync, not received Housekeeper On, Scheduled 100% passive Agents locked down to server/proxies Don’t use Actions 23
  • 24. Running the World’s Internet Servers www.ChinaNetCloud.com Housekeeper Total rows in history tables: 2 BILLION Hourly housekeeping delete: 7 MILLION Takes 3 hours to run Killing us if 8 hour backups are running Did not partition yet, problems in 2.0/MySQL Changed code to make it configurable Still very slow, as item by item Now only run during day, avoid backups Hourly (shortest option) Will go to partitions in 2.2 24
  • 25. Running the World’s Internet Servers www.ChinaNetCloud.com Proxies We hate proxies, sometimes Great in theory, but Always getting stuck – very poor network handling in 1.8 Fuzzytime() useless for time check – no solution Server time is not data gathered time We run them in many data centers, countries With often poor connections to main servers We build SQL tools, with PHP GUI coming soon Poller Queue, Sending Queue, Oldest unsent value Coping Use cron to check queue and auto restart Use auto re-route to improve traffic via TCP relays Hoping for better TCP/IP in 2.x 25
  • 26. Customizations Running the World’s Internet Servers www.ChinaNetCloud.com 26
  • 27. Customizations – GUI Dashboard Mostly change columns & colors We only use Top Issues section Running the World’s Internet Servers www.ChinaNetCloud.com Added custom columns Mostly ticket system integration Ticket numbers & status/history ACKs include person’s initials Status if someone logs into host Changed popup menus Links to various systems Create ticket popups Triggers Dynamic link to procedures Sends hostname and profile fields New dynamic priority system Rule & time-based priority changes Auto escalation, tracking 27
  • 28. Customizations – GUI Dashboard Running the World’s Internet Servers www.ChinaNetCloud.com 28
  • 29. Customizations – GUI Dashboard Running the World’s Internet Servers www.ChinaNetCloud.com 29 Popup Menu Customized Ping – Via ssh/get/snmp Server Summary Custom history, owner, contact Create Ticket Wikis for each level
  • 30. Customizations – GUI Dashboard Running the World’s Internet Servers www.ChinaNetCloud.com Several Dashboards Main Dashboard - Internal Secondary Dashboard We push things we can’t fix Customer Dashboard More limited view Graphs American Dates Fonts & Sizes Many more coming Annotations, Max limits LOTS more for future improvements 30
  • 31. Customizations – Graphs Running the World’s Internet Servers www.ChinaNetCloud.com 31
  • 32. Customizations - Other Running the World’s Internet Servers www.ChinaNetCloud.com Housekeeper Configurable schedule Avoids conflict with backups Direct data sender Bypass proxies Major improvement Ping Host – Can’t use behind proxy Changed to use zabbix_get, ssh-key Use SNMP tool for network hosts A LOT more changes planned for 2.2 Dozens, including agent, proxy UI 32
  • 33. Automation & Reporting Running the World’s Internet Servers www.ChinaNetCloud.com 33
  • 34. Running the World’s Internet Servers www.ChinaNetCloud.com Automation Lots of direct DB interaction Dozens of Reports Safety checks Monthly reports Metrics reports Auto Ticket Creates tickets based on rules Auto Closer Closes tickets if alert disappears Auto Priority/Escalate Rules-based event driver Dynamic priority engine Push new hosts in via API Some still direct DB inserts 34
  • 35. Running the World’s Internet Servers www.ChinaNetCloud.com Reporting Two key types: Customer Reports – Monthly / on demand Custom selection by server, customer, graph Use API to pull real graphs Pull data from DB for metrics Auditing – SQL-based Ad Hoc – Lots of SQL-based Safety Reports – Look for problems 35
  • 36. Running the World’s Internet Servers www.ChinaNetCloud.com Safety Reports Remember – We are in the Wild West Look for mistakes Direct from DB See my DB talk for details So many changes by so many people Not enough automation yet Dozens of people making changes Not all as careful as we’d like Easy to break things Hard to know or detect 36
  • 37. Safety Reports – Examples Items that differ from template Running the World’s Internet Servers www.ChinaNetCloud.com Missing templates Disable items/hosts, forget to enable Alerts with no URL/Wiki Hosts missing profile data Items disabled conflict with trigger Web alerts with no trigger Web alerts with long/short timeouts Hosts in wrong, duplicate, conflicting groups Servers in Zabbix, not core system 37
  • 38. Running the World’s Internet Servers www.ChinaNetCloud.com Security Pretty good Very good for customers One group per customer Read-only rights BUT, not granular enough Either read or write Can’t have different level employees Can’t limit tasks / functions Useless for ops company Audit system great But GUI & reports useless We have custom reports We’ll modify this in 2.2 Not sure how yet 38
  • 39. Running the World’s Internet Servers www.ChinaNetCloud.com Future Lots more basic improvements to make Changing event processing, rule-based Thinking correlation engines Adding new Agent features in 2015 New GUI & Security & Audit & Reports Thinking about 10X scale for 2015-16 10,000 servers, millions of items Architecture, Federation, ??? Wondering about 100X scale 39
  • 40. Running the World’s Internet Servers www.ChinaNetCloud.com Summary Zabbix is a great system It’s critical to our business Still the best system out there We’ve invested a lot, more to go Still lots of improvements to make Glad to see vibrant user & developers Happy to be part of the Zabbix ecosystem Happy to share all we know & have learned 40
  • 41. Thanks from ChinaNetCloud Pioneers in OaaS – Operations as a Service Running the World’s Internet Servers www.ChinaNetCloud.com 41
  • 42. ChinaNetCloud Shanghai Headquarters: X2 Space 1-601, 1238 Xietu Lu Shanghai, 200032 China T: +86-21-6422-1946 F: +86-21-6422-4911 Beijing Office: Lee World Business Building #305 57 Happiness Village Road, Chaoyang District Beijing, 100027 China Sales@ChinaNetCloud.com www.ChinaNetCloud.com Silicon Valley Office: California Avenue Palo Alto, 94123 USA Running the World’s Internet Servers www.ChinaNetCloud.com 42