Velocity Conference 2011 presentation by New Relic CEO Lew Cirne. - New Relic’s multitenant, SaaS web application monitoring service collects and persists over 90,000 metrics every second on a sustained basis, while still delivering an average page load time of 1.5 seconds. In this presentation Lew Cirne discusses how good architecture and good tools can help you handle an extremely large amount of data while still providing extremely fast service. He shows you how we scale to support customer growth, how we monitor our system, and what traps to look out for.
3. What our app does
APM as a Service
In-app agent instrumentation (BCI, etc)
150,000+ app processes monitored, globally (10K customers)
Each process reports a few hundred metrics per minute
5 Languages (Ruby, Java, PHP, .NET, Python)
4. Each day we collect 20 billion measurements,
from 150,000 application processes,
for over 10,000 customers.
5. Each day we collect 20 billion measurements,
from 150,000 application processes,
for over 10,000 customers.
All on 9 servers.
6. We capture “Timeslices”
Each o ne is about
Response Time 250 bytes
4 hours from 11:04 to 15:04
Count: 1242 A single tweet
Avg: 337 ms
is about the
Min: 0.63 ms
Max: 95669 ms same size
Std Dev: 782
7. timeslice insertion rate: 100K/second
>7 billion rows per day
Twitter peak insertion rate:
8K rows per second
9 Servers handle all
data collection
8.
9. Collecting is one thing...
• We provide realtime monitoring
• One minute granularity
• Data is almost always stale
• Each user/account has different data
• Page caching and other easy solutions don’t work for us.
12. Main App Software stack
User Interface Data Collectors Data Store
& REST API MySQL
Servlets on Jetty Sharded by accounts
Rails 2.3
13. Simplified architecture...
9 Collector / Aggregator / DB’s
Sustained 100K
insertion rate per
second
S
Customer’s environment HTTP
24 Core Intel Nehalem
48 GB RAM
SAS attached RAID 5
No Virtualization
(either cloud
or datacenter)
2 Web App Servers
12 Core Intel Nehalem
48 GB RAM
14. Even more data!
On May 17, we launched Real User Monitoring
• Using Episodes to measure browser load time of every page view
• Browser reports data to our ‘Beacon’ servers
• Monitoring >1 Billion page views per week
• Doubled our total inbound HTTP requests in a MONTH
15. Beacon Architecture
Response Time 0.15ms
RUM Beacons
Real User Asynchronously
Browsers Billions of metrics from
Servlets Capture and
across the globe enqueue (in-memory) aggregate and
forward
Timeslices to our
Collectors
Over 1 Billion user sessions
measured for performance in first Currently at EC2
month.
16. Challenges
• Data Purging
• Determining what to pre-aggregate
• Large Accounts
• MySQL Optimization and Tuning
• I/O performance - (virtualized to
dedicated) ...