Lifting the Blinds: Monitoring Windows Server 2012

•Descargar como PPTX, PDF•

2 recomendaciones•14,114 vistas

Operating systems monitor resources continuously in order to effectively schedule processes. In this webinar, Evan Mouzakitis (Datadog) discusses how to get operational data from Windows Server 2012 using a variety of native tools.

Software

Read the full guide at: http://www.datadoghq.com/blog/monitoring-windows-server/
g the Blinds: Monitoring Windows Server

• SaaS based infrastructure and app monitoring
• Open Source Agent
• Time series data (metrics and events)
• Processing nearly a trillion data points per day
• Intelligent Alerting and Insightful Dashboards
Datadog Overview

Operating Systems, Cloud Providers (AWS), Containers, Web Servers, Datastores,
Caches, Queues and more...
Monitor Everything

Agenda
- Why should I monitor Windows Server?
- What are some indicators of performance
issues?
- How can I collect performance metrics for
analysis?

CPU metrics
- PercentProcessorTime
- ContextSwitchesPersec
- ProcessorQueueLength
- DPCsQueuedPersec
- PercentPrivilegedTime
- PercentDPCTime
- PercentInterruptTime

CPU: ContextSwitchesPersec
What it tracks:
Number of times the processor switched to a new thread
Correlate with:
Memory: PageFaultsPersec
Disk: DiskTransfersPersec
Network: BytesSentPersec/BytesReceivedPersec
Issue resolution:
Adding processors, thread partitioning, DPC partitioning,
hardware interrupt partitioning, disable I/O counters

CPU: PercentProcessorTime
What it tracks:
Percentage of time spent performing work (not idle)
Correlate with:
ProcessorQueueLength
Issue resolution:
More processors, bigger instance, optimize offending application,

CPU: ProcessorQueueLength
What it tracks:
Size of processor queue
Correlate with:
CPU: PercentProcessorTime, PercentPrivilegedTime, PercentDPCTime, PercentInterruptTime
Issue resolution:
Adding processors, thread partitioning, DPC partitioning,
hardware interrupt partitioning, disable I/O counters

CPU:DPCsQueuedPersec
What it tracks:
Deferred procedure call (DPC) enqueue rate
Correlate with:
CPU: PercentDPCTime
Disk: DiskTransfersPersec
Network: BytesSentPersec/BytesReceivedPersec
Issue resolution:
Remove buggy device, rollback driver

CPU: PercentPrivilegedTime/PercentDPCTime
PercentInterruptTime
What they track:
Percentage of time CPU spent in privileged mode/deferred procedure
calls/interrupts
Correlate with:
ContextSwitchesPersec/PercentPrivilegedTime/PercentDPCTime PercentInterruptTime
Issue resolution:
Adding processors, thread partitioning, DPC partitioning,
hardware interrupt partitioning, disable I/O counters

Memory metrics
- PoolNonpagedBytes
- PageFaultsPersec
- PagesInputPersec

Memory: PoolNonpagedBytes
What it tracks:
Amount of non-paged memory in use
Correlate with:
Windows Event 2019 “Nonpaged Memory Pool Empty”
Issue resolution:
Identify troublesome driver/roll back to known good state

What it tracks:
Rate of page faults
Correlate with:
PagesInputPersec
Issue resolution:
Increase system memory
Memory: PageFaultsPersec

What it tracks:
Rate pages are read (from disk) into memory
Correlate with:
PageFaultsPersec/ DiskTransfersPersec
Issue resolution:
Increase system memory, move page file to separate physical disk
Memory: PagesInputPersec

- AvgDiskQueueLength
- DiskTransfersPersec
- PercentIdleTime
Disk Metrics

Disk: AvgDiskQueueLength
What it tracks:
Running average of I/O ops in queue
Correlate with:
DiskTransfersPersec
Issue resolution:
Move data for I/O-intensive applications to separate disk; add disks to syste

Disk: DiskTransfersPersec
What it tracks:
Aggregate I/O rate
Correlate with:
AvgDiskQueueLength
Issue resolution:
Move data for I/O-intensive applications to separate disk; add disks to
system; increase disk cache

Disk: PercentIdleTime
What it tracks:
Percent of time disk is idle
Correlate with:
AvgDiskQueueLength
Issue resolution:
Move page file to separate disk; add disks to system; use SSDs

Powershell
- Windows’ scripting language (no more batch files!)
- Powerful language with deep OS support
- Integrates with C# natively
- Output is typed (unlike *NIX)

Windows Performance Toolkit
Requires Windows
Assessment and
Deployment Kit (formerly
Windows Performance
Toolkit)
https://www.microsoft.com
/en-
US/download/details.aspx
?id=39982

Questions?
Evan Mouzakitis
Research Engineer
Twitter: @vagelim
Email: evan@datadoghq.com
Read the full guide at: http://www.datadoghq.com/blog/monitoring-windows-server/

Más contenido relacionado

La actualidad más candente

AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017AWSKRUG - AWS한국사용자모임

Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교Amazon Web Services Korea

Apache Druid 101Data Con LA

Cloud Price Comparison - AWS vs Azure vs GoogleRightScale

An introduction to AWS CloudFormation - Pop-up Loft Tel AvivAmazon Web Services

Serverless with IAC - terraform과 cloudformation 비교재현 신

AWSome Day Bethesda - February 2019Amazon Web Services

Data Ingestion in Big Data and IoT platformsGuido Schmutz

Deep Dive on Amazon RDS (Relational Database Service)Amazon Web Services

AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017Amazon Web Services Korea

Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Institute of Contemporary Sciences

고객 경험을 통한 AWS 클라우드 이전을 위한 지름길 - 김효정 (AWS 솔루션즈 아키텍트)Amazon Web Services Korea

HSBC and AWS Day - AWS foundationsAmazon Web Services

서버리스 기반의 프론트엔드 서버 구축(Serverless frontend web server)ChanMin Park

Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks

Recommender systemNilotpal Pramanik

Intro to Apache SparkRobert Sanders

AWS Elastic BeanstalkAmazon Web Services

검색 서비스 간략 교육 Rjs Ryu

Movie lens movie recommendation systemGaurav Sawant

La actualidad más candente (20)

AWS 기반 대규모 트래픽 견디기 - 장준엽 (구로디지털 모임) :: AWS Community Day 2017

Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교

Apache Druid 101

Cloud Price Comparison - AWS vs Azure vs Google

An introduction to AWS CloudFormation - Pop-up Loft Tel Aviv

Serverless with IAC - terraform과 cloudformation 비교

AWSome Day Bethesda - February 2019

Data Ingestion in Big Data and IoT platforms

Deep Dive on Amazon RDS (Relational Database Service)

AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017

Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...

고객 경험을 통한 AWS 클라우드 이전을 위한 지름길 - 김효정 (AWS 솔루션즈 아키텍트)

HSBC and AWS Day - AWS foundations

서버리스 기반의 프론트엔드 서버 구축(Serverless frontend web server)

Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...

Recommender system

Intro to Apache Spark

AWS Elastic Beanstalk

검색 서비스 간략 교육

Movie lens movie recommendation system

Destacado

Scaling monitoring with Datadogalexismidon

Monitoring kubernetes across data center and cloudDatadog

Application Monitoring using DatadogMukta Aphale

Dataday Texas 2016 - DatadogDatadog

Running & Monitoring Docker at ScaleDatadog

Why Visibility into Your Stack MattersAmazon Web Services

Datadog- Monitoring In Motion Cloud Native Apps SF

Datadog + VictorOps WebinarDatadog

CloudCamp Chicago - Big Data & Cloud May 2015 - All SlidesCloudCamp Chicago

Elastic Data Analytics Platform @DatadogC4Media

Native container monitoringRohit Jnagal

20161108 datadog and_sushiMasahiro Hattori

Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSylvain Kalache

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs

Data Logging and TelemetryFrancesco Meschia

Deep-Dive to Application Insights Gunnar Peipman

Intro to open source telemetry linux con 2016Matthew Broberg

Sysdig Monitorama SlidesLoris Degioanni

RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...Amazon Web Services

Volta: Logging, Metrics, and Monitoring as a ServiceLN Renganarayana

Destacado (20)

Scaling monitoring with Datadog

Monitoring kubernetes across data center and cloud

Application Monitoring using Datadog

Dataday Texas 2016 - Datadog

Running & Monitoring Docker at Scale

Why Visibility into Your Stack Matters

Datadog- Monitoring In Motion

Datadog + VictorOps Webinar

CloudCamp Chicago - Big Data & Cloud May 2015 - All Slides

Elastic Data Analytics Platform @Datadog

Native container monitoring

20161108 datadog and_sushi

Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog

Data Logging and Telemetry

Deep-Dive to Application Insights

Intro to open source telemetry linux con 2016

Sysdig Monitorama Slides

RMG203 Cloud Infrastructure and Application Monitoring with Amazon CloudWatch...

Volta: Logging, Metrics, and Monitoring as a Service

Similar a Lifting the Blinds: Monitoring Windows Server 2012

Perfmon And Profiler 101Quest Software

SharePoint 2013 Performance and Capacity Management jems7

Web Performance Part 3 "Server-side tips"Binary Studio

Testing pc’s performance lfiteclearners

Ch14.run time support systemsYi-Jun Zheng

Performance Whackamole (short version)PostgreSQL Experts, Inc.

#SUGCON 2015 Sitecore Monitoringchriswoj

Optimization In Mobile Systemsmomobangalore

Sql Server Performance TuningBala Subra

Big data meet_up_08042016Mark Smith

SQL 2005 Disk IO PerformanceInformation Technology

Testing pc’s performanceiteclearners

Windows Internal - Ch9 memory managementKent Huang

Netezza fundamentals for developersBiju Nair

Introductiontoasp netwindbgdebugging-100506045407-phpapp01Camilo Alvarez Rivera

Google Cloud Computing on Google Developer 2008 Dayprogrammermag

Sql server troubleshootingNathan Winters

How Data Instant Replay and Data Progression Work TogetherCompellent Technologies

16. PagingImplementIssused.pptxMyName1sJeff

Application Performance LectureVishwanath Ramdas

Similar a Lifting the Blinds: Monitoring Windows Server 2012 (20)

Perfmon And Profiler 101

SharePoint 2013 Performance and Capacity Management

Web Performance Part 3 "Server-side tips"

Testing pc’s performance lf

Ch14.run time support systems

Performance Whackamole (short version)

#SUGCON 2015 Sitecore Monitoring

Optimization In Mobile Systems

Sql Server Performance Tuning

Big data meet_up_08042016

SQL 2005 Disk IO Performance

Testing pc’s performance

Windows Internal - Ch9 memory management

Netezza fundamentals for developers

Introductiontoasp netwindbgdebugging-100506045407-phpapp01

Google Cloud Computing on Google Developer 2008 Day

Sql server troubleshooting

How Data Instant Replay and Data Progression Work Together

16. PagingImplementIssused.pptx

Application Performance Lecture

Más de Datadog

What it Means to be a Next-Generation Managed Service ProviderDatadog

Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015Datadog

PyData NYC 2015 - Automatically Detecting Outliers with Datadog Datadog

Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015Datadog

Monitoring Docker containers - Docker NYC Feb 2015Datadog

Treating Infrastructure as GarbageDatadog

Events and metrics the Lifeblood of WebopsDatadog

The Data Mullet: From all SQL to No SQL back to Some SQLDatadog

Big (IT) dataDatadog

Deep dive into Nagios analyticsDatadog

Just enough web ops for web developersDatadog

Customer Ops: DevOps <3 customer supportDatadog

I <3 graphs in 20 slidesDatadog

Effective monitoring with StatsDDatadog

Alerting: more signal, less noise, less painDatadog

Fact based monitoringDatadog

Fact-Based MonitoringDatadog

Monitoring NGINX (plus): key metrics and how-toDatadog

What’s in this Cookbook? - Mike FiedlerDatadog

I Love Graphs - Alexis Lê-QuôcDatadog

Más de Datadog (20)

What it Means to be a Next-Generation Managed Service Provider

Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015

PyData NYC 2015 - Automatically Detecting Outliers with Datadog

Monitoring Docker at Scale - Docker San Francisco Meetup - August 11, 2015

Monitoring Docker containers - Docker NYC Feb 2015

Treating Infrastructure as Garbage

Events and metrics the Lifeblood of Webops

The Data Mullet: From all SQL to No SQL back to Some SQL

Big (IT) data

Deep dive into Nagios analytics

Just enough web ops for web developers

Customer Ops: DevOps <3 customer support

I <3 graphs in 20 slides

Effective monitoring with StatsD

Alerting: more signal, less noise, less pain

Fact based monitoring

Fact-Based Monitoring

Monitoring NGINX (plus): key metrics and how-to

What’s in this Cookbook? - Mike Fiedler

I Love Graphs - Alexis Lê-Quôc

Último

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Test Automation Strategy for Frontend and BackendArshad QA

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Right Money Management App For Your Financial GoalsJhone kinadey

TECUNIQUE: Success Stories: IT Service providermohitmore19

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Software Quality Assurance Interview QuestionsArshad QA

Clustering techniques data mining book ....ShaimaaMohamedGalal

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Lifting the Blinds: Monitoring Windows Server 2012

1. Read the full guide at: http://www.datadoghq.com/blog/monitoring-windows-server/ g the Blinds: Monitoring Windows Server

2. • SaaS based infrastructure and app monitoring • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting and Insightful Dashboards Datadog Overview

3. Operating Systems, Cloud Providers (AWS), Containers, Web Servers, Datastores, Caches, Queues and more... Monitor Everything

4. Agenda - Why should I monitor Windows Server? - What are some indicators of performance issues? - How can I collect performance metrics for analysis?

6. What to monitor?

8. CPU metrics - PercentProcessorTime - ContextSwitchesPersec - ProcessorQueueLength - DPCsQueuedPersec - PercentPrivilegedTime - PercentDPCTime - PercentInterruptTime

9. CPU: ContextSwitchesPersec What it tracks: Number of times the processor switched to a new thread Correlate with: Memory: PageFaultsPersec Disk: DiskTransfersPersec Network: BytesSentPersec/BytesReceivedPersec Issue resolution: Adding processors, thread partitioning, DPC partitioning, hardware interrupt partitioning, disable I/O counters

10. CPU: PercentProcessorTime What it tracks: Percentage of time spent performing work (not idle) Correlate with: ProcessorQueueLength Issue resolution: More processors, bigger instance, optimize offending application,

11. CPU: ProcessorQueueLength What it tracks: Size of processor queue Correlate with: CPU: PercentProcessorTime, PercentPrivilegedTime, PercentDPCTime, PercentInterruptTime Issue resolution: Adding processors, thread partitioning, DPC partitioning, hardware interrupt partitioning, disable I/O counters

12. CPU:DPCsQueuedPersec What it tracks: Deferred procedure call (DPC) enqueue rate Correlate with: CPU: PercentDPCTime Disk: DiskTransfersPersec Network: BytesSentPersec/BytesReceivedPersec Issue resolution: Remove buggy device, rollback driver

13. CPU: PercentPrivilegedTime/PercentDPCTime PercentInterruptTime What they track: Percentage of time CPU spent in privileged mode/deferred procedure calls/interrupts Correlate with: ContextSwitchesPersec/PercentPrivilegedTime/PercentDPCTime PercentInterruptTime Issue resolution: Adding processors, thread partitioning, DPC partitioning, hardware interrupt partitioning, disable I/O counters

14. Memory metrics - PoolNonpagedBytes - PageFaultsPersec - PagesInputPersec

15. Memory: PoolNonpagedBytes What it tracks: Amount of non-paged memory in use Correlate with: Windows Event 2019 “Nonpaged Memory Pool Empty” Issue resolution: Identify troublesome driver/roll back to known good state

16. What it tracks: Rate of page faults Correlate with: PagesInputPersec Issue resolution: Increase system memory Memory: PageFaultsPersec

17. What it tracks: Rate pages are read (from disk) into memory Correlate with: PageFaultsPersec/ DiskTransfersPersec Issue resolution: Increase system memory, move page file to separate physical disk Memory: PagesInputPersec

18. - AvgDiskQueueLength - DiskTransfersPersec - PercentIdleTime Disk Metrics

19. Disk: AvgDiskQueueLength What it tracks: Running average of I/O ops in queue Correlate with: DiskTransfersPersec Issue resolution: Move data for I/O-intensive applications to separate disk; add disks to syste

20. Disk: DiskTransfersPersec What it tracks: Aggregate I/O rate Correlate with: AvgDiskQueueLength Issue resolution: Move data for I/O-intensive applications to separate disk; add disks to system; increase disk cache

21. Disk: PercentIdleTime What it tracks: Percent of time disk is idle Correlate with: AvgDiskQueueLength Issue resolution: Move page file to separate disk; add disks to system; use SSDs

22. Tooling

23. Word of Warning

24. Powershell - Windows’ scripting language (no more batch files!) - Powerful language with deep OS support - Integrates with C# natively - Output is typed (unlike *NIX)

25. Powershell

26. Powershell

27. Perfmon

28. Windows Performance Toolkit Requires Windows Assessment and Deployment Kit (formerly Windows Performance Toolkit) https://www.microsoft.com /en- US/download/details.aspx ?id=39982

29. Windows Performance Recorder

30. Questions? Evan Mouzakitis Research Engineer Twitter: @vagelim Email: evan@datadoghq.com Read the full guide at: http://www.datadoghq.com/blog/monitoring-windows-server/

Notas del editor

Our goal is to help you monitor everything from all levels of your stack so that you can make intelligent data based decisions about your applications and infrastructure.
Why monitor Windows in the first place? Monitoring the performance of the applications that run your business is critical; but applications don’t live in a vacuum. Applications interact with the underlying operating system often to, request resources, preempt the execution of other processes, access hardware devices, and more. Being aware of the health and performance of the operating system gives you more information when troubleshooting issues anywhere higher up in the stack (not to mention that monitoring the operating system is critical for insight into hardware issues). For example, is a SQL Server database query slow because of the query itself, or because the SQL Server is also hosted alongside Exchange and they are competing for disk access? These kinds of issues can only be surfaced when you monitor both the application in question and the underlying operating system.
A monitoring plan typically tries to cover Work metrics, Resource metrics, and non-metric data like events or code changes. As the broker between applications and hardware resources, when monitoring Windows server we are primarily focused on resource metrics, because that is what the operating system is managing. Work metrics are usually more applicable to application-level monitoring, but as you will see there are a few work metrics related to disk access that we’ll cover here too.
What kind of resources are we interested in monitoring? What kinds of metrics can we surface from those resources? Generally speaking, the most useful resources to monitor are CPU, RAM, disk, and network. Things like power consumption, thermal monitoring, noise and data of a similar nature, while useful, don’t usually add meaningful context to application or operating system performance issues.
At the highest level, the following metrics are useful in assessing CPU performance, and can shed light on performance bottlenecks depending on what the kind of work the CPU spends most of its time performing.
ContextSwitchesPersec tracks the number of times the processor switched to a new execution context. Context switches are computationally expensive; before the processor can enter the execution context of another thread, it must first save the current context, push the old context to the bottom of its priority queue, find the highest priority queue containing an executable thread, pop it from its queue, load its context, and finally execute the thread. In a multi-core machine (common today), context switching add significant overhead. By default, the Windows Task manager measures I/O per-process, and attributing I/O to a particular process in a multi-core multithreaded environment can have a drastic performance impact under heavy I/O loads. If that’s the case, you would benefit from disabling global and per-process I/O counters by adding a CountOperations entry as a REG_DWORD with a value of 0 to the registry under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\I/O System\
PercentProcessorTime is a metric most everyone is familiar with, even if they don’t know the name. It tracks the percentage of time the CPU was doing something. In and of itself, this metric isn’t all that useful. For example, if I’m analyzing data on a single core machine, I’d expect the CPU to in use 100% of the time. However, when correlated with ProcessorQueueLength, which tracks the number of pending threads, you have enough information to determine whether or not the system is suffering a CPU bottleneck. A queue length greater than 2 * the number of processors, coupled with prolonged periods of maxed out CPU utilization very clearly indicate that the system does not have enough processor resources to perform all of its tasks.
The processor queue length is a value which reflects the number of threads that are ready to run, but are not able to use the processor. A healthy measure of processor queue length is about 2 * the number of processors on the system. Even on multicore machines, there is only one processorqueuelength performance counter. High values for this counter very clearly indicate CPU contention. You can correlate this metric with other CPU metrics like PercentProcessorTime, PercentPrivilegedTime, PercentDPCTime, PercentInterruptTime to determine where the CPU is spending its time, and to narrow down if the CPU is the bottleneck causing backed up queue.
Hardware requirements demand real-time, unfettered access to the CPU in order to ensure that high-priority work (like accepting keyboard input) is performed when it is needed. Interrupts provide a means by which devices can interrupt the processor and force it to perform the requested operation (triggering the processor to perform a context switch). Some work from devices may be put off until later, but still must be accomplished in a timely manner. Enter DPCs. Through DPCs, real-time processes like device drivers can schedule lower-priority tasks to be completed after higher-priority interrupts are handled. DPCs are created by the kernel, and can only be called by kernel mode programs. A large or near-constant number of DPCs could point to issues with low-level system software. An unused but buggy sound driver could be the culprit, for example.
This trio of metrics, taken together, help to shed light on where the CPU is spending its time. In particular, privileged time reflects the time spent executing instructions for kernel-mode programs. Code executing in privileged mode have unrestricted access to the system’s hardware. This includes device drivers, core operating system functions, etc. If you observe a system spending 30 percent or more of its time processing privileged instructions, check the values of PercentDPCTime and PercentInterruptTime. If either of those two metrics report values greater than 20%, it is likely that a poorly written device driver, or very busy peripheral is the culprit.
As with CPU metrics, Windows exposes a wealth of performance counters tracking memory statistics. We’ve omitted AvailableMemory and similar metrics from this webinar because they are pretty self-explanatory. The three listed here, PageFaultsPersec, PoolNonpagedBytes, and PagesInputPersec provide insight into the nature of issues which may be impacting performance. We’ll touch on each in turn, but at a high level, PageFaultsPersec tracks the rate of page faults, PoolNonpagedBytes describes the current size of non-pageable memory, and the last, PagesInputPersec, describes the rate of pages read from disk (which is distinct from the number of page reads from disk).
Windows maintains two general pools of memory: a paged pool and non paged pool. The paged pool is for general use and is the pool used by all user space applications for memory allocation. Because user space applications are more tolerant to latency, or, to put it another way, because user space applications don’t generally have real-time requirements, they can get by if the requested memory needs to be read in (or paged in) from disk. Because kernel-level software has real-time execution requirements, device drivers and the like make use of the non paged pool. The non paged pool is guaranteed to reside in physical memory at all times, with no possibility of being paged to disk (hence the name “non paged”). This significantly reduces latency by preventing the possibility of page faults. No memory pool is infinite, and poorly written device drivers could end up exhausting the entire non paged pool if left unchecked. If you are seeing reports of Event 2019, it’s already too late. But keeping an eye on the size of this pool and its growth over time are necessary to identify and deal with any troublesome drivers or hardware.
Page faults occur when a thread references a page that is not in the current set of memory-resident pages. Because the thread can’t perform its work without the requested memory, a hardware interrupt occurs, the processor enters into kernel-mode (resulting in a context switch—both upon entering and exiting kernel-mode), and attempts to locate the page in memory. If the page is found somewhere else in memory, it is that address which is returned to the requesting thread. This is called a “soft” page fault. If the page is not elsewhere in memory the kernel will look in the page file and read it into memory. This is called a “hard” page fault. Because this operation requires accessing the disk, it is more computationally expensive to perform this type of lookup. Page faults occur under normal operating conditions, but a spike in page faults could result in serious performance degradation, depending on the “hardness” of the fault. By tracking the page fault rate alongside the page input rate, you can differentiate between hard and soft page faults. High values of both metrics unequivocally indicate hard page faults. There’s not much you can do to prevent soft page faults from occurring, but increasing the amount of RAM available on the system is a straightforward way of alleviating hard page faults. It is worth mentioning that when a hard page fault does occur, Windows attempts to retrieve multiple, contiguous pages into memory, to maximize the work performed by each read. This, in turn, can potentially increase a page fault’s performance impact, as more disk bandwidth is consumed reading in potentially unneeded pages. All of this can potentially be avoided by putting your page file (see next section) on a separate physical (not logical) disk, or increasing the amount of RAM available to your system.
As I mentioned, there are two types of page faults, and tracking PagesInputPersec alongside PageFaultsPersec gives you the information you need to determine the type of page fault occurring. If you are seeing high values of both metrics, the page faults are hard. The effects of hard page faults can be exacerbated if disk is a contentious resource. To give a simplified example, if your have a system with one disk and it’s running an I/O intensive application, page faults will hit this system harder (and performance will degrade in the application) because Windows is competing with the application for disk access (and Windows always wins). This goes to show that an excessive number of page faults can be responsible for system wide effects, completely unrelated to the application experiencing performance degradation.
Though there are many disk metrics worth tracking, I’ve distilled the list to the most essential, while omitting the obvious, like PercentFreeSpace.
The AvgDiskQueueLength counter gives an estimated average of the number of I/O operations currently awaiting execution. Generally speaking, this counter should not exceed 2 * the number of drives on the system. If you are seeing greater values than that, it means the system cannot service the number of I/O requests it’s receiving in a timely manner, which can lead to processing delays, degraded application performance, and more.
DiskTransfersPersec is an aggregate measure of both disk reads and writes. It is useful for shedding light on the cause of bottlenecks. High values for this metric do not always indicate issues; for example if you are running I/O intensive applications on your server you are definitely going to observe high values for this metric (and most likely for PercentIdleTime as well). However, if I/O ops are not being enqueued (per the AvgDiskQueueLength metric) and applications are not hurting for memory (and thus paging to disk), there should be no observable performance impact.
PercentIdleTime is a pretty intuitive metric that tracks the percent of time disks are idle. Depending on the role of the system under investigation, low idle times may be expected, especially for when running I/O intensive applications like SQL Server or Exchange. If that’s not the case, low values should be investigated. If you don’t already have your page file stored on a separate drive, you should do so. Otherwise, consider either adding disks to the system to increase performance, or swap out HDDs for SSDs if possible.
Windows offers numerous methods by which you can collect, store, and visualize system performance data. Because the methods are so varied, I will only go through a couple of the tools that I have experience with. All of the tools mentioned are native to Windows Server 2012 R2 so you can get up and running quickly.
Reading performance counters does not generally appear to have much of an impact on system performance. In my tests, collecting 2631 counters with 1-second sample rate caused a 4 percent increase in user CPU usage (by perfmon). There are a few things to keep in mind, though: depending on the data collected and the duration of the collection, the collected data could be very large. To give you an idea about the size of the data collected, in a test collecting handle and kernel base events, pagefaults, cpu, I/O and memory samples, the data grew at a rate approaching 100 MB/min. Additionally, if you are collecting data from your local machine, you may see occasional spikes in I/O latency; in my tests I observed response times for some user space applications in excess of 2000 ms! Also, I did not attempt to collect performance counters from user applications which may have an impact on the application’s performance. And as I mentioned earlier in the CPU section, if you are sampling I/O with processor-specific information, you most certainly will observe degradation in performance.
Powershell is great for collecting performance counters programmatically. You can query the event log from powershell as well. You can use powershell to collect metrics from local and remote machines.
Here are some example powershell commands for retrieving CPU-related performance counters. As you can see, there is a regular pattern. For a full list of commands to retrieve performance counters for CPU, memory, disk, network, and events, check out my “How to collect Windows Server 2012 metrics” article on the datadog blog. https://www.datadoghq.com/blog/collect-windows-server-2012-metrics/#toc-powershell
Last thing about powershell, if you want to do something in powershell and there’s no pre-packaged cmdlet to get you what you want, you can always interface with WMI to get what you’re looking for.
In my honest opinion, perfmon is not nearly as useful as xperf or Windows Performance Recorder when it comes to investigating performance issues. It is a good tool to help spot issues, but not so good for getting into the nitty gritty. Here’s a screenshot of perfmon collecting “System Performance counters” a counter set provided out of the box. As you can see, there is a lot going on. My investigation was focusing on the cause of excessive memory use, visualized as the black bar nearly pinned to the 100 mark. From this image it’s clear that something is going on, but since I was only collecting the Total memory usage (as opposed to collection per-process), it isn’t clear which process is exhausting RAM. To determine the underlying cause in this case requires me to re-run perfmon, this time collecting per-process counters in addition to the total, and hoping that my issue arises again. As you’re about to see, we can do better.
The Windows performance toolkit contains the Windows Performance Recorder & Windows Performance Analyzer (WPA). Though technically not strictly “native” since it requires a download, it is a useful, graphical tool for collecting and analyzing windows performance data and is made by Microsoft.
Windows performance recorder is a modern replacement for xperf. It features both graphical and command line interfaces. Here you can see the available collection profiles. Collecting data with the Windows Performance Recorder is as easy as clicking “Start”. Technically, Windows Performance Recorder (and xperf) do not merely collect performance counters; they are a tracing mechanism for collecting fine-grained performance data. As you will see, traces are superior to performance counters when investigating performance issues.

Lifting the Blinds: Monitoring Windows Server 2012

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Lifting the Blinds: Monitoring Windows Server 2012

Similar a Lifting the Blinds: Monitoring Windows Server 2012 (20)

Más de Datadog

Más de Datadog (20)

Último

Último (20)

Lifting the Blinds: Monitoring Windows Server 2012

Notas del editor