SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
War stories
from .NET team
.NET Core Summer event 2019 – Linz, AT
Karel Zikmund – @ziki_cz
Agenda
• Stories
• Investigations on .NET team
• Not just from me
• Lessons learned on the way
You won’t see any:
• Source code
• Debugger
Not on agenda
My First Serious Investigation
• Build lab for Windows component
• Build break 1x per week
• AccessViolation dialog hangs machine
• Repro:
• Once in ~50 runs
• Overnight run: 247 crashes out of 77,006 runs (0.3%)
mscorwks!UTSemReadWrite::UnlockRead+0xe
mscorwks!CMDSemReadWrite::~CMDSemReadWrite+0x14
mscorwks!RegMeta::DefineParam+0x19
cscomp!EMITTER::EmitParamProp
…
…
…
cscomp!CController::Compile
csc!main
My First Serious Investigation
My First Serious Investigation
• Does it by a chance reproduce on only one machine?
• Answer: How did you know?
• But why always the same callstack?
• Good question, no good answer … magic
• Lesson learned: Debugging HW errors is costly and hard
• Always ask: Does it repro on more than 1 machine?
Another MetaData story
MetaData format background:
• Basically database – rows and columns
• Example – TypeDef table:
• Indexes into tables/heaps are either 2B or 4B
• What happens if last TypeDef has no methods?
• MethodList = Number of methods + 1 = max + 1
• What happens if there is 0xffff methods?
Flags TypeName TypeNamespace Extends MethodList
(Public) “Foo” “Awesome.Story” … Method #10
(Private) “Bar” “Awesome.Story” … Method #11
Another MetaData story
• II.24.2.6 “#~ stream”
• If e is a simple index into a table with index i, it is stored using 2 bytes if table i has less than
2^16 rows, otherwise it is stored using 4 bytes.
• II.22.37 TypeDef : 0x02
• 21. If MethodList is non-null, it shall index a valid row in the MethodDef table, where valid
means 1 <= row <= rowcount+1 [ERROR]
• How do you fix it?
• “I’m on the fence whether we should (fix it), given it looks like people hit this about once in 17
years”
• https://github.com/dotnet/corefx/issues/29554
• Lesson learned: Not all bugs have to be fixed
TypeSystem – Collapsing interfaces
• Table of implemented interfaces:
class A : I, J {}
• With generics:
class C<T> : L<T> {}
class D<T> : C<T>, L<string> {}
class E : D<string>, I {}
0 1
I J
0 1 2
I J K
0 1
L<T> L<string>
0 1
L<string> I
0 1 2
L<string> L<string> I
class B : A, K {}
Fix:
Breaking changes – Intro
• Everyone wants fix for their bug
• But nobody wants to be broken
• Observation: 10% of fixes have unintended side-effects
• Extreme case: Perf improvement can break app
• How many customers?
• Lesson learned: Everything has risk of breaking someone
Breaking changes – Last build
• Finance app crashing – “last” build of Windows 8 on arm (Surface RT)
• Latent bug (introduced months ago)
• Bug triggered by:
1. Method in NGen image has to be across 8KB pages
2. GC has to be triggered at least twice when it’s on stack
• Unrelated change caused “unlucky” method order for:
• System.Net.Configuration.DefaultProxySectionInternal..ctor
• Lesson learned: Anything, really ANYTHING, has risk of breaking
Breaking changes – Huge impact
• Patch to .NET Framework broke certain tax SW
• Printing tax forms
• Update pushed few days before tax deadline in US
• Note: Printing was tested on both sides (Microsoft & tax SW
company)
• But only into file, not to printer
• Lessons learned: Be extra cautious around sensitive dates
Breaking changes – Below you
• RavenDB – blue screen after KB4487017 on .NET Core!
• dotnet/coreclr#22597
• PrefetchVirtualMemory
• Kernel memory
management bug
Networking – Security issue
• January: Researcher running ML models on Cosmos
• Suspicion about buffers – more logging
• March: Repro gone
• May: Similar report
• +2 weeks: It blows up (more teams & impact)
• All hands on-deck
• Small repro (20 min, then 1 min) … yay!
• TTD trace (iDNA / TTT) … bonus & life saver
Networking – Security issue
• Root-cause: HTTP pipelining under stress
• 13 years old bug (.NET 2.0)
Response 1
Request 1
Server
Response 1
Request 1
Server
Request 2
Response 2
Networking – Security issue
Request 1
Server
Request 2Request 3
Response 1Response 2
Networking – Security issue
Request 1
Server
Request 2Request 3
Response 1Response 2
Networking – Security issue
• We have workaround (disable pipelining) – perf impact
• Worked fix …
• Verifying fix …
• Repro fails after 4h 
• Same symptoms
• Repro sensitive to cloud network load (8-17)
• TTD (iDNA / TTT) does not work 
• Suspicion about buffers again
Networking – Security issue
• Bad buffer lifetime management – on sending side!
• 5 years old bug (.NET 4.5.2)
• Trigger found:
• Thanks to Skype team – 24h deployment of experiments
• Change in .NET 4.7.1
• Fix around the problematic area
• Making the opportunity window SMALLER!
• … counter-intuitive
• Code review – similar bug on receiving side (5 years old)
• Same symptoms as HTTP pipelining
Networking – Security issue
• Why so many customers/services hit it at once?
• Maybe Spectre & Meltdown fixes roll out?
• or just … magic
• Lesson learned: Weird coincidences can happen …
Optimizations
• Once upon a time, … there was a service in Microsoft
• List vs. array data structure perf
• Perspectives:
1. The data structure will have in practice 3-5 items
2. There 3 hops between servers for each request!!!
• Lesson learned: Avoid premature optimizations … at all cost
Lessons learned
• Always ask: Does it repro on more than 1 machine?
• Debugging HW bugs is costly
• Some bugs happen once in 17 years
• Spec bugs are hard to fix
• MetaData format bug
• Anything, really ANYTHING, has risk of breaking someone
• Innocent changes can trigger latent bugs elsewhere
• Impact may be huge – e.g. during tax season
• Always try to create small repro
• Make your and everyone’s life easier
• TTD (iDNA / TTT) is life saver
• Avoid premature optimizations … at all cost, save your time
• … sometimes there is just … magic
@ziki_cz
Thank you
• Feedback welcome
• Twitter DM, email, in-person, etc.
• Survey
• What you liked vs. not?
• Too rushed?
• Hard to understand?
• Boring?
• Didn’t meet your expectations?
@ziki_cz

Más contenido relacionado

Similar a .NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel Zikmund

Case Study of the Unexplained
Case Study of the UnexplainedCase Study of the Unexplained
Case Study of the Unexplainedshannomc
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013midnite_runr
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Vulnerability Inheritance in ICS (English)
Vulnerability Inheritance in ICS (English)Vulnerability Inheritance in ICS (English)
Vulnerability Inheritance in ICS (English)Digital Bond
 
Debugging multiplayer games
Debugging multiplayer gamesDebugging multiplayer games
Debugging multiplayer gamesMaciej Siniło
 
BSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwertyBSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwertyJerome Smith
 
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Fwdays
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelDaniel Coupal
 
Asufe juniors-training session2
Asufe juniors-training session2Asufe juniors-training session2
Asufe juniors-training session2Omar Ahmed
 
TDD and Related Techniques for Non Developers (2012)
TDD and Related Techniques for Non Developers (2012)TDD and Related Techniques for Non Developers (2012)
TDD and Related Techniques for Non Developers (2012)Peter Kofler
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Interactions complicate debugging
Interactions complicate debuggingInteractions complicate debugging
Interactions complicate debuggingSyed Zaid Irshad
 
Static Code Analysis and AutoLint
Static Code Analysis and AutoLintStatic Code Analysis and AutoLint
Static Code Analysis and AutoLintLeander Hasty
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012TEST Huddle
 

Similar a .NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel Zikmund (20)

Case Study of the Unexplained
Case Study of the UnexplainedCase Study of the Unexplained
Case Study of the Unexplained
 
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
Patching Windows Executables with the Backdoor Factory | DerbyCon 2013
 
Surge2012
Surge2012Surge2012
Surge2012
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
Vulnerability Inheritance in ICS (English)
Vulnerability Inheritance in ICS (English)Vulnerability Inheritance in ICS (English)
Vulnerability Inheritance in ICS (English)
 
Debugging multiplayer games
Debugging multiplayer gamesDebugging multiplayer games
Debugging multiplayer games
 
BSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwertyBSides MCR 2016: From CSV to CMD to qwerty
BSides MCR 2016: From CSV to CMD to qwerty
 
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Effective C++
Effective C++Effective C++
Effective C++
 
Asufe juniors-training session2
Asufe juniors-training session2Asufe juniors-training session2
Asufe juniors-training session2
 
TDD and Related Techniques for Non Developers (2012)
TDD and Related Techniques for Non Developers (2012)TDD and Related Techniques for Non Developers (2012)
TDD and Related Techniques for Non Developers (2012)
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Interactions complicate debugging
Interactions complicate debuggingInteractions complicate debugging
Interactions complicate debugging
 
Static Code Analysis and AutoLint
Static Code Analysis and AutoLintStatic Code Analysis and AutoLint
Static Code Analysis and AutoLint
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
 

Más de Karel Zikmund

.NET Conf 2022 - Networking in .NET 7
.NET Conf 2022 - Networking in .NET 7.NET Conf 2022 - Networking in .NET 7
.NET Conf 2022 - Networking in .NET 7Karel Zikmund
 
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel Zikmund
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel ZikmundNDC London 2020 - Challenges of Managing CoreFx Repo -- Karel Zikmund
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel ZikmundKarel Zikmund
 
NDC Sydney 2019 - Async Demystified -- Karel Zikmund
NDC Sydney 2019 - Async Demystified -- Karel ZikmundNDC Sydney 2019 - Async Demystified -- Karel Zikmund
NDC Sydney 2019 - Async Demystified -- Karel ZikmundKarel Zikmund
 
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel Zikmund
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel ZikmundWUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel Zikmund
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel ZikmundKarel Zikmund
 
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile....NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...Karel Zikmund
 
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel ZikmundKarel Zikmund
 
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf....NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...Karel Zikmund
 
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel Zikmund
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel ZikmundDotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel Zikmund
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel ZikmundKarel Zikmund
 
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...Karel Zikmund
 
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel ZikmundKarel Zikmund
 
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek SafarKarel Zikmund
 
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel ZikmundKarel Zikmund
 
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel ZikmundKarel Zikmund
 
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel ZikmundKarel Zikmund
 
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel ZikmundKarel Zikmund
 
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel ZikmundKarel Zikmund
 
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel ZikmundKarel Zikmund
 

Más de Karel Zikmund (17)

.NET Conf 2022 - Networking in .NET 7
.NET Conf 2022 - Networking in .NET 7.NET Conf 2022 - Networking in .NET 7
.NET Conf 2022 - Networking in .NET 7
 
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel Zikmund
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel ZikmundNDC London 2020 - Challenges of Managing CoreFx Repo -- Karel Zikmund
NDC London 2020 - Challenges of Managing CoreFx Repo -- Karel Zikmund
 
NDC Sydney 2019 - Async Demystified -- Karel Zikmund
NDC Sydney 2019 - Async Demystified -- Karel ZikmundNDC Sydney 2019 - Async Demystified -- Karel Zikmund
NDC Sydney 2019 - Async Demystified -- Karel Zikmund
 
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel Zikmund
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel ZikmundWUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel Zikmund
WUG Days 2022 Brno - Networking in .NET 7.0 and YARP -- Karel Zikmund
 
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile....NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...
.NET Core Summer event 2019 in Vienna, AT - .NET 5 - Future of .NET on Mobile...
 
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund
.NET Core Summer event 2019 in Brno, CZ - Async demystified -- Karel Zikmund
 
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf....NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
 
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel Zikmund
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel ZikmundDotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel Zikmund
DotNext 2017 in Moscow - Challenges of Managing CoreFX repo -- Karel Zikmund
 
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...
DotNext 2017 in Moscow - .NET Core Networking stack and Performance -- Karel ...
 
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund
.NET MeetUp Brno 2017 - Microsoft Engineering teams in Europe -- Karel Zikmund
 
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar
.NET MeetUp Brno 2017 - Xamarin .NET internals -- Marek Safar
 
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Brno - Challenges of Managing CoreFX repo -- Karel Zikmund
 
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
 
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
 
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
 
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
 
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Amsterdam 2017 - .NET Standard -- Karel Zikmund
 

Último

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 

Último (20)

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 

.NET Core Summer event 2019 in Linz, AT - War stories from .NET team -- Karel Zikmund

  • 1. War stories from .NET team .NET Core Summer event 2019 – Linz, AT Karel Zikmund – @ziki_cz
  • 2. Agenda • Stories • Investigations on .NET team • Not just from me • Lessons learned on the way You won’t see any: • Source code • Debugger Not on agenda
  • 3. My First Serious Investigation • Build lab for Windows component • Build break 1x per week • AccessViolation dialog hangs machine • Repro: • Once in ~50 runs • Overnight run: 247 crashes out of 77,006 runs (0.3%)
  • 5. My First Serious Investigation • Does it by a chance reproduce on only one machine? • Answer: How did you know? • But why always the same callstack? • Good question, no good answer … magic • Lesson learned: Debugging HW errors is costly and hard • Always ask: Does it repro on more than 1 machine?
  • 6. Another MetaData story MetaData format background: • Basically database – rows and columns • Example – TypeDef table: • Indexes into tables/heaps are either 2B or 4B • What happens if last TypeDef has no methods? • MethodList = Number of methods + 1 = max + 1 • What happens if there is 0xffff methods? Flags TypeName TypeNamespace Extends MethodList (Public) “Foo” “Awesome.Story” … Method #10 (Private) “Bar” “Awesome.Story” … Method #11
  • 7. Another MetaData story • II.24.2.6 “#~ stream” • If e is a simple index into a table with index i, it is stored using 2 bytes if table i has less than 2^16 rows, otherwise it is stored using 4 bytes. • II.22.37 TypeDef : 0x02 • 21. If MethodList is non-null, it shall index a valid row in the MethodDef table, where valid means 1 <= row <= rowcount+1 [ERROR] • How do you fix it? • “I’m on the fence whether we should (fix it), given it looks like people hit this about once in 17 years” • https://github.com/dotnet/corefx/issues/29554 • Lesson learned: Not all bugs have to be fixed
  • 8. TypeSystem – Collapsing interfaces • Table of implemented interfaces: class A : I, J {} • With generics: class C<T> : L<T> {} class D<T> : C<T>, L<string> {} class E : D<string>, I {} 0 1 I J 0 1 2 I J K 0 1 L<T> L<string> 0 1 L<string> I 0 1 2 L<string> L<string> I class B : A, K {} Fix:
  • 9. Breaking changes – Intro • Everyone wants fix for their bug • But nobody wants to be broken • Observation: 10% of fixes have unintended side-effects • Extreme case: Perf improvement can break app • How many customers? • Lesson learned: Everything has risk of breaking someone
  • 10. Breaking changes – Last build • Finance app crashing – “last” build of Windows 8 on arm (Surface RT) • Latent bug (introduced months ago) • Bug triggered by: 1. Method in NGen image has to be across 8KB pages 2. GC has to be triggered at least twice when it’s on stack • Unrelated change caused “unlucky” method order for: • System.Net.Configuration.DefaultProxySectionInternal..ctor • Lesson learned: Anything, really ANYTHING, has risk of breaking
  • 11. Breaking changes – Huge impact • Patch to .NET Framework broke certain tax SW • Printing tax forms • Update pushed few days before tax deadline in US • Note: Printing was tested on both sides (Microsoft & tax SW company) • But only into file, not to printer • Lessons learned: Be extra cautious around sensitive dates
  • 12. Breaking changes – Below you • RavenDB – blue screen after KB4487017 on .NET Core! • dotnet/coreclr#22597 • PrefetchVirtualMemory • Kernel memory management bug
  • 13. Networking – Security issue • January: Researcher running ML models on Cosmos • Suspicion about buffers – more logging • March: Repro gone • May: Similar report • +2 weeks: It blows up (more teams & impact) • All hands on-deck • Small repro (20 min, then 1 min) … yay! • TTD trace (iDNA / TTT) … bonus & life saver
  • 14. Networking – Security issue • Root-cause: HTTP pipelining under stress • 13 years old bug (.NET 2.0) Response 1 Request 1 Server Response 1 Request 1 Server Request 2 Response 2
  • 15. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  • 16. Networking – Security issue Request 1 Server Request 2Request 3 Response 1Response 2
  • 17. Networking – Security issue • We have workaround (disable pipelining) – perf impact • Worked fix … • Verifying fix … • Repro fails after 4h  • Same symptoms • Repro sensitive to cloud network load (8-17) • TTD (iDNA / TTT) does not work  • Suspicion about buffers again
  • 18. Networking – Security issue • Bad buffer lifetime management – on sending side! • 5 years old bug (.NET 4.5.2) • Trigger found: • Thanks to Skype team – 24h deployment of experiments • Change in .NET 4.7.1 • Fix around the problematic area • Making the opportunity window SMALLER! • … counter-intuitive • Code review – similar bug on receiving side (5 years old) • Same symptoms as HTTP pipelining
  • 19. Networking – Security issue • Why so many customers/services hit it at once? • Maybe Spectre & Meltdown fixes roll out? • or just … magic • Lesson learned: Weird coincidences can happen …
  • 20. Optimizations • Once upon a time, … there was a service in Microsoft • List vs. array data structure perf • Perspectives: 1. The data structure will have in practice 3-5 items 2. There 3 hops between servers for each request!!! • Lesson learned: Avoid premature optimizations … at all cost
  • 21. Lessons learned • Always ask: Does it repro on more than 1 machine? • Debugging HW bugs is costly • Some bugs happen once in 17 years • Spec bugs are hard to fix • MetaData format bug • Anything, really ANYTHING, has risk of breaking someone • Innocent changes can trigger latent bugs elsewhere • Impact may be huge – e.g. during tax season • Always try to create small repro • Make your and everyone’s life easier • TTD (iDNA / TTT) is life saver • Avoid premature optimizations … at all cost, save your time • … sometimes there is just … magic @ziki_cz
  • 22. Thank you • Feedback welcome • Twitter DM, email, in-person, etc. • Survey • What you liked vs. not? • Too rushed? • Hard to understand? • Boring? • Didn’t meet your expectations? @ziki_cz

Notas del editor

  1. Quickly about me: .NET team for almost 14 years Started as junior / out of college on Runtime – C++, pieces like Metadata, TypeSystem, Assembly Loader Later on moved to manager role Then moved to BCL (Base Class Libraries) – Networking area mainly (HttpClient) … working in open-source (.NET Core) Community manager of dotnet/corefx repo
  2. Lessons learned – maybe useful to you Maybe just helps you understand what is happening on the other side / below you I already had few people confirm they hit some/all situations Were able to identify with problems and recommendations
  3. 2006 January – 3 months in MS Large code base, dozens of machines, productivity impact on larger team Crash – “hang dialog” with AV Repro – great Getting heap dumps We get to see callstack … but before that, some quotes
  4. Simplified callstack for readability AV in MetaData emitting – defining a parameter Basically stack corruption (dangerous) Proper RW lock Who corrupts memory? …
  5. Costly and hard … and requires quite some expertise Variants: Different machine setup? … driver bugs Extreme from Maoni: Real HW?
  6. 1 year old story – 2018 May First background on MetaData Compressed indexes = just schema which says 2B, 4B … variable between files, but static/stable and given per file MethodList = Start of list of methods, INCLUSIVE
  7. How do you fix that? … You don’t … spec bug / format bug Changing rules means rewriting & recompiling all tools (CCI and command line tools like ildasm, or UI Reflector, ILSpy, Visual Studio, debuggers, profilers, …) Compensate? Rearranging fields/methods/params in a way the last one does not need the +1. Nasty Emitting fake type/method with field/method/param to push row count to 2^16. Also nasty Using 0 as valid value? Readers will be surprised, maybe other bugs?
  8. 2009/4 story VirtualStubDispatch uses indexes of interfaces to call virtual methods – code in D<T>, casting to K<string> will use index 1 for interface and method index. But if instance of E is passed, method on I (interface index 1) SECURITY issue! Missing feature – implementing in security patch … risk of breaking changes Proper description complicates spec a lot Kind of problem when spec had 1 reference implementation – described the implementation Bug since introduction of generics, many smart people looked at it, yet missed it
  9. Read slides
  10. OEM getting builds 2 days Paranoia
  11. Sensitive dates like tax date, shopping season? (December) … online stores usually have stop on any changes
  12. Federico Lois February case Blue screen – not something CoreCLR should be able to cause Used PrefetchVirtualMemory to optimize perf – rarely used When you’re on cutting edge, pushing the limits, you should be prepared for anything
  13. Last July (2018) Story starts 8 months earlier in December 2017 Is it server or client problem? … wireshark traces Around Feb, we know it is client - .NET or Windows March – repro is gone (they upgraded cluster) (fast forward 2 months) May another email thread – similar symptoms Back and forth Heated Realize it is 2 different products on the thread And then couple of more start coming in span of 2 weeks Impact on one customer is huge Potential: Data loss Information disclosure – mixing data in multi-tenant scenarios 3-4 weeks of all-hands on deck + 24/7 We had iDNA trace (TTD / TTT)
  14. What happens when requests are cancelled? If 1st – close connection If last – remove it & and mark for closing If in middle – remove it & and mark for closing
  15. Bad things can happen – imagine you asked: “Does the data exist?” … data loss Multi-tenant scenarios: “Give me data about customer X” … data about Y
  16. Added logging (ETW) – reused buffers Old code – track down bad buffer management
  17. Something like SharePoint Online, Bing, O365 OneNote – where response to customer matters Caring about perf early (with measurements!) != premature optimization
  18. Help me do better job next time