2. Parallel Programming with Visual Studio 2010 and the .NET Framework 4 Stephen Toub Microsoft Corporation October 2009
3. Agenda Why Parallelism, Why Now? Difficulties w/ Visual Studio 2008 & .NET 3.5 Solutions w/ Visual Studio 2010 & .NET 4 Parallel LINQ Task Parallel Library New Coordination & Synchronization Primitives New Parallel Debugger Windows New Profiler Concurrency Visualizations
4. Moore’s Law “The number of transistors incorporated in a chip will approximately double every 24 months.” Gordon Moore Intel Co-Founder http://www.intel.com/pressroom/kits/events/moores_law_40th/
5. Moore’s Law: Alive and Well? The number of transistors doubles every two years… More than 1 billiontransistors in 2006! http://upload.wikimedia.org/wikipedia/commons/2/25/Transistor_Count_and_Moore%27s_Law_-_2008_1024.png
6. Moore’s Law: Feel the Heat! Sun’s Surface 10,000 1,000 100 10 1 Rocket Nozzle Nuclear Reactor Power Density (W/cm2) Pentium® processors Hot Plate 8080 ‘70 ‘80 ’90 ’00 ‘10 486 386 Intel Developer Forum, Spring 2004 - Pat Gelsinger
7. Moore’s Law: But Different Frequencies will NOT get much faster! Maybe 5 to 10% every year or so, a few more times… And these modest gains would make the chips A LOThotter! http://www.tomshw.it/cpu.php?guide=20051121
8. The Manycore Shift “[A]fter decades of single core processors, the high volume processor industry has gone from single to dual to quad-core in just the last two years. Moore’s Law scaling should easily let us hit the 80-core mark in mainstream processors within the next ten years and quite possibly even less.”-- Justin Rattner, CTO, Intel (February 2007) “If you haven’t done so already, now is the time to take a hard look at the design of your application, determine what operations are CPU-sensitive now or are likely to become so soon, and identify how those places could benefit from concurrency.”-- Herb Sutter, C++ Architect at Microsoft (March 2005)
9. I'm convinced… now what? Multithreaded programming is “hard” today Doable by only a subgroup of senior specialists Parallel patterns are not prevalent, well known, nor easy to implement So many potential problems Businesses have little desire to “go deep” Best devs should focus on business value, not concurrency Need simple ways to allow all devs to write concurrent code
10. Example: “Race Car Drivers” IEnumerable<RaceCarDriver> drivers = ...; varresults = new List<RaceCarDriver>(); foreach(var driver in drivers) { if (driver.Name == queryName && driver.Wins.Count >= queryWinCount) { results.Add(driver); } } results.Sort((b1, b2) => b1.Age.CompareTo(b2.Age));
11. Manual Parallel Solution IEnumerable<RaceCarDriver> drivers = …; varresults = new List<RaceCarDriver>(); intpartitionsCount = Environment.ProcessorCount; intremainingCount = partitionsCount; varenumerator = drivers.GetEnumerator(); try { using (vardone = new ManualResetEvent(false)) { for(inti = 0; i < partitionsCount; i++) { ThreadPool.QueueUserWorkItem(delegate { while(true) { RaceCarDriver driver; lock (enumerator) { if (!enumerator.MoveNext()) break; driver = enumerator.Current; } if (driver.Name == queryName && driver.Wins.Count >= queryWinCount) { lock(results) results.Add(driver); } } if (Interlocked.Decrement(refremainingCount) == 0) done.Set(); }); } done.WaitOne(); results.Sort((b1, b2) => b1.Age.CompareTo(b2.Age)); } } finally { if (enumerator is IDisposable) ((IDisposable)enumerator).Dispose(); }
12. P LINQ Solution .AsParallel() varresults = from driver in drivers where driver.Name == queryName && driver.Wins.Count >= queryWinCount orderbydriver.Ageascending select driver;
13. Visual Studio 2010Tools, Programming Models, Runtimes Programming Models Tools .NET Framework 4 Visual C++ 10 Visual Studio IDE Parallel LINQ Parallel Pattern Library Parallel Debugger Tool Windows AgentsLibrary Task Parallel Library Data Structures Data Structures Concurrency Runtime Profiler Concurrency Analysis Task Scheduler ThreadPool Task Scheduler Resource Manager Resource Manager Operating System Windows Threads UMS Threads Key: Managed Native Tooling
14. Parallel Extensions What is it? Pure .NET libraries No compiler changes necessary mscorlib.dll, System.dll, System.Core.dll Lightweight, user-mode runtime Key ThreadPool enhancements Supports imperative and declarative, data and task parallelism Declarative data parallelism (PLINQ) Imperative data and task parallelism (Task Parallel Library) New coordination/synchronization constructs Why do we need it? Supports parallelism in any .NET language Delivers reduced concept count and complexity, better time to solution Begins to move parallelism capabilities from concurrency experts to domain experts How do we get it? Built into the core of .NET 4 Debugging and profiling support in Visual Studio 2010
15. Architecture PLINQ Execution Engine .NET Program Parallel Algorithms Declarative Queries Query Analysis Data Partitioning Chunk Range Hash Striped Repartitioning Custom Operator Types Merging Sync and Async Order Preserving Buffered Inverted Map Filter Sort Search Reduce Group Join … C# Compiler VB Compiler Task Parallel Library Coordination Data Structures C++ Compiler Loop replacementsImperative Task Parallelism Scheduling Thread-safe Collections Synchronization Types Coordination Types F# Compiler Other .NET Compiler Threads IL Proc 1 Proc p …
16. Language Integrated Query (LINQ) LINQ enabled data sources Others… C# Visual Basic .NET Standard Query Operators LINQ-enabled ADO.NET LINQ To SQL LINQ To XML LINQ To Objects LINQ To Datasets LINQ To Entities <book> <title/> <author/> <price/> </book> Relational Objects XML
17. Writing a LINQ-to-Objects Query Two ways to write queries Comprehensions Syntax extensions to C# and Visual Basic APIs Used as extension methods on IEnumerable<T> System.Linq.Enumerable class Compiler converts the former into the latter API implementation does the actual work var q = from x in Y where p(x) orderby x.f1select x.f2; var q = Y.Where(x => p(x)).OrderBy(x => x.f1).Select(x => x.f2); var q = Enumerable.Select( Enumerable.OrderBy( Enumerable.Where(Y, x => p(x)), x => x.f1), x => x.f2);
18.
19.
20. Implementation of a Query Operator What might an implementation look like? Does it have to be this way? What if we could do this in… parallel?! public static IEnumerable<TSource> Where<TSource>( this IEnumerable<TSource> source, Func<TSource, bool> predicate) { if (source == null || predicate == null) throw new ArgumentNullException(); foreach (var item in source) { if (predicate(item)) yield return item; } } public static IEnumerable<TSource> Where<TSource>( this IEnumerable<TSource> source, Func<TSource, bool> predicate) { ... }
21. Parallel LINQ (PLINQ) Utilizes parallel hardware for LINQ queries Abstracts away most parallelism details Partitions and merges data intelligently Supports all .NET Standard Query Operators Plus a few knobs Works for any IEnumerable<T> Optimizations for other types (T[], IList<T>) Supports custom partitioning (Partitioner<T>) Built on top of the rest of Parallel Extensions
22. Programming Model Minimal impact to existing LINQ programming model AsParallel extension method ParallelEnumerable class Implements the Standard Query Operators, but for ParallelQuery<T> public static ParallelQuery<T> AsParallel<T>(this IEnumerable<T> source); public static ParallelQuery<TSource> Where<TSource>( this ParallelQuery<TSource> source, Func<TSource, bool> predicate)
23. Writing a PLINQ Query Two ways to write queries Comprehensions Syntax extensions to C# and Visual Basic APIs Used as extension methods on ParallelQuery<T> System.Linq.ParallelEnumerable class Compiler converts the former into the latter As with serial LINQ, API implementation does the actual work var q = from x in Y.AsParallel()where p(x) orderby x.f1select x.f2; var q = Y.AsParallel().Where(x => p(x)).OrderBy(x => x.f1).Select(x => x.f2); var q = ParallelEnumerable.Select( ParallelEnumerable.OrderBy( ParallelEnumerable.Where(Y.AsParallel(), x => p(x)), x => x.f1), x => x.f2);
24. PLINQ Knobs Additional Extension Methods WithDegreeOfParallelism AsOrdered WithCancellation WithMergeOptions WithExecutionMode var results = from driver in drivers.AsParallel().WithDegreeOfParallelism(4) where driver.Name == queryName && driver.Wins.Count >= queryWinCount orderbydriver.Age ascending select driver; var results = from driver in drivers.AsParallel().AsOrdered() where driver.Name == queryName && driver.Wins.Count >= queryWinCount orderbydriver.Age ascending select driver;
29. Several partitioning schemes built-in Chunk Works with any IEnumerable<T> Single enumerator shared; chunks handed out on-demand Range Works only with IList<T> Input divided into contiguous regions, one per partition Stripe Works only with IList<T> Elements handed out round-robin to each partition Hash Works with any IEnumerable<T> Elements assigned to partition based on hash code Custom partitioning available through Partitioner<T> Partitioner.Createavailable for tighter control over built-in partitioning schemes Partitioning: Algorithms
30.
31. Example: (from x in D.AsParallel() where p(x) select x*x*x).Sum();
34. Minimizes number of partitioning/merging steps necessary… Task 1 … … Task 1 … … Task 1 … where p(x) select x3 Sum() D # … Task n… … Task n… … Task n… where p(x) select x3 Sum() … Task 1 … where p(x) select x3 Sum() D # … Task n… where p(x) select x3 Sum()
35. Merging Pipelined: separate consumer thread Default for GetEnumerator() And hence foreach loops AutoBuffered, NoBuffering Access to data as its available But more synchronization overhead Stop-and-go: consumer helps Sorts, ToArray, ToList, etc. FullyBuffered Minimizes context switches But higher latency and more memory Inverted: no merging needed ForAll extension method Most efficient by far But not always applicable Requires side-effects Thread 2 Thread 1 Thread 1 Thread 3 Thread 4 Thread 1 Thread 1 Thread 1 Thread 2 Thread 3 Thread 1 Thread 1 Thread 1 Thread 2 Thread 3
36. Parallelism Blockers Ordering not guaranteed Exceptions Thread affinity Operations with < 1.0 speedup Side effects and mutability are serious issues Most queries do not use side effects, but it’s possible… int[] values = new int[] { 0, 1, 2 };var q = from x in values.AsParallel() select x * 2;int[] scaled = q.ToArray(); // == { 0, 2, 4 }? System.AggregateException object[] data = new object[] { "foo", null, null };var q = from x in data.AsParallel() select o.ToString(); controls.AsParallel().ForAll(c => c.Size = ...); IEnumerable<int> input = …; var doubled = from x in input.AsParallel() select x*2; Random rand = new Random(); var q = from i in Enumerable.Range(0, 10000).AsParallel() select rand.Next();
37. Task Parallel LibraryLoops Loops are a common source of work Can be parallelized when iterations are independent Body doesn’t depend on mutable state / synchronization used Synchronous All iterations finish, regularly or exceptionally Lots of knobs Breaking, task-local state, custom partitioning, cancellation, scheduling, degree of parallelism Visual Studio 2010 profiler support (as with PLINQ) for (inti = 0; i < n; i++) work(i); … foreach (T e in data) work(e); Parallel.For(0, n, i => work(i)); … Parallel.ForEach(data, e => work(e));
38. Task Parallel LibraryStatements Sequence of statements When independent, can be parallelized Synchronous (same as loops) Under the covers May use Parallel.For, may use Tasks StatementA(); StatementB; StatementC(); Parallel.Invoke( () => StatementA() , () => StatementB , () => StatementC() );
39. Task Parallel LibraryTasks System.Threading.Tasks Task Represents an asynchronous operation Supports waiting, cancellation, continuations, … Parent/child relationships 1st-class debugging support in Visual Studio 2010 Task<TResult> : Task Tasks that return results TaskCompletionSource<TResult> Create Task<TResult>s to represent other operations TaskScheduler Represents a scheduler that executes tasks Extensible TaskScheduler.Default => ThreadPool
44. What Can I Do with These Cores? Offload Free up your UI Go faster whenever you can Parallelize the parallelizable Do more Use more data to get better results Add more features Speculate Pre-fetch, Pre-process Evaluate multiple solutions
45. Performance Tips Compute intensive and/or large data sets Work done should be at least 1,000s of cycles Measure, and combine/optimize as necessary Use the Visual Studio concurrency profiler Look for common anti-patterns: load imbalance, lock convoys, etc. Parallelize fine-grained but not too fine-grained e.g. Parallelize outer loop, unless N is insufficiently large to offer enough parallelism Consider parallelizing only inner, or both, at that point Consider unrolling Do not be gratuitous in task creation Lightweight, but still requires object allocation, etc. Prefer isolation & immutability over synchronization Synchronization => !Scalable Try to avoid shared state Have realistic expectations
47. To Infinity And Beyond… The “Manycore Shift” is happening Parallelism in your code is inevitable Visual Studio 2010 and .NET 4 will help Parallel Computing Dev Center http://msdn.com/concurrency Download Beta 2 (“go-live” license) http://go.microsoft.com/?linkid=9692084 Team Blogs Managed: http://blogs.msdn.com/pfxteam Native: http://blogs.msdn.com/nativeconcurrency Tools: http://blogs.msdn.com/visualizeconcurrency Forums http://social.msdn.microsoft.com/Forums/en-US/category/parallelcomputing We love feedback!