Our New king 2006/04/09
There really >is< a difference: Pascal and C 2008/07/09
The advent of multicore processors brings many questions to mind, including perhaps one that nobody has brought up. What if the main purpose of muticore is to distract CPU customers from the fact that CPUs are no longer delivering the clock speed inprovements as in the past?
Ok, ok, that is deciding that the cart is pushing the horse. The principle issue with multicore is that Moores' Law (that other Moore, not me) split into two directions. Moores' law dictated that the density of computers doubled about every two years. You'll note that I didn't say "transistor count" or specify integrated circuits specifically. That is because Moores' law was true before either transistors or ICs existed. Its not so much a basic law of physics as it is a law of human endevor, and has been true to the beginnings of the computer.
The integrated circuit folks have been saying for years (perhaps decades) that Moores' Law was due to fail, and it actually did. To be more correct, a corillary of Moores' law bit the dust, which was that the speed of integrated circuits would also double with the density doubling. Integrated circuit speeds hit a speed wall based on the capacitance of circuits, leakage, and other issues.
The result of the "speed wall" was that, although the amount of silicon real estate continued to increase, the speed of the circuits there would not keep pace. The answer the CPU makers choose was simple: dust off the idea of parallelisim. This really involves two steps, only the second of which we are coming to grips with at the moment. The first is finding a solution. The second is solving the inevitable problems with that solution.
The solution everyone hit on was to place multiple CPUs on the same chip, and to use SMP, or Symmetric MultiProcessing, to tie them together as a whole. Basically, SMP takes two CPUs and ties them together so that they have the exact same memory and pheriperals, or at least appear to (an important distinction). Why is that a good idea?
SMP is somewhat akin to putting two cars on the same racetrack. If you do that "blind" without any other preconditions, you get a demolition derby, since they will eventually collide. What turns the demolition into a race is that the two or more CPUs are controlled, and can see each other. They follow each other, and can even share work.
SMP coupled CPUs share work by two mechanisims. The first is hardware coherency, the second is software control. Hardware coherency means that the hardware makes sure that the two CPUs see a consistent picture of their memory, and sometimes the pheriperhals. This means having either a shared cache, or different caches that talk to each other. If this was not done, one CPU would be caching a copy of memory location X, while the other CPU would have a completely different idea of what was in location X.
The software control aspect is simply that the software controlling the system makes sure that each CPU has its own area to execute code in, and each has its own private data to work on. In fact, this would be madness if it were not for one central unifying principle. That is that multiple executing CPUs in the same memory context can be treated, with few exceptions, the same as multiple executing threads. This does not make dealing with multicore less complex, it simply means that we have decades of multitasking research based on threads that is immediately applicable to SMP.
In fact, it is this "apparent" compatibility with multitasking principles that has made SMP the overriding favorite model for multicore application.
Well, above, I said that the first step was the solution (SMP, multicore), and the second step was dealing with the problems. There are two big problems with multicore/SMP. Thats the bad news. The good news is that the first huge rock in the (instruction) stream was dealt with so smoothly and easily that virtually nobody noticed it. And that is the fact that SMP, the model itself, hit an iceberg and sank, with all hands, quickly and quietly in the night.
The reason SMP didn't work is it dictated that all CPUs work with the same memory. From a hardware point of view this was like having all your horses drink from the same well, or all your trains come to the same turntable. It works to an extent, but the system has to fail eventually, as it gets overloaded. There is only so much you can do to increase memory bandwidth, and there are only so many ports you can create to the same cache before the system starts to drag in performance.
The replacement for SMP was NUMA, or Non Uniform Memory Access. NUMA means that you forget giving all processors direct access to the same memory. Instead, if you have (say) four processors and 4 megabytes of memory, you give each processor 1 megabyte of memory. The fact that you now have what looks like 4 CPUs each their own memory, and have apparently broken the SMP model, is compensated for by having the memory controllers for each CPU talk to each other like mad to the point that each CPU "appears" to have 4 megabytes of memory. In fact, each CPU has 1 megabyte of "easy" (or fast) memory, and 3 megabytes of "hard" (or slow) memory that is really accessed indirectly.
So we don't really have SMP anymore, but we simulate it via NUMA. That means we had stored program computers with mutltithreading added, then we "simulated" that with SMP, then we "simulated" SMP with NUMA. That's a lot of conceptual models stacked on each other, which itself is dangerous. The last time I recall anyone saying that we were stacking models on models, and that it was a good thing, was when microcoding CPUs, instead of direct hardware implementation, was the rage. If you recall, internal microcoding of CPUs was dropped faster than a tax reduction bill in congress. But I digress.
The price for NUMAs' simulation of SMP is that certain areas of memory are more efficient than others. The majority of the memory a core faces is the slow kind, but that is not a bad thing. SMP is (was) about intra-CPU communication. That slow memory is not bad, its "communication memory". There are two consequences of this, both of which can be seen as paramount in multicore design. The first is that hardware must be good, really good, at communicating non-local memory accesses between CPUs. The second is that the operating systems that use multicore must be good at working with NUMA systems.
The hardware requirement, intracore communication, is difficult, but has a good side. And that is, since it is creating a SMP "simulation", once done right, it can be forgotten by the software. AMD led the way in NUMA for the 80x86 series processors by adding a complex memory controller on-chip along with a very high speed intra-CPU communications channel, the HyperTransport bus, then also added an on-chip communications crossbar switch to route the traffic between cores, technology right out of high speed internet core routers. AMD worked this system out before multicore chips, and was positioned to jump into multicores by simple integration. Intel solved the same problem basically by having the same cache serve as the bonding point for the cores, which in network terms would be described as a "shared memory" switching technology.
The software requirememt of NUMA is solved by bigger, more complex operating systems (raise your hand if you are surprised by this!). The OS must choose between making pages of memory local to a given core, or sharing them, or even making two copies of the same information (typically executable code), and giving each core its own copy. An operating can ignore that it is implemented on NUMA and not SMP, but the result of such ignorance is inefficient execution.
The "hardware consistency" rock in the stream was, and is, under control. That does not mean it is easy, it just means it is solvable. The hardware issues are solved, and getting better all of the time. The OS issues are in a similar position. I have tried to synopsize the issues above, but it's a complex and ongoing design problem.
The other problem that multicore has brought to a head is what this letter is all about.
Basically, multicore ties into the multitasking model, and assumes that the multitasking model is a solved problem. The problem is, it is far from a solved problem. Mutltitasking got dumped on programmers back in the 1960s. Thats, errr, some 40 years or more to get used to an idea. The problem is, even though we have had 40 years to deal with how to deal with threads, and have come up with lots of good solutions, applications, also known as "that which we must ship to get paid", are amazingly well stuck in a single tasking world.
The basic answer to why applications still are not very multithreading capable is that "its hard to do". Iif you accept for now that multithreading is not cake easy, and not generally practiced, we can talk about what that means in multitasking (and multicore) terms. Even though the impact of multithreading is 40 years old, it is central to multicore solutions today.
Actually, the fact that programmers don't like to build multithreaded applications has been known for a long time. A far easier goal for multiprocessing operating systems was to run multiple processes in parallel. This is what enabled mainframes to have multiple terminals, and finally killed off the batch systems, and struck a massive blow for computer democracy. Large timeshare systems bought mainframes to the people just in time for them to get killed off by microcomputers.
With the advent of windowed user interfaces and the internet, multiprocessing took off again. The first generally available windowing interfaces were single task/cooperative, including Windows and the Mac OS. This changed in 1995 for Windows, and later for the Mac, as they dropped the cooperative model at last, and embraced multitasking. Now, looking at my current desktop, I am editting a web page, looking at another, running a DOS command line window, and editing code in another, and at the same time, getting my favorite French radio station from half a world away in the background. This would be difficult or impossible if the "cooperative" model had persisted.
So the simple answer about how multicore has affected real application use is that it is a gangbuster success: at multiprocessing, but not at multithreading. In other words, running multiple applications, in different windows, has worked well for multicore implementations. But getting each application to run faster, outside of a few "heroic" applications like Matlab, has been a drag. This is not news, and in fact EE Times has been publishing opinions to this effect as folks ponder the potential speed improvement of advancing multicore use.
There have been several articles published on theories about multicore impact on program speedup. A lot of the studies done on multicore assume that the hardware or software won't make use of the efficientcies of NUMA. In other words, the gloom and doom that multicore will be useless past 8-32 cores is probally wrong (in my not-very humble opinion). The basis for the gloom and doom predictions is that the cores will start being in each others' way more and more as the number of SMP/NUMA connected cores increases. Instead, I believe, NUMA connected cores will be formed increasingly into groups with levels of communication mangaged by the operating system.
However, I forsee another limit on the horizon that is closer and more fundamental that any multicore communication theory.
First, allow me to digress a step or two. I am very late to the multicore revolution. Even though I run three computers here in my office, none of them are multicore. The only multicore machine I run belongs to my teenager, because she needed a computer, and I needed to get her off mine, and there was no real reason to buy an outdated single core system.
The reason that matters is that I am still stuck in a single core world, and, since I both use and write applications for windows, I observe a lot of interesting behavior with respect to that.
The first observation is that mutltithreading, even though it is available with good support under Windows, is hardly used in Windows applications. In fact, there is quite a bit of the old cooperative windows left. It is very common to see what I call "faceless windows". This is a application presenting a Window that is locked or frozen. You can tell this if, by moving the window around or placing another window in front of it, the window "loses face" or presents a white surface with out buttons or other content. In the worst case, the window will contain chewed up bits of the desktop and other windows (which actually depdends on whether the application obeys the "clear window" message in the windows handler).
The reason this happens is that the application is single threaded, and worse, that that one thread is also used to manage the display for the window. If the thread is hung, then the window can no longer draw itself. This can even occur in multiple windows, since some common applications run even multiple windows from the same thread/process.
Now I personally avocate always running your display on its own thread, but that's a different issue. Having a thread dedicated the user interface basically means you value the user, even if all that thread can do is keep the display consistent and display an "I'm busy.." indicator. The real question is why the thread should hang up in the first place.
Since we, as regular computer users, rarely ask our systems to solve problems like routing an integrated circuit or figuring out how to fold a protien, program stalls occur mainly because devices, as in externally connnected peripherals, sometimes take time or are not online. For example, my Microsoft Word will "lock and block" the user interface frequently because it is looking through a series of removable disk/flash drives on my computer. It will soon time out on each of these devices, but the thread running Word is blocked, and the user interface is blocked.
This, believe it or not, is reasonable behavior. Windows gives offline devices a chance to come online, and backs that up with a timeout. You can't complain about the timeout, because there is always a device that could have come online if the timeout were longer, and reducing the timeout to the point it would be unnoticeable would cause some perhiperals to be excluded. Word should have another thread running its user interface, but that is another issue.
What scares me far more is that some applications in this same case drive the CPU utilisation to %100 while they are doing that. They are not only waiting for a device to timeout, but they are sitting in a polling loop wasting CPU time for that to happen. This occurs frequently with Firefox and Thunderbird. What is worse is that they are waiting not for a local device, but are waiting for a remote, internet connected resource like a web server or mail server.
The reason why this is worse than a locally connected device is that the internet is connected via a communications interface that is connected via a complex series of stacked software interfaces with a highly efficient hardware interface at the bottom, usually with both interrupts and advanced DMA (Direct Memory Access) built in. For a program to hang on an internet communication link means that it is not hanging waiting for the device to set a hardware flag, but rather it is waiting for other software to signal ready. Why would that be?
Well, the Windows internet interface is done by a software standard called "Winsock", which is the Windows version of the Berkley network sockets communication standard. The original sockets implementation was based on efficient communicating parallel tasks in a multitasking operating system. When Windows implemented Winsock, Windows itself was not multitasking, but cooperative tasking. The result was that the sockets interface got redone to be compatible with "cooperative multitasking" (an idea that comes close to being an oximoron, in my opinion), including the ability for the application program to poll the internet connection until data is ready. This idea looks really silly from a lower level implementation standpoint. The internet device being polled is not some hardware bit, but rather one piece of software, the application, asking another piece of software, the operating system/driver, if it is ready. This would be as if you continually went over to your waiter at the coffee shop asking if your coffee was ready. He/she would put up with that a few times, but then tell you to sit down and wait for him/her to bring it over.
This is why an application can suddenly start taking all of the CPU time on your computer just because server across the country is taking too long to respond.
All of this leads me to believe that the idea of how many cores you can put in your computer before you see the efficently fall off does not rely on any complex forumula about multiple cores contending for access, but its about basic human factors. And its about why multicore is really making your computer look better and faster.
The scenario that I described above was worst case, that is, an internet (or other) application that is waiting for a remote server or a device that is stuck. However, in an application that uses polling instead of more efficient multitask messsaging to get work done, is very likely wasting time in lots of little ways that you don't notice. There are always going to be little delays in access to a device, local or internet connected, and using polling causes small chunks of time to occur when the CPU is going to be pegged by applications polling for events. I can see this often on my single core machine, because I hear my favorite French radio station drop out briefly when applications start up, access a device, or access the internet.
Based on this, I highly suspect that the reason multicore delivers such good results for most users today is not because of any division of tasks or other theoretical principle. Having 2 or more cores is simply giving bad applications their own, private core to waste time with. In a system like Windows that divides the main processes running across multiple cores, the application that hangs up momentarily is not also bringing the other applications on your desktop to a halt, even for a short time. As a result, such bad applications are having less of an effect on your system as a whole.
Unfortunately, the idea you had about the workload being efficiently divided among different cores is sadly not true. That second, third or forth core is not dividing your workload, it's executing in a useless polling loop. Most insidiously, having more cores is actually reducing the general demand for true multithreaded applications, because it is giving single thread oriented applications a private CPU to waste.
Based on this idea, I'll formulate my own law: the returns from multicore are going to fall off rapidly after 4 cores. The reason why is simple human factors. Most folks don't actively use more than about 4 programs on their desktop. Oh, its true that a lot of "power users" tend to open up a lot of windows on their desktop, but few of those windows are actually doing anything. Instead, most users are running one application intensively, which they expect to have good response time for, and a smattering of background applications such as internet radio, TV, instant messsaging, etc.
Of course, this pretty much amounts to the same thing as others have already been saying. Multicore implementations of 8, 16, 32 and more cores aren't going to make a difference to the average user unless he/she is running a more complex mix of tasking, and present users of CPU intensive tasks like circuit simulation, layout or protien folding, etc, aren't going to see a multicore difference without heavy multithreading on the part of applications. And even then, just dividing applications up into threads is not going to get it, it requires programmers to think hard about how to divide up tasks into multiple threads running on multiple cores. Even applications that were multithreaded before the advent of multicore aren't going to see speedup from multicore for the simple reason that there was no incentive before to divide up the work between multiple threads. If you divided up a heavy duty simulation, layout or compile among multiple threads, they would not have been performed in half the time on a single core implementation. That is entirely due to multicore.
The result is that multicore has been good for multiple processes, but not for multiple threads. The low hanging fruit has all been picked. The user that was pleased going from single core to 2 or 4 cores is not going to be impressed by going to 8 or more cores.
If we get advantages from moving beyond 4 cores, its going to be applications that lead the way. And how to get there is a subject for the next time.