Personal Desktop: Thermals and Overclocking
About four years ago I built a performance desktop, partly to satisfy my curiosity on computer hardware and partly to cater to my workload. Two years ago I got myself involved in the International Young Physicists' Tournament, a physics research competition; and I overclocked my machine for the first time, to handle the terabytes of experiment data and hundreds of hours of simulation and theoretical computations more efficiently. And besides, electricity in boarding school is free—or charged at a flat rate, depending on how you look at it—and so there really wasn't a reason not to do it.
That was a modest overclock. The Intel i7-4770K processor that I had runs at 3.50 Ghz stock, with a turbo boost frequency of up to 3.90 Ghz; and here we must remember that the maximum turbo frequency is only reached if a single core is active, meaning that in most intensive compute applications this optimistic cap is never reached. I had overclocked the processor to run at 4.4 Ghz for one to two active cores, 4.3 Ghz for three active, and 4.2 Ghz for four active; and this was stable at a core voltage of 1.240 V, moderately higher than the 1.134 V stock.
I had disabled this overclock when I graduated from high school. I had figured I wouldn't be needing massive computational power over my next two years—for I would be serving the military—and I also figured I ought to save on the power bills.
I changed my decision. And now I'm redoing the overclocking.
Ever since my first overclock I had noticed that the thermal performance of my desktop had been deteriorating; and I had feared I might have damaged something whilst moving the desktop to and fro, between boarding school and my home—though I would discover later that is not the case. Here's an instance of recorded temperatures, whilst running the Prime95 stress test:
CPU package temperatures hover in the 80-90 degrees Celsius range when under load, at times hitting the 100 degrees thermal throttling ceiling. And this is at stock settings, with a Corsair H100i liquid cooler and a 240 mm radiator, which makes it very suspicious. We can also see that one of the cores seem to be much better cooled than the other three, which might indicate some asymmetry in the mounting of the cooler.
One of the first things to try is to replace the thermal paste between the water block and the processor. The purpose of the thermal paste is to fill scratches and gaps between the surfaces of the two components, so that better thermal contact can be achieved; and so deterioration of the thermal paste could, in theory, lead to the ineffective cooling that we're observing.
I bought myself the Cooler Master X1 Extreme Fusion, which was rated to be one of the best non-electrically conductive and non-metallic thermal pastes through physical testing, at least by one review site.
It is my personal opinion that it really doesn't make sense to skimp on the thermal paste, for they're pretty much all in the same 10-20 dollars price range. If, for a few additional bucks we're able to push the power limits further and increase the clock speeds by 5% or so, then why not? It is a dirt-cheap way of enhancing compute performance. We have to bear in mind that from generation to generation the performance of processors grow approximately 10% at best, and this is at a cost of a few hundred bucks.
And of course I'll have to partially disassemble the desktop in order to replace the thermal material, so here we go.
And indeed the problem is revealed. The thermal paste had dried and cracked, and some pieces of it had disintegrated and fallen off the socket altogether. It is a known issue that thermal pastes would dry over time—but I hadn't expected it to deteriorate so much in two years. This could single-handedly explain the pathetic thermal performance of the desktop.
Cleaning the old paste off is quite easily done using the grease cleaner that came with the X1 Extreme Fusion. I did use some additional 70% isopropyl alcohol, which can be purchased from regular pharmacies. It's not strictly necessary to get everything sparking clean—all should be well as long as we make sure nothing would interfere with the coupling between the water cooler and the processor.
We can then apply the new thermal paste to the clean processor, and re-mount the water block. It is advisable to drive the screws in a diagonal sequence, so that pressure is more evenly applied onto the processor, and the thermal paste would be spread more uniformly. And following that we can re-assemble the computer.
I'd also seized this opportunity to update my Corsair Link control software. I had found the default fan profiles on the software to be excessively loud, and so I bumped it down by quite a bit—and anyway it's a common misconception that fan speed is a major factor in the cooling efficiency of radiators, at least in the context of personal computers. There are many other bottlenecks with far greater effect than air flow—such as the thermal conduction rates of the thermal interface material and the water block material, the fluid flow rate, and the thermal transfer between the fluid and the radiator, which is in turn dependent on how much turbulent mixing the radiator manages to induce, and the temperature delta. And so halving fan speeds might at most result in a temperature increase of a few degrees, which is an excellent exchange for relative silence.
Do not apply this reasoning for air coolers, though. Airflow really does matter for air coolers, because their metal-air surface areas are relatively limited. This is the primary difference between water coolers and air coolers—because water coolers can channel the heat towards a large radiator placed away from the crowded electronics, it can, in principle, be engineered to have greater cooling capacity.
Back to the story. With the new thermal paste applied, we can do a quick check on performance. Once again we run the Prime95 stress test, and use the Core Temp package to track temperatures.
And we indeed see a drastic improvement. Load temperatures are now in the 60-70 degrees range, which is much more reasonable. The fourth core is still a tad cooler than the other three, though, and that might be due to asymmetry in the internal thermal interface material of the processor—something we can do nothing about, unless we de-lid the processor.
Speaking of delidding processors, I'd recently discovered a CPU retailer named Silicon Lottery, which offers pre-binned Intel and AMD processor with guaranteed overclock performance, and they also offer a very affordable delidding service. For those not familiar with the technicalities, binning here refers to classifying processors according to their actual performance. Processors aren't all made equal, even if they're of the same model, due to material imperfections and lithographic inconsistencies. Manufacturers like Intel and AMD only make sure that the chips coming off their production lines satisfy the specifications as stated on their websites and datasheets; how well the chips perform, beyond those specifications, is not communicated to consumers.
Purchasing from usual retailers would mean that one doesn't pay any premiums, but in exchange for this he is subject to the silicon lottery—that is, there is a chance that he may end up with a chip that overclocks poorly. The novel aspect of Silicon Lottery is that they perform their own binning tests; and customers can then purchase processors knowing exactly how well it'll perform. For instance, I could buy an Intel i7-7700K that's guaranteed to be overclockable to 5.0 Ghz.
Turns out I lost the silicon lottery for my i7-4770K. More on this later. All the more reason for me to advocate for pre-binned processors.
And, again, for those unfamiliar with the terminology, delidding here refers to the act of breaking open the Integrated Heat Spreader (IHS)—the outer silver-gray metal covering—of processors, in order to replace the internal thermal interface material. Haswell chips has integrated voltage regulator components built into the processor itself, and so their heat output is a tad higher than other Intel generations; this makes delidding a much more appreciable practice. On some other chips the benefits of delidding might be too low to be worth the risk.
Anyway, now that we have healthy thermals, we can proceed to the overclocking.
There's a lot of guides on how to overclock the different generations of processors from Intel and AMD available, so if you're looking to do it yourself I suggest you search up the relevant materials and follow those guides. I'm far from being an expert on this, and anyway the relevant settings would differ from chip to chip. For instance, there's the new option of using AVX offset to modify the frequency of Kaby Lake chips.
Before overclocking it is generally a good idea to check for system stability, so that we can be sure whatever instabilities we encounter during overclocking is not due to some undetected underlying issue. The first part of this check is to run MemTest86, a program that loads the RAM with all sorts of patterns in order to check for memory integrity issues. It is somewhat special, for it was written to run on an independent Linux kernel; and this means it needs to be loaded onto a bootable media and be activated during boot-up. In my case I used a USB stick.
To be safe the test can be run several times—I ran it four times, with each taking an hour or so.
And following that I ran a variety of stress tests. The first is Prime95, which forces the computer to run through iterations of the Lucas-Lehmer algorithm on large Mersenne primes. The Schönhage–Strassen algorithm is implemented in Prime95 for more efficient multiplication, and using shorter FFT lengths tends to stress the processor more, thus allowing us to check more thoroughly for stability.
I'd also used the Intel Burn Test for stress testing, which implements LINPACK, a numerical algebra library known to be demanding on the processor. In fact the numerical performance of supercomputers are traditionally benchmarked using LINPACK—but the value of this practice had become questionable, for modern computers are built for increasingly specialized functions. For instance, machines meant for machine learning can forgo high-precision floating-point arithmetic for lower-precision ones optimized at the hardware level, and this penalizes them at LINPACK, even though they easily outperform conventional counterparts at what they were built for. That's for another story. For our purposes we don't really need to care about what LINPACK is, so long as it stresses our processor.
Aside from these softwares, PCMark and Cinebench can also be used. And of course I ran my own suite of tests as well, on Mathematica code, video rendering, finite element simulations, and a bunch of other tasks.
And now we start overclocking—which really is a fancy name for a simple process, if you don't dive into the internals and advanced settings too much. We set for ourselves a target frequency, which in my case is 4.6 Ghz, and we bump the core voltage up bit by bit until we find the minimum voltage at which the system is stable. At that point we're done.
To test for stability we again use Prime95 and the Intel Burn Test, and in principle we ought to let the tests run for hours on end; but in practice half an hour is good enough during the trial-and-error overclocking process. A 24-hour stability test can always be done after the overclock, to make sure everything is well.
All the tweaking of frequencies and voltages can be done quite easily in the BIOS, and in my case Gigabyte had done a decent job at organizing their BIOS, which makes things even easier. A difference between my first overclock, and the overclock now, is that last time I had set different maximum frequencies dependent on the number of active cores, whereas this time I had decided to be lazy and simply set a maximum frequency for the processor, regardless of the number of active cores. I don't think it matters much. The most demanding compute tasks are almost always parallelized, and so what matters for performance is the processor frequency when all cores are active.
After a certain point everything gets blown to hell, because you start hitting thermal limits or safe voltage limits, and your dreams of a good overclock gets crushed in a swift, brutal sweep. The mathematics itself is working against you—the power dissipation of a processor is approximately linear in frequency and quadratic in supplied voltage, and as a rule of thumb the core voltage needed to support a stable overclock is somewhat linearly proportional to frequency, so what we end up with is a thermal output cubic in overclock extent. This is the reason why overclocks tend not to be too extreme.
And in addition some chips just won't clock past a certain frequency, even if voltages are raised sky-high. My Intel i7-4770K seems to have hit this ceiling at 4.5 Ghz, requiring a whooping 1.360 V to remain somewhat stable. The thermals at such a voltage is insane, so I settled for an overclock of 4.40 Ghz at 1.290 V. This was why I said I lost the silicon lottery.
And now for some results. The video rendering test was done on Vegas Pro 13, and involves some chroma-keying and keyframe animation; the finite-element computational fluid dynamics simulation was on laminar supersonic flow around an object with adaptive mesh refinement, ran on COMSOL Multiphysics. And lastly the Mathematica benchmark was a numerical time-stepped simulation of a Newtonian n-body gravitational system.
And we can also look at the thermal performance of the overclocked computer. Load temperatures are in the 75-85 degrees Celsius range; though it can climb into the nineties when under exceptionally intense workloads. Some overclocking guides cite that it is not advisable to go beyond 80 degrees, because it would shorten the lifespan of the processor greatly—but this is something that I don't entirely agree with.
Death of processors come about because of two primary factors—electromigration, and stress damage due to repeated thermal loading and unloading. Electromigration refers to the transport of processor die material by the flowing electrons, similar to how water in rivers can transport sediment; given enough time the conductive channels in the processor would wear out. We don't usually observe this phenomenon in our macroscopic world, because electromigration happens at a tiny scale. The dimensions of components in a modern processor are in the nanometer scale, though, and electromigration thus becomes a considerable effect.
But to say that an increase in voltage of a few hundred millivolts would result in a significant shortening of a processor's lifespan might be an overstatement. Processors at stock settings can easily last more than a decade—and more often than not other components would fail before the processor does. A typical timescale for a processor to become outdated might be 4-5 years, which corresponds to about 3 new microarchitecture generations. I doubt a voltage increase of 15% or so would more than halve the lifespan of a processor.
As for stress damage, the same argument applies. I doubt an increase in temperature delta of 15% (we'll have to use Kelvins here) would halve their lifespans.
It is interesting to note that the overclocked performance of my desktop is approximately 115 GFlops, and this would be faster than the fastest supercomputer available in the world, back in 1993. And so in 25 years technology has progressed fast enough to shrink what once would occupy an entire floor of a building to something that can be placed on a desk.
And not only that—that supercomputer in 1993 with similar performance sucks 131 kW of power. The desktop draws 230 W. That's an improvement of 3 orders of magnitude.
And so, cheers for the ingenious science and engineering! Cheers for our civilization! Things like these make me hopeful for the future. And honestly amidst all the crap in this world things like these really does make me feel proud, if only for a little while before other things set in and I return to being apprehensive about it all again. See, the good thing about the physical sciences and technology is that it's relatively hard to taint the beauty of it all with some cancerous aspect of man-made society; but I reckon someone someday will manage to do it big-time, regardless.
Till next time, goodbye!