Datacenter Optimization

Datacenter Optimization

The aim of optimization is to:

Reduce infrastructure systems failures
Reduce costs (equipment, maintenance, power and water consumption)

Important note: increasing infrastructure reliability should be the main objective. Once infrastructure is fully stable and capable to face datacenter external and internal failures, free time will be available to optimize, experiment new optimizations, etc.

General formulae

PUE

Power usage effectiveness (PUE) is an indicator to measure data center energetic performance. PUE should takes into account all energy losses. There are however many formulae variations seen in datacenters. In theory, PUE should be:

PUE=(Total facility energy used)/(Useful energy used)

Useful energy used is the total power used to produce something (calculations, heat swimming pool, grow tomatoes, etc).
Total facility energy used should be taken from power supplier source, before transformers. It then includes all data center losses.

Note that this formulae can leads to PUE bellow 1 if heat is re-used.

This also leads to interpretations (commercial guys really like this ). Most of the time, this result in:

PUE=(Total facility energy used)/(IT energy used at PSU)

However, this is not really accurate as for example, there are often other components in the datacenter than the cluster on the transformers (lights, offices, etc), and also, if measure is taken at PSU, with an air cooled server for example, this does not take into account the fans inside server (which is cooling part), etc. Also, this PUE does not include IT PSU performances, as IT energy used is taken at IT plug, before IT PSU. Measuring energy used by the whole IT system after PSU would be difficult.
In general, the more probes on Electrical/cooling/IT equipments, the more accurate is PUE. You need hypothesis for the remaining variables.

Standard PUE are 1.4-1.6 for air cooled datacenters, and 1.1-1.2 for watercooled datacenters.

The best way to really reduce PUE are:

increase delta temperature of all equipments
re-use heat (leads to interesting stuff, like pumps now provide none wasted energy (flow leads to heat), etc).
Use calibrated equipments (too powerful PSU lead to waste, etc. For example, using a 4×4 hummer to drive in a city is just a waste of fuel, same here.)

However, PUE does not render the calculation performances, i.e. flops/watt which is also an important performances indicator.

Flops performances

With new green computers, it is important to take into account flops per watts, considering the energy measured before IT PSU (Local):

Flops/W_L=(flops delivered)/(IT energy used)

Still, this indicator do not take into account network and real applications performances per delivered watt.

Cooling optimization

First and most important: think globally. If optimizing somewhere generates a major loss somewhere else, then it’s not globally efficient.

Second and important thing: information is key. To optimize a datacenter, you need probes everywhere, and you need to monitor these probes before a modification, after, and then after on the long term. If you don't have money, use Arduinos combined with cheap sensors like DHT11.

To optimize cooling, main objective is to increase temperatures delta at maximum, taking into account:

micro-fans shaped like micro-turbines, they can consume from 15 to 50+ watts (an average of 28w for 1U OEM server from a Facebook datacenter study). Rising too much temperature in computer room could not be globally efficient.
Reduce air latent heat usage in computer rooms at all cost (major loss of energy).
Safety failure delay before damage: the more hot is water in watercooling, the more efficient it is, but the less time you have to shutdown in case of cooling infrastructure failure.
Reduce pressures losses.
Maintain water quality (can be impacted by temperature).

A good strategy would be:

Rise IT room temperature to maximum supported by IT equipments (consider the fact that it reduce time available in case of emergency).
Check IT power consumption during this IT room temperature rise, and ensure fans do not draw too much energy as a consequence (can be a step effect, fans do not regulate linearly but by steps).
Check also blowers of air handling units consumption.
Rise cold water temperature, keeping into account the psychometric chart to prevent any water condensation (by increasing water temperature, this should reduce condensation).
Before this step, check maximum temperature allowed by equipments like chillers (some cannot operate above a specific temperature)
Check if power used globally has been reduced.
At the end, pumps and air handling units fans should not have to blow much more because Delta T where somehow conserved, only temperatures where increased.
Using higher temperature, chillers efficiency should have raised a lot because their Delta T with exterior air was increased ! It also allows more free cooling if available on chiller.

Of course, to increase Delta T in IT room, use confinement. There is no need to buy very expensive confinement, just ensure it resists fire and doesn't disturb fire detection/extinction strategy. This will massively increase Delta T into air handling unit.

Same strategy apply for watercooled IT equipments (means CPU is directly cooled by water). Try to increase Delta T at maximum. However, consider seriously the delay to shut everything down in case of emergency. Main water loop of watercooled systems has often small volume, and temperature increase very quickly in case of cooling failure.

Power optimization

There are not a lot of ways to optimize power consumption. It is only a mater of good calibration:

PSU too big and not loaded at least at 60-80% will have a bad efficiency.
Same with inverters.
Batteries consume a lot, do you need to backup all your equipments ? For how long ? Have you considered flying wheel ?
Too small cables generate heat losses.

In general, optimize equipment range of use: too much loaded means equipment is in danger, too less loaded equipment means low efficiency.

Resources

Resources :

Articles :

http://fire.nist.gov/bfrlpubs/build85/PDF/b85009.pdf

Table of Contents