====== Parallelization ======
===== SIMD instructions =====

Such as SSE2, provide minimal parallelism (2-4 times) and allows quickly calculate difficult for the processor functions, such as

1/x, sqrt(x), 1/sqrt(x).

In practice, they can speed up the calculations from 10 to 100%.
(more typically 10% ;-) )

===== OpenMP =====

Because some of the architectural features, this technology is difficult to use for threading a single molecular-dynamic task. 
However, it is well suited for parallel calculation of a number of similar models, such as in //replica exchange// method.


===== MPI =====

MPI is widely used for parallelization on clusters of computers. Typical tasks are scaled by 20-30 times.
Specific tasks for the best programs are scaled up to 200-500 times.

===== GPU =====

At present there are thousands of execution units on the relatively cheap graphics cards which are available with CUDA technology. But it is difficult to use their combined power. As is the case with the MPI, the molecular-mechanical problem is difficult to scale a thousand times. In practice, large model with some algorithms are accelerated 50-100 times. In a more typical situation for the models in the 500-2000 atoms the acceleration is only 2-10 times with respect to a single core of the processor.

**A practical solution would be to run multiple tasks (applications) on a single card.**