Adding cython support to pypy:
http://rguillebert.blogspot.com/
Tuesday, August 30, 2011
Thursday, August 18, 2011
Why Opencv is so fast
Convolution is the basis of many computer visions algorithms and straightforward algorithm to implement in C, but in the comparison of various implementations Opencv clearly comes out as the winner.
For the convolution of a 5x5 kernel with a 1000x1000 image of type float32 (time in ms):
And this factor of performance is visible in the implementations of other libraries as well, e.g. leptonica, theano.
It has been a goal of scikits.image to operate without too many explicit dependencies, so pulling in a fast convolution algorithm has been stated as a very desired goal.
The reason why opencv performs so well, is because of its use of SSE operators. In convolution where we apply the same operation on multiple data items the gains in perfomance are considerable.
The following command for example,
loads 4 values from the S pointer into the 128 bit register t0, and all operations on on this register operate on these values in parallel.
I have implemented a SSE based float32 convolution routine and though a bit slower than opencv, it diminishes the performance gap considerably. Each type needs some additional work, including support for row and column separable convolutions. With this we will get a good foundation for a fast convolution implementation.
Benchmark of current results for the test case:
For the convolution of a 5x5 kernel with a 1000x1000 image of type float32 (time in ms):
opencv 5.43189048767
nidimage 36.602973938
And this factor of performance is visible in the implementations of other libraries as well, e.g. leptonica, theano.
It has been a goal of scikits.image to operate without too many explicit dependencies, so pulling in a fast convolution algorithm has been stated as a very desired goal.
The reason why opencv performs so well, is because of its use of SSE operators. In convolution where we apply the same operation on multiple data items the gains in perfomance are considerable.
The following command for example,
__m128 t0 = _mm_loadu_ps(S);
loads 4 values from the S pointer into the 128 bit register t0, and all operations on on this register operate on these values in parallel.
s0 = _mm_add_ps(s0, s1);
I have implemented a SSE based float32 convolution routine and though a bit slower than opencv, it diminishes the performance gap considerably. Each type needs some additional work, including support for row and column separable convolutions. With this we will get a good foundation for a fast convolution implementation.
Benchmark of current results for the test case:
scikits.image 11.029958725
opencv 5.04112243652
scipy.ndimage 43.2901382446
Subscribe to:
Posts (Atom)