c - Vectorization: aligned and unaligned arrays -
this question try more insights loop vectorization, particularly using openmp4. code given bellow generate 'size' random samples, these samples extract piece 'q' of 'qsize' samples position 'qpos'. program finds position of 'q' in 'samples' array. code:
#include <float.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <assert.h> #include <mm_malloc.h> // simd size in floats, assuming 1 float = 4 bytes #define vec_size 8 #define align (vec_size*sizeof(float)) int main(int argc, char *argv[]) { if (argc!=4) { printf("usage: %s <size> <qsize> <qpos>",argv[0]); exit(1); } int size = atoi(argv[1]); int qsize = atoi(argv[2]); int qpos = atoi(argv[3]); assert(qsize < size); assert((qpos < size - qsize) && (qpos >= 0)); float *samples; float *q; samples = (float *) malloc(size*sizeof(float)); q = (float *) _mm_malloc(size*sizeof(float),align); // initialization // - randomly filling samples samples[0] = 0.0; (int = 1 ; < size; i++) //loop1 samples[i] = samples[i-1] + rand()/((float)rand_max) - 0.5; // - getting q samples #pragma omp simd aligned(q:align) (int = 0; < qsize; i++) //loop2 q[i] = samples[qpos+i]; // finding best match (since q taken form samples self // position of best match must qpos) float best_dist = flt_max; int pos = -1; (int = 0; < size - qsize; i++)//loop 3 { float dist = 0; #pragma omp simd aligned(q:align) reduction(+:dist) (int j = 0; j < qsize; j++)//loop4 dist += (q[j] - samples[i+j]) * (q[j] - samples[i+j]); if (dist < best_dist) { best_dist = dist; pos = i; } } assert(pos==qpos); printf("done!\n"); free(samples); _mm_free(q); }
i'm compiling using icc 15.0.0 , gcc 4.9.2 using following commands:
icc vec-test.c -o icc-vec-test -std=c11 -qopt-report=3 -qopt-report-phase=vec -qopt-report-file=icc.vec -o3 -xhost -fopenmp gcc vec-test.c -o gcc-vec-test -std=c11 -fopt-info-vec-missed-optimized=gcc.vec -o3 -march=native -fopenmp
'q' aligned using _mm_malloc(). not make sense same 'samples' since ways inner loop (loop4) access unaligned elements of it.
both gcc , icc reported vectorization of loop4 (actually, icc manages autovectorize loop if omit '#pragma omp simd', gcc refuses do, that's 1 observation). vectorization reports seems none of compiler generated peeling loop. questions:
1) how compilers handle fact 'samples' not alligned?
2) how can affects performance?
3) icc had no problem vectorizing loop2. gcc can not: "note: not vectorized: not enough data-refs in basic block". ideas?
thanks!
here experience when test running stream package testing sustainable memory bandwidth.
1) intel compiler not generate code checking alignment far know, use equivalent of movdqu
loading samples , movdqa
loading q
2) depend on ratio of memory bandwidth , flops available. loop 4 require tiny amount of computation, guess current program on modern hpc memory bandwidth bound given size of samples , q large, fixing alignment not much. however, if limit number of core used <4, should able observe speed gain aligning sample.
3) compiler not determine vectorization base on alignment, compiler refuse vectorize when not safe vectorize due data dependency. have little experience in gcc cannot provide suggestion this.
for information, checking alignment @ runtime , provide specialized routine uses aligned load , in-register shifting can beat compiler generated code. can check intel's l1 blas routines how this.
Comments
Post a Comment