c - Vectorization: aligned and unaligned arrays -

- May 15, 2012

this question try more insights loop vectorization, particularly using openmp4. code given bellow generate 'size' random samples, these samples extract piece 'q' of 'qsize' samples position 'qpos'. program finds position of 'q' in 'samples' array. code:

#include <float.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <assert.h> #include <mm_malloc.h>  // simd size in floats, assuming 1 float = 4 bytes #define vec_size 8 #define align (vec_size*sizeof(float))  int main(int argc, char *argv[]) {     if (argc!=4)     {         printf("usage: %s <size> <qsize> <qpos>",argv[0]);         exit(1);     }     int size = atoi(argv[1]);     int qsize = atoi(argv[2]);     int qpos = atoi(argv[3]);     assert(qsize < size);     assert((qpos < size - qsize) && (qpos >= 0));      float *samples;     float *q;      samples = (float *) malloc(size*sizeof(float));     q = (float *) _mm_malloc(size*sizeof(float),align);      // initialization     // - randomly filling samples     samples[0] = 0.0;     (int = 1 ; < size; i++) //loop1         samples[i] = samples[i-1] + rand()/((float)rand_max) - 0.5;      // - getting q samples #pragma omp simd aligned(q:align)     (int = 0; < qsize; i++) //loop2         q[i] = samples[qpos+i];      // finding best match (since q taken form samples self     // position of best match must qpos)     float best_dist = flt_max;     int pos = -1;     (int = 0; < size - qsize; i++)//loop 3     {         float dist = 0; #pragma omp simd aligned(q:align) reduction(+:dist)                 (int j = 0; j < qsize; j++)//loop4             dist += (q[j] - samples[i+j]) * (q[j] - samples[i+j]);         if (dist < best_dist)         {             best_dist = dist;             pos = i;         }     }         assert(pos==qpos);     printf("done!\n");      free(samples);     _mm_free(q); }

i'm compiling using icc 15.0.0 , gcc 4.9.2 using following commands:

icc vec-test.c -o icc-vec-test -std=c11 -qopt-report=3 -qopt-report-phase=vec -qopt-report-file=icc.vec -o3 -xhost -fopenmp gcc vec-test.c -o gcc-vec-test -std=c11 -fopt-info-vec-missed-optimized=gcc.vec -o3 -march=native -fopenmp

'q' aligned using _mm_malloc(). not make sense same 'samples' since ways inner loop (loop4) access unaligned elements of it.

both gcc , icc reported vectorization of loop4 (actually, icc manages autovectorize loop if omit '#pragma omp simd', gcc refuses do, that's 1 observation). vectorization reports seems none of compiler generated peeling loop. questions:

1) how compilers handle fact 'samples' not alligned?

2) how can affects performance?

3) icc had no problem vectorizing loop2. gcc can not: "note: not vectorized: not enough data-refs in basic block". ideas?

thanks!

here experience when test running stream package testing sustainable memory bandwidth.

1) intel compiler not generate code checking alignment far know, use equivalent of movdqu loading samples , movdqa loading q

2) depend on ratio of memory bandwidth , flops available. loop 4 require tiny amount of computation, guess current program on modern hpc memory bandwidth bound given size of samples , q large, fixing alignment not much. however, if limit number of core used <4, should able observe speed gain aligning sample.

3) compiler not determine vectorization base on alignment, compiler refuse vectorize when not safe vectorize due data dependency. have little experience in gcc cannot provide suggestion this.

for information, checking alignment @ runtime , provide specialized routine uses aligned load , in-register shifting can beat compiler generated code. can check intel's l1 blas routines how this.

Search This Blog

Alconcel

c - Vectorization: aligned and unaligned arrays -

Comments

Post a Comment

Popular posts from this blog

How has firefox/gecko HTML+CSS rendering changed in version 38? -

android - CollapsingToolbarLayout: position the ExpandedText programmatically -

Listeners to visualise results of load test in JMeter -