python - Matrix row difference, output a boolean vector -


i have m x 3 matrix a , row subset b (n x 3). both sets of indices another, large 4d matrix; data type dtype('int64'). generate boolean vector x, x[i] = true if b not contain row a[i,:].

there no duplicate rows in either a or b.

i wondering if there's efficient way how in numpy? found answer that's related: https://stackoverflow.com/a/11903368/265289; however, returns actual rows (not boolean vector).

you follow same pattern shown in jterrace's answer, except use np.in1d instead of np.setdiff1d:

import numpy np np.random.seed(2015)  m, n = 10, 5 = np.random.randint(10, size=(m,3)) b = a[np.random.choice(m, n, replace=false)] print(a) # [[2 2 9] #  [6 8 5] #  [7 8 0] #  [6 7 8] #  [3 8 6] #  [9 2 3] #  [1 2 6] #  [2 9 8] #  [5 8 4] #  [8 9 1]]  print(b) # [[2 2 9] #  [1 2 6] #  [2 9 8] #  [3 8 6] #  [9 2 3]]  def using_view(a, b, assume_unique=false):     ad = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[1])     bd = np.ascontiguousarray(b).view([('', b.dtype)] * b.shape[1])     return ~np.in1d(ad, bd, assume_unique=assume_unique)  print(using_view(a, b, assume_unique=true)) 

yields

[false  true  true  true false false false false  true  true] 

you can use assume_unique=true (which can speed calculation) since there no duplicate rows in a or b.


beware a.view(...) raise

valueerror: new type not compatible array. 

if a.flags['c_contiguous'] false (i.e. if a not c-contiguous array). therefore, in general need use np.ascontiguous(a) before calling view.


as b.m. suggests, instead view each row using "void" dtype:

def using_void(a, b):     dtype = 'v{}'.format(a.dtype.itemsize * a.shape[-1])     ad = np.ascontiguousarray(a).view(dtype)     bd = np.ascontiguousarray(b).view(dtype)     return ~np.in1d(ad, bd, assume_unique=true) 

this safe use integer dtypes. however, note

in [342]: np.array([-0.], dtype='float64').view('v8') == np.array([0.], dtype='float64').view('v8') out[342]: array([false], dtype=bool) 

so using np.in1d after viewing void may return incorrect results arrays float dtype.


here benchmark of of proposed methods:

import numpy np np.random.seed(2015)  m, n = 10000, 5000 # note may contain duplicate rows,  # don't use assume_unique=true these benchmarks.  # in case, using assume_unique=false not improve speed anyway. = np.random.randint(10, size=(2*m,3)) # make not c_contiguous; view methods fail non-contiguous arrays = a[::2]   b = a[np.random.choice(m, n, replace=false)]  def using_view(a, b, assume_unique=false):     ad = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[1])     bd = np.ascontiguousarray(b).view([('', b.dtype)] * b.shape[1])     return ~np.in1d(ad, bd, assume_unique=assume_unique)  scipy.spatial import distance def using_distance(a, b):     return ~np.any(distance.cdist(a,b)==0,1)  functools import reduce  def using_loop(a, b):     pred = lambda i: a[:, i:i+1] == b[:, i]     return ~reduce(np.logical_and, map(pred, range(a.shape[1]))).any(axis=1)  pandas.core.groupby import get_group_index, _int64_overflow_possible functools import partial def using_pandas(a, b):     shape = [1 + max(a[:, i].max(), b[:, i].max()) in range(a.shape[1])]     assert not _int64_overflow_possible(shape)      encode = partial(get_group_index, shape=shape, sort=false, xnull=false)     a1, b1 = map(encode, (a.t, b.t))     return ~np.in1d(a1, b1)  def using_void(a, b):     dtype = 'v{}'.format(a.dtype.itemsize * a.shape[-1])     ad = np.ascontiguousarray(a).view(dtype)     bd = np.ascontiguousarray(b).view(dtype)     return ~np.in1d(ad, bd)  # sanity check: make sure functions return same result func in (using_distance, using_loop, using_pandas, using_void):     assert (func(a, b) == using_view(a, b)).all() 

in [384]: %timeit using_pandas(a, b) 100 loops, best of 3: 1.99 ms per loop  in [381]: %timeit using_void(a, b) 100 loops, best of 3: 6.72 ms per loop  in [378]: %timeit using_view(a, b) 10 loops, best of 3: 35.6 ms per loop  in [383]: %timeit using_loop(a, b) 1 loops, best of 3: 342 ms per loop  in [379]: %timeit using_distance(a, b) 1 loops, best of 3: 502 ms per loop 

Comments

Popular posts from this blog

How has firefox/gecko HTML+CSS rendering changed in version 38? -

javascript - Complex json ng-repeat -

jquery - Cloning of rows and columns from the old table into the new with colSpan and rowSpan -