python - Matrix row difference, output a boolean vector -
i have m x 3
matrix a
, row subset b
(n x 3
). both sets of indices another, large 4d matrix; data type dtype('int64')
. generate boolean vector x
, x[i] = true
if b
not contain row a[i,:]
.
there no duplicate rows in either a
or b
.
i wondering if there's efficient way how in numpy? found answer that's related: https://stackoverflow.com/a/11903368/265289; however, returns actual rows (not boolean vector).
you follow same pattern shown in jterrace's answer, except use np.in1d
instead of np.setdiff1d
:
import numpy np np.random.seed(2015) m, n = 10, 5 = np.random.randint(10, size=(m,3)) b = a[np.random.choice(m, n, replace=false)] print(a) # [[2 2 9] # [6 8 5] # [7 8 0] # [6 7 8] # [3 8 6] # [9 2 3] # [1 2 6] # [2 9 8] # [5 8 4] # [8 9 1]] print(b) # [[2 2 9] # [1 2 6] # [2 9 8] # [3 8 6] # [9 2 3]] def using_view(a, b, assume_unique=false): ad = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[1]) bd = np.ascontiguousarray(b).view([('', b.dtype)] * b.shape[1]) return ~np.in1d(ad, bd, assume_unique=assume_unique) print(using_view(a, b, assume_unique=true))
yields
[false true true true false false false false true true]
you can use assume_unique=true
(which can speed calculation) since there no duplicate rows in a
or b
.
beware a.view(...)
raise
valueerror: new type not compatible array.
if a.flags['c_contiguous']
false
(i.e. if a
not c-contiguous array). therefore, in general need use np.ascontiguous(a)
before calling view
.
as b.m. suggests, instead view each row using "void" dtype:
def using_void(a, b): dtype = 'v{}'.format(a.dtype.itemsize * a.shape[-1]) ad = np.ascontiguousarray(a).view(dtype) bd = np.ascontiguousarray(b).view(dtype) return ~np.in1d(ad, bd, assume_unique=true)
this safe use integer dtypes. however, note
in [342]: np.array([-0.], dtype='float64').view('v8') == np.array([0.], dtype='float64').view('v8') out[342]: array([false], dtype=bool)
so using np.in1d
after viewing void may return incorrect results arrays float dtype.
here benchmark of of proposed methods:
import numpy np np.random.seed(2015) m, n = 10000, 5000 # note may contain duplicate rows, # don't use assume_unique=true these benchmarks. # in case, using assume_unique=false not improve speed anyway. = np.random.randint(10, size=(2*m,3)) # make not c_contiguous; view methods fail non-contiguous arrays = a[::2] b = a[np.random.choice(m, n, replace=false)] def using_view(a, b, assume_unique=false): ad = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[1]) bd = np.ascontiguousarray(b).view([('', b.dtype)] * b.shape[1]) return ~np.in1d(ad, bd, assume_unique=assume_unique) scipy.spatial import distance def using_distance(a, b): return ~np.any(distance.cdist(a,b)==0,1) functools import reduce def using_loop(a, b): pred = lambda i: a[:, i:i+1] == b[:, i] return ~reduce(np.logical_and, map(pred, range(a.shape[1]))).any(axis=1) pandas.core.groupby import get_group_index, _int64_overflow_possible functools import partial def using_pandas(a, b): shape = [1 + max(a[:, i].max(), b[:, i].max()) in range(a.shape[1])] assert not _int64_overflow_possible(shape) encode = partial(get_group_index, shape=shape, sort=false, xnull=false) a1, b1 = map(encode, (a.t, b.t)) return ~np.in1d(a1, b1) def using_void(a, b): dtype = 'v{}'.format(a.dtype.itemsize * a.shape[-1]) ad = np.ascontiguousarray(a).view(dtype) bd = np.ascontiguousarray(b).view(dtype) return ~np.in1d(ad, bd) # sanity check: make sure functions return same result func in (using_distance, using_loop, using_pandas, using_void): assert (func(a, b) == using_view(a, b)).all()
in [384]: %timeit using_pandas(a, b) 100 loops, best of 3: 1.99 ms per loop in [381]: %timeit using_void(a, b) 100 loops, best of 3: 6.72 ms per loop in [378]: %timeit using_view(a, b) 10 loops, best of 3: 35.6 ms per loop in [383]: %timeit using_loop(a, b) 1 loops, best of 3: 342 ms per loop in [379]: %timeit using_distance(a, b) 1 loops, best of 3: 502 ms per loop
Comments
Post a Comment