I have two large numpy
arrays for which I want to calculate an Euclidean Distance using sklearn
. The following MRE achieves what I want in the final result, but since my RL usage is large, I really want a vectorized solution as opposed to using a for
loop.
import numpy as npfrom sklearn.metrics.pairwise import euclidean_distancesn = 3sample_size = 5X = np.random.randint(0, 10, size=(sample_size, n))Y = np.random.randint(0, 10, size=(sample_size, n))lst = []for f in range(0, sample_size):ed = euclidean_distances([X[f]], [Y[f]])lst.append(ed[0][0])print(lst)
Best Answer
euclidean_distances
computes the distance for each combination of X,Y points; this will grow large in memory and is totally unnecessary if you just want the distance between each respective row. Sklearn includes a different function called paired_distances
that does what you want:
from sklearn.metrics.pairwise import paired_distancesd = paired_distances(X,Y)# array([5.83095189, 9.94987437, 7.34846923, 5.47722558, 4. ])
If you need the full pairwise distances, you can get the same result from the diagonal (as pointed out in the comments):
d = euclidean_distances(X,Y).diagonal()
Lastly: arrays are a numpy type, so it is useful to know the numpy api itself (prob. what sklearn calls under the hood). Here are two examples:
d = np.linalg.norm(X-Y, axis=1)d = np.sqrt(np.sum((X-Y)**2, axis=1))