今天和同事聊天了解到开源工具Mahout,已经提供了all-pairs-similarity的计算功能,有兴趣的同学可以去阅读源码,简介如下(RowSimilarityJob:
https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob):
The goal is to compute all pairwise similarities between the rows of a
sparse matrix A.
The computation should be executed in a way that only rows that have at
least one non-zero value in the same dimension (column) are compared. We
need this to avoid a quadratic number of pairwise comparisons.
Furthermore we should be able to 'embed' arbitrary similarity measures
and we should always be able to use a combiner in all MapReduce steps.
The computation is executed using three MapReduce passes:
In the first step, the rows of A are preprocessed via
VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
scaled to unit-length), a single number for each row of A is computed
via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
The second steps operates on the rows of A' (the columns of A). The
mapper sums up all pairwise cooccurrences using
VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
called 'stripes' pattern). The reducers sums up all cooccurrence vectors
for one row and uses the similarity measure and the precomputed numbers
from step one to compute all similarities via
VectorSimilarityMeasure.similarity().
The third step ensures that only the top k similar rows per row are kept.
It's hard to see from the code but actually the job performs the matrix
multiplication AA' via outer products with a modified (similarity
measure specific) dot product.