两个(具有不同单词的)文档的交集(intersection)中元素的个数除以并集(union)中元素的个数,就是这两个文档的相似度。例如,{1, 5, 3} 和 {1, 7, 2, 3} 的相似度是 0.4,其中,交集的元素有 2 个,并集的元素有 5 个。给定一系列的长篇文档,每个文档元素各不相同,并与一个 ID 相关联。它们的相似度非常“稀疏”,也就是说任选 2 个文档,相似度都很接近 0。请设计一个算法返回每对文档的 ID 及其相似度。只需输出相似度大于 0 的组合。请忽略空文档。为简单起见,可以假定每个文档由一个含有不同整数的数组表示。

输入为一个二维数组 docs,docs[i] 表示 id 为 i 的文档。返回一个数组,其中每个元素是一个字符串,代表每对相似度大于 0 的文档,其格式为 {id1},{id2}: {similarity},其中 id1 为两个文档中较小的 id,similarity 为相似度,精确到小数点后 4 位。以任意顺序返回数组均可。

示例:

输入:
[
  [14, 15, 100, 9, 3],
  [32, 1, 9, 3, 5],
  [15, 29, 2, 6, 8, 7],
  [7, 10]
]
输出:
[
  "0,1: 0.2500",
  "0,2: 0.1000",
  "2,3: 0.1429"
]

提示:

  • docs.length <= 500
  • docs[i].length <= 500

Python 解答:
1.暴力

class Solution:
    def computeSimilarities(self, docs: List[List[int]]) -> List[str]:
        def transform(num):
            if (num*100000)%10 >= 5:
                num = int(num*10000)+1
            else:
                num = int(num*10000)
            count0 = 0
            p = num
            while p%10 == 0:
                count0 += 1
                if count0 >= 4:
                    break
                p //= 10
            string = str(num/10000)+count0*'0'
            return string
        res = []
        for i in range(len(docs)):
            for j in range(i+1, len(docs)):
                if not docs[i] or not docs[j]:
                    continue
                set1 = set(docs[i])
                set2 = set(docs[j])
                first = len(set1 & set2)
                if first:
                    temp = first/len(set1 | set2)
                    print(format(temp, '.5f'))
                    res.append(str(i)+","+str(j)+": "+transform(temp))
        return res

2.倒排索引

class Solution:
    def computeSimilarities(self, docs: List[List[int]]) -> List[str]:
        dics = {}
        counts = [[0 for i in range(len(docs))] for j in range(len(docs))]
        for i in range(len(docs)):
            for j in range(len(docs[i])):
                if docs[i][j] not in dics.keys():
                    dics[docs[i][j]] = [i]
                else:
                    for index in dics[docs[i][j]]:
                        counts[index][i] += 1
                    dics[docs[i][j]].append(i)
        res = []
        err = 1e-9
        for i in range(len(docs)):
            for j in range(i+1, len(docs)):
                if counts[i][j] > 0:
                    res.append('{0:d},{1:d}: {2:.4f}'.format(i, j, counts[i][j]/(len(docs[i])+len(docs[j])-counts[i][j])+err))
        return res

简化一下:

class Solution:
    def computeSimilarities(self, docs: List[List[int]]) -> List[str]:
        dics = {}
        res = []
        err = 1e-9
        length = len(docs)*(len(docs)-1)//2
        counts = [0 for i in range(length)]
        for i in range(len(docs)):
            for j in range(len(docs[i])):
                if docs[i][j] not in dics.keys():
                    dics[docs[i][j]] = [i]
                else:
                    for index in dics[docs[i][j]]:
                        counts[(2*len(docs)-1-index)*index//2+i-index-1] += 1
                    dics[docs[i][j]].append(i)

            for k in range(i):
                union = (2*len(docs)-1-k)*k//2+i-k-1
                if counts[union] > 0:
                    res.append('{0:d},{1:d}: {2:.4f}'.format(k, i, counts[union]/(len(docs[k])+len(docs[i])-counts[union])+err))
        return res
最后修改日期: 2021年5月27日

留言

撰写回覆或留言

发布留言必须填写的电子邮件地址不会公开。