HCY Blog

Jaccard相似度

学习

Jaccard相似度通过计算两个集合交集与并集之比来衡量其相似性。本文介绍了其基本公式与计算方法,并举例说明了在文本处理、文档去重和推荐系统等场景中的应用。读者可掌握这一简单直观的度量工具,用于处理集合数据并快速评估相似度。

Jaccard相似度

Jaccard相似度是一种用于衡量两个集合相似度的指标,它通过计算两个集合的交集大小与并集大小的比值来度量相似性。

基本概念

公式

J(A,B) = |A ∩ B| / |A ∪ B|

其中:

  • A ∩ B 是集合A和B的交集(共同元素)
  • A ∪ B 是集合A和B的并集(所有不重复元素)

计算方法

示例1:简单集合

A = {1, 2, 3, 4}
B = {3, 4, 5, 6}

交集 = A ∩ B = {3, 4}        # 大小 = 2
并集 = A ∪ B = {1, 2, 3, 4, 5, 6}  # 大小 = 6

Jaccard相似度 = 2 / 6 = 0.333

示例2:文本处理中的应用

def jaccard_similarity(set1, set2):
    """计算两个集合的Jaccard相似度"""
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0

# 测试
set1 = {"apple", "banana", "orange"}
set2 = {"banana", "orange", "grape"}

similarity = jaccard_similarity(set1, set2)
print(f"Jaccard相似度: {similarity:.3f}")  # 输出: 0.500

在文本相似度中的应用

1. 基于单词的Jaccard相似度

def jaccard_text_similarity(text1, text2):
    """基于单词的文本Jaccard相似度"""
    words1 = set(text1.lower().split())
    words2 = set(text2.lower().split())
    
    return jaccard_similarity(words1, words2)

# 测试
text1 = "I love machine learning"
text2 = "I love deep learning"

similarity = jaccard_text_similarity(text1, text2)
print(f"文本相似度: {similarity:.3f}")  # 输出: 0.600

2. 基于字符n-gram的Jaccard相似度

def get_ngrams(text, n=2):
    """获取文本的n-gram集合"""
    return set([text[i:i+n] for i in range(len(text)-n+1)])

def jaccard_ngram_similarity(text1, text2, n=2):
    """基于n-gram的Jaccard相似度"""
    ngrams1 = get_ngrams(text1.lower(), n)
    ngrams2 = get_ngrams(text2.lower(), n)
    
    return jaccard_similarity(ngrams1, ngrams2)

# 测试
text1 = "hello"
text2 = "hallo"

similarity_2gram = jaccard_ngram_similarity(text1, text2, 2)
similarity_3gram = jaccard_ngram_similarity(text1, text2, 3)

print(f"2-gram相似度: {similarity_2gram:.3f}")  # 输出: 0.500
print(f"3-gram相似度: {similarity_3gram:.3f}")  # 输出: 0.333

实际应用示例

示例1:文档去重

def find_duplicate_documents(documents, threshold=0.8):
    """使用Jaccard相似度查找重复文档"""
    duplicates = []
    
    # 将文档转换为单词集合
    doc_sets = [set(doc.lower().split()) for doc in documents]
    
    for i in range(len(documents)):
        for j in range(i+1, len(documents)):
            similarity = jaccard_similarity(doc_sets[i], doc_sets[j])
            if similarity >= threshold:
                duplicates.append((i, j, similarity))
    
    return duplicates

# 测试
docs = [
    "machine learning is amazing",
    "deep learning is a subset of machine learning",
    "machine learning algorithms are amazing",
    "python programming is fun"
]

duplicates = find_duplicate_documents(docs, 0.5)
for i, j, sim in duplicates:
    print(f"文档{i}和文档{j}相似度: {sim:.3f}")

示例2:推荐系统

def jaccard_recommendation(user_preferences, all_items):
    """基于Jaccard相似度的简单推荐"""
    recommendations = []
    
    user_set = set(user_preferences)
    
    for item_set in all_items:
        similarity = jaccard_similarity(user_set, item_set)
        recommendations.append((item_set, similarity))
    
    # 按相似度排序
    recommendations.sort(key=lambda x: x[1], reverse=True)
    return recommendations

# 测试
user_likes = {"python", "machine learning", "data science"}
all_categories = [
    {"python", "programming", "web development"},
    {"machine learning", "deep learning", "ai"},
    {"data science", "statistics", "visualization"},
    {"java", "mobile development", "android"}
]

recs = jaccard_recommendation(user_likes, all_categories)
for category, similarity in recs:
    print(f"类别: {category}, 相似度: {similarity:.3f}")

Jaccard相似度的特点

优点:

  1. 简单直观:计算简单,易于理解
  2. 对集合大小不敏感:只关注共同元素的比例
  3. 适用于各种数据类型:可以用于文本、标签、用户偏好等

缺点:

  1. 不考虑元素频率:只关注元素是否存在,不关心出现次数
  2. 不考虑元素顺序:对于顺序敏感的数据不太适用
  3. 对稀有元素敏感:稀有元素对结果影响较大

与其他相似度度量的比较

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def compare_similarities(text1, text2):
    """比较不同相似度度量方法"""
    # Jaccard相似度
    words1 = set(text1.split())
    words2 = set(text2.split())
    jaccard_sim = jaccard_similarity(words1, words2)
    
    # 余弦相似度
    vectorizer = CountVectorizer().fit_transform([text1, text2])
    vectors = vectorizer.toarray()
    cosine_sim = cosine_similarity(vectors[0:1], vectors[1:2])[0][0]
    
    return jaccard_sim, cosine_sim

# 测试
text1 = "machine learning deep neural network"
text2 = "deep learning neural network model"

jaccard, cosine = compare_similarities(text1, text2)
print(f"Jaccard相似度: {jaccard:.3f}")
print(f"余弦相似度: {cosine:.3f}")

Jaccard相似度是一个简单但有效的相似度度量方法,特别适用于处理集合数据、标签系统和需要快速计算的场景。