SequenceMatcher字符串相似度函数学习

SequenceMatcher 是 Python 标准库 difflib 模块中的一个类，用于比较序列（特别是字符串）之间的相似度。

基本介绍

from difflib import SequenceMatcher

主要功能

1. 创建 SequenceMatcher 对象

# 基本用法
matcher = SequenceMatcher(None, string1, string2)

# 忽略某些字符（如空格）
matcher = SequenceMatcher(lambda x: x == " ", string1, string2)

2. 主要方法

string1 = "hello world"
string2 = "hello python"

matcher = SequenceMatcher(None, string1, string2)

# 相似度比例 (0-1之间)
similarity = matcher.ratio()
print(f"相似度: {similarity:.2f}")  # 输出: 相似度: 0.55

# 快速计算相似度（性能更好）
quick_similarity = matcher.quick_ratio()

# 更快速的计算（精度稍低）
real_quick_similarity = matcher.real_quick_ratio()

实际应用示例

示例1：基本字符串比较

from difflib import SequenceMatcher

def string_similarity(str1, str2):
    return SequenceMatcher(None, str1, str2).ratio()

# 测试
test_cases = [
    ("apple", "apple"),      # 完全相同
    ("apple", "appl"),       # 高度相似
    ("apple", "orange"),     # 完全不同
    ("hello world", "hello world!"),  # 稍有不同
]

for str1, str2 in test_cases:
    similarity = string_similarity(str1, str2)
    print(f"'{str1}' vs '{str2}': {similarity:.3f}")

输出：

'apple' vs 'apple': 1.000
'apple' vs 'appl': 0.889
'apple' vs 'orange': 0.182
'hello world' vs 'hello world!': 0.958

示例2：查找最相似的字符串

def find_most_similar(target, candidates):
    best_match = None
    best_score = 0
    
    for candidate in candidates:
        score = SequenceMatcher(None, target, candidate).ratio()
        if score > best_score:
            best_score = score
            best_match = candidate
    
    return best_match, best_score

# 测试
target = "python"
candidates = ["pyton", "pythn", "java", "ruby", "pythoon"]

match, score = find_most_similar(target, candidates)
print(f"目标: '{target}'")
print(f"最相似: '{match}' (相似度: {score:.3f})")

示例3：获取匹配的详细信息

def detailed_comparison(str1, str2):
    matcher = SequenceMatcher(None, str1, str2)
    
    print(f"字符串1: '{str1}'")
    print(f"字符串2: '{str2}'")
    print(f"相似度: {matcher.ratio():.3f}")
    print(f"快速相似度: {matcher.quick_ratio():.3f}")
    
    # 获取匹配的块
    print("\n匹配的块:")
    for block in matcher.get_matching_blocks():
        if block.size > 0:  # 只显示有内容的匹配
            i, j, n = block
            match_str = str1[i:i+n]
            print(f"  位置: str1[{i}:{i+n}] = str2[{j}:{j+n}] = '{match_str}'")

# 测试
detailed_comparison("hello world", "hello python")

高级用法

忽略某些字符

# 忽略空格比较
str1 = "hello world"
str2 = "helloworld"

# 不忽略空格
matcher1 = SequenceMatcher(None, str1, str2)
print(f"不忽略空格: {matcher1.ratio():.3f}")  # 0.769

# 忽略空格
matcher2 = SequenceMatcher(lambda x: x == " ", str1, str2)
print(f"忽略空格: {matcher2.ratio():.3f}")    # 1.000

比较列表等其他序列

# 也可以比较列表
list1 = [1, 2, 3, 4, 5]
list2 = [1, 2, 4, 5, 6]

matcher = SequenceMatcher(None, list1, list2)
print(f"列表相似度: {matcher.ratio():.3f}")  # 0.600

性能考虑

ratio(): 最精确但相对较慢
quick_ratio(): 较快，精度稍低
real_quick_ratio(): 最快，精度最低

对于大量比较，建议根据精度需求选择合适的方法。

SequenceMatcher 是一个简单但强大的工具，特别适用于需要计算字符串相似度的场景，如拼写检查、文本比较、搜索建议等。