python字符串模糊匹配 - RapidFuzz_「已注销」_python字典模糊匹配key

未知 02-07 6948

简介

之前已介绍了字符串模糊匹配的应用以及FuzzyWuzzy包的使用。目前使用较多的是运行速度更快的RapidFuzz，从名字即可看出其特点。RapidFuzz是一个使用python和C++编写的字符串匹配模块，使用了与FuzzyWuzzy相同的字符串相似度计算方法。RapidFuzz与FuzzyWuzzy的主要区别如下：

RapidFuzz是MIT licensed，可在任何地方使用，而FuzzyWuzzy需要遵守GPL license；RapidFuzz提供更多字符串相似度计算方式，比如 hamming，jaro_winkler；大部分使用C++编写，在此基础上有很多算法优化使得匹配速度更快，并且结果一致；解决了FuzzyWuzzy中partial_ratio 方法的一些bug；

安装：pip install rapidfuzz

RapidFuzz基本使用

使用方法与FuzzyWuzzy 基本一致，有4种常用的相似度计算函数scorer，其运行速度远超FuzzyWuzzy对应函数，如下图所示。同样有process模块用于字符串与列表的比较，且其效率远比单独使用scorers函数逐个比较更快速。

> from rapidfuzz import fuzz,process > fuzz.ratio("this is a test", "this is a test!") 96.55 > fuzz.partial_ratio("this is a test", "this is a test!") 100.0 > fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 100.0 > fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 100.0 > process.extractOne("cowboys", choices, scorer=fuzz.WRatio) > process.extract("new york jets", choices, scorer=fuzz.WRatio, limit=2)

模块详细介绍

有些模块和方法在较老版本中不可用。以下以最新2.8.0为例介绍。主要有：

processdistancefuzzstring_metric process

process模块主要用在字符串列表choices中查找最相似字符串或计算相似度。主要包括4个方法：

process.cdist process.cdist(queries, choices, *, scorer=<cyfunction ratio>, processor=None, score_cutoff=None, dtype=None, workers=1, **kwargs) 查询为字符串列表，计算两个列表中字符串的相似度。 > choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] > process.cdist(["new york jets","new york"], choices,scorer=fuzz.token_set_ratio) array([[28.571428, 76.92308 , 64.28571 , 14.814815], [26.086956, 57.142857, 60.869564, 18.181818]], dtype=float32) process.extract process.extract(query, choices, *, scorer=<cyfunction WRatio>, processor=<cyfunction default_process>, limit=5, score_cutoff=None, **kwargs) 查询为字符串，返回按照相似度排序的结果。返回列表的元素类型是含有3个元素的元组。第一个值为choices中元素；第二个值一般是相似度，但取值根据scorer不同形式上会有不同（当scorer 为string_metric.levenshtein时，0表示完美匹配）；第3个元素是列表索引或者字典的key。 > process.extract("new york jets", choices,scorer=fuzz.token_set_ratio) [('New York Jets', 100.0, 1), ('New York Giants', 78.57142857142857, 2), ('Atlanta Falcons', 28.57142857142857, 0), ('Dallas Cowboys', 14.81481481481481, 3)] process.extract_iter process.extract_iter(query, choices, *, scorer=<cyfunction WRatio>, processor=<cyfunction default_process>, score_cutoff=None, **kwargs) 查询为字符串，返回迭代器，此时并未排序，顺序与原choices一致，结果形式同样是元组。process.extractOn 返回最佳匹配结果。 distance

包含不同的距离度量函数。使用rapidfuzz内置的距离函数比python-Levenshtein要快很多，建议使用内置函数。

Levenshtein Levenshtein距离（编辑距离）用于测量两个字符串s1和s2之间的差异。定义为将s1转换为s2所需的插入、删除或替换操作的最小次数。该函数实现支持对插入/删除/替换使用不同的权重。均匀Levenshtein距离指的是weights=(1,1,1)，Indel距离指的是weights=(1,1,2)。所有其他的权重都可以被称为是Levenshtein距离。 distance Levenshtein.distance(s1, s2, *, weights=(1, 1, 1), processor=None, score_cutoff=None) 计算并返回编辑操作次数。normalized_distance 计算并返回标准化后的编辑距离，计算为distance / max，max 是两个字符串之间最大的编辑距离。取值为[1,0]，值越小，越相似。similaritynormalized_similarity 计算为 1 - normalized_distance。取值[0,1]，值越大，越相似。 Indel 计算将s1替换为s2所需的插入和删除的最小次数。等价于Levenshtein距离中替换操作的权重设为2。 4种可用的方法与Levenshtein完全一致。在新版本中新增了Damerau Levenshtein，似乎与Indel的计算方法相同。Hamming 汉明距离为两个等长字符串相同位置上字符不同的数目计数，要求两个字符串必须长度相同。同样有4个方法： distance 原始Hamming距离normalized_distance distance / (len1 + len2)。取值为[1,0]，值越小，越相似。similarity len1 - distance。normalized_similarity 1 - normalized_distance。取值为[0,1]，值越大，越相似。 Jaro distance.Jaro.similarity(s1, s2, *, processor=None, score_cutoff=None) Jaro distance也是一种字符串相似度度量，计算略复杂，可参考其他资料。JaroWinkler istance.JaroWinkler.similarity(s1, s2, *, prefix_weight=0.1, processor=None, score_cutoff=None) JaroWinkler是Jaro distance的一种变体。JaroWinkler距离越小，两字符串相似度越高。similarity取值为[0,1]，值越大，越相似。计算公式： S i m w Sim_w Simw?= S i m j Sim_j Simj? + (lp(1- S i m j Sim_j Simj?))， S i m j Sim_j Simj?是Jaro相似度，l是字符串公共前缀长度，最大取值为4，p是常量因子。JaroWinkler更适合前缀匹配。 fuzz

除了在FuzzyWuzzy中提到的几个函数，另有token_ratio和partial_token_ratio方法可用。 token_ratio返回 token_set_ratio and token_sort_ratio二者值最大的结果，运行速度比分别调用再比较要快。 partial_token_ratio返回 partial_token_set_ratio and partial_token_sort_ratio 二者值最大的结果，运行速度比分别调用再比较要快很多。

string_metric

主要功能与distance模块基本一致。提供以下几种距离，可用于process中的指定scorer参数。

levenshteinnormalized_levenshteinhammingnormalized_hammingjaro_similarityjaro_winkler_similarity 总结 rapidfuzz因其速度快而被更多使用。process模块从候选列表中计算最相似字符串结果，相比于fuzz中函数单独计算而使用更多。distance更多作为距离度量方式，而string_metric中更多作为scorer参数指定。字符串（模糊）匹配计算在任何形式的词条关联中有重要作用，主要是不同形式的词条与标准化词条关联匹配，如实体连接等。参考

https://github.com/maxbachmann/rapidfuzz https://maxbachmann.github.io/RapidFuzz/Usage/index.html