文章详情页

python实现过滤敏感词

浏览：206日期：2022-06-20 11:47:26

简述：

关于敏感词过滤可以看成是一种文本反垃圾算法，例如题目：敏感词文本文件 filtered_words.txt，当用户输入敏感词语，则用星号 * 替换，例如当用户输入「北京是个好城市」，则变成「**是个好城市」代码：

#coding=utf-8def filterwords(x): with open(x,’r’) as f:text=f.read() print text.split(’n’) userinput=raw_input(’myinput:’) for i in text.split(’n’):if i in userinput: replace_str=’*’*len(i.decode(’utf-8’)) word=userinput.replace(i,replace_str) return wordprint filterwords(’filtered_words.txt’)

再例如反黄系列：

开发敏感词语过滤程序，提示用户输入评论内容，如果用户输入的内容中包含特殊的字符：敏感词列表 li = ['苍老师','东京热',”武藤兰”,”波多野结衣”]则将用户输入的内容中的敏感词汇替换成***，并添加到一个列表中；如果用户输入的内容没有敏感词汇，则直接添加到上述的列表中。content = input(’请输入你的内容：’)li = ['苍老师','东京热','武藤兰','波多野结衣']i = 0while i < 4: for li[i] in content:li1 = content.replace(’苍老师’,’***’)li2 = li1.replace(’东京热’,’***’)li3 = li2.replace(’武藤兰’,’***’)li4 = li3.replace(’波多野结衣’,’***’) else:pass i += 1

python实现过滤敏感词

实战案例：

一道bat面试题：快速替换10亿条标题中的5万个敏感词，有哪些解决思路？有十亿个标题，存在一个文件中，一行一个标题。有5万个敏感词，存在另一个文件。写一个程序过滤掉所有标题中的所有敏感词，保存到另一个文件中。

1、DFA过滤敏感词算法

在实现文字过滤的算法中，DFA是比较好的实现算法。DFA即Deterministic Finite Automaton，也就是确定有穷自动机。算法核心是建立了以敏感词为基础的许多敏感词树。 python 实现DFA算法：

# -*- coding:utf-8 -*-import timetime1=time.time()# DFA算法class DFAFilter(): def __init__(self):self.keyword_chains = {}self.delimit = ’x00’ def add(self, keyword):keyword = keyword.lower()chars = keyword.strip()if not chars: returnlevel = self.keyword_chainsfor i in range(len(chars)): if chars[i] in level:level = level[chars[i]] else:if not isinstance(level, dict): breakfor j in range(i, len(chars)): level[chars[j]] = {} last_level, last_char = level, chars[j] level = level[chars[j]]last_level[last_char] = {self.delimit: 0}breakif i == len(chars) - 1: level[self.delimit] = 0 def parse(self, path):with open(path,encoding=’utf-8’) as f: for keyword in f:self.add(str(keyword).strip()) def filter(self, message, repl='*'):message = message.lower()ret = []start = 0while start < len(message): level = self.keyword_chains step_ins = 0 for char in message[start:]:if char in level: step_ins += 1 if self.delimit not in level[char]:level = level[char] else:ret.append(repl * step_ins)start += step_ins - 1breakelse: ret.append(message[start]) break else:ret.append(message[start]) start += 1return ’’.join(ret)if __name__ == '__main__': gfw = DFAFilter() path='F:/文本反垃圾算法/sensitive_words.txt' gfw.parse(path) text='新疆骚乱苹果新品发布会?八' result = gfw.filter(text) print(text) print(result) time2 = time.time() print(’总共耗时：’ + str(time2 - time1) + ’s’)

运行效果：

新疆骚乱苹果新品发布会?八****苹果新品发布会**总共耗时：0.0010344982147216797s

2、AC自动机过滤敏感词算法

AC自动机：一个常见的例子就是给出n个单词，再给出一段包含m个字符的文章，让你找出有多少个单词在文章里出现过。简单地讲，AC自动机就是字典树+kmp算法+失配指针

# -*- coding:utf-8 -*-import timetime1=time.time()# AC自动机算法class node(object): def __init__(self):self.next = {}self.fail = Noneself.isWord = Falseself.word = ''class ac_automation(object): def __init__(self):self.root = node() # 添加敏感词函数 def addword(self, word):temp_root = self.rootfor char in word: if char not in temp_root.next:temp_root.next[char] = node() temp_root = temp_root.next[char]temp_root.isWord = Truetemp_root.word = word # 失败指针函数 def make_fail(self):temp_que = []temp_que.append(self.root)while len(temp_que) != 0: temp = temp_que.pop(0) p = None for key,value in temp.next.item():if temp == self.root: temp.next[key].fail = self.rootelse: p = temp.fail while p is not None:if key in p.next: temp.next[key].fail = p.fail breakp = p.fail if p is None:temp.next[key].fail = self.roottemp_que.append(temp.next[key]) # 查找敏感词函数 def search(self, content):p = self.rootresult = []currentposition = 0while currentposition < len(content): word = content[currentposition] while word in p.next == False and p != self.root:p = p.fail if word in p.next:p = p.next[word] else:p = self.root if p.isWord:result.append(p.word)p = self.root currentposition += 1return result # 加载敏感词库函数 def parse(self, path):with open(path,encoding=’utf-8’) as f: for keyword in f:self.addword(str(keyword).strip()) # 敏感词替换函数 def words_replace(self, text):''':param ah: AC自动机:param text: 文本:return: 过滤敏感词之后的文本'''result = list(set(self.search(text)))for x in result: m = text.replace(x, ’*’ * len(x)) text = mreturn textif __name__ == ’__main__’: ah = ac_automation() path=’F:/文本反垃圾算法/sensitive_words.txt’ ah.parse(path) text1='新疆骚乱苹果新品发布会?八' text2=ah.words_replace(text1) print(text1) print(text2) time2 = time.time() print(’总共耗时：’ + str(time2 - time1) + ’s’)

运行结果：

新疆骚乱苹果新品发布会?八****苹果新品发布会**总共耗时：0.0010304450988769531s

以上就是python实现过滤敏感词的详细内容，更多关于python 过滤敏感词的资料请关注好吧啦网其它相关文章！

Python 编程

上一条：python实战之用emoji表情生成文字下一条：详解python数据结构之栈stack

相关文章：

1. IntelliJ IDEA删除类的方法步骤2. IntelliJ IDEA设置默认浏览器的方法3. idea自定义快捷键的方法步骤4. docker /var/lib/docker/aufs/mnt 目录清理方法5. 删除docker里建立容器的操作方法6. IntelliJ IDEA创建web项目的方法7. IntelliJ IDEA导出项目的方法8. IntelliJ IDEA导入项目的方法9. Intellij IDEA 关闭和开启自动更新的提示?10. IntelliJ IDEA配置Tomcat服务器的方法

排行榜

					
					idea自定义快捷键的方法步骤
Intellij IDEA 关闭和开启自动更新的提示?
IntelliJ IDEA设置默认浏览器的方法
IntelliJ IDEA配置Tomcat服务器的方法
IntelliJ IDEA导出项目的方法
IntelliJ IDEA创建web项目的方法
IntelliJ IDEA导入项目的方法
删除docker里建立容器的操作方法
docker /var/lib/docker/aufs/mnt 目录清理方法
IntelliJ IDEA删除类的方法步骤
JSP动态网页开发原理详解
				

热门标签