python过滤敏感词记录

IT界的小小小学生 2020-11-06 01:28:10
python 过滤 敏感 记录


简述:

关于敏感词过滤可以看成是一种文本反垃圾算法,例如
题目:敏感词文本文件 filtered_words.txt,当用户输入敏感词语,则用 星号 * 替换,例如当用户输入「北京是个好城市」,则变成「**是个好城市」
代码:

#coding=utf-8
def filterwords(x):
with open(x,'r') as f:
text=f.read()
print text.split('\n')
userinput=raw_input('myinput:')
for i in text.split('\n'):
if i in userinput:
replace_str='*'*len(i.decode('utf-8'))
word=userinput.replace(i,replace_str)
return word
print filterwords('filtered_words.txt')

再例如反黄系列:

开发敏感词语过滤程序,提示用户输入评论内容,如果用户输入的内容中包含特殊的字符:
敏感词列表 li = ["苍老师","东京热",”武藤兰”,”波多野结衣”]
则将用户输入的内容中的敏感词汇替换成***,并添加到一个列表中;如果用户输入的内容没有敏感词汇,则直接添加到上述的列表中。
content = input('请输入你的内容:')
li = ["苍老师","东京热","武藤兰","波多野结衣"]
i = 0
while i < 4:
for li[i] in content:
li1 = content.replace('苍老师','***')
li2 = li1.replace('东京热','***')
li3 = li2.replace('武藤兰','***')
li4 = li3.replace('波多野结衣','***')
else:
pass
i += 1

在这里插入图片描述
实战案例:
一道bat面试题:快速替换10亿条标题中的5万个敏感词,有哪些解决思路?
有十亿个标题,存在一个文件中,一行一个标题。有5万个敏感词,存在另一个文件。写一个程序过滤掉所有标题中的所有敏感词,保存到另一个文件中。

1、DFA过滤敏感词算法

在实现文字过滤的算法中,DFA是比较好的实现算法。DFA即Deterministic Finite Automaton,也就是确定有穷自动机。
算法核心是建立了以敏感词为基础的许多敏感词树。
python 实现DFA算法:

# -*- coding:utf-8 -*-
import time
time1=time.time()
# DFA算法
class DFAFilter():
def __init__(self):
self.keyword_chains = {}
self.delimit = '\x00'
def add(self, keyword):
keyword = keyword.lower()
chars = keyword.strip()
if not chars:
return
level = self.keyword_chains
for i in range(len(chars)):
if chars[i] in level:
level = level[chars[i]]
else:
if not isinstance(level, dict):
break
for j in range(i, len(chars)):
level[chars[j]] = {}
last_level, last_char = level, chars[j]
level = level[chars[j]]
last_level[last_char] = {self.delimit: 0}
break
if i == len(chars) - 1:
level[self.delimit] = 0
def parse(self, path):
with open(path,encoding='utf-8') as f:
for keyword in f:
self.add(str(keyword).strip())
def filter(self, message, repl="*"):
message = message.lower()
ret = []
start = 0
while start < len(message):
level = self.keyword_chains
step_ins = 0
for char in message[start:]:
if char in level:
step_ins += 1
if self.delimit not in level[char]:
level = level[char]
else:
ret.append(repl * step_ins)
start += step_ins - 1
break
else:
ret.append(message[start])
break
else:
ret.append(message[start])
start += 1
return ''.join(ret)
if __name__ == "__main__":
gfw = DFAFilter()
path="F:/文本反垃圾算法/sensitive_words.txt"
gfw.parse(path)
text="新疆骚乱苹果新品发布会雞八"
result = gfw.filter(text)
print(text)
print(result)
time2 = time.time()
print('总共耗时:' + str(time2 - time1) + 's')

运行效果:

新疆骚乱苹果新品发布会雞八
****苹果新品发布会**
总共耗时:0.0010344982147216797s

2、AC自动机过滤敏感词算法

AC自动机:一个常见的例子就是给出n个单词,再给出一段包含m个字符的文章,让你找出有多少个单词在文章里出现过。
简单地讲,AC自动机就是字典树+kmp算法+失配指针

# -*- coding:utf-8 -*-
import time
time1=time.time()
# AC自动机算法
class node(object):
def __init__(self):
self.next = {}
self.fail = None
self.isWord = False
self.word = ""
class ac_automation(object):
def __init__(self):
self.root = node()
# 添加敏感词函数
def addword(self, word):
temp_root = self.root
for char in word:
if char not in temp_root.next:
temp_root.next[char] = node()
temp_root = temp_root.next[char]
temp_root.isWord = True
temp_root.word = word
# 失败指针函数
def make_fail(self):
temp_que = []
temp_que.append(self.root)
while len(temp_que) != 0:
temp = temp_que.pop(0)
p = None
for key,value in temp.next.item():
if temp == self.root:
temp.next[key].fail = self.root
else:
p = temp.fail
while p is not None:
if key in p.next:
temp.next[key].fail = p.fail
break
p = p.fail
if p is None:
temp.next[key].fail = self.root
temp_que.append(temp.next[key])
# 查找敏感词函数
def search(self, content):
p = self.root
result = []
currentposition = 0
while currentposition < len(content):
word = content[currentposition]
while word in p.next == False and p != self.root:
p = p.fail
if word in p.next:
p = p.next[word]
else:
p = self.root
if p.isWord:
result.append(p.word)
p = self.root
currentposition += 1
return result
# 加载敏感词库函数
def parse(self, path):
with open(path,encoding='utf-8') as f:
for keyword in f:
self.addword(str(keyword).strip())
# 敏感词替换函数
def words_replace(self, text):
"""
:param ah: AC自动机
:param text: 文本
:return: 过滤敏感词之后的文本
"""
result = list(set(self.search(text)))
for x in result:
m = text.replace(x, '*' * len(x))
text = m
return text
if __name__ == '__main__':
ah = ac_automation()
path='F:/文本反垃圾算法/sensitive_words.txt'
ah.parse(path)
text1="新疆骚乱苹果新品发布会雞八"
text2=ah.words_replace(text1)
print(text1)
print(text2)
time2 = time.time()
print('总共耗时:' + str(time2 - time1) + 's')

运行结果:

新疆骚乱苹果新品发布会雞八
****苹果新品发布会**
总共耗时:0.0010304450988769531s
微信号
版权声明
本文为[IT界的小小小学生]所创,转载请带上原文链接,感谢
https://vip01.blog.csdn.net/article/details/86608460

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database