{"componentChunkName":"component---src-templates-best-practice-detail-tsx","path":"/best-practice/2020-04-20-filter-sensitive-words","result":{"data":{"currentBlog":{"id":"ebea695d-436a-5cb9-955a-2be16fa5102f","frontmatter":{"thumbnail":"https://img.serverlesscloud.cn/2020511/1589207417716-ZalNtxgQAC_small.jpg","authors":["Anycodes"],"categories":["best-practice"],"date":"2020-04-20T00:00:00.000Z","title":"Serverless 架构下，3 分钟实现文本敏感词过滤","description":"随着各种社交平台等的日益火爆，敏感词过滤逐渐成了非常重要的也是值得重视的功能。那么在 Serverless 架构下，敏感词过滤又有那些新的实现呢？","authorslink":["https://zhuanlan.zhihu.com/ServerlessGo"],"translators":null,"translatorslink":null,"tags":["Python","文本处理"],"keywords":"Serverless 多环境配置,Serverless 管理环境,Serverless配置方案","outdated":true},"wordCount":{"words":167,"sentences":44,"paragraphs":44},"fileAbsolutePath":"/opt/build/repo/content/best-practice/2020-04-20-filter-sensitive-words.md","fields":{"slug":"/best-practice/2020-04-20-filter-sensitive-words/","keywords":["python","serverless","函数计算","self","root","关键词","敏感","time","level","word"]},"html":"<h2 id=\"前言\"><a href=\"#%E5%89%8D%E8%A8%80\" aria-label=\"前言 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>前言</h2>\n<p>敏感词过滤是随着互联网社区一起发展起来的一种阻止网络犯罪和网络暴力的技术手段，通过对可能存在犯罪或网络暴力的关键词进行有针对性的筛查和屏蔽，能够防患于未然，将后果严重的犯罪行为扼杀于萌芽之中。</p>\n<p>随着各种社交论坛的日益火爆，敏感词过滤逐渐成为了非常重要的功能。那么在 Serverless 架构下，利用 Python 语言，敏感词过滤又有那些新的实现呢？我们能否用最简单的方法实现一个敏感词过滤的 API 呢？</p>\n<h2 id=\"了解敏感过滤的几种方法\"><a href=\"#%E4%BA%86%E8%A7%A3%E6%95%8F%E6%84%9F%E8%BF%87%E6%BB%A4%E7%9A%84%E5%87%A0%E7%A7%8D%E6%96%B9%E6%B3%95\" aria-label=\"了解敏感过滤的几种方法 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>了解敏感过滤的几种方法</h2>\n<h3 id=\"replace方法\"><a href=\"#replace%E6%96%B9%E6%B3%95\" aria-label=\"replace方法 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Replace方法</h3>\n<p>敏感词过滤，其实在一定程度上是文本替换，以 Python 为例，我们可以通过 replace 来实现，首先准备一个敏感词库，然后通过 replace 进行敏感词替换:</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"16824941568349194000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`def worldFilter(keywords, text):\n    for eve in keywords:\n        text = text.replace(eve, &quot;***&quot;)\n    return text\nkeywords = (&quot;关键词1&quot;, &quot;关键词2&quot;, &quot;关键词3&quot;)\ncontent = &quot;这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。&quot;\nprint(worldFilter(keywords, content))`, `16824941568349194000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">def</span> <span class=\"token function\">worldFilter</span><span class=\"token punctuation\">(</span>keywords<span class=\"token punctuation\">,</span> text<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    <span class=\"token keyword\">for</span> eve <span class=\"token keyword\">in</span> keywords<span class=\"token punctuation\">:</span>\n        text <span class=\"token operator\">=</span> text<span class=\"token punctuation\">.</span>replace<span class=\"token punctuation\">(</span>eve<span class=\"token punctuation\">,</span> <span class=\"token string\">\"***\"</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">return</span> text\nkeywords <span class=\"token operator\">=</span> <span class=\"token punctuation\">(</span><span class=\"token string\">\"关键词1\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"关键词2\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"关键词3\"</span><span class=\"token punctuation\">)</span>\ncontent <span class=\"token operator\">=</span> <span class=\"token string\">\"这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。\"</span>\n<span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>worldFilter<span class=\"token punctuation\">(</span>keywords<span class=\"token punctuation\">,</span> content<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span></code></pre></div>\n<p>这种方法虽然操作简单，但是存在一个很大的问题：在文本和敏感词汇非常庞大的情况下，会出现很严重的性能问题。</p>\n<p>举个例子，我们先修改代码进行基本的性能测试：</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"64567794702223620000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`import time\n\ndef worldFilter(keywords, text):\n    for eve in keywords:\n        text = text.replace(eve, &quot;***&quot;)\n    return text\nkeywords =[ &quot;关键词&quot; + str(i) for i in range(0,10000)]\ncontent = &quot;这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。&quot; * 1000\nstartTime = time.time()\nworldFilter(keywords, content)\nprint(time.time()-startTime)`, `64567794702223620000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">import</span> time\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">worldFilter</span><span class=\"token punctuation\">(</span>keywords<span class=\"token punctuation\">,</span> text<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    <span class=\"token keyword\">for</span> eve <span class=\"token keyword\">in</span> keywords<span class=\"token punctuation\">:</span>\n        text <span class=\"token operator\">=</span> text<span class=\"token punctuation\">.</span>replace<span class=\"token punctuation\">(</span>eve<span class=\"token punctuation\">,</span> <span class=\"token string\">\"***\"</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">return</span> text\nkeywords <span class=\"token operator\">=</span><span class=\"token punctuation\">[</span> <span class=\"token string\">\"关键词\"</span> <span class=\"token operator\">+</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> <span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span><span class=\"token number\">0</span><span class=\"token punctuation\">,</span><span class=\"token number\">10000</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">]</span>\ncontent <span class=\"token operator\">=</span> <span class=\"token string\">\"这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。\"</span> <span class=\"token operator\">*</span> <span class=\"token number\">1000</span>\nstartTime <span class=\"token operator\">=</span> time<span class=\"token punctuation\">.</span>time<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\nworldFilter<span class=\"token punctuation\">(</span>keywords<span class=\"token punctuation\">,</span> content<span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>time<span class=\"token punctuation\">.</span>time<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token operator\">-</span>startTime<span class=\"token punctuation\">)</span></code></pre></div>\n<p>此时的输出结果是：<code class=\"language-text\">0.12426114082336426</code>，可以看到性能非常差。</p>\n<h3 id=\"正则表达方法\"><a href=\"#%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E6%96%B9%E6%B3%95\" aria-label=\"正则表达方法 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>正则表达方法</h3>\n<p>相较于 <code class=\"language-text\">replace</code>，使用正则表达 <code class=\"language-text\">re.sub</code> 实现可能更加快速。</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"56836023025286430000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`import time\nimport re\ndef worldFilter(keywords, text):\n     return re.sub(&quot;|&quot;.join(keywords), &quot;***&quot;, text)\nkeywords =[ &quot;关键词&quot; + str(i) for i in range(0,10000)]\ncontent = &quot;这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。&quot; * 1000\nstartTime = time.time()\nworldFilter(keywords, content)\nprint(time.time()-startTime)`, `56836023025286430000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">import</span> time\n<span class=\"token keyword\">import</span> re\n<span class=\"token keyword\">def</span> <span class=\"token function\">worldFilter</span><span class=\"token punctuation\">(</span>keywords<span class=\"token punctuation\">,</span> text<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n     <span class=\"token keyword\">return</span> re<span class=\"token punctuation\">.</span>sub<span class=\"token punctuation\">(</span><span class=\"token string\">\"|\"</span><span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>keywords<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> <span class=\"token string\">\"***\"</span><span class=\"token punctuation\">,</span> text<span class=\"token punctuation\">)</span>\nkeywords <span class=\"token operator\">=</span><span class=\"token punctuation\">[</span> <span class=\"token string\">\"关键词\"</span> <span class=\"token operator\">+</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> <span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span><span class=\"token number\">0</span><span class=\"token punctuation\">,</span><span class=\"token number\">10000</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">]</span>\ncontent <span class=\"token operator\">=</span> <span class=\"token string\">\"这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。\"</span> <span class=\"token operator\">*</span> <span class=\"token number\">1000</span>\nstartTime <span class=\"token operator\">=</span> time<span class=\"token punctuation\">.</span>time<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\nworldFilter<span class=\"token punctuation\">(</span>keywords<span class=\"token punctuation\">,</span> content<span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>time<span class=\"token punctuation\">.</span>time<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token operator\">-</span>startTime<span class=\"token punctuation\">)</span></code></pre></div>\n<p>增加性能测试之后，我们按照上面的方法进行改造测试，输出结果是 <code class=\"language-text\">0.24773502349853516</code>。</p>\n<p>对比这两个例子，我们会发现当前两种方法的性能差距不是很大，但是随着文本数量的增加，正则表达的优势会逐渐凸显，性能提升明显。</p>\n<h3 id=\"dfa-过滤敏感词\"><a href=\"#dfa-%E8%BF%87%E6%BB%A4%E6%95%8F%E6%84%9F%E8%AF%8D\" aria-label=\"dfa 过滤敏感词 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>DFA 过滤敏感词</h3>\n<p>相对来说，DFA 过滤敏感词的效率会更高一些，例如我们把坏人、坏孩子、坏蛋作为敏感词，那么它们的树关系可以这样表达：</p>\n<p><img src=\"https://img.serverlesscloud.cn/202058/2-4-1.png\"></p>\n<p>而 DFA 字典是这样表示的：</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">{\n    &#39;坏&#39;: {\n        &#39;蛋&#39;: {\n            &#39;\\x00&#39;: 0\n        },\n        &#39;人&#39;: {\n            &#39;\\x00&#39;: 0\n        },\n        &#39;孩&#39;: {\n            &#39;子&#39;: {\n                &#39;\\x00&#39;: 0\n            }\n        }\n    }\n}</code></pre></div>\n<p>使用这种树表示问题最大的好处就是可以降低检索次数、提高检索效率。其基本代码实现如下：</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"27391389513634714000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`import time\n\nclass DFAFilter(object):\n    def __init__(self):\n        self.keyword_chains = {}  # 关键词链表\n        self.delimit = '\\x00'  # 限定\n\n    def parse(self, path):\n        with open(path, encoding='utf-8') as f:\n            for keyword in f:\n                chars = str(keyword).strip().lower()  # 关键词英文变为小写\n                if not chars:  # 如果关键词为空直接返回\n                    return\n                level = self.keyword_chains\n                for i in range(len(chars)):\n                    if chars[i] in level:\n                        level = level[chars[i]]\n                    else:\n                        if not isinstance(level, dict):\n                            break\n                        for j in range(i, len(chars)):\n                            level[chars[j]] = {}\n                            last_level, last_char = level, chars[j]\n                            level = level[chars[j]]\n                        last_level[last_char] = {self.delimit: 0}\n                        break\n                if i == len(chars) - 1:\n                    level[self.delimit] = 0\n\n    def filter(self, message, repl=&quot;*&quot;):\n        message = message.lower()\n        ret = []\n        start = 0\n        while start < len(message):\n            level = self.keyword_chains\n            step_ins = 0\n            for char in message[start:]:\n                if char in level:\n                    step_ins += 1\n                    if self.delimit not in level[char]:\n                        level = level[char]\n                    else:\n                        ret.append(repl * step_ins)\n                        start += step_ins - 1\n                        break\n                else:\n                    ret.append(message[start])\n                    break\n            else:\n                ret.append(message[start])\n            start += 1\n\n        return ''.join(ret)\n\n\n\ngfw = DFAFilter()\ngfw.parse( &quot;./sensitive_words&quot;)\ncontent = &quot;这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。&quot; * 1000\nstartTime = time.time()\nresult = gfw.filter(content)\nprint(time.time()-startTime)`, `27391389513634714000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">import</span> time\n\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">DFAFilter</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">object</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span>keyword_chains <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span>  <span class=\"token comment\"># 关键词链表</span>\n        self<span class=\"token punctuation\">.</span>delimit <span class=\"token operator\">=</span> <span class=\"token string\">'\\x00'</span>  <span class=\"token comment\"># 限定</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">parse</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> path<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        <span class=\"token keyword\">with</span> <span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span>path<span class=\"token punctuation\">,</span> encoding<span class=\"token operator\">=</span><span class=\"token string\">'utf-8'</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> f<span class=\"token punctuation\">:</span>\n            <span class=\"token keyword\">for</span> keyword <span class=\"token keyword\">in</span> f<span class=\"token punctuation\">:</span>\n                chars <span class=\"token operator\">=</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>keyword<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strip<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>lower<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>  <span class=\"token comment\"># 关键词英文变为小写</span>\n                <span class=\"token keyword\">if</span> <span class=\"token keyword\">not</span> chars<span class=\"token punctuation\">:</span>  <span class=\"token comment\"># 如果关键词为空直接返回</span>\n                    <span class=\"token keyword\">return</span>\n                level <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>keyword_chains\n                <span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> <span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>chars<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n                    <span class=\"token keyword\">if</span> chars<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span> <span class=\"token keyword\">in</span> level<span class=\"token punctuation\">:</span>\n                        level <span class=\"token operator\">=</span> level<span class=\"token punctuation\">[</span>chars<span class=\"token punctuation\">[</span>i<span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span>\n                    <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n                        <span class=\"token keyword\">if</span> <span class=\"token keyword\">not</span> <span class=\"token builtin\">isinstance</span><span class=\"token punctuation\">(</span>level<span class=\"token punctuation\">,</span> <span class=\"token builtin\">dict</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n                            <span class=\"token keyword\">break</span>\n                        <span class=\"token keyword\">for</span> j <span class=\"token keyword\">in</span> <span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">,</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>chars<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n                            level<span class=\"token punctuation\">[</span>chars<span class=\"token punctuation\">[</span>j<span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span>\n                            last_level<span class=\"token punctuation\">,</span> last_char <span class=\"token operator\">=</span> level<span class=\"token punctuation\">,</span> chars<span class=\"token punctuation\">[</span>j<span class=\"token punctuation\">]</span>\n                            level <span class=\"token operator\">=</span> level<span class=\"token punctuation\">[</span>chars<span class=\"token punctuation\">[</span>j<span class=\"token punctuation\">]</span><span class=\"token punctuation\">]</span>\n                        last_level<span class=\"token punctuation\">[</span>last_char<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>self<span class=\"token punctuation\">.</span>delimit<span class=\"token punctuation\">:</span> <span class=\"token number\">0</span><span class=\"token punctuation\">}</span>\n                        <span class=\"token keyword\">break</span>\n                <span class=\"token keyword\">if</span> i <span class=\"token operator\">==</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>chars<span class=\"token punctuation\">)</span> <span class=\"token operator\">-</span> <span class=\"token number\">1</span><span class=\"token punctuation\">:</span>\n                    level<span class=\"token punctuation\">[</span>self<span class=\"token punctuation\">.</span>delimit<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> <span class=\"token number\">0</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">filter</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> message<span class=\"token punctuation\">,</span> repl<span class=\"token operator\">=</span><span class=\"token string\">\"*\"</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        message <span class=\"token operator\">=</span> message<span class=\"token punctuation\">.</span>lower<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n        ret <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n        start <span class=\"token operator\">=</span> <span class=\"token number\">0</span>\n        <span class=\"token keyword\">while</span> start <span class=\"token operator\">&lt;</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>message<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n            level <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>keyword_chains\n            step_ins <span class=\"token operator\">=</span> <span class=\"token number\">0</span>\n            <span class=\"token keyword\">for</span> char <span class=\"token keyword\">in</span> message<span class=\"token punctuation\">[</span>start<span class=\"token punctuation\">:</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n                <span class=\"token keyword\">if</span> char <span class=\"token keyword\">in</span> level<span class=\"token punctuation\">:</span>\n                    step_ins <span class=\"token operator\">+=</span> <span class=\"token number\">1</span>\n                    <span class=\"token keyword\">if</span> self<span class=\"token punctuation\">.</span>delimit <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> level<span class=\"token punctuation\">[</span>char<span class=\"token punctuation\">]</span><span class=\"token punctuation\">:</span>\n                        level <span class=\"token operator\">=</span> level<span class=\"token punctuation\">[</span>char<span class=\"token punctuation\">]</span>\n                    <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n                        ret<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>repl <span class=\"token operator\">*</span> step_ins<span class=\"token punctuation\">)</span>\n                        start <span class=\"token operator\">+=</span> step_ins <span class=\"token operator\">-</span> <span class=\"token number\">1</span>\n                        <span class=\"token keyword\">break</span>\n                <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n                    ret<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>message<span class=\"token punctuation\">[</span>start<span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n                    <span class=\"token keyword\">break</span>\n            <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n                ret<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>message<span class=\"token punctuation\">[</span>start<span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span>\n            start <span class=\"token operator\">+=</span> <span class=\"token number\">1</span>\n\n        <span class=\"token keyword\">return</span> <span class=\"token string\">''</span><span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span>ret<span class=\"token punctuation\">)</span>\n\n\n\ngfw <span class=\"token operator\">=</span> DFAFilter<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\ngfw<span class=\"token punctuation\">.</span>parse<span class=\"token punctuation\">(</span> <span class=\"token string\">\"./sensitive_words\"</span><span class=\"token punctuation\">)</span>\ncontent <span class=\"token operator\">=</span> <span class=\"token string\">\"这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。\"</span> <span class=\"token operator\">*</span> <span class=\"token number\">1000</span>\nstartTime <span class=\"token operator\">=</span> time<span class=\"token punctuation\">.</span>time<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\nresult <span class=\"token operator\">=</span> gfw<span class=\"token punctuation\">.</span><span class=\"token builtin\">filter</span><span class=\"token punctuation\">(</span>content<span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>time<span class=\"token punctuation\">.</span>time<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token operator\">-</span>startTime<span class=\"token punctuation\">)</span></code></pre></div>\n<p>这里的字典库是：</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"47979518613455360000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`with open(&quot;./sensitive_words&quot;, 'w') as f:\n    f.write(&quot;\\n&quot;.join( [ &quot;关键词&quot; + str(i) for i in range(0,10000)]))`, `47979518613455360000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">with</span> <span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span><span class=\"token string\">\"./sensitive_words\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'w'</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> f<span class=\"token punctuation\">:</span>\n    f<span class=\"token punctuation\">.</span>write<span class=\"token punctuation\">(</span><span class=\"token string\">\"\\n\"</span><span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span> <span class=\"token punctuation\">[</span> <span class=\"token string\">\"关键词\"</span> <span class=\"token operator\">+</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> <span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span><span class=\"token number\">0</span><span class=\"token punctuation\">,</span><span class=\"token number\">10000</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span></code></pre></div>\n<p>执行结果：</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"67505138662904530000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`0.06450581550598145`, `67505138662904530000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">0.06450581550598145</code></pre></div>\n<p>从中，我们可以看到性能又进一步得到了提升。</p>\n<h3 id=\"ac-自动机过滤敏感词算法\"><a href=\"#ac-%E8%87%AA%E5%8A%A8%E6%9C%BA%E8%BF%87%E6%BB%A4%E6%95%8F%E6%84%9F%E8%AF%8D%E7%AE%97%E6%B3%95\" aria-label=\"ac 自动机过滤敏感词算法 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>AC 自动机过滤敏感词算法</h3>\n<p>什么是 AC 自动机？简单来说，AC 自动机就是字典树 +kmp 算法 + 失配指针，一个常见的例子就是给出 n 个单词，再给出一段包含 m 个字符的文章，让你找出有多少个单词在文章里出现过。</p>\n<p>代码实现：</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"98269764050986660000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`import time\nclass Node(object):\n    def __init__(self):\n        self.next = {}\n        self.fail = None\n        self.isWord = False\n        self.word = &quot;&quot;\n\n\nclass AcAutomation(object):\n\n    def __init__(self):\n        self.root = Node()\n\n    # 查找敏感词函数\n    def search(self, content):\n        p = self.root\n        result = []\n        currentposition = 0\n\n        while currentposition < len(content):\n            word = content[currentposition]\n            while word in p.next == False and p != self.root:\n                p = p.fail\n\n            if word in p.next:\n                p = p.next[word]\n            else:\n                p = self.root\n\n            if p.isWord:\n                result.append(p.word)\n                p = self.root\n            currentposition += 1\n        return result\n\n    # 加载敏感词库函数\n    def parse(self, path):\n        with open(path, encoding='utf-8') as f:\n            for keyword in f:\n                temp_root = self.root\n                for char in str(keyword).strip():\n                    if char not in temp_root.next:\n                        temp_root.next[char] = Node()\n                    temp_root = temp_root.next[char]\n                temp_root.isWord = True\n                temp_root.word = str(keyword).strip()\n\n    # 敏感词替换函数\n    def wordsFilter(self, text):\n        &quot;&quot;&quot;\n        :param ah: AC自动机\n        :param text: 文本\n        :return: 过滤敏感词之后的文本\n        &quot;&quot;&quot;\n        result = list(set(self.search(text)))\n        for x in result:\n            m = text.replace(x, '*' * len(x))\n            text = m\n        return text\n\n\nacAutomation = AcAutomation()\nacAutomation.parse('./sensitive_words')\nstartTime = time.time()\nprint(acAutomation.wordsFilter(&quot;这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。&quot;*1000))\nprint(time.time()-startTime)`, `98269764050986660000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">import</span> time\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">Node</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">object</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span> <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span>\n        self<span class=\"token punctuation\">.</span>fail <span class=\"token operator\">=</span> <span class=\"token boolean\">None</span>\n        self<span class=\"token punctuation\">.</span>isWord <span class=\"token operator\">=</span> <span class=\"token boolean\">False</span>\n        self<span class=\"token punctuation\">.</span>word <span class=\"token operator\">=</span> <span class=\"token string\">\"\"</span>\n\n\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">AcAutomation</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">object</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span>root <span class=\"token operator\">=</span> Node<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># 查找敏感词函数</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">search</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> content<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        p <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n        result <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n        currentposition <span class=\"token operator\">=</span> <span class=\"token number\">0</span>\n\n        <span class=\"token keyword\">while</span> currentposition <span class=\"token operator\">&lt;</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>content<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n            word <span class=\"token operator\">=</span> content<span class=\"token punctuation\">[</span>currentposition<span class=\"token punctuation\">]</span>\n            <span class=\"token keyword\">while</span> word <span class=\"token keyword\">in</span> p<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span> <span class=\"token operator\">==</span> <span class=\"token boolean\">False</span> <span class=\"token keyword\">and</span> p <span class=\"token operator\">!=</span> self<span class=\"token punctuation\">.</span>root<span class=\"token punctuation\">:</span>\n                p <span class=\"token operator\">=</span> p<span class=\"token punctuation\">.</span>fail\n\n            <span class=\"token keyword\">if</span> word <span class=\"token keyword\">in</span> p<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">:</span>\n                p <span class=\"token operator\">=</span> p<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">[</span>word<span class=\"token punctuation\">]</span>\n            <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n                p <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n\n            <span class=\"token keyword\">if</span> p<span class=\"token punctuation\">.</span>isWord<span class=\"token punctuation\">:</span>\n                result<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>p<span class=\"token punctuation\">.</span>word<span class=\"token punctuation\">)</span>\n                p <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n            currentposition <span class=\"token operator\">+=</span> <span class=\"token number\">1</span>\n        <span class=\"token keyword\">return</span> result\n\n    <span class=\"token comment\"># 加载敏感词库函数</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">parse</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> path<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        <span class=\"token keyword\">with</span> <span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span>path<span class=\"token punctuation\">,</span> encoding<span class=\"token operator\">=</span><span class=\"token string\">'utf-8'</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> f<span class=\"token punctuation\">:</span>\n            <span class=\"token keyword\">for</span> keyword <span class=\"token keyword\">in</span> f<span class=\"token punctuation\">:</span>\n                temp_root <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n                <span class=\"token keyword\">for</span> char <span class=\"token keyword\">in</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>keyword<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strip<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n                    <span class=\"token keyword\">if</span> char <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> temp_root<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">:</span>\n                        temp_root<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">[</span>char<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> Node<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n                    temp_root <span class=\"token operator\">=</span> temp_root<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">[</span>char<span class=\"token punctuation\">]</span>\n                temp_root<span class=\"token punctuation\">.</span>isWord <span class=\"token operator\">=</span> <span class=\"token boolean\">True</span>\n                temp_root<span class=\"token punctuation\">.</span>word <span class=\"token operator\">=</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>keyword<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strip<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># 敏感词替换函数</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">wordsFilter</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> text<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        <span class=\"token triple-quoted-string string\">\"\"\"\n        :param ah: AC自动机\n        :param text: 文本\n        :return: 过滤敏感词之后的文本\n        \"\"\"</span>\n        result <span class=\"token operator\">=</span> <span class=\"token builtin\">list</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">set</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>search<span class=\"token punctuation\">(</span>text<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n        <span class=\"token keyword\">for</span> x <span class=\"token keyword\">in</span> result<span class=\"token punctuation\">:</span>\n            m <span class=\"token operator\">=</span> text<span class=\"token punctuation\">.</span>replace<span class=\"token punctuation\">(</span>x<span class=\"token punctuation\">,</span> <span class=\"token string\">'*'</span> <span class=\"token operator\">*</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>x<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n            text <span class=\"token operator\">=</span> m\n        <span class=\"token keyword\">return</span> text\n\n\nacAutomation <span class=\"token operator\">=</span> AcAutomation<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\nacAutomation<span class=\"token punctuation\">.</span>parse<span class=\"token punctuation\">(</span><span class=\"token string\">'./sensitive_words'</span><span class=\"token punctuation\">)</span>\nstartTime <span class=\"token operator\">=</span> time<span class=\"token punctuation\">.</span>time<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>acAutomation<span class=\"token punctuation\">.</span>wordsFilter<span class=\"token punctuation\">(</span><span class=\"token string\">\"这是一个关键词替换的例子，这里涉及到了关键词1还有关键词2，最后还会有关键词3。\"</span><span class=\"token operator\">*</span><span class=\"token number\">1000</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n<span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>time<span class=\"token punctuation\">.</span>time<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token operator\">-</span>startTime<span class=\"token punctuation\">)</span></code></pre></div>\n<p>词库同样是：</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"91422863049793830000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`with open(&quot;./sensitive_words&quot;, 'w') as f:\n    f.write(&quot;\\n&quot;.join( [ &quot;关键词&quot; + str(i) for i in range(0,10000)]))`, `91422863049793830000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">with</span> <span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span><span class=\"token string\">\"./sensitive_words\"</span><span class=\"token punctuation\">,</span> <span class=\"token string\">'w'</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> f<span class=\"token punctuation\">:</span>\n    f<span class=\"token punctuation\">.</span>write<span class=\"token punctuation\">(</span><span class=\"token string\">\"\\n\"</span><span class=\"token punctuation\">.</span>join<span class=\"token punctuation\">(</span> <span class=\"token punctuation\">[</span> <span class=\"token string\">\"关键词\"</span> <span class=\"token operator\">+</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>i<span class=\"token punctuation\">)</span> <span class=\"token keyword\">for</span> i <span class=\"token keyword\">in</span> <span class=\"token builtin\">range</span><span class=\"token punctuation\">(</span><span class=\"token number\">0</span><span class=\"token punctuation\">,</span><span class=\"token number\">10000</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span></code></pre></div>\n<p>使用上面的方法，测试结果为 <code class=\"language-text\">0.017391204833984375</code>。</p>\n<h3 id=\"敏感词过滤方法小结\"><a href=\"#%E6%95%8F%E6%84%9F%E8%AF%8D%E8%BF%87%E6%BB%A4%E6%96%B9%E6%B3%95%E5%B0%8F%E7%BB%93\" aria-label=\"敏感词过滤方法小结 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>敏感词过滤方法小结</h3>\n<p>根据上文的测试对比，我们可以发现在所有算法中，DFA 过滤敏感词性能最高，但是在实际应用中，DFA 过滤和 AC 自动机过滤各自有自己的适用场景，可以根据具体业务来选择。</p>\n<h2 id=\"实现敏感词过滤-api\"><a href=\"#%E5%AE%9E%E7%8E%B0%E6%95%8F%E6%84%9F%E8%AF%8D%E8%BF%87%E6%BB%A4-api\" aria-label=\"实现敏感词过滤 api permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>实现敏感词过滤 API</h2>\n<p>想要实现敏感词过滤 API，就需要将代码部署到 Serverless 架构上，选择 API 网关与函数计算进行结合。以 AC 自动机过滤敏感词算法为例：我们只需要增加是几行代码就好：</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"72664609928237620000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`# -*- coding:utf-8 -*-\n\nimport json, uuid\n\n\nclass Node(object):\n    def __init__(self):\n        self.next = {}\n        self.fail = None\n        self.isWord = False\n        self.word = &quot;&quot;\n\n\nclass AcAutomation(object):\n\n    def __init__(self):\n        self.root = Node()\n\n    # 查找敏感词函数\n    def search(self, content):\n        p = self.root\n        result = []\n        currentposition = 0\n\n        while currentposition < len(content):\n            word = content[currentposition]\n            while word in p.next == False and p != self.root:\n                p = p.fail\n\n            if word in p.next:\n                p = p.next[word]\n            else:\n                p = self.root\n\n            if p.isWord:\n                result.append(p.word)\n                p = self.root\n            currentposition += 1\n        return result\n\n    # 加载敏感词库函数\n    def parse(self, path):\n        with open(path, encoding='utf-8') as f:\n            for keyword in f:\n                temp_root = self.root\n                for char in str(keyword).strip():\n                    if char not in temp_root.next:\n                        temp_root.next[char] = Node()\n                    temp_root = temp_root.next[char]\n                temp_root.isWord = True\n                temp_root.word = str(keyword).strip()\n\n    # 敏感词替换函数\n    def wordsFilter(self, text):\n        &quot;&quot;&quot;\n        :param ah: AC自动机\n        :param text: 文本\n        :return: 过滤敏感词之后的文本\n        &quot;&quot;&quot;\n        result = list(set(self.search(text)))\n        for x in result:\n            m = text.replace(x, '*' * len(x))\n            text = m\n        return text\n\n\ndef response(msg, error=False):\n    return_data = {\n        &quot;uuid&quot;: str(uuid.uuid1()),\n        &quot;error&quot;: error,\n        &quot;message&quot;: msg\n    }\n    print(return_data)\n    return return_data\n\n\nacAutomation = AcAutomation()\npath = './sensitive_words'\nacAutomation.parse(path)\n\n\ndef main_handler(event, context):\n    try:\n        sourceContent = json.loads(event[&quot;body&quot;])[&quot;content&quot;]\n        return response({\n            &quot;sourceContent&quot;: sourceContent,\n            &quot;filtedContent&quot;: acAutomation.wordsFilter(sourceContent)\n        })\n    except Exception as e:\n        return response(str(e), True)`, `72664609928237620000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token comment\"># -*- coding:utf-8 -*-</span>\n\n<span class=\"token keyword\">import</span> json<span class=\"token punctuation\">,</span> uuid\n\n\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">Node</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">object</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span> <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span><span class=\"token punctuation\">}</span>\n        self<span class=\"token punctuation\">.</span>fail <span class=\"token operator\">=</span> <span class=\"token boolean\">None</span>\n        self<span class=\"token punctuation\">.</span>isWord <span class=\"token operator\">=</span> <span class=\"token boolean\">False</span>\n        self<span class=\"token punctuation\">.</span>word <span class=\"token operator\">=</span> <span class=\"token string\">\"\"</span>\n\n\n<span class=\"token keyword\">class</span> <span class=\"token class-name\">AcAutomation</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">object</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n\n    <span class=\"token keyword\">def</span> <span class=\"token function\">__init__</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        self<span class=\"token punctuation\">.</span>root <span class=\"token operator\">=</span> Node<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># 查找敏感词函数</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">search</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> content<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        p <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n        result <span class=\"token operator\">=</span> <span class=\"token punctuation\">[</span><span class=\"token punctuation\">]</span>\n        currentposition <span class=\"token operator\">=</span> <span class=\"token number\">0</span>\n\n        <span class=\"token keyword\">while</span> currentposition <span class=\"token operator\">&lt;</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>content<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n            word <span class=\"token operator\">=</span> content<span class=\"token punctuation\">[</span>currentposition<span class=\"token punctuation\">]</span>\n            <span class=\"token keyword\">while</span> word <span class=\"token keyword\">in</span> p<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span> <span class=\"token operator\">==</span> <span class=\"token boolean\">False</span> <span class=\"token keyword\">and</span> p <span class=\"token operator\">!=</span> self<span class=\"token punctuation\">.</span>root<span class=\"token punctuation\">:</span>\n                p <span class=\"token operator\">=</span> p<span class=\"token punctuation\">.</span>fail\n\n            <span class=\"token keyword\">if</span> word <span class=\"token keyword\">in</span> p<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">:</span>\n                p <span class=\"token operator\">=</span> p<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">[</span>word<span class=\"token punctuation\">]</span>\n            <span class=\"token keyword\">else</span><span class=\"token punctuation\">:</span>\n                p <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n\n            <span class=\"token keyword\">if</span> p<span class=\"token punctuation\">.</span>isWord<span class=\"token punctuation\">:</span>\n                result<span class=\"token punctuation\">.</span>append<span class=\"token punctuation\">(</span>p<span class=\"token punctuation\">.</span>word<span class=\"token punctuation\">)</span>\n                p <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n            currentposition <span class=\"token operator\">+=</span> <span class=\"token number\">1</span>\n        <span class=\"token keyword\">return</span> result\n\n    <span class=\"token comment\"># 加载敏感词库函数</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">parse</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> path<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        <span class=\"token keyword\">with</span> <span class=\"token builtin\">open</span><span class=\"token punctuation\">(</span>path<span class=\"token punctuation\">,</span> encoding<span class=\"token operator\">=</span><span class=\"token string\">'utf-8'</span><span class=\"token punctuation\">)</span> <span class=\"token keyword\">as</span> f<span class=\"token punctuation\">:</span>\n            <span class=\"token keyword\">for</span> keyword <span class=\"token keyword\">in</span> f<span class=\"token punctuation\">:</span>\n                temp_root <span class=\"token operator\">=</span> self<span class=\"token punctuation\">.</span>root\n                <span class=\"token keyword\">for</span> char <span class=\"token keyword\">in</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>keyword<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strip<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n                    <span class=\"token keyword\">if</span> char <span class=\"token keyword\">not</span> <span class=\"token keyword\">in</span> temp_root<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">:</span>\n                        temp_root<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">[</span>char<span class=\"token punctuation\">]</span> <span class=\"token operator\">=</span> Node<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n                    temp_root <span class=\"token operator\">=</span> temp_root<span class=\"token punctuation\">.</span><span class=\"token builtin\">next</span><span class=\"token punctuation\">[</span>char<span class=\"token punctuation\">]</span>\n                temp_root<span class=\"token punctuation\">.</span>isWord <span class=\"token operator\">=</span> <span class=\"token boolean\">True</span>\n                temp_root<span class=\"token punctuation\">.</span>word <span class=\"token operator\">=</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>keyword<span class=\"token punctuation\">)</span><span class=\"token punctuation\">.</span>strip<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\n\n    <span class=\"token comment\"># 敏感词替换函数</span>\n    <span class=\"token keyword\">def</span> <span class=\"token function\">wordsFilter</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">,</span> text<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n        <span class=\"token triple-quoted-string string\">\"\"\"\n        :param ah: AC自动机\n        :param text: 文本\n        :return: 过滤敏感词之后的文本\n        \"\"\"</span>\n        result <span class=\"token operator\">=</span> <span class=\"token builtin\">list</span><span class=\"token punctuation\">(</span><span class=\"token builtin\">set</span><span class=\"token punctuation\">(</span>self<span class=\"token punctuation\">.</span>search<span class=\"token punctuation\">(</span>text<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n        <span class=\"token keyword\">for</span> x <span class=\"token keyword\">in</span> result<span class=\"token punctuation\">:</span>\n            m <span class=\"token operator\">=</span> text<span class=\"token punctuation\">.</span>replace<span class=\"token punctuation\">(</span>x<span class=\"token punctuation\">,</span> <span class=\"token string\">'*'</span> <span class=\"token operator\">*</span> <span class=\"token builtin\">len</span><span class=\"token punctuation\">(</span>x<span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n            text <span class=\"token operator\">=</span> m\n        <span class=\"token keyword\">return</span> text\n\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">response</span><span class=\"token punctuation\">(</span>msg<span class=\"token punctuation\">,</span> error<span class=\"token operator\">=</span><span class=\"token boolean\">False</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    return_data <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token string\">\"uuid\"</span><span class=\"token punctuation\">:</span> <span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>uuid<span class=\"token punctuation\">.</span>uuid1<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"error\"</span><span class=\"token punctuation\">:</span> error<span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"message\"</span><span class=\"token punctuation\">:</span> msg\n    <span class=\"token punctuation\">}</span>\n    <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>return_data<span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">return</span> return_data\n\n\nacAutomation <span class=\"token operator\">=</span> AcAutomation<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span>\npath <span class=\"token operator\">=</span> <span class=\"token string\">'./sensitive_words'</span>\nacAutomation<span class=\"token punctuation\">.</span>parse<span class=\"token punctuation\">(</span>path<span class=\"token punctuation\">)</span>\n\n\n<span class=\"token keyword\">def</span> <span class=\"token function\">main_handler</span><span class=\"token punctuation\">(</span>event<span class=\"token punctuation\">,</span> context<span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    <span class=\"token keyword\">try</span><span class=\"token punctuation\">:</span>\n        sourceContent <span class=\"token operator\">=</span> json<span class=\"token punctuation\">.</span>loads<span class=\"token punctuation\">(</span>event<span class=\"token punctuation\">[</span><span class=\"token string\">\"body\"</span><span class=\"token punctuation\">]</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">[</span><span class=\"token string\">\"content\"</span><span class=\"token punctuation\">]</span>\n        <span class=\"token keyword\">return</span> response<span class=\"token punctuation\">(</span><span class=\"token punctuation\">{</span>\n            <span class=\"token string\">\"sourceContent\"</span><span class=\"token punctuation\">:</span> sourceContent<span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"filtedContent\"</span><span class=\"token punctuation\">:</span> acAutomation<span class=\"token punctuation\">.</span>wordsFilter<span class=\"token punctuation\">(</span>sourceContent<span class=\"token punctuation\">)</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">)</span>\n    <span class=\"token keyword\">except</span> Exception <span class=\"token keyword\">as</span> e<span class=\"token punctuation\">:</span>\n        <span class=\"token keyword\">return</span> response<span class=\"token punctuation\">(</span><span class=\"token builtin\">str</span><span class=\"token punctuation\">(</span>e<span class=\"token punctuation\">)</span><span class=\"token punctuation\">,</span> <span class=\"token boolean\">True</span><span class=\"token punctuation\">)</span></code></pre></div>\n<p>最后，为了方便本地测试，我们可以再增加以下代码：</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"80689048577498680000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`def test():\n    event = {\n        &quot;requestContext&quot;: {\n            &quot;serviceId&quot;: &quot;service-f94sy04v&quot;,\n            &quot;path&quot;: &quot;/test/{path}&quot;,\n            &quot;httpMethod&quot;: &quot;POST&quot;,\n            &quot;requestId&quot;: &quot;c6af9ac6-7b61-11e6-9a41-93e8deadbeef&quot;,\n            &quot;identity&quot;: {\n                &quot;secretId&quot;: &quot;abdcdxxxxxxxsdfs&quot;\n            },\n            &quot;sourceIp&quot;: &quot;14.17.22.34&quot;,\n            &quot;stage&quot;: &quot;release&quot;\n        },\n        &quot;headers&quot;: {\n            &quot;Accept-Language&quot;: &quot;en-US,en,cn&quot;,\n            &quot;Accept&quot;: &quot;text/html,application/xml,application/json&quot;,\n            &quot;Host&quot;: &quot;service-3ei3tii4-251000691.ap-guangzhou.apigateway.myqloud.com&quot;,\n            &quot;User-Agent&quot;: &quot;User Agent String&quot;\n        },\n        &quot;body&quot;: &quot;{\\&quot;content\\&quot;:\\&quot;这是一个测试的文本，我也就呵呵了\\&quot;}&quot;,\n        &quot;pathParameters&quot;: {\n            &quot;path&quot;: &quot;value&quot;\n        },\n        &quot;queryStringParameters&quot;: {\n            &quot;foo&quot;: &quot;bar&quot;\n        },\n        &quot;headerParameters&quot;: {\n            &quot;Refer&quot;: &quot;10.0.2.14&quot;\n        },\n        &quot;stageVariables&quot;: {\n            &quot;stage&quot;: &quot;release&quot;\n        },\n        &quot;path&quot;: &quot;/test/value&quot;,\n        &quot;queryString&quot;: {\n            &quot;foo&quot;: &quot;bar&quot;,\n            &quot;bob&quot;: &quot;alice&quot;\n        },\n        &quot;httpMethod&quot;: &quot;POST&quot;\n    }\n    print(main_handler(event, None))\n\n\nif __name__ == &quot;__main__&quot;:\n    test()`, `80689048577498680000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"python\"><pre class=\"language-python\"><code class=\"language-python\"><span class=\"token keyword\">def</span> <span class=\"token function\">test</span><span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">:</span>\n    event <span class=\"token operator\">=</span> <span class=\"token punctuation\">{</span>\n        <span class=\"token string\">\"requestContext\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">{</span>\n            <span class=\"token string\">\"serviceId\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"service-f94sy04v\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"path\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"/test/{path}\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"httpMethod\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"POST\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"requestId\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"c6af9ac6-7b61-11e6-9a41-93e8deadbeef\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"identity\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">{</span>\n                <span class=\"token string\">\"secretId\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"abdcdxxxxxxxsdfs\"</span>\n            <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"sourceIp\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"14.17.22.34\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"stage\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"release\"</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"headers\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">{</span>\n            <span class=\"token string\">\"Accept-Language\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"en-US,en,cn\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"Accept\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"text/html,application/xml,application/json\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"Host\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"service-3ei3tii4-251000691.ap-guangzhou.apigateway.myqloud.com\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"User-Agent\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"User Agent String\"</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"body\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"{\\\"content\\\":\\\"这是一个测试的文本，我也就呵呵了\\\"}\"</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"pathParameters\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">{</span>\n            <span class=\"token string\">\"path\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"value\"</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"queryStringParameters\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">{</span>\n            <span class=\"token string\">\"foo\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"bar\"</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"headerParameters\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">{</span>\n            <span class=\"token string\">\"Refer\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"10.0.2.14\"</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"stageVariables\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">{</span>\n            <span class=\"token string\">\"stage\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"release\"</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"path\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"/test/value\"</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"queryString\"</span><span class=\"token punctuation\">:</span> <span class=\"token punctuation\">{</span>\n            <span class=\"token string\">\"foo\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"bar\"</span><span class=\"token punctuation\">,</span>\n            <span class=\"token string\">\"bob\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"alice\"</span>\n        <span class=\"token punctuation\">}</span><span class=\"token punctuation\">,</span>\n        <span class=\"token string\">\"httpMethod\"</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"POST\"</span>\n    <span class=\"token punctuation\">}</span>\n    <span class=\"token keyword\">print</span><span class=\"token punctuation\">(</span>main_handler<span class=\"token punctuation\">(</span>event<span class=\"token punctuation\">,</span> <span class=\"token boolean\">None</span><span class=\"token punctuation\">)</span><span class=\"token punctuation\">)</span>\n\n\n<span class=\"token keyword\">if</span> __name__ <span class=\"token operator\">==</span> <span class=\"token string\">\"__main__\"</span><span class=\"token punctuation\">:</span>\n    test<span class=\"token punctuation\">(</span><span class=\"token punctuation\">)</span></code></pre></div>\n<p>完成之后，就可以进行测试运行，例如我的字典是：</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">呵呵\n测试</code></pre></div>\n<p>执行之后结果：</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">{&#39;uuid&#39;: &#39;9961ae2a-5cfc-11ea-a7c2-acde48001122&#39;, &#39;error&#39;: False, &#39;message&#39;: {&#39;sourceContent&#39;: &#39;这是一个测试的文本，我也就呵呵了&#39;, &#39;filtedContent&#39;: &#39;这是一个**的文本，我也就**了&#39;}}</code></pre></div>\n<p>接下来，我们将代码部署到云端，新建 <code class=\"language-text\">serverless.yaml</code>:</p>\n<div\n              class=\"gatsby-code-button-container\"\n              data-toaster-id=\"42595711592256830000\"\n              data-toaster-class=\"gatsby-code-button-toaster\"\n              data-toaster-text-class=\"gatsby-code-button-toaster-text\"\n              data-toaster-text=\"代码复制成功\"\n              data-toaster-duration=\"3500\"\n              onClick=\"copyToClipboard(`sensitive_word_filtering:\n  component: &quot;@serverless/tencent-scf&quot;\n  inputs:\n    name: sensitive_word_filtering\n    codeUri: ./\n    exclude:\n      - .gitignore\n      - .git/**\n      - .serverless\n      - .env\n    handler: index.main_handler\n    runtime: Python3.6\n    region: ap-beijing\n    description: 敏感词过滤\n    memorySize: 64\n    timeout: 2\n    events:\n      - apigw:\n          name: serverless\n          parameters:\n            environment: release\n            endpoints:\n              - path: /sensitive_word_filtering\n                description: 敏感词过滤\n                method: POST\n                enableCORS: true\n                param:\n                  - name: content\n                    position: BODY\n                    required: 'FALSE'\n                    type: string\n                    desc: 待过滤的句子`, `42595711592256830000`)\"\n            >\n              <div\n                class=\"gatsby-code-button\"\n                data-tooltip=\"\"\n              >\n                复制代码<svg class=\"gatsby-code-button-icon\" xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path fill=\"none\" d=\"M0 0h24v24H0V0z\"/><path d=\"M16 1H2v16h2V3h12V1zm-1 4l6 6v12H6V5h9zm-1 7h5.5L14 6.5V12z\"/></svg>\n              </div>\n            </div>\n<div class=\"gatsby-highlight\" data-language=\"yaml\"><pre class=\"language-yaml\"><code class=\"language-yaml\"><span class=\"token key atrule\">sensitive_word_filtering</span><span class=\"token punctuation\">:</span>\n  <span class=\"token key atrule\">component</span><span class=\"token punctuation\">:</span> <span class=\"token string\">\"@serverless/tencent-scf\"</span>\n  <span class=\"token key atrule\">inputs</span><span class=\"token punctuation\">:</span>\n    <span class=\"token key atrule\">name</span><span class=\"token punctuation\">:</span> sensitive_word_filtering\n    <span class=\"token key atrule\">codeUri</span><span class=\"token punctuation\">:</span> ./\n    <span class=\"token key atrule\">exclude</span><span class=\"token punctuation\">:</span>\n      <span class=\"token punctuation\">-</span> .gitignore\n      <span class=\"token punctuation\">-</span> .git/**\n      <span class=\"token punctuation\">-</span> .serverless\n      <span class=\"token punctuation\">-</span> .env\n    <span class=\"token key atrule\">handler</span><span class=\"token punctuation\">:</span> index.main_handler\n    <span class=\"token key atrule\">runtime</span><span class=\"token punctuation\">:</span> Python3.6\n    <span class=\"token key atrule\">region</span><span class=\"token punctuation\">:</span> ap<span class=\"token punctuation\">-</span>beijing\n    <span class=\"token key atrule\">description</span><span class=\"token punctuation\">:</span> 敏感词过滤\n    <span class=\"token key atrule\">memorySize</span><span class=\"token punctuation\">:</span> <span class=\"token number\">64</span>\n    <span class=\"token key atrule\">timeout</span><span class=\"token punctuation\">:</span> <span class=\"token number\">2</span>\n    <span class=\"token key atrule\">events</span><span class=\"token punctuation\">:</span>\n      <span class=\"token punctuation\">-</span> <span class=\"token key atrule\">apigw</span><span class=\"token punctuation\">:</span>\n          <span class=\"token key atrule\">name</span><span class=\"token punctuation\">:</span> serverless\n          <span class=\"token key atrule\">parameters</span><span class=\"token punctuation\">:</span>\n            <span class=\"token key atrule\">environment</span><span class=\"token punctuation\">:</span> release\n            <span class=\"token key atrule\">endpoints</span><span class=\"token punctuation\">:</span>\n              <span class=\"token punctuation\">-</span> <span class=\"token key atrule\">path</span><span class=\"token punctuation\">:</span> /sensitive_word_filtering\n                <span class=\"token key atrule\">description</span><span class=\"token punctuation\">:</span> 敏感词过滤\n                <span class=\"token key atrule\">method</span><span class=\"token punctuation\">:</span> POST\n                <span class=\"token key atrule\">enableCORS</span><span class=\"token punctuation\">:</span> <span class=\"token boolean important\">true</span>\n                <span class=\"token key atrule\">param</span><span class=\"token punctuation\">:</span>\n                  <span class=\"token punctuation\">-</span> <span class=\"token key atrule\">name</span><span class=\"token punctuation\">:</span> content\n                    <span class=\"token key atrule\">position</span><span class=\"token punctuation\">:</span> BODY\n                    <span class=\"token key atrule\">required</span><span class=\"token punctuation\">:</span> <span class=\"token string\">'FALSE'</span>\n                    <span class=\"token key atrule\">type</span><span class=\"token punctuation\">:</span> string\n                    <span class=\"token key atrule\">desc</span><span class=\"token punctuation\">:</span> 待过滤的句子</code></pre></div>\n<p>然后通过 <code class=\"language-text\">sls --debug</code> 进行部署，部署结果：</p>\n<p><img src=\"https://img.serverlesscloud.cn/202058/2-4-2.png\"></p>\n<p>最后，通过 PostMan 进行测试：</p>\n<p><img src=\"https://img.serverlesscloud.cn/202058/2-4-3.png\"></p>\n<h2 id=\"总结\"><a href=\"#%E6%80%BB%E7%BB%93\" aria-label=\"总结 permalink\" class=\"anchor\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>总结</h2>\n<p>敏感词过滤是当前企业的普遍需求，通过敏感词过滤，我们可以在一定程度上遏制恶言恶语和违规言论的出现。在具体实现过程中，有两个方面需要额外主要：</p>\n<ul>\n<li>敏感词库的获得问题：Github 上有很多敏感词库，其中包含了各种场景中的敏感词，大家可以自行搜索下载使用；</li>\n<li>API 使用场景的问题：我们可以将这个 API 放置在社区跟帖系统、留言评论系统或者是博客发布系统中，这样可以防止出现敏感词汇，减少不必要的麻烦。</li>\n</ul>\n<hr>\n<div id='scf-deploy-iframe-or-md'></div>\n<hr>\n<blockquote>\n<p><strong>传送门：</strong></p>\n<ul>\n<li>GitHub: <a href=\"https://github.com/serverless/serverless/blob/master/README_CN.md\">github.com/serverless</a></li>\n<li>官网：<a href=\"https://serverless.com/\">serverless.com</a></li>\n</ul>\n</blockquote>\n<p>欢迎访问：<a href=\"https://serverlesscloud.cn/\">Serverless 中文网</a>，您可以在 <a href=\"https://serverlesscloud.cn/best-practice\">最佳实践</a> 里体验更多关于 Serverless 应用的开发！</p>","tableOfContents":"<ul>\n<li><a href=\"/best-practice/2020-04-20-filter-sensitive-words/#%E5%89%8D%E8%A8%80\">前言</a></li>\n<li>\n<p><a href=\"/best-practice/2020-04-20-filter-sensitive-words/#%E4%BA%86%E8%A7%A3%E6%95%8F%E6%84%9F%E8%BF%87%E6%BB%A4%E7%9A%84%E5%87%A0%E7%A7%8D%E6%96%B9%E6%B3%95\">了解敏感过滤的几种方法</a></p>\n<ul>\n<li><a href=\"/best-practice/2020-04-20-filter-sensitive-words/#replace%E6%96%B9%E6%B3%95\">Replace方法</a></li>\n<li><a href=\"/best-practice/2020-04-20-filter-sensitive-words/#%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E6%96%B9%E6%B3%95\">正则表达方法</a></li>\n<li><a href=\"/best-practice/2020-04-20-filter-sensitive-words/#dfa-%E8%BF%87%E6%BB%A4%E6%95%8F%E6%84%9F%E8%AF%8D\">DFA 过滤敏感词</a></li>\n<li><a href=\"/best-practice/2020-04-20-filter-sensitive-words/#ac-%E8%87%AA%E5%8A%A8%E6%9C%BA%E8%BF%87%E6%BB%A4%E6%95%8F%E6%84%9F%E8%AF%8D%E7%AE%97%E6%B3%95\">AC 自动机过滤敏感词算法</a></li>\n<li><a href=\"/best-practice/2020-04-20-filter-sensitive-words/#%E6%95%8F%E6%84%9F%E8%AF%8D%E8%BF%87%E6%BB%A4%E6%96%B9%E6%B3%95%E5%B0%8F%E7%BB%93\">敏感词过滤方法小结</a></li>\n</ul>\n</li>\n<li><a href=\"/best-practice/2020-04-20-filter-sensitive-words/#%E5%AE%9E%E7%8E%B0%E6%95%8F%E6%84%9F%E8%AF%8D%E8%BF%87%E6%BB%A4-api\">实现敏感词过滤 API</a></li>\n<li><a href=\"/best-practice/2020-04-20-filter-sensitive-words/#%E6%80%BB%E7%BB%93\">总结</a></li>\n</ul>"},"previousBlog":{"id":"6e8a71a3-426e-5101-94db-bd5b1ef4d1ec","frontmatter":{"thumbnail":"https://img.serverlesscloud.cn/2020512/1589274422875-code.jpg","authors":["Anycodes"],"categories":["best-practice"],"date":"2020-04-22T00:00:00.000Z","title":"基于 Serverless 架构的编程学习小工具","description":"这是一个基于 Serverless 架构的编程学习 App，它不仅仅可以让你学习编程，还能实现代码的编写与运行","authorslink":["https://zhuanlan.zhihu.com/ServerlessGo"],"translators":null,"translatorslink":null,"tags":["Serverless","程序员"],"keywords":"Serverless 多环境配置,Serverless 管理环境,Serverless配置方案","outdated":null},"wordCount":{"words":281,"sentences":50,"paragraphs":50},"fileAbsolutePath":"/opt/build/repo/content/best-practice/2020-04-22-coding-learing.md","fields":{"slug":"/best-practice/2020-04-22-coding-learing/","keywords":["go","python","serverless","函数计算","云函数","serverlesscloud","Serverless","req","函数","event"]}},"nextBlog":{"id":"99725ecb-905b-5852-a6c7-80e83df8d7f9","frontmatter":{"thumbnail":"https://img.serverlesscloud.cn/2020512/1589274868260-071529vaxztt.jpg","authors":["Anycodes"],"categories":["best-practice"],"date":"2020-04-19T00:00:00.000Z","title":"基于 Serverless Framework 的人工智能小程序开发","description":"本示例将会通过微信小程序，在 Serverless 架构上，实现一款基于人工智能的相册小工具！","authorslink":["https://zhuanlan.zhihu.com/ServerlessGo"],"translators":null,"translatorslink":null,"tags":["小程序","人工智能"],"keywords":"Serverless 多环境配置,Serverless 管理环境,Serverless配置方案","outdated":true},"wordCount":{"words":541,"sentences":90,"paragraphs":90},"fileAbsolutePath":"/opt/build/repo/content/best-practice/2020-04-19-applets.md","fields":{"slug":"/best-practice/2020-04-19-applets/","keywords":["go","python","serverless","website","云函数","mysql","函数","功能","remark","Components"]}}},"pageContext":{"isCreatedByStatefulCreatePages":false,"blogId":"ebea695d-436a-5cb9-955a-2be16fa5102f","previousBlogId":"6e8a71a3-426e-5101-94db-bd5b1ef4d1ec","nextBlogId":"99725ecb-905b-5852-a6c7-80e83df8d7f9"}}}