概述
记录一下,学习使用前缀树,来实现从一传字符串中找出敏感词,并替换掉。起始就是一个字符串匹配算法,但是要清楚,铭感词可以有很多个,我们从数据库或者文件中加载我们的敏感词,并且加入到树中,组成一个铭感词树,下面就是当我们传递过来一个字符串时,我们使用这个树过滤,得出我们想要的结果
树的结构
假设我们的敏感词为 abc,abb,aac,ad,那我们期望得到一颗如下的树,按照我们字符前缀组成的一颗树,此外我们还需要对该串的结束字符做一个标识,即叶子节点就是结束字符。 代码设计如下
private class PrefixTreeNode {
private char val;
private boolean isEndKeyWord = false;
private Map<Character, PrefixTreeNode> subRoot = new HashMap<>();
}
整体代码如下
public class SensitiveWordFilter {
private final PrefixTreeNode prefixTreeRoot = new PrefixTreeNode();
private final String REPLACE_SYMBOL = "***";
{
InputStream inputStream = SensitiveWordFilter.class.getClassLoader().getResourceAsStream("sensitive_word.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String data = null;
while (true) {
try {
if ((data = reader.readLine()) == null) break;
this.addKeyWord(data);
} catch (IOException e) {
e.printStackTrace();
}
}
}
public void addKeyWord(String data) {
PrefixTreeNode node = prefixTreeRoot;
for (int i = 0; i < data.length(); i++) {
char val = data.charAt(i);
PrefixTreeNode subNode = node.getSubNodeByKey(val);
if (subNode == null) {
subNode = new PrefixTreeNode(val);
if (data.length() == (i + 1)) subNode.setEndKeyWord(true);
node.addSubNode(val, subNode);
}
node = subNode;
}
}
public String doFilter(String data) {
if (data == null || data.length() == 0) return null;
StringBuilder result = new StringBuilder();
PrefixTreeNode curNode;
for (int i = 0; i < data.length(); i++) {
curNode = prefixTreeRoot;
char val = data.charAt(i);
PrefixTreeNode subNode = curNode.getSubNodeByKey(val);
if (subNode == null) {
result.append(val);
} else {
curNode = subNode;
int startIndex = i;
for (int j = startIndex + 1; j < data.length(); j++) {
char target = data.charAt(j);
PrefixTreeNode curSubNode = curNode.getSubNodeByKey(target);
if (curSubNode == null) {
startIndex = j;
break;
} else if (curSubNode.isEndKeyWord()) {
result.append(REPLACE_SYMBOL);
startIndex = j + 1;
break;
} else {
curNode = curSubNode;
}
}
i = startIndex - 1;
}
}
return result.toString();
}
private class PrefixTreeNode {
private char val;
private boolean isEndKeyWord = false;
public boolean isEndKeyWord() {
return isEndKeyWord;
}
public void setEndKeyWord(boolean endKeyWord) {
isEndKeyWord = endKeyWord;
}
private Map<Character, PrefixTreeNode> subRoot = new HashMap<>();
public PrefixTreeNode(char val, Map<Character, PrefixTreeNode> subRoot) {
this.val = val;
this.subRoot = subRoot;
}
public PrefixTreeNode(char val) {
this.val = val;
}
public PrefixTreeNode getSubNodeByKey(char val) {
return subRoot.get(val);
}
public void addSubNode(char val, PrefixTreeNode node) {
subRoot.put(val, node);
}
public PrefixTreeNode() {
}
public Character getVal() {
return val;
}
public void setVal(Character val) {
this.val = val;
}
public Map<Character, PrefixTreeNode> getSubRoot() {
return subRoot;
}
public void setSubRoot(Map<Character, PrefixTreeNode> subRoot) {
this.subRoot = subRoot;
}
}
}
代码解读,设计思路
首先这个树是一个内部类,因为只有敏感词过滤类使用该类,是一个特有的数据结构,设计思想可以参考JDK,好多类的设计,都是高内聚,对类的操作,高内聚,外部类,仅仅只是要调用而已。 每次创建对象时,都会从类路径下加载一个文件,存储我们铭感词的文件,然后,加入到我们的敏感词树中。 一开始我们就创建一个根节点,不存储数据,他的子节点才开始存储数据,为了和后面的设计保持一致。
|