拓扑空间--编写简单的分词程序

		I can because I think I can !
		首页(8)　搜索技术(3)　Unix学习(2)　算法研究(0)　杂七杂八(1)　脚本语言(0)　职业生涯(0)　数据挖掘(0)

博客信息

blog名称:拓扑空间
日志总数:8
评论数量:3
留言数量:0
访问次数:25990
建立时间:2007年7月26日

最新留言

签写新留言

友情链接

其他信息

编写简单的分词程序

逍凌发表于 2007/7/27 23:13:30

　算法简要描述:对一个字符串S,从前到后扫描,对扫描的每个字,从词库中寻找最长匹配.比如假设S="我是中华人民共和国公民",词库中有"中华人民共和国","中华","公民","人民","共和国"......等词.当扫描到"中"字,那么从中字开始,向后分别取1,2,3,......个字("中","中华","中华人","中华人民","中华人民共","中华人民共和","中华人民共和国",,"中华人民共和国公"),词库中的最长匹配字符串是"中华人民共和国",那么就此切分开,扫描器推进到"公"字。数据结构:选择什么样的数据结构对性能影响很大.我采用Hashtable _rootTable记录词库.键值对为(键,插入次数).对每一个词语,如果该词语有N个字,则将该词语的1,1~2,1~3,......1~N个字作为键,插入_rootTable中.而同一个键如果重复插入,则后面的值递增。字典文件为网上下载的一个800K的字典。核心程序程序核心代码如下(程序中包含权重,插入次数等要素,目前的算法并没有利用这些.可以借此写出更有效的分词算法): /// 中文分词器。 public class ChineseParse { private static ChineseWordsHashCountSet _countTable; static ChineseParse() { _countTable = new ChineseWordsHashCountSet(); InitFromFile("ChineseDictionary.txt"); } /**//// /// 从指定的文件中初始化中文词语字典和字符串次数字典。 /// 文件名 private static void InitFromFile(string fileName) { string path = Directory.GetCurrentDirectory() +@"\" + fileName; if (File.Exists(path)) { using (StreamReader sr = File.OpenText(path)) { string s = ""; while ((s = sr.ReadLine()) != null) { ChineseWordUnit _tempUnit = InitUnit(s); _countTable.InsertWord(_tempUnit.Word); } } } } /**//// /// 将一个字符串解析为ChineseWordUnit。 /// 字符串 /// 解析得到的ChineseWordUnit private static ChineseWordUnit InitUnit(string s) { Regex reg = new Regex(@"\s+"); string[] temp = reg.Split(s); if (temp.Length!=2) { throw new Exception("字符串解析错误："+s); } return new ChineseWordUnit(temp[0],Int32.Parse(temp[1])); } /**//// /// 分析输入的字符串，将其切割成一个个的词语。 /// /// 待切割的字符串 /// 所切割得到的中文词语数组 public static string[] ParseChinese(string s) { int _length = s.Length; string _temp = String.Empty; ArrayList _words = new ArrayList(); for(int i=0;i63 { _temp = s.Substring(i,1); if (_countTable.GetCount(_temp)>1) { int j=2; for (;i+j0;j++) { } _temp = s.Substring(i,j-1); i = i + j - 2; } i++; _words.Add(_temp); } string[] _tempStringArray = new string[_words.Count]; _words.CopyTo(_tempStringArray); return _tempStringArray; } }

[

阅读全文(1465) | 回复(0) | 编辑 | 精华]

发表评论：

昵称：
密码：
主页：
标题：

验证码： (不区分大小写,请仔细填写,输错需重写评论内容！)

管理摘写注销

站点首页 | 苏ICP备05006046号站点首页 | 联系我们 | 博客注册 | 博客登陆

Sponsored By W3CHINA
W3CHINA Blog 0.8 Processed in 0.016 second(s), page refreshed 144781009 times.
《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
苏ICP备05006046号