本站首页    管理页面    写新日志    退出


«October 2025»
1234
567891011
12131415161718
19202122232425
262728293031


公告
 本博客在此声明所有文章均为转摘,只做资料收集使用。

我的分类(专题)

日志更新

最新评论

留言板

链接

Blog信息
blog名称:
日志总数:1304
评论数量:2242
留言数量:5
访问次数:7634479
建立时间:2006年5月29日




[TurboGears]Pumping Up Your Applications with Xapian Full-Text Search 
软件技术

lhwork 发表于 2007/2/16 12:02:23

What good is an application—not matter how much information it contains—if the inability to easily search it renders it useless? Xapian to the Rescue Xapian is an excellent open source (GPL) search engine library. It is written in C++ and comes with bindings for Python as well as many other languages, and it supports everything you’d expect from a modern search engine: Ranked probabilistic search – The results that are returned are ranked according to their relevancy, with the most relevant occurring first.Boolean search operators – You can use AND, OR, NOT, XOR in your searches.Phrase and proximity searching – For example, “used books” will look for occurrences of these words as an exact phrase, but you can also search for “used NEAR books” to find occurrences of the words “used” and “books” that are within 10 words of each other. You can even write “used NEAR/3 books” to change the proximity threshold to threeStemming of search terms – If, for example, you search for “programmer,” you can find articles that mention “programmers” or “programming.” Xapian currently supports stemming in Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and SwedishStopwords – Xapian already knows which common words to ignore (like ‘are,’ ‘is,’ and ‘being’)Simultaneous update and searching – Xapian allows to index new articles while the database is being searched. New articles can be searched right away.Relevant Suggestions – Xapian can automatically suggest documents that are relevant to a given document. As such, you can add a list of “similar items” to each page.Value-associated results – You can associate values like word-count, date, page views, diggs, and so on with each document. Xapian can return results that are sorted by any of these criteriaDocument metadata – You can add metadata tags to each document (Xapian calls these terms). These tags can be anything you desire, like author, title, date, tags, and so on. Users can then search within the metadata by typing “author:john”. 500)this.width=500'> The above diagram shows the main participants in a typical search-enabled use case. We assume that the data to be searched is stored in a relational database (the blue SQL server jar), but it doesn’t really matter where the data comes from. The indexer is a Python program that is executed periodically (as a cron job). Its function is to retrieve new or changed documents from the database and to index them. The Xapian library handles the actual read/write operations on the Xapian database (in the purple jar). Since the Xapian library is not thread-safe and because Web applications are usually multithreaded, you need to implement a locking mechanism if you want access to a Xapian database to be safe. My preferred way for accomplishing this aim is to use a separate process (the orange Search Server box). This process will be a single-threaded RPC server that will handle all searches. The benefit of this strategy is that you can move the search server process (together with the Xapian database) to a different machine. In so doing, you can free up a lot of the resources on that server that runs your application. That makes the system very scalable. In general, you can expect any bottlenecks to be more IO and memory related than CPU related. Alternatively, your search operations can be directly initiated from your application process. This alternative will work as long as you use a mutex to govern access to your database. However, I wouldn’t recommend doing this in a production environment. Why? Because you’ll never be sure what’s consuming so much memory—the Xapian library or your application. The application (red box) gets a very clean search API. It simply connects to the XML RPC server (one line of code) and obtains access to a search() method which gets the search query and how many results are needed as arguments. It then returns a dictionary with the total number of available results and the results themselves. In this tutorial, we’ll create a searchable article database. We assume that you already have Xapian and the Xapian bindings installed. We’ll start with the indexer. The Indexer: The Golden Retriever 500)this.width=500'>Following is the indexer code. It is tailored to TurboGears and SQLAlchemy, but it can be easily adapted to suit any ORM. It accepts three command line arguments: the configuration file, which helps it find the database (either dev.cfg or prod.cfg); the Xapian database location, which is simply a directory name; and a number of hours (such that all documents that were created or modified within this number of hours will be indexed). If you run the indexer every hour, you can safely give 2 as the third argument. If you’d like to re-index all articles, pass in 0 as the third argument. Code (python) #!/usr/bin/env python from datetime import * import xapian WORD_RE = re.compile(r"w{1,32}", re.U) ARTICLE_ID = 0 stemmer = xapian.Stem("en") # english stemmer def create_document(article):     """Gets an article object and returns     a Xapian document representing it and     a unique article identifier."""     doc = xapian.Document()     text = article.text.encode("utf8")     # go word by word, stem it and add to the     # document.     for index, term in enumerate(WORD_RE.finditer(text)):         doc.add_posting(           stemmer.stem_word(term.group()),           index)     doc.add_term("A"+article.submitted_by.user_name)     doc.add_term("S"+article.subject.encode("utf8"))     article_id_term = ‘I’+str(article.article_id)     doc.add_term(article_id_term)     doc.add_value(ARTICLE_ID, str(article.article_id))     return doc, article_id_term def index(db, period):     """Index all articles that were modified     in the last <period> hours into the given     Xapian database"""     if period:         start = datetime.now()-timedelta(hours=period)         query = (Article.q.last_modified>start)     else:         query = None     articles = session.query(Article).select(        query)     for article in articles:         doc, id_term = create_document(article)        db.replace_document(id_term, doc) if __name__=="__main__":     configfile, dbpath, period = sys.argv[1:]      turbogears.update_config(configfile=configfile,         modulename="myproject.config")     from myproject.model import *     turbogears.database.bind_meta_data()     db = xapian.WritableDatabase(dbpath,         xapian.DB_CREATE_OR_OPEN)         index(db, int(period)) All strings that are passed to Xapian functions must be encoded in UTF-8. The create_document() function splits the article’s text into words, stems them, and then adds them one by one to the Xapian document. Next, a term with the username of the article’s author, prefixed by the letter ‘A,’ and the article’s subject, prefixed by the letter ‘S’ is added. Xapian gives special treatment to the first character of each term (i.e., it gives the terms its meaning). You’ll see how we use these terms when we code the search server. Another term, this one prefixed by the letter ‘I’,’ is now added to render a unique article ID. The article ID is also assigned to the document as a value. This number relates a Xapian document to its authentic source in the SQL server. The index() method function simply selects the relevant articles and builds a Xapian document object for them. Instead of using add_document(), which can cause an article to be indexed multiple times in the database, we use replace_document(), which is given a unique term. If a document is already indexed by the given term, it will be replaced with the given document; otherwise, a new document will be added to the database. After the data is indexed, it is time to make it searchable. The Search Server: Seeing Results The role of the search server is to obtain queries from the application and to then return results. As we strive for a single-threaded implementation, the Twisted framework makes it extremely easy to write the code for this server. If you are not familiar with Twisted or XML RPC, don’t worry; just imagine we’re writing a controller with only one method exposed: xmlrpc_search(). Code (python) import xapian from twisted.web import xmlrpc, server from twisted.internet import reactor, task from time import time from indexer import ARTICLE_ID DEFAULT_SEARCH_FLAGS = (         xapian.QueryParser.FLAG_BOOLEAN |         xapian.QueryParser.FLAG_PHRASE |         xapian.QueryParser.FLAG_LOVEHATE |         xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE         ) class SearchServer(xmlrpc.XMLRPC):     def __init__(self, db):         xmlrpc.XMLRPC.__init__(self)         self.db = xapian.Database(db)         self.parser = xapian.QueryParser()         self.parser.add_prefix("author", "A")         self.parser.add_prefix("subject", "S")         self.parser.set_stemmer(xapian.Stem("en"))         self.parser.set_stemming_strategy(xapian.QueryParser.STEM_SOME)         # make sure database is reloaded every 10 minutes         lc = task.LoopingCall(self.reopen_db)         lc.start(600)     def reopen_db(self):         try:             self.db.reopen()         except IOError:             print "Unable to open database"     def xmlrpc_search(self, query, offset, count):         """Search the database for <query>, return         results[offest:offset+count], sorted by relevancy"""         query = self.parser.parse_query(query.encode("utf8"),                 DEFAULT_SEARCH_FLAGS)         enquire = xapian.Enquire(self.db)         enquire.set_query(query)         try:             mset = enquire.get_mset(offset, count)         except IOError, e:             if "DatabaseModifiedError" in str(e):                 self.reopen_db()             raise         results = []         for m in mset:             results.append(                 {"percent": m[xapian.MSET_PERCENT],                 "article_id": m[xapian.MSET_DOCUMENT].get_value(ARTICLE_ID)})         return {"count": mset.get_matches_upper_bound(), "results": results} import sys if len(sys.argv)!=3:     print "Usage: search.py <port> <db>"     sys.exit(1) reactor.listenTCP(int(sys.argv[1]), server.Site(SearchServer(sys.argv[2]))) reactor.run() The search server constructor opens a database and initializes a query parser. We tell the query parser that the keyword ‘author’ refers to terms that are prefixed with the letter ‘A’ and that the keyword ’subject’ refers to terms that are prefixed with the letter ‘S.’ This specification makes it possible to search for “author:john” or “subject:xapian.” We then instruct Twisted to call reopen_db() every ten minutes. This reopening renders the latest changes in the database available to the search server. Each time Xapian’s library opens the database, it works against a fixed revision of it. The xmlrpc_search() is the only method that is exposed (since its name is prefixed by xmlrpc_). The offset (zero-based) and limit arguments allow for an efficient way to split the search results into several pages. The call returns a dictionary with the total number of available results and a list with the selected subset of results. Each item in the list contains a unique article_id and a percent, which indicates each document’s relevancy score. The program receives the port number to listen on and the Xapian database path. Unless you’d like to expose your search functionality to the world, it is suggested that you block outside access to this port. By now, you’re probably eager to try searching your own database. Here’s a quick way to do so. First, start the search server: Code (text) $ python search.py 3000 ./my_database From another terminal, start a Python shell: Code (python) >>> import xmlrpclib >>> s = xmlrpclib.Server(‘http://localhost:3000′) >>> s.search(‘python -snake’, 0, 10) {‘count’: 2, ‘results’: [{‘percent’: 94, ‘article_id’: 15}, {‘percent’: 79, ‘article_id’: 6}]} In the same manner, you can use the search server from your application. Working with Smaller Databases Search engines are optimized to return results that are sorted according to their relevancy. If you need your results sorted by another criterion, such as date or diggs, it might be useful to run the query over a smaller database. For example, you might try running it over a database that contains only articles from the previous month. This strategy can significantly increase your overall performance. Alternatives to Xapian While I haven’t tried working with search engines libraries other than Xapian, you can try the Java-based Lucene which can be accessed from Python using PyLucene. The TurboLucene library eases using PyLucene from TurboGears.


阅读全文(3778) | 回复(0) | 编辑 | 精华
 



发表评论:
昵称:
密码:
主页:
标题:
验证码:  (不区分大小写,请仔细填写,输错需重写评论内容!)



站点首页 | 联系我们 | 博客注册 | 博客登陆

Sponsored By W3CHINA
W3CHINA Blog 0.8 Processed in 0.781 second(s), page refreshed 144814507 times.
《全国人大常委会关于维护互联网安全的决定》  《计算机信息网络国际联网安全保护管理办法》
苏ICP备05006046号