What good is an application—not matter how much information it contains—if the inability to easily search it renders it useless?
Xapian to the Rescue
Xapian is an excellent open
source (GPL) search engine library. It is written in C++ and comes with
bindings for Python as well as many other languages, and it supports
everything you’d expect from a modern search engine:
Ranked probabilistic search – The results that are returned are ranked according to their relevancy, with the most relevant occurring first.Boolean search operators – You can use AND, OR, NOT, XOR in your searches.Phrase and proximity searching – For example,
“used books” will look for occurrences of these words as an exact
phrase, but you can also search for “used NEAR books” to find
occurrences of the words “used” and “books” that are within 10 words of
each other. You can even write “used NEAR/3 books” to change the
proximity threshold to threeStemming of search terms – If, for example, you
search for “programmer,” you can find articles that mention
“programmers” or “programming.” Xapian currently supports stemming in
Danish, Dutch, English, Finnish, French, German, Italian, Norwegian,
Portuguese, Russian, Spanish, and SwedishStopwords – Xapian already knows which common words to ignore (like ‘are,’ ‘is,’ and ‘being’)Simultaneous update and searching – Xapian allows to index new articles while the database is being searched. New articles can be searched right away.Relevant Suggestions – Xapian can automatically
suggest documents that are relevant to a given document. As such, you
can add a list of “similar items” to each page.Value-associated results – You can associate
values like word-count, date, page views, diggs, and so on with each
document. Xapian can return results that are sorted by any of these
criteriaDocument metadata – You can add metadata tags to
each document (Xapian calls these terms). These tags can be anything
you desire, like author, title, date, tags, and so on. Users can then
search within the metadata by typing “author:john”.
500)this.width=500'>
The above diagram shows the main participants in a typical
search-enabled use case. We assume that the data to be searched is
stored in a relational database (the blue SQL server jar), but it
doesn’t really matter where the data comes from. The indexer is a
Python program that is executed periodically (as a cron job). Its
function is to retrieve new or changed documents from the database and
to index them. The Xapian library handles the actual read/write
operations on the Xapian database (in the purple jar).
Since the Xapian library is not thread-safe and because Web
applications are usually multithreaded, you need to implement a locking
mechanism if you want access to a Xapian database to be safe. My
preferred way for accomplishing this aim is to use a separate process
(the orange Search Server box). This process will be a single-threaded
RPC server that will handle all searches. The benefit of this strategy
is that you can move the search server process (together with the
Xapian database) to a different machine. In so doing, you can free up a
lot of the resources on that server that runs your application. That
makes the system very scalable. In general, you can expect any
bottlenecks to be more IO and memory related than CPU related.
Alternatively, your search operations can be directly initiated from
your application process. This alternative will work as long as you use
a mutex to govern access to your database. However, I wouldn’t
recommend doing this in a production environment. Why? Because you’ll
never be sure what’s consuming so much memory—the Xapian library or
your application.
The application (red box) gets a very clean search API. It simply
connects to the XML RPC server (one line of code) and obtains access to
a search() method which gets the search query and how many results are
needed as arguments. It then returns a dictionary with the total number
of available results and the results themselves.
In this tutorial, we’ll create a searchable article database. We
assume that you already have Xapian and the Xapian bindings installed.
We’ll start with the indexer.
The Indexer: The Golden Retriever
500)this.width=500'>Following
is the indexer code. It is tailored to TurboGears and SQLAlchemy, but
it can be easily adapted to suit any ORM. It accepts three command line
arguments: the configuration file, which helps it find the database
(either dev.cfg or prod.cfg); the Xapian database location, which is
simply a directory name; and a number of hours (such that all documents
that were created or modified within this number of hours will be
indexed). If you run the indexer every hour, you can safely give 2 as
the third argument. If you’d like to re-index all articles, pass in 0
as the third argument.
Code (python)
#!/usr/bin/env python
from datetime import *
import xapian
WORD_RE = re.compile(r"w{1,32}", re.U)
ARTICLE_ID = 0
stemmer = xapian.Stem("en") # english stemmer
def create_document(article):
"""Gets an article object and returns
a Xapian document representing it and
a unique article identifier."""
doc = xapian.Document()
text = article.text.encode("utf8")
# go word by word, stem it and add to the
# document.
for index, term in enumerate(WORD_RE.finditer(text)):
doc.add_posting(
stemmer.stem_word(term.group()),
index)
doc.add_term("A"+article.submitted_by.user_name)
doc.add_term("S"+article.subject.encode("utf8"))
article_id_term = ‘I’+str(article.article_id)
doc.add_term(article_id_term)
doc.add_value(ARTICLE_ID, str(article.article_id))
return doc, article_id_term
def index(db, period):
"""Index all articles that were modified
in the last <period> hours into the given
Xapian database"""
if period:
start = datetime.now()-timedelta(hours=period)
query = (Article.q.last_modified>start)
else:
query = None
articles = session.query(Article).select(
query)
for article in articles:
doc, id_term = create_document(article)
db.replace_document(id_term, doc)
if __name__=="__main__":
configfile, dbpath, period = sys.argv[1:]
turbogears.update_config(configfile=configfile,
modulename="myproject.config")
from myproject.model import *
turbogears.database.bind_meta_data()
db = xapian.WritableDatabase(dbpath,
xapian.DB_CREATE_OR_OPEN)
index(db, int(period))
All strings that are passed to Xapian functions must be encoded in
UTF-8. The create_document() function splits the article’s text into
words, stems them, and then adds them one by one to the Xapian
document. Next, a term with the username of the article’s author,
prefixed by the letter ‘A,’ and the article’s subject, prefixed by the
letter ‘S’ is added. Xapian gives special treatment to the first
character of each term (i.e., it gives the terms its meaning). You’ll
see how we use these terms when we code the search server.
Another term, this one prefixed by the letter ‘I’,’ is now added to
render a unique article ID. The article ID is also assigned to the
document as a value. This number relates a Xapian document to its
authentic source in the SQL server.
The index() method function simply selects the relevant articles and
builds a Xapian document object for them. Instead of using
add_document(), which can cause an article to be indexed multiple times
in the database, we use replace_document(), which is given a unique
term. If a document is already indexed by the given term, it will be
replaced with the given document; otherwise, a new document will be
added to the database.
After the data is indexed, it is time to make it searchable.
The Search Server: Seeing Results
The role of the search server is to obtain queries from the
application and to then return results. As we strive for a
single-threaded implementation, the Twisted framework makes it
extremely easy to write the code for this server. If you are not
familiar with Twisted or XML RPC, don’t worry; just imagine we’re
writing a controller with only one method exposed: xmlrpc_search().
Code (python)
import xapian
from twisted.web import xmlrpc, server
from twisted.internet import reactor, task
from time import time
from indexer import ARTICLE_ID
DEFAULT_SEARCH_FLAGS = (
xapian.QueryParser.FLAG_BOOLEAN |
xapian.QueryParser.FLAG_PHRASE |
xapian.QueryParser.FLAG_LOVEHATE |
xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE
)
class SearchServer(xmlrpc.XMLRPC):
def __init__(self, db):
xmlrpc.XMLRPC.__init__(self)
self.db = xapian.Database(db)
self.parser = xapian.QueryParser()
self.parser.add_prefix("author", "A")
self.parser.add_prefix("subject", "S")
self.parser.set_stemmer(xapian.Stem("en"))
self.parser.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
# make sure database is reloaded every 10 minutes
lc = task.LoopingCall(self.reopen_db)
lc.start(600)
def reopen_db(self):
try:
self.db.reopen()
except IOError:
print "Unable to open database"
def xmlrpc_search(self, query, offset, count):
"""Search the database for <query>, return
results[offest:offset+count], sorted by relevancy"""
query = self.parser.parse_query(query.encode("utf8"),
DEFAULT_SEARCH_FLAGS)
enquire = xapian.Enquire(self.db)
enquire.set_query(query)
try:
mset = enquire.get_mset(offset, count)
except IOError, e:
if "DatabaseModifiedError" in str(e):
self.reopen_db()
raise
results = []
for m in mset:
results.append(
{"percent": m[xapian.MSET_PERCENT],
"article_id": m[xapian.MSET_DOCUMENT].get_value(ARTICLE_ID)})
return {"count": mset.get_matches_upper_bound(), "results": results}
import sys
if len(sys.argv)!=3:
print "Usage: search.py <port> <db>"
sys.exit(1)
reactor.listenTCP(int(sys.argv[1]), server.Site(SearchServer(sys.argv[2])))
reactor.run()
The search server constructor opens a database and initializes a
query parser. We tell the query parser that the keyword ‘author’ refers
to terms that are prefixed with the letter ‘A’ and that the keyword
’subject’ refers to terms that are prefixed with the letter ‘S.’ This
specification makes it possible to search for “author:john” or
“subject:xapian.”
We then instruct Twisted to call reopen_db() every ten minutes. This
reopening renders the latest changes in the database available to the
search server. Each time Xapian’s library opens the database, it works
against a fixed revision of it.
The xmlrpc_search() is the only method that is exposed (since its
name is prefixed by xmlrpc_). The offset (zero-based) and limit
arguments allow for an efficient way to split the search results into
several pages. The call returns a dictionary with the total number of
available results and a list with the selected subset of results. Each
item in the list contains a unique article_id and a percent, which
indicates each document’s relevancy score.
The program receives the port number to listen on and the Xapian
database path. Unless you’d like to expose your search functionality to
the world, it is suggested that you block outside access to this port.
By now, you’re probably eager to try searching your own database. Here’s a quick way to do so. First, start the search server:
Code (text)
$ python search.py 3000 ./my_database
From another terminal, start a Python shell:
Code (python)
>>> import xmlrpclib
>>> s = xmlrpclib.Server(‘http://localhost:3000′)
>>> s.search(‘python -snake’, 0, 10)
{‘count’: 2, ‘results’: [{‘percent’: 94, ‘article_id’: 15}, {‘percent’: 79, ‘article_id’: 6}]}
In the same manner, you can use the search server from your application.
Working with Smaller Databases
Search engines are optimized to return results that are sorted
according to their relevancy. If you need your results sorted by
another criterion, such as date or diggs, it might be useful to run the
query over a smaller database. For example, you might try running it
over a database that contains only articles from the previous month.
This strategy can significantly increase your overall performance.
Alternatives to Xapian
While I haven’t tried working with search engines libraries other than Xapian, you can try the Java-based Lucene which can be accessed from Python using PyLucene. The TurboLucene library eases using PyLucene from TurboGears. |