Lucene is the platform where we index information and make it searchable. The first stage is independent of Lucene — you will provide the mechanism to fetch the information. Once you have the information, we can use Lucene-provided facilities for indexing so we can add the news articles into the index. To search, we will use Lucene's searcher to provide search functionality against the index. Now, let's have a quick overview of Lucene's way of managing information.
Continuing our news search application, let's assume we fetched some news bits from a custom source. The following shows the two news items that we are going to add to our index:
For each news bit, we have a title, publishing date, content, and link, which are the constituents of the typical information in a news article. We will treat each news item as a document and add it to our news data store. The act of adding documents to the data store is called
indexing and the data store itself is called an index. Once the index is created, you can query it to locate documents by search terms, and this is what's referred to as searching the index.
So, how does Lucene maintain an index, and how's an index being leveraged in terms of search? We can think of a scenario where you look for a certain subject from a book. Let's say you are interested in
Object Oriented Programming (OOP) and learning more about
inheritance. You then get a book on OOP and start looking for the relevant information about inheritance. You can start from the beginning of the book; start reading until you land on the inheritance topic. If the relevant topic is at the end of the book, it will certainly take a while to reach. As you may notice, this is not a very efficient way to locate information. To locate information quickly in a book, especially a reference book, you can usually rely on the index where you will find the key value pairs of the keyword, and page numbers sorted alphabetically by the keyword. Here, you can look for the word, inheritance, and go to the related pages immediately without scanning through the entire book. This is a more efficient and standard method to quickly locate the relevant information. This is also how Lucene works behind the scene, though with more sophisticated algorithms that make searching efficient and flexible.
Internally, Lucene assigns a unique
document ID (called DocId) to each document when they are added to an index. DocId is used to quickly return details of a document in search results. The following is an example of how Lucene maintains an index. Assuming we start a new index and add three documents as follows:
Lucene indexes these documents by tokenizing the phrases into keywords and putting them into an inverted index. Lucene's inverted index is a reverse mapping lookup between keyword and DocId. Within the index, keywords are stored in sorted and DocIds are associated with each keyword. Matches in keywords can bring up associated DocIds to return the matching documents. This is a simplistic view of how Lucene maintains an index and how it should give you a basic idea of the schematic of Lucene's architecture.
The following is an example of an inverted index table for our current sample data:
As you notice, the inverted index is designed to optimally answer such queries: get me all documents with the term xyz. This data structure allows for a very fast full-text search to locate the relevant documents. For example, a user searches for the term Solr. Lucene can quickly locate Solr in the inverted index, because it's sorted, and return DocId 2 and DocId 3 as the result. Then, the search can proceed to quickly retrieve the relevant documents by these DocIds. To a great extent, this architecture contributes to Lucene's speed and efficiency. As you continue to read through this book, you will see Lucene's many techniques to find the relevant information and how you can customize it to suit your needs.
One of the many Lucene features worth noting is text analysis. It's an important feature because it provides extensibility and gives you an opportunity to massage data into a standard format before feeding the data into an index. It's analogous to the transform layer in an
Extract Transform Load (ETL) process. An example of its typical use is the removal of stop words. These are common words (for example, is, and, the, and so on) of little or no value in search. For an even more flexible search application, we can also use this analyzing layer to turn all keywords into lowercase, in order to perform a case-insensitive search. There are many more analyses you can do with this framework; we will show you the best practices and pitfalls to help you make a decision when customizing your search application.
Why is Lucene so popular?
A quick overview of Lucene's features is as follows:
- Index at about 150GB of data per hour on modern hardware
- Efficient RAM utilization (only 1 MB heap)
- Customizable ranking models
- Supports numerous query types
- Restrictive search (routed to specific fields)
- Sorting by fields
- Real-time indexing and searching
- Faceting, Grouping, Highlighting, and so on
- Suggestions
Lucene makes the most out of the modern hardware, as it is very fast and efficient. Indexing 20 GB of textual content typically produces an index size in the range of 4-6 GB. Lucene's speed and low RAM requirement is indicative of its efficiency. Its extensibility in text analysis and search will allow you to virtually customize a search engine in any way you want.
It is becoming more apparent that there are quite a few big companies using Lucene for their search applications. The list of Lucene users is growing at a steady pace. You can take a look at the list of companies and websites that use Lucene on Lucene's wiki page. More and more data giants are using Lucene nowadays: Netflix, Twitter, MySpace, LinkedIn, FedEx, Apple, Ticketmaster, www.Salesforce.com, Encyclopedia Britannica CD-ROM/DVD, Eclipse IDE, Mayo Clinic, New Scientist magazine, Atlassian (JIRA), Epiphany, MIT's OpenCourseWare and DSpace, HathiTrust digital library, and Akamai's EdgeComputing platform, all come under this list. This wide range of implementations illustrates that Lucene is a stand-out piece of search technology that's trusted by many.
Lucene's wiki page is available at http://wiki.apache.org/lucene-java/FrontPage