Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Apache Solr Search Patterns
Apache Solr Search Patterns

Apache Solr Search Patterns: Leverage the power of Apache Solr to power up your business by navigating your users to their data quickly and efficiently

eBook
$29.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Apache Solr Search Patterns

Chapter 2. Customizing the Solr Scoring Algorithm

In this chapter, we will go through the relevance calculation algorithm used by Solr for ranking results and understand how relevance calculation works with reference to the parameters in the algorithm. In addition to this, we will look at tweaking the algorithm and create our own algorithm for scoring results. Then, we will add it as a plugin to Solr and see how the search results are ranked. We will discuss the problems with the default algorithm used in Solr and define a new algorithm known called the information gain model. This chapter will incorporate the following topics:

  • The relevance calculation algorithm
  • Building a custom scorer
  • Drawback of the TF-IDF model
  • The information gain model
  • Implementing the information gain model
  • Options to TF-IDF similarity
  • BM25 similarity
  • DFR similarity

Relevance calculation

Now that we are aware of how Solr works in the creation of an inverted index and how a search returns results for a query from an index, the question that comes to our mind is how Solr or Lucene (the underlying API) decides which documents should be at the top and how the results are sorted. Of course, we can have custom sorting, where we can sort results based on a particular field. However, how does sorting occur in the absence of any custom sorting query?

The default sorting mechanism in Solr is known as relevance ranking. During a search, Solr calculates the relevance score of each document in the result set and ranks the documents so that the highest scoring documents move to the top. Scoring is the process of determining how relevant a given document is with respect to the input query. The default scoring mechanism is a mix of the Boolean model and the Vector Space Model (VSM) of information retrieval. The binary model is used to figure out documents that match...

Building a custom scorer

Now that we know how the default relevance calculation works, let us look at how to create a custom scorer. The default scorer used for relevance calculation is known as DefaultSimilarity. In order to create a custom scorer, we will need to extend DefaultSimilarity and create our own similarity class and eventually use it in our Solr schema. Solr also provides the option of specifying different similarity classes for different fieldTypes configuration directive in the schema. Thus, we can create different similarity classes and then specify different scoring algorithms for different fieldTypes as also a different global Similarity class.

Let us create a custom scorer that disables the IDF factor of the scoring algorithm. Why would we want to disable the IDF? The IDF boosts documents that have query terms that are rare in the index. Therefore, if a query contains a term that occurs in fewer documents, the documents containing the term will be ranked higher. This does...

Drawbacks of the TF-IDF model

Suppose, on an e-commerce website, a customer is searching for a jacket and intends to purchase a jacket with a unique design. The keyword entered is unique jacket. What happens at the Solr end?

http://solr.server/solr/clothes/?q=unique+jacket

Now, unique is a comparatively rare keyword. There would be fewer items or documents that mention unique in their description. Let us see how this affects the ranking of our results via the TF-IDF scoring algorithm. A relook at the scoring algorithm with respect to this query is shown in the following diagram:

Drawbacks of the TF-IDF model

A relook at the TF-IDF scoring algorithm

The following parameters in the scoring formula do not affect the ranking of the documents in the query result:

  • coord(q,d): This would be constant for a MUST query. Herein we are searching for both unique and jacket, so all documents will have both the keywords and the coord(q,d) value will be the same for all documents.
  • queryNorm(q): This is used to make the scores from different...

The information gain model

The information gain model is a type of machine learning concept that can be used in place of the inverse document frequency approach. The concept being used here is the probability of observing two terms together on the basis of their occurrence in an index. We use an index to evaluate the occurrence of two terms x and y and calculate the information gain for each term in the index:

  • P(x): Probability of a term x appearing in a listing
  • P(x|y): Probability of the term x appearing given a term y also appears

The information gain value of the term y can be computed as follows:

The information gain model

Information gain equation

This equation says that the more number of times term y appears with term x with respect to the total occurrence of term x, the higher is the information gain for that y.

Let us take a few examples to understand the concept.

In the earlier example, if the term unique appears with jacket a large number of times as compared to the total occurrence of the term jacket, then unique...

Implementing the information gain model

The problem with the information gain model is that, for each term in the index, we will have to evaluate the occurrence of every other term. The complexity of the algorithm will be of the order of square of the two terms, square(xy). It is not possible to compute this using a simple machine. What is recommended is that we create a map-reduce job and use a distributed Hadoop cluster to compute the information gain for each term in the index.

Our distributed Hadoop cluster would do the following:

  • Count all occurrences of each term in the index
  • Count all occurrences of each co-occurring term in the index
  • Construct a hash table or a map of co-occurring terms
  • Calculate the information gain for each term and store it in a file in the Hadoop cluster

In order to implement this in our scoring algorithm, we will need to build a custom scorer where the IDF calculation is overwritten by the algorithm for deriving the information gain for the term from the Hadoop cluster...

Relevance calculation


Now that we are aware of how Solr works in the creation of an inverted index and how a search returns results for a query from an index, the question that comes to our mind is how Solr or Lucene (the underlying API) decides which documents should be at the top and how the results are sorted. Of course, we can have custom sorting, where we can sort results based on a particular field. However, how does sorting occur in the absence of any custom sorting query?

The default sorting mechanism in Solr is known as relevance ranking. During a search, Solr calculates the relevance score of each document in the result set and ranks the documents so that the highest scoring documents move to the top. Scoring is the process of determining how relevant a given document is with respect to the input query. The default scoring mechanism is a mix of the Boolean model and the Vector Space Model (VSM) of information retrieval. The binary model is used to figure out documents that match...

Building a custom scorer


Now that we know how the default relevance calculation works, let us look at how to create a custom scorer. The default scorer used for relevance calculation is known as DefaultSimilarity. In order to create a custom scorer, we will need to extend DefaultSimilarity and create our own similarity class and eventually use it in our Solr schema. Solr also provides the option of specifying different similarity classes for different fieldTypes configuration directive in the schema. Thus, we can create different similarity classes and then specify different scoring algorithms for different fieldTypes as also a different global Similarity class.

Let us create a custom scorer that disables the IDF factor of the scoring algorithm. Why would we want to disable the IDF? The IDF boosts documents that have query terms that are rare in the index. Therefore, if a query contains a term that occurs in fewer documents, the documents containing the term will be ranked higher. This does...

Left arrow icon Right arrow icon

Description

This book is for developers who already know how to use Solr and are looking at procuring advanced strategies for improving their search using Solr. This book is also for people who work with analytics to generate graphs and reports using Solr. Moreover, if you are a search architect who is looking forward to scale your search using Solr, this is a must have book for you. It would be helpful if you are familiar with the Java programming language.

Who is this book for?

This book is for developers who already know how to use Solr and are looking at procuring advanced strategies for improving their search using Solr. This book is also for people who work with analytics to generate graphs and reports using Solr. Moreover, if you are a search architect who is looking forward to scale your search using Solr, this is a must have book for you.

What you will learn

  • Customize the Solr scoring algorithm to get better and more relevant search results Use Solr with big data for analytical purposes Get insights into Solr internals-indexing and search Setting up and scaling with Solr cloud Implement spatial search with Solr Understand Finite State Transducers (FST) and implement text tagging using FST Breeze through the strategies used in executing search using Solr in e-commerce, advertising, and real estate websites Learn more about how to use Solr with AJAX

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 24, 2015
Length: 316 pages
Edition : 1st
Language : English
ISBN-13 : 9781783981854
Vendor :
Apache
Category :
Languages :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Apr 24, 2015
Length: 316 pages
Edition : 1st
Language : English
ISBN-13 : 9781783981854
Vendor :
Apache
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 158.97
Scaling Big Data with Hadoop and Solr, Second Edition
$48.99
Apache Solr Search Patterns
$54.99
Solr Cookbook - Third Edition
$54.99
Total $ 158.97 Stars icon
Banner background image

Table of Contents

11 Chapters
1. Solr Indexing Internals Chevron down icon Chevron up icon
2. Customizing the Solr Scoring Algorithm Chevron down icon Chevron up icon
3. Solr Internals and Custom Queries Chevron down icon Chevron up icon
4. Solr for Big Data Chevron down icon Chevron up icon
5. Solr in E-commerce Chevron down icon Chevron up icon
6. Solr for Spatial Search Chevron down icon Chevron up icon
7. Using Solr in an Advertising System Chevron down icon Chevron up icon
8. AJAX Solr Chevron down icon Chevron up icon
9. SolrCloud Chevron down icon Chevron up icon
10. Text Tagging with Lucene FST Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(2 Ratings)
5 star 50%
4 star 50%
3 star 0%
2 star 0%
1 star 0%
BMille Jul 02, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
It does a great job of in detail on important items.
Amazon Verified review Amazon
Tim Crothers Jun 19, 2015
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
This book is an excellent treatise on how to use and tune Solr for various uses. I really enjoyed how the author struck a good balance between hands-on specifics and details while still covering the breadth of the application needs. In particular including details like how to tune and handle language differences (a necessity typically overlooked in most technical books) was very appreciated. Given the depth covered by the author developers of every level of experience with Solr should find a lot of value in this book. I had moderate experience with Solr when I picked this book up and definitely learned several useful techniques to add to my toolkit.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.