Saturday, July 08, 2006

document scoring/calculating relevance in lucene

How is scoring of a document calculated in lucene? Well, first of all what is soring and what does it do. Scoring is a factor which decides the order by which documents are returned after a search is done using lucene. Something like "order by relevance". If we provide an order by/sort by clause in the search query, then ordering is done on this. Else, ordering is done on the score of the document. All documents returned from the search have a score no associated with it. The document with the highest score comes on top and that with the lease score comes at the bottom.

Scoring is implemented in the similarity class of lucene (org.apache.lucene.search.Similarity).

Lets see how the score of a document is calculated.

The score of a query q for document d is defines as follows (t refers to a term):





score(q,d)=
Σ (tf(t in d) * idf(t)^2 * getBoost(t in q) * getBoost(t.field in d) * lengthNorm(t.field in d) ) * coord(q,d) * queryNorm(sumOfSqaredWeights)
t in q


where


sumOfSqaredWeights =
Σ ( idf(t) * getBoost(t in q) )^2
t in q


Now, lets decode it...

tf = term/phrase(t) frequency in a document(d). Terms and phrases repeated in a document indicate the topic of the document, so implementations of this method usually return larger values when freq is large, and smaller values when freq is small.

idf = term document frequency(no of documents which contain the term). Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms.

getBoost (t in q) = Gets the boost for this clause in the query. Documents matching this clause will (in addition to the normal weightings) have their score multiplied by b. The boost b=1.0 by default.

getBoost (t.field in d) = Gets the boost for the field in the document if the term is in the field. Again boost for the field b is 1.0 by default.

lengthNorm(t.field in d) = normalization value for a field given the total number of terms contained in a field. These values, together with field boosts, are stored in an index and multipled into scores for hits on each field by the search code. Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small. These values are computed under IndexWriter.addDocument(Document d) and stored. Thus they have limited precision, and documents must be re-indexed if this method is altered.

coord = Score factor based on the fraction of all query terms that a document contains. This value is multiplied into scores. The presence of a large portion of the query terms indicates a better match with the query, so implementations of this method usually return larger values when the ratio between these parameters is large and smaller values when the ratio between them is small.

queryNorm = Computes the normalization value for a query given the sum of the squared weights of each of the query terms. This value is then multipled into the weight of each query term. This does not affect ranking, but rather just attempts to make scores from different queries comparable.

So affectively the score of a document returned, in search done, using a query is calculated keeping in mind the frequency of the term/phrase in the document, frequency of documents containing the term, boosting factor of a clause in the query, boosting factor of a field, total number of terms in a field, fraction of all query terms contained in a document and some normalization factor based in the idf and boost of term in query.

If you are searching a single field using a query, the results should be very relevant, provided the text entered is meaningful and gramatically correct. With the logic here, even if you are searching multiple fields, the order of relevance of results should be good. But relevance can be increased by assigning boost factor to fields either during the query formation process or during the indexing process. Fields could be boosted according to their order of importance resulting in very relevant results.

Boosting during the query formation process provides the flexibility to change the boost factor dynamically but slows down the search and scoring due to dynamic computation of score. Whereas if fields are boosted during the indexing process, then though the search will be much faster, you would loose the flexibility of altering the boost factor dynamically.

Anything i have missed, please let me know and i will add it...

4 comments:

Anonymous said...

Hi.. jayanth,
Can you explain the same (lucene scoring) with sample index. So that it would e wel understood.
If you look at the Luke toolbox, it explains the score of a result; but It wasn't clear.
Any inputs to understand this lucene better woud be appriciated.
Thanks in advance.

gamegeek said...

yes of course, as soon as i get some time i would try to dig into the details of document scoring in lucene and list them out with examples.

Mike said...

An example or two would make this article really useful. Have you found the time to dig out some examples yet ?

casino online said...

great post - very helpful information
I would love it if you'll provide an example. cheers!