Apache lucene doc

APACHE LUCENE DOC FULL

This phase tried to equalize scores across different queries and indices so that they are more comparable, but didn't alter the sort order of hits, and was also TF/IDF specific. Likewise, the query normalization phase of scoring will be removed. Since this was specific to one scoring model, TFIDFSimilarity, and since Lucene has now switched to the better Okapi BM25 scoring model by default, we have now fully removed coordination factors in 7.0 from both BooleanQuery and Similarity. However, this hack is only necessary for scoring models like TF/IDF which have "weak" term saturation such that many occurrences of a single term in a document would be more powerful than adding a single occurence of another term from the query. Finally, with index time boosts gone, length encoding is more accurate, and in particular the first nine length values (1 to 9) are distinct.īooleanQuery has long exposed a confusing scoring feature called the coordination factor ( coord), to reward hits containing a higher percentage of the search terms.

Furthermore, it is now straightforward to write your custom boost into your own doc values field and use function queries to apply the boost at search time. This has always been a fragile feature: it was encoded, along with the field's length, into a single byte value, and thus had very low precision. Index-time boosting, which lets you increase the a-priori score for a particular document versus other documents, is now deprecated and will be removed in 7.0. Our nightly sparse benchmarks, based on the NYC Trip Data corpus, show the impressive gains each of the above changes (and more!) accomplished. This is the same as other parts of the index like postings, stored fields, term vectors, etc., and it means users with very sparse doc values no longer see merges taking unreasonably long time or the index becoming unexpectedly huge while merging. With these changes, you finally only pay for what you actually use with doc values, in index size, indexing performance, etc.

A new advanceExact method enables more efficient skipping.

Both top-level browse-only facet counts and facet counts for hits in a query are now faster in sparse cases.

Our doc values based queries take advantage of the new API.

Outlier values no longer consume excessive space.

The 7.0 codec now sparsely encodes sparse doc values and length normalization factors ("norms").

The initial rote switch to an iterator API was really just a plumbing swap and less interesting than all the subsequent user-impacting improvements that became possible thanks to the more restrictive API: Postings have long been consumed through an iterator, so this was a relatively natural change to make, and the two share the same base class, DocIdSetIterator, to step through or seek to each hit. Their column-stride storage means it's efficient to visit all values for the one field across documents, in contrast to row-stride storage that stored fields use to retrieve all field values for a single document. They can be used to hold scoring signals, such as the single-byte (by default) document length encoding or application-dependent signals, or for sorting, faceting or grouping, or even numeric fields that you might use for range filtering in some queries. The biggest change in 7.0 is changing doc values from a random access API to a more restrictive iterator API.ĭoc values are Lucene's column-stride numeric, sorted or binary per-document field storage across all documents.

APACHE LUCENE DOC FULL

This is only a subset of the new 7.0 only features for the full list please see the 7.0.0 section in the upcoming CHANGES.txt.

Of course, with every major release, we also do more mundane things like remove deprecated 6.x APIs, and drop support for old indices (written with Lucene 5.x or earlier). Remember that Lucene developers generally try hard to backport new features for the next non-major (feature) release, and the upcoming 6.5 already has many great changes, so a new major release is exciting because it means the 7.0-only features, which I now describe, are the particularly big ones that we felt could not be backported for 6.5. The Apache Lucene project will likely release its next major release, 7.0, in a few