Lucene in Action: Covers Apache Lucene 3.0 (2nd Edition)
Erik Hatcher, Otis Gospodnetic, Michael McCandless
When Lucene first hit the scene five years ago, it was nothing short of amazing. By using this open-source, highly scalable, super-fast search engine, developers could integrate search into applications quickly and efficiently. A lot has changed since then-search has grown from a "nice-to-have" feature into an indispensable part of most enterprise applications. Lucene now powers search in diverse companies including Akamai, Netflix, LinkedIn, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others.
Some things remain the same, though. Lucene still delivers high-performance search features in a disarmingly easy-to-use API. Due to its vibrant and diverse open-source community of developers and users, Lucene is relentlessly improving, with evolutions to APIs, significant new features such as payloads, and a huge increase (as much as 8x) in indexing speed with Lucene 2.3.
And with clear writing, reusable examples, and unmatched advice on best practices, Lucene in Action, Second Edition is still the definitive guide to developing with Lucene.
documents for geographic sorting 205 Implementing custom geographic sort 206 Accessing values used in custom sorting 209 ■ ■ 6.2 Developing a custom Collector 210 The Collector base class 211 Custom collector: BookLinkCollector 212 AllDocCollector 213 ■ ■ 6.3 Extending QueryParser 214 Customizing QueryParser’s behavior 214 Prohibiting fuzzy and wildcard queries 215 Handling numeric field-range queries 216 Handling date ranges 218 Allowing ordered phrase queries 220 ■ ■ ■ 6.4 Custom
shows that 8 of the 16 documents we indexed with Indexer contain the word patent and that the search took a meager 11 milliseconds. Because Indexer stores files’ absolute paths in the index, Searcher can print them. It’s worth noting that storing the file path as a field was our decision and appropriate in this case, but from Lucene’s perspective, it’s arbitrary metadata included in the indexed documents. You can use more sophisticated queries, such as 'patent AND freedom' or 'patent AND NOT
the word (text value) of that field. Note that Term objects are also involved in the indexing process. However, they’re created by Lucene’s internals, so you typically don’t need to think about them while indexing. During searching, you may construct Term objects and use them together with TermQuery: Summary 29 Query q = new TermQuery(new Term("contents", "lucene")); TopDocs hits = searcher.search(q, 10); This code instructs Lucene to find the top 10 documents that contain the word lucene in
IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED); writer.setInfoStream(System.out); This reveals detailed diagnostic information about segment flushes and merges, as shown here, and may help you tune indexing parameters described earlier in the chapter. If you’re experiencing an issue during indexing, something you may believe to be a bug in Lucene, and you take your issue to the Lucene user’s list at Apache, the first request you’ll get back is someone asking you to post the
#0]: merging _0:C1010->_0 _1:C1118->_0 ➥ _2:C968->_0 _3:C1201->_0 _4:C947->_0 _5:C1084->_0 _6:C1028->_0 ➥ _7:C954->_0 _8:C990->_0 _9:C1095->_0 into _a IW 0 [Lucene Merge Thread #0]: merge: total 10395 docs In addition, if you need to peek inside your index once it’s built, you can use Luke, a handy third-party tool that we discuss in section 8.1. Our final section covers some advanced indexing topics. 2.13 Advanced indexing concepts We’ve covered many interesting topics in this chapter—you