Lucene in Action: Meet Lucene Pt. 1 | WebReference

Lucene in Action: Meet Lucene Pt. 1

Lucene in Action: Meet Lucene Pt. 1

Written by Otis Gospodnetic and Erik Hatcher and reproduced from "Lucene in Action" by permission of Manning Publications Co. ISBN 1932394281, copyright 2004. All rights reserved. See https://www.manning.com for more information.

This chapter covers

  • Understanding Lucene
  • Using the basic indexing API
  • Working with the search API
  • Considering alternative products
  • One of the key factors behind Lucene's popularity and success is its simplicity. The careful exposure of its indexing and searching API is a sign of the welldesigned software. Consequently, you don't need in-depth knowledge about how Lucene's information indexing and retrieval work in order to start using it. Moreover, Lucene's straightforward API requires you to learn how to use only a handful of its classes.

    In this chapter, we show you how to perform basic indexing and searching with Lucene with ready-to-use code examples. We then briefly introduce all the core elements you need to know for both of these processes. We also provide brief reviews of competing Java/non-Java, free, and commercial products.

    1.1 Evolution of information organization and access

    In order to make sense of the perceived complexity of the world, humans have invented categorizations, classifications, genuses, species, and other types of hierarchical organizational schemes. The Dewey decimal system for categorizing items in a library collection is a classic example of a hierarchical categorization scheme. The explosion of the Internet and electronic data repositories has brought large amounts of information within our reach. Some companies, such as Yahoo!, have made organization and classification of online data their business. With time, however, the amount of data available has become so vast that we needed alternate, more dynamic ways of finding information. Although we can classify data, trawling through hundreds or thousands of categories and subcategories of data is no longer an efficient method for finding information.

    The need to quickly locate information in the sea of data isn't limited to the Internet realm—desktop computers can store increasingly more data. Changing directories and expanding and collapsing hierarchies of folders isn't an effective way to access stored documents. Furthermore, we no longer use computers just for their raw computing abilities: They also serve as multimedia players and media storage devices. Those uses for computers require the ability to quickly find a specific piece of data; what's more, we need to make rich media—such as images, video, and audio files in various formats—easy to locate.

    With this abundance of information, and with time being one of the most precious commodities for most people, we need to be able to make flexible, freeform, ad-hoc queries that can quickly cut across rigid category boundaries and find exactly what we're after while requiring the least effort possible. To illustrate the pervasiveness of searching across the Internet and the desktop, figure 1.1 shows a search for lucene at Google. The figure includes a context

    Figure 1.1 Convergence of Internet searching with Google and the web browser.

    menu that lets us use Google to search for the highlighted text. Figure 1.2 shows the Apple Mac OS X Finder (the counterpart to Microsoft's Explorer on Windows) and the search feature embedded at upper right. The Mac OS X music player, iTunes, also has embedded search capabilities, as shown in figure 1.3. Search functionality is everywhere! All major operating systems have embedded searching. The most recent innovation is the Spotlight feature (https://www.apple.com/macosx/tiger/spotlighttech.html) announced by Steve Jobs in the

    Figure 1.2 Mac OS X Finder with its embedded search capability.

    Figure 1.3 Apple's iTunes intuitively embeds search functionality.

    Figure 1.4 Microsoft's newly acquired Lookout product, using Lucene.Net underneath.

    next version of Mac OS X (nicknamed Tiger); it integrates indexing and searching across all file types including rich metadata specific to each type of file, such as emails, contacts, and more.1

    Google has gone IPO. Microsoft has released a beta version of its MSN search engine; on a potentially related note, Microsoft acquired Lookout, a product leveraging the Lucene.Net port of Lucene to index and search Microsoft Outlook email and personal folders (as shown in figure 1.4). Yahoo! purchased Overture and is beefing up its custom search capabilities.

    To understand what role Lucene plays in search, let's start from the basics and learn about what Lucene is and how it can help you with your search needs.


    Created: March 27, 2003
    Revised: January 24, 2005

    URL: https://webreference.com/programming/lucene/1