This article is taken from the forthcoming book Lucene in Action, Second Edition. The 475-page guide is a comprehensive tutorial that shows how to use Lucene to add full-text, cross-platform search to nearly any application. This article introduces a new feature of release 2.3 that enables backing up an index without pausing indexing or restarting searches.
Hot Backups with Lucene
By Michael McCandless, April 2008
Picture this: You own a small, very profitable and quickly growing e-commerce Web site. You carefully designed the whole user experience around the powerful open-source search engine Lucene. This search-centric approach is your secret sauce, and you know it’s the reason you are winning over users from your competition. Eighty percent of purchases come through search. You are rightfully proud.
Then the unthinkable happens: One day your hard drive crashes and your search index becomes corrupt and unusable. So what do you do? You restore from your backups! You could also try out Backup & Disaster Recovery Services to see if that works. You do have backups of your search index, right? Amazingly, it’s all too common for owners and administrators of search-intensive sites to overlook making regular backups of their search index. In our increasing agile, always-on, search-driven world, failing to backup your search index is a very costly mistake. Fortunately, as of version 2.3, backing up a Lucene index is now surprisingly simple.
In the modern world of heavyweight, expensive, and complex closed-source enterprise search engines, Lucene is a surprising breath of fresh air. The simple design, carefully exposed API, and incredible feature set, make it trivial to add search to your application. Recently Lucene has been under very active development, quickly adding features previously available only to expensive, closed-source commercial offerings. Hot backups is just such a feature.
The challenge
The most obvious way to backup a Lucene index is to close your IndexWriter and make a full or incremental copy of all files in the index. After all, these are just ordinary files stored in a single flat directory in the file system, so this approach will work. While this approach is simple, it has serious limitations. On Windows, if you have an IndexReader open on the index, it can keep files around even when they are no longer needed by the most recent commit. Your backup process then wastes time and space copying these unnecessary files. You can work around that problem by always re-opening your reader, after closing the IndexWriter and before running the backup.
But here’s another problem: You can’t open another IndexWriter until the backup finishes because the writer might change the index while the backup is running, which would corrupt your backup. This means you cannot make any updates to your index while the backup is running, making your index read-only. Worse, you can neither predict nor control how long this read-only down time will actually be. It could be 30 seconds or it could be an hour or more, depending on the size of your index and the availability of overall IO bandwidth.
So maybe you decide to work around that by giving the highest priority possible to the backup process. This way it finishes as quickly as possible, right? Well, yes, but this will cause serious interference to any IndexSearchers that you are using to search the index. Really you should do the reverse: Give the backup process a low priority, or carefully throttle its IO, so that it does not interfere with searching.
Suddenly this backup process is really a hassle because it interferes so much with ongoing searches and updates. No wonder so many people just don’t bother with backups and only discover, the hard way, just how important they really are.
The solution
Fortunately, Lucene’s simple segmented architecture, described later in this article, presents an elegant solution. With recent changes in 2.3, it is now possible to make a hot backup of your index, which means backing it up without having to close your IndexWriter nor pause indexing or restart searchers. Furthermore, it’s fine if the backup process takes as much time as needed because Lucene will protect the necessary files. The backup will be a point-in-time copy of the search index, even if the index is still being changed by the writer.
Cutting to the chase
For the impatient ones among us, this is all you have to do.
NOTE: All code samples in this article are based upon release 2.3.1 of Lucene.
When you instantiate the IndexWriter, use the new SnapshotDeletionPolicy, like this:
This creates an IndexWriter with a special deletion policy. At this point, use your writer as you normally would. You can also use a different original deletion policy than KeepOnlyLastCommitDeletionPolicy
if you need to. Then, when you need to do a backup, initiate it from your writer, like this:
You can do this from a separate thread, and continue using the writer as usual in your application to make changes to the index. The backup will copy the point-in-time snapshot as of the moment when you called the snapshot() method.
Here are some important notes to follow when copying the files:
- Don’t copy the write.lock file.
- Always copy the segments.gen file.
- For all other files, Lucene is “write once.” This makes doing incremental backups very easy: Simply compare the file names. Once a file is written, it will never change; therefore, if you’ve already backed up that file, there’s no need to copy it again.
- You can do the copying in Java, or you can take the filenames and launch a shell to run your favorite backup or file archiving utility, such as rsync, robocopy, cp, tar, or zip. However, take extra care to catch and handle any errors that these tools might encounter. For example, if you get a disk full error, then that will certainly lead to a corrupt backup image.
- You can even throttle the IO usage of the backup program so that it doesn’t interfere with ongoing searching or indexing. It really doesn’t matter to Lucene how long your backup takes because your backup will always be a point-in-time copy; however, while the backup is running, it will prevent deletion of any files referenced by the point-in-time commit point. This means your index might temporarily use more disk space.
Restoring or replicating the index
When it’s time to restore the index, follow this procedure:
1. Make sure all IndexReaders
and IndexWriter
s on the index directory are closed.
2. Remove all files from the index directory.
NOTE
In Windows, if you are unable to remove certain files, this means there are still processes holding the files open. Go back to step 1.
3. Copy the files from your backup into the index directory.
WARNING
Be very careful during the copying that you don’t encounter errors, such as disk full, as this will certainly lead to a corrupt index.
This same approach can easily be used to efficiently replicate the index to other computers, for example, if you have a high search load and distribute searches across multiple search servers.
Technical details
Let’s dig into how SnapshotDeletionPolicy actually works. To do that, we first need to understand Lucene’s elegant segmented architecture. Figure 1 shows the structure of a Lucene index. The index is stored in separate pieces, each containing a complete index for a subset of the documents. Each segment can have many files associated with it, depending upon whether you are using the compound file format. A new segment is created when IndexWriter’s buffer is flushed. Periodically, according to the MergePolicy and MergeScheduler in use by your application, segments are merged together, at which point one new segment is created and the old merged segments are removed.
Figure 1: A Lucene index is composed of separate, independent segments, each holding a full index for a subset of the documents. A commit point (each segments_N file) references a list of segments that make up the index as it exists at that commit.
Finally, and this is the key point, a separate file named segments_N, where N is an integer, holds references to those segments that make up a given commit point (IndexCommitPoint). Every time the writer commits to the index, N is increased by 1. These files are called commit points because a new one is created whenever the writer commits a change to the index. Lucene first writes all new files for a segment, and only when that is successful, writes a new segments_N file referencing that segment and de-referencing any segments that were just merged.
As of release 2.1, the IndexDeletionPolicy was factored out from IndexWriter, enabling you to customize when an old commit point gets deleted. This is useful for certain filesystems, notably NFS, that do not protect open files from being deleted. Whenever the IndexWriter creates a new commit point, it consults the deletion policy to decide which older commit points should then be deleted. The default policy is KeepOnlyLastCommitDeletionPolicy, which removes the previous commit point whenever a new commit is done.
Listing 1 shows the source code for SnapsotDeletionPolicy. You can see that it is surprisingly simple (less than 100 lines). Thanks to the fact that Lucene is open-source, with the liberal Apache Software Foundation License, you can see and modify any of Lucene’s sources. SnapshotDeletionPolicy simply wraps an existing IndexDeletionPolicy. When you make a snapshot, it grabs the current commit point and holds a reference to it, preventing IndexWriter from removing it. Once you release the commit point, then the next time IndexWriter commits a change to the index, that commit point and any resulting unreferenced files will be removed.
Some minor limitations
SnapshotDeletionPolicy has a few minor limitations. First off, you can only hold one snapshot open at a time. You can see that calling snapshot a second time will throw an IllegalStateException. This normally isn’t a problem because you usually don’t kick off another backup while a previous one is still running. However, if for some reason you really need more than one snapshot at a time, you could make your own version of SnapshotDeletionPolicy that changes the snapshot attribute to Collection
instead, and updates all methods to use that collection.
The second limitation is that SnapshotDeletionPolicy will not remember the snapshot when you close your IndexWriter. This means your backup process must finish before you can close and open a new IndexWriter. Once again, this is simple to fix: Just change it to store its own file in the index Directory, recording whether or not a snapshot is currently open, and if so, its segments filename (IndexCommitPoint.getSegmentsFileName()). Then, in the onInit method, re-open that file if it exists and locate the matching commit point in commits, and mark that one as the current snapshot. With this change, your backup can keep running even while you close and open new IndexWriters in a new JVM.
The third limitation is what happens, or rather doesn’t happen, when you call release(). When you release a snapshot, it’s likely there are now files that are no longer referenced by any commit point or running backup. However, these files are not deleted immediately. Instead, they are deleted the next time IndexWriter checks for deleted files. This happens when the writer is opened and when it commits a change to the index. This one is not simple to fix yourself, but, Lucene is always in flux and so maybe a future release will fix it! In the meantime, simply opening and closing an IndexWriter will do the trick.
Conclusion
Let’s face it: someday your search index will suddenly become unusable and your only fast option is to restore from a backup. Maybe you are an optimist and figure it’ll be a year or two, or maybe you’re a pessimist and count on only a few weeks. Or maybe you figure you can just quickly re-build your entire index when fate comes calling. Whatever your persuasion, it really is only a matter of time until that day comes. Thanks to recent active development in Lucene, making a backup is now a surprisingly simple operation that no longer interferes with ongoing updating and searching. There are no more excuses to delay!