Community Over Code NA 2024 Apache Lucene/Solr Birds of a Feather

Eric PughEric Pugh
3 min read

Community Over Code NA 2024 (Ex ApacheCon) happened from the 7th to the 11th of October in Denver (United States of America). The Search Track was well attended, with the designated room being at standing room only capacity quite a few times.

As has now become a tradition, we had a Lucene/Solr Birds of a Feather (BoF) to collect feedback from the community and assess the status of the projects, with a good attendance of 22 people in the room. This one was a bit different than the previous ones as it focused more on tech and less on the people and processes related to the two communities. You can read previous ones here: Bratislava, Halifax, and New Orleans.

Huge thanks to Stefan Vodita, Lucene committer, who took notes during the event. Without his effort, this blog post would not have been possible.

In Solr we do a lot of “Keep the Lights On” (KTLO) work, and don’t pay as much attention to Lucene as we should. What should we know about Lucene 10?

  • KNN Capabilities: Expansion of K-nearest neighbors (KNN) functionality. This has been there since 9x, and there is a lot more features planned for 10.

  • Vector API: The latest Java version includes support for the Vector API and Project Panama, enabling more efficient handling of vectors.

  • Intra-Segment Concurrency: Each segment can now be searched by multiple threads, which reduces latency.

  • JDK21 Compatibility: Emphasis on utilizing new features in JDK21.

What's coming up in Lucene in the next 2 years?

Implementation of binary partitioning will enhance sorting of documents with similar terms, leading to smaller deltas in postings lists and faster decoding. Aiming to extend this capability to vectors. While some great work was done in Lucene 9, in 10 we’re going to do even more to make vectors powerful.

What can I do about large stored fields, especially when I have one massive field and few small metadata fields?

  • Stored fields are currently slow; it is recommended to use DocValues where possible.

  • Unlike stored fields, DocValues do not require decoding of all stored fields for retrieval, which improves performance.

  • Suggestion to sort and retrieve top-N results before accessing stored fields to boost efficiency.

Forward and Backwards compatibility of Lucene versions and limitations in IndexUpgrader.

This was a big topic, both at the BOF and more generally at the conference. We need to make updating Lucene versions easy. We should accept in the Solr community that by default you STORE a copy of all your data, or that upgrades and reindexing and experimentation is easy! Don’t need to go back to source!

  • Index Upgrade Tool: A tool exists to update your index to a new codec, but index metadata may still retain the old version.

  • Proposal for Comprehensive Index Updates: There’s a proposal to update the index with full awareness of potential breaking changes (e.g., norms), even if it risks losing some features.

  • Acknowledgment that future concerns may hinder current functionality.

  • Solr's Role in Upgrades: Solr could assist in spinning up clusters with new versions and facilitating migration.

  • Rollback Limitations: Current inability to roll back minor versions raises questions about possible solutions.

What can the Solr project do for Lucene?

  • Testing Collaboration: Suggestion to revive collaborative testing efforts that existed when Lucene and Solr were a single project.

  • User Insights: Encourage sharing information about user hotspots and issues, leveraging Solr's larger user base to benefit Lucene. Make sure to funnel feedback down to Lucene!

Lastly, a short discussion on moving from Jira to Github issues, a topic of previous BOF’s.... Lucene community is happy with it, though there is less structure overall compared to JIRA.

0
Subscribe to my newsletter

Read articles from Eric Pugh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Eric Pugh
Eric Pugh