GSIP 69 - Catalog scalability enhancements
Overview
Improved vertical scalability of Catalog resources (i.e. being able to efficiently manage hundreds of thousands of layers, styles, etc).
Proposed By
Assigned to Release
GeoServer trunk, hopefully the current 2.2.x branch.
State
Under Discussion, In Progress, Completed, Rejected, Deferred
Motivation
With the arrival of Virtual Services, Workspace Local Services, and Workspace Local Settings GeoServer gets more suited to Multitenancy and hence supporting a large number of configuration resources becomes even more important.
Prior art on this regard includes the development of the DBConfig Module, which allows to externalize the storage of the configuration objects to a RDBMS using Hibernate O/R mapping, and hence adds the ability for the Catalog to scale up to an unbounded number of workspaces, stores, layers, etc.
Regardless of the Catalog's backend ability to scale up, GeoServer itself doesn't gracefully scale as the number of config objects in the catalog increases, since given the way the current Catalog API is designed, assumptions are made that full scans and defensive copies of lists of catalog resources are cheap both in processing time and memory consumption.
This proposal aims to provide a means to solve this problem in a way that allows to progressively adopt any API change throughout the code base where the benefits are clear and measurable.
Scope
In Scope
Given a relatively large number of Catalog configuration objects:
- Identify some exemplary use cases that result in scalability/performance bottle necks throughout the GeoServer code base;
- Identify the needed requirements and main QA goals to satisfactorily solve the problems described in the use cases;
- Design Catalog API enhancements that fulfill the requirements;
- To validate the API design by providing more than one concrete backend implementation, and to upgrade the Catalog client code from the exemplary use cases.
- To provide general guidelines on how and when to progressively adopt the new API methods.
Not in Scope
- It is not in this proposal's scope to allow applications outside GeoServer to directly edit the backend's (RDBMS or other) configuration objects. CatalogFacade and GeoServerFacade implementations are free to use whatever storage format and mechanisms they see fit. That said, this proposal also doesn't forbid Catalog/Config backend implementations to allow for applications outside GeoServer to directly edit the configuration objects.
Use Case Drivers
The following are identified use cases in the GeoServer code base that should cover most situations where the main scalability and/or performance bottle neck is in the Catalog's client code and not in the Catalog's ability to serve large amounts of configuration objects.
- Secure Catalog Decorator: a full scan of Catalog resources is performed on each get*():<List> request and a separate list is built for the current user's accessible objects, even if the Catalog returns an immutable list, affecting both memory consumption and processing time.
- Wicket User Interface: Home page gets the whole list of workspaces, stores, and layers only to get their size. Catalog resource list pages (e.g. LayerPage, StorePage, etc) do so to a) return the iterator for the current page of data, b) obtain the full list of objects, c) obtain the filtered list of objects, d) obtain the total number of objects, e) obtain the filtered number of objects.
- WMS GetCapabilities: Generation of a WMS Capabilities document implies fetching the full list of layers multiple times, in order to a)filter the layer list based on the request's NAMESPACE parameter, b)calculate the layer list's aggregated bounds, c) figure out a common CRS to all the layers, and d) build an in-memory layer tree in order to nest layers based on the LayerInfo's wms "path" attribute;
Check the GSIP 69 - Use Cases page for further detail.
Requirements
In attention to the above use cases, the following list of high level requirements and QA goals shall be met by Catalog API change proposal:
- Filtering: Shall allow for filtering of catalog objects through arbitrary query criteria;
- Streaming: Shall allow for a streamed approach to catalog objects retrieval;
- Paging: Shall allow for paged queries. Catalog backends shall provide a consistent "natural order" of resources. Doesn't need to be based on id or any other prescribed property.
- Leverage query engines: Shall allow to move any in-process filtering criteria back to the backend, allowing for optimization in the common cases;
- Query generality: in-process filtering shall work out of the box for the general case;
- Compactness: API changes should be additive and minimal;
- Usability: Easy of use and compactness is highly desired;
- Incremental adoption: Shall allow for progressive/iterative adoption;
- Leverage sub-system cohesion: Shall introduce no external dependencies at the API level.
Proposed Catalog API extensions
The following is a summary of the API proposal. Check the GSIP 69 - API Proposal page for further detail.
In essence, the proposal lays down to a two method addition to the Catalog interface: one to obtain the count and another one to obtain a stream of Catalog configuration objects, both allowing to specify a filtering criteria through a domain specific predicate interface, as well as the ability to do paged queries.
- Iterable or Iterator: Use java.util.Iterator instead of java.lang.Iterable as query return type. It is better suited to represent a single stream of contents and helps keep the API changes to a minimum.
- CloseableIterator
interface CloseableIterator<T> extends Iterator<T>, Closeable { @Override public void close(); }
- Predicate: simple GeoServer catalog subsystem's query model.
package org.geoserver.catalog; public interface Predicate<T extends Info> { /** * Returns a boolean representing the result of applying this predicate to {@code input}. */ boolean apply(T input); }
- Predicates utility factory methods: Static factory methods utility to build well known types of Predicate instances, that Catalog backends can easily identify and translate to their native query language.
package org.geoserver.catalog; public class Predicates { public static <T extends Info> Predicate<T> acceptAll() {...} public static <T extends Info> Predicate<T> acceptNone() {...} public static <T extends Info> Predicate<T> propertyEquals( String property, Object expected) {...} public static <T extends Info> Predicate<T> propertyExists( String property) {...} public static <T extends Info> Predicate<T> contains( String property, String subsequence) {...} public static <T extends Info> Predicate<T> and( Predicate<? super T>... operands) {...} public static <T extends Info> Predicate<T> or( Predicate<? super T>... operands) {...} public static <T extends Info> Predicate<T> isNull( String propertyName) {...} public static OrderBy sortBy( String propertyName, boolean ascending) {...} }
- Catalog Extensions: augment the Catalog interface with two methods, to enable counting the number, and obtaining a stream of Catalog objects for a given query predicate, and to enable paged queries.
public interface Catalog { .... previous methods ... public <T extends CatalogInfo> int count( Class<T> of, Predicate<T> filter); public <T extends CatalogInfo> T get( Class<T> type, Predicate<? super T> filter) throws IllegalArgumentException; public boolean canSort( Class<? extends CatalogInfo> type, String propertyName); public <T extends CatalogInfo> CloseableIterator<T> list( Class<T> of, Predicate<? super T> filter); public <T extends CatalogInfo> CloseableIterator<T> list( Class<T> of, Predicate<? super T> filter, @Nullable Integer offset, @Nullable Integer count, @Nullable OrderBy sortOrder); }
- CatalogFacade Extensions: augment the CatalogFacade interface with three methods in order to cope up with the three main interface use cases in a generic way: a) get a single object, b) get a filtered and possibly paged list of objects, and c) compute the number of objects that satisfy a query criteria.
public interface CatalogFacade { .... public <T extends CatalogInfo> int count(Class<T> of, Predicate<T> filter); public boolean canSort(Class<? extends CatalogInfo> type, String propertyName); public <T extends CatalogInfo> CloseableIterator<T> list(final Class<T> of, final Predicate<? super T> filter, @Nullable Integer offset, @Nullable Integer count, @Nullable OrderBy sortOrder); }
API Validation
In this section two ways of validating the Catalog API extension from this proposal is presented. First, we'll migrate the code from the use cases to the new API to verify its usability and correctness. Then we'll provide a couple Catalog back end implementations to verify its implementability and effectiveness.
Migration of identified sample offending code
In this section we will go through updating the Catalog client code identified as exemplary performance/scalability offenders in the Use Cases section, to the new API, in order to validate it in terms of usability.
Please not that all the usage of Guava utility classes is anecdotal implementation detail here. No such requirement exists at the API level.
- Solving Use Case 1: Lower the memory footprint and CPU utilization of {{SecureCatalogImpl
- Implement new methods: Implement new methods in SecureCatalogImpl in a way that the filtering of catalog objects not accessible to the current user is pushed back to the CatalogFacade, thus avoiding double creation of in-memory list of objects and wrapping all objects in a secure decorator just to throw away the ones not needed.
Short version:import static org.geoserver.catalog.Predicates.*; import org.geoserver.catalog.Predicate; class SecureCatalogImpl implements Catalog { .... @Override public <T extends CatalogInfo> int count(Class<T> of, Predicate<T> filter) { if (!isAdmin(user())) { filter = and(filter, securityFilter(of, user())); } return delegate.count(of, filter); } @Override public <T extends CatalogInfo> CloseableIterator<T> list(Class<T> of, Predicate<T> filter, Integer offset, Integer count) { if (!isAdmin(user())) { filter = and(filter, securityFilter(of, user())); } // create secured decorators on-demand final Function<T, T> securityWrapper = securityWrapper(of, user); return CloseableIteratorAdapter.transform( delegate.list(of, filter, offset, count), securityWrapper); } /** * @return a catalog Predicate that evaluates if an object of the required * type is accessible to the given user */ private <T extends CatalogInfo> Function<T, T> securityWrapper( final Class<T> forClass, final Authentication user) { ... } /** * @return a Function that applies a security wrapper over the catalog * object given to it as input * @see #checkAccess(Authentication, CatalogInfo) */ private <T extends CatalogInfo> Predicate<T> securityFilter( final Class<T> of, final Authentication user) { ... } ... }
- Leverage new API: Leverage new API in SecureCatalogImpl's existing code so that current bulk query methods avoid double creation of a java.util.List and in-process filtering of current user's accessible objects.
Short version:import static org.geoserver.catalog.Predicates.*; class SecureCatalogImpl implements Catalog { ... //BEFORE public List<LayerInfo> getLayers() { return filterLayers(user(), delegate.getLayers()); } //AFTER public List<LayerInfo> getLayers() { return filterLayers(acceptAll()); } //BEFORE public List<LayerInfo> getLayers(ResourceInfo resource) { return filterLayers(user(), delegate.getLayers(unwrap(resource))); } //AFTER public List<LayerInfo> getLayers(ResourceInfo resource) { return filterLayers(propertyEquals("resource.id", resource.getId())); } //BEFORE public List<LayerInfo> getLayers(StyleInfo style) { return filterLayers(user(), delegate.getLayers(style)); } //AFTER public List<LayerInfo> getLayers(StyleInfo style) { Predicate<LayerInfo> filter = or( propertyEquals("defaultStyle.id", style.getId()), propertyEquals("styles.id", style.getId())); return filterLayers(filter); } //BEFORE protected List<LayerInfo> filterLayers(Authentication user, List<LayerInfo> layers) { List<LayerInfo> result = new ArrayList<LayerInfo>(); for (LayerInfo original : layers) { LayerInfo secured = checkAccess(user, original); if (secured != null) result.add(secured); } return result; } //AFTER private List<LayerInfo> filterLayers(final Predicate<LayerInfo> filter) { CloseableIterator<LayerInfo> iterator; iterator = list(LayerInfo.class, filter, null, null); try { return ImmutableList.copyOf(iterator); } finally { iterator.close(); } } ... }
- Implement new methods: Implement new methods in SecureCatalogImpl in a way that the filtering of catalog objects not accessible to the current user is pushed back to the CatalogFacade, thus avoiding double creation of in-memory list of objects and wrapping all objects in a secure decorator just to throw away the ones not needed.
- Solving Use Case 2: Leverage Catalog filtering, sorting and paging on LayerPage
BEFORE
public class LayerProvider extends GeoServerDataProvider<LayerInfo> { @Override protected List<LayerInfo> getItems() { return getCatalog().getLayers(); } }
AFTER
public class LayerProvider extends GeoServerDataProvider<LayerInfo> { @Override protected List<LayerInfo> getItems() { throw new UnsupportedOperationException( "This method should not be being called! " + "We use the catalog streaming API"); } @Override public int size() { return getCatalog().count(LayerInfo.class, getFilter()); } @Override public int fullSize() { return getCatalog().count(LayerInfo.class, acceptAll()); } @Override public Iterator<LayerInfo> iterator( final int first, final int count) { Iterator<LayerInfo> iterator = filteredItems(first, count); if (iterator instanceof CloseableIterator) { // don't know how to force wicket to close the iterator, lets return // a copy. Shouldn't be much overhead as we're paging try { return Lists.newArrayList(iterator).iterator(); } finally { CloseableIteratorAdapter.close(iterator); } } else { return iterator; } } /** * Returns the requested page of layer objects after applying * any keyword filtering set on the page */ private Iterator<LayerInfo> filteredItems( Integer first, Integer count) { ... } private Predicate<LayerInfo> getFilter() { ... } }
- Solving Use Case 3: Leverage Catalog filtering and sorting on WMS GetCapabilities generation.
BEFORE
... private void handleLayers() { start("Layer"); final List<LayerInfo> layers; // filter the layers if a namespace filter has been set if (request.getNamespace() != null) { final List<LayerInfo> allLayers = wmsConfig.getLayers(); layers = new ArrayList<LayerInfo>(); String namespace = wmsConfig.getNamespaceByPrefix( request.getNamespace()); for (LayerInfo layer : allLayers) { Name name = layer.getResource().getQualifiedName(); if (name.getNamespaceURI().equals(namespace)) { layers.add(layer); } } } else { layers = wmsConfig.getLayers(); } ... handleRootBbox(layers); ... // now encode each layer individually LayerTree featuresLayerTree = new LayerTree(layers); handleLayerTree(featuresLayerTree); ... List<LayerGroupInfo> layerGroups = wmsConfig.getLayerGroups(); handleLayerGroups(layerGroups.iterator()); ... end("Layer"); } private void handleLayerTree(final LayerTree layerTree) { ... } }AFTER
... private void handleLayers() { start("Layer"); //ask for enabled and advertised to start with Predicate<LayerInfo> filter; { Predicate<LayerInfo> enabled = propertyEquals("enabled", Boolean.TRUE); Predicate<LayerInfo> advertised = propertyEquals("advertised", Boolean.TRUE); filter = Predicates.and(enabled, advertised); } // filter the layers if a namespace filter has been set if (request.getNamespace() != null) { //build a query predicate for the namespace prefix final String nsPrefix = request.getNamespace(); final String nsProp = "resource.namespace.prefix"; Predicate<LayerInfo> equals = propertyEquals(nsProp, nsPrefix); filter = Predicates.and(filter, equals); } ... final Catalog catalog = wmsConfig.getCatalog(); CloseableIterator<LayerInfo> layers; OrderBy sortOrder = null; if (catalog.canSort(LayerInfo.class, "name")) { sortOrder = Predicates.sortBy("name", true); } layers = catalog.list(LayerInfo.class, filter, null, null, sortOrder); try{ handleRootBbox(layers); }finally{ layers.close(); } ... // now encode each layer individually layers = catalog.list(LayerInfo.class, filter); try { handleLayerTree(layers); } finally { layers.close(); } final Predicate<LayerGroupInfo> lgFilter = acceptAll(); CloseableIterator<LayerGroupInfo> layerGroups = catalog.list(LayerGroupInfo.class, lgFilter); try { handleLayerGroups(layerGroups); }finally{ layerGroups.close(); } end("Layer"); }
Multiple Back-End Implementations
In addition to the default Catalog implementation, a JDBC based catalog and configuration storage has been developed.
The current prototype for the JDBC backend is located at this github branch.
The jdbcconfig community module is based on the spring-jdbc framework, and utilizes a RDBMS (either H2 or PostgreSQL at the time of writing) as a key/value store with extra indices for Catalog objects 'searchable' properties.
The key on this single-table store is the object identifier and the value it's XStream representation, leveraging exactly the same serialization mechanism GeoServer uses for the on-disk catalog persistence.
This is so to minimize the maintenance costs while the Catalog and configuration object model evolves, hence having to maintain only the XStream persistence code for both the on-disk and database back ends.
API Adoption Guidelines
- If you need to get a count of Catalog objects, use the count method instead of getXXX().size():
int allLayers = catalog.count(LayerInfo.class, Predicates.acceptAll()); int workspaceLayers = catalog.count(LayerInfo.class, Predicates.propertyEquals("resource.workspace.id", workspaceId);
- If only a subset of objects is needed, consider using a Predicate instead of in-process filtering:
//BAD: for(LayerInfo layer : catalog.getLayers()){ if("topp".equals(layer.getResource().getStore().getWorkspace().getName()){ //do something with layer } } //GOOD: Predicate<LayerInfo> filter = Predicates.propertyEquals("resource.store.workspace.name", "topp"); Iterator<LayerInfo> layers = catalog.list(LayerInfo.class, filter); try{ LayerInfo layer; while(layers.hasNext()){ layer = layers.next(); // do something with layer } }finally{ CloseableIteratorAdapter.close(layers); }
- Push sorting to the backend:
//BAD: List<StyleInfo> styles = new ArrayList<StyleInfo>(catalog.getStyles()); Comparator<StyleInfo> comparator = new Comparator<StyleInfo>{ @Override public int compare(StyleInfo s1, StyleInfo s2){ return s1.getName().compareTo(s2.getName()); } } Collections.sort(styles); //GOOD: boolean ascending = true; OrderBy sortOrder = Predicates.sortBy("name", ascending); Iterator<StyleInfo> styles = catalog.list(StyleInfo.class, acceptAll(), null, null, sortOrder);
- Use catalog backend's paging, even if what you really want is a List and not an Iterator:
int startIndex = 50; int pageSize = 25; //BAD: List<LayerInfo> layers = catalog.getLayers(); List<LayerInfo> page = layers.subList(startIndex, startIndex + pageSize); //GOOD: Iterator<LayerInfo> pageIterator = catalog.list(LayerInfo.class, acceptAll(), startIndex, pageSize, null); List<LayerInfo> page; try{ page = com.google.common.collect.Lists.newArrayList(pageIterator); }finally{ CloseableIteratorAdapter.close(pageIterator); }
Feedback
This section should contain feedback provided by PSC members who may have a problem with the proposal.
Backwards Compatibility
Backwards compatibility is preserved since the API changes are additive only. All existing code using the current API will keep working untouched.
Voting
Andrea Aime:
Alessio Fabiani:
Ben Caradoc-Davies:
Gabriel Roldán:
Justin Deoliveira:
Jody Garnett:
Mark Leslie:
[~roba]:
Simone Giannecchini:
Links
[JIRA Task|]
[Email Discussion|]
[Wiki Page|]