Coherence itself has always had the capability to “run across a WAN” in some fashion or another (*Extend or custom CacheStores being the pre-dominate solution). While solutions in the past have usually required some level of customization, mainly at the application level because every application is completely different in terms of replication strategy, with the introduction of the Push Replication Pattern example on the Coherence Incubator, essentially an open-framework for solving distributed Data Grid replication, things are a lot easier.
To be honest, over the past few years I’ve heard a lot of non-sense about the use of Push Replication Pattern as a strategy for keeping sites in sync (using push rather than a pull model), especially around it’s applicability as a solution for large scale distributed replication challenges. While the comments have seemingly originated from competitors of Coherence in the Data Grid space (feedback is that they don’t have these features) or from people that have never really solved these types of problems (what would we do without critics huh? :P), it’s pretty clear why…. simply understanding the possible combinations of technologies coupled with the replication semantics and business rules makes globally distributed replication a very challenging problem. (Note: I was told this stuff even before I joined Tangosol when I was investigating Data Grids).
For example: Often every name space (say cache) in a system will have completely different semantic requirements for replication. Some may require an active passive strategy, some may require active + active, some may hub-and-spoke, and so on. When you have a system with twenty of these different requirements it’s clear why “one single hard coded solution” or “one button in the tin” is not flexible enough for architects. It’s not about “things being too complex”, it’s about supporting business models that are “complex”. Ultimately what you can do in a single JVM, or the strategies that work between a few servers in a single data center are nothing compared to keeping things running in real-time across the globe. Actually, usually the rules are completely orthogonal.
Having been personally responsible for these types of solutions in the past (prior to joining Tangosol and Oracle over four years ago), the Coherence-based Push Replication Pattern example seems to demonstrate most of these challenges can be solved, especially in the Data Grid space.
In a nutshell here’s what the Oracle solution is designed to deliver;
1. Completely Asynchronous Replication (no blocking between sites – especially those distributed around the world!)
2. Many deployment models; from one to one, one to many, many to one and many to many sites (not just single servers)
3. Simultaneous replication in multiple directions (those active + passive and hub + spoke are also supported)
4. Guaranteed ordering of updates
5. Completely monitorable (from any point using JMX) (and now also has third party monitoring tool integration)
6. Pluggable (business rules based) Conflict Resolution (on a per-cache and site basis)
7. Native Java (but supports portable objects from .NET and C++)
8. Provided with complete source code.
9. You can continue to use the Coherence APIs as they are… no changes in either the product (either statically or dynamically with byte-code manipulation) or your application code.
Ok… enough of the background and marketing crapolla… Here’s a very common deployment scenario.
Consider the requirement where a company has three different regional operating centers. Say for example the deployment model that exists in virtually every investment bank (before most of them recently turned into “savings banks”). Each typically has an East Coast operation (in the United States), one operation in Europe, and lonely centre somewhere in the far reaches of the Asia Pacific :P, each being “active” and each with their own local business continuity (disaster recovery site). So a total of six sites.
ASIDE: As I’ve always said, and recently talked about at the NY and London Coherence SIGs, once you have an architecture that is globally distributed, you effectively have to double the number of sites you need to keep in sync (to allow for regional disaster recovery), regardless of whether everything is hot or not. Why? Because everyone wants everything as hot as possible (and 24 x 7) as their requirement/dream is “anyone should be able to trade from anywhere at anytime”. Even for the smallish firms I’ve worked for in the past eight years, this was exactly the requirement – trading everywhere.
Doing this with a bunch of servers, even say just deploying six around the world, would be a significant challenge for most technologies. The big challenge however is coping with capacity requirements, especially when those servers don’t have enough. In the extreme cases, shipping log files around, or using a messaging bus, simply doesn’t cut it, and this is where a Coherence-based solution can provide the capacity, low-latency and high-resilience typically required.
So what do the Coherence-based deployments look like? Well instead of six servers, there are usually somewhere between say twenty and several hundred Coherence servers at each site, most mainly configured to be part of a regional Compute Grid, usually for a risk management/trade position keeping/reference data solution. Thus in terms of replication the challenge is to keep not six, but somewhere between 6 x 20 = 120 and 6 x (say) 100 = 600 Coherence servers in sync.
While these numbers are just examples, I’ve recently have been involved with implementations and discussions concerning Push Replication that range from three sites to over five thousands sites (sounds a bit extreme but in some industries – non financial – this type of capacity is “normal”), with anything from three servers per-site to hundreds per-site. In reality, almost all investment banks operate the same way; three or four major centers with a few hundred servers per site (typically per application).
Just to be clear and not to sound like I’m taking ownership of these patterns, while the Coherence implementation of Push Replication is relatively new, the concepts are not. Just solving this on a large scale with a flexible example framework is. The fact is, WAN-based solutions are becoming more prevalent and solutions for implementing them more commonly known as the requirements for operating on a global basis in real-time is dramatically increasing.
In some ways it’s what you “don’t have to do” that makes the Coherence solution unique.
1. When you want to scale (even more), you simply add more servers at run-time. There’s no need to shut an entire site down (or the system globally) to reconfigure, use a console or anything else, you just start more servers.
2. You also don’t need to setup up “gateways” (single points of failure) to do the work between the sites. Each cache uses a set of “publishers between sites” and these are managed as an integral “part of the grid(s)”. Basically this means there is no need to separately configure them (from the rest of the infrastructure). Also, if they die (say because they are killed off to be then upgraded), everything just fails over and continues where it left off. This is usually sub-second.
The great thing about the Coherence Push Replication implementation is that it’s been customer driven. Like all of the features in Coherence itself, the features and semantics for the Coherence Push Replication example implementation have come from a variety of firms, some small and some large, each kind enough to invest some of their time in providing guidance around their unique challenges. It’s been great working with these firms, each themselves distributed around the world, and I’m sure they are looking forward to the next generation of the Push Replication Pattern implementation, including adding support for cache-based and directional filtering of updates between sites, together with coalescing of updates for even greater network through-put. (it already supports compression)