Coherence Push Replication

Coherence itself has always had the capability to “run across a WAN” in some fashion or another (*Extend or custom CacheStores being the pre-dominate solution).  While solutions in the past have usually required some level of customization, mainly at the application level because every application is completely different in terms of replication strategy, with the introduction of the Push Replication Pattern example on the Coherence Incubator, essentially an open-framework for solving distributed Data Grid replication, things are a lot easier.

To be honest, over the past few years I’ve heard a lot of non-sense about the use of Push Replication Pattern as a strategy for keeping sites in sync (using push rather than a pull model), especially around it’s applicability as a solution for large scale distributed replication challenges. While the comments have seemingly originated from competitors of Coherence in the Data Grid space (feedback is that they don’t have these features) or from people that have never really solved these types of problems (what would we do without critics huh? :P), it’s pretty clear why…. simply understanding the possible combinations of technologies coupled with the replication semantics and business rules makes globally distributed replication a very challenging problem.  (Note: I was told this stuff even before I joined Tangosol when I was investigating Data Grids).

For example:  Often every name space (say cache) in a system will have completely different semantic requirements for replication.  Some may require an active passive strategy, some may require active + active, some may hub-and-spoke, and so on.  When you have a system with twenty of these different requirements it’s clear why “one single hard coded solution” or “one button in the tin” is not flexible enough for architects.  It’s not about “things being too complex”, it’s about supporting business models that are “complex”.  Ultimately what you can do in a single JVM, or the strategies that work between a few servers in a single data center are nothing compared to keeping things running in real-time across the globe. Actually, usually the rules are completely orthogonal.

Having been personally responsible for these types of solutions in the past (prior to joining Tangosol and Oracle over four years ago), the Coherence-based Push Replication Pattern example seems to demonstrate most of these challenges can be solved, especially in the Data Grid space.  

In a nutshell here’s what the Oracle solution is designed to deliver;

1. Completely Asynchronous Replication (no blocking between sites – especially those distributed around the world!)
2. Many deployment models; from one to one, one to many, many to one and many to many sites (not just single servers)
3. Simultaneous replication in multiple directions (those active + passive and hub + spoke are also supported)
4. Guaranteed ordering of updates
5. Completely monitorable (from any point using JMX) (and now also has third party monitoring tool integration)
6. Pluggable (business rules based) Conflict Resolution (on a per-cache and site basis)
7. Native Java (but supports portable objects from .NET and C++)
8. Provided with complete source code.
9. You can continue to use the Coherence APIs as they are… no changes in either the product (either statically or dynamically with byte-code manipulation) or your application code. 

Ok… enough of the background and marketing crapolla… Here’s a very common deployment scenario.

Consider the requirement where a company has three different regional operating centers.  Say for example the deployment model that exists in virtually every investment bank (before most of them recently turned into “savings banks”).  Each typically has an East Coast operation (in the United States), one operation in Europe, and lonely centre somewhere in the far reaches of the Asia Pacific :P, each being “active” and each with their own local business continuity (disaster recovery site).  So a total of six sites.  

ASIDE: As I’ve always said, and recently talked about at the NY and London Coherence SIGs, once you have an architecture that is globally distributed, you effectively have to double the number of sites you need to keep in sync (to allow for regional disaster recovery), regardless of whether everything is hot or not.   Why?  Because everyone wants everything as hot as possible (and 24 x 7) as their requirement/dream is “anyone should be able to trade from anywhere at anytime”.  Even for the smallish firms I’ve worked for in the past eight years, this was exactly the requirement  – trading everywhere.

Doing this with a bunch of servers, even say just deploying six around the world, would be a significant challenge for most technologies.  The big challenge however is coping with capacity requirements, especially when those servers don’t have enough.  In the extreme cases, shipping log files around, or using a messaging bus, simply doesn’t cut it, and this is where a Coherence-based solution can provide the capacity, low-latency and high-resilience typically required.

So what do the Coherence-based deployments look like?  Well instead of six servers, there are usually somewhere between say twenty and several hundred Coherence servers at each site, most mainly configured to be part of a regional Compute Grid, usually for a risk management/trade position keeping/reference data solution.  Thus in terms of replication the challenge is to keep not six, but somewhere between 6 x 20 = 120 and 6 x (say) 100 = 600 Coherence servers in sync.  

While these numbers are just examples, I’ve recently have been involved with implementations and discussions concerning Push Replication that range from three sites to over five thousands sites (sounds a bit extreme but in some industries – non financial – this type of capacity is “normal”), with anything from three servers per-site to hundreds per-site.  In reality, almost all investment banks operate the same way; three or four major centers with a few hundred servers per site (typically per application).

Just to be clear and not to sound like I’m taking ownership of these patterns, while the Coherence implementation of Push Replication is relatively new, the concepts are not.  Just solving this on a large scale with a flexible example framework is.  The fact is, WAN-based solutions are becoming more prevalent and solutions for implementing them more commonly known as the requirements for operating on a global basis in real-time is dramatically increasing.

In some ways it’s what you “don’t have to do” that makes the Coherence solution unique.

1. When you want to scale (even more), you simply add more servers at run-time. There’s no need to shut an entire site down (or the system globally) to reconfigure, use a console or anything else, you just start more servers.  

2. You also don’t need to setup up “gateways” (single points of failure) to do the work between the sites.  Each cache uses a set of “publishers between sites” and these are managed as an integral “part of the grid(s)”.  Basically this means there is no need to separately configure them (from the rest of the infrastructure). Also, if they die (say because they are killed off to be then upgraded), everything just fails over and continues where it left off.  This is usually sub-second.

The great thing about the Coherence Push Replication implementation is that it’s been customer driven.  Like all of the features in Coherence itself, the features and semantics for the Coherence Push Replication example implementation have come from a variety of firms, some small and some large, each kind enough to invest some of their time in providing guidance around their unique challenges.  It’s been great working with these firms, each themselves distributed around the world, and I’m sure they are looking forward to the next generation of the Push Replication Pattern implementation, including adding support for cache-based and directional filtering of updates between sites, together with coalescing of updates for even greater network through-put.  (it already supports compression)


8 responses to “Coherence Push Replication

  1. Can you share details on how much data is be replicated (transactions / sec, total, # of objects)

    • At the moment it’s difficult to share such information due to confidentiality. Basically no one in the finance industry will give away anything about what they do, especially figures. I’d certainly like to publish the information, but I’ll need a bunch of approvals first 😦

      The answers however are very much dependent on the underlying infrastructure, and less so about Coherence itself. It’s pretty easy to do some benchmarks for yourself. On my setup at home (a couple of new generation Mac’s), it’s easy to sustain over 1000 object updates per-second, per-cache, fully ordered with conflict resolution. However this number varies a lot based on underlying hardware and network topologies. eg: At some sites I’ve seen 2x more than this, at others I’ve seen 0.5x of this. The important part is that everyone (so far) seems happy with the performance, scalability, availability and resilience of the solution, especially in light of the “cost” to implement. eg: Drop in a few jars, add a CacheStore (to each cache you want replicated) and configure the end-points.

      eg: Last week we managed to walk a group through setting up London to Sydney replication in around 20 mins, from start to finish, without recompiling their application.

  2. How does the hub-less messaging concept fit in with this?

    • The Coherence Messaging Pattern implementation (that is essentially hub-less) is used as infrastructure to guarantee cache entry operations (Insert, Update, Deletes) are propagated in-order (batched) asynchronously between sites. However, it’s not really necessary. IN the past we’ve implemented similar push-replication patterns using bulk-standard messaging infrastructure – albeit with lower through-put.

      In reality while I’ve personally been interested in Messaging on-top-of-Coherence for a number of years (I think my first request for the features and initial design was made over five years ago), we essentially created the Messaging Pattern implementation to support the Push Replication Pattern implementation and thus avoid requiring alternative messaging infrastructure.

      That said, the Messaging Pattern 2.3.0 implementation is now more of a “true store-and-forward messaging solution” than ever before as it supports Queues, Durable and Non-Durable Subscriptions together with Transactions (on the Subscriber side). In all practical sense, none of these messaging features are really required for Push Replication. In fact in the first (not released) version of the Push Replication implementation, the “messaging” component was embedded. ie: There was no clear separation between the two patterns. It was only after working with several other customers, explicitly wanting to see how they could do “store and forward messaging” on-top-of Coherence, did I separate it out.

      The same also applies to the Command Pattern implementation. It was once embedded but now you can use it separately without any of the other higher-level patterns. Based on download histories, it’s clear that the Command Pattern is used more often than the higher-level patterns, which makes sense really. The Command Pattern can solve a lot more problems than the other solutions (being more generic).

      In the next version of the Push Replication Pattern we’re going to “relax” some of the ordering constraints to massively increase throughput. Not that it’s been reported as a problem, but it’s something most systems don’t really require. ie: They don’t need ordering-per-cache, but instead ordering-per-entry is sufficient enough. Most of this work is based on a solution we produced nearly four years ago for a customer that wanted transactional updates to propagated between a few hundred sites. We’re just cleaning it up and making it freely available. ie: you could build it yourself now, but we’re going to create a working example for everyone.

  3. So this thing has been implemented to avoid using JMS infrastructure? I thought hub-less was more of an alternative architecture. What does this give above JMS other than simply replacing it?

  4. From our experience and customer architectures, most JMS implementations are based on “hub-and-spoke” architectures. Consequently pushing an entire Data Grid worth of updates into a hub-and-spoke based JMS (or other messaging solution) causes some concerns; a). performance, b). scalability, c). ability to handle the capacity demands, d). configuration and management, e) high-availability (without hitting disk).

    So instead of going down the JMS route (not that there’s anything wrong with JMS, it’s the implementations we were concerned about), we developed a similar store-and-forward solution that uses a Data Grid address the scale-out, performance, availability etc demands. While we *could* have also produced yet-another-JMS implementation, we had a pragmatic approach. We didn’t need a JMS stack, just Topics and Durable Subscriptions, which high-performance resilience and message availability.

    Since then we’ve received more requests for “more JMS-like” features in the Coherence Messaging Pattern implementation, so we’ve experimented and produced some examples.

    We *could* implement a full JMS stack, but this hasn’t really been requested. Further, doing such a thing is a big task, especially if we have to be standards compliant, meet the spec and pass the Java EE TCK (for JMS).

    What most people see when they use this is the ability to embed the messaging system directly into their Data Grid-based application. It means they don’t need to provision and separately manage servers, as it’s all part of the same infrastructure.

    Personally I just see it as yet another programming model that can be used with and/or on-top-of-Coherence.

  5. Brian,

    What thoughts do you have regarding best practices for designing a grid implementation across data centers with regards to disaster recovery? For example, if n grids are set up with push-replication and one site goes down, can the other grids push over their data to re-sync with the one that went down after it comes back up, or does that need to be done locally through a persistence

    • Sorry for the delay. Basically the Push Replication Pattern will support n-way replication and the n-1 sites will hold back (queue) “replications” that are destined for the one (or more) sites that is unavailable (for some reason).

      Currently the Push Replication has no functionality to “send over a copy” of an entire data center. We’ve seen people implement this themselves, but we’re going to add this to the next+1 release (I think).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s