Streaming Data Clears the Path for Legacy Systems Modernization
At the recent Cloud Foundry Summit, I caught up with my colleague Jeff Cherng, Advisory Data Engineer for Pivotal, and Anupama Pradhan, Senior Technology Architect at Health Care Service Corporation (HCSC). Jeff and Anupama teamed up to co-present how data streaming forms a key ingredient for HCSC’s legacy system modernization.
HCSC’s use of real-time data streams, rather than batch oriented ETL, is in keeping with an ongoing shift that is well underway, and for good reason. With batch ETL, the moment you have completed the ‘E’ in ‘ETL’ the data starts to age, until the next batch run. This lack of fresh data is a showstopper for many use cases. Furthermore, the complexity of batch oriented ETL makes it fragile, slow, operationally costly and resource intensive.
In contrast, streaming platforms support the ability to process high volume, high-diversity data. They are meant to be real-time from the ground up, and they have been highly supportive of a transition to event-centric thinking. Streaming pipelines, like the ones built by HCSC, can handle high volume data sets and complex database queries, yet with low data latency between the source and destination data store. This design is much more tunable and scalable than batch ETL approaches.
HCSC uses data streams for a ‘leave and layer’ approach to legacy systems modernization. In this post, I’ll first discuss this approach in broad terms, and then come back to HCSC’s specific implementation.
Legacies that Don’t Bind
Legacy applications built on Enterprise Service Busses and RDBMs have hit limitations that cause future investments in these systems to be disappointingly ineffective. The complexity of these applications makes them extremely brittle with several points of failure, and difficult to scale. These diminishing returns are worsened by the steep pricing curves typically associated with scaling these products (ex: pricing on mainframes based on MIPS, RDBMS pricing for adding capacity).
Yet, it’s often impractical to undertake a wholesale replacement of these systems because they are well entrenched and form the basis of the previous generation of automation, which the business now depends on. Their future potential may be limited, but the value they deliver today is well understood, predictable, and important for the business.
In these cases, what’s needed is an evolutionary approach that preserves the value of legacy systems, and also opens doors to extending these applications around the edges using modern, scalable, more agile, and cost-effective approaches. Fortunately, this evolutionary approach does exist, and data streaming can be a key technical underpinning that makes it practical - HCSC is a case in point, as we will see later.
The Role of Data Streams In Legacy System Modernization
Data Streams, originating from the legacy databases can populate a new data layer that forms the basis of a new generation of applications that use modern microservices architectures. These next generation applications can be built on a cloud-native platform, like Pivotal Cloud Foundry, which opens doors to a profound shift in the velocity and flexibility with which these applications are built, not to mention all the operational automation provided by the platform.
The new generation data layer can use in-memory caches, which provide the performance, availability, and agility that modern applications depend on. These caching layers mediate between the legacy data store and the microservices that rely on each of the caches as their data layer.
This evolutionary approach embraces legacy systems by protecting and extending investments in them. Existing workloads that are unchanging and well supported by the legacy system can continue to operate against the legacy system until a major change or upgrade is needed, at which point a microservices-based replacement can be considered. Meanwhile, new microservices will benefit from the low-latency, highly concurrent performance, and the event-driven mechanism delivered via the caching layers.
This evolutionary build-out scenario can be optionally coupled with other approaches like re-platforming parts of the legacy application, and over time offloading functionality in favor of microservices, commonly referred to as the strangler pattern.
A Case in Point - HCSC’s Legacy System Modernization
As Jeff and Anupama explained in their session, HCSC generates data streams from three main legacy data sources, which are then consolidated into a single stream for pushing to the target. Changes to the three sources are captured through some combination of file extracts or events raised by changes to source data. The event based approach works well when data latency matters because it pushes changes when they occur, rather than the file extract approach or a request-response approach that only sees changes after a request is made.
Apart for providing a data layer for the next generation of microservices based apps, HCSC’s data streaming approach itself is based on microservices. If the data streaming process is broken down into microservices, then the stream can be scaled by adding instances of microservices wherever more performance is needed.
HCSC uses an external change data capture tool for assembling changes from the legacy source database logs - in this case DB2. This is less intrusive to the existing workload than using database triggers for capturing changes. What is captured is not the changed data values, but metadata about the change. This metadata is posted as events in an event table.
A listener microservice associated with the source reads the event table in micro batches, which provides much better performance than reading each event one at a time. The listener then sends events to one of many instances of processors that extract and transform data. The processor instance that is handling the request then uses the event metadata to reach back into the legacy database to extract the data and transform it as needed by the target system. The extract and transform steps tend to be performance intensive, and this is where the microservices approach of adding instances to scale the process really adds value. The transformed data is sent to a sink service for writing to a target.
HCSC’s target system is Pivotal GemFire, an in-memory cache that can handle a high-concurrency workload with high-throughput and low-latency. Their messaging backbone for supporting their data streaming workflow is RabbitMQ and they use Spring Cloud Stream for defining and automating the flow, both of which are provided as services on Pivotal Cloud Foundry.
Cache Warming
At start-up, HCSC warms the GemFire cache with an initial load that leverages the same mechanism used for doing incremental loads. The initial load is achieved by generating synthetic events that reflect a snapshot of the source database at startup. Using a single mechanism for both the initial and incremental loads simplifies the process.
Error Handling
Data that doesn’t make it through the pipeline because of an error is sent through the pipeline again (upto 3 times) because the HCSC team noticed that in many cases the rejected data goes through on the second or third attempt. Building retries into the flow reduces the number of error conditions and minimizes the amount of data that has to be dealt with via manual intervention.
This error handling mechanism is implemented by a feedback loop back into the event table to re-set the status of this event to its undelivered status, so that it will get picked up again by the listener. The events table maintains a counter so that retries can be stopped after three attempts.
Field of Streams
Clearly, data streaming is not just another shiny object that is part of a hype-cycle that will soon re-calibrate. There is real value to this architectural shift, and the HCSCs use of data streams is yet another data point to validate this generational shift. It’s exciting to see the various new use cases for data streams, including legacy systems modernization as featured by HCSC.
Great Article. Having left Informatica (ETL) for the Streaming world and real-time integration and analytics it's refreshing to see the markets responding and move towards multiple data pipelines, with real time visualization. and transformation between data pipes that call action and move business!
Building teams & Leading Architecture @Infomover | Training MNCs on high-end tech
6yNice article. It helped
Bringing Digital Transformation to our diverse client base within Civilian Sector
6yVery cool pattern implementation