So you've just joined a new organization with an established engineering team and a decade-old codebase. You're starting to notice the classic signs of a long-term feature factory—where the focus is more on delivering features than on maintaining quality—and there's a daunting pile of well-known tech debt. There's significant momentum to continuing on the way things have always been done. Yet, you've been brought in to disrupt this pattern and help the organization explore new business verticals. With many possible paths forward, how can you lead the team most effectively?
Navigating the development of significant new functionality while effectively maintaining legacy systems poses a significant challenge for many organizations. This balancing act requires a strategic approach to manage risks, optimize resources, and ensure seamless operations. In this article, we delve into strategies for evaluating legacy systems, understanding the associated risks and trade-offs, managing people and processes effectively, achieving system synchronization, and learning from real-world case studies.
Evaluation of Legacy Systems
Observe and Gather Data
Before undertaking any modifications, it's essential to make a concerted effort to fully understand the intricacies of existing legacy systems. Adopting a "read-only" observational mode allows you to identify strengths and weaknesses without prematurely altering functionality that is well-optimized and still effectively solving problems for stakeholders. At any organization, there's plenty of tribal knowledge that won't be on your radar if you're just joining. Keep an eye out for those with this tribal knowledge and consider inviting them for coffee until you understand what they know.
Don't try to turn a chair into a table.
Is the system working from a product perspective? If so, then the choice to embark on a rewrite is a very slippery slope. If not, then you've got a stronger argument for taking a fresh approach and building migrations to pull in the existing legacy data.
There are three categories of information to be aware of:
- Known Knowns
- These are fairly obvious items that are either well-documented, or easy enough to reason about by just reading the code. In an ideal world, there is a comprehensive test suite that covers all scenarios. I'd love to work for an org in the aforementioned ideal world, but sadly I don't think it exists.
- Known Unknowns
- While slightly more amorphous, these are items you know need more research. While there are gaps, you know roughly where the bodies are buried
- Unknown Unknowns
- These are tricky. The team doesn't know they need to bring you up to speed on this because it's not even on their radar. This includes edge cases you haven't encountered yet and mysterious modules that, "just work, don't touch them."
Not Invented Here (NIH) Syndrome
How do you evaluate whether or not the system is working or if you just don't like it?
A common trap organizations fall into is the desire to build new solutions from scratch—often driven by the NIH syndrome. This bias can lead to unnecessary redevelopment instead of enhancing existing systems that are already effective. Critical evaluation helps determine whether to innovate upon the old or start anew, considering both the immediate needs and long-term strategic goals. A sense of NIH permeated the halls when I worked as a Software Developer at Microsoft many years ago.
Over the years, I developed my approach to build the components that truly define and differentiate the organization, and buy or adopt everything else. Auth is a common example on this front. There are maybe five companies in the world who should be building their own auth system from scratch, and chances are high that your company is not one of them. There are great commercial solutions out there like Okta, Auth0, Cognito, Microsoft Entra, and others. If your cost model doesn't support the overhead of a third party, I've implemented the open source solution Keycloak successfully at a few organizations and, other than owning setup and maintenance, it's free.
Opportunity Costs and Adaptation
Deciding between adapting an existing system or constructing a new one involves evaluating opportunity costs. For instance, consider the impact on time-to-market and resource allocation. Adaptation might leverage established processes and technologies, speeding up deployment and reducing costs.
Long-Term Costs
The decision between old versus new should also include an analysis of long-term total ownership costs. This includes direct costs, such as upgrades and maintenance, and indirect costs, such as user training and potential downtimes.
Risks and Trade-offs
Complexity and Splitting Focus
Operating parallel systems adds complexity to the IT environment. This can split the focus of your teams, risking a decline in efficiency and effectiveness in both system maintenance and new development efforts.
Managing Time and Blast Radius
Balancing innovation with maintenance requires meticulous time management. It's crucial to understand the potential "blast radius" of changes in one system that may inadvertently impact the other, necessitating robust risk assessment and mitigation strategies.
Case Studies
New Product on a Legacy Platform
Building a Specialized Product Module alongside an existing Records System is one path I've taken. In this instance, the existing solution had a ton of in-house written functionality that made hard things easy and easy things impossible. For example, when needing to store ephemeral GPS coordinates for vehicles, it was easy to store these in a versioned format and store them forever. However, the system had no provision for a low-overhead format to bring in the data, render the vehicles on the screen, and then persist this data only in raw format for future replay. This was due to the anti-pattern in the API, where prior to my arrival at the organization, the ORM layer was written from scratch in a long-deprecated version of Java. A solution like Jooq or Hibernate would have been a much smarter solution, with less fragility and need to write custom code when new types of data came into the system, such as GPS coordinates.
In this case, we took a page from "The Bezos Memo" and isolated our API for the new product module to our own API & persistence layer while interacting with the existing records system using API endpoints.
What is "The Bezos Memo?"
The Amazon Bezos API mandate, often referred to as the "Bezos Mandate," was a directive issued by Jeff Bezos in the early 2000s, requiring all internal software at Amazon to be designed as reusable, modular services exposed through well-defined interfaces (APIs). This mandate was foundational in transforming Amazon's technological infrastructure, promoting a service-oriented architecture that significantly enhanced scalability and efficiency. It dictated that teams must communicate with each other strictly through these interfaces, which led to the emergence of a more agile and innovative company structure, allowing Amazon to rapidly expand its product offerings and improve interoperability within its services. The directive also included a stark warning that failure to comply would result in termination, emphasizing the critical nature of this strategic shift towards a decentralized tech ecosystem.
With full control of our data model, we added a significant stakeholder request: flexible forms and data. On the persistence layer, we captured the core data in standard db columns, while we stored flexible data in JSONB column in what we called "sister tables" to allow more flexible partitioning of customer-defined data, which might expand in ways that could cause performance and storage space issues in multi-tenant systems.
New Platform alongside a Legacy System
A registration system had been operating successfully for a number of years when it reached a crossroads to penetrate several new verticals. The existing platform was tailored to long-term rentals, while the company was expanding into short-term rentals like hotels, Airbnb, VRBO, and airlines. Though the existing platform was great for the existing narrow business case, it was not at all flexible for this new abstract concept of short-term rentals. On top of that, the existing system was sparsely documented, had significant performance issues, and was built entirely by an external consulting agency.
After significant research eliminated the viability of solving for the new business verticals on the legacy system, the choice was made to build the new system in parallel. This allowed the team to focus on LODO (Lights On Doors Open) for the existing business case while building a tailored solution for the new business cases. The new platform was able to encompass the scenarios of the existing platform, as well as fully support the new verticals. Integration was performed with near-real-time data sync at the database level using Kafka.
What are the Downsides?
It's not all sunshine and rainbows when you diverge from the accepted norms at an organization. Expect pushback to change, regardless of the positives the changes might bring. Expect it to take at least 1.5X as long as you anticipated. However, if you know you're on the right path, this adage, often attributed to Ghandi, is pertinent "First they ignore you, then they laugh at you, then they fight you, then you win."
Additionally, it's going to take a bit more work managing a distributed platform or managing multiple platforms concurrently. This complexity spans from technical to teaming.
People, Processes, and Communication
Team Structure
Successful management of dual systems often involves creating dedicated teams for each, with specific roles aimed at either maintaining the old or developing the new. This specialization helps in focusing efforts and avoiding conflicts of interest where team members may prefer one system over the other.
Artifacts and Ceremonies
Implementing iterative delivery mechanisms can be particularly effective. Regular sprints, planning, grooming, and retrospectives help maintain a clear view of progress and challenges on both fronts.
Coordinating Changes and Communication Strategies
To manage changes effectively across systems, develop a communication plan that includes regular updates and feedback loops between teams. This ensures that modifications in one system do not disrupt the functionality or performance of the other.
Synchronization between Systems
Any changes to the legacy system can have significant effects on both upstream and downstream systems. Clear documentation and understanding of these relationships are crucial for maintaining operational continuity.
Data Consistency and Synchronous Updates
Ensuring data consistency across systems is challenging but critical. Strategies might include using middleware to handle data transformations and synchronizations, thus maintaining integrity across the ecosystem.
Best Practices
Patterns and Anti-Patterns
Utilizing well-established architectural patterns can facilitate easier integration and maintenance. Conversely, common anti-patterns to avoid include over-engineering solutions or ignoring the potential scalability issues that might arise from tighter coupling of new and old systems.
Have a Backup Plan
Whether incrementally building upon a legacy system or taking a green-field approach, it's critical to ensure your stakeholder needs are being met at all times. Can you roll back? Do you have backups in place? Have you considered DR (Disaster Recovery)? This may generate additional overhead, but business continuity is paramount.
Conclusion
Successfully developing new systems alongside maintaining legacy systems is an art that requires thoughtful planning and execution. By understanding the specific needs and nuances of both systems, organizations can ensure smoother transitions, better resource allocation, and ultimately, a more robust technological ecosystem.