The Chief Architect role — part 2 (the actual work)
So, for the second part, or as I like to call it — down to business. In the first part I presented a bit of my background and how I came to attain my position. In this part I’ll try and dig a bit into what I actually do.
For those of you that read/seen me talk in the past, it’ll come as no surprise when I say that my most important role is to maintain and keep the “language of the system”. When I talk about the language of the system, I extrapolate from Rich Hickey’s amazing talk and try to look at the entity that is called AppsFlyer (where I’m the chief architect for in case you missed it from the first part ;-)). Every organization, big or small, develops its own “language” along the years. Specifically, for the R&D organization, this revolves a lot around design and tooling questions — “Does this paradigm work for us?”, “Do we use library X or Y?”, “Which database do we use for realtime high throughput key-value storage?” etc. These questions were not specifically answered by me, but were organically manifested over the decade-plus that AppsFlyer exists. A huge amount of engineering hours has been poured into having the system where it is today — capable of handling hundreds of billions of events per day both online and offline, in order to produce one of the world’s leading measurement and analytics platform.
But wait, if most of the questions have already been answered, what do I do now? What do I mean when I say that I “maintain” the language of the system. Well, I often find that complex systems, especially those that have a high scale of traffic/services/machines/data, tend to be a bit more complicated than what can be represented in the multitude of system design diagrams. More often than not, I tend to think of our system as an almost “organic organism” that can not only grow in unexpected ways, but can also react in a bewildering manner to seemingly naive things. This happens because we create a cartesian product of extremely complex concepts:
- Distributed systems are fallible in nature.
- Distributed systems, while amazing in reducing direct dependencies and accelerating individual growth vectors, introduce indirect dependencies which are WAY harder to grok and reason about.
- Building a Platform that can accept a near infinite amount of user input means that the data complexity needed in order to provide a coherent analytical product is immense.
When you take all of these into account, some questions that seem trivial can become very daunting. Let’s take, for example, a customer-facing API in which we want to add a field. “What’s the big problem?” some of you may ask, “just add another query param and you’re set!”. While this is superficially true, behind the scenes there’s a LOT more going on — things like param validation, internal realtime schema evolution, internal offline datalake evolution, external data representation etc. Compound this with the fact that this said API handles millions of http calls a second, and we also need to look at stability, cost, rollout strategy, etc. We must also ask ourselves even more complex questions — how do we make sure that this new parameter is not already in use across the dozens of exiting APIs we already have? If it does exist, how do we make sure that we align on the naming and the schema? If it doesn’t, how do we “notify” the rest of the organization that we’re now introducing a potential new data dimension? Is this new dimension applicable for only this current product? Can this be, potentially, applicable for other existing/future products?
As you can see, there are a lot of questions here, and this is just only one example. In reality, we’re faced with a plethora of such questions that we deal with on a daily basis. Not all of them require an Architect — in a lot of them we do “more of the same” and this can be good because this “sameness” was developed across more than a decade of testing new infra, dealing with production incidents etc. This corpus of knowledge, methodology and practicality is the actual “language of the system”. If that’s the case, when is an Architect actually needed? The simple answer is that every language changes over time and such changes need to be managed in a manner that is beneficial to all involved. Examples that come to mind are trying out new technologies, changing existing system designs, needing to adhere to new rules and regulations and many, many more. In all of these things, we utilize an Architect whose sole position is an extreme bias to take action in order to promote both the business needs and the system requirements (note that I’m using the word “system” to denote a very complex entity comprised of tens of thousands of machines, thousands of services and hundreds of databases).
But what does this process actually look like? I’ll start by saying that the title of Architect can be misleading — it’s not the Architect’s job to design something and hand it down from his ivory tower (in AppsFlyer, the design is done by the feature lead). The Architect has 2 very clear objectives stipulated in his R&Rs (roles and responsibilities):
- During the Definition of Ready (DoR — making sure we have all the knowledge we need in order to start the development work) process, he’s the one that “signs off” on the high level design of the solution. That means that he’s responsible to take all the considerations stipulated above (of which I mentioned only a handful) and make sure that the proposed design and solution tackles them. If the solution fails to tackle item X or Y, the Architect will help tackle it with both the feature lead and the product lead of said feature.
- During the Definition of Done (DoD — the process from DoR until the solution is pushed to production) process, the Architect actively pursues the goals introduced in the DoR. Why? because it never happens that the design remains exactly the same between the DoR and the DoD. Stuff change because reality keeps on changing around us — what we thought will work sometme doesn’t, sometimes product requirements change etc. The Architect is there to make sure that, even though we’re making changes to the plan, the end result still aligns with the “language of the system”
I don’t know if you noticed, but somehow along the way in this post, I stopped talking about the Chief Architect role and started talking simply about an Architect. The reason is that in order to more easily explain what I do, I also need to explain what the Architects in AppsFlyer do. Now that you’re more aware of the set of considerations of what we tackle in AppsFlyer, it’ll be easier to explain on how I specifically tackle the things I wrote about here. Hint — it’s mostly around the people I interact and work with. All this and more, in part 3 of the series (which will hopefully come soon).