The Data Engineering Case for Developing in Prod

Eric McCarty
10 min readNov 10, 2023

Let’s stop pretending.

Live Reaction of an SDLC Purist to this Article’s Title — Photo by 傅甬 华 on Unsplash

Call me crazy, call me names, but you have to admit if you could mostly eliminate the risks, developing in prod sure would be sweet, right? All that trying to make fake data, working with extra small resources, or developing complicated “production to test” pipelines just to work in dev? Yuck.

Well I have a secret for you: at least for data engineering, there are plenty of companies that do have development happening in prod. They just either A) lie about it, or B) call it something OTHER than prod and lie about it.

In this article, I make the case that we should stop hiding what we are doing, build best practices, and embrace developing in prod.

Prod vs. “Prod”

Of course, when I say “prod,” I don’t mean “end-user production environment.” We have to separate the prod “network” (or “prod account/subscription,” if you are in cloud) and the prod “end-user environment.” I have found these semantics get confusing, so I drew an (admittedly awful) diagram:

Reality for many companies.

And these semantics get confusing when talking about “prod:” is the “Development” Environment in the Prod Network/Account box depicted above “prod” if the end users never touch it? Or is it “prod” because it exists within a production network/account? It likely depends on who you ask.

To help clear the confusion, I’ll try to make some clarifying definitions that will be used for the rest of this article:

  • Account — A physical separation of resources. For on-premises, this was usually separated by a physical network where nothing could cross network boundaries. But because I suspect the majority of the audience will be cloud heavy, I will refer to this as “account” for the rest of this article as in the cloud, separation of resources is generally done via account (or subscription/project/folder, depending on the vendor) boundaries. And yes, in cloud it’s generally separated by network VPC’s as well, but we’re data people and nothing loses a data audience faster than talking network stuff.
    Non-Prod Account — A true by-the-book definition of a development area: only test or fake data, no access to production resources, and much smaller resources to save costs.
    Production Account — An area that has access to production data and right-sized resources for production volumes of data.
  • Environment — A separation of resources within an account. This separation can be both physical and logical.
    Development Environment — An area within an account that end-users do not access.
    Test Environment — An area within an account that end-users do not access for day-to-day operations (but could for testing new features, for instance).
    Production Environment — An area within an account that end-users do access for day-to-day operations.

So with these definitions, a potential setup of your data landscape could look like this (notice two “Development Environments” across both a non-prod and production account):

Potential setup of your environments across two accounts. Notice a Development Environment in both a Non-Prod and Production Account.

Seeing a “Development Environment” in a production account may be setting off alarms in your (and possibly my former employers) heads. But before we get there, let’s discuss what an actual ideal state for a development environment looks like.

Ideal State

Data engineers don’t love this setup. Developing “in prod” feels icky and goes against all the best practices we were taught. But ask them what is necessary for an ideal development environment and you are likely to hear most, if not all of the following.

Production Volume of Data — This is generally necessary for a data engineer because it’s difficult to understand all the different scenarios of a data pipeline using only small amounts of data.

Production Equivalent Data — This is generally necessary for a data engineer because production data generally reveals outcomes that you would not otherwise see with “fake” data.

Production Equivalent Resources — This is generally necessary for a data engineer because they get a more accurate idea of performance and how best to right-size resources that you would not see with smaller compute instances.

Can all three of these be achieved in a non-prod account? Certainly. But it is an investment: in people, processes, security, and cost. And most of the time, data teams are constrained in all of these areas, which leads data teams to do the easier thing: start doing development in production accounts.

And why is that a bad thing?

So, Why Don’t We Develop in Prod?

This is one of those questions that seems obvious on its surface as it’s engrained in every developer since the moment they picked up a “Programming for Beginners” book during sophomore year.

But let’s really break down the reasons and see if those reasons still apply for data engineers in 2023.

  1. You’ll Break Production Code
    -
    Them: If you do something directly in prod, the odds of breaking production drastically go up, and your users will be affected.
    - Counterpoint: Going back to our “prod” discussion earlier, separating your development and production environments in a production account can eliminate this risk (more on that later).
  2. You’ll Break Production Data
    - Them: If you are developing a data pipeline in prod, you could wipe out or alter the production database.
    - Counterpoint: With modern data warehouses, you can leverage clones to do your development, which are writable copies of data that don’t affect the base table. You can also separate permissions to ensure no accidental CRUD operations on your base data (more on that later).
  3. Data Engineers will Have Access to Production Data
    -
    Them: Developers should only have access to dev (aka, fake) data when developing data pipelines.
    - Counterpoint: I think this ship has sailed long ago. Unless your data engineers are mindlessly doing straight moves from source to target, I can’t fathom how they are effective at communicating with the business and translating requirements if they don’t have access to production data. Even companies that have a strict “development only happens in dev accounts” rule usually allow read access to prod for data engineers.
  4. You’ll Impact Prod SLA’s
    - Them: If you run inefficient code that you are developing in prod, it will impact prod resources and cause performance challenges for your end-users.
    - Counterpoint: This isn’t really a challenge anymore with cloud. Cloud data warehouses like BigQuery and Snowflake have the ability to separate compute, and your data pipelines can be given isolated compute in your development environment also.

But most likely, the main reason we don’t develop in prod is because we are just following best practices from our friends in traditional software engineering. You would never think about updating a website or an app in prod, so why would you in data?

But data is different: end-users are almost always internal to your company and small in number (compared to a companies’ customers), and all of your assets (data and code) can be fully isolated within a production account.

Also note that there is very little pushback for other data roles to “develop in prod.” At most companies I have seen, Data Analysts, BI Developers, and Data Scientists don’t even have a “development account,” and work directly in prod. They are just isolated from breaking the production environment. Why can’t data engineers have that same luxury?

The Challenges with Development in a Non-Prod Account

Now remember, if the requirements of the Ideal State section of this article exists in a non-prod account for you, this section doesn’t apply (although I’d be curious on how expensive or how many people you have dedicated to maintain it). But for the rest of you, this is probably all too familiar.

Challenge #1 — The Data

Any data engineer who works with fake data will understand this one. If you are joining tables, good luck finding much referential integrity to make the join work. If you are building a report, good luck with that snazzy map visual when you only have miniscule data showing up in 3 states. If you are creating a new aggregate table, good luck estimating how much data will actually be in the final product. If you are converting codes to descriptions, good luck figuring out all the possible combinations of codes in the system.

“Eric, just use faker.” Faker is fine in certain scenarios, like generating a file that mimics an input into an ETL job. But if you have any kind of business logic based on the input data, this is very prone to hiding potential errors that you wont find until you make it to prod.

“Eric, just copy down some data from prod to dev.” This is easier said than done, and usually still has the same issues as true fake data: no referential integrity, not valuable for any kind of aggregates, likely to miss most production scenarios. Not to mention, you have to open up some kind of hole between your prod and dev networks, which is likely more risky that developing in a production account in the first place.

The other problem is just data volume. While performance testing is usually done in a pre-prod environment, I have seen countless times where I create a pipeline that works great in dev, and then have to optimize my code right before moving into prod. Generally this means making fundamental logic changes to help performance, which is both A) risky because it’s much later in the development cycle and B) a waste of time since I could have been performance testing the whole time. Developing with production volume means you are doing performance testing continuously, lowering the overall development time and reducing risk with late changes.

Challenge #2 — The Environment

Unless you take great care to have your non-prod account mimic your prod account, there are almost always problems when promoting code to prod due to environment differences. That small single-node compute instance just doesn’t run the same as your multi-node prod cluster. Your VPC and subnet setup is always different than in prod. That service account that is using basic auth in dev is using SSO in prod. And on and on.

While you would still have some differences even between development and production environments within a singular production account, those differences become much more manageable than when they are split.

How to Develop in Prod Safely

In the “So, Why Don’t We Develop in Prod” section of this article, we laid out 4 risks that we traditionally have when we develop in prod. For the most part, these 4 risks can be eliminated with cloud and the use of modern data warehouses.

  1. Create Separate Development and Production Environments within a Production Account — As discussed in the beginning of this article, you need to create a separate environment for development and production. This will vary by your setup, but as an example:
    - In Google Cloud, you could have a production folder that represents your account, and have separate projects that represent your environment.
    - In Azure, you could have a production subscription that represents your account, and have separate resource groups that represent your environment.
    - In Snowflake, you could have a production account, and have separate databases that represent your environment.
  2. Create Service Accounts/Roles in Your Development Environment That Only Have Read Access to Your Production Environment — Arguably the most important part of the setup, you need to take great care that any accounts/roles created in your development environment have no destructive CRUD privileges on your production environment, and only has read access.
  3. Clone Data As Necessary for Development — Make use of BigQuery or Snowflake’s cloning feature to develop as necessary. This will give you access to full volume production data without any risk of breaking your end-user data.

Final Considerations

Once you have this set up, you have a relatively risk-free way to develop in prod. This setup has a number of benefits:

  • Greatly reduces the frustration for data engineers dealing with an inadequate dev environment.
  • Increases time to market for delivery as you would see all of the production scenarios early in the development cycle.
  • Increases data quality as you can have a more complete test bed earlier in the development cycle.
  • Is much cheaper to setup than to maintain a care and feeding process to create an adequate development environment in a non-prod account.

I do want to mention that I believe you still need a development environment in your development account. This can be useful when you have a new data producer that is currently in development, and you want to test the integration, for instance. You’ll need this assuming your upstream data producers are still developing in dev like suckers.

Another consideration that is more work but could be beneficial is creating environments for every branch of code. That way you have a fully isolated environment even within your development environment. I know there is work in this arena by companies trying to simplify the CI/CD experience for data engineering, which could be an exciting next step.

But overall, I believe the majority of data engineering frustration could be alleviated with the setup laid out here. Developing with dev data may be a relic from the past that, with cloud and modern data warehouses, should be a thing of the past.

Let me know what you think, I’m really interested in what is working for others. Have you seen other setups work well? Is there something I haven’t considered here? Am I off my rocker? Appreciate any insight.

--

--

Eric McCarty

Data Specialist Engineer at Google and former Technical Architect at USAA and Walgreens. Opinions are of my own and not of Google.