Graduating from ETL Developer to Data Engineer

Eric McCarty
Google Cloud - Community
10 min readJan 13, 2022

--

Bring sexy back to your career by ditching your GUI-based comfort.

Does your day-to-day life look like this?

A collection of development environments from various GUI-based ETL tools. Clockwise from top left: Informatica PowerCenter, IBM Datastage, Talend, Microsoft SSIS.
Ah, GUI-based ETL tools, always gettin’ no respect.

If so, welcome my ETL developer friends. This is where I spent the majority of my development career also. And like you, I always heard the whispers, and sometimes the outright ridicule from those ‘oh-so-smart-CS-major engineering know-it-all’s.’

  • “That’s just visual SQL”
  • “That isn’t real development”
  • “That’s so low-skill”
  • “When are you going to be a REAL coder?”

And it sucks because deep down inside you know there is a hint of truth to these comments: you probably went to school for IT work and found yourself moving data from here to there in a tool that was pretty easy to pick up and learn.

Of course, I think that perception is generally a load of crap. Some of the most technically advanced and valuable work of my career was done in Informatica PowerCenter, the multi-decade king of ETL tools (don’t @ me Datastage/SSIS/Talend/Ab Initio fans). While it’s absolutely true that I could probably teach my 10 year old how to create a simple job that picks up a CSV file and load it into a database using PowerCenter, you can do incredibly advanced engineering within these tools as well. Being easy to pick up doesn’t mean that it’s easy to master, and disciplined engineering principles go a long way in separating beginners and advanced developers. And besides, I don’t really get why “being easy to learn” was ever a knock on these tools in the first place.

So this article isn’t about trashing legacy ETL tools, there are plenty of other blogs and online discussion that do that just fine. No this article will go into why you need to make the change to become a data engineer and how to make that leap coming from an ETL developer background.

Why Make the Change?

Photo by Jp Valery on Unsplash

If you are a current GUI-based ETL developer thinking about making the leap to code-based data engineering, you are probably in one of three camps:

  • I already am a data engineer (and your title may even reflect that).
  • ETL tools make my job easier, why do I need to change?
  • I realize I need to become a coder, but I don’t know where to begin.

If you are in the third camp, great: you are in a good place and will get some good information in the “How” section below. But I first want to address the first two groups.

First, in my opinion at least, ETL developers are data engineers, don’t let any of the snobs tell you otherwise. You are engineering data pipelines based on business requirements after all. But if we look at Wikipedia for truth (it’s never wrong, right?), the definition of data engineer is:

“A data engineer is someone who creates big data ETL pipelines, and makes it possible to take huge amounts of data and translate it into insights.”

OK, so far so good. But it continues.

“They are focused on the production readiness of data and things like formats, resilience, scaling, and security.”

Well readiness and formats sure. Scaling is a little tougher, but we have some resili…..

“Data engineers usually hail from a software engineering background and are proficient in programming languages like Java, Python and Scala.

Oof. Gut punch.

One of the main reasons you need to “graduate” to become a “real” data engineer is that the market, right or wrong, expects it. In the words of Aretha Franklin, R-E-S-P-E-C-T. In 2022 especially, the market does not respect someone who is purely a GUI-based ETL developer. You can see some of this in articles like this one, which show a much higher correlation of compensation with skills like Scala and Spark (+17% and +16% increase respectively) than with pure ETL (+7%, and keep in mind that number does not distinguish GUI-based ETL and code-based ETL, so it’s likely even lower for purely GUI-based ETL). You can search a site like Indeed for numbers of jobs based on skill and, as of this writing, in the US there are 191,648 jobs looking for the skill “Python” and 8,474 jobs looking for the skill “Informatica.” You can look at Google Trends data for the top ETL tools and see declining interest over time in all of them.

Anecdotally, when I was a pure ETL developer, I don’t recall ever being recruited on LinkedIn. But since becoming a Google Cloud Certified Professional Data Engineer, I get 4–5 recruitment opportunities a week from prominent companies all over the country. Not one has cared about my experience in PowerCenter.

But it’s more than just selfish or market reasons to make the jump to code-first data engineering. There are legitimate, non-snobby reasons code-first data engineering has become the dominant form of ETL over the past few years, and reasons that the industry is heading that direction.

  1. Reducing Vendor Lock-In — When a company makes a GUI-based ETL decision, they are tying themselves to that company for a very long time (decades, as is the case for most financial services companies). Not only is this bad financially (customers lose their leverage as these companies know how difficult it is to change), you are also at the mercy of that company for any new capabilities. They will nickel and dime you on any new connectors as you adapt to shifting technology, and new capabilities always seem multiple versions out from where you currently are. Not to mention, upgrading these tools are generally a monumental effort. Moving to code that is portable eliminates these issues.
  2. Abstraction — As Maxime Beauchemin puts it in his famous Rise of the Data Engineer post: “There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software.” Now ETL tools were created specifically to make life easier for everyday tasks, not “complex software,” but somewhere along the way complexity was built into many (most?) ETL jobs. Very rarely did I do simple things in ETL, and anything that was simple was generally sent to an offshore or junior resource. But that’s where the point stands: in no other field do you build complex software in drag and drop tools. While sites like Wordpress or Squarespace are fine sites for beginners to build simple websites, every company website that you use regularly with any kind of complexity is built using code. And because they built that abstraction for all their sites (and not just their complex ones), they can apply standards and practices for the entire thing (think a CSS change that effects the look and feel of an banking site, for example).
  3. Continuous Integration/Continuous Delivery (CI/CD) — Chances are you have heard and maybe even practice some form of CI/CD in your day job. But if you are using a GUI-based ETL tool, you are not seeing the full benefits that other engineering disciplines enjoy. Checking in a representation of your GUI flow in XML or JSON in Git is not the same thing as branching your code and allowing changes to be part of a continuous integration pipeline. GUI-based ETL has always been notorious for being difficult to collaborate on, and version control is spotty at best.
  4. Scalability — When you need more horsepower with a GUI-based ETL tool, you generally need more compute. Most of these tools charge by the core, which means adding compute requires you to expand your licensing. Even if you have a very nimble contracting team (which is a challenge with every big company I have worked with), it’s still a hit to your operational expense and a constant struggle with your operations teams for capacity. With code, there are no increased overhead expenses with increasing scale. Plus, if you optimize for the cloud and only need to run your pipelines during a peak time (like end-of-month processing), you can scale back and save on those costs during non-peak time. Flexible pricing like this is more difficult/non-existent in legacy GUI-based systems.

There are others, but those are the ones that drove me to pivot my preferred approach away from GUI-based ETL tooling.

How to Make the Change

Photo by Kristopher Roller on Unsplash

So how do you make the leap? There are plenty of articles and Github repositories out there around data engineering, but I have found them overwhelming as someone who is primarily coming from a GUI-based ETL background. I will offer a step-by-step guide here in the context of that background.

Step One: Get Comfortable with Cloud

If you have not leveraged the cloud yet, it’s the single best way to advance your career and learn new skills. And when I say “cloud,” I mean getting comfortable with primary cloud service providers (CSP: Google Cloud Platform, Amazon Web Services, Microsoft Azure), not just using a cloud-based data warehouse in your ETL. Learn how to spin up a VM, spin up a managed database, get comfortable with the command line. Each of these CSP’s have free tiers that you can use to stand up mini-projects and enhance your skills with integrated tooling that will help you learn new paradigms in data.

If you haven’t used the cloud previously, I highly recommend starting with Qwiklabs. With this platform, you can search for topics you are interested in and it will spin up a temporary Google Cloud environment for you and give you step-by-step instructions on accomplishing hands-on tasks within a certain amount of time. You can start with this Quest for free to get a feel for the platform and some of the basics.

There is also plenty of training available on sites like Pluralsight that includes both videos and hands-on examples.

Finally, getting certified is by far the best thing you can do for yourself to graduate to a data engineer. And bonus, the Google Cloud Data Engineer certification is the highest paying IT certification as of this writing. Your company will probably pay for your training and certification exams, take advantage. You don’t need to get certified to start Step Two, but you should at least get your feet wet first.

Step Two: Start with an Analytics Engineering Product

The best way for someone who is experienced in GUI-based ETL to make the leap to a code-first product is to start with a language that will be familiar to anyone doing ETL work: SQL. Tools like dbt (the clear industry leader with an expansive and helpful community) or Dataform (recently acquired by Google, will be integrated within BigQuery) make an easier transition for a GUI-based ETL background than starting with something like Scala or Python.

For dbt in particular, all your code “compiles” to SQL, which will be familiar to anyone who has done ETL.

dbt describes itself as the “T” part of ETL, so it’s important to know that you cannot use it for extracting and loading: you use it to transform raw data already in your data warehouse to actionable assets. So while this skillset will be limited for an end-to-end data engineer, you can get very comfortable with modern engineering principles (source control, CI/CD, automated testing, etc.) without also struggling with the syntax of a new language.

To get started, you can follow the dbt setup tutorial here. You’ll notice it takes advantage of the generous GCP free tier, which is why it helps to get comfortable with the cloud.

Step Three: Resist that Impostor Syndrome You Feel

At this point you are ready to do your first non-SQL data engineering task. You start Googling Python and inevitably run into what feels like a wall.

Hey, how did they find a picture of me?

You see import statements that you are unsure of its purpose, complex code for what was easy tasks in the GUI, and an overall feeling that everyone else is smarter than you. Here’s a little secret: you can be a competent data engineer without being a Python or Scala expert. I am a certified data engineer and my job is to help Google Cloud customers with their data engineering architecture, but if you gave me a blank IDE and no Google, I wouldn’t be able to engineer myself out of a paper sack.

Don’t get me wrong, you need to get competent with a language and begin to understand what some of the details represent. But that’s a marathon, not a sprint, and leveraging code you find online and modifying it to what you are trying to do is something almost everyone does. Don’t feel less-than because you don’t see yourself as a basement-dwelling hacker.

Step Four: Create a Sample End-to-End Data Pipeline

The best way to learn data engineering is to build an end-to-end data pipeline from soup to nuts. Even with fake data, it’s the best way to understand the life of a data engineer in a post-GUI ETL world. It can start small, like ingesting a sample dataset from a Kaggle competition and loading it into a data warehouse. Stand up a managed database and write changes into a PubSub system to learn streaming data principles. Using a managed Airflow instance to orchestrate a data pipeline is a great way to learn Python for someone coming from an ETL background. Learn how all these components work together to make an end-to-end pipeline.

In a follow-on post, my colleague Sathish Sadagopan and I will show you an end-to-end demo that you will be able to deploy and modify to get you familiar with these concepts. We will orient this demo to an audience that comes from a GUI-based ETL world with concepts that should be familiar to you, but will equip you with the skills to “graduate” to a full-fledged data engineer.

Edit 8/9/2023: No part 2, but I posted a followup here.

--

--

Eric McCarty
Google Cloud - Community

Data Specialist Engineer at Google and former Technical Architect at USAA and Walgreens. Opinions are of my own and not of Google.