Intro to databases on Azure: Basics for aspiring data engineers
How do you get started with an Azure database? As a database novice or someone new to Microsoft Azure, there are so many options it can be hard to know where to begin. Which is right for you as you get started on the path to becoming a data engineer?
To do this, let's go through an example discussing how to keep track of customers' orders as they buy products in your store. We'll use as little database jargon as possible, and you'll see that these ideas are more common than what you might think. We'll look at the problem and raise some questions along the way, and I'll point you to some documentation you can explore further. Let's begin!
The many shapes of data: Relational vs. non-relational data
Perhaps one of the big differentiators of today's databases is not the databases themselves but the type of data they can easily handle. That is: whether the database is equipped to deal with relational vs. non-relational data. Yes, you can make a database adapt to different shapes of data, but that may end up hurting your application's performance. That means you need to identify how you want to model your data, and then decide which database is the best tool for the job.
Consider the following hypothetical. Suppose you have Customers that want to buy Products by placing Orders. So, you need to model them into your database, something like this:
Customer: Customer ID, Customer Name, Customer Address
Product: Product ID, Product Name, Description
You have a few options to go when it comes to placing orders with this data. For example, each Order could store the full Customer and Product information for every line in the order, or you could add a unique identifier to each product, and then refer to that identifier from the Orders table. Something like this:
Order: Order ID, Customer ID, Product ID, Quantity<
The question is: which of these options is preferred? If you want to keep track of transactions and preserver data integrity, then perhaps storing Customer, Product and Order entities in different tables is the best option for you.
What do we mean by integrity in this context? If every time a Product is placed into an Order, you add the full product info, then if the product description were to change, you'll have to update it for every instance of that product in the Order table. This would be very expensive to compute and error-prone! If, on the other hand, you only keep track of the product's unique ID, then if you need to update the product description, you just do it in one place. Sounds simple right? Not so fast…
What happens if you want to display data? Say business is booming, and you have thousands and thousands of orders! Now you need to create a report where you'll display a list of orders so your shareholders can see the latest business report. That means your queries to the database must fetch not only the information from the Orders table but also from the Products table. So perhaps having both entities, Order and Product, stored at the same place wasn't a bad idea after all? What do you think?
Data engineers know decisions come with trade-offs
As you progress in your data engineer career, you'll learn that every decision comes with trade-offs. You get data integrity as explained above, but displaying information might end up being more expensive. You could insert all the data into the same table, but then updates might need to manually be replicated across the database. Are there solutions and best practices to follow when facing these kinds of problems? Of course!
I've been avoiding using database-specific terminology so far because I wanted to introduce the problem in a way that you could reason about it without too much noise in place. It's time to uncover that jargon and give you learning resources so you can dive deeper into those topics. Let's look.
Database and data engineer jargon explained
We showed above that one way of storing your business entities in the database is to split them across tables. We have the Product table on one hand, and the Orders table on the other. We call this normalization.
When the Orders table keeps track of the Product by storing only its ID, we call this a relation. In relational database jargon, the product id stored in the Orders table is a foreign key that refers or relates to the id stored in the Products table. You can learn more about these ideas in the Microsoft Learn module Describe concepts of relational data, specifically in the section titled Explore the characteristics of relational data.
Then we're presented with the trade-off that occurs when we split data across tables: now we have the problem that when we want to present this data in a report, we need to fetch the data from multiple tables. This can be cumbersome and lead to performance issues since you need to run queries across tables with tons of data. While we could use a non-relational data model for this problem, like in a document database, there's no need to depart from the relational data model.
When working with relational databases, you can use views. With views, you can mix the data of two or more tables and present it as if it were part of the same virtual table. In the Microsoft Learn section Explore relational data structures you can learn more about views — seeing how to create them and how to retrieve data from views.
We have talked a lot about inserting data, retrieving data, and so on, but we have forgotten about one crucial aspect: how to query data.
Getting started with SQL
In relational databases, you usually work with the SQL language; whether you use MySQL, PostgreSQL or SQL Server, you'll need to learn SQL.
Building up your SQL chops is essential in your career towards becoming a data engineer as most databases will use SQL — or a flavor of it.
To get you started with SQL, Microsoft Learn has the module Query relational data in Azure. This will guide you as you learn to query, insert, update, and delete data from a relational database.
One final tip: Keep in mind that at this stage of your data engineer career, you need to understand what the basic concepts are and the keywords that you need to learn, like relations, normalizations, views, etc. Write them down in your notebook and have them as search terms to come back later to. This way you can keep exploring further as you make your own path as a Data Engineer.
Conclusion
In this article, you saw how one problem (like keeping track of products, orders, and customers) can be stored in many ways. We chose to go the relational data way, where we ended up using foreign keys, views, doing data normalization, and much more. Our goal was to show how everyday problems map to these concepts, so you know the jargon. This will be crucial because you can't start learning if you don't know what to search for. At the same time, we provided you with a couple of links to dig deeper and learn at your own pace.
Want to learn more? There's a Microsoft Learn learning path to get you started with databases on Azure called Azure Data Fundamentals: Explore core data concepts.