Busy working and busy studying this past week, as I ventured deeper into the world of SQL and data manipulation and visualization mainly with Tableau, a visual analytics platform. In this blog post, I want to discuss how to craft effective SQL subqueries and introduce several valuable SQL clauses. Building on our momentum from last week, we'll discuss a different function each week. After that, I want to focus on Pivot tables in spreadsheets and explore the difference between a Temporary(Temp) table and a View in SQL. Towards the end, I want to delve a bit into visualizing your data and principles and the thought process before, during and after visualization.
Subquery:
Subqueries, also known as inner queries or nested queries, are used to retrieve data that will be used in the main query as a condition to further restrict the data to be retrieved. Essentially, they allow you to perform operations that require multiple steps, all within a single query.
Let's look at this Customer database as an example:
Here we have two tables, `Orders`
and `Customers`
.
Orders:
order_id | customer_id | order_date |
1 | 1003 | 2020-09-29 |
2 | 1001 | 2020-11-03 |
3 | 1002 | 2021-03-14 |
4 | 1005 | 2021-06-25 |
5 | 1004 | 2022-02-07 |
Customers:
customer_id | customer_name |
1001 | Kevin |
1002 | Peter |
1003 | Mary |
1004 | Anna |
1005 | Bob |
My task is to retrieve customers' names from the Customer database, but only for customers who had made purchases since 2021. By using a subquery, I can first identify the customers with purchases since the start of 2021, and then use that information to filter the main query.
SELECT customer_name
FROM Customers
WHERE customer_id IN (SELECT customer_id FROM Orders WHERE order_date > '2021-01-01');
This multi-step operation, all performed within a single query, saved me both time and computational resources. However, you can make it more readable by rewriting it:
-- Fetch names of customers who made purchases since the start of 2021.
SELECT customer_name AS name
FROM Customers
WHERE customer_id IN (
-- Subquery to get customer IDs from orders made after 2021-01-01.
SELECT customer_id
FROM Orders
WHERE order_date > '2021-01-01'
);
I've employed three techniques to make my SQL query more readable and easy to follow:
Indentation can help distinguish between different levels of logic.
Aliases help you to be less descriptive on obvious things.
Comments next to your query can explain your intentions.
Now, let's delve a little bit deeper, imagine there are two more tables in the same database:
Products:
product_id | product_name | price |
1 | Apple | 0.50 |
2 | Banana | 0.30 |
3 | Orange | 0.60 |
4 | Pear | 0.70 |
5 | Peach | 0.80 |
Addresses:
address_id | customer_id | street | city | state |
1 | 1001 | 123 Apple St | New York | NY |
2 | 1002 | 456 Banana Blvd | Chicago | IL |
3 | 1003 | 789 Orange Ave | Los Angeles | CA |
4 | 1004 | 321 Pear Pkwy | Houston | TX |
5 | 1005 | 654 Peach Pl | Phoenix | AZ |
The task at hand is to find the number of customers and the total amount spent in each city for orders made since 2021. This seems overwhelming at first glance but let's go through the code:
-- What's being selected is what's being returned
SELECT
a.city, -- Select city from Address table
COUNT(DISTINCT c.customer_id) AS num_customers, -- Count unique customer IDs
SUM(p.price) AS total_spent, -- Sum the prices of the products
CASE -- Classify cities based on total spending
WHEN SUM(p.price) > 5000 THEN 'High spending city'
ELSE 'Low spending city'
END AS spending_level -- Creating a new column called spending_level
FROM
Customers c -- Table aliases can be defined directly after the table names
JOIN
Addresses a ON c.customer_id = a.customer_id -- a -> c
JOIN
Orders o ON c.customer_id = o.customer_id -- o -> a + c
JOIN
Products p ON o.product_id = p.product_id -- p -> o + a + c
WHERE
c.customer_id IN (
SELECT customer_id
FROM Orders
WHERE order_date > '2021-01-01' -- Only consider orders made after 2021-01-01
)
GROUP BY
a.city -- Group results by city
HAVING
SUM(p.price) > 5000; -- Only include cities that are categorized as High spending city.
Let's explain more because it's quite a lengthy query. Essentially, the return table based on the SELECT
statement will look like this:
a.city | num_customers | total_spent | spending_level |
The GROUP BY
statement is used to group rows that have the same values in specified columns into aggregated data, like count, average, maximum, minimum, etc. For each city, you will get the count of unique customer IDs 'num_customers'
, the total spent by these customers 'total_spent'
, and the spending_level
classification based on the total spent.
So, in the final table, each row would represent a city where the total amount spent is more than 5000, and you'll see the number of unique customers and the spending level, which is "High spending city" for all cities.
CAST Function:
The CAST()
function in SQL is used to convert one data type to another. This can be very useful in situations where you need to change the data type of a column either as part of your data cleaning process or to perform certain operations that require a specific type of data type. For example, arithmetic operations can only be performed on INT
, BIGINT
, DECIMAL
, FLOAT
etc.
Here is a basic syntax of the CAST()
function:
CAST(expression AS data_type)
Here expression
is the value or the column that you want to convert and data_type
is the target data type you want to convert it into.
Let's use the `Products`
table as an example:
product_id | product_name | price |
1 | Apple | 0.50 |
2 | Banana | 0.30 |
3 | Orange | 0.60 |
4 | Pear | 0.70 |
5 | Peach | 0.80 |
And the `price`
column is stored as a floating point number. If you want to convert the price to an integer, you could use the CAST()
function like this:
SELECT product_id, CAST(Price AS INT) as Price
FROM Products
However, this would return all values as 0 because casting a decimal number into an integer in SQL will truncate the decimal part, effectively rounding down to the nearest integer which is 0. To round the number as an INT
, rewrite the query as:
SELECT product_id, CAST(ROUND(Price) AS INT) as Price
FROM Products
Here, the CAST()
function has converted the floating point numbers to integers.
product_id | product_name | price |
1 | Apple | 1 |
2 | Banana | 0 |
3 | Orange | 1 |
4 | Pear | 1 |
5 | Peach | 1 |
In a similar manner, CAST()
can be used to convert dates to strings, strings to dates, numbers to strings, etc., as per the requirement. Keep in mind though, not all conversions are possible. For example, trying to convert a string that does not represent a number (like 'abc') to an integer would result in an error.
Pivot table:
A pivot table is a data summarization tool used in spreadsheet programs like Microsoft Excel and Google Sheets, and in data visualization tools like Tableau. Pivot tables are used to sort, reorganize, group, count, total or average data stored in a database or spreadsheet and they are extremely powerful for analyzing large quantities of data.
Imagine you're a detective, and you've got a huge pile of evidence to sort through. It's all important, but in its current state, it's a jumbled mess. That's where a Pivot Table comes in.
Let's take a more visual scenario:
You own a chain of stores selling different categories of products, like electronics, groceries, clothing etc., across several regions. Every day, each of your stores is generating sales data that you collect. This data includes:
Date of sale
Product Category (Electronics, Groceries, Clothing, etc.)
Individual Product sold
Store Region (North, South, East, West)
Salesperson's name
Sales Amount
As you can imagine, in just a short period, you have a mountain of data. It's like a puzzle with thousands of pieces.
Without Pivot Table:
It's like looking at the pile of puzzle pieces in the box. You have all the information, but it's hard to see the full picture. You can see individual sales, but answering questions like these is difficult:
What are the total sales for each product category?
Who is the top-performing salesperson in each region?
How have sales trends changed month over month?
With Pivot Table:
By using a Pivot table you can sort these puzzle pieces to form different parts of the picture. It takes simple column-wise data as input and groups the entries into a two-dimensional table that provides multi-dimensional analysis.
- If you want to see the total sales for each product category, just "pivot" on the product category. Your jumble of data now neatly shows you exactly what you wanted:
Product Category | Total Sales |
Electronics | $X |
Groceries | $Y |
Clothing | $Z |
- Want to see the top-performing salesperson in each region? Pivot on the salesperson and region:
Region | Top Salesperson | Sales |
North | John Doe | $A |
South | Jane Smith | $B |
East | Jim Brown | $C |
West | Jill White | $D |
Just like a detective, you're trying to find insights hidden in your data. The Pivot Table is a powerful tool that sorts through the mess and presents the clues in a way that you can easily make sense of them.
Temp tables vs Views:
The two might sound similar to each other, but here are 4 reasons why they are used and created for different purposes:
Purpose:
They can be very useful for storing intermediate results for complex queries, or for breaking down a large task into smaller, more manageable ones.
A view is a virtual table based on the result set of an SQL statement. It's often used to encapsulate complex queries, limit data access, or join and aggregate data from multiple tables.
Storage and Performance:
Temp tables are physically stored on the database server, they can be indexed, and the database engine can optimize its performance.
Views are not physically stored; it's just saved queries. When you query a view, the underlying queries are run. Therefore, performance depends on the complexity of those queries.
Persistence and Visibility:
Temp tables exist only for the duration of your database session or even a specific procedure or task. After that, it's automatically dropped. Other users can't see your temp tables.
Views persist in the database until it's explicitly dropped, and they can be used by anyone with sufficient privileges, just like a regular table.
Modification of Data:
You can perform DML operations like
INSERT
,UPDATE
,DELETE
on a temp table just like a regular table.In general, simple views such as a subset of one table's columns can be used to modify data, but complex views such as views that involve aggregate functions,
GROUP BY
, or multipleJOIN
s often can't be.
Data Visualization:
Data visualization is the process of transforming complex data into easily digestible visuals, facilitating swift understanding and prompting data-driven decision-making. By uncovering hidden patterns and trends, it tells compelling stories and makes it accessible to a wider audience. Ultimately, it elevates data analysis from a raw numbers game to an insightful, visual narrative.
One useful framework is the David McCandless method, shown in the visuals below, it can help you organize your thoughts about data visualization and give you a useful checklist to reference.
A successful visualization should be a consistent balance between 4 key requirements:
Information: data you are working with
Story: clear and compelling narrative or concept
Goal: specific objective or function for the visual
Visual form: effective use of metaphor or visual expression
Another useful approach is to focus on the pre-attentive attributes, they are the elements of a data visualization that people recognize automatically without conscious effort. The essential, basic building blocks that make visuals immediately understandable are called marks and channels.
Marks: The geometric shapes used to represent data points on a graph. Examples of marks include points, lines, bars, areas, etc.
Channels: These are visual properties used to encode data. Examples of channels include position, color, size, shape etc.
While there are as many as 13 types of graphs commonly used by data analysts, the choice is often dependent on the type of data or story you are trying to present, your business goal as well as your audiences.
There are 9 basic principles of design that data analysts should think about when building their visualizations. These principles focus more on the readers' experience than the data itself, in other words, this is about reaching your audiences or in one word, Art. Here's a summary of the 9 principles.
Principles | Description |
Balance | Key visual elements, like color and shape should be distributed evenly. This doesn’t mean complete symmetry, but your visualization shouldn’t have one side distracting from the other. |
Emphasis | Your data visualization should have a focal point so that your audience knows where to concentrate. |
Movement | Movement in data visualization should mimic the way people usually read. You can use lines and colors to pull the viewer’s attention across the page. |
Pattern | Patterns can be used to highlight similarities between different data sets or break up a pattern with a unique shape, color, or line to create more emphasis. |
Repetition | Repeating chart types, shapes, or colors throughout a project adds to the effectiveness of your presentation. |
Proportion | Using various colors and sizes helps demonstrate that you are calling attention to a specific visual over others. |
Rhythm | Rhythm is closely tied to the movement principle. If your finished design doesn’t successfully create a flow, you might want to rearrange some of the elements to improve the rhythm. |
Variety | Variety in chart types, shapes and colors can keep the audience engaged. But it is good to find balance since too much variety can confuse people. |
Unity | If the visual is disjointed or not well organized, it will be confusing and overwhelming. |
Finally, we will wrap up this week on Design thinking, similar to the 9 principles of design, design thinking is a framework used to identify alternative strategies for your visualizations that might not be clear right away.
Empathize: Thinking about the emotions and needs of the target audience for the data visualization
Define: Figuring out exactly what your audience needs from the data
Ideate: Generating ideas for data visualization
Prototype: Putting visualizations together for testing and feedback
Test: Showing prototype visualizations to people before stakeholders see them
Essentially, we can use this 5-step framework to design any visualization that is targeted at solving your particular business goal, telling a convincing story through data and reaching to your audience.
In two days, I will celebrate my 3rd month on this platform. I'm pleased to have integrated this approach to document my learning journey. It's been a journey of discovery, trial, and continuous growth. As I persist, I find joy in sharing these insights and my journey with you all. Here's to more enlightening moments and the pursuit of knowledge. Until next one, happy learning!