Week 14: Data Viz with Tableau & PowerBI

As I approach the end of the Google Data Analytics and CS50 course, I've planned out a roadmap for the next 3 months. and I'll share it with you towards the end of this week's update. This week's focus revolves around two popular visualization tools.

Data Visualization for what:

In the "Microsoft Certified: Data Analyst Associate" course on Logikbot, it's proposed that there are 5 key stages of data analytics. Each stage has a unique purpose and makes use of data visualization in different ways. Let's break down these components and see how visualization fits in each stage:

Stage	Focus	Visualisations
Descriptive	Analyze historical data to understand what has happened	Show historical trends and patterns
Diagnostic	Discover why something happened	Pinpoint causes and relationships of events
Predictive	Predict future outcomes	Illustrate predicted future trends
Prescriptive	Recommend actions to achieve desired outcomes	Display recommended actions and pathways
Cognitive	Simulate human thought processes in decision-making	Illustrate how LLMs are processing and deciding upon unstructured data

Data visualization is essential as it enables analysts to translate complex data into simple, intuitive visuals. This is particularly important when communicating with a target audience that often lacks a technical background, maybe colleagues from other departments of your company or representatives of your client companies. Visualization is useful in every stage of analytics, from understanding historical data to making future predictions and recommendations.

Machine Learning, especially in the Cognitive stage, is increasingly important in the field of data analytics. Knowledge of machine learning algorithms can allow analysts to handle larger, more complex data sets, and generate more advanced insights.

Tableau or PowerBI:

Choosing between Tableau and PowerBI, or even deciding to use both, largely depends on your specific needs and circumstances. I myself have decided to learn both and apply my new found skills through case studies and a capstone project.

Tableau, being open-source, presents an environment that thrives on community-driven enhancements. Its strength lies in its robust data visualization capabilities, enabling users to create complex, detailed, and highly customized visualizations. However, mastering Tableau can be challenging due to its potential learning curve with its vast array of features and complex interface. However, if you are willing to surmount this learning curve, Tableau can provide unparalleled data visualization and analytical capabilities.

In contrast, PowerBI, a product of Microsoft, meshes seamlessly with its suite of office products. It's often considered more accessible, especially for those already in the Microsoft ecosystem. While its integration with Microsoft's ecosystem is a significant advantage, it doesn't offer the same degree of flexibility provided by Tableau's open-source nature. This limitation may be more apparent when attempting to tailor visualizations or analysis methods to very specific requirements. PowerBI's ease of use may compensate for these limitations, but for many including myself, Tableau's flexibility and breadth may be more appealing.

Ultimately, it's about aligning the tool with your team's needs, skills, and overall operational environment. In my opinion, knowing both will never hurt your resume. Alternatively, R in RStudio can be useful if you're looking for a more code-driven, statistical approach to data analysis and visualization. RStudio, paired with R's rich package ecosystem, can provide even greater flexibility and control over your data workflows, which could be advantageous for more complex analytical tasks or research-driven projects. However, I'm quite skeptical of R in terms of where it stands in the industry, Python is far superior in terms of versatility, library availability, and its wide-ranging applicability beyond just data analysis, including my favorite topic, machine learning.

Linking data in Tableau:

Since both visualization tools are quite similar with Tableau being "slightly" ahead in the game, I want to talk about something I found interesting so far In Tableau. In Tableue, Join and Relationship are both methods to combine data from different tables but they work differently.

Join merges tables into a single, larger one based on a related column. This is immediate, potentially leading to data duplication or loss if mishandled. It's represented by a Venn diagram.

Relationship is more flexible. It creates a dynamic connection between tables without physically merging them. The actual combination of data happens only when needed for specific analyses, mitigating data duplication or loss and making it easier to handle large, complex datasets. It's represented by lines connecting the tables.

The internal workings of SQL and Tableau are not the same, but I have a habit of coming up with analogies to help memorize new concepts. Analogously, a Join is similar to a temp table in SQL, physically storing combined data, while a Relationship resembles a view, creating a virtual table and pulling specific data only when required.

PARTITION BY Clause:

Let's consider you have a 'stocks' table with the columns 'ticker', 'date', and 'closing_price', and you want to calculate the average closing price for each stock.

ticker	date	closing_price
AAPL	2023-01-01	150.69
AAPL	2023-01-02	152.41
AAPL	2023-01-03	153.13
MSFT	2023-01-01	105.66
MSFT	2023-01-02	107.50
MSFT	2023-01-03	108.30

Without using PARTITION BY, you would have to GROUP BY 'ticker', which would give you one row per 'ticker' with the average closing price:

SELECT 
    ticker,
    AVG(closing_price) as avg_closing_price_per_stock
FROM 
    stocks
GROUP BY
    ticker;

This query will return a table like this:

ticker	avg_closing_price_per_stock
AAPL	152.08
MSFT	107.15

But what if you want to retain all the original rows and just add a column with the average closing price per stock? This is where PARTITION BY becomes very useful.

Here is an example:

SELECT 
    ticker,
    date,
    closing_price,
    AVG(closing_price) OVER (
        PARTITION BY ticker
    ) as avg_closing_price_per_stock
FROM 
    stocks;

In this query, AVG(closing_price) OVER (PARTITION BY ticker) calculates the average of 'closing_price' for each partition of 'ticker'. The result is a new column 'avg_closing_price_per_stock' which shows the average closing price per stock alongside every row, without reducing the result set to one row per group.

ticker	date	closing_price	avg_closing_price_per_stock
AAPL	2023-01-01	150.69	152.08
AAPL	2023-01-02	152.41	152.08
AAPL	2023-01-03	153.13	152.08
MSFT	2023-01-01	105.66	107.15
MSFT	2023-01-02	107.50	107.15
MSFT	2023-01-03	108.30	107.15

To wrap up, PARTITION BY is a versatile tool that allows you to perform aggregate calculations on partitions of your data while maintaining the original granularity of your result set.

Study roadmap for the next 3 months:

I outlined a checklist I want to complete by the end of 2023 but now I want to share a more solid roadmap specifically for the next 3 months, so you can expect what content will show up in the coming blogs. Currently, I have two final projects in front of me. One is from the CS50 course and the other is a capstone case study from the Google Data Analytics course.

What I will be doing:

I plan to build a personal website as my CS50 final project which can then host my portfolio including the case study from the Google DA course (I'm not exactly sure about the implementation yet but I plan to involve using Kaggle, Tableau and PowerBI and with a touch of Machine Learning, fuiyoh!). Both will take a while to complete. While I'm busy working on both, I'll complete two more related courses, first is the one mentioned at the beginning of this blog on Logikbot, and the second one is the Google Advanced Data Analytics professional course. At the same time, I'll follow study guides for both DP-300 and PL-300 exams. I'm expecting a turbulent week or two adapting to my new routine.

What I might be doing:

At some point, I hope to find some time in between to continue polishing my skills in Python, probably do some leet code problems plus take a data structure and algorithm course. Last but not least, I'll take up the new CS50 cybersecurity course because I have a coworker who's interested in breaking into that field but never had the motivation. So I wanted to use that as an excuse to force myself into learning something extra on the side. To motivate myself to motivate my coworker, I plan to do the CS50 cybersecurity course with a AWS course at the same time.

In short, it will be perhaps the busiest 3 months of my life. However, I do not feel overwhelmed, I want to jump right into it after finishing writing this blog post. And this might be the last time I'd want to bring up this lengthy subject, so I want to end by this: I have not forgotten to write a blog about GitHub, it will be done down the road!