Week 20: Looping in Exploratory Data Analysis

Week 20: Looping in Exploratory Data Analysis

This week, I've begun to work on the exploratory data analysis on the Bellabeat case study. Before delving into the analysis, I took a moment to reflect on the practicality and effectiveness of beginning with an Exploratory Data Analysis(EDA) as opposed to jumping straight into the ANALYZE stage of the data analysis cycle. To my surprise, I discovered that I had somewhat underestimated the immense importance of data cleaning and manipulation in the PROCESS stage.

What an EDA entails:

With newfound wisdom, I've crafted a list of potentially insightful questions that I'm eager to explore during the EDA. These questions hold the key to significantly improving the accuracy of my future analyses:

  • Are there missing data? Are imputation or exclusion appropriate strategies?

  • Are there statistically significant differences between groups for certain measures? Can we cross-validate daily_METs and daily_activity data?

  • Are there any inconsistencies in the responses?

  • Have ordinal variables been converted into numeric scales for analysis?

  • Are the classes balanced, or is one category overrepresented?

  • If a column where almost every response is the same, is it providing valuable insights or is it a candidate for removal?

  • What do the distributions of numeric variables look like? Are they skewed? Any outliers?

  • Are there any notable relationships between two numeric variables?

  • Are there any trends, patterns, correlations or variations over time that we can find in visualizations such as histograms, box plots, scatter plots, correlation matrices, and heat maps?

As I tackle the questions, I realize this isn't a simple one-at-a-time process. Data cleaning and manipulation intersect with data visualizations, which aren't just for the end of the analysis. Visualizations reveal patterns, data bias, relations, and insights. Check out the screenshot below, my initial JOIN query for question #2, where I extracted relevant columns.

Though the actual query I used differs and proved more effective, it emphasizes that EDA is an iterative process. You'll go back and forth to enhance data quality for a more capable analysis. Keep iterating!

After joining the tables, I exported the new table to RStudio for statistical analysis. However, as shown in the screenshot below, I discovered a missing row with a null value in the METs column, which needs fixing. Moreover, the imported file only contains 500 rows out of the total 728, another issue to address.

This data analysis phase is proving time-consuming, and I've been on this journey for almost a month now. But thoroughness is key to reliable results which is something I get to compromise in a personal project!

The prospect of this case study:

Finally, let's talk about the scale of this Bellabeat case study. In my Google Data Analytics Certificate course, they suggested completing the project in just one week. But honestly, that's a bit unrealistic, especially for beginners like me. This journey has been extensive, and every step I've taken from raw data to where I am now has been meticulously documented in a separate draft.

The sheer magnitude of the project has left me undecided on the format of publishing. Currently, I'm at around 40-50% completion, and the reading time estimate on Hashnode is already at 25 minutes, with over 5500 words. It's a substantial piece of work!

I've come to accept that rushing to meet any deadline might compromise the quality and depth of my investigation. The initial deadline was September 25, which marks 6 months since I started learning about ML. Since I keep discovering new avenues to explore, it's hard to predict how long each step will take.

Nevertheless, my short-term goal remains firm: to break into the industry before I hit 30, which is just 14 months away. Despite the challenges, I'm determined to make it happen and forge a path towards success!