Week 21: R in EDA

Continuing from previous weeks, this week was yet another stretch of intensive data wrangling and exploration. The journey was filled with intricate challenges, requiring a deep dive into various methods of data manipulation, SQL queries, and R programming.

The truth is, the complexity of the tasks is starting to take its toll. I'm beginning to feel a sense of burnout, and any time taken off feels like a setback. Getting back into the flow becomes increasingly difficult with each break, due to the overhead required to relearn and reconnect with what I've done previously.

But here's an overview of what unfolded:

Dataset inspection:

By employing the readr() function, I successfully imported the CSV file, allowing me to examine the structure of the data frame. With the help of the head() and summary() functions, I quickly gleaned an overview of the data, unearthing intriguing observations and inconsistencies that begged for a closer look. These inconsistencies were not trivial; they signaled deeper complexities in the dataset that warranted further investigation.

Handling Missing Data:

As discussed in the previous week, there were missing data in the dataset, prompting me to locate these rows for future imputation. Initially, I identified the rows containing missing data using the which() and is.na() functions in R. Following that, I backed up the original dataset to ensure data integrity, preserving the state before any imputations.

na_m_row <- which(is.na(data$M_light) | is.na(data$M_medium) | is.na(data$M_heavy))
print(na_m_row)

na_a_row <- which(is.na(data$A_light) | is.na(data$A_medium) | is.na(data$A_heavy))
print(na_a_row)

It's worth noting that this approach might not be the most efficient way to handle imputation, especially with larger datasets. The returned list of row numbers with missing data can become unwieldy and less meaningful as the dataset grows. At this stage, I have not fully decided whether the imputation process will involve both BigQuery and RStudio or be confined to just one of those platforms. What I have done so far is a reflection of my preliminary game plan on how to handle the missing data, setting the stage for more refined and targeted efforts in the coming weeks.

Mean Imputation:

To fill in the missing values, I employed the mean imputation method. This approach required calculating the mean for specific columns based on individual respondents, adding an extra layer of complexity. This process led to a comprehensive effort in crafting SQL queries for mean computation, ensuring that the imputation was tailored to the unique characteristics of each respondent, preventing skewed data. A snippet of that work is as follows:

    CAST(ROUND(AVG(A_light)) AS INT64) AS mean_A_light,
    CAST(ROUND(AVG(A_medium)) AS INT64) AS mean_A_medium,
    CAST(ROUND(AVG(A_heavy)) AS INT64) AS mean_A_heavy,
    CAST(ROUND(AVG(M_light)) AS INT64) AS mean_M_light,
    CAST(ROUND(AVG(M_medium)) AS INT64) AS mean_M_medium,
    CAST(ROUND(AVG(M_heavy)) AS INT64) AS mean_M_heavy

Visualization and onward:

I'm ashamed to admit that the visualizations have not yet been created. However, the week's work laid the groundwork for future data exploration and visualization. With the data cleaned, pre-processed, and missing values handled (still in progress), the stage is set for diving into the actual analysis. Soon, I'll be poised to uncover the underlying patterns and trends that can drive data-driven decisions for the case study I long to finish.