Week 19: Refactoring code (Bellabeat Case Study)

Over the past couple weeks, I have been working on the Bellabeat cast study that involved one particular relatively large table which I had to use Python to clean because Google Sheets cannot stably handle that many rows and it keeps crashing.

Case Study:

A snippet of the code before refactoring:

# Apply the function to the id column
data['id'] = data['id'].apply(transform_id)

# Adjust METs values by dividing by 10
data['METs'] = data['METs'] / 10

# Convert the 'date' column to datetime format
data['date'] = pd.to_datetime(data['date'], format='%m/%d/%Y %I:%M:%S %p')

# Group by 'id' and resample to daily intervals, calculating the mean, min, and max of METs
resampled_data = data.groupby('id').resample('D', on='date')[
    'METs'].agg(['mean', 'min', 'max'])

# Reset the index
resampled_data.reset_index(inplace=True)

# Rename the mean, min, and max columns
resampled_data.rename(columns={'mean': 'daily_average_METs',
                      'min': 'min_METs', 'max': 'max_METs'}, inplace=True)

The initial version of my Python code was functional and met the requirements of the task at hand. However, as I started to add more features to the analysis and navigated through the evolving requirements, the code began to look a bit messy. There were areas of repetition, some portions were hard to read, and it was becoming increasingly difficult to maintain. That's when I decided to delve into the world of refactoring.

Refactoring, in the context of programming, is the process of restructuring existing code without changing its external behavior or functionality. It's about improving the design, structure, and implementation of the code while preserving its functionality.

A key part of my refactoring process was breaking down my code into functions. Instead of having a long script doing all the work, I divided the script into several functions, each doing a specific task. This made the code cleaner, easier to understand, and more maintainable. Here's the same portion of the code after refactoring:

def transform_ids(data):
    # Apply the function to the id column
    data['id'] = data['id'].apply(transform_id)
    return data


def adjust_mets_values(data):
    # Adjust METs values by dividing by 10
    data['METs'] = data['METs'] / 10
    return data


def convert_date_format(data):
    # Convert the 'date' column to datetime format
    data['date'] = pd.to_datetime(data['date'], format='%m/%d/%Y %I:%M:%S %p')
    return data


def resample_data(data):
    # Group by 'id' and resample to daily intervals, calculating the mean, min, and max of METs
    resampled_data = data.groupby('id').resample('D', on='date')[
        'METs'].agg(['mean', 'min', 'max'])
    resampled_data.reset_index(inplace=True)
    return resampled_data


def rename_resampled_columns(resampled_data):
    # Rename the mean, min, and max columns
    resampled_data.rename(columns={'mean': 'daily_average_METs',
                          'min': 'min_METs', 'max': 'max_METs'}, inplace=True)
    return resampled_data

By grouping the logic into individual functions, the code is now more reusable and could then be called whenever needed, reducing redundancy and improving readability.

Another significant improvement I made was in handling date-time data. Working with time-series data can be tricky due to the varying formats and manipulations required. Initially, my code for handling dates was mixed up with other data-cleaning steps, making it hard to follow. To make it cleaner, I created a function called convert_date_format(), which would take care of converting the date to the desired format. This encapsulation of date-time handling made the code more organized and the logic behind date-time manipulation more transparent.

I also realized the importance of meaningful comments in refactoring. Comments that accurately explain what a block of code or a function does can be incredibly helpful for readers or your future self. In my refactoring process, I ensured that each function was accompanied by a brief comment explaining its purpose and functionality.

Lastly, refactoring should not alter the functionality of the code, so it's crucial to verify that the output remains the same after the refactoring process. I did this by comparing the output of the refactored code with the output of the original code. This is my first time refactoring code by organizing it into functions, and the outcome is a neat, satisfactory piece of code that successfully meets the task requirements.

What's Next:

Indeed, the past two weeks have not unfolded as I had hoped in terms of progression. I must admit, I underestimated the complexity of the project at hand. Moving forward, my immediate task is to merge three tables from Dataset A into one comprehensive table. Then I need to check for missing values, outliners or conflicting data. Once this is done, I will kickstart the Exploratory Data Analysis (EDA) for both Datasets A and B. This process will facilitate the generation of initial hypotheses for the in-depth analysis stage.

The EDA and the actual analysis both entail the creation of visualizations. I aim to leverage the capabilities of R or Python to generate graphs for the EDA, while Tableau and PowerBI will be my tools of choice for the final analysis.

Although I anticipate transitioning into the crucial ANALYZE stage this week. I recognize the need to expedite my progress to meet my self-imposed deadline. My goal is to have a fully functioning personal portfolio page featuring a comprehensive data analysis case study by September 25th. Given the magnitude of the task, it's time to buckle down and forge ahead!