Week 3: Scikit-learn, Memory, SQL and Data Analytics

Week 3: Scikit-learn, Memory, SQL and Data Analytics

This week, I dedicated more time to my computer science studies as I reached the one-year mark of picking up Japanese. I experienced several moments of insight and profound realizations about how to translate my studies into tangible success, ultimately landing a job in the IT industry.

As I become increasingly occupied, I reflect on the past year where I maintained a 40-40 lifestyle - 40 hours of work and 40 hours of study each week. Now, I am transitioning to a 40-55 schedule. While I aspire to push myself even further, I have learned from my self-taught journey in Japanese and programming that steady progress is key. Pursuing overly ambitious goals often leads to setbacks, so taking small steps is a more effective approach.

Sklearn on Kaggle:

As mentioned in last week's update, I planned to hone my skill in Machine Learning by working on a small project that will grow over time and this week's work involves implementing a regression model on a dataset. Let me show you my progress!

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# Basic format to import the needed libraries

Output

/kaggle/input/world-happiness/2015.csv 
/kaggle/input/world-happiness/2017.csv 
/kaggle/input/world-happiness/2019.csv 
/kaggle/input/world-happiness/2018.csv 
/kaggle/input/world-happiness/2016.csv
data = pd.read_csv("/kaggle/input/world-happiness/2019.csv")
print(data.head())
print(data.shape)
# the dataset I chose for training the model and a peek of the content inside as well as the shape of the arrays.

Output

   Overall rank Country or region  Score  GDP per capita  Social support  \
0             1           Finland  7.769           1.340           1.587   
1             2           Denmark  7.600           1.383           1.573   
2             3            Norway  7.554           1.488           1.582   
3             4           Iceland  7.494           1.380           1.624   
4             5       Netherlands  7.488           1.396           1.522   

   Healthy life expectancy  Freedom to make life choices  Generosity  \
0                    0.986                         0.596       0.153   
1                    0.996                         0.592       0.252   
2                    1.028                         0.603       0.271   
3                    1.026                         0.591       0.354   
4                    0.999                         0.557       0.322   

   Perceptions of corruption  
0                      0.393  
1                      0.410  
2                      0.341  
3                      0.118  
4                      0.298  
(156, 9)

Below I did 3 things:

  1. categorized the data into the input feature X and actual outcome y as training data.

  2. fit the training set into the built-in sklearn's linear regression model

  3. printed the coefficient of determination metric of the model's prediction.

X = data[["GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption"]]
y = data["Score"]
print(f"Shape of X is {X.shape}")
print(f"Shape of y is {y.shape}")

linear_model = LinearRegression()
linear_model.fit(X, y)
predict = linear_model.predict(X)
print("Prediction on training set (first 5):", predict[:5])
print("Actual target Value (first 5):", y.values[:5] )

goodness = linear_model.score(X, y)
print("Coefficient of Determination:", goodness)

Output

Shape of X is (156, 6)
Shape of y is (156,)
Prediction on training set (first 5): [7.00548205 7.09306376 7.17731778 6.94501295 6.92351457]
Actual target Value (first 5): [7.769 7.6   7.554 7.494 7.488]
Coefficient of Determination: 0.7791638079594221

But I'm not done with my evaluation, so I did some research on chatgpt and added a couple more metrics to it:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse = mean_squared_error(y, predict)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y, predict)
r2 = r2_score(y, predict)

print(f"Mean Squrared Error: {mse:.3f}")
print(f"Root Mean Squrared Error:{rmse:.3f} ")
print(f"Mean Absolute Error: {mae:.3f}")
print(f"R-squared: {r2:.3f}")

Output

Mean Squrared Error:  0.272
Root Mean Squrared Error:  0.521
Mean Absolute Error:  0.414
R-squared:  0.779
#As you may have noticed, model.score() is in fact also, R-squared

My discovery:

Mean Squared Error: A smaller MSE indicates a better fit. However, since the errors are squared, it can be challenging to interpret the MSE in the same units as the target variable.

Root Mean Squared Error: RMSE is more interpretable than MSE and is widely used in regression problems. A smaller RMSE indicates better model performance.

Mean Absolute Error: MAE is less sensitive to large errors compared to MSE or RMSE, making it more robust to outliers.

R-squared: Coefficient of determination or the "goodness-of-fit" of the model

Summary of different metrics:

  • High MSE or RMSE indicates the model making large errors in predictions.

  • High MAE indicates the model making consistent errors across the board.

  • Comparing RMSE and MAE:

    • If RMSE is significantly GREATER than MAE, that indicates the model has a few large errors (require more outlier-robust techniques)

Review of the first 5 samples from the dataset:

Prediction 1: 7.00548205 | Actual: 7.769 | Error: 0.76351795

Prediction 2: 7.09306376 | Actual: 7.6 | Error: 0.50693624

Prediction 3: 7.17731778 | Actual: 7.554 | Error: 0.37668222

Prediction 4: 6.94501295 | Actual: 7.494 | Error: 0.54898705

Prediction 5: 6.92351457 | Actual: 7.488 | Error: 0.56448543

Evaluation: Overall, the performance of this model seems decent, with an R-squared of 0.7792, indicating that it captures a significant portion(77.92%) of the variance in the target variable. However, there's room for improvement. The errors for each prediction vary between 0.37 and 0.76 points in the first 5 samples, but the happiness score ranges between 0-10 so it's fair to suggest that the errors are rather large in comparison and can be disruptive for meaningful interpretation of the study.

Steps for improvement:

In the next couple of weeks, I plan to experiment and implement these ideas and see where that will lead me to:

  1. Visualize the data: Create scatter plots for each independent variable against the dependent variable to see if there's any nonlinearity in the relationship.

  2. Fit a polynomial regression model: Transform the independent variables by adding higher-degree terms (e.g., squared or cubic terms) and fit a new custom regression model with gradient descent.

  3. Evaluate the model performance: Compare the performance metrics (e.g. RMSE, MAE, R-squared) between the linear regression model and the polynomial regression model.

  4. Cross-validation: Perform cross-validation to assess the model's performance on different data subsets and check for overfitting.

CS50:

This week, I delved under the hood of data storage, pointers, and the hexadecimal representation of bits. I am currently working on the filter assignment that requires coding various functions in C language to modify the color and position of pixels in a BMP file. The objective is to achieve one of four effects: Blur, Greyscale, Sepia, and Reflection.

A simple bitmap, like an 8x8 grid, can illustrate how data is transformed into an image on your screen. Each slot in the grid represents a pixel, which stores an RGB value ranging from 000000 to ffffff. By imagining this grid as your screen, you can better understand the process of image visualization.

SQL and Data Analytics:

As mentioned at the beginning of this blog, I looked at what I lack in skills. And I remember the breakdown I made about Machine Learning and to a great extent, AI development:

  1. Data Collection and Optimization

  2. Field Knowledge

  3. Model Architecture

So far I've only been focusing on the third component, but I realized it's time to address the more pressing issues. A good model can only produce high-quality output when it's fed with reliable input. Naturally, I looked up SQL. To refresh my memory on SQL, I watched a few brief tutorials, as I had forgotten what I learned during the Python for Everybody course. Additionally, I began a new course on Coursera, Google Data Analytics. Currently, I am juggling three separate courses, and with my intensified efforts, I aim to complete them within the next three to five months.

Tomorrow, I plan to explore the job market and compile a list of skills and experiences necessary for a Python developer position. Now there's no guarantee, especially with market-disruptive tools such as chatgpt and GPT4. It appears that I am climbing a vertical cliff not just an uphill battle. With sufficient preparation, opportunities often arise, and starting as a Junior developer will undoubtedly provide a strong foundation for my eventual transition to becoming an ML/AI developer.

See you next week!