This week, I dedicated more time to my computer science studies as I reached the one-year mark of picking up Japanese. I experienced several moments of insight and profound realizations about how to translate my studies into tangible success, ultimately landing a job in the IT industry.
As I become increasingly occupied, I reflect on the past year where I maintained a 40-40 lifestyle - 40 hours of work and 40 hours of study each week. Now, I am transitioning to a 40-55 schedule. While I aspire to push myself even further, I have learned from my self-taught journey in Japanese and programming that steady progress is key. Pursuing overly ambitious goals often leads to setbacks, so taking small steps is a more effective approach.
Sklearn on Kaggle:
As mentioned in last week's update, I planned to hone my skill in Machine Learning by working on a small project that will grow over time and this week's work involves implementing a regression model on a dataset. Let me show you my progress!
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# Basic format to import the needed libraries
Output
/kaggle/input/world-happiness/2015.csv
/kaggle/input/world-happiness/2017.csv
/kaggle/input/world-happiness/2019.csv
/kaggle/input/world-happiness/2018.csv
/kaggle/input/world-happiness/2016.csv
data = pd.read_csv("/kaggle/input/world-happiness/2019.csv")
print(data.head())
print(data.shape)
# the dataset I chose for training the model and a peek of the content inside as well as the shape of the arrays.
Output
Overall rank Country or region Score GDP per capita Social support \
0 1 Finland 7.769 1.340 1.587
1 2 Denmark 7.600 1.383 1.573
2 3 Norway 7.554 1.488 1.582
3 4 Iceland 7.494 1.380 1.624
4 5 Netherlands 7.488 1.396 1.522
Healthy life expectancy Freedom to make life choices Generosity \
0 0.986 0.596 0.153
1 0.996 0.592 0.252
2 1.028 0.603 0.271
3 1.026 0.591 0.354
4 0.999 0.557 0.322
Perceptions of corruption
0 0.393
1 0.410
2 0.341
3 0.118
4 0.298
(156, 9)
Below I did 3 things:
categorized the data into the input feature X and actual outcome y as training data.
fit the training set into the built-in sklearn's linear regression model
printed the coefficient of determination metric of the model's prediction.
X = data[["GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption"]]
y = data["Score"]
print(f"Shape of X is {X.shape}")
print(f"Shape of y is {y.shape}")
linear_model = LinearRegression()
linear_model.fit(X, y)
predict = linear_model.predict(X)
print("Prediction on training set (first 5):", predict[:5])
print("Actual target Value (first 5):", y.values[:5] )
goodness = linear_model.score(X, y)
print("Coefficient of Determination:", goodness)
Output
Shape of X is (156, 6)
Shape of y is (156,)
Prediction on training set (first 5): [7.00548205 7.09306376 7.17731778 6.94501295 6.92351457]
Actual target Value (first 5): [7.769 7.6 7.554 7.494 7.488]
Coefficient of Determination: 0.7791638079594221
But I'm not done with my evaluation, so I did some research on chatgpt and added a couple more metrics to it:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mse = mean_squared_error(y, predict)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y, predict)
r2 = r2_score(y, predict)
print(f"Mean Squrared Error: {mse:.3f}")
print(f"Root Mean Squrared Error:{rmse:.3f} ")
print(f"Mean Absolute Error: {mae:.3f}")
print(f"R-squared: {r2:.3f}")
Output
Mean Squrared Error: 0.272
Root Mean Squrared Error: 0.521
Mean Absolute Error: 0.414
R-squared: 0.779
#As you may have noticed, model.score() is in fact also, R-squared
My discovery:
Mean Squared Error: A smaller MSE indicates a better fit. However, since the errors are squared, it can be challenging to interpret the MSE in the same units as the target variable.
Root Mean Squared Error: RMSE is more interpretable than MSE and is widely used in regression problems. A smaller RMSE indicates better model performance.
Mean Absolute Error: MAE is less sensitive to large errors compared to MSE or RMSE, making it more robust to outliers.
R-squared: Coefficient of determination or the "goodness-of-fit" of the model
Summary of different metrics:
High MSE or RMSE indicates the model making large errors in predictions.
High MAE indicates the model making consistent errors across the board.
Comparing RMSE and MAE:
- If RMSE is significantly GREATER than MAE, that indicates the model has a few large errors (require more outlier-robust techniques)
Review of the first 5 samples from the dataset:
Prediction 1: 7.00548205 | Actual: 7.769 | Error: 0.76351795
Prediction 2: 7.09306376 | Actual: 7.6 | Error: 0.50693624
Prediction 3: 7.17731778 | Actual: 7.554 | Error: 0.37668222
Prediction 4: 6.94501295 | Actual: 7.494 | Error: 0.54898705
Prediction 5: 6.92351457 | Actual: 7.488 | Error: 0.56448543
Evaluation: Overall, the performance of this model seems decent, with an R-squared of 0.7792, indicating that it captures a significant portion(77.92%) of the variance in the target variable. However, there's room for improvement. The errors for each prediction vary between 0.37 and 0.76 points in the first 5 samples, but the happiness score ranges between 0-10 so it's fair to suggest that the errors are rather large in comparison and can be disruptive for meaningful interpretation of the study.
Steps for improvement:
In the next couple of weeks, I plan to experiment and implement these ideas and see where that will lead me to:
Visualize the data: Create scatter plots for each independent variable against the dependent variable to see if there's any nonlinearity in the relationship.
Fit a polynomial regression model: Transform the independent variables by adding higher-degree terms (e.g., squared or cubic terms) and fit a new custom regression model with gradient descent.
Evaluate the model performance: Compare the performance metrics (e.g. RMSE, MAE, R-squared) between the linear regression model and the polynomial regression model.
Cross-validation: Perform cross-validation to assess the model's performance on different data subsets and check for overfitting.
CS50:
This week, I delved under the hood of data storage, pointers, and the hexadecimal representation of bits. I am currently working on the filter assignment that requires coding various functions in C language to modify the color and position of pixels in a BMP file. The objective is to achieve one of four effects: Blur, Greyscale, Sepia, and Reflection.
A simple bitmap, like an 8x8 grid, can illustrate how data is transformed into an image on your screen. Each slot in the grid represents a pixel, which stores an RGB value ranging from 000000 to ffffff. By imagining this grid as your screen, you can better understand the process of image visualization.
SQL and Data Analytics:
As mentioned at the beginning of this blog, I looked at what I lack in skills. And I remember the breakdown I made about Machine Learning and to a great extent, AI development:
Data Collection and Optimization
Field Knowledge
Model Architecture
So far I've only been focusing on the third component, but I realized it's time to address the more pressing issues. A good model can only produce high-quality output when it's fed with reliable input. Naturally, I looked up SQL. To refresh my memory on SQL, I watched a few brief tutorials, as I had forgotten what I learned during the Python for Everybody course. Additionally, I began a new course on Coursera, Google Data Analytics. Currently, I am juggling three separate courses, and with my intensified efforts, I aim to complete them within the next three to five months.
Tomorrow, I plan to explore the job market and compile a list of skills and experiences necessary for a Python developer position. Now there's no guarantee, especially with market-disruptive tools such as chatgpt and GPT4. It appears that I am climbing a vertical cliff not just an uphill battle. With sufficient preparation, opportunities often arise, and starting as a Junior developer will undoubtedly provide a strong foundation for my eventual transition to becoming an ML/AI developer.
See you next week!