Week 2: ML libraries and Data Preprocessing

Week 2: ML libraries and Data Preprocessing

This week, I continued my exciting journey into the world of machine learning, exploring various topics such as overfitting and underfitting, neural networks, computer graphics recognition, and implementations of scikit-learn and TensorFlow.

Regularization and Model Fitting:

Overfitting and underfitting are common challenges faced by machine learning practitioners. Your model (see picture below) could have a high level of variance and is almost too adaptive to the data set therefore fails to accurately predict and generalize reality. On the other hand, a highly biased and over-simplified model will not fit the data set accurately resulting in, again, unreliable predictions.

To put it simply, you should avoid creating a model too complex in an attempt to fit data clusters. This could be solved, as far as I'm concerned in two different ways. One is through regularization by giving the model a penalty, often represented by lambda (λ) to minimize the cost of the prediction. Alternatively, a pre-emptive approach would be to pre-process the data so that the effect of cluster data can be minimized before even fitting them into the model. I've also noticed regularization techniques such as L1 (Lasso) and L2 (Ridge), however they weren't mentioned in the course material at the moment. I will keep that in the back of my head as I continue my learning.

Scikit-learn and Data preprocessing:

As discussed in the week-1 update, I've started a scikit-learn crash course by Freecodecamp on Youtube. I was introduced to GridSearchCV for hyperparameter tuning and cross-validation, which helps in selecting the best model for a given task. During this time, I also explored data preprocessing techniques such as the Quantile Transformer, which minimizes data clustering by scaling, and feature engineering.

To illustrate the concept of feature engineering, let's consider a house price prediction model. The input features might include the length(x1) and the width(x2) of the frontage as well as the number of floors(x3). When compared individually, these features may not exhibit a significant impact or interaction with the pricing (outcome y). However, by engineering a new feature that combines these individual features (x1 * x2 * x3) we can gain a more comprehensive understanding of how these factors influence property prices. Consequently, this engineered feature (x1x2x3) proves more important than each of the three features on its own. In such cases, it may be worthwhile to consider reducing less impactful data, such as x1, x2, and x3, to decrease model complexity.

TensorFlow and Neural Networks:

Later this week I progressed to the 2nd course in the machine learning specialization, focusing on Advanced Learning Algorithms, particularly deep learning and neural networks. I learned about the core components and concepts of neural networks, including layers, activations, forward propagation, and multilayer perceptrons (hidden layers). These concepts are essential for understanding how neural networks make inferences and adapt to complex patterns in data.

Additionally, I explored computer graphics recognition, for example**, face recognition**, a fascinating application of deep learning techniques. This process essentially involves pattern recognition grounded in classification.

Building neural networks in TensorFlow from scratch might sound intimidating but it's actually a lot simpler than most people think. This process involved:

  1. Data Preprocessing: Formatting data into matrices and normalizing and manipulating it to prepare for model training.

  2. Testing: Instantiating the model and running a compile sequence to define a loss function

  3. Optimization: Running gradient descent to update the weights of the features.

  4. Presentation: Converting the probabilities to a decision using a 0.5 threshold.

CS50:

As I wrapped up the week, I started week 4 of CS50 on the topic of Memory. Although the content in CS50 may not appear directly related to machine learning, I'm determined to complete it to bolster my general understanding of computer science. As I delved deeper into machine learning, I realized that this field encompasses three important aspects.

  1. Data Collection and Optimization

  2. Field Knowledge

  3. Model Architecture

Next week, I will focus on creating a simple ML model with scikit-learn. I've gathered some interesting ideas but I'm still undecided about what data and predictions I want to draw. Regardless, one thing remains true. The accumulation of knowledge and the ability to think like a computer scientist are the driving forces behind success in this industry.