Machine learning (an overview)

glossing over the many ways a thing can learn
2025-12-18 17:53
// updated 2025-12-19 11:00

Machine learning refers to the gathering, processing and analyzing of data by some human-made device, based on statistics and patterns:

Machine learning types

Machine learning could fall under:

Supervised learning

  • training a model to recognize inputs, based on previously labelled datasets, to classify new, unseen data or predict an outcome or an output
  • further divided into:
    • classification
      • assigning a label based on inputs
    • regression
      • predicting a numeric quantity based on inputs
      • quantity based on other records with similar characteristics

Unsupervised learning

  • training a model to classify or group data without the benefit of previously-labelled data
  • further divided into:
    • clustering
      • finding groups of similar or similarly-positioned inputs
    • dimensionality reduction
      • reducing columns (but not rows) of data
        • e.g. for real estate prices, remove duplicate features such as "area in square metres" if another column has "area in square feet"
    • association rule learning
      • finding tendencies of one variable as a good predictor of another
        • e.g. "customers who bought X also bought Y"

Reinforcement learning

  • a device will interact with an environment and adjust based on feedback (either via "rewards" [positive] or "punishment" [negative])

Semi-supervised learning

  • a mix between supervised and unsupervised learning (in which some data would have labels, while others won't)

Deep learning

  • convolutional neural networks
    • recognizes spatial patterns
    • great for image-based ("spatially-oriented") data
  • recurrent neural networks
    • recognizes sequential data
    • great for text and speech ("temporally-oriented") data

Machine learning process

Machine learning happens in a process of steps also known as machine learning pipeline; many versions of this pipeline exist, steps and order of steps may vary, but most versions include some form of the "input + processing + output" shape:

Data collection

  • gather data from reliable sources

Data processing

  • preparation and transformation before th emodel
  • check for missing or incorrect values
  • check for correct data types
  • feature scaling
    • transform (or "normalize") the values to a more interpretable range
      • e.g. transform from a range of "-337.28 to 5828.91", to "0 to 1"
    • this helps both humans and computers more easily distinguish between low and high values
    • an optional step if the data values already make sense to everyone using the data

Model training

  • split data into "training data" and "testing data"
    • ~80% for training and ~20% for testing

Model building

  • formula derivation with data visualization
    • use the training data to create a model (such as a formula)
      • e.g. an equation in the form of y = mx + b or y = m1x1 + m1x2 + ... + mnxn + b (or whatever linear or non-linear equation)
    • graph the model if possible and/or desired
    • model typologies include:
      • k-nearest neighbours
      • decision trees
      • random forest
      • boosting
      • support vector machines
      • neutral networks

Model testing

  • formula validation
    • use the testing data to test the model against the model
    • plug the independent variables (x) of the testing data into the model to compare its prediction (y_pred) with the actual dependent (y_test) variable of the data
    • use Python libraries and metrics such as:
      • mean absolute error (MAE)
      • mean square error (MSE)
      • root mean square error (RMSE)
    • see which x (or combination of x's) has the error metrics closest to 0

Backward elimination

  • noisy data removal
    • for multiple regression models, remove any variable(s) that would cause the model to classify or predict poorly

Model deployment

  • taking that model
    • to the internet
    • to an intranet
    • to some private users for their own use cases
⬅️ older (in takeaways)
⚙️ Natural language processing (NLP) : an overview
⬅️ older (posts)
📜 Pre-processing steps for NLP