Machine learning (an overview)

glossing over the many ways a thing can learn

2025-12-18 17:53
// updated 2025-12-19 11:00

Machine learning refers to the gathering, processing and analyzing of data by some human-made device, based on statistics and patterns:

Machine learning types

Machine learning could fall under:

Supervised learning

training a model to recognize inputs, based on previously labelled datasets, to classify new, unseen data or predict an outcome or an output
further divided into:
- classification
  - assigning a label based on inputs
- regression
  - predicting a numeric quantity based on inputs
  - quantity based on other records with similar characteristics

Unsupervised learning

training a model to classify or group data without the benefit of previously-labelled data
further divided into:
- clustering
  - finding groups of similar or similarly-positioned inputs
- dimensionality reduction
  - reducing columns (but not rows) of data
    - e.g. for real estate prices, remove duplicate features such as "area in square metres" if another column has "area in square feet"
- association rule learning
  - finding tendencies of one variable as a good predictor of another
    - e.g. "customers who bought X also bought Y"

Reinforcement learning

a device will interact with an environment and adjust based on feedback (either via "rewards" [positive] or "punishment" [negative])

Semi-supervised learning

a mix between supervised and unsupervised learning (in which some data would have labels, while others won't)

Deep learning

convolutional neural networks
- recognizes spatial patterns
- great for image-based ("spatially-oriented") data
recurrent neural networks
- recognizes sequential data
- great for text and speech ("temporally-oriented") data

Machine learning process

Machine learning happens in a process of steps also known as machine learning pipeline; many versions of this pipeline exist, steps and order of steps may vary, but most versions include some form of the "input + processing + output" shape:

Data collection

gather data from reliable sources

Data processing

preparation and transformation before th emodel
check for missing or incorrect values
check for correct data types
feature scaling
- transform (or "normalize") the values to a more interpretable range
  - e.g. transform from a range of "-337.28 to 5828.91", to "0 to 1"
- this helps both humans and computers more easily distinguish between low and high values
- an optional step if the data values already make sense to everyone using the data

Model training

split data into "training data" and "testing data"
- ~80% for training and ~20% for testing

Model building

formula derivation with data visualization
- use the training data to create a model (such as a formula)
  - e.g. an equation in the form of y = mx + b or y = m1x1 + m1x2 + ... + mnxn + b (or whatever linear or non-linear equation)
- graph the model if possible and/or desired
- model typologies include:
  - k-nearest neighbours
  - decision trees
  - random forest
  - boosting
  - support vector machines
  - neutral networks

Model testing

formula validation
- use the testing data to test the model against the model
- plug the independent variables (x) of the testing data into the model to compare its prediction (y_pred) with the actual dependent (y_test) variable of the data
- use Python libraries and metrics such as:
  - mean absolute error (MAE)
  - mean square error (MSE)
  - root mean square error (RMSE)
- see which x (or combination of x's) has the error metrics closest to 0

Backward elimination

noisy data removal
- for multiple regression models, remove any variable(s) that would cause the model to classify or predict poorly

Model deployment

taking that model
- to the internet
- to an intranet
- to some private users for their own use cases

statistics ai textbook machine learning basics outlines regression supervised learning unsupervised learning exploratory data analysis machine learning pipeline

⬅️ older (in takeaways)
⚙️ Natural language processing (NLP) : an overview

⬅️ older (posts)
📜 Pre-processing steps for NLP