Tillämpad maskininlärning

1. Intro to Machine Learning

What are the key differences between supervised and unsupervised learning?
- In supervised learning you have the target labels and can thus try to optimize the model to predict the correct label from the data. In unsupervised learning you only train on the data and try to find patterns.
What is overfitting and underfitting, and how would you detect them?
- Overfitting is when your model is trained too long on the test data so that it is no longer generalizable and would predict incorrect results for new data. Underfitting is when the model is not trained long enough on the data to learn how it works. Overfitting can be detected by the model performing worse on validation than train. Underfitting is detected by bad performance on train, test and validation.
What is cross-validation? Why is it useful?
- It is when you iterate over the dataset and select a chunk as test and rest as train and then train the model on that. You then iterate that until all data has been in test once. This is used to find good hyperparameters and to test the model architecture.
Explain how the k-Nearest Neighbors (kNN) algorithm works. Is it supervised or unsupervised?
- It is supervised and works by finding the k nearest neighbours of some given sample, it will then if classification select the majority vote of those neighbours as the correct value and if regression the mean of labels. The outputs are often weighted by 1/d where d is the distance.
What are the steps of the k-means clustering algorithm? Is it supervised or unsupervised?
- The k means clustering algorithm is unsupervised. It works by training on the data to cluster them into groups. When making a classification prediction the label of the cluster where the sample lays is taken, if regression the mean like in kNN.
- It is trained by first creating k clusters (and the centers) in some way, for example by random. You then iterate over all datapoints and take the cluster with the closest center to the point. When you have iterated over all points you calculate new centers of the clusters by averaging the points in the cluster. You then start over until the centroids no longer change.
What is precision, recall, and F1-score?
- Precision is how many of the predictions which are relevant, so TP / (TP + FP)
- Recall is how many of the correct predictions which are made TP / (TP + FN)
- F1 score is a weighted average of the precision and recall by taking the harmonic mean.
- Give an example where a test set has high precision but low recall.
  - A spam filter which is careful to not make many FP but will thus also have quite some FN.
- Give an example where a test set has high recall but low precision.
  - A medical test which catches many TP but also makes many false positives.
What is homogeneity and completeness? When are they used?
- Homogeneity is a measure of how many of the samples in a cluster have the same label.
- Completeness is a measure of how many of the samples with a certain label are in a single cluster.
How can you use a clustering algorithm for classification?
- It can be used by taking the label of the cluster the sample is closest to (the label of the cluster is the majority voted label), or by taking the weighted majority vote of the closest n clusters.
What loss functions can you use for multiclass classification? What loss functions can you use for regression? For regression you could use MSE. For multiclass classification you could use cross-entropy loss.

2. Math for ML Part 1 (Probability, Distributions, Information Theory)

What is the general product rule (also called chain rule) for probability, describing the relationship between the joint probability, conditional probability, and marginal probability?
- P(A and B) = P(B | A) * P(A)
What is the formula for Bayes’ theorem? (This can be derived from the product rule.)
- P(A | B) = P (B|A) * P(A) / P(B)
What is the likelihood and the posterior?
- In the above bayes theorem the posterior is P(A|B), the probability given what we saw in B and the likelihood is P(B|A) how well B is predicted by A.
Posterior P(A∣B)P(A | B)P(A∣B): This is what we want to find out—the updated probability of AAA after observing BBB. It reflects our updated belief based on the new evidence BBB.
Likelihood P(B∣A)P(B | A)P(B∣A): This measures how likely it is to observe BBB given that AAA is true. It tells us how well AAA explains the observed evidence BBB.
What is entropy? For what type of probability distribution is entropy highest?
- Entropy is a measure of the uncertainty or randomness in a probability distribution. It quantifies the average amount of information required to describe or predict the outcome of a random variable.
- A completely random distribution (where all outcomes are equally likely and nondeterministic.) has the highest entropy.
Refer to Exam 2024, Question 2.3 for an example of how to calculate cross-entropy.
Refer to Lab 2 for examples of how to calculate mutual information.

3. Math for ML Part 2 (MLE, Linear Regression)

What is Maximum Likelihood Estimation (MLE)?
- Maximum likelihood estimation is a method used to estimate the parameters of some probability distribution by setting them to be the most probable based on the data.
What is Maximum a Posteriori (MAP) estimation?
- MAP is used to estimate a parameter based on the maximum likelihood of data and on prior knowledge. It is useful for generating better predictions when you have limited data.
What is regularization?
- Regularization is used to prevent overfitting by adding a penalty to the model so that it limits complexity. It can for example favor smaller parameters (using L1 or L2), or dropout in NN.
What assumption is made about the data in linear regression?
- It is assumed that the data is roughly linear. So that it can be approximated by a linear function. It also assumes that the observations are mutually independent (otherwise extra information could be utilized.)
What is Ridge regression?
- Ridge regression is regression where we use L2 regularization; it modifies the loss function used on training so that it includes the squared sum of the parameters.

4. Decision Trees and Regression Trees

Explain the concept of a decision tree classifier and how it splits data. How is information gain or Gini impurity used in building decision trees?
- It creates a tree which splits the data into paths based on observations values. It is easy to see how it “thinks” and is often quite good and it is also very fast to train. A decision tree classifier splits the data so that the split maximises information gain for a single feature. Information gain is the decrease in entropy caused by a split. If you instead split on gini impurity you want to split on the lowest, gini is however otherwise quite similar to entropy and information gain.
Explain how a regression tree calculates the split and the final value.
- A regression tree calculates the split by evaluating different threshold values for a feature and selecting the one that minimizes a given objective function, such as Mean Squared Error (MSE), to reduce the variance within the resulting subsets. When the tree reaches a leaf node, the predicted value is the mean of the target values for the data points that fall into that node. The splitting process is aimed at minimizing MSE by choosing the best feature and threshold that results in the most homogeneous groups with respect to the target variable.
What are the advantages and disadvantages of regression trees compared to linear regression?
- Advantages: It is easy to understand how it works. It can make more complex decisions when predicting as it can capture non-linear relationships. Does not need feature scaling as it works on each feature independently.
- Disadvantages: It can only create values in a discrete range. It will not create exact results for simpler problems which are inherently linear. It is prone to overfitting if it is allowed to grow deep. It is also sensitive to small changes in data, it is also comparatively computationally expensive.
Compare CART and ID3.
- CART tries to minimize gini impurity / entropy and in regression it uses MSE, ID3 tries to maximize information gain on each split.
- Cart can be used for both classification and in regression while ID3 can only work on discrete input ranges and is thus better suited for classification where the output does not require to be super granular.
- Cart creates binary trees (each node except the root has exactly 2 or 0 children). ID3 can split into multiple branches from a single node. A feature with multiple categories can split into as many branches as there are categories.
How is overfitting prevented in a decision tree?
- The tree is limited in size by setting max depths, min samples per leaf/branch or a maximum of nodes. Or by only splitting if the gini or MSE is decreased by a certain amount.
- It is also possible to prune after training based on data. The nodes which do not greatly increase the error when removed will be pruned to decrease the complexity.
How do you handle missing values in a decision tree?
- It is possible to remove rows with missing data, another option is using surrogate splits where the tree is instead split on another correlated feature. It is also possible to replace missing values with the mean value or the most common value.

5. Random Forests and Ensemble Methods

What is the key idea behind ensemble methods in ML?
- The key idea is that multiple actors will make better decisions together than one can do alone. It decreases the risk of overfitting and improves the accuracy of unseen data.
How do random forests improve the performance of decision trees?
- By creating multiple random trees with different subsets of features from which they can learn information, this creates many more specialized and smaller trees which can vote on the correct result.
How does a random forest perform classification or regression? (e.g., voting, weighted average, etc.)
- For classification majority votes are common, for regression it is common to use an average of the values.
What is AdaBoost, and how does it differ from random forests?
- AdaBoost puts weights on different subsets of the data and focuses the training of models on those weights. Weights are set so that the errors are decreased, the model gets better at what it is the worst at. Each of the weak classifiers will then pay more attention to the previously misclassified samples.
- AdaBoost starts by creating a weak classifier which is only slightly better than random guessing. It will then train that classifier on the training data. After the weak classifier is trained it will be ran on the train data again and it will update the previously mentioned weights.
- It will then create new weak classifiers and train them until some condition is met.
- The weak classifiers are combined using a weighted majority vote

6. Bayesian Classifiers

What is the key assumption about features in the dataset in a Naive Bayes classifier?
- They are assumed to be conditionally independent
KOLLA CANVAS FÖR FRÅGA!
- 0.24/0.52
Explain how Naïve Bayes calculates the probability of a given class C given a new data point. See formula in lecture slides
- It tests all classes by taking the probability of the class based on how common it is, we then multiply that with the product of the probabilities of the values of each feature being what it is given that it has that class. This selects the class which is most likely based on the previously seen data.
How does Naive Bayes handle missing data? Why is missing data problematic?
- Missing data is problematic as the class is selected by multiplying the probabilities of that value happening. If it's missing that would likely make the probability 0. It is thus handled by skipping that feature for the sample. This is also fine as they are considered to be independent making this the optimal choice.
What is a Gaussian Mixture Model (GMM)? How is it defined?
- It is a probabilistic model that assumes that all the data points are generated from a mixture of a finite number of gaussian distributions which each have unknown parameters. It is similar to k-means clustering but incorporates information about covariance structures of the data
What is GMM used for? (e.g., clustering such as customer segmentation in marketing.)
- It is used for clustering data, in a similar way to k-means but it does not assume the features to be independent. It can be used when you want to cluster data which have covarient features, such as marketing where a label, visited page, is correlated to “asking about page”. It is useful in cases where the data is modeled by multiple gaussians.
What is the expectation-maximization algorithm? How is it used for k-means? How is it used for GMMs?
- It is an iterable algorithm which has two steps which are run until it converges, the expectation step and the maximization step.
  - In the E step you calculate the expected value of the latent variables based on the current estimates of the parameters.
  - In the M step you maximize the likelihood of the observed data, treating the computed variables from E as observed data.
- This is used in k-means where the E step can be seen as the step where points are assigned to the closest cluster centers, where the cluster centers are the current parameter estimates. The M step is then where you recalculate the centroids by averaging the new cluster points.
- This is used in GMMs where they in the E step calculate the most probable class for a sample given the current model parameters, and in the M step those parameters are updated to maximize the likelihood of the actual class.

7. Logistic Regression

How does logistic regression differ from linear regression?
- Logistic regressors are linear regressors with an added sigmoid function on the output. The logistic regressor is trained with log-likelihood loss.
Explain the sigmoid function and its role in logistic regression.
- The sigmoid function is used when outputting values and makes sure that the values are between 0 and 1. This makes the output useful in tasks featuring probabilities.
What does the output from logistic regression represent, and how is it interpreted?
- It represents a probability. It can be used to find the probability of being of a certain class.
How is it used for classification? (e.g., threshold function.)
- It can be used with a threshold function of for example 0.95 where a result from the regressor larger than 0.95 is positive.
How can logistic regression be represented as a neural network?
- By a linear layer with only one neuron and then a sigmoid activation function.
How are the parameters for logistic regression obtained?
- They are obtained by maximizing log-likelihood of the parameters given the observed data.

8. Feedforward Networks

Describe the components of a neural network (e.g., input layer, hidden layers, activation functions, output layers).
- The input layer is the first layer in the network to which the data is sent. Hidden layers are layers in between the input and output layer. Activation functions are commonly used between layers to introduce non-linearity. Activation functions can also be used after the output layer to for example cap the results between 0 and 1, sigmoid. The output layer is the last layer.
What is an activation function, and why is it important in a neural network?
- It is a function which modifies the output from other layers. It is used to introduce non-linearity, for example ReLU, sigmoid, tanh. Many remove all negative values (sets them to zero).
How are weights and biases updated in a feedforward network?
- The weights and biases in a feedforward network are updated using gradient descent, after calculating gradients of the loss function with respect to each weight and bias (using backpropagation), the weights and biases are adjusted to minimize the loss, typically by subtracting a fraction of the gradient (learning rate).
- Explain the loss function.
  - The loss function is a function which calculates the loss between the predicted value and the actual value. MSE is common in regression and cross-entropy in classification.
- Explain the chain rule and its role in backpropagation.
  - The chain rule is used in backpropagation to compute the gradients of the loss function with respect to each weight. It breaks down the derivative of the loss into the product of partial derivatives allowing efficient computation of gradients layer by layer.
    - Chain rule:
      - If h(x) = f(g(x))
      - Then: h’(x) = f’(g(x))g’(x)
    - For backpropagation with loss L = f(g(h(x))
    - dL/dx = dL/dg * dg/dh * dh/dx
  - Backprop.
    - ∂L/∂y_t = y_t - y, the gradient of the loss.
    - Propagate the gradient backward with the chain rule.
      - ∂L/∂w_h = ∂L/∂h * ∂h/∂w_h where w_h is the weight connected to the layer and ∂L/∂h is computed from the gradient in the layer above.
      - The weight w_h is updated by w_h = w_h - (learning rate) * ∂L/∂w_h
How does gradient descent work? (Explain batch, stochastic, and mini-batch gradient descent.)
- Gradient descent is a way to find the minima of a function, it is done by taking a_(n+1) = a_n - dF(a_n) * l where l is the learning rate. Go in the path of the steepest negative gradient
- In batch gradient descent we only take one step per epoch which is the averaged gradient for all training examples.
- In stochastic gradient descent we instead calculate the gradient for each training example and use those to update the weights.
- Mini batch gradient descent is a combination of batch gradient and stochastic as we are operating on a fixed batch size. We calculate the gradients for N examples and average those and then apply the mean of the gradients when updating weights.
How do sigmoid and hyperbolic tangent activation functions behave?
- Sigmoid generates a value between 0 and 1 and tanh between -1 and 1 both are steeper in the middle.
- When are they most sensitive?
  - Closest to zero
- When do they saturate?
  - When the input is close to positive or negative infinity.
What is the vanishing gradient problem, and how does ReLU address it?
- The vanishing gradient problem is when the gradient gets very small making the updates small. ReLU makes the negative values 0 and the positive 1 which makes sure that the gradients do not vanish.

9. Dimension Reduction (UMAP, PCA)

What is normalization (e.g., using Normalizer from sklearn.preprocessing)?
- Normalize scales each vector individually to the unit norm (L1 or L2 = 1)
What is standardization (e.g., using StandardScaler from sklearn.preprocessing)?
- It scales features so that they look like a gaussian around 0. This is done by subtracting the mean from all values and scaling to unit variance.
Why is it important to normalize or standardize the data?
- It is important for UMAP and PCA as they place the points in space, large values in one feature would “weigh” that feature more heavily. So one of the two should be done so that each feature is seen as equally important.
Compare PCA and UMAP in terms of their functionality and use cases.
- PCA is more focused on getting global distances correct while UMAP prioritizes local (inter cluster) distances while still preserving global structure. UMAP is faster for large datasets and allows larger embedding dimensions. PCA is more useful if you care about distances between clusters and UMAP more useful when distances between elements in the same cluster is of importance.
- PCA is useful for visualization of data, it can also be used in preprocessing, it is deterministic. It can also be used for feature reduction.
- UMAP is also useful for visualizing data, especially when local structure is important. It's also useful for manifold learning as it introduces nonlinearity. It can also be used as a preprocessing step for clustering as the structure is preserved.
What is an embedding?
- An embedding is a vector representation of some information. It often has a high dimensionality. An embedding is a mapping from a high-dimensional space to a lower-dimensional space.
What is a manifold? Draw an example of a manifold UMAP can visualize effectively, but PCA struggles with.
- A manifold is a space which is almost euclidean and tries to capture the conceptual dimensionality of the data. It can capture nonlinear relations in the data.
- UMAP is better in cases where the data is all continuous in more complicated structures like spirals. PCA would break up the structure while UMAP will still show it correctly.
Why does PCA require zero mean?
- As it would otherwise not necessarily find the correct directions of variance as the mean would be included in the calculations skewing the results.
What is the role of eigenvectors and eigenvalues in PCA?
- Eigenvalues and eigenvectors are used to find the features with the maximum variance in the data. Eigenvectors determine the direction of the feature space and eigenvectors the significance of that.

10. Convolutional Neural Networks (CNNs)

What are some typical ML tasks where CNNs perform well?
- Image classification, time series data, feature detection
Given an input and a kernel, what is the output?
- A convolution between them, the kernel is moved over the input like a 2d sliding window and each element would be multiplicated and summed. that becomes the new output.
What do the numbers in nn.Conv2d(in_channels, out_channels, kernel_size) mean?
- in_channels is the amount of features in each element of the data, so in_channels would be 3 or rgb images. out_channels are the amount of filters created by the conv layer. Kernel size is the width of the square kernels which will be moved over the input to create new filters.
You have an 64x64 RGB image as input, how many channels does it have? What are the dimensions for the image tensor in PyTorch?
- It has 3 channels, the dimensions would be (3, 64, 64)
What does the stride parameter do?
- It determines how large the filter should jump.
What does the dilation parameter do?
- It skips some elements in the input so that the kernel can view data further away.
What does the padding parameter do?
- It adds cells around the data with zeros so as to not modify the dimensions.

Example Code:

class TestClassifier(nn.Module):

def __init__(self):

super(TestClassifier, self).__init__()

self.conv1 = nn.Conv2d(in_channels=3, out_channels=2, kernel_size=3, stride=1, padding=0)

def forward(self, x):

x = self.conv1(x)

return x

clf = TestClassifier()

output = clf(image_tensor)

print(output.shape)

FORMULA FOR OUTPUT SHAPE: (input - kernel + 2 * padding) / stride + 1

If you send in a 64×64 image tensor with dimensions 3×64×64, what is the output shape?
- 2x62x62
What happens to the output shape if padding=1 is used?
- 2x64x64
What does max pooling do? What is the result from the MaxPool2D below?
- Max pool takes the max value which the kernel has.

You are given the classifier below, and the input is an image tensor with dimensions 3x64x64. What is the shape of the output?

After conv1 2x64x64
After pool1 2x32x32
After conv2 2x32x32
After pool2 2x16x16

class TestClassifier(nn.Module):

def __init__(self):

super(TestClassifier, self).__init__()

self.conv1 = nn.Conv2d(in_channels=3, out_channels=2, kernel_size=3, stride=1, padding=1, bias=False)

self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)

self.conv2 = nn.Conv2d(in_channels=2, out_channels=1, kernel_size=3, stride=1, padding=1, bias=False)

self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)

def forward(self, x):

x = self.conv1(x)

x = self.pool1(x)

x = self.conv2(x)

x = self.pool2(x)

return x

clf = TestClassifier()

output = clf(image_tensor)

print(output.shape)

How many parameters does the model above have?
(You can get it in code using:)
total_params = sum(p.numel() for p in clf.parameters())

print(f"Number of parameters: {total_params}")

Note: Each output channel learns its own kernel.
If we change bias=True for both convolution layers, how many parameters will the model have?
Note: The bias is added after the convolution for one channel, and each output channel has its own bias.

Network Architectures

Fill in the blanks:
- AlexNet was the first time the activation function ReLU was used. Therefore, the network was deeper [deeper, shallower] than previous networks, and it also made the training times faster. They also used data augmentation to translate images and change intensity.
- The VGG network used smaller [smaller/bigger] convolutions sized 3x3 compared to previous models. Therefore, the network was deeper [deeper/shallower] than previous architectures while still having the same number of [the same number of/fewer] parameters. They also used known weights from previous networks as initialization, so they converged faster [faster/slower] compared to AlexNet.
- GoogLeNet decreased [increased/decreased] the number of trainable parameters by having a more complex [simple/more complex] architecture.
- Deep networks can get poor performance because gradients might vanish or explode. ResNet added residual connections that bypass layers, which improved convergence and are now added to many object detection and segmentation networks. It has 152 layers, which was much more [more/less] than previous models. ResNetXt added Cardinality, which refers to the number of parallel paths (groups) in each layer.
- EfficientNet used neural architecture search (NAS) to find an architecture that balanced width, depth, and resolution of the input images, with fewer [fewer/more] parameters than previous models and therefore lower [lower/higher] computational cost.

Explainable AI

What are some reasons for deploying explainable AI models?
- Unwanted biases: To know what biases it has.
- The robustness: is the model consistent and accurate in its predictions
- Safety: Can we find examples with minimal change which changes the outcome?
- Understanding: Can we learn something about the task that could change what model we choose
What does it mean that a model has local explainability?
- Using a single or a small selection of instances to gain insight into how the model makes its decision.
What does it mean that a model has global explainability?
- Explains the whole model, e.g trends towards classes and decision boundaries.
What does it mean that a model is a black box?
- That it is not known how it works
- It is not known on what it bases its decisions.
- Only input and output is known.
What does it mean that a neural network model is a white box?
- We have access to the input and output as well as the model weights and the structure. Often also means that we can compute gradients.
Draw lines from the method names to the description that matches the method. Check canvas
- Integrated gradients -> Explains the predictions of deep neural networks by measuring the contribution of each feature. Works by calculating the integral of the gradients.
- LRP -> Explains the predictions by going backwards from prediction to input, aggregating relevance. Produces a heat map on the input domain for contributions towards a class.
- LIME -> Finds the minimal change in input that changes the prediction for example by adding noise to an image that changes the prediction. Demonstrated robustness and safety issues with a model.
- SHAP -> Based on cooperative game theory where each feature is assigned contributions for a prediction. It explains individual predictions by calculating how the model output changes when a feature is added or removed, making sure that the more influential features receive higher contribution.
- Counterfactual -> Explains individual predictions by perturbing them locally and trains a simple interpretable model with the local data to learn feature importance for the specific instance.

ML Tasks and Metrics

What is image classification, object detection, and segmentation?
- Image classification is a task where you shall assign a single class to an image. Object detection is where you should find the objects or objects of an image. Image segmentation is when you should split an image into parts depending on what is in the image.
How are bounding boxes generated for an input image?
- The input image is split into a grid of cells and each cell has k candidate bounding boxes. Each cell predicts k bounding boxes but only ones with high probabilities are used.
What does it mean that a dataset is imbalanced? How can that be detected?
- That it has more samples of a specific class, it can be detected by analysing the dataset or for the model starting to select the same class each time.
How can you change your dataset to make it less imbalanced?
- You can add more data by data augmentation or remove existing data.
What is confirmation bias, historical bias, selection bias, and survivorship bias?
- Confirmation bias is a bias for getting the expected value, historical bias is a bias for getting the same as in history. Selection bias is a bias towards thinking that the surviving are better.
How can you automatically augment image data?
- You can crop out different parts, change colors, rotate it etc.
Give an example of how you can generate synthetic image data.
- It can be generated by ai???
Fill in the blanks:
- RNNs, LSTMs, and transformers are all models used for sequence/timeseries modeling.
- Which one of RNNs, LSTMs, and transformers uses attention to capture context, has an encoder-decoder architecture, and is used in large language models? Transformers
- Audio data can be preprocessed into a Mel spectrogram, and since that is an image, we can use CNN for classification.