Deep learning/ Machine learning /Prep large datasets/ Query large datasets

Anudha Mittal

2 min readMar 19, 2024

Deep Learning …cards/bricks/legos

Tokenization

Positional tagging

Embedding into another space / another representation of information

Loss functions

Activation functions

Searching a space in an optimal way: random sampling, sampling guided by gradient descent, grid search

Distribution of weights along each layer

Taming the training

Batch normalization vs. layer normalization
Learning rate
Convergence / divergence of loss functions, due to scaling of weights, due to learning rate

Feasibility of training / performance

Metrics for model performance
Ensemble of metrics

Compute considerations

Serializing deserializing / runtime env
Size of Model
Inference time

De-duplicate training data: K-means

Training on data where some data points are present multiple times gives more weight to those data points (if this is desired, then ok)
Repetition in training data is similar to training multiple epochs
Duplication may lead to overfitting
A method to de-duplicate: use a clustering method and pick one data point from each cluster

Clustering methods:

K-means: Unsupervised, based on similarity of data points, similarity is based on some defined metric or numerical value of data point

Naive Bayes: Supervised. If-then kind of method. If this feature in this data point, then this class. This method is sensitive to a characteristic of the training data. This characteristic results from a combined result of (1) number of data points in each class and (2) overlap between features of each class. If data set is unbalanced and features of each class overlap, presence of the overlapping feature would predict the class with a higher number of datapoints in the training set. — Maybe this can be cured by some normalization factor that accounts for number of samples in each class in the training dataset.

Decision Tree: Supervised. If-then kind of method. Interestingly, it does not suffer from the problem described above with the Naive-Bayes method. Presence of the overlapping feature would not result in predicting a class with higher number of datapoints in the training set. Instead, it will continue down the decision-tree until there is a distinguishing feature, then predict a class for the data point at hand.

Question:

Can you pre-gauge the size of the parametric model to fit a set of data given statistical characteristics of the data?

Installing spark:

https://dlcdn.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz

https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.2.tgz

Deep learning/ Machine learning /Prep large datasets/ Query large datasets

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Anudha Mittal

No responses yet