Deep learning/ Machine learning /Prep large datasets/ Query large datasets
Deep Learning …cards/bricks/legos
Tokenization
Positional tagging
Embedding into another space / another representation of information
Loss functions
Activation functions
Searching a space in an optimal way: random sampling, sampling guided by gradient descent, grid search
Distribution of weights along each layer
Taming the training
- Batch normalization vs. layer normalization
- Learning rate
- Convergence / divergence of loss functions, due to scaling of weights, due to learning rate
Feasibility of training / performance
- Metrics for model performance
- Ensemble of metrics
Compute considerations
- Serializing deserializing / runtime env
- Size of Model
- Inference time
De-duplicate training data: K-means
- Training on data where some data points are present multiple times gives more weight to those data points (if this is desired, then ok)
- Repetition in training data is similar to training multiple epochs
- Duplication may lead to overfitting
- A method to de-duplicate: use a clustering method and pick one data point from each cluster
Clustering methods:
K-means: Unsupervised, based on similarity of data points, similarity is based on some defined metric or numerical value of data point
Naive Bayes: Supervised. If-then kind of method. If this feature in this data point, then this class. This method is sensitive to a characteristic of the training data. This characteristic results from a combined result of (1) number of data points in each class and (2) overlap between features of each class. If data set is unbalanced and features of each class overlap, presence of the overlapping feature would predict the class with a higher number of datapoints in the training set. — Maybe this can be cured by some normalization factor that accounts for number of samples in each class in the training dataset.
Decision Tree: Supervised. If-then kind of method. Interestingly, it does not suffer from the problem described above with the Naive-Bayes method. Presence of the overlapping feature would not result in predicting a class with higher number of datapoints in the training set. Instead, it will continue down the decision-tree until there is a distinguishing feature, then predict a class for the data point at hand.
Question:
Can you pre-gauge the size of the parametric model to fit a set of data given statistical characteristics of the data?
Installing spark:
https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.2.tgz