Deep learning/ Machine learning /Prep large datasets/ Query large datasets

Anudha Mittal
2 min readMar 19, 2024

--

Deep Learning …cards/bricks/legos

Tokenization

Positional tagging

Embedding into another space / another representation of information

Loss functions

Activation functions

Searching a space in an optimal way: random sampling, sampling guided by gradient descent, grid search

Distribution of weights along each layer

Taming the training

  • Batch normalization vs. layer normalization
  • Learning rate
  • Convergence / divergence of loss functions, due to scaling of weights, due to learning rate

Feasibility of training / performance

  • Metrics for model performance
  • Ensemble of metrics

Compute considerations

  • Serializing deserializing / runtime env
  • Size of Model
  • Inference time

De-duplicate training data: K-means

  • Training on data where some data points are present multiple times gives more weight to those data points (if this is desired, then ok)
  • Repetition in training data is similar to training multiple epochs
  • Duplication may lead to overfitting
  • A method to de-duplicate: use a clustering method and pick one data point from each cluster

Clustering methods:

K-means: Unsupervised, based on similarity of data points, similarity is based on some defined metric or numerical value of data point

Naive Bayes: Supervised. If-then kind of method. If this feature in this data point, then this class. This method is sensitive to a characteristic of the training data. This characteristic results from a combined result of (1) number of data points in each class and (2) overlap between features of each class. If data set is unbalanced and features of each class overlap, presence of the overlapping feature would predict the class with a higher number of datapoints in the training set. — Maybe this can be cured by some normalization factor that accounts for number of samples in each class in the training dataset.

Decision Tree: Supervised. If-then kind of method. Interestingly, it does not suffer from the problem described above with the Naive-Bayes method. Presence of the overlapping feature would not result in predicting a class with higher number of datapoints in the training set. Instead, it will continue down the decision-tree until there is a distinguishing feature, then predict a class for the data point at hand.

Question:

Can you pre-gauge the size of the parametric model to fit a set of data given statistical characteristics of the data?

Installing spark:

https://dlcdn.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop$HADOOP_VERSION.tgz

https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.2.tgz

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response