Tokenization In NLP / PSQL / Merging data

2 min readAug 25, 2024

Initializes a vocabulary list based on every character

Each element in the vocabulary list is called a token

Appends the vocabulary list based on an algorithm until we reach a desired vocabulary size

The algorithm is recursive

First iteration merges the individual characters into pairs, and these are added as tokens into the vocabulary list

Susequent iterations merge tokens

# Byte-Pair Tokenization

Merge Rule :

Merge tokens with highest frequency of adjacent occurence

# Word Piece Tokenization

Merge Rule :

#1) Merge tokens with the highest probability of occurence together

#2) Normalize by the probability of ocurrence of token 1 and probability of occurence of token 2

# Sentence Piece Tokenization

Treats spaces as characters
Can be applied to languages that don’t use spaces to divide words (Chinese, Japanese).
Note that sometimes Sanskrit merges words and there are rules to merge.

Paper on tokenization https://aclanthology.org/2021.emnlp-main.160.pdf

Uses greedy longest-match-first strategy

Leet-code level concepts:

Split by word and space, aka “pre-tokenize”. Only used in BytePairEncoding (BPE) and WordPiece.

Substrings

Regex matching

Parallel subword tokenization schemes:

Byte-Pair Encoding (BPE) (Schuster and Nakajima, 2012; Sennrich et al., 2016)
SentencePiece (Kudo, 2018)
WordPiece (Google, 2018)

5 min Video on tokenization

Implementation for sentence transformers: https://sbert.net/

Use tokenization and embedding models to merge databases.

Join tables on column names with similar words (when value is not exactly the same, as common when different people enter data or different datasets are merged).

Default structure in postgresql:

Default organization in postgresql

Custom tables:

Select rows based on a condition: Select all columns from table rivers where pollution is unknown.

Link for benchmarking performance in psql db: https://www.postgresql.org/docs/current/pgbench.html

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Anudha Mittal

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

LLM-Powered Topic Modeling

Siena Duplan

LLM-Powered Topic Modeling

Before Large Language Models (LLMs), data scientists like myself relied on what now feels like primitive techniques to perform Natural…

Sep 26, 2024

Understanding Tokenization

LM Po

Understanding Tokenization

BPE, WordPiece, and SentencePiece in NLP

Jan 12

Lists

Natural Language Processing

1977 stories1619 saves

Tokenization in Large Language Models (LLMs): A Deep Dive

Bragadeesh Sundararajan

Tokenization in Large Language Models (LLMs): A Deep Dive

Tokenization is a fundamental process that underpins the operation of Large Language Models (LLMs) like GPT, BERT, and their numerous…

Oct 10, 2024

LLM Architectures Explained: RNNs, LSTMs & GRUs (Part 3)

Vipra Singh

LLM Architectures Explained: RNNs, LSTMs & GRUs (Part 3)

Deep Dive into the architecture & building real-world applications leveraging NLP Models starting from RNN to Transformer.

Sep 8, 2024

LLM Tokenization — BPE

In

Artificial Intelligence in Plain English

by

Mayur Jain

LLM Tokenization — BPE

How LLMs tokenize texts!

Feb 4

How do LLMs understand the words with Typos/Spelling mistakes?

Puspak Supakar

How do LLMs understand the words with Typos/Spelling mistakes?

Ever wondered how large language models (LLMs) like ChatGPT or Bard can understand your question, even if you make a spelling mistake…

Dec 21, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams