Word Embedding ①) One-Hot Encoding

1. Word Embedding

Because language that consist of words, prepositionas and etc is not a computable object, it have to be converted to some computable representation. So, in data processing stage, input texts is converted to a vector, and it is called ‘Word Embedding’. There are a lot of word embedding methods. In this post, ‘One-Hot Encoding’ will be introduced.

2. One-Hot Encdoing

There are two kinds of data, Categorical Value and Continuous Value. Contrary to continuous value such as temperature and height, categorical value is discrete value such as class and word. Generally, we convert categorical value some computable value such as integer. But its cardinal numbers is not numberic value that has size or rank. For example, if two temperatures(continuous value) are similar, it means that the kinetic energy of the molecule is similar. And if a temperature is higher than others, it is hotter.

Index	0	1	2	3	…	1,000,000
Class	Cats	Ships	Papers	Knifes	…	Dogs

fig 1. a mapping table of some image classes

Whereas, in classification task, the indices of image classes labeled cats, ships and etc are not allocated, based on similarity. If cats class is indexed to 0, it works well whether dogs class is indexed to 1 or 1,000,000. Moreover, The distance of classes can not be used. In fig 1, although cats are more similar with dogs than ships, the distance of Cats class and Ships class is 1, less than the distance of Cats class and Dogs class, 1,000,000.

	Index	Cats	Ships	Papers	…	Dogs
Cats	0	1	0	0	…	0
Ships	1	0	1	0	…	0
Papers	2	0	0	1	…	0
Knifes	3	0	0	0	…	0
…	…	…	…	…	…	…
Dogs	1,000,000	0	0	0	…	1

fig 2. One-Hot vectors by One-Hot Encoding

One-Hot encoding is a vector representation, whose dimensions is mapped to continuous values(words or classes). Each value are converted to One-Hot vector. One-Hot vector of a value is allocated 1 at its dimension mapped to the value, and all other dimensional elements are 0. For example, in fig 2, the index of Papers class is 2. It means the second dimension of all One-Hot vectors is mapped to Papers class. So, One-Hot vector of Papers class is (0, 0, 1, 0, …). One-Hot vector is sparse vector because almost elements of One-Hot vector is 0. (One-Hot Encoding is sparse representation) And cosine similarity between all One-Hot vectors is 0, because two One-Hot vectors are orthogonal.
In NLP, the words or tokens are converted One-Hot vectors.

3. Drawbacks of One-Hot Encoding

Computing the distance between vectors is hard. However, there is similarity(distance) between some words of languages. For example, cats are more similar with dogs than ships semantically.
We need very large memory space to implement One-Hot vectors. If there is a dictionary whose size is N, we need O(N²) space.

So, One-Hot Encoding is not prefered. There are a lot of embedding methods including dense representation(truely word embedding).

Refers

딥 러닝을 이용한 자연어 처리 위키독스