1. Word Embedding


  Because language that consist of words, prepositionas and etc is not a computable object, it have to be converted to some computable representation. So, in data processing stage, input texts is converted to a vector, and it is called ‘Word Embedding’. There are a lot of word embedding methods. In this post, ‘One-Hot Encoding’ will be introduced.



2. One-Hot Encdoing


  There are two kinds of data, Categorical Value and Continuous Value. Contrary to continuous value such as temperature and height, categorical value is discrete value such as class and word. Generally, we convert categorical value some computable value such as integer. But its cardinal numbers is not numberic value that has size or rank. For example, if two temperatures(continuous value) are similar, it means that the kinetic energy of the molecule is similar. And if a temperature is higher than others, it is hotter.

Index 0 1 2 3 1,000,000
Class Cats Ships Papers Knifes Dogs
fig 1. a mapping table of some image classes


  Whereas, in classification task, the indices of image classes labeled cats, ships and etc are not allocated, based on similarity. If cats class is indexed to 0, it works well whether dogs class is indexed to 1 or 1,000,000. Moreover, The distance of classes can not be used. In fig 1, although cats are more similar with dogs than ships, the distance of Cats class and Ships class is 1, less than the distance of Cats class and Dogs class, 1,000,000.

  Index Cats Ships Papers Dogs
Cats 0 1 0 0 0
Ships 1 0 1 0 0
Papers 2 0 0 1 0
Knifes 3 0 0 0 0
Dogs 1,000,000 0 0 0 1
fig 2. One-Hot vectors by One-Hot Encoding


  One-Hot encoding is a vector representation, whose dimensions is mapped to continuous values(words or classes). Each value are converted to One-Hot vector. One-Hot vector of a value is allocated 1 at its dimension mapped to the value, and all other dimensional elements are 0. For example, in fig 2, the index of Papers class is 2. It means the second dimension of all One-Hot vectors is mapped to Papers class. So, One-Hot vector of Papers class is (0, 0, 1, 0, …). One-Hot vector is sparse vector because almost elements of One-Hot vector is 0. (One-Hot Encoding is sparse representation) And cosine similarity between all One-Hot vectors is 0, because two One-Hot vectors are orthogonal.
  In NLP, the words or tokens are converted One-Hot vectors.



3. Drawbacks of One-Hot Encoding


  1. Computing the distance between vectors is hard. However, there is similarity(distance) between some words of languages. For example, cats are more similar with dogs than ships semantically.
  2. We need very large memory space to implement One-Hot vectors. If there is a dictionary whose size is N, we need O(N2) space.


  So, One-Hot Encoding is not prefered. There are a lot of embedding methods including dense representation(truely word embedding).



Refers


딥 러닝을 이용한 자연어 처리 위키독스