CS224d-Lecture note.pdf
《CS224d-Lecture note.pdf》由会员分享,可在线阅读,更多相关《CS224d-Lecture note.pdf(48页珍藏版)》请在三一文库上搜索。
1、CS 224D: Deep Learning for NLP1 1Course Instructor: Richard Socher Lecture Notes: Part I2 2Authors: Francois Chaubard, Rohit Mundra, Richard Socher Spring 2015 Keyphrases: Natural Language Processing. Word Vectors. Singu- lar Value Decomposition. Skip-gram. Continuous Bag of Words (CBOW). Negative S
2、ampling. This set of notes begins by introducing the concept of Natural Language Processing (NLP) and the problems NLP faces today. We then move forward to discuss the concept of representing words as numeric vectors. Lastly, we discuss popular approaches to designing word vectors. 1Introduction to
3、Natural Language Processing Natural Language Processing tasks come in varying levels of diffi culty: Easy Spell Checking Keyword Search Finding Synonyms Medium Parsing information from websites, documents, etc. Hard Machine Translation Semantic Analysis Coreference Question Answering We begin with a
4、 general discussion of what is NLP. The goal of NLP is to be able to design algorithms to allow computers to understand natural language in order to perform some task. Example tasks come in varying level of diffi culty: Easy Spell Checking Keyword Search Finding Synonyms Medium Parsing information f
5、rom websites, documents, etc. Hard Machine Translation (e.g. Translate Chinese text to English) Semantic Analysis (What is the meaning of query statement?) Coreference (e.g. What does he or it refer to given a docu- ment?) Question Answering (e.g. Answering Jeopardy questions). The fi rst and arguab
6、ly most important common denominator across all NLP tasks is how we represent words as input to any and all of our models. Much of the earlier NLP work that we will not cover treats words as atomic symbols. To perform well on most NLP tasks we fi rst need to have some notion of similarity and differ
7、ence cs 224d: deep learning for nlp2 between words. With word vectors, we can quite easily encode this ability in the vectors themselves (using distance measures such as Jaccard, Cosine, Euclidean, etc). 2Word Vectors There are an estimated 13 million tokens for the English language but are they all
8、 completely unrelated? Feline to cat, hotel to motel? I think not. Thus, we want to encode word tokens each into some vector that represents a point in some sort of word space. This is paramount for a number of reasons but the most intuitive reason is that perhaps there actually exists some N-dimens
9、ional space (such that N? 13 million) that is suffi cient to encode all semantics of our language. Each dimension would encode some meaning that we transfer using speech. For instance, semantic dimensions might indicate tense (past vs. present vs. future), count (singular vs. plural), and gender (ma
10、sculine vs. feminine).One-hot vector: Represent every word as an R|V|1vector with all 0s and one 1 at the index of that word in the sorted english language. So lets dive into our fi rst word vector and arguably the most simple, the one-hot vector: Represent every word as an R|V|1vector with all 0s a
11、nd one 1 at the index of that word in the sorted english language. In this notation,|V|is the size of our vocabulary. Word vectors in this type of encoding would appear as the following: waardvark= 1 0 0 . . . 0 , wa= 0 1 0 . . . 0 , wat= 0 0 1 . . . 0 ,wzebra= 0 0 0 . . . 1 Fun fact: The term one-h
12、ot comes from digital circuit design, meaning a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). We represent each word as a completely independent entity. As we previously discussed, this word representation does not gi
13、ve us directly any notion of similarity. For instance, (whotel)Twmotel= (whotel)Twcat=0 So maybe we can try to reduce the size of this space from R|V|to something smaller and thus fi nd a subspace that encodes the rela- tionships between words. 3SVD Based Methods For this class of methods to fi nd w
14、ord embeddings (otherwise known as word vectors), we fi rst loop over a massive dataset and accumu- late word co-occurrence counts in some form of a matrix X, and then perform Singular Value Decomposition on X to get a USVTdecom- position. We then use the rows of U as the word embeddings for all cs
15、224d: deep learning for nlp3 words in our dictionary. Let us discuss a few choices of X. 3.1Word-Document Matrix As our fi rst attempt, we make the bold conjecture that words that are related will often appear in the same documents. For instance, banks, bonds, stocks, money, etc. are probably likely
16、 to ap- pear together. But banks, octopus, banana, and hockey would probably not consistently appear together. We use this fact to build a word-document matrix, X in the following manner: Loop over billions of documents and for each time word i appears in docu- ment j, we add one to entry Xij. This
17、is obviously a very large matrix (R|V|M) and it scales with the number of documents (M). So per- haps we can try something better. 3.2Word-Word Co-occurrence Matrix The same kind of logic applies here however, the matrix X stores co-occurrences of words thereby becoming an affi nity matrix. We displ
18、ay an example one below. Let our corpus contain just three sentences:Using Word-Word Co-occurrence Matrix: Generate|V| |V|co-occurrence matrix, X. Apply SVD on X to get X=USVT. Select the fi rst k columns of U to get a k-dimensional word vectors. ki=1i |V| i=1i indicates the amount of variance captu
19、red by the fi rst k dimensions. 1. I enjoy fl ying. 2. I like NLP. 3. I like deep learning. The resulting counts matrix will then be: X= IlikeenjoydeeplearningNLPflying. I02100000 like20010100 enjoy10000010 deep01001000 learning00010001 NLP01000001 flying00100001 .00001110 We now perform SVD on X, o
20、bserve the singular values (the diag- onal entries in the resulting S matrix), and cut them off at some index k based on the desired percentage variance captured: k i=1i |V| i=1i cs 224d: deep learning for nlp4 We then take the submatrix of U1:|V|,1:kto be our word embedding matrix. This would thus
21、give us a k-dimensional representation of every word in the vocabulary. Applying SVD to X: |V| |V|X = |V| | |V|u1u2 | |V| 10 |V|02 . . . . . . . |V| v1 |V|v2 . . . Reducing dimensionality by selecting fi rst k singular vectors: |V| |V| X = k | |V|u1u2 | k 10 k02 . . . . . . . |V| v1 kv2 . . . Both o
22、f these methods give us word vectors that are more than suffi cient to encode semantic and syntactic (part of speech) informa- tion but are associated with many other problems: The dimensions of the matrix change very often (new words are added very frequently and corpus changes in size). The matrix
23、 is extremely sparse since most words do not co-occur. The matrix is very high dimensional in general (106106) Quadratic cost to train (i.e. to perform SVD) Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency Some solutions to exist to resolve some of
24、 the issues discussed above: Ignore function words such as the, he, has, etc. Apply a ramp window i.e. weight the co-occurrence count based on distance between the words in the document. Use Pearson correlation and set negative counts to 0 instead of using just raw count. As we see in the next secti
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- CS224d-Lecture note CS224d Lecture
链接地址:https://www.31doc.com/p-6061384.html