欢迎来到三一文库! | 帮助中心 三一文库31doc.com 一个上传文档投稿赚钱的网站
三一文库
全部分类
  • 研究报告>
  • 工作总结>
  • 合同范本>
  • 心得体会>
  • 工作报告>
  • 党团相关>
  • 幼儿/小学教育>
  • 高等教育>
  • 经济/贸易/财会>
  • 建筑/环境>
  • 金融/证券>
  • 医学/心理学>
  • ImageVerifierCode 换一换
    首页 三一文库 > 资源分类 > PDF文档下载  

    CS224d-Lecture note.pdf

    • 资源ID:6061384       资源大小:2.35MB        全文页数:48页
    • 资源格式: PDF        下载积分:6
    快捷下载 游客一键下载
    会员登录下载
    微信登录下载
    三方登录下载: 微信开放平台登录 QQ登录   微博登录  
    二维码
    微信扫一扫登录
    下载资源需要6
    邮箱/手机:
    温馨提示:
    用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)
    支付方式: 支付宝    微信支付   
    验证码:   换一换

    加入VIP免费专享
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    CS224d-Lecture note.pdf

    CS 224D: Deep Learning for NLP1 1Course Instructor: Richard Socher Lecture Notes: Part I2 2Authors: Francois Chaubard, Rohit Mundra, Richard Socher Spring 2015 Keyphrases: Natural Language Processing. Word Vectors. Singu- lar Value Decomposition. Skip-gram. Continuous Bag of Words (CBOW). Negative Sampling. This set of notes begins by introducing the concept of Natural Language Processing (NLP) and the problems NLP faces today. We then move forward to discuss the concept of representing words as numeric vectors. Lastly, we discuss popular approaches to designing word vectors. 1Introduction to Natural Language Processing Natural Language Processing tasks come in varying levels of diffi culty: Easy Spell Checking Keyword Search Finding Synonyms Medium Parsing information from websites, documents, etc. Hard Machine Translation Semantic Analysis Coreference Question Answering We begin with a general discussion of what is NLP. The goal of NLP is to be able to design algorithms to allow computers to understand natural language in order to perform some task. Example tasks come in varying level of diffi culty: Easy Spell Checking Keyword Search Finding Synonyms Medium Parsing information from websites, documents, etc. Hard Machine Translation (e.g. Translate Chinese text to English) Semantic Analysis (What is the meaning of query statement?) Coreference (e.g. What does he or it refer to given a docu- ment?) Question Answering (e.g. Answering Jeopardy questions). The fi rst and arguably most important common denominator across all NLP tasks is how we represent words as input to any and all of our models. Much of the earlier NLP work that we will not cover treats words as atomic symbols. To perform well on most NLP tasks we fi rst need to have some notion of similarity and difference cs 224d: deep learning for nlp2 between words. With word vectors, we can quite easily encode this ability in the vectors themselves (using distance measures such as Jaccard, Cosine, Euclidean, etc). 2Word Vectors There are an estimated 13 million tokens for the English language but are they all completely unrelated? Feline to cat, hotel to motel? I think not. Thus, we want to encode word tokens each into some vector that represents a point in some sort of word space. This is paramount for a number of reasons but the most intuitive reason is that perhaps there actually exists some N-dimensional space (such that N? 13 million) that is suffi cient to encode all semantics of our language. Each dimension would encode some meaning that we transfer using speech. For instance, semantic dimensions might indicate tense (past vs. present vs. future), count (singular vs. plural), and gender (masculine vs. feminine).One-hot vector: Represent every word as an R|V|1vector with all 0s and one 1 at the index of that word in the sorted english language. So lets dive into our fi rst word vector and arguably the most simple, the one-hot vector: Represent every word as an R|V|1vector with all 0s and one 1 at the index of that word in the sorted english language. In this notation,|V|is the size of our vocabulary. Word vectors in this type of encoding would appear as the following: waardvark= 1 0 0 . . . 0 , wa= 0 1 0 . . . 0 , wat= 0 0 1 . . . 0 ,wzebra= 0 0 0 . . . 1 Fun fact: The term one-hot comes from digital circuit design, meaning a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). We represent each word as a completely independent entity. As we previously discussed, this word representation does not give us directly any notion of similarity. For instance, (whotel)Twmotel= (whotel)Twcat=0 So maybe we can try to reduce the size of this space from R|V|to something smaller and thus fi nd a subspace that encodes the rela- tionships between words. 3SVD Based Methods For this class of methods to fi nd word embeddings (otherwise known as word vectors), we fi rst loop over a massive dataset and accumu- late word co-occurrence counts in some form of a matrix X, and then perform Singular Value Decomposition on X to get a USVTdecom- position. We then use the rows of U as the word embeddings for all cs 224d: deep learning for nlp3 words in our dictionary. Let us discuss a few choices of X. 3.1Word-Document Matrix As our fi rst attempt, we make the bold conjecture that words that are related will often appear in the same documents. For instance, banks, bonds, stocks, money, etc. are probably likely to ap- pear together. But banks, octopus, banana, and hockey would probably not consistently appear together. We use this fact to build a word-document matrix, X in the following manner: Loop over billions of documents and for each time word i appears in docu- ment j, we add one to entry Xij. This is obviously a very large matrix (R|V|M) and it scales with the number of documents (M). So per- haps we can try something better. 3.2Word-Word Co-occurrence Matrix The same kind of logic applies here however, the matrix X stores co-occurrences of words thereby becoming an affi nity matrix. We display an example one below. Let our corpus contain just three sentences:Using Word-Word Co-occurrence Matrix: Generate|V| |V|co-occurrence matrix, X. Apply SVD on X to get X=USVT. Select the fi rst k columns of U to get a k-dimensional word vectors. ki=1i |V| i=1i indicates the amount of variance captured by the fi rst k dimensions. 1. I enjoy fl ying. 2. I like NLP. 3. I like deep learning. The resulting counts matrix will then be: X= IlikeenjoydeeplearningNLPflying. I02100000 like20010100 enjoy10000010 deep01001000 learning00010001 NLP01000001 flying00100001 .00001110 We now perform SVD on X, observe the singular values (the diag- onal entries in the resulting S matrix), and cut them off at some index k based on the desired percentage variance captured: k i=1i |V| i=1i cs 224d: deep learning for nlp4 We then take the submatrix of U1:|V|,1:kto be our word embedding matrix. This would thus give us a k-dimensional representation of every word in the vocabulary. Applying SVD to X: |V| |V|X = |V| | |V|u1u2 | |V| 10 |V|02 . . . . . . . |V| v1 |V|v2 . . . Reducing dimensionality by selecting fi rst k singular vectors: |V| |V| X = k | |V|u1u2 | k 10 k02 . . . . . . . |V| v1 kv2 . . . Both of these methods give us word vectors that are more than suffi cient to encode semantic and syntactic (part of speech) informa- tion but are associated with many other problems: The dimensions of the matrix change very often (new words are added very frequently and corpus changes in size). The matrix is extremely sparse since most words do not co-occur. The matrix is very high dimensional in general (106106) Quadratic cost to train (i.e. to perform SVD) Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency Some solutions to exist to resolve some of the issues discussed above: Ignore function words such as the, he, has, etc. Apply a ramp window i.e. weight the co-occurrence count based on distance between the words in the document. Use Pearson correlation and set negative counts to 0 instead of using just raw count. As we see in the next section, iteration based methods solve many of these issues in a far more elegant manner. cs 224d: deep learning for nlp5 4Iteration Based Methods Let us step back and try a new approach. Instead of computing and storing global information about some huge dataset (which might be billions of sentences), we can try to create a model that will be able to learn one iteration at a time and eventually be able to encode the probability of a word given its context.Context of a word: The context of a word is the set of C surrounding words. For instance, the C=2 context of the word fox in the sentence The quick brown fox jumped over the lazy dog is quick, brown, jumped, over. We can set up this probabilistic model of known and unknown parameters and take one training example at a time in order to learn just a little bit of information for the unknown parameters based on the input, the output of the model, and the desired output of the model. At every iteration we run our model, evaluate the errors, and follow an update rule that has some notion of penalizing the model parameters that caused the error. This idea is a very old one dating back to 1986. We call this method backpropagating the errors (see Learning representations by back-propagating errors. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams (1988).) 4.1Language Models (Unigrams, Bigrams, etc.) First, we need to create such a model that will assign a probability to a sequence of tokens. Let us start with an example: The cat jumped over the puddle. A good language model will give this sentence a high probability because this is a completely valid sentence, syntactically and semanti- cally. Similarly, the sentence stock boil fi sh is toy should have a very low probability because it makes no sense. Mathematically, we can call this probability on any given sequence of n words: P(w(1),w(2),w(n) We can take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent: P(w(1),w(2),w(n) = n i=1 P(w(i) Unigram model: P(w(1),w(2),w(n) = n i=1 P(w(i) However, we know this is a bit ludicrous because we know the next word is highly contingent upon the previous sequence of words. And the silly sentence example might actually score highly. So per- haps we let the probability of the sequence depend on the pairwise cs 224d: deep learning for nlp6 probability of a word in the sequence and the word next to it. We call this the bigram model and represent it as: P(w(1),w(2),w(n) = n i=2 P(w(i)|w(i1) Bigram model: P(w(1),w(2),w(n) = n i=2 P(w(i)|w(i1) Again this is certainly a bit naive since we are only concerning ourselves with pairs of neighboring words rather than evaluating a whole sentence, but as we will see, this representation gets us pretty far along. Note in the Word-Word Matrix with a context of size 1, we basically can learn these pairwise probabilities. But again, this would require computing and storing global information about a massive dataset. Now that we understand how we can think about a sequence of tokens having a probability, let us observe some example models that could learn these probabilities. 4.2Continuous Bag of Words Model (CBOW) One approach is to treat The, cat, over, the, puddle as a context and from these words, be able to predict or generate the center word jumped. This type of model we call a Continuous Bag of Words (CBOW) Model.CBOW Model: Predicting a center word form the surrounding context Lets discuss the CBOW Model above in greater detail. First, we set up our known parameters. Let the known parameters in our model be the sentence represented by one-hot word vectors. The input one hot vectors or context we will represent with an x(i). And the output as y(i)and in the CBOW model, since we only have one output, so we just call this y which is the one hot vector of the known center word. Now lets defi ne our unknowns in our model.Notation for CBOW Model: w(i): Word i from vocabulary V W(1)Rn|V|: Input word matrix u(i): i-th column of W(1), the input vector representation of word w(i) W(2)Rn|V|: Output word matrix v(i): i-th row of W(2), the output vector representation of word w(i) We create two matrices, W(1)Rn|V|and W(2)R|V|n. Where n is an arbitrary size which defi nes the size of our embedding space. W(1)is the input word matrix such that the i-th column of W(1)is the n-dimensional embedded vector for word w(i)when it is an input to this model. We denote this n1 vector as u(i). Similarly, W(2)is the output word matrix. The j-th row of W(2)is an n-dimensional embedded vector for word w(j)when it is an output of the model. We denote this row of W(2)as v(j). Note that we do in fact learn two vectors for every word w(i)(i.e. input word vector u(i)and output word vector v(i). We breakdown the way this model works in these steps: 1. We generate our one hot word vectors (x(iC),.,x(i1),x(i+1),.,x(i+C) for the input context of size C. cs 224d: deep learning for nlp7 2. We get our embedded word vectors for the context (u(iC)= W(1)x(iC), u(iC+1)=W(1)x(iC+1), ., u(i+C)=W(1)x(i+C) 3. Average these vectors to get h= u(iC)+u(iC+1)+.+u(i+C) 2C 4. Generate a score vector z=W(2)h 5. Turn the scores into probabilities y=softmax(z) 6. We desire our probabilities generated, y, to match the true prob- abilities, y, which also happens to be the one hot vector of the actual word. Figure 1: This image demonstrates how CBOW works and how we must learn the transfer matrices So now that we have an understanding of how our model would work if we had a W(1)and W(2), how would we learn these two matrices? Well, we need to create an objective function. Very often when we are trying to learn a probability from some true probability, we look to information theory to give us a measure of the distance between two distributions. Here, we use a popular choice of dis- tance/loss measure, cross entropy H( y,y). The intuition for the use of cross-entropy in the discrete case can be derived from the formulation of the loss function: H( y,y) = |V| j=1 yjlog( y j) Let us concern ourselves with the case at hand, which is that y is a one-hot vector. Thus we know that the above loss simplifi es to simply: H( y,y) = yilog( y i) In this formulation, i is the index where the correct words one hot vector is 1. We can now consider the case where our predic- tion was perfect and thus yi=1. We can then calculate H( y,y) = 1log(1) =0. Thus, for a perfect prediction, we face no penalty or loss. Now let us consider the opposite case where our prediction was very bad and thus yi=0.01. As before, we can calculate our loss to be H( y,y) = 1log(0.01) 4.605. We can thus see that for proba- bility distributions, cross entropy provides us with a good measure of cs 224d: deep learning for nlp8 distance. We thus formulate our optimization objective as: minimize J= logP(w(i)|w(iC),.,w(i1),w(i+1),.,w(i+C) = logP(v(i)|h) = log exp(v(i)Th) |V| j=1exp(v(i)Tu(j) = v(i)Th+log |V| j=1 exp(v(i)Tu(j) Since we use gradient descent to update word vectors all relevant word vectors v(i)and u(j), we calculate the gradients in the following manner: To be added after Assignment 1 is graded 4.3Skip-Gram Model Skip-Gram Model: Predicting surrounding context words given a center word Another approach is to create a model such that given the center word jumped, the model will be able to predict or generate the surrounding words The, cat, over, the, puddle. Here we call the word jumped the context. We call this type of model a Skip- Gram model.Notation for

    注意事项

    本文(CS224d-Lecture note.pdf)为本站会员(大张伟)主动上传,三一文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三一文库(点击联系客服),我们立即给予删除!

    温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。




    经营许可证编号:宁ICP备18001539号-1

    三一文库
    收起
    展开