Reading guide ： The content of this article is from 《 The real battle of natural language processing ： utilize Python understand 、 Analyze and generate text 》 A Book , from Hobson Lane Et al. .
This book is about natural language processing （NLP） And the practical book of deep learning .NLP It has become the core application field of deep learning , And deep learning is NLP Necessary tools in research and Application . This paper is written to middle and advanced level Python Developer , Both basic theory and programming practice , It's modern NLP Practical reference books for practitioners in the field .
Learn more about natural language processing , Focus on AI Technology base and comments share your insights on natural language processing , We are going to elect 10 Quality comments , Send out each 《 The real battle of natural language processing ： utilize Python understand 、 Analyze and generate text 》 a copy . The deadline for the event is 11 month 15 Friday night 8 spot .
1950 year , Allen · Turing （Alan Turing） Published an article entitled “ Computer mechanics and intelligence （Computing Machinery and Intelligence） ” The article , The famous “ Turing test （Turing Test）”. This involves automatic interpretation and natural language generation , As a condition of judging intelligence , This is natural language processing （Natural Language Processing,NLP） The beginning of development .
Natural language processing is computer science and artificial intelligence （artificial intelligence,AI） One of the areas of research , It focuses on natural language （ Such as English or Mandarin ） To deal with . This processing usually involves converting natural language into data that computers can use to understand the world （ Numbers ）. meanwhile , This understanding of the world is sometimes used to generate natural language texts that embody this understanding （ Natural language generation ）.
Language was invented to facilitate communication , It's the basis of human consensus building . And now , Programmers struggling with natural language processing have a goal ： So that computers can also understand human language .
NLP The charm of —— Creating machines that communicate
Since the computer was invented , The machine has been dealing with language . However , these “ form ” Language （ Such as early language Ada、COBOL and Fortran） Designed to have only one correct explanation （ Or compile ） The way .
at present , Wikipedia lists 700 Multiple programming languages . by comparison ,Ethnologue The total number of natural languages spoken by people all over the world today is confirmed 10 times . Google's natural language document index far exceeds 1 Gigabytes , And it's just an index , The actual size of natural language content online is definitely larger than 1000 Gigabytes , At the same time, these documents don't cover the whole Internet .
“ Natural language ” And “ The natural world ” in “ natural ” The meaning of the word is the same . The world's natural 、 Evolutionary things are different from the mechanical ones that humans design and build 、 Artificial things . Be able to design and build software to read and process the language you are reading , This language is about how to build software to deal with natural languages , It's very advanced , And it's amazing .
At first ,Google It takes a little bit of skill to find what we're looking for , But it soon became more intelligent , Can accept more and more word search . Then the text completion function of smart phones began to become advanced , The middle button usually gives us the word we're looking for . These are the charms of natural language processing —— Let the machine understand what we think .
More and more entertainment 、 Advertising and financial reporting content can be generated without a finger .NLP Robots can script entire movies . Video games and virtual worlds often have robots that talk to us , They sometimes even talk about robots and artificial intelligence itself . such “ In the play ” You'll get more metadata about movies , And then robots in the real world write reviews based on that to help you decide which movie to watch .
With NLP Technological development , Information flow and computing power are also increasing . Now we just need to type a few characters into the search box , You can retrieve the exact information needed to complete the task . The first few auto completion options provided by search are usually very appropriate , It makes us feel like someone is helping us search .
introduction NLP A few basics of
1. Regular expressions
Regular expressions use a special class called regular Syntax （regular grammar） The formal language grammar of . The behavior of regular grammar can be predicted or proved , And flexible enough , It can support some of the most complex conversation engines and chat robots on the market .Amazon Alexa and Google Now Are mainly pattern based Conversation engines that rely on regular Syntax . profound 、 Complex regular syntax rules can usually be represented in a line of code called a regular expression .Python There are some successful chat robot frameworks in , Such as Will, They rely entirely on the language to produce useful and interesting behaviors .Amazon Echo、Google Home And similar complex and useful helpers also use this language , Provides coding logic for most user interactions .
2. Word order and grammar
The order of words is very important . Those in the word sequence （ Like a sentence ） The rules that control word order in are called the grammar of language （grammar, Also called grammar ）. This is the information discarded in the previous bag of words or word vector examples . Fortunately, , In most short phrases and even many complete sentences , All the above methods can work well . If you just want to code the general meaning and emotion of a short sentence , Then word order is not very important . to glance at “Good morning Rosa” All the word order results in this example ：
>>> from itertools import permutations>>> [" ".join(combo) for combo in\... permutations("Good morning Rosa!".split(), 3)]['Good morning Rosa!', 'Good Rosa! morning', 'morning Good Rosa!', 'morning Rosa! Good', 'Rosa! Good morning', 'Rosa! morning Good']
Now? , If you try to interpret each of these strings in isolation （ Don't look at other strings ）, Then it may be concluded that , That is, these strings may have similar intentions or meanings . We may even notice Good The capital form of the word , And put it at the front of the phrase in your head . But we may also think Good Rosa It's a proper noun , The name of a restaurant or florist . For all that , A smart chat robot or bletchelli Park 20 century 40 Smart women of the s might respond to this with the same innocuous greeting 6 Any one of these situations ：“Good morning my dear General.”
We （ In my mind ） And a longer one 、 More complex phrases to try , This is a logical statement , The order of words is very important ：
"""Find textbooks with titles containing 'NLP', or 'natural' and 'language', or 'computational' and 'linguistics'.""" len(set(s.split()))12 import numpy as np np.arange(1, 12 + 1).prod() # factorial(12) = arange(1, 13).prod()479001600 The number of words arranged from simple greetings factorial(3) == 6 Surge to longer sentences factorial(12) == 479001600s =
Obviously , The logic contained in word order is important for any machine that wants to reply correctly . Although ordinary greetings are usually not confused by word bag processing , But if you put more complex sentences in the bag of words , You lose most of the meaning . Just like the natural language query in the previous example , Word bag is not the best way to handle database queries .
3. The word vector
2012 year , Microsoft Intern Thomas Mikolov We found a way to express the meaning of words with a certain dimension vector .Mikolov Present a word near the target of neural network training .2013 year ,Mikolov And his teammates at Google released software to create these word vectors , be called Word2vec.
Word2vec Only based on large unmarked text corpus to learn the meaning of words , You don't need to mark Word2vec Words in the vocabulary . We don't need to tell Word2vec Algorithm Mary · Curie was a scientist 、 The woodcutter is a football team 、 Seattle is a city 、 Portland is a city in Oregon and Maine , There's no need to tell Word2vec Football is a sport 、 A team is a group of people , Or the city is both a place and a community .Word2vec You can learn more by yourself ！ All you need to do is to prepare a large enough corpus , Among them is science 、 Football or city Related words refer to Mary nearby · Curie 、 The loggers and Portland .
It is Word2vec This unsupervised nature makes it incredibly powerful , Because the world is full of unmarked 、 unclassified 、 Unstructured natural language text .
4.Word2vec and GloVe
Word2vec It's a huge breakthrough , But it depends on a neural network model that must be trained by back propagation . Back propagation is usually not as efficient as the cost function directly optimized by gradient descent method . from Jeffrey Pennington Leading Stanford University NLP The team studied Word2vec How it works , And find the cost function that can be optimized from it . They count the number of CO occurrences of words and record them in a square matrix . They found that singular value decomposition can be performed on this co-occurrence matrix , The meaning of the two weight matrices obtained by decomposition is similar to Word2vec It produces exactly the same . The key point is to normalize the co-occurrence matrix in the same way . In some cases ,Word2vec The model doesn't converge , And researchers at Stanford University were able to use their idea of SVD Method to get the global optimal solution . This method is a global vector of co-occurrence of words （ Co occurrence in the whole corpus ） Optimize directly , So it's called GloVe（global vectors of word co-occurrences）.
GloVe Can produce the equivalent of Word2vec The matrix of input weight matrix and output weight matrix , The language model generated by this model is similar to Word2vec The same accuracy , And it takes less time .GloVe Speed up the training process by using data more efficiently . It can be trained in a smaller corpus , And still be able to converge .SVD The algorithm has been improved for decades , therefore GloVe It has advantages in debugging and algorithm optimization . by comparison ,Word2vec Rely on back propagation to update the weight of the embedded words , But the back propagation efficiency of neural network is lower than GloVe The use of SVD This more sophisticated optimization algorithm .
Even though Word2vec Firstly, the concept of semantic reasoning based on word vector is popularized , However, we should try our best to use GloVe To train a new word vector model . adopt GloVe, We are more likely to find the global optimal solution represented by the word vector , To get more accurate results .
GloVe The advantages are as follows ：
1. The training process is faster ;
2. Make more effective use of CPU、 Memory （ Can handle larger documents ）;
3. More efficient use of data （ It is helpful for small corpora ）;
4. At the same training times, the accuracy rate is higher .
5. Knowledge method
A.L.I.C.E. And others AIML Chat robots rely entirely on pattern matching . Thinking about AIML Before , The first popular chat robot ELIZA Pattern matching and templates are also used . But the developers of these chat robots have hard coded the logic of replies in patterns and templates . Hard coding doesn't work well “ Expand ”, This kind of expansion is not in terms of processing performance, but in terms of manpower . The complexity of chat robots built in this way increases linearly with the increase of human input . in fact , As the complexity of this chat robot grows , We start to see the rewards of our efforts diminishing , It's because with “ Active components ” Increased interaction between , The behavior of chat robots is becoming more and more difficult to predict and debug .
Now , Data driven programming is a modern approach to most complex programming challenges . How to program chat robots with data ？ In the last chapter , We learned how to use information extraction from natural language text （ Unstructured data ） Create structured knowledge in . Just based on the read in text , You can build a network of relationships or facts , These texts can be Wikipedia articles , Even your own personal log .
Processing the knowledge map through logical reasoning , Can answer questions about the world contained in the knowledge base . You can then use the reasoning answer to fill in the variables in the templated response , To create natural language answers . Question answering system , for example IBM stay Jeopardy Winning “ Watson ”（Watson）, It was originally built in this way , Although the latest version almost certainly uses search or information retrieval technology . Knowledge map can be said to bring chat robots to the real world “ The fundamental ”.
The knowledge base based approach is not limited to answering questions about the world . The knowledge base can also be filled in real time with the fact that the conversation is going on . This allows chat robots to quickly understand the conversation's goals and their preferences .
6. retrieval （ Search for ） Method
Another kind “ Listen for ” The user's data-driven approach is to search for previous statements in the historical dialog log . It's similar to human listeners trying to recall where they've heard the question before 、 A sentence or word . Robots can not only search their own conversation logs , You can also search for records of conversations between people 、 Records of conversations between robots and people , It's even a record of conversations between robots . But as always , Dirty data in and dirty data out . therefore , We should clean up and integrate the historical dialogue database , To make sure the robot searches for （ And imitate ） High quality dialogue . We want humans to enjoy the dialogue with robots .
Its conversations or conversations should be based on the chat database to make sure it's enjoyable , And they're supposed to be some of the themes that robots that set their personalities are expected to communicate with . For search based robots , Some good examples of dialogue resources include movie dialogue scripts 、IRC Customer service logs on the channel （ Part of the user's satisfaction ） Direct message interaction with humans （ If those people are willing to share with us ）. If you don't get the written consent of everyone involved in the conversation you want to use , Please do not use your own email or SMS log .
If you decide to incorporate the conversations between robots into the corpus , So please be careful . We only need statements in our database that at least one person seems to be satisfied with the interaction , Even if it's just a conversation . Unless it's a really smart chat robot , Otherwise, the dialogue between robots is rarely used .
Search based chat robots can use historical dialog logs to find examples of statements similar to what the robot's talking partner just said . For ease of searching , The dialogue corpus should be organized into idioms - Reply to . If the reply is a reply to the statement , Then the reply should appear twice in the database , Once in reply , And then as a sentence to prompt a reply . The reply column in the database table can then be used as “ sentence ”（ Or promote ） The reply to the statement of the column is based on .
Just understand the above NLP In fact, the knowledge of learning NLP It's not enough . So how to effectively and completely master NLP The overall framework and all the knowledge of ？ Believe in this book 《 The real battle of natural language processing ： utilize Python understand 、 Analyze and generate text 》 Can help you .
The real battle of natural language processing utilize Python understand 、 Analyze and generate text
author ：[ beautiful ] Hobson • Lane （Hobson Lane） , Cole • Howard （Cole Howard） , Hannas • Marx • Harper （Hannes Max Hapke）
translator ： Shi Liang , Lu Xiao , Tang Kexin , Wang Bin
notes ： This book is divided into 3 part ： Part I Introduction NLP Basics , Including participles 、TF-IDF Vectorization and the transformation from word frequency vector to semantic vector ; The second part is about deep learning , Including neural networks 、 The word vector 、 Convolutional neural networks （CNN）、 Cyclic neural network （RNN）、 Long and short term memory （LSTM） The Internet 、 Basic deep learning models and methods such as sequence to sequence modeling and attention mechanism ; The third part introduces the content of actual combat , Including information extraction 、 Question answering system 、 Modeling of real world systems such as man-machine dialogue 、 Performance challenges and solutions .
You are right about NLP Are you interested? ？
Focus on AI Technology base and comments share your insights on natural language processing , We're going to pick out 10 Good reviews get free books respectively 1 Ben , By the time 11 month 15 Friday night 20:00.