Python Is a general programming language , It has been widely used in the field of data science in the past decade . in fact ,Python In the field of data science, it's second only to R The second most popular programming language .
The main purpose of this article is to show you how to use Python How easy it is to learn data science . You may think you want to be a senior first Python The programmer , Then we can carry out the complex tasks usually related to data science , But that's not the case .Python Comes with a lot of useful tool Libraries , They can provide you with powerful support in the background . You don't even need to know what the program is running , You don't have to care about that . The only thing you really need to know is , You need to perform certain tasks , and Python Make these tasks fairly simple .
that , Let's start now .
Configure what data science needs Python Environmental Science
Whether the computer you use is Mac still Windows, I suggest you download a free one that allows you to easily access as many useful modules as possible Python Release version .
I tried some Python The distribution version of , ad locum , I recommend you to use Continuum Analytics Provided Anaconda. This Python The release contains 200 Multiple libraries . To understand Python Middle bag 、 The difference between modules and Libraries , Please refer to this article .
When you download Anaconda When , You need to choose to download Python 2 Version or Python 3 edition . I strongly recommend that you use Python 2.17.12 edition . By the end of 2016 end of the year , The vast majority of non computer science Python Users use this Python edition . It can do a good job in Data Science , Than Python 3 It's easier to learn , And like GitHub There are millions of such sites Python Scripts and code snippets , For your reference , Life will be easier .
Anaconda It also comes with Ipython Programming environment , We suggest you use . install Anaconda after , Just navigate to Jupyter Notebook and open the program , You can go to Web Open in the browser IPython.Jupyter The laptop program will start automatically Web Applications in browsers .
You can refer to this article to learn how to Ipython Change path in notebook .
Basic knowledge learning
Before you learn more about Python Before the database of Data Science , You need to learn some first Python Basic knowledge of .Python Is an object-oriented programming language . stay Python in , Object can be assigned to a variable , It can also be passed as a parameter to a function . Here are Python Objects in the ： Numbers 、 character string 、 list 、 Tuples 、 aggregate 、 Dictionaries 、 Functions and classes .
Python The functions in are basically the same as those in general mathematics —— It receives input data , Process the data and output the results . The output depends entirely on how the function is designed . On the other hand ,Python Classes in are prototypes of objects designed to output other objects .
If your goal is to write fast 、 Reusable 、 Easy to modify Python Code , So you have to use functions and classes . Using functions and classes helps keep code efficient and clean .
Now? , Let's see Python What data science tool libraries are available in .
Scientific Computing ：Numpy And Scipy
Numpy It's mainly used to deal with n Dimension array object Python tool kit , and Scipy It provides many mathematical algorithms and the realization of complex functions , Can be used to extend Numpy The function of the library .Scipy The library is Python Added some special scientific functions , In response to specific tasks in Data Science .
In order to be in Python Use in Numpy（ Or anything else Python library ）, You have to import the corresponding tool library first .
np.array(scores) Convert a list to an array .
When you use normal Python The program —— No external extensions are used （ For example, tool library ） Of Python Program —— You can only use one-dimensional lists to store data . however , If you use Numpy Library to expand Python, You can use it directly n Dimension group .（ If you want to know ,n A dimension array is an array that contains one or more dimensions .）
Learn from the beginning Numpy, It's because you're using Python When doing scientific calculations Numpy essential . Yes Numpy In depth knowledge of will help you use... Efficiently Pandas and Scipy Such a tool library .
Data reprocessing ：Pandas
Pandas It is the most widely used tool in data reprocessing . It includes advanced data structure and data operation tools designed to make data analysis faster and more convenient . For the use of R Language for statistical calculation , It must not be right DataFrame The variable name of .
Pandas yes Python One of the key factors to grow into a powerful and efficient data analysis platform .
Next , I'll show you how to use Pandas Working with a small data set .
DataFrame It's a spreadsheet structure , Contains an ordered set of columns . Each column can have a different variable type .DataFrame Include both row index , It also contains column indexes .
visualization ：Matplotlib + Seaborn + Bokeh
Matlplotlib yes Python A module for data visualization in .Matplotlib It makes it easy for you to draw a line diagram 、 The pie chart 、 Histograms and other professional charts .
You can use Matplotlib Customize every detail in the chart . When you are in IPython Use in Matplotlib when ,Matplotlib With zoom 、 Translation and other interactive features .Matplotlib Support different... On all operating systems GUI Back end , meanwhile , It can also export charts to several common image formats , Such as PDF、SVG、JPG、PNG、BMP、GIF etc. .
Seaborn It's based on Matplotlib Data visualization tool library of , Used in Python Create attractive and informative statistical charts in .Seaborn The main feature of the game is , With relatively simple commands, it can be accessed from Pandas Creating complex chart types from data . I use Seaborn I drew the following picture ：
machine learning : Scikit-learn
The goal of machine learning is to learn from machines （ Software ） Provide some examples （ How to perform a task or what cannot be performed ） To teach machines to perform tasks .
Python There are many tool libraries for machine learning in , However ,Scikit-learn It's one of the most popular .Scikit-learn Based on the Numpy、Scipy And Matplotlib Above Library . be based on Scikit-learn library , You can implement almost all machine learning algorithms , Like returning to 、 clustering 、 Classification and so on . therefore , If you plan to use Python Learn machine learning , So I suggest you learn from Scikit-learn Start .
K Nearest neighbor algorithm can be used for classification or regression . The following code shows how to use KNN The model predicts iris data set .
Other machine learning libraries also have ：
statistical ：Statsmodels And Scipy.stats
Statsmodels and Scipy.stats yes Python Two popular statistical learning modules in .Scipy.stats It is mainly used for the realization of probability distribution . On the other hand ,Statsmodels It provides a statistical model similar to R The formula framework of . Including descriptive statistics 、 Statistical tests 、 The extended functions, including plotting function and result statistics, are suitable for different types of data and each estimator .
The following code shows how to use Scipy.stats Module calls normal distribution .
A normal distribution is a continuous distribution or function whose input is any value on a real line . The normal distribution can be parameterized by two parameters ： Mean of distribution μ And variance σ2.
Web Grab ：Requests、Scrapy And BeautifulSoup
Web Crawling means getting unstructured data from the network （ Usually it is HTML Format ）, And the process of transforming it into structured data format for analysis .
Popular for Web The tool libraries we grab are ：
To crawl data from a website , You need to know something about HTML Basic knowledge of .
Here's a use BeautifulSoup Library for network crawling example ：
Code beautiful = urllib2.urlopen(url).read(); Go to bigdataexaminer.com And obtained the website corresponding entire HTML Text . And then , I store the text in variables beautiful in .
I use the urllib2 To get url by http://www.bigdataexaminer.com/ The website page of , You can also use Requests Do the same thing . Here is an article to help you understand urllib2 and Requests The difference between the two .
Scrapy And BeautifulSoup similar . Back-end engineer Prasanna Venkadesh stay Quora The difference between the two toolkits is explained in ：
"Scrapy It's a Web Reptiles , Or say , It's a Web The crawler frame , You are Scrapy Provide a root to start the grab operation URL, Then you can specify some constraints , For example, how many URL wait , This is one for Web A complete frame for grabbing or crawling .
and BeautifulSoup Is a parsing library , It can also perform page crawling tasks excellently , And allows you to easily parse some of the content on the page . however ,BeautifulSoup I'll just grab what you offer URL The content of the page . It doesn't grab other pages , Unless you manually move the page in a certain way URL Add to the loop .
Simply speaking , You can use it. BeautifulSoup Build a relationship with Scrapy Something similar . however BeautifulSoup It's a Python library , and Scrapy It's a complete framework ."
Now? , You know Python And the purpose of these tool libraries . It's time to use what you've learned to solve specific data analysis problems . You can start with structured data sets , After that, we can solve those complex unstructured data analysis problems .
The above is the translation
This article is written by Beijing post @ Love coco - Love life The teacher recommended , Aliyunqi community organization .
Link to the original text ：https://yq.aliyun.com/articles/68270