The second popular language: from introduction to mastery, python data science concise tutorial

second popular language introduction mastery

Python Is a general programming language , It has been widely used in the field of data science in the past decade . in fact ,Python In the field of data science, it's second only to R The second most popular programming language .

The main purpose of this article is to show you how to use Python How easy it is to learn data science . You may think you want to be a senior first Python The programmer , Then we can carry out the complex tasks usually related to data science , But that's not the case .Python Comes with a lot of useful tool Libraries , They can provide you with powerful support in the background . You don't even need to know what the program is running , You don't have to care about that . The only thing you really need to know is , You need to perform certain tasks , and Python Make these tasks fairly simple .

that , Let's start now .

Configure what data science needs Python Environmental Science

Whether the computer you use is Mac still Windows, I suggest you download a free one that allows you to easily access as many useful modules as possible Python Release version .

I tried some Python The distribution version of , ad locum , I recommend you to use Continuum Analytics Provided Anaconda. This Python The release contains 200 Multiple libraries . To understand Python Middle bag 、 The difference between modules and Libraries , Please refer to this article .

When you download Anaconda When , You need to choose to download Python 2 Version or Python 3 edition . I strongly recommend that you use Python 2.17.12 edition . By the end of 2016 end of the year , The vast majority of non computer science Python Users use this Python edition . It can do a good job in Data Science , Than Python 3 It's easier to learn , And like GitHub There are millions of such sites Python Scripts and code snippets , For your reference , Life will be easier .

Anaconda It also comes with Ipython Programming environment , We suggest you use . install Anaconda after , Just navigate to Jupyter Notebook and open the program , You can go to Web Open in the browser IPython.Jupyter The laptop program will start automatically Web Applications in browsers .

 The second hot language : From entry to mastery ,Python A concise course in Data Science

You can refer to this article to learn how to Ipython Change path in notebook .

Basic knowledge learning

Before you learn more about Python Before the database of Data Science , You need to learn some first Python Basic knowledge of .Python Is an object-oriented programming language . stay Python in , Object can be assigned to a variable , It can also be passed as a parameter to a function . Here are Python Objects in the : Numbers 、 character string 、 list 、 Tuples 、 aggregate 、 Dictionaries 、 Functions and classes .

Python The functions in are basically the same as those in general mathematics —— It receives input data , Process the data and output the results . The output depends entirely on how the function is designed . On the other hand ,Python Classes in are prototypes of objects designed to output other objects .

If your goal is to write fast 、 Reusable 、 Easy to modify Python Code , So you have to use functions and classes . Using functions and classes helps keep code efficient and clean .

Now? , Let's see Python What data science tool libraries are available in .

Scientific Computing :Numpy And Scipy

Numpy It's mainly used to deal with n Dimension array object Python tool kit , and Scipy It provides many mathematical algorithms and the realization of complex functions , Can be used to extend Numpy The function of the library .Scipy The library is Python Added some special scientific functions , In response to specific tasks in Data Science .

In order to be in Python Use in Numpy( Or anything else Python library ), You have to import the corresponding tool library first .

 The second hot language : From entry to mastery ,Python A concise course in Data Science

np.array(scores) Convert a list to an array .

When you use normal Python The program —— No external extensions are used ( For example, tool library ) Of Python Program —— You can only use one-dimensional lists to store data . however , If you use Numpy Library to expand Python, You can use it directly n Dimension group .( If you want to know ,n A dimension array is an array that contains one or more dimensions .)

Learn from the beginning Numpy, It's because you're using Python When doing scientific calculations Numpy essential . Yes Numpy In depth knowledge of will help you use... Efficiently Pandas and Scipy Such a tool library .

Data reprocessing :Pandas

Pandas It is the most widely used tool in data reprocessing . It includes advanced data structure and data operation tools designed to make data analysis faster and more convenient . For the use of R Language for statistical calculation , It must not be right DataFrame The variable name of .

Pandas yes Python One of the key factors to grow into a powerful and efficient data analysis platform .

Next , I'll show you how to use Pandas Working with a small data set .

 The second hot language : From entry to mastery ,Python A concise course in Data Science

DataFrame It's a spreadsheet structure , Contains an ordered set of columns . Each column can have a different variable type .DataFrame Include both row index , It also contains column indexes .

 The second hot language : From entry to mastery ,Python A concise course in Data Science

visualization :Matplotlib + Seaborn + Bokeh

Matlplotlib yes Python A module for data visualization in .Matplotlib It makes it easy for you to draw a line diagram 、 The pie chart 、 Histograms and other professional charts .

You can use Matplotlib Customize every detail in the chart . When you are in IPython Use in Matplotlib when ,Matplotlib With zoom 、 Translation and other interactive features .Matplotlib Support different... On all operating systems GUI Back end , meanwhile , It can also export charts to several common image formats , Such as PDF、SVG、JPG、PNG、BMP、GIF etc. .

 The second hot language : From entry to mastery ,Python A concise course in Data Science

Seaborn It's based on Matplotlib Data visualization tool library of , Used in Python Create attractive and informative statistical charts in .Seaborn The main feature of the game is , With relatively simple commands, it can be accessed from Pandas Creating complex chart types from data . I use Seaborn I drew the following picture :

 The second hot language : From entry to mastery ,Python A concise course in Data Science

machine learning : Scikit-learn

The goal of machine learning is to learn from machines ( Software ) Provide some examples ( How to perform a task or what cannot be performed ) To teach machines to perform tasks .

Python There are many tool libraries for machine learning in , However ,Scikit-learn It's one of the most popular .Scikit-learn Based on the Numpy、Scipy And Matplotlib Above Library . be based on Scikit-learn library , You can implement almost all machine learning algorithms , Like returning to 、 clustering 、 Classification and so on . therefore , If you plan to use Python Learn machine learning , So I suggest you learn from Scikit-learn Start .

K Nearest neighbor algorithm can be used for classification or regression . The following code shows how to use KNN The model predicts iris data set .

 The second hot language : From entry to mastery ,Python A concise course in Data Science

 The second hot language : From entry to mastery ,Python A concise course in Data Science

Other machine learning libraries also have :

  • Theano

  • Pylearn2

  • Pyevolve

  • Caffe

  • Tensorflow

statistical :Statsmodels And Scipy.stats

Statsmodels and Scipy.stats yes Python Two popular statistical learning modules in .Scipy.stats It is mainly used for the realization of probability distribution . On the other hand ,Statsmodels It provides a statistical model similar to R The formula framework of . Including descriptive statistics 、 Statistical tests 、 The extended functions, including plotting function and result statistics, are suitable for different types of data and each estimator .

The following code shows how to use Scipy.stats Module calls normal distribution .

 The second hot language : From entry to mastery ,Python A concise course in Data Science

 The second hot language : From entry to mastery ,Python A concise course in Data Science

A normal distribution is a continuous distribution or function whose input is any value on a real line . The normal distribution can be parameterized by two parameters : Mean of distribution μ And variance σ2.

Web Grab :Requests、Scrapy And BeautifulSoup

Web Crawling means getting unstructured data from the network ( Usually it is HTML Format ), And the process of transforming it into structured data format for analysis .

Popular for Web The tool libraries we grab are :

  • Scrapy

  • URl lib

  • Beautifulsoup

  • Requests

To crawl data from a website , You need to know something about HTML Basic knowledge of .

Here's a use BeautifulSoup Library for network crawling example :

import urllib2

import bs4

 The second hot language : From entry to mastery ,Python A concise course in Data Science

Code beautiful = urllib2.urlopen(url).read(); Go to And obtained the website corresponding entire HTML Text . And then , I store the text in variables beautiful in .

I use the urllib2 To get url by The website page of , You can also use Requests Do the same thing . Here is an article to help you understand urllib2 and Requests The difference between the two .

Scrapy And BeautifulSoup similar . Back-end engineer Prasanna Venkadesh stay Quora The difference between the two toolkits is explained in :

"Scrapy It's a Web Reptiles , Or say , It's a Web The crawler frame , You are Scrapy Provide a root to start the grab operation URL, Then you can specify some constraints , For example, how many URL wait , This is one for Web A complete frame for grabbing or crawling .

and BeautifulSoup Is a parsing library , It can also perform page crawling tasks excellently , And allows you to easily parse some of the content on the page . however ,BeautifulSoup I'll just grab what you offer URL The content of the page . It doesn't grab other pages , Unless you manually move the page in a certain way URL Add to the loop .

Simply speaking , You can use it. BeautifulSoup Build a relationship with Scrapy Something similar . however BeautifulSoup It's a Python library , and Scrapy It's a complete framework ."


Now? , You know Python And the purpose of these tool libraries . It's time to use what you've learned to solve specific data analysis problems . You can start with structured data sets , After that, we can solve those complex unstructured data analysis problems .

