Pandas It's based on python in Numpy A module of a module

Python In data processing and preparation ⽅⾯⼀ Straight did a good job , But in data analysis and modeling ⽅⾯ It's just bad ⼀ some .pandas Helped fill this ⼀ empty ⽩, Enables you to Python The bigot ⾏ The whole data analysis ⼯ Working process ,⽽ You don't have to switch to a more domain specific language ⾔, Such as R. And out ⾊ Of jupyter⼯ A package is combined with other libraries ,Python in ⽤ Yu Jin ⾏ The environment of data analysis is in performance 、⽣ Yield and cooperative energy ⼒⽅⾯ Are excellent .
pandas yes Python The core of ⼼ Data analysis ⽀ Holding Treasury , Provides fast 、 flexible 、 Clear data structure , Designed to be simple 、 Deal directly with relational 、 Tagged data .pandas yes Python Into the ⾏ Necessary for data analysis ⾼ level ⼯ have .
pandas The main data structure of is Series(⼀ D data ) And DataFrame (⼆ D data ), These two data structures ⾜ To deal with ⾦ melting 、 Statistics 、 Social Sciences 、⼯ Process and other fields ⾥ Of ⼤ Most cases deal with data ⼀ Generally divided into ⼏ Stages : Data sorting and cleaning 、 Data analysis and modeling 、 Data visualization and tabulation ,Pandas Ideal for processing data ⼯ have .
Introduction to the environment
Code tools :jupyternotebook
python edition :python3.8.6
System version :win10
One 、Pands install  
Open the terminal command input pip install -i  https://pypi.doubanio.com/simple/ --trusted-host  pypi.doubanio.com pandas
The first ⼆ part data structure
The first ⼀ section Series
⽤ list ⽣ become Series when ,Pandas Default ⾃ dynamic ⽣ Integer index , You can also specify an index
l = [0,1,7,9,np.NAN,None,1024,512]
# ⽆ Theory is numpy Medium NAN still Python Medium None stay pandas All are based on missing data NaN treat
s1 = pd.Series(data = l) # pandas⾃ Add index
s2 = pd.Series(data = l,index = list('abcdefhi'),dtype='float32') # Appoint ⾏ Indexes
# Pass on ⼊ Dictionary creation ,key⾏ Indexes
s3 = pd.Series(data = {'a':99,'b':137,'c':149},name = 'Python_score')
display(s1,s2,s3)

The first ⼀ section  Dataframe

DataFrame Is composed of multiple types of columns ⼆ Dimension label data structure , Be similar to Excel 、SQL surface , or Series A dictionary of objects .

1 import numpy as np
2 import pandas as pd
3 # index As ⾏ Indexes , In the dictionary key As a column index , Created 3*3 Of DataFrame form ⼆ Dimension group
4 df1 = pd.DataFrame(data = {'Python':[99,107,122],'Math':[111,137,88],'En': [68,108,43]},# key As a column index
5 index = [' Zhang San ',' Li Si ','Michael']) # ⾏ Indexes
6 df2 = pd.DataFrame(data = np.random.randint(0,151,size = (5,3)),
7 index = ['Danial','Brandon','softpo','Ella','Cindy'],# ⾏ Indexes
8 columns=['Python','Math','En'])# Column index
9 display(df1,df2)

The third part View
see DataFrame Often ⽤ Properties and DataFrame Overview and statistics of
 1 import numpy as np
2 import pandas as pd
3 df = pd.DataFrame(data = np.random.randint(0,151,size=(150,3)),
4 index = None, # Row index defaults to
5 columns=['A','B','C'])# Column index
7 df.head(10)# Show the first ten lines !! The default is five lines !!
8 df.tail(10)# Show the last ten lines
9 df.shape# Look at the number of rows and columns
10 df.dtypes# View data type
11 df.index# View row index
12 df.value# The object is worth , Two dimensional array
13 df.describe()# View the summary statistics of the data value column , Count , Average , Standard deviation , minimum value , Four percentile , Maximum
14 df.info()# View column index , data type , Non null count and memory information

The fourth part Input and output of data

Section 1 csv


df = DataFrame(data = np.random.randint(0,50,size = [50,5]), # Salary situation
columns=['IT',' turn ⼯','⽣ matter ',' Teachers' ','⼠ The soldiers '])
# Save to the file named
df.to_csv('./salary.csv',
sep = ';',# Separator
header = True,# Whether to save column index
index = True)# Whether to save row index 、
# load
pd.read_csv('./salary.csv',
sep = ';',# The default is comma
header = [0],# Specify the column index
index_col=0) # Appoint ⾏ Indexes
# load 
pd.read_table('./salary.csv', # and read_csv similar , Read the of the qualifier ⽂ Ben ⽂ Pieces of sep = ';', header = [0],# Specify the column index index_col=1) # Appoint ⾏ Indexes ,IT As ⾏ Indexes 

The first ⼆ section Excel

pip install xlrd -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install xlwt -i https://pypi.tuna.tsinghua.edu.cn/simple
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data = np.random.randint(0,50,size = [50,5]), # Salary situation
columns=['IT',' turn ⼯','⽣ matter ',' Teachers' ','⼠ The soldiers '])
df2 = pd.DataFrame(data = np.random.randint(0,50,size = [150,3]),# Computer science ⽬ The test results of
columns=['Python','Tensorflow','Keras'])
# Save to current path ,⽂ The name of the piece is :salary.xls
df1.to_excel('./salary.xls',
sheet_name = 'salary',# Excel in ⼯ Make the name of the watch
header = True,# Whether to save column index
index = False) # Save ⾏ Indexes , preservation ⾏ Indexes
pd.read_excel('./salary.xls',
sheet_name=0,# Read where ⼀ individual Excel in ⼯ Make a table , Default No ⼀ individual
header = 0,# send ⽤ The first ⼀⾏ Data as column index
names = list('ABCDE'),# Replace ⾏ Indexes
index_col=1)# Appoint ⾏ Indexes ,B As ⾏ Indexes
# ⼀ individual Excel⽂ Save multiple... In the file ⼯ Make a table
with pd.ExcelWriter('./data.xlsx') as writer:
df1.to_excel(writer,sheet_name='salary',index = False)
df2.to_excel(writer,sheet_name='score',index = False)
pd.read_excel('./data.xlsx',
sheet_name='salary') # Read Excel Named in ⼯ Make a table
In the third quarter SQL
pip install sqlalchemy -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install pymysql -i https://pypi.tuna.tsinghua.edu.cn/simple

import pandas as pd
# SQLAlchemy yes Python Programming language ⾔ Under the ⼀ Open source software . Provides SQL⼯ With package and object relationship mapping (ORM)⼯ have
from sqlalchemy import create_engine
df = pd.DataFrame(data = np.random.randint(0,50,size = [150,3]),# Computer science ⽬ The test of
achievement
columns=['Python','Tensorflow','Keras'])
# Database connection
conn = create_engine('mysql+pymysql://root:[email protected]/pandas?
charset=UTF8MB4')
# Save to database
df.to_sql('score',# Table name in database
conn,# Database connection
if_exists='append')# If the table name exists , Additional data
# Load from database
pd.read_sql('select * from score limit 10', # sql Query statement
conn, # Database connection
index_col='Python') # Appoint ⾏ Index name

---------------------------------------------!!!!!!!!! First update !!!!!!!!!!!----------------------------------------------------------

The fifth part Data selection

Section 1 Data acquisition

!!!--- First import a data ---!!!
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),# Computer science ⽬ The test results of
index = list('ABCDEFGHIJ'),# ⾏ label
columns=['Python','Tensorflow','Keras'])

df.Python# View the data in the column 
df['Python']# View the data in the column
df[['Python','Keras']]# Get multiple columns of data
df[1:3]# Row slicing operation !!!-- The slicing operation here is the same as the slicing operation of data --!!!

Use  loc[]  Data acquisition  loc Index retrieval through row and column labels

1 df.loc[['A','B']]# Select row labels 
2 df.loc[['A','B'],['Python','Keras']]# Select matching data according to row and column labels
3 df.loc[:,['Python','Keras']]# Keep all rows
4 df.loc[::2,['Python','Keras']]# every other 2 Take out a row of data
5 df.loc['A',['Python','Keras']]# Select the corresponding data according to the row label
6 # There are no screenshots here to show

Use  iloc[]  Data acquisition  iloc Index retrieval through row column integer labels

df.iloc[2:4]# Using integer row slicing operation and Numpy be similar !!!-- The slicing operation here is the same as the slicing operation of data --!!!
df.iloc[1:3,1:2]# Slice rows and columns with integers
df.iloc[1:3:]# Row slice
df.iloc[:,0:1]# Column slice

Boolean Indexes

cond1 = df.Python > 100 # Judge Python Is the score ⼤ On 100, The return value is boolean Type of Series
df[cond1] # return Python fraction ⼤ On 100 Points of ⽤ All examination subjects in the household ⽬ data
cond2 = (df.Python > 50) & (df['Keras'] > 50) # & And operation
df[cond2] # return Python and Keras meanwhile ⼤ On 50 Points of ⽤ All examination subjects in the household ⽬ data
df[df > 50]# choice DataFrame Zhongman ⾜ The value of the condition , If full ⾜ Return value , Otherwise, null data will be returned NaN
df[df.index.isin(['A','C','F'])] # isin Determine if it's in the array , Back again boolean Type values

The sixth part Data integration

The first ⼀ section concat Data concatenation

# Then build two data matrices 
df1 = pd.DataFrame(np.random.randint(1,151,size=10),
index = list('ABCDEFGHIJ'),
columns=['Science'])
df2 = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('KLMNOPQRST'),
columns=['Python','Tensorflow','Keras'])
pd.concat([df,df2],axis=0)#df2 Splice in series to df1 below 
pd.concat([df,df1],axis=1)#df1 Splice in series to df The left side of the
df.append(df1) # stay df1 after ⾯ Additional df2

In the second quarter Insert

 insert() Insert a column

Be careful : If you use insert() When inserting a column , Then the length of the inserted column must be equal to the length of the inserted rows

# Insert a column c++
df.insert(loc=1,
column='C++',
value=np.random.randint(0,151,size=(10))) df.insert(loc = 1,column='Python3.8,value=2048)

In the third quarter Data link (join SQL style )

Data set merging (merge) Or connection (join) The operation is through ⼀ Data linked by one or more keys . These operations are the core of relational databases ⼼ operation .pandas Of merge The function is a data set into ⾏join The main cut of the operation ⼊ spot .
# First create two sets of data 
df1 = pd.DataFrame(data = {'sex':np.random.randint(0,2,size=6),'name':[' Kyushu ',' Nine weeks ','Nineweek','Mrs Tong ',' Small A',' Small C']})
df2 = pd.DataFrame(data = {'score':np.random.randint(90,151,size=6),'name':[' Kyushu ',' Nine weeks ','Nineweek','Mrs Tong ',' Small A',' Small Ming']})

    

pd.merge(df1,df2)
#( Internal connection ) In the use of merge() Merging merge Empty data is automatically removed
pd.merge(df1,df2,how='left')# Left link
pd.merge(df1,df2,how='right')# Right link

---------------------------------------------!!!!!!!!! Second update !!!!!!!!!!!----------------------------------------------------------

Part seven Data cleaning

The first ⼀ section  duplicated Filter for duplicate data

duplicated It is used to filter from top to bottom. If the row values are the same, it returns TRUE.

# Create a score data 
df2 = pd.DataFrame(data={'Name':[' Kyushu ','Mrs Tong ','Nineweek',None,np.NAN,'Mrs Tong '],'Sex':[0,1,0,1,0,1],'Score':[89,100,67,90,98,100]})

df2.duplicated()# Check for duplicate values With Boolean Output display in form 
df2.duplicated().sum()# Print how many duplicate values there are
df2[df2.duplicated()]# Print duplicate values
df2[df2.duplicated()==False]# Print non duplicate values
df2.drop_duplicates()# Delete duplicate values ( This operation does not delete the data source itself )
df2.drop_duplicates(inplace=True)# Delete duplicate values ( This operation is to delete the data source itself )

In the second quarter Filter null data

df2.isnull()# Check for null values ( You can find NAN Values and None value )
df2.dropna(how = 'any') # Delete empty data ( This operation does not delete the data source itself )
df2.dropna(how = 'any',inplace=True)# Delete empty data ( This operation is to delete the data source itself )
df2.fillna(value=' Small A')# Fill in empty data ( This operation does not delete the data source itself )
df2.fillna(value=' Small A',inplace=True)# Fill in empty data ( This operation is to delete the data source itself )

In the third quarter   Filter the specified row or column

del df2['Sex'] # Delete a column directly 
df2.drop(labels = ['price'],axis = 1)# Delete the specified column
df2.drop(labels = [0,1,5],axis = 0) # Delete the specified ⾏

filter function : Select retained data and filter other data

df2.filter(items=['Name', 'Score'])# Retain ‘Name’,‘Score’ Two 
df2.filter(like='S',axis = 1)# The reserved column label contains ‘S’ The column of (axis=1 The column ,axis=0 Said line )
df.filter(regex='S$', axis=1)# Filter in a regular way

Part eight Data conversion

Section 1 rename and replace The transformation tag of the is an element

# Change row and column index 
df2.rename(index = {0:10,1:11},columns={'Name':'StName'})# Index rows 0 Replace with 10,1 Replace with 11; Column index Name Replace with StName
# Replace element values
df2.replace(100,102)# Will all 100 Replace with 102
df2.replace([89,67],78)# Will all 89 and 67 Replace with 78
df2.replace({' Kyushu ':'JZ',None:' Kyushu '})# Replace according to the key value of the dictionary
df2.replace({'Sex':1},1024)# take Sex Column 1 Replace all with 1024

In the second quarter apply and Transform

The same thing : Can be targeted Dataframe Calculation of the characteristics of , Often with groupby() The group aggregation method is used in conjunction with the next section update method

Difference :aplly Parameters can be custom functions , Including simple summation function and difference function between copied features .apply Can't be used directly python Built in functions for , such as sum、max、min.

Transform The parameter cannot be a custom feature interaction function , because transform For each element ( That is, each column feature operation ) Calculate .

# First create an array 
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),index = list('ABCDEFGHIJ'),columns=['Python','En','Math'])

df['Python'].apply(lambda x:True if x >50 else False)# selection python More than... In the discipline 50 The data of 
df.apply(lambda x : x.median(),axis = 0) # Median of column 

# Custom function algorithm 
def avg(x):
return (x.mean(),x.max(),x.min(),x.var().round(1))
df.apply(avg,axis=0)# The average value of the output column , Maximum , minimum value , Keep one decimal place for variance

# ⼀ Lieju ⾏ Multinomial calculation 
df['Python'].transform([np.sqrt,np.log10]) # Do the square sum logarithm operation for single column data processing

# Custom function algorithm 
def convert(x):
if x > 140:
x -= 12
else:
x += 12
return x

df.transform({'Python':np.sqrt,'En':np.log10,'Math':convert}).round(1)# Do different operations on multi column data processing

---------------------------------------------!!!!!!!!! Second update !!!!!!!!!!!----------------------------------------------------------

Part IX Data remodeling

df = pd.DataFrame(data = np.random.randint(0,150,size = [20,3]),
index = pd.MultiIndex.from_product([list('ABCDEFHIJK'),[' The first phase of ',' Phase two ']]),# Multi level index
columns=['Python','En','Math'])

df.unstack(level=1)# Row as column 
df.stack()# Column as row
df.mean(level=1)# The average score of each subject is
df.mean(level=0)# Average score of each student
df.mean()# Average score of each subject

Part 10 Statistical method function

pandas Have a variety of common ⽤ Mathematical statistics ⽅ Law . It can meet most of the data processing , Yes Series and DataFrame Line calculates and returns Series Array of form
 
# Create data 
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('ABCDEFGHIJ'),
columns=['Python','En','Math']) df.count() # ⾮NA The amount of value
df.max(axis = 0) # Axis 0 most ⼤ value , That is, every ⼀ Column most ⼤ value
df.min() # Default calculation axis 0 most ⼩ value
df.median() # Median
df.sum() # Sum up
df.mean(axis = 1) # Calculate each ⼀⾏ Average value
df.quantile(q = [0.2,0.5,0.9]) # quantile
df.describe() # View summary statistics for numeric Columns , Count 、 Average 、 Standard deviation 、 most ⼩ value 、 Four percentile 、 most ⼤ value
df['Python'].value_counts() # Count the number of elements
df['Math'].unique() # duplicate removal
df.cumsum() # Add up
df.cumprod() # Multiplicative multiplication
df.std() # Standard deviation
df.var() # ⽅ Bad
df.cummin() # Cumulative most ⼩ value
df.cummax() # Cumulative most ⼤ value
df.diff() # Calculate the difference
df.pct_change() # Calculate the percentage ⽐ change
df.cov() # Property ⽅ Bad
df['Python'].cov(df['Math']) # Python and Math The association of ⽅ Bad
df.corr() # All attribute correlation coefficients
df.corrwith(df['En']) # single ⼀ Attribute correlation coefficient
# Calculation method of label index 
df['Python'].argmin() # Calculation Python The most ⼩ Value position
df['Math'].argmax() # Calculation Math The most ⼤ Value position
df.idxmax() # most ⼤ Value index label
df.idxmin() # most ⼩ Value index label

Part 11 Sort

# Create data 
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('ABCDEFGHIJ'),
columns=['Python','En','Math'])
ran = np.random.permutation(10)
df = df.take(ran)# Random row index

df.sort_index(axis=0,ascending=True)# Sort by row index in descending order 
df.sort_index(axis=1,ascending=True)# Sort by column index in descending order

df.sort_values(by='Python')# according to Python The values of the columns are sorted in descending order 
df.sort_values(by=['Python','Math'])# Press to find Python Sort by Math Sort
lage = df.nlargest(3,columns='Math') # According to attributes Math Sort , Back to the latest ⼤3 Data 
samll = df.nsmallest(3,columns='Python') # According to attributes Python Sort , Back to the latest ⼩3 Data
display(lage,samll)

Part 12 cut And qcut Sub box processing of

cut Function to process data in boxes , That is to say Cut a continuous value into several segments , The value of each paragraph is regarded as a classification . This process of converting continuous values into discrete values , We call it sub box processing cut The data will be divided into several points according to the order of data value from large to small , And make Range of each group Roughly equal

qcut Is to divide variables according to the number of variables , And try to make sure that each group Number of variables identical .

df['py_cut'] = pd.cut(df.Python,bins=4)# Divide the boxes according to the data range 
df['en_cut'] = pd.cut(df.En,bins=4)# Divide the boxes according to the number of data
df['q_ The rating '] = pd.qcut(df.Python,q = 4,# 4 Equal division
labels=[' Bad ',' in ',' good ',' optimal ']) # Sort after sorting
df['c_ The rating '] = pd.cut(df.En,# Sub box data
bins = [0,60,90,120,150],# Breakpoint of sub box
right = False,# The principle of left closing and right opening
labels=[' Bad ',' in ',' good ',' optimal '])# Sort after sorting

Python Pandas Use !!!!! More related articles in detail

  1. python pandas Detailed explanation of string functions ( turn )

     pandas Detailed explanation of string functions ( turn )—— See the end of the article for a link to the original text In the use of pandas Framework of the DataFrame In the process of , If you need to handle some string features , For example, judge whether a column contains some keywords , Whether the character length of a column is less than 3 wait ...

  2. Pandas Detailed explanation of common operations

    Pandas Detailed explanation of common operations Many people have misunderstandings , Always thought Pandas It has something to do with pandas , Follow gui Uncle creation Python Feel the same Pandas It's so and so. Programmers like pandas, so they name them , A brief introduction ,Pandas The name comes from the panel ...

  3. python And OS Module details

    python And OS Module details ^_^, Step into the second module world ----->OS List of common functions os.sep: Replace operating system specific path separators os.name: Indicate the work platform you are using . For example, for Windows ...

  4. python And sys Module details

    python And sys Module details sys Multi function modules , Here are some useful functions , I'm sure you'll like it , Walk in with me python The module of ! sys A list of common functions for modules sys.argv: Realize the transmission from the outside of the program to the program ...

  5. python in threading Module details ( One )

    python in threading Module details ( One ) source  http://blog.chinaunix.net/uid-27571599-id-3484048.html threading Provides a comparison of thr ...

  6. Python Data types and their methods

    Python Data types and their methods When we learn programming languages , Data types are encountered , This kind of basic and inconspicuous thing , But it's important , This paper introduces python Data type of , The method of each data type is described in detail , As you can see ...

  7. python Details of references and objects

    python Details of references and objects @[ Mark flying elephant ] python Variable names and objects are separated in Example 1: a = 1 This is a simple assignment statement , Integers 1 For an object ,a It's a reference , Using assignment statements , quote a Point to the object 1. ...

  8. Python in time Module details

    Python in time Module details In normal code , We often need to deal with time . stay Python in , Modules related to time processing include :time,datetime as well as calendar. This article , Main explanation time modular . ...

  9. Python list (List) Detailed operation method

    Python list (List) Detailed operation method This article mainly introduces Python Middle list (List) How to operate , Include creation . visit . to update . Delete . Other operations, etc , Friends in need can refer to   The list is Python The most basic ...

  10. Python Detailed explanation of module call mode

    Python Detailed explanation of module call mode author : Yin Zhengjie Copyright notice : Original works , Declined reprint ! Otherwise, the legal liability will be investigated . modular , With a mound of code to achieve a function of the code set .  Similar to functional programming and process oriented programming , Functional programming accomplishes a function , Its ...

Random recommendation

  1. About RESTFUL API Some summary of security authentication methods

    Common authentication methods Previous articles REST API Safety design guidelines and use AngularJS & NodeJS Implementation is based on token The application of authentication in two articles ,[ translate ]web Permission verification methods are also described in detail , commonly ...

  2. DirectShow Learning notes

    DirectShow, as you might have guessed, is a COM based multimedia framework that makes the task of ca ...

  3. About pandas Precision control

    Recently used pandas Processing a batch of data , The data contains several columns, Their data accuracy , for example 3.25165,1451684684168.0,0.23 Save after processing csv They found ,1451684684168. ...

  4. Talking about java Compiling mechanism and running mechanism

    How source files and bytecodes are made up Source file : Expand the following java This is the document of java The source file . Java Source compilation consists of the following three procedures : 1. Analyze and input to symbol table 2. Annotation Processing 3. Semantic analysis and generation class file flow chart ...

  5. 【 Reprint 】npm Check the packages installed globally

    In the use of node When , use npm Installed a lot of software , If you don't use it for a period of time, you will forget , How to view your globally installed packages , Command npm list -g --depth You can't find the results in Baidu , I am here google Articles by foreigners in ...

  6. Spring boot Global profile application.properties

    # change Tomcat Port number server.port=8090 # Modify to enter DispatcherServlet The rule of :*.htmlserver.servlet-path=*.html# Here we should pay attention to the higher version s ...

  7. web Web practice

    One . HTML part 1. XHTML and HTML What's the difference? HTML It's a basic WEB Web design language ,XHTML It's based on XML The most important difference between the markup language of : XHTML Elements must be nested correctly . XHTML element ...

  8. ARP agreement understand

    ARP The essence of the protocol is to let other hosts in the LAN know where I am , For example, someone yelled at everyone on the LAN 「IP by XXXX The guy who , Where are you? 」, I listen. ,XXXX Is not my IP Do you , I have to answer him , So I'm aiming at everyone ( It can also be ...

  9. 【 In depth understanding of JAVA virtual machine 】 The first 5 part . Efficient concurrent .2. Thread safety and lock optimization

    1 summary For the theme of this part “ Efficient concurrent ” Speaking of , First of all, we need to ensure the correctness of concurrency , Then on this basis to achieve efficient . 2 Thread safety <Java Concurrency In Practice> The author of Brian ...

  10. How to crawl the proxy server ip Address ?

    A year ago, there was an inspiration , Want to build a powerful web search engine , But because the undergraduate study software engineering partial embedded direction ,web I'm a little weak in this area , Can't jsp, Don't understand, html, I haven't played for a long time sql, But just taking advantage of the intransigence of young people , Just put ...