Use of Python pandas!!!!! Explain in detail

Tong Dashuai 2021-10-29 12:18:23
use python pandas explain

 Pandas It's based on python in Numpy A module of a module

Python In data processing and preparation ⽅⾯⼀ Straight did a good job , But in data analysis and modeling ⽅⾯ It's just bad ⼀ some .pandas Helped fill this ⼀ empty ⽩, Enables you to Python The bigot ⾏ The whole data analysis ⼯ Working process ,⽽ You don't have to switch to a more domain specific language ⾔, Such as R. And out ⾊ Of jupyter⼯ A package is combined with other libraries ,Python in ⽤ Yu Jin ⾏ The environment of data analysis is in performance 、⽣ Yield and cooperative energy ⼒⽅⾯ Are excellent .
pandas yes Python The core of ⼼ Data analysis ⽀ Holding Treasury , Provides fast 、 flexible 、 Clear data structure , Designed to be simple 、 Deal directly with relational 、 Tagged data .pandas yes Python Into the ⾏ Necessary for data analysis ⾼ level ⼯ have .
pandas The main data structure of is Series(⼀ D data ) And DataFrame (⼆ D data ), These two data structures ⾜ To deal with ⾦ melting 、 Statistics 、 Social Sciences 、⼯ Process and other fields ⾥ Of ⼤ Most cases deal with data ⼀ Generally divided into ⼏ Stages : Data sorting and cleaning 、 Data analysis and modeling 、 Data visualization and tabulation ,Pandas Ideal for processing data ⼯ have .
Introduction to the environment
Code tools :jupyternotebook
python edition :python3.8.6
System version :win10
One 、Pands install  
Open the terminal command input pip install -i https://pypi.doubanio.com/simple/  --trusted-host pypi.doubanio.com pandas
The first ⼆ part data structure
The first ⼀ section Series
⽤ list ⽣ become Series when ,Pandas Default ⾃ dynamic ⽣ Integer index , You can also specify an index
l = [0,1,7,9,np.NAN,None,1024,512]
# ⽆ Theory is numpy Medium NAN still Python Medium None stay pandas All are based on missing data NaN treat 
s1 = pd.Series(data = l) # pandas⾃ Add index 
s2 = pd.Series(data = l,index = list('abcdefhi'),dtype='float32') # Appoint ⾏ Indexes 
# Pass on ⼊ Dictionary creation ,key⾏ Indexes 
s3 = pd.Series(data = {'a':99,'b':137,'c':149},name = 'Python_score')
display(s1,s2,s3)
In the second quarter  Dataframe
DataFrame Is composed of multiple types of columns ⼆ Dimension label data structure , Be similar to Excel 、SQL surface , or Series A dictionary of objects .
import numpy as np
import pandas as pd
# index As ⾏ Indexes , In the dictionary key As a column index , Created 3*3 Of DataFrame form ⼆ Dimension group 
df1 = pd.DataFrame(data = {'Python':[99,107,122],'Math':[111,137,88],'En': [68,108,43]},# key As a column index 
index = [' Zhang San ',' Li Si ','Michael']) # ⾏ Indexes 
df2 = pd.DataFrame(data = np.random.randint(0,151,size = (5,3)),
index = ['Danial','Brandon','softpo','Ella','Cindy'],# ⾏ Indexes 
columns=['Python','Math','En'])# Column index 
display(df1,df2)

The third part View
see DataFrame Often ⽤ Properties and DataFrame Overview and statistics of
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,151,size=(150,3)),
index = None, # Row index defaults to 
columns=['A','B','C'])# Column index 
df.head(10)# Show the first ten lines !! The default is five lines !!
df.tail(10)# Show the last ten lines 
df.shape# Look at the number of rows and columns 
df.dtypes# View data type 
df.index# View row index 
df.value# The object is worth , Two dimensional array 
df.describe()# View the summary statistics of the data value column , Count , Average , Standard deviation , minimum value , Four percentile , Maximum 
df.info()# View column index , data type , Non null count and memory information 

The fourth part Input and output of data

Section 1 csv

df = DataFrame(data = np.random.randint(0,50,size = [50,5]), # Salary situation 
columns=['IT',' turn ⼯','⽣ matter ',' Teachers' ','⼠ The soldiers '])
# Save to the file named 
df.to_csv('./salary.csv',
sep = ';',# Separator 
header = True,# Whether to save column index 
index = True)# Whether to save row index 、
# load 
pd.read_csv('./salary.csv',
sep = ';',# The default is comma 
header = [0],# Specify the column index 
index_col=0) # Appoint ⾏ Indexes 
# load 
pd.read_table('./salary.csv', # and read_csv similar , Read the of the qualifier ⽂ Ben ⽂ Pieces of sep = ';', header = [0],# Specify the column index index_col=1) # Appoint ⾏ Indexes ,IT As ⾏ Indexes 

The first ⼆ section Excel

pip install xlrd -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install xlwt -i https://pypi.tuna.tsinghua.edu.cn/simple
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data = np.random.randint(0,50,size = [50,5]), # Salary situation 
columns=['IT',' turn ⼯','⽣ matter ',' Teachers' ','⼠ The soldiers '])
df2 = pd.DataFrame(data = np.random.randint(0,50,size = [150,3]),# Computer science ⽬ The test results of 
columns=['Python','Tensorflow','Keras'])
# Save to current path ,⽂ The name of the piece is :salary.xls
df1.to_excel('./salary.xls',
sheet_name = 'salary',# Excel in ⼯ Make the name of the watch 
header = True,# Whether to save column index 
index = False) # Save ⾏ Indexes , preservation ⾏ Indexes 
pd.read_excel('./salary.xls',
sheet_name=0,# Read where ⼀ individual Excel in ⼯ Make a table , Default No ⼀ individual 
header = 0,# send ⽤ The first ⼀⾏ Data as column index 
names = list('ABCDE'),# Replace ⾏ Indexes 
index_col=1)# Appoint ⾏ Indexes ,B As ⾏ Indexes 
# ⼀ individual Excel⽂ Save multiple... In the file ⼯ Make a table 
with pd.ExcelWriter('./data.xlsx') as writer:
df1.to_excel(writer,sheet_name='salary',index = False)
df2.to_excel(writer,sheet_name='score',index = False)
pd.read_excel('./data.xlsx',
sheet_name='salary') # Read Excel Named in ⼯ Make a table 
In the third quarter SQL
pip install sqlalchemy -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install pymysql -i https://pypi.tuna.tsinghua.edu.cn/simple
import pandas as pd
# SQLAlchemy yes Python Programming language ⾔ Under the ⼀ Open source software . Provides SQL⼯ With package and object relationship mapping (ORM)⼯ have 
from sqlalchemy import create_engine
df = pd.DataFrame(data = np.random.randint(0,50,size = [150,3]),# Computer science ⽬ The test of 
 achievement
columns=['Python','Tensorflow','Keras'])
# Database connection 
conn = create_engine('mysql+pymysql://root:[email protected]/pandas?
charset=UTF8MB4')
# Save to database 
df.to_sql('score',# Table name in database 
conn,# Database connection 
if_exists='append')# If the table name exists , Additional data 
# Load from database 
pd.read_sql('select * from score limit 10', # sql Query statement 
conn, # Database connection 
index_col='Python') # Appoint ⾏ Index name 

 

---------------------------------------------!!!!!!!!! First update !!!!!!!!!!!----------------------------------------------------------

 

The fifth part Data selection

Section 1 Data acquisition

!!!--- First import a data ---!!!
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),# Computer science ⽬ The test results of 
index = list('ABCDEFGHIJ'),# ⾏ label 
columns=['Python','Tensorflow','Keras'])

df.Python# View the data in the column 
df['Python']# View the data in the column 
df[['Python','Keras']]# Get multiple columns of data 
df[1:3]# Row slicing operation !!!-- The slicing operation here is the same as the slicing operation of data --!!!

Use  loc[]  Data acquisition  loc Index retrieval through row and column labels

df.loc[['A','B']]# Select row labels 
df.loc[['A','B'],['Python','Keras']]# Select matching data according to row and column labels 
df.loc[:,['Python','Keras']]# Keep all rows 
df.loc[::2,['Python','Keras']]# every other 2 Take out a row of data 
df.loc['A',['Python','Keras']]# Select the corresponding data according to the row label 
# There are no screenshots here to show 

Use  iloc[]  Data acquisition  iloc Index retrieval through row column integer labels

df.iloc[2:4]# Using integer row slicing operation and Numpy be similar !!!-- The slicing operation here is the same as the slicing operation of data --!!!
df.iloc[1:3,1:2]# Slice rows and columns with integers 
df.iloc[1:3:]# Row slice 
df.iloc[:,0:1]# Column slice 

Boolean Indexes

cond1 = df.Python > 100 # Judge Python Is the score ⼤ On 100, The return value is boolean Type of Series
df[cond1] # return Python fraction ⼤ On 100 Points of ⽤ All examination subjects in the household ⽬ data 
cond2 = (df.Python > 50) & (df['Keras'] > 50) # & And operation 
df[cond2] # return Python and Keras meanwhile ⼤ On 50 Points of ⽤ All examination subjects in the household ⽬ data 
df[df > 50]# choice DataFrame Zhongman ⾜ The value of the condition , If full ⾜ Return value , Otherwise, null data will be returned NaN
df[df.index.isin(['A','C','F'])] # isin Determine if it's in the array , Back again boolean Type values 

The sixth part Data integration

The first ⼀ section concat Data concatenation

# Then build two data matrices 
df1 = pd.DataFrame(np.random.randint(1,151,size=10),
index = list('ABCDEFGHIJ'),
columns=['Science'])
df2 = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('KLMNOPQRST'),
columns=['Python','Tensorflow','Keras']) 
pd.concat([df,df2],axis=0)#df2 Splice in series to df1 below 
pd.concat([df,df1],axis=1)#df1 Splice in series to df The left side of the 
df.append(df1) # stay df1 after ⾯ Additional df2

In the second quarter Insert

 insert() Insert a column

Be careful : If you use insert() When inserting a column , Then the length of the inserted column must be equal to the length of the inserted rows

# Insert a column c++
df.insert(loc=1,
column='C++',
value=np.random.randint(0,151,size=(10)))
df.insert(loc = 1,column='Python3.8,value=2048)

In the third quarter Data link (join SQL style )

Data set merging (merge) Or connection (join) The operation is through ⼀ Data linked by one or more keys . These operations are the core of relational databases ⼼ operation .pandas Of merge The function is a data set into ⾏join The main cut of the operation ⼊ spot .
# First create two sets of data 
df1 = pd.DataFrame(data = {'sex':np.random.randint(0,2,size=6),'name':[' Kyushu ',' Nine weeks ','Nineweek','Mrs Tong ',' Small A',' Small C']})
df2 = pd.DataFrame(data = {'score':np.random.randint(90,151,size=6),'name':[' Kyushu ',' Nine weeks ','Nineweek','Mrs Tong ',' Small A',' Small Ming']})

    

pd.merge(df1,df2)
#( Internal connection ) In the use of merge() Merging merge Empty data is automatically removed 
pd.merge(df1,df2,how='left')# Left link 
pd.merge(df1,df2,how='right')# Right link 

 

 ---------------------------------------------!!!!!!!!! Second update !!!!!!!!!!!----------------------------------------------------------

  Part seven Data cleaning

The first ⼀ section  duplicated Filter for duplicate data

duplicated It is used to filter from top to bottom. If the row values are the same, it returns TRUE.

# Create a score data 
df2 = pd.DataFrame(data={'Name':[' Kyushu ','Mrs Tong ','Nineweek',None,np.NAN,'Mrs Tong '],'Sex':[0,1,0,1,0,1],'Score':[89,100,67,90,98,100]})

df2.duplicated()# Check for duplicate values With Boolean Output display in form 
df2.duplicated().sum()# Print how many duplicate values there are 
df2[df2.duplicated()]# Print duplicate values 
df2[df2.duplicated()==False]# Print non duplicate values 
df2.drop_duplicates()# Delete duplicate values ( This operation does not delete the data source itself )
df2.drop_duplicates(inplace=True)# Delete duplicate values ( This operation is to delete the data source itself )

In the second quarter Filter null data

df2.isnull()# Check for null values ( You can find NAN Values and None value )
df2.dropna(how = 'any') # Delete empty data ( This operation does not delete the data source itself ) 
df2.dropna(how = 'any',inplace=True)# Delete empty data ( This operation is to delete the data source itself ) 
df2.fillna(value=' Small A')# Fill in empty data ( This operation does not delete the data source itself ) 
df2.fillna(value=' Small A',inplace=True)# Fill in empty data ( This operation is to delete the data source itself ) 

In the third quarter   Filter the specified row or column

del df2['Sex'] # Delete a column directly 
df2.drop(labels = ['price'],axis = 1)# Delete the specified column 
df2.drop(labels = [0,1,5],axis = 0) # Delete the specified ⾏

filter function : Select retained data and filter other data

df2.filter(items=['Name', 'Score'])# Retain ‘Name’,‘Score’ Two 
df2.filter(like='S',axis = 1)# The reserved column label contains ‘S’ The column of (axis=1 The column ,axis=0 Said line )
df.filter(regex='S$', axis=1)# Filter in a regular way 

  Part eight Data conversion

Section 1 rename and replace The transformation tag of the is an element

# Change row and column index 
df2.rename(index = {0:10,1:11},columns={'Name':'StName'})# Index rows 0 Replace with 10,1 Replace with 11; Column index Name Replace with StName
# Replace element values 
df2.replace(100,102)# Will all 100 Replace with 102
df2.replace([89,67],78)# Will all 89 and 67 Replace with 78
df2.replace({' Kyushu ':'JZ',None:' Kyushu '})# Replace according to the key value of the dictionary 
df2.replace({'Sex':1},1024)# take Sex Column 1 Replace all with 1024

In the second quarter apply and Transform

The same thing : Can be targeted Dataframe Calculation of the characteristics of , Often with groupby() The group aggregation method is used in conjunction with the next section update method

Difference :aplly Parameters can be custom functions , Including simple summation function and difference function between copied features .apply Can't be used directly python Built in functions for , such as sum、max、min.

Transform The parameter cannot be a custom feature interaction function , because transform For each element ( That is, each column feature operation ) Calculate .

# First create an array 
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),index = list('ABCDEFGHIJ'),columns=['Python','En','Math'])

df['Python'].apply(lambda x:True if x >50 else False)# selection python More than... In the discipline 50 The data of 
df.apply(lambda x : x.median(),axis = 0) # Median of column 

# Custom function algorithm 
def avg(x):
return (x.mean(),x.max(),x.min(),x.var().round(1))
df.apply(avg,axis=0)# The average value of the output column , Maximum , minimum value , Keep one decimal place for variance 

# ⼀ Lieju ⾏ Multinomial calculation 
df['Python'].transform([np.sqrt,np.log10]) # Do the square sum logarithm operation for single column data processing 

# Custom function algorithm 
def convert(x):
if x > 140:
x -= 12
else:
x += 12
return x
df.transform({'Python':np.sqrt,'En':np.log10,'Math':convert}).round(1)# Do different operations on multi column data processing 


 ---------------------------------------------!!!!!!!!! Third update !!!!!!!!!!!----------------------------------------------------------

Part IX Data remodeling

df = pd.DataFrame(data = np.random.randint(0,150,size = [20,3]),
index = pd.MultiIndex.from_product([list('ABCDEFHIJK'),[' The first phase of ',' Phase two ']]),# Multi level index 
columns=['Python','En','Math'])

df.unstack(level=1)# Row as column 
df.stack()# Column as row 
df.mean(level=1)# The average score of each subject is 
df.mean(level=0)# Average score of each student 
df.mean()# Average score of each subject 

Part 10 Statistical method function

pandas Have a variety of common ⽤ Mathematical statistics ⽅ Law . It can meet most of the data processing , Yes Series and DataFrame Line calculates and returns Series Array of form
# Create data 
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('ABCDEFGHIJ'),
columns=['Python','En','Math'])
df.count() # ⾮NA The amount of value 
df.max(axis = 0) # Axis 0 most ⼤ value , That is, every ⼀ Column most ⼤ value 
df.min() # Default calculation axis 0 most ⼩ value 
df.median() # Median 
df.sum() # Sum up 
df.mean(axis = 1) # Calculate each ⼀⾏ Average value 
df.quantile(q = [0.2,0.5,0.9]) # quantile 
df.describe() # View summary statistics for numeric Columns , Count 、 Average 、 Standard deviation 、 most ⼩ value 、 Four percentile 、 most ⼤ value 
df['Python'].value_counts() # Count the number of elements 
df['Math'].unique() # duplicate removal 
df.cumsum() # Add up 
df.cumprod() # Multiplicative multiplication 
df.std() # Standard deviation 
df.var() # ⽅ Bad 
df.cummin() # Cumulative most ⼩ value 
df.cummax() # Cumulative most ⼤ value 
df.diff() # Calculate the difference 
df.pct_change() # Calculate the percentage ⽐ change 
df.cov() # Property ⽅ Bad 
df['Python'].cov(df['Math']) # Python and Math The association of ⽅ Bad 
df.corr() # All attribute correlation coefficients 
df.corrwith(df['En']) # single ⼀ Attribute correlation coefficient 
# Calculation method of label index 
df['Python'].argmin() # Calculation Python The most ⼩ Value position 
df['Math'].argmax() # Calculation Math The most ⼤ Value position 
df.idxmax() # most ⼤ Value index label 
df.idxmin() # most ⼩ Value index label 

Part 11 Sort

# Create data 
df = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),
index = list('ABCDEFGHIJ'),
columns=['Python','En','Math'])
ran = np.random.permutation(10)
df = df.take(ran)# Random row index 

df.sort_index(axis=0,ascending=True)# Sort by row index in descending order 
df.sort_index(axis=1,ascending=True)# Sort by column index in descending order 

df.sort_values(by='Python')# according to Python The values of the columns are sorted in descending order 
df.sort_values(by=['Python','Math'])# Press to find Python Sort by Math Sort 
lage = df.nlargest(3,columns='Math') # According to attributes Math Sort , Back to the latest ⼤3 Data 
samll = df.nsmallest(3,columns='Python') # According to attributes Python Sort , Back to the latest ⼩3 Data 
display(lage,samll)

  Part 12 cut And qcut Sub box processing of

cut Function to process data in boxes , That is to say Cut a continuous value into several segments , The value of each paragraph is regarded as a classification . This process of converting continuous values into discrete values , We call it sub box processing cut The data will be divided into several points according to the order of data value from large to small , And make Range of each group Roughly equal

  qcut Is to divide variables according to the number of variables , And try to make sure that each group Number of variables identical .

df['py_cut'] = pd.cut(df.Python,bins=4)# Divide the boxes according to the data range 
df['en_cut'] = pd.cut(df.En,bins=4)# Divide the boxes according to the number of data 
df['q_ The rating '] = pd.qcut(df.Python,q = 4,# 4 Equal division 
labels=[' Bad ',' in ',' good ',' optimal ']) # Sort after sorting 
df['c_ The rating '] = pd.cut(df.En,# Sub box data 
bins = [0,60,90,120,150],# Breakpoint of sub box 
right = False,# The principle of left closing and right opening 
labels=[' Bad ',' in ',' good ',' optimal '])# Sort after sorting 

 

版权声明
本文为[Tong Dashuai]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/10/20211013223741006Z.html

  1. Python code reading (Chapter 14): List Union
  2. Lecture du Code Python (article 25): diviser les chaînes multilignes en listes
  3. Python self study notes -- operators
  4. Formation python - différences entre http et HTTPS
  5. Implementation of automatic timing comment function on Python CSDN platform
  6. python+tkinter+treeview子控件快捷键
  7. Raccourcis clavier pour les sous - contrôles Python + tkinter + treeview
  8. Analyse des données Python
  9. python+tkinter+treeview子控件快捷鍵
  10. Devine si je peux attraper Maotai avec la programmation python? Tout est ouvert à github
  11. À propos de pygame.display.set in Python Un petit problème avec mode ()
  12. Implementation of automatic timing comment function on Python CSDN platform
  13. python:dataframe进行iteritem遍历时如何将输出结果按照列分别输出为该列最后一行
  14. python:dataframe進行iteritem遍曆時如何將輸出結果按照列分別輸出為該列最後一行
  15. Python: comment le dataframe affiche les résultats de sortie par colonne à la dernière ligne de la colonne lors de la traversée de l'itemitem
  16. Écrivez un gadget de bureau pour votre fille préférée en python et elle dit que c'est génial!
  17. Introduction to closures in Python 3
  18. Global / nonlocal usage in Python 3
  19. Introduction to context manager in Python 3
  20. Python crawler selenium framework. You can start with these five questions | Python skill tree
  21. Common standard library random, python introductory tutorial 5 or 6 questions a day | Python skill tree
  22. It is said that Python is omnipotent. It's really good to see Liyang photography circle with Python this time
  23. 【Python 爬虫】 4、爬虫基本原理
  24. 【Python 爬蟲】 4、爬蟲基本原理
  25. 【 Python crawler】 4. Principes de base du crawler
  26. 这道python题到底应该要怎么做
  27. Que doit faire exactement ce problème Python
  28. Après l'importation des variables du module Python, les valeurs imprimées sont fixes.
  29. Nouveau singe Muzi Lee: 0 cours de formation Python de base types de hachage pour les opérations Python redis
  30. Looking at problems from a fresh perspective: analyzing selenium principle from the perspective of Python
  31. Insérez le format de date dans la base de données MySQL en python et ne l'exécutez pas.
  32. Try Python 3.10 with CONDA
  33. Répondez en python et demandez à quelqu'un de vous aider.
  34. Un simple problème de travail Python, qui ne fonctionne pas
  35. Problèmes d'écriture Python pour la boucle
  36. Comment Python exécute les commandes du programme à plusieurs reprises au lieu de quitter
  37. YYDS! Dexplot: one line of Python code to easily draw statistical charts!
  38. pandas生成的透视表如何和源数据一起保存
  39. pandas生成的透視錶如何和源數據一起保存
  40. Comment sauvegarder le tableau pivot généré par pandas avec les données sources
  41. 10 fois plus efficace avec cache dans le développement de Django
  42. 求Python *.svg文件操作方法
  43. 求Python *.svg文件操作方法
  44. Trouver la méthode de fonctionnement du fichier Python *.Svg
  45. 【 python】 Internal Guide for Unit Test Practice
  46. 用Python编程佩尔数列pell数列循环结构
  47. 【 python】 échafaudage fastapi: spécification du développement du projet d'interface arrière fastapi
  48. [Python] restful Specification Practice Based on fastapi
  49. Python代码阅读(第26篇):将列表映射成字典
  50. How to use Python to make a screen color extractor with Exe file
  51. Lecture du Code Python (article 26): cartographie des listes dans les dictionnaires
  52. Python代码阅读(第26篇):将列表映射成字典
  53. Python代碼閱讀(第26篇):將列錶映射成字典
  54. Lecture du Code Python (article 26): cartographie des listes dans les dictionnaires
  55. 使用 Python 进行数据可视化之Seaborn
  56. Real time access to stock data, free—— Python crawler Sina stock actual combat
  57. Seaborn pour la visualisation des données en python
  58. 浅识XPath(熟练掌握XPath的语法)【python爬虫入门进阶】(03)
  59. Python中if else语句进行操作的时候哪里除了错,搞不懂
  60. Python题,我刚学,还不会