Data analysis -- numpy and pandas

Hee hee hee hee hee 2020-11-13 01:53:58
data analysis numpy pandas

numpy and pandas

numpy Basic operation

Create array : a, b, c Create the same array , Choose any one ;
a = np.array([1, 2, 3, 4, 5])
b = np.array(range(1, 6))
c = np.arange(1, 6)
see numpy Array type created
Look at the data types stored in the array , What are the common data types ?
print(a.dtype) # Why int64? Because the hardware architecture is 64 position ;
Specify the data type of the created array
d = np.array([1.9, 0, 1.3, 0], dtype=float)
print(d, d.dtype)
Change the data type of the array
e = d.astype('int64') # It can be data type , It can also be data code ;int64---i1
print(e, e.dtype)
Modify the number of decimal places for floating-point numbers
# Randomly create an array of three rows and four columns ;
f = np.random.random((3, 4))
# Modify the decimal places of floating-point books to 3 position 
g = np.round(f, 3)

Array operations and index slicing

import numpy as np
data = np.random.random((3, 4))
# Transform the data structure # 2,6
data = data.reshape((2, 6))
print(" Transposition : ", data.T)
print(" Transposition : ", data.transpose())
print(" Transposition : ", data.swapaxes(1, 0))
Index and slice
import numpy as np
a = np.arange(12).reshape((3, 4))
# ***************** Take a single line or column *********************
# Take the first place 2 That's ok ;
# Take the first place 3 Column ;
print(a[:, 2])
# For the first 2 That's ok 3 Columns of data 
print(a[1, 2])
# ***************** Take consecutive rows or columns *********************
# Take the first place 2 Xing He 3 That's ok ;
# Take the first place 3 Column and the first 4 Column 
print(a[:, 2:4])
# That's ok : 1 and 2 Column : 2
print(a[0:2, 1:2])
# ***************** Take discontinuous rows or columns *********************
# That's ok : 1 and 3 Column : all Get all elements of the first and third lines 
print(a[[0, 2], :])
# That's ok : all Column : 1, 4
print(a[:, [0, 3]])
# That's ok : 1 , 3 Column : 1 4 Get the elements in the first row and first column , And the third line 4 The elements of the column 
print(a[[0, 2], [0, 3]])
Modification of values in an array
import numpy as np
# Perform row and column changes 
t = np.arange(24).reshape((4, 6))
# That's ok : all, Column : 3,4
t[:, 2:4] = 0
# Boolean index 
print(t < 10)
t[t < 10] = 100
t[t > 20] = 200
# numpy The ternary operator of t<100?0:10
t1 = np.where(t < 100, 0, 10)

 Insert picture description here
 Insert picture description here

Fancy index

Fancy index refers to using an array of integers to index .

The fancy index takes the value of the index array as the index of a certain axis of the target array . For using a one-dimensional integer array as an index , If the target is a one-dimensional array , So the result of the index is the element corresponding to the location ;
If the target is a two-dimensional array , So that's the line corresponding to the subscript .

Fancy indexing is not the same as slicing , It always copies the data into a new array .

import numpy as np
# Pass in the ordinal index array 
x = np.arange(32).reshape((8, 4))
print(x[[4, 2, 1, 7]])
# Pass in the inverted index array 
print (x[[-4,-2,-1,-7]])
# Pass in multiple index arrays ( To use np.ix_)
principle :np.ix_ Function is to input two arrays , Mapping relations that produce Cartesian products
Will array [1,5,7,2] And an array [0,3,1,2] Produce Cartesian product , Is to get
print (x[np.ix_([1,5,7,2],[0,3,1,2])])
Shape modification
reshape Modify the shape without changing the data
numpy.reshape(arr, newshape, order='C')
order:'C' -- Press the line ,'F' -- By column ,'A' -- The original order ,'k' -- The order in which elements appear in memory .
flat Array element iterator
flatten Returns a copy of the array , Changes made to the copy do not affect the original array
ravel Return expanded array

Array splicing and segmentation


concatenate Join the array sequence along the existing axis
stack Add a series of arrays along the new axis .
hstack Stack arrays in a sequence horizontally ( Column direction )
vstack Stack arrays in a sequence vertically ( Line direction )


split Divide an array into multiple subarrays
numpy.split(ary, indices_or_sections, axis)
hsplit Divide an array horizontally into multiple subarrays ( By column )
vsplit Divide an array vertically into multiple subarrays ( Press the line )

Addition and deletion of array elements
 resize Returns a new array of specified shapes
append Add values to the end of the array
insert Inserts a value along the specified axis before the specified subscript
delete Delete the subarray of a certain axis , And return the new array after deletion
unique Find the only element in the array
arr: Input array , If it is not a one-dimensional array, it will expand
return_index: If true, Returns the location of the new list element in the old list ( Subscript ), And in the form of lists
return_counts: If true, Returns the number of occurrences of elements in the de duplicated array in the original array

numpy The statistical function of

Used to calculate the minimum value of the elements in the array along the specified axis .
Used to calculate the maximum value of elements in an array along a specified axis .
Function to calculate the difference between the maximum value and the minimum value of an element in an array ( Maximum - minimum value ).
Percentiles are measures used in Statistics , Represents the percentage of observations less than this value .
Function to evaluate an array a The median of the middle elements ( The median )
Function to return the arithmetic mean of elements in an array . If a shaft is provided , Then follow it .
The function calculates the weighted average value of the elements in the array according to the respective weights given in another array . average()
Standard deviation is a measure of the dispersion of the average of a set of data .
The standard deviation formula is as follows :std = sqrt(mean((x - x.mean())**2))
Variance in Statistics ( Sample variance ) Is the average of the square of the difference between each sample value and the average of all sample values ,
namely mean((x - x.mean())** 2).
The standard deviation is the square root of variance .


establish Series data type

Pandas Is a powerful tool set for analyzing structured data ; Its use is based on Numpy( Provides high performance matrix operations ); For data mining and data analysis , It also provides data cleaning function .
One of the sharp weapons :Series
Objects similar to one-dimensional arrays , It's a set of data ( Various NumPy data type ) And a set of related data labels ( Index ) form . Simple... Can also be generated from only one set of data Series object .
Two of the best tools :DataFrame
yes Pandas A tabular data structure in , Contains an ordered set of columns , Each column can be of a different value type ( The number 、 character string 、 Boolean, etc ),DataFrame There are both row and column indexes , Can be seen as by Series A dictionary made up of .

Common data types :
- A one-dimensional : Series
- A two-dimensional : DataFrame
- The three dimensional : Panel …
- 4 d : Panel4D …
- N dimension : PanelND …

Series yes Pandas One dimensional data structure in , Be similar to Python List and in Numpy Medium Ndarray, The difference is :Series It's one-dimensional , Can store different types of data , There is a set of indexes that correspond to elements .

Series Basic operation

 Number Properties or methods describe
1 axes Return to the list of row axis labels .
2 dtype Returns the data type of the object (dtype).
3 empty If the series is empty , Then return to True.
4 ndim Returns the dimension of the underlying data , Default definition :1.
5 size Returns the number of elements in the underlying data .
6 values Make the series ndarray return .
7 head() Return to the former n That's ok .
8 tail() Back to the end n That's ok .
Series Calculation example
import pandas as pd
import numpy as np
import string
s1 = pd.Series(np.arange(5), index=list(string.ascii_lowercase[:5])) # s1.index=[a, b, c, d, e] s1.value=[0 1 2 3 4]
s2 = pd.Series(np.arange(2, 8), index=list(string.ascii_lowercase[2:8])) # s2.index = [c,d,e,f]
# ***************** Calculated according to the corresponding index , If the index is different , Fill in Nan;
# Add , Missing value + True value === Missing value 
print(s1 + s2)
# -
print(s1 - s2)
# *
print(s1 * s2)
# /
print(s1 / s2)
# Find the median 
# Sum up 
# max
# min
where Method
# &**********series Medium where Method run results and numpy It's totally different ;
s1 = pd.Series(np.arange(5), index=list(string.ascii_lowercase[:5]))
# print(s1.where(s1 > 3))
# Not more than in the object 3 The element of is assigned the value of 10;
print(s1.where(s1 > 3, 10))
# Greater than in objects 3 The element of is assigned the value of 10;
print(s1.mask(s1 > 3, 10))

DataFrame data

Series Only row index , and DataFrame Object has an existing row index , There are also column indexes
Row index , Show different lines , Horizontal index , It's called index,
Column index , Show different columns , Vertical index , It's called columns,

establish DataFrame data

Method 1: Create... From a list

li = [
[1, 2, 3, 4],
[2, 3, 4, 5]
# DataFRame Object contains two indexes , Row index (0 Axis , axis=0), Column index (1 Axis , axis=1)
d1 = pd.DataFrame(data=li, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])

Method 2: adopt numpy objects creating

narr = np.arange(8).reshape(2, 4)
# DataFRame Object contains two indexes , Row index (0 Axis , axis=0), Column index (1 Axis , axis=1)
d2 = pd.DataFrame(data=narr, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])

Method 3 : Create... By dictionary

dict = {
'views': [1, 2, ],
'loves': [2, 3, ],
'comments': [3, 4, ]
d3 = pd.DataFrame(data=dict, index=[' Vermicelli ', " fans "])
DataFrame Basic properties and overall information query

a) Basic properties
df.shape # Row number 、 Number of columns
df.dtype # Column data type
df.ndim # Data dimension
df.index # Row index
df.columns # Column index
df.values # The object is worth , A two-dimensional ndarray Array

b) Overall situation query
df.head(3) # Show header lines , Default 5 That's ok
df.tail(3) # Show last lines , Default 5 That's ok # Overview of relevant information : Row number 、 Number of columns 、 Indexes 、 Number of non null values of column 、 Column type 、 Memory footprint
df.describe() # Quick synthesis of Statistics : Count 、 mean value 、 Standard deviation 、 Maximum 、 Four percentile 、 Minimum, etc

File reading and writing

csv Writing files

df.to_csv('doc/csvFile.csv', index=False) # index=False Don't store row indexes

csv File reading

 df2 = pd.read_csv('doc/csvFile.csv')

excel Writing files

df.to_excel("/tmp/excelFile.xlsx", sheet_name=" Provincial Statistics ")
groupby function

pandas Provides a flexible and efficient groupby function ,
1). It allows you to slice data sets in a natural way 、 cutting 、 Abstract and so on .
2). According to one or more keys ( It could be a function 、 An array or DataFrame Column > name ) Split pandas object .
3). Calculate group summary statistics , Such as counting 、 Average 、 Standard deviation , Or user defined functions .

本文为[Hee hee hee hee hee]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database