Data analysis -- numpy and pandas

Hee hee hee hee hee 2020-11-13 01:53:58
data analysis numpy pandas

numpy and pandas

numpy Basic operation

Create array : a, b, c Create the same array , Choose any one ;
a = np.array([1, 2, 3, 4, 5])
b = np.array(range(1, 6))
c = np.arange(1, 6)
print(type(a))
print(type(b))
print(type(c))
Look at the data types stored in the array , What are the common data types ?
print(a.dtype) # Why int64？ Because the hardware architecture is 64 position ;
Specify the data type of the created array
d = np.array([1.9, 0, 1.3, 0], dtype=float)
print(d, d.dtype)
Change the data type of the array
e = d.astype('int64') # It can be data type , It can also be data code ;int64---i1
print(e, e.dtype)
Modify the number of decimal places for floating-point numbers
# Randomly create an array of three rows and four columns ;
f = np.random.random((3, 4))
print(f)
# Modify the decimal places of floating-point books to 3 position
g = np.round(f, 3)
print(g)

Array operations and index slicing

Transposition
import numpy as np
data = np.random.random((3, 4))
# Transform the data structure # 2,6
data = data.reshape((2, 6))
print(data)
print(" Transposition : ", data.T)
print(" Transposition : ", data.transpose())
print(" Transposition : ", data.swapaxes(1, 0))
Index and slice
import numpy as np
a = np.arange(12).reshape((3, 4))
print(a)
# ***************** Take a single line or column *********************
# Take the first place 2 That's ok ;
print(a)
# Take the first place 3 Column ;
print(a[:, 2])
# For the first 2 That's ok 3 Columns of data
print(a[1, 2])
# ***************** Take consecutive rows or columns *********************
# Take the first place 2 Xing He 3 That's ok ;
print(a[1:3])
# Take the first place 3 Column and the first 4 Column
print(a[:, 2:4])
# That's ok ： 1 and 2 Column : 2
print(a[0:2, 1:2])
# ***************** Take discontinuous rows or columns *********************
# That's ok ： 1 and 3 Column : all Get all elements of the first and third lines
print(a[[0, 2], :])
# That's ok : all Column : 1, 4
print(a[:, [0, 3]])
# That's ok ： 1 , 3 Column ： 1 4 Get the elements in the first row and first column , And the third line 4 The elements of the column
print("*"*10)
print(a[[0, 2], [0, 3]])
Modification of values in an array
import numpy as np
# Perform row and column changes
t = np.arange(24).reshape((4, 6))
print(t)
# That's ok ： all, Column : 3,4
t[:, 2:4] = 0
print(t)
# Boolean index
print(t < 10)
#
t[t < 10] = 100
print(t)
t[t > 20] = 200
print(t)
# numpy The ternary operator of t<100?0:10
t1 = np.where(t < 100, 0, 10)
print(t)
print(t1)  Fancy index

Fancy index refers to using an array of integers to index .

The fancy index takes the value of the index array as the index of a certain axis of the target array . For using a one-dimensional integer array as an index , If the target is a one-dimensional array , So the result of the index is the element corresponding to the location ;
If the target is a two-dimensional array , So that's the line corresponding to the subscript .

Fancy indexing is not the same as slicing , It always copies the data into a new array .

import numpy as np
# Pass in the ordinal index array
x = np.arange(32).reshape((8, 4))
print(x)
print(x[[4, 2, 1, 7]])
# Pass in the inverted index array
x=np.arange(32).reshape((8,4))
print (x[[-4,-2,-1,-7]])
# Pass in multiple index arrays （ To use np.ix_）
"""
principle ：np.ix_ Function is to input two arrays , Mapping relations that produce Cartesian products
Will array [1,5,7,2] And an array [0,3,1,2] Produce Cartesian product , Is to get
(1,0),(1,3),(1,1),(1,2);(5,0),(5,3),(5,1),(5,2);(7,0),(7,3),(7,1),(7,2);(2,0),(2,3),(2,1),(2,2);
"""
x=np.arange(32).reshape((8,4))
print(x)
print (x[np.ix_([1,5,7,2],[0,3,1,2])])
Shape modification
reshape Modify the shape without changing the data
numpy.reshape(arr, newshape, order='C')
order：'C' -- Press the line ,'F' -- By column ,'A' -- The original order ,'k' -- The order in which elements appear in memory .
flat Array element iterator
flatten Returns a copy of the array , Changes made to the copy do not affect the original array
ravel Return expanded array

Array splicing and segmentation

Splicing

concatenate Join the array sequence along the existing axis
stack Add a series of arrays along the new axis .
hstack Stack arrays in a sequence horizontally （ Column direction ）
vstack Stack arrays in a sequence vertically （ Line direction ）

Division

split Divide an array into multiple subarrays
numpy.split(ary, indices_or_sections, axis)
hsplit Divide an array horizontally into multiple subarrays （ By column ）
vsplit Divide an array vertically into multiple subarrays （ Press the line ）

Addition and deletion of array elements
resize Returns a new array of specified shapes
append Add values to the end of the array
insert Inserts a value along the specified axis before the specified subscript
delete Delete the subarray of a certain axis , And return the new array after deletion
unique Find the only element in the array
arr： Input array , If it is not a one-dimensional array, it will expand
return_index： If true, Returns the location of the new list element in the old list （ Subscript ）, And in the form of lists
return_counts： If true, Returns the number of occurrences of elements in the de duplicated array in the original array

numpy The statistical function of

numpy.amin()
Used to calculate the minimum value of the elements in the array along the specified axis .
numpy.amax()
Used to calculate the maximum value of elements in an array along a specified axis .
numpy.ptp()
Function to calculate the difference between the maximum value and the minimum value of an element in an array （ Maximum - minimum value ）.
numpy.percentile()
Percentiles are measures used in Statistics , Represents the percentage of observations less than this value .
numpy.median()
Function to evaluate an array a The median of the middle elements （ The median ）
numpy.mean()
Function to return the arithmetic mean of elements in an array . If a shaft is provided , Then follow it .
numpy.average()
The function calculates the weighted average value of the elements in the array according to the respective weights given in another array . average()
np.std()
Standard deviation is a measure of the dispersion of the average of a set of data .
The standard deviation formula is as follows ：std = sqrt(mean((x - x.mean())**2))
np.var()
Variance in Statistics （ Sample variance ） Is the average of the square of the difference between each sample value and the average of all sample values ,
namely mean((x - x.mean())** 2).
The standard deviation is the square root of variance .

pandas

establish Series data type

Pandas Is a powerful tool set for analyzing structured data ; Its use is based on Numpy（ Provides high performance matrix operations ）; For data mining and data analysis , It also provides data cleaning function .
One of the sharp weapons ：Series
Objects similar to one-dimensional arrays , It's a set of data ( Various NumPy data type ) And a set of related data labels ( Index ) form . Simple... Can also be generated from only one set of data Series object .
Two of the best tools ：DataFrame
yes Pandas A tabular data structure in , Contains an ordered set of columns , Each column can be of a different value type ( The number 、 character string 、 Boolean, etc ),DataFrame There are both row and column indexes , Can be seen as by Series A dictionary made up of .

Common data types :
- A one-dimensional : Series
- A two-dimensional : DataFrame
- The three dimensional : Panel …
- 4 d : Panel4D …
- N dimension : PanelND …

Series yes Pandas One dimensional data structure in , Be similar to Python List and in Numpy Medium Ndarray, The difference is ：Series It's one-dimensional , Can store different types of data , There is a set of indexes that correspond to elements .

Series Basic operation

Number Properties or methods describe
2 dtype Returns the data type of the object (dtype).
3 empty If the series is empty , Then return to True.
4 ndim Returns the dimension of the underlying data , Default definition ：1.
5 size Returns the number of elements in the underlying data .
6 values Make the series ndarray return .
8 tail() Back to the end n That's ok .
Series Calculation example
import pandas as pd
import numpy as np
import string
s1 = pd.Series(np.arange(5), index=list(string.ascii_lowercase[:5])) # s1.index=[a, b, c, d, e] s1.value=[0 1 2 3 4]
s2 = pd.Series(np.arange(2, 8), index=list(string.ascii_lowercase[2:8])) # s2.index = [c,d,e,f]
print(s1)
print(s2)
# ***************** Calculated according to the corresponding index , If the index is different , Fill in Nan;
# Add , Missing value + True value === Missing value
print(s1 + s2)
# -
print(s1 - s2)
print(s1.sub(s2))
# *
print(s1 * s2)
print(s1.mul(s2))
# /
print(s1 / s2)
print(s1.div(s2))
# Find the median
print(s1)
print(s1.median())
# Sum up
print(s1.sum())
# max
print(s1.max())
# min
print(s1.min())
where Method
# &**********series Medium where Method run results and numpy It's totally different ;
s1 = pd.Series(np.arange(5), index=list(string.ascii_lowercase[:5]))
# print(s1.where(s1 > 3))
# Not more than in the object 3 The element of is assigned the value of 10;
print(s1.where(s1 > 3, 10))
# Greater than in objects 3 The element of is assigned the value of 10;

DataFrame data

Series Only row index , and DataFrame Object has an existing row index , There are also column indexes
Row index , Show different lines , Horizontal index , It's called index,
Column index , Show different columns , Vertical index , It's called columns,

establish DataFrame data

Method 1： Create... From a list

li = [
[1, 2, 3, 4],
[2, 3, 4, 5]
]
# DataFRame Object contains two indexes , Row index (0 Axis , axis=0), Column index (1 Axis , axis=1)
d1 = pd.DataFrame(data=li, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])

Method 2： adopt numpy objects creating

narr = np.arange(8).reshape(2, 4)
# DataFRame Object contains two indexes , Row index (0 Axis , axis=0), Column index (1 Axis , axis=1)
d2 = pd.DataFrame(data=narr, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])

Method 3 : Create... By dictionary

dict = {
'views': [1, 2, ],
'loves': [2, 3, ],
}
d3 = pd.DataFrame(data=dict, index=[' Vermicelli ', " fans "])
DataFrame Basic properties and overall information query

a) Basic properties
df.shape # Row number 、 Number of columns
df.dtype # Column data type
df.ndim # Data dimension
df.index # Row index
df.columns # Column index
df.values # The object is worth , A two-dimensional ndarray Array

b) Overall situation query
df.tail(3) # Show last lines , Default 5 That's ok
df.info() # Overview of relevant information ： Row number 、 Number of columns 、 Indexes 、 Number of non null values of column 、 Column type 、 Memory footprint
df.describe() # Quick synthesis of Statistics ： Count 、 mean value 、 Standard deviation 、 Maximum 、 Four percentile 、 Minimum, etc

csv Writing files

df.to_csv('doc/csvFile.csv', index=False) # index=False Don't store row indexes