## Data analysis -- numpy and pandas

Hee hee hee hee hee 2020-11-13 01:53:58
data analysis numpy pandas

## numpy and pandas

#### numpy Basic operation

###### Create array : a, b, c Create the same array , Choose any one ;
``````a = np.array([1, 2, 3, 4, 5])
b = np.array(range(1, 6))
c = np.arange(1, 6)
``````
###### see numpy Array type created
``````print(type(a))
print(type(b))
print(type(c))
``````
###### Look at the data types stored in the array , What are the common data types ?
``````print(a.dtype) # Why int64？ Because the hardware architecture is 64 position ;
``````
###### Specify the data type of the created array
``````d = np.array([1.9, 0, 1.3, 0], dtype=float)
print(d, d.dtype)
``````
###### Change the data type of the array
``````e = d.astype('int64') # It can be data type , It can also be data code ;int64---i1
print(e, e.dtype)
``````
###### Modify the number of decimal places for floating-point numbers
``````# Randomly create an array of three rows and four columns ;
f = np.random.random((3, 4))
print(f)
# Modify the decimal places of floating-point books to 3 position
g = np.round(f, 3)
print(g)
``````

#### Array operations and index slicing

###### Transposition
``````import numpy as np
data = np.random.random((3, 4))
# Transform the data structure # 2,6
data = data.reshape((2, 6))
print(data)
print(" Transposition : ", data.T)
print(" Transposition : ", data.transpose())
print(" Transposition : ", data.swapaxes(1, 0))
``````
###### Index and slice
``````import numpy as np
a = np.arange(12).reshape((3, 4))
print(a)
# ***************** Take a single line or column *********************
# Take the first place 2 That's ok ;
print(a[1])
# Take the first place 3 Column ;
print(a[:, 2])
# For the first 2 That's ok 3 Columns of data
print(a[1, 2])
# ***************** Take consecutive rows or columns *********************
# Take the first place 2 Xing He 3 That's ok ;
print(a[1:3])
# Take the first place 3 Column and the first 4 Column
print(a[:, 2:4])
# That's ok ： 1 and 2 Column : 2
print(a[0:2, 1:2])
# ***************** Take discontinuous rows or columns *********************
# That's ok ： 1 and 3 Column : all Get all elements of the first and third lines
print(a[[0, 2], :])
# That's ok : all Column : 1, 4
print(a[:, [0, 3]])
# That's ok ： 1 , 3 Column ： 1 4 Get the elements in the first row and first column , And the third line 4 The elements of the column
print("*"*10)
print(a[[0, 2], [0, 3]])
``````
###### Modification of values in an array
``````import numpy as np
# Perform row and column changes
t = np.arange(24).reshape((4, 6))
print(t)
# That's ok ： all, Column : 3,4
t[:, 2:4] = 0
print(t)
# Boolean index
print(t < 10)
#
t[t < 10] = 100
print(t)
t[t > 20] = 200
print(t)
# numpy The ternary operator of t<100?0:10
t1 = np.where(t < 100, 0, 10)
print(t)
print(t1)
``````

###### Fancy index

Fancy index refers to using an array of integers to index .

The fancy index takes the value of the index array as the index of a certain axis of the target array . For using a one-dimensional integer array as an index , If the target is a one-dimensional array , So the result of the index is the element corresponding to the location ;
If the target is a two-dimensional array , So that's the line corresponding to the subscript .

Fancy indexing is not the same as slicing , It always copies the data into a new array .

``````
import numpy as np
# Pass in the ordinal index array
x = np.arange(32).reshape((8, 4))
print(x)
print(x[[4, 2, 1, 7]])
# Pass in the inverted index array
x=np.arange(32).reshape((8,4))
print (x[[-4,-2,-1,-7]])
# Pass in multiple index arrays （ To use np.ix_）
"""
principle ：np.ix_ Function is to input two arrays , Mapping relations that produce Cartesian products
Will array [1,5,7,2] And an array [0,3,1,2] Produce Cartesian product , Is to get
(1,0),(1,3),(1,1),(1,2);(5,0),(5,3),(5,1),(5,2);(7,0),(7,3),(7,1),(7,2);(2,0),(2,3),(2,1),(2,2);
"""
x=np.arange(32).reshape((8,4))
print(x)
print (x[np.ix_([1,5,7,2],[0,3,1,2])])
``````
###### Shape modification
``````reshape Modify the shape without changing the data
numpy.reshape(arr, newshape, order='C')
order：'C' -- Press the line ,'F' -- By column ,'A' -- The original order ,'k' -- The order in which elements appear in memory .
flat Array element iterator
flatten Returns a copy of the array , Changes made to the copy do not affect the original array
ravel Return expanded array
``````

#### Array splicing and segmentation

###### Splicing

concatenate Join the array sequence along the existing axis
stack Add a series of arrays along the new axis .
hstack Stack arrays in a sequence horizontally （ Column direction ）
vstack Stack arrays in a sequence vertically （ Line direction ）

###### Division

split Divide an array into multiple subarrays
numpy.split(ary, indices_or_sections, axis)
hsplit Divide an array horizontally into multiple subarrays （ By column ）
vsplit Divide an array vertically into multiple subarrays （ Press the line ）

###### Addition and deletion of array elements
`````` resize Returns a new array of specified shapes
append Add values to the end of the array
insert Inserts a value along the specified axis before the specified subscript
delete Delete the subarray of a certain axis , And return the new array after deletion
unique Find the only element in the array
arr： Input array , If it is not a one-dimensional array, it will expand
return_index： If true, Returns the location of the new list element in the old list （ Subscript ）, And in the form of lists
return_counts： If true, Returns the number of occurrences of elements in the de duplicated array in the original array
``````

#### numpy The statistical function of

``````numpy.amin()
Used to calculate the minimum value of the elements in the array along the specified axis .
numpy.amax()
Used to calculate the maximum value of elements in an array along a specified axis .
numpy.ptp()
Function to calculate the difference between the maximum value and the minimum value of an element in an array （ Maximum - minimum value ）.
numpy.percentile()
Percentiles are measures used in Statistics , Represents the percentage of observations less than this value .
numpy.median()
Function to evaluate an array a The median of the middle elements （ The median ）
numpy.mean()
Function to return the arithmetic mean of elements in an array . If a shaft is provided , Then follow it .
numpy.average()
The function calculates the weighted average value of the elements in the array according to the respective weights given in another array . average()
np.std()
Standard deviation is a measure of the dispersion of the average of a set of data .
The standard deviation formula is as follows ：std = sqrt(mean((x - x.mean())**2))
np.var()
Variance in Statistics （ Sample variance ） Is the average of the square of the difference between each sample value and the average of all sample values ,
namely mean((x - x.mean())** 2).
The standard deviation is the square root of variance .
``````

## pandas

#### establish Series data type

Pandas Is a powerful tool set for analyzing structured data ; Its use is based on Numpy（ Provides high performance matrix operations ）; For data mining and data analysis , It also provides data cleaning function .
One of the sharp weapons ：Series
Objects similar to one-dimensional arrays , It's a set of data ( Various NumPy data type ) And a set of related data labels ( Index ) form . Simple... Can also be generated from only one set of data Series object .
Two of the best tools ：DataFrame
yes Pandas A tabular data structure in , Contains an ordered set of columns , Each column can be of a different value type ( The number 、 character string 、 Boolean, etc ),DataFrame There are both row and column indexes , Can be seen as by Series A dictionary made up of .

Common data types :
- A one-dimensional : Series
- A two-dimensional : DataFrame
- The three dimensional : Panel …
- 4 d : Panel4D …
- N dimension : PanelND …

Series yes Pandas One dimensional data structure in , Be similar to Python List and in Numpy Medium Ndarray, The difference is ：Series It's one-dimensional , Can store different types of data , There is a set of indexes that correspond to elements .

#### Series Basic operation

`````` Number Properties or methods describe
2 dtype Returns the data type of the object (dtype).
3 empty If the series is empty , Then return to True.
4 ndim Returns the dimension of the underlying data , Default definition ：1.
5 size Returns the number of elements in the underlying data .
6 values Make the series ndarray return .
8 tail() Back to the end n That's ok .
``````
###### Series Calculation example
``````import pandas as pd
import numpy as np
import string
s1 = pd.Series(np.arange(5), index=list(string.ascii_lowercase[:5])) # s1.index=[a, b, c, d, e] s1.value=[0 1 2 3 4]
s2 = pd.Series(np.arange(2, 8), index=list(string.ascii_lowercase[2:8])) # s2.index = [c,d,e,f]
print(s1)
print(s2)
# ***************** Calculated according to the corresponding index , If the index is different , Fill in Nan;
# Add , Missing value + True value === Missing value
print(s1 + s2)
# -
print(s1 - s2)
print(s1.sub(s2))
# *
print(s1 * s2)
print(s1.mul(s2))
# /
print(s1 / s2)
print(s1.div(s2))
# Find the median
print(s1)
print(s1.median())
# Sum up
print(s1.sum())
# max
print(s1.max())
# min
print(s1.min())
``````
###### where Method
``````# &**********series Medium where Method run results and numpy It's totally different ;
s1 = pd.Series(np.arange(5), index=list(string.ascii_lowercase[:5]))
# print(s1.where(s1 > 3))
# Not more than in the object 3 The element of is assigned the value of 10;
print(s1.where(s1 > 3, 10))
# Greater than in objects 3 The element of is assigned the value of 10;
``````

#### DataFrame data

Series Only row index , and DataFrame Object has an existing row index , There are also column indexes
Row index , Show different lines , Horizontal index , It's called index,
Column index , Show different columns , Vertical index , It's called columns,

###### establish DataFrame data

Method 1： Create... From a list

``````li = [
[1, 2, 3, 4],
[2, 3, 4, 5]
]
# DataFRame Object contains two indexes , Row index (0 Axis , axis=0), Column index (1 Axis , axis=1)
d1 = pd.DataFrame(data=li, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])
``````

Method 2： adopt numpy objects creating

``````narr = np.arange(8).reshape(2, 4)
# DataFRame Object contains two indexes , Row index (0 Axis , axis=0), Column index (1 Axis , axis=1)
d2 = pd.DataFrame(data=narr, index=['A', 'B'], columns=['views', 'loves', 'comments', 'tranfers'])
``````

Method 3 : Create... By dictionary

``````dict = {
'views': [1, 2, ],
'loves': [2, 3, ],
}
d3 = pd.DataFrame(data=dict, index=[' Vermicelli ', " fans "])
``````
###### DataFrame Basic properties and overall information query

a) Basic properties
df.shape # Row number 、 Number of columns
df.dtype # Column data type
df.ndim # Data dimension
df.index # Row index
df.columns # Column index
df.values # The object is worth , A two-dimensional ndarray Array

b) Overall situation query
df.tail(3) # Show last lines , Default 5 That's ok
df.info() # Overview of relevant information ： Row number 、 Number of columns 、 Indexes 、 Number of non null values of column 、 Column type 、 Memory footprint
df.describe() # Quick synthesis of Statistics ： Count 、 mean value 、 Standard deviation 、 Maximum 、 Four percentile 、 Minimum, etc

csv Writing files

``````df.to_csv('doc/csvFile.csv', index=False) # index=False Don't store row indexes
``````

`````` df2 = pd.read_csv('doc/csvFile.csv')
``````

excel Writing files

``````df.to_excel("/tmp/excelFile.xlsx", sheet_name=" Provincial Statistics ")
``````
###### groupby function

pandas Provides a flexible and efficient groupby function ,
1). It allows you to slice data sets in a natural way 、 cutting 、 Abstract and so on .
2). According to one or more keys （ It could be a function 、 An array or DataFrame Column > name ） Split pandas object .
3). Calculate group summary statistics , Such as counting 、 Average 、 Standard deviation , Or user defined functions .