As an easy to learn programming language , It's easy to get started , Today we have a super long article , One time literacy Python、NumPy and Pandas, It is provided at the end of the paper Python Technology exchange group , Welcome to join , Like this article , give the thumbs-up 、 Collection .
Build a language environment
Let's first learn how to install and build Python Language environment
Python Version selection
Current popular Python There are two versions ,2.X and 3.X, because 2.X Will no longer be maintained , So I suggest using 3.X Version as your main version .
IDE The choice of
At present, there are many popular Python Editor , such as Sublime,Notebook++ etc. , But I still recommend the following two
PyCharm: This is a cross platform Python development tool , Not only have regular debugging 、 Syntax highlighting , Intelligent prompt and other functions , It also comes with multiple database connectors , So that you can be handy when debugging the database , No longer busy downloading various database clients everywhere .
Jupyter: This is a web An online editor , Run one line of code at a time , You can get immediate results , Very convenient , In the code debugging phase , Unlimited use .
Python Software installation
If you are Linux perhaps MacOS operating system , Then they usually bring their own Python2.6 Version of . If you want to install 3.X Version of , You need to compile and install yourself , without Linux If the operation is basic , Recommended or used Windows.
If it is Windows operating system , You can go straight to Python Download from the official website .exe Installation package , The next step along the way is to complete the installation .
Hello World
I believe everyone has this experience , When learning any language , The entry is the output Hello World, Now let's see how to use Python To input Hello World
print("Hello World")
sum = 1 + 2
print("sum = %d" %sum)
>>>
Hello World
sum = 3
print function , Used to print out on the console ,sum = The syntax is to declare variables and assign values ,%d Is used for string replacement .
Data types and variables
list
list1 = ["1", "2", "test"]
print(list1)
list1.append("hello")
print(lists)
>>>
['1', '2', 'test']
['1', '2', 'test', 'hello']
list yes Python Built in data type , It's an orderly collection , You can add and remove elements at any time .
Tuples
tuple1 = ("zhangsan", "lisi")
print(tuple1[0])
>>>
zhangsan
tuple and list Very similar , however tuple Once initialized, it cannot be modified .
Dictionaries
dict1 = {
"name1": "zhangsan", "name2": "lisi", "name3": "wangwu"}
dict1["name1"]
>>>
'zhangsan'
Python Built in dictionary :dict Full name dictionary, Also known in other languages as map, Use the key - value (key-value) Storage , With extremely fast search speed .
aggregate
s = set([1, 2, 3])
print(s)
>>>
{
1, 2, 3}
set and dict similar , It's also a group. key Set , But no storage. value. because key Can't repeat , therefore , stay set in , No repeat key.
Variable
The concept of variable is basically consistent with the equation variable of junior high school algebra , It's just in a computer program , Variables can be more than numbers , It can also be any data type .
a = 1
a = 3
print(a)
>>>
3
conditional
age = 30
if age >= 18:
print('your age is', age)
print('good')
else:
Print('your are not belong here')
>>>
your age is 30
good
if … else… Is a very classic conditional judgment statement ,if Followed by a conditional expression , If set up , Then execute the following statement , Otherwise execution else Subsequent statements . At the same time, pay attention to ,Python Language uses code indentation to judge code blocks , Usually four spaces or one tab, Don't mix the two .
Loop statement
names = {
"zhangsan", "lisi", "wangwu"}
for name in names:
print(name)
>>>
lisi
zhangsan
wangwu
names It's a collection , Is an iterative object , Use for loop ,name Will be assigned to names The element value in .
sum = 0
n = 99
while n > 0:
sum = sum + n
n = n - 2
print(sum)
>>>
2500
Inside the loop, variables n Constantly decreasing , Until it becomes -1 when , No longer satisfied while Conditions , Loop exit .
Advanced features
section
L = ['zhangsan', 'lisi', 'wangwu', 'zhaoliu']
print(L[1])
print(L[1:3])
>>>
lisi
['lisi', 'wangwu']
Python in , All subscripts are from 0 At the beginning , And they are all left closed and right open intervals
iteration
For a list of 、 Tuples and dictionaries , Are all iteratable objects , have access to for To iterate
L = ['zhangsan', 'lisi', 'wangwu', 'zhaoliu']
D = {
"zhangsan":1, "lisi": 2, "wangwu": 3, "zhaoliu": 4}
for l in L:
print(l)
print('\n')
for k,v in D.items():
print(" key :", k, ",", " value ", v)
>>>
zhangsan
lisi
wangwu
zhaoliu key : zhangsan , value 1
key : lisi , value 2
key : wangwu , value 3
key : zhaoliu , value 4
For dictionaries , Use items(), But traversing key value pairs at the same time
function
Call function
Python Built in a lot of useful functions , We can call .
>>> abs(100)
100
>>> abs(-20)
20
>>> abs(12.34)
12.34
>>> max(1, 2)
2
>>> max(2, 3, 1, -5)
3
When the function is called , If there is a problem with the parameters passed in , The program throws an exception .
It contains Python All built-in functions in :
https://docs.python.org/zh-cn/3/library/functions.html
Defined function
stay Python in , Define a function to use def sentence , Write the function names in turn 、 Brackets 、 Parameters and colons in brackets :, then , Write function bodies in indented blocks , The return value of the function is return Statement returns .
def add(num1, num2):
return num1 + num2result = add(1,2)
print(result)
>>>
3
In the code , It's defined as add Function of , It takes in two parameters , And will return to their sum . After function definition , You can use the function name followed by () To call , If the function has a return value , You can assign a variable to receive .
modular
Call module
Python There are many very useful modules built in , As long as the installation is complete , These modules can be used immediately .
import time
def sayTime():
now = time.time()
return nownowtime = sayTime()
print(nowtime)
>>>
1566550687.642805
Use import To import modules , After that, we can call various method variables provided by the module .
A module is simply a collection of tools , Of course we can write some tools ourselves , Then form your own module , For later programming .
We write our own modules , The general directory structure is as follows
mytest
├─ __init__.py
├─ test1.py
└─ test2.py
Now we can reference and call these two... In other files test The tool file is
import mytest
mytest.test1
You should have noticed __init__.py
file , This file can be empty , Contains __init__.py
The file folder is a ” package “(Package). If we need to reference files like above , It must include __init__.py
file .
Install third party modules
stay Python in , Install third party modules , It's through package management tools pip Accomplished .
Generally speaking , Third party libraries will be in Python Official pypi.python.org Website registration , To install a third-party library , You must first know the name of the library , It can be on the official website or pypi On the search , such as Pillow The name of is Pillow, therefore , install Pillow The order is :
pip install Pillow
object-oriented programming
Classes and instances
The most important concept of object-oriented is class (Class) And examples (Instance), It's important to remember that classes are abstract templates , such as Student class , Instances are created one by one according to the class “ object ”, Each object has the same method , But the data may be different .
stay Python in , Use class Keyword to define the class
class Student(object):
pass
After defining the class , You can instantiate this class
zhangsan = Student()
zhangsan.age = 20
print(Student)
print(zhangsan)
print(zhangsan.age)
>>>
<class '__main__.Student'>
<__main__.Student object at 0x00EA7350>
20
here , Variable zhangsan It's class Student An example of . At the same time, we also give zhangsan Bound a property age And the assignment .
Keep in mind the three basic elements of object orientation : abstract , encapsulation , Inherit . If you don't have many ideas about these at present , It doesn't matter , You can experience it slowly in the later study .
IO Programming
Read the file , This is the operation that will be frequently used later , stay Python in , Use open Function can easily open a file
f = open('/Users/tanxin/test.txt', 'r')
f.read()
f.close()
Identifier ‘r’ Express reading , such , We successfully opened a file , And then use read Function to read the contents of the file , Last use close To close the file .
The file must be closed after use , Because file objects take up operating system resources , And the operating system can open a limited number of files at the same time
Use with To easily open files
with open('/Users/tanxin/test.txt', 'r') as f:
print(f.read())
with Statement helps us complete close The process of
File reading and readline() and readlins() Two functions .readline() Read one row at a time ,readlines() Read everything at once and return to a list by line .
Regular expressions
Regular expressions are a big subject , The content can be written in a single book , Let's just make a brief introduction here .
Python Provided in re Module to do regular
import re
str1 = "010-56765"
res = re.match(r'(\d{3})-(\d{5})', str1)
print(res)
print(res.group(0))
print(res.group(1))
print(res.group(2))
>>>
<re.Match object; span=(0, 9), match='010-56765'>
010-56765
010
56765
match() Method to determine whether it matches , If the match is successful , Return to one Match object , Otherwise return to None
coordination group Method , Can effectively extract the word string .
requests library , It's a very common HTTP Network request Library , Later reptile lessons , We will use it a lot .
import requests
r = requests.get('https://www.baidu.com')
r = requests.post('http://test.com/post', data = {
'key':'value'})
payload = {
'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://test.com/get", params=payload)
At this time r It's a response object , We can get relevant information from it
r.text # Get response content
r.content # Read the response information in bytes
response.encoding = "utf-8" # Change the code
html = response.text # Get web content
binary__content = response.content # Get binary data
raw = requests.get(url, stream=True) # Get the original response content
headers = {
'user-agent': 'my-test/0.1.1'} # Custom request header
r = requests.get(url, headers=headers)
cookies = {
"cookie": "# your cookie"} # cookie Use
r = requests.get(url, cookies=cookies)
This is just a brief introduction Python The grammar of , If you want to learn more , You need to spend more energy . But nothing in the world is difficult , I'm afraid I'm willing to climb . Don't stay at the beginning stage , Usually find more websites that brush questions , such as Leetcode,online Judge wait , While brushing the questions , More able to exercise their programming thinking and algorithm ability .
NumPy not only Python The most used libraries in scientific computing , still SciPy,Pandas Wait for the foundation of the library , It provides a more advanced and efficient data structure , It is a library specially designed for Scientific Computing .
NumPy Usually with SciPy(Scientific Python) and Matplotlib( Drawing library ) Use it together , This combination is widely used to replace MatLab, It's a powerful scientific computing environment , It helps us to get through Python Study data science or machine learning .
NumPy One of the most important features is its N Dimensional array object ndarray, It's a collection of data of the same type , With 0 The subscript is to start indexing the elements in the collection .
ndarray Internal composition
A point to data ( A piece of data in a memory or memory mapped file ) The pointer to
Data type or dtype, Describes a lattice with a fixed size value in an array
An array shape (shape) tuples , A tuple representing the size of each dimension
A span tuple (stride), The integer refers to the need to move forward to the next element of the current dimension ” Across “ Bytes of
The above concept , You can experience it slowly in the later study .
Create a ndarray Just call NumPy Of array Function
import numpy as np
a = np.array([1, 2, 2])
b = np.array([[1, 2], [5, 5], [7, 8]])
b[1,1]=10
print(a.shape)
print(b.shape)
print(a.dtype)
print(b)
>>>
(3,)
(3, 2)
int32
[[ 1 2]
[ 5 10]
[ 7 8]]
quote numpy library , call array Function to create ndarray.
To create a one-dimensional array, you only need to pass in a list, Create multidimensional arrays , You need to nest an array as an element first , Put it in another array .
extract array The elements in , You can use slicing operations ,b[1,1].
Use shape Property to get the shape of the array ( size ), Such as b The array is an array of three rows and two columns .
Use dtype Property to get the data type in the array .
NumPy Supported data type ratio Python There are more built-in types , Here are some common types
name | describe |
---|---|
bool_ | Boolean data type (True perhaps False) |
int_ | Default integer type |
int32 | Integers (-2147483648 to 2147483647) |
uint32 | Unsigned integer (0 to 4294967295) |
float32 | Single-precision floating-point , Include :1 Sign bits ,8 One digit ,23 One last digit |
float64 | Double precision floating point , Include :1 Sign bits ,11 One digit ,52 One last digit |
Data type object (dtype)
Data type objects can be used to create arrays that meet our expected data structures
numpy.dtype(object, align, copy)
object: Data type object to convert
align: If True, Fill in the fields to make them look like C The structure of the body
copy: Copy dtype object , If False, Is a reference to a built-in data type object
Use dtype Create a structure array
mydtype = np.dtype({
'names': ['name', 'age', 'sex'],
'formats': ['S32', 'i4', 'S32']
})
persons = np.array([
('zhangsan', 20, 'man'),
('lisi', 18, 'woman'),
('wangwu', 30, 'man')
],
dtype=mydtype)
print(persons)
>>>
[(b'zhangsan', 20, b'man') (b'lisi', 18, b'woman') (b'wangwu', 30, b'man')]
First, through dtype Function defines a structure type , And then use array Function to build an array ,dtype Parameters can be defined by us .
NumPy The dimension of an array is called rank (rank), The rank of one-dimensional array is 1, The rank of a two-dimensional array is 2, And so on .
stay NumPy in , Every linear array is called an axis (axis), That is dimension (dimensions). for instance , A two-dimensional array is equivalent to two one-dimensional arrays , Each element in the first one-dimensional array is another one-dimensional array . So a one-dimensional array is NumPy Axis in (axis), The first axis is the same as the underlying array , The second axis is the array in the underlying array . And the number of shafts —— Rank , That's the dimension of the array .
A lot of times you can declare axis.axis=0, Means following the 0 Operate the shaft , That is, to operate each column ;axis=1, Means following the 1 Operate the shaft , That is, to operate on each line .
The following lists the more important ndarray Object properties
attribute | explain |
---|---|
ndim | Rank , That is, the number of axes or dimensions |
shape | Dimension of array |
size | The total number of array elements |
dtype | Type of element |
itemsize | The size of each element , In bytes |
An empty array
x = np.empty([3,2], dtype=int)
print(x)
>>>
[[0 0]
[0 0]
[0 0]]
numpy.empty Method to create a specified shape (shape)、 data type (dtype) And uninitialized array
0 Array
zero1 = np.zeros(5)
zero2 = np.zeros(4, dtype=int)
print(zero1)
print(zero2)
>>>
[0. 0. 0. 0. 0.]
[0 0 0 0]
1 Array
one1 = np.ones(3)
one2 = np.ones(4, dtype=float)
print(one1)
print(one2)
>>>
[1. 1. 1.]
[1. 1. 1. 1.]
Create an array from an existing array
numpy.asarray, From the list , Tuples , Multidimensional array create array
list1 = [1, 3, 5]
tuple1 = (1, 2, 3)
one = np.ones((2,3), dtype=int)
array1 = np.asarray(list1)
array2 = np.asarray(tuple1)
array3 = np.asarray(one)
print(array1)
print(array2)
print(array3)
>>>
[1 3 5]
[1 2 3]
[[1 1 1]
[1 1 1]]
numpy.frombuffer, Read in as a stream and convert it into an array
str1 = b"Hello world"
buffer1 = np.frombuffer(str1, dtype='S1')
print(buffer1)
>>>
[b'H' b'e' b'l' b'l' b'o' b' ' b'w' b'o' b'r' b'l' b'd']
numpy.fromiter, You can create arrays from iteratable objects
range1 = range(5)
iter1 = np.fromiter(range1, dtype=int)
print(iter1)
>>>
[0 1 2 3 4]
numpy.arange, Create an array from a range of values
myarray1 = np.arange(5)
print(myarray1)
>>>
[0 1 2 3 4]
numpy.linspace, Build an array of arithmetic sequences
myarray2 = np.linspace(1,9,5)
print(myarray2)
>>>
[1. 3. 5. 7. 9.]
Slicing and indexing
ndarray The contents of an object can be accessed and modified by index or slice , And Python in list The slice operation is the same .
ndarray Arrays can be based on 0 - n Index the subscripts of , Slice objects through the built-in slice function , And set up start, stop And step Parameters , Cut a new array from the original array .
a = np.arange(10)
print(a)
s = slice(2,7,2) # From the index 2 Start to index 7 stop it , The interval is 2
print (a[s])
>>>
[0 1 2 3 4 5 6 7 8 9]
[2 4 6]
You can also use colons (:) To slice
a = np.arange(10)
print(a)
b = a[2:7:2] # From the index 2 Start to index 7 stop it , The interval is 2
print(b)
>>>
[0 1 2 3 4 5 6 7 8 9]
[2 4 6]
Modify array shape
nunpy.reshape, You can modify the array shape without changing the data
a = np.arange(6)
print(" The original array :", a)
b = a.reshape(3, 2)
print(" Array after transformation :", b)
>>>
The original array : [0 1 2 3 4 5]
Array after transformation : [[0 1]
[2 3]
[4 5]]
numpy.ndarray.flat, Is an array element iterator , You can process each element in turn
a = np.arange(9).reshape(3,3)
print (' The original array :')
for row in a:
print (row)
# Handle every element in the array , have access to flat attribute , This attribute is an array element iterator :
print (' Array after iteration :')
for element in a.flat:
print (element)
>>>
The original array :
[0 1 2]
[3 4 5]
[6 7 8]
Array after iteration :
0
1
2
3
4
5
6
7
8
Flip array
numpy.transpose, You can swap the dimensions of the array
a = np.arange(10).reshape(2, 5)
print(a)
b = a.transpose()
print(b)
>>>
[[0 1 2 3 4]
[5 6 7 8 9]]
[[0 5]
[1 6]
[2 7]
[3 8]
[4 9]]
Linked array
numpy.concatenate, Used to join two or more arrays of the same shape
a = np.array([[1,2],[3,4]])
print (' The first array :')
print (a)b = np.array([[5,6],[7,8]])
print (' The second array :')
print (b)# The dimensions of the two arrays are the same
print (' Along axis 0 Concatenate two arrays :')
print (np.concatenate((a,b)))
print (' Along axis 1 Concatenate two arrays :')
print (np.concatenate((a,b),axis = 1))
>>>
The first array :
[[1 2]
[3 4]]
The second array :
[[5 6]
[7 8]]
Along axis 0 Concatenate two arrays :
[[1 2]
[3 4]
[5 6]
[7 8]]
Along axis 1 Concatenate two arrays :
[[1 2 5 6]
[3 4 7 8]]
Split array
numpy.split, You can split an array into subarrays
a = np.arange(9)
print (' The first array :')
print (a)
print (' Divide the array into three equal sized subarrays :')
b = np.split(a,3)
print (b)
print (' Divide the position indicated in the one-dimensional array :')
b = np.split(a,[4,7])
print (b)
>>>
The first array :
[0 1 2 3 4 5 6 7 8] Divide the array into three equal sized subarrays :
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])] Divide the position indicated in the one-dimensional array :
[array([0, 1, 2, 3]), array([4, 5, 6]), array([7, 8])]
In addition, there are addition and deletion operations for array elements
function | describe |
---|---|
resize | Returns a new array of the specified form |
append | Add values to the end of the array |
insert | Inserts a value along the specified axis before the specified subscript |
delete | Delete the subarray of a certain axis , Returns the new array after deletion |
unique | Find the only element in the array |
Calculate the maximum and minimum
numpy.amin(), Calculates the minimum value of the specified axis in the array
numpy.amax(), Calculates the maximum value of the specified axis in the array
a = np.array([[3,7,5],[8,4,3],[2,4,9]])
print (' An array is :')
print (a)
print (' call amin() function :')
print (np.amin(a,1))
print (' Call again amin() function :')
print (np.amin(a,0))
print (' call amax() function :')
print (np.amax(a))
print (' Call again amax() function :')
print (np.amax(a, axis = 0))
>>>
An array is :
[[3 7 5]
[8 4 3]
[2 4 9]]
call amin() function :
[3 3 2]
Call again amin() function :
[2 4 3]
call amax() function :
9
Call again amax() function :
[8 7 9]
Don't specify axis when , Will find the maximum or minimum... In the entire array .
axis = 0, Is to operate on each column , That is, think of the array as [3, 8, 2],[7, 4, 4],[5, 3, 9], Choose the largest or smallest
axis = 1, Is to operate on each line , That is, think of the array as [3, 7, 5],[8, 4, 3],[2, 4, 9].
there axis It's not easy to understand , I also hope you can spend more time here , To practice , To understand .
numpy.ptp, You can calculate the difference between the maximum and minimum values of array elements
a = np.array([[3,7,5],[8,4,3],[2,4,9]])
print (' Our array is :')
print (a)
print (' call ptp() function :')
print (np.ptp(a))
print (' Along axis 1 call ptp() function :')
print (np.ptp(a, axis = 1))
print (' Along axis 0 call ptp() function :')
print (np.ptp(a, axis = 0))
>>>
Our array is :
[[3 7 5]
[8 4 3]
[2 4 9]]
call ptp() function :
7
Along axis 1 call ptp() function :
[4 5 7]
Along axis 0 call ptp() function :
[6 3 6]
numpy.percentile, Calculate percentiles , Represents the percentage of observations less than this value
Understand the percentile : The first p Percentiles represent , It makes at least p% The data item of is less than or equal to this value , And at least there is (100 - p)% The data item of is greater than or equal to this value .
for example : A student's Chinese test score is 80, If this score is just the third of all students' grades 80 Percentiles , Then we can see that the score is greater than about 80% people , about 20% One's grades are higher than that of the classmate .
a = np.array([[10, 7, 4], [3, 2, 1]])
print (' An array is :')
print (a)
print (' call percentile() function :')
# 50% Quantile of , Namely a The median after ranking in
print (np.percentile(a, 50))
# axis by 0, Find... On the column
print (np.percentile(a, 50, axis=0))
# axis by 1, Ask... On the horizontal line
print (np.percentile(a, 50, axis=1))
# Keep dimensions the same
print (np.percentile(a, 50, axis=1, keepdims=True))
>>>
An array is :
[[10 7 4]
[ 3 2 1]]
call percentile() function :
3.5
[6.5 4.5 2.5]
[7. 2.]
[[7.]
[2.]]
numpy.median, Calculate the median of array elements
a = np.array([[10, 7, 4], [3, 2, 1]])
print (' An array is :')
print (a)
print(np.median(a))
>>>
3.5
It can be seen that ,percentile in p be equal to 50 when , That 's the median
numpy.mean, The average
a = np.array([[10, 7, 4], [3, 2, 1]])
print (' An array is :')
print (a)
print(np.mean(a))
>>>
4.5
numpy.average, Calculate the weighted average
a = np.array([1,2,3,4])
print (' An array is :')
print (a)
print (' call average() function :')
print (np.average(a))
wts = np.array([4,3,2,1])
print (' Call again average() function :')
print (np.average(a,weights = wts))
>>>
An array is :
[1 2 3 4]
call average() function :
2.5
Call again average() function :
2.0
Standard deviation and variance
Standard deviation is a measure of the dispersion of the average of a set of data , It's the arithmetic square root of variance .
Variance is the average of the square of the difference between each sample value and the average of all sample values .
print (np.std([1,2,3,4]))
print (np.var([1,2,3,4]))
>>>
1.118033988749895
1.25
stay numpy Just sort a line of code in , Call directly sort Function .
numpy.sort(a, axis, kind, order)
By default , Using a quick sort algorithm ; stay kind in , You can specify quicksort、mergesort and heapsort, Express quick sort respectively 、 Merge sort and heap sort ;axis The default is -1, Sort along the last axis , axis=0 Sort by column ,axis=1 Sort by row ; about order Field , If the value contains a field , You can fill in the fields to sort .
a = np.array([[3,7],[9,1]])
print (' An array is :')
print (a)
print (' call sort() function :')
print (np.sort(a))
print (' Sort by column :')
print (np.sort(a, axis = 0))
print (' Sort by row :')
print (np.sort(a, axis = 1))
>>>
An array is :
[[3 7]
[9 1]]
call sort() function :
[[3 7]
[1 9]]
Sort by column :
[[3 1]
[9 7]]
Sort by row :
[[3 7]
[1 9]]
In data analysis , We usually use Pandas To do data cleaning . In real work life , The data we get are often untidy , Null value 、 duplicate value 、 Invalid values and other information will interfere with our analysis , At this point, we need to clean up the data step by step . Data cleaning is a very important step in data analysis , It's also a very cumbersome step , Of course , After you have mastered Pandas After the library , It's like you've got a sword that cuts iron like mud , The efficiency of data cleaning will be greatly improved .
Pandas There are two main data structures , Namely Series and DataFrame, They represent one-dimensional sequences and two-dimensional table structures, respectively .
dimension | name | describe |
---|---|---|
1 | Series | It can be seen as a label ( The default is a sequence of integers RangeIndex; Can be repeated ) One dimensional array of ( Same type ). yes scalars( Scalar ) Set , It's also DataFrame The elements of . |
2 | DataFrame | Generally two-dimensional labels , Variable size table structure , Potentially heterogeneous Columns . |
Series Is a fixed length dictionary sequence . It's equivalent to two ndarray, A representative index, A representative values.
import pandas as pd
s = pd.Series(data, index=index)
Here data, It can be the following data types :
Python Medium dict
One ndarray
A scalar , such as :4
and index The default value of is 0,1,2… An increasing sequence of integers .
Appoint index
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
>>>
a -0.595567
b -0.201314
c 1.516812
d 0.102395
e -1.009924
dtype: float64
Don't specify index
s1 = pd.Series(['a', 'b', 'c', 'd'])
print(s1)
>>>
0 a
1 b
2 c
3 d
dtype: object
Create... Through a dictionary Series
d= {
'a': 1, 'b': 2, 'c': 3}
s2 = pd.Series(d)
print(s2)
>>>
a 1
b 2
c 3
dtype: int64
DataFrame It's a two-dimensional data structure , It can be understood as a data table or SQL surface , Or by Series A dictionary of objects .
d = {
"Chinese": [80, 85, 90], "Math": [85, 70, 95], "English": [90, 95, 90]}
df1 = pd.DataFrame(d)
print(df1)
df2 = pd.DataFrame(d, index=['zhangsan', 'lisi', 'wangwu'])
print(df2)
print(df2.columns, df2.index)
>>>
Chinese Math English
0 80 85 90
1 85 70 95
2 90 95 90
Chinese Math English
zhangsan 80 85 90
lisi 85 70 95
wangwu 90 95 90
Index(['Chinese', 'Math', 'English'], dtype='object') Index(['zhangsan', 'lisi', 'wangwu'], dtype='object')
adopt index choice DataFrame Data in
operation | grammar | Result Type |
---|---|---|
Choose a column | df[col] | Series |
Select a row by label | df.loc[label] | Series |
Select a line by label position | df.iloc[loc] | Series |
Slice to get some rows | df[5:10] | DataFrame |
Get some rows from Boolean vectors | df[bool_vec] | DataFrame |
Code
print(df2['Chinese'], '\n')
print(df2.loc['zhangsan'], '\n')
print(df2.iloc[-1], '\n')
print(df2[0:2], '\n')
print(df2[df2>85], '\n')
>>>
zhangsan 80
lisi 85
wangwu 90
Name: Chinese, dtype: int64 Chinese 80
Math 85
English 90
Name: zhangsan, dtype: int64 Chinese 90
Math 95
English 90
Name: wangwu, dtype: int64 Chinese Math English
zhangsan 80 85 90
lisi 85 70 95 Chinese Math English
zhangsan NaN NaN 90
lisi NaN NaN 95
wangwu 90.0 95.0 90
Reading data
df = pd.read_csv("test.csv")
print(df.head())
print('\n')
print(type(df))
>>>
name age score
0 zhangsan 30.0 80.0
1 lisi 20.0 NaN
2 wangwu 25.0 100000.0
3 zhaoliu NaN 32.0
4 maqi 33.0 60.0
<class 'pandas.core.frame.DataFrame'>
Save the data
df.to_csv('my.csv')
df.to_excel('my.xlsx')
print(df.index, '\n')
print(df.columns, '\n')
print(df.to_numpy(), '\n')
print(df.describe())
>>>
RangeIndex(start=0, stop=5, step=1) Index(['name', 'age', 'score'], dtype='object') [['zhangsan' 30.0 80.0]
['lisi' 20.0 nan]
['wangwu' 25.0 100000.0]
['zhaoliu' nan 32.0]
['maqi' 33.0 60.0]] age score
count 4.000000 4.000000
mean 27.000000 25043.000000
std 5.715476 49971.337211
min 20.000000 32.000000
25% 23.750000 53.000000
50% 27.500000 70.000000
75% 30.750000 25060.000000
max 33.000000 100000.000000
describe Is a very common function , You can use it to see the whole picture of the data , Help understand the data .
Sort by axis
print(df.sort_index(axis=1, ascending=False))
>>>
score name age
0 80.0 zhangsan 30.0
1 NaN lisi 20.0
2 100000.0 wangwu 25.0
3 32.0 zhaoliu NaN
4 60.0 maqi 33.0
Sort by number
print(df.sort_values(by='score'))
>>>
name age score
3 zhaoliu NaN 32.0
4 maqi 33.0 60.0
0 zhangsan 30.0 80.0
2 wangwu 25.0 100000.0
1 lisi 20.0 NaN
View missing values
print(df.isnull(),'\n')
print(df.isnull().any())
>>>
name age score
0 False False False
1 False False True
2 False False False
3 False True False
4 False False False name False
age True
score True
dtype: bool
It is easy to see that , Which columns have null values .
Delete / Fill in empty values
df1 = df.copy()
print(df1, '\n')
print(df1.dropna(how='any'), '\n')
print(df1.fillna(value=50))
>>>
name age score
0 zhangsan 30.0 80.0
1 lisi 20.0 NaN
2 wangwu 25.0 100000.0
3 zhaoliu NaN 32.0
4 maqi 33.0 60.0 name age score
0 zhangsan 30.0 80.0
2 wangwu 25.0 100000.0
4 maqi 33.0 60.0 name age score
0 zhangsan 30.0 80.0
1 lisi 20.0 50.0
2 wangwu 25.0 100000.0
3 zhaoliu 50.0 32.0
4 maqi 33.0 60.0
To be ranked high
df1.rename(columns={
'name': 'student'}, inplace = True)
print(df1)
>>>
student age score
0 zhangsan 30.0 80.0
1 lisi 20.0 NaN
2 wangwu 25.0 100000.0
3 zhaoliu NaN 32.0
4 maqi 33.0 60.0
Delete column / That's ok
df1 = df1.drop(columns=['age'])
print(df1, '\n')
df1 = df1.drop(index=[1])
print(df1)
>>>
student score
0 zhangsan 80.0
1 lisi NaN
2 wangwu 100000.0
3 zhaoliu 32.0
4 maqi 60.0 student score
0 zhangsan 80.0
2 wangwu 100000.0
3 zhaoliu 32.0
4 maqi 60.0
Remove duplicate values
df = df.drop_duplicates() # Remove duplicate lines
Modify the data format
df1['score'].astype('str')
apply The application of function
apply Used to apply functions to data .
df2 = df1['score'].apply(lambda x: x * 2)
print(df2)
>>>
0 160.0
2 200000.0
3 64.0
4 120.0
Name: score, dtype: float64
The above code is equivalent to
list(map(lambda x: x*2, df1['score']))
>>>
[160.0, 200000.0, 64.0, 120.0]
From this we can see that ,apply Is an efficient and concise function , You can quickly apply functions to each element .
Histogram
The so-called histogram , That's the function. value_counts, This function can view the data , How many different values are there in each column , And the number of different values
print(df, '\n')
df3 = df.fillna(60)
df3.loc[5] = ['qianba', 20, 80] # Add a new line
print(df3['score'].value_counts())
>>>
name age score
0 zhangsan 30.0 80.0
1 lisi 20.0 NaN
2 wangwu 25.0 100000.0
3 zhaoliu NaN 32.0
4 maqi 33.0 60.0 60.0 2
80.0 2
32.0 1
100000.0 1
Name: score, dtype: int64
Merge
1、 Use concat Connect two Pandas object
print(df3, '\n')
df4 = df3.copy()
df3 = pd.concat([df3, df4], ignore_index=True)
print(df3)
>>>
name age score
0 zhangsan 30.0 80.0
1 lisi 20.0 60.0
2 wangwu 25.0 100000.0
3 zhaoliu 60.0 32.0
4 maqi 33.0 60.0
5 qianba 20.0 80.0 name age score
0 zhangsan 30.0 80.0
1 lisi 20.0 60.0
2 wangwu 25.0 100000.0
3 zhaoliu 60.0 32.0
4 maqi 33.0 60.0
5 qianba 20.0 80.0
6 zhangsan 30.0 80.0
7 lisi 20.0 60.0
8 wangwu 25.0 100000.0
9 zhaoliu 60.0 32.0
10 maqi 33.0 60.0
11 qianba 20.0 80.0
2、 Use merge function
Join based on a column
left = pd.DataFrame({
'key': ['foo', 'bar', 'loo'], 'lval': [1, 2, 3]})
right = pd.DataFrame({
'key': ['foo', 'bar', 'roo'], 'rval': [3, 4, 5]})
print(left, '\n')
print(right, '\n')
print(pd.merge(left, right, on='key'))
>>>
key lval
0 foo 1
1 bar 2
2 loo 3
key rval
0 foo 3
1 bar 4
2 roo 5
key lval rval
0 foo 1 3
1 bar 2 4
Internal connection (innert), The intersection of keys
print(pd.merge(left, right, how='inner'))
>>>
key lval rval
0 foo 1 3
1 bar 2 4
And the left link 、 Right connection and outer connection , You can try it yourself , See what's the difference .
grouping
So called grouping , According to some standards , Break the data into groups , Apply the function independently to each group , Finally, the results are combined into a data structure .
df = pd.DataFrame({
'A': ['foo', 'bar', 'bar', 'foo', 'foo', 'foo'],
'B': ['one', 'two', 'three', 'one', 'two', 'two'],
'C':[1, 2, 3, 4, 5, 6]})
print(df, '\n')
print(df.groupby('A').sum(), '\n')
print(df.groupby('B').sum())
>>>
A B C
0 foo one 1
1 bar two 2
2 bar three 3
3 foo one 4
4 foo two 5
5 foo two 6 C
A
bar 5
foo 16 C
B
one 5
three 3
two 13
You can also group by multiple columns
print(df.groupby(['A', 'B']).sum())
>>>
C
A B
bar three 3
two 2
foo one 5
two 11
Pandas It also provides the function of drawing charts
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2018', periods=1000))
print(ts, '\n')
ts = ts.cumsum() # Returns the cumulative value
ts.plot()
>>>
2018-01-01 1.055229
2018-01-02 0.101467
2018-01-03 -2.083537
2018-01-04 1.178102
2018-01-05 -0.084247
...
2020-09-22 -4.316770
2020-09-23 -0.823494
2020-09-24 0.215199
2020-09-25 1.094516
2020-09-26 0.285788
Freq: D, Length: 1000, dtype: float64 Out[94]:
<matplotlib.axes._subplots.AxesSubplot at 0x4742270>
Okay , Today's sharing is here , Is it long enough ! Originality is not easy. , Give me one “ Fabulous ”
Welcome to reprint 、 Collection 、 Gain some praise and support !
At present, a technical exchange group has been opened , Group friends have exceeded 2000 people , The best way to add notes is : source + Interest direction , Easy to find like-minded friends