Professionals use Excel Data processing has become commonplace . But I believe you must have been helpless , For example, complex computing 、 Repeat the calculation 、 Automatic processing, etc , There's another crash that hasn't been saved , It's not impossible for the whole person to collapse .
If you learn the programming language , These problems are nothing . that , What to learn ？
Countless training institutions and online materials will tell us ：Python!
Python The code looks simple , A few lines can solve a lot of trouble Excel Calculation , It looks great .
But is it true ？ As nonprofessionals , I can really learn Python To help us with our work ？
Python DataFrame
Daily business is mainly dealing with tabular data （ In a professional sense, structured data ）, Like this ：
Each row of data in a table other than the first row is called a record , It corresponds to one thing 、 A person 、 An order ……, The first line is the title , Describes what attributes a record consists of , These records all have the same properties , The whole table is a collection of such records .
Python It's mainly called DataFrame To handle this kind of tabular data , Let's see DataFrame How it's done .
Like the table above , Read in DataFrame The latter is like this ：
Looks like Excel almost , It's just that the line number is from 0 At the beginning .
however ,DataFrame The essence of is a matrix （ Can you remember linear algebra in college ？）,Python There is no record of such a concept , All its operations have to go around the method that the matrix can execute .
Let's look at some simple operations .
Filter
Filtering is a simple and common operation , It is to take a subset that satisfies a certain condition , For example, the table above ：
Question 1 ： Take out R&D Employees of the Department .
Python The code looks like this ：
import pandas as pd data = pd.read_csv('Employees.csv') rd = data.loc[data['DEPT']=='R&D'] print(rd) 
Import Pandas Reading data Filter R&D department see rd data 
Running results ：
The code is simple , And it turns out to be OK . however ：
1. The function used is called loc, yes location（ location ） Abbreviation , There's no filtering at all . in fact , The filtering here is also through positioning （location） The index of the rows that satisfy the condition , Inside the function data[‘DEPT’]==’R&D’ It will work out a Boolean value Series：
and data The index of is the same , The act of satisfying conditions True Otherwise False
then loc According to the value of True The index corresponding to the row of data A new one is obtained by the corresponding line in DataFrame, Essentially, it is the operation of extracting a specified row from a matrix , It's a little tricky to deal with filtering .
2. Filter DataFrame You can't just use loc Function filtering , You can also use query(…) Other methods , But the result is to locate the row and column index of the matrix , Then index the data by row and column , It's basically like this matrix.loc[row,col]
in any case , The basic filtering is simple , I understand . Next, let's try to do two operations on the filtered subset that are not complicated .
Modify the data in the subset
Question two ： take R&D The salary of the Department staff has been raised 5%
Natural thinking , Just filter out R&D Department staff , Then we can modify the salary of these employees .
Write the code according to this logic ：
import pandas as pd data = pd.read_csv('Employees.csv') rd = data.loc[data['DEPT']=='R&D'] rd['SALARY']=rd['SALARY']*1.05 print(data) 
Import Pandas Reading data Filter R&D department modify SALARY 
Running results ：
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandasdocs/stable/user_guide/indexing.html#returningaviewversusacopy
rd['SALARY']=rd['SALARY']*1.05
You can see , Not only did it trigger a warning , The value modification was not successful .
This is because rd = data.loc[data['DEPT']=='R&D'] It's a filtered matrix , Reuse rd['SALARY']=rd['SALARY']*1.05 This sentence modifies SALARY When it's worth it ,rd['SALARY'] It's a new matrix again , So modifying it is actually a modification rd This submatrix , No modification data This initial matrix .
That's a lot to say , Listen to me .
How to write the right code ？
import pandas as pd data = pd.read_csv('Employees.csv') rd_salary = data.loc[data['DEPT']=='R&D','SALARY'] data.loc[data['DEPT']=='R&D','SALARY'] = rd_salary*1.05 print(data) 
find R&D Department staff salaries Intercept R&D Department staff salary and revise

Running results ：
That's right this time . You can't take the subset first and then modify it , To the original matrix , Find the location of the member to be modified and modify it , namely loc[row=data['DEPT']=='R&D',column='SALARY'], Get the data to be modified according to the column index , Assign a value to this matrix . Want to raise 5% We have to get this data before that （rd_salary=… This sentence ）. This kind of writing needs repeated filtering , It's just inefficient , But it's just too much .
Intersection of subsets
Question 3 ： Find out that it's both New York State and R&D Employees of the Department
This is a simpler question , As long as we calculate two subsets and do an intersection operation, it's over . Let's see. Python How is it handled ：
import pandas as pd data = pd.read_csv('Employees.csv') rd = data[data['DEPT']=='R&D'] ny = data[data['STATE']=='New York'] isect_idx = rd.index.intersection(ny.index) rd_isect_ny = data.loc[isect_idx] print(rd_isect_ny) 
R&D Department staff New York State Employees Index intersection Capture data by index intersection

Running results ：
Set intersection is a very basic operation , Many programming languages provide , in fact python Also provided （ It has intersection function ）. However , DataFrame The essence of is a matrix , There's no point in finding the intersection of two matrices ,Python There is no matrix intersection operation . Want to do it with two dataframe The intersection operation of the set represented by , We can only find the intersection of two matrix indexes by detour , Finally, the intersection of the index is used to locate and intercept the original data , There's a feeling of being close to the distance , Do not press “ tricks ” play .
The most commonly used filtering operations in work are so puzzling , I'm dizzy , You can imagine other more complex operations , It's a sour feeling “ Born leisurely ”.
Let's look at a slightly more complex grouping operation ：
grouping
Grouping operation is the most commonly used operation in daily data processing ,Python It also provides a wealth of grouping operation functions , Can complete most of the grouping operations , But it's not so easy to understand and use .
Understand in groups
Grouping is to divide a large set into small sets according to certain rules , The result is a set of sets , And then the set after grouping is calculated , Here's the picture ：
Let's take a look at the most commonly used grouping aggregation operations .
Question 4 ： Summarize the number of people in each department
Python Code ：
import pandas as pd data = pd.read_csv('Employees.csv') group = data.groupby("DEPT") dept_num = group.count() print(dept_num) 
Group by Department Summarize the number of people in each department 
Running results ：
It turned out to be a little embarrassing , Originally, you only need to record the number of members in each group , Just have one column , Why are there so many columns , It's like repeating the same action for each column , That's strange .
Don't worry. , This problem Python It can still be solved , It's just not using count function , It is size function ：
import pandas as pd data = pd.read_csv('Employees.csv') group = data.groupby("DEPT") dept_num = group.size() print(dept_num) 
Group by Department Summarize the number of people in each department 
Running results ：
This result seems to be much more normal , however , It's still weird .
It's a drop , This result is no longer twodimensional DataFrame 了 , It's a onedimensional Seriese.
count The reason why the result of function calculation is strange , Because it counts every column , and size The function is to see the size of each group , But in fact, our natural logic still uses count To count ,size It's hard to think of （ And search the Internet for information ）.
As mentioned earlier , The grouping result should be a set of sets , Let's see. Python Medium DataFrame What does it look like after grouping ？ Put the code above data.groupby(“DEPT”) Print out the results of .
import pandas as pd data = pd.read_csv('Employees.csv') group = data.groupby("DEPT") print(group) 
Group by Department 
Running results ：
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001ADBC9CE0F0>
wow , What kind of thing is this ？
The first time I saw this thing , It's just a circle , Shouldn't the result of grouping be a set of sets , Why is that so ？ It's a nightmare for nonprofessional programmers .
However, searching the Internet can still see that it is a socalled iterative object , After iteration, it is found that each item is indexed in groups + DataFrame Composed of , There are some ways to see what's inside , If you use list(group) You can see the result of grouping . Here's the picture ：
When you look at the picture above, you will understand , go by the name of “ object ” It turns out to be like this . In essence, it is also a collection of sets （ Let's think of a matrix as a set ）, But it can't take a member directly like a normal set ( Such as group[0]), This forced users to memorize this kind of “ object ” Of N The rules of operation , If you can't understand it, you have to memorize it by rote .
See here , It is estimated that many readers have already started to feel dizzy , I have no idea what the above paragraph is nonsense . Um. , That's right , Because this is the normal state of the workplace .
Simple aggregation operations in groups are so hard to understand , Let's burn our brains again , Let's look at a slightly more complex grouping of subsets .
Group subset processing
Although grouping is often used for aggregation operations , But sometimes we don't care about aggregation results , It's about the set itself after grouping . For example, the grouped sets are sorted by a certain column .
Question five ： Sort the employees of each department from morning to evening according to their starting time .
Problem analysis ： After grouping, the subsets are sorted according to the entry time .
Python Code
import pandas as pd employee = pd.read_csv("Employees.csv") employee['HIREDATE']=pd.to_datetime(employee['HIREDATE']) employee_new = employee.groupby('DEPT',as_index=False).apply(lambda x:x.sort_values('HIREDATE')).reset_index(drop=True) print(employee_new) 
Change the entry time format
Press DEPT grouping , And to each group according to HIREDATE Sort , Finally reset the index

Running results ：
It works , The employees of each department are sorted according to their entry time . But let's look at the core sentence in the code employee.groupby('DEPT',as_index=False).apply(lambda x:x.sort_values('HIREDATE')), Abstract this code, and that's it ：
df.groupby(c).apply(lambda x:f(x))
df： Data frame DataFrame
groupby： Group function
c： The columns by which you group
The above three are easy to understand , But apply coordination lambda It's very obscure , Beyond what most nonprofessional programmers understand , It needs to be understood that “ Functional language ” To understand the principle of （ Search for it yourself , I'm too lazy to explain ）.
If you don't use this “ both of you ”（apply+lambda） Well ？ Can also do , It's just going to be a lot of trouble . Have to use for loop , Sort each grouping subset separately , Finally, we have to combine the results .
import pandas as pd employee = pd.read_csv("Employees.csv") employee['HIREDATE']=pd.to_datetime(employee['HIREDATE']) dept_g = employee.groupby('DEPT',as_index=False) dept_list = [] for index,group in dept_g: group = group.sort_values('HIREDATE') dept_list.append(group) employee_new = pd.concat(dept_list,ignore_index=True) print(employee_new) 
Change the entry time format Press DEPT grouping Initialization list for loop Sort each group Sort the results into the list Combine the results of each group 
The running results are the same , But the code is a lot more complicated , And the operating efficiency is getting lower . Which one would you like to use ？
Python Different data types are designed for similar but not identical data , There are different ways of operation , It's not easy to copy knowledge of one kind of data to another similar data , Make people faint .
Said so much , In conclusion, it's just a sentence ：Python It's really hard to understand , It's not something for non professionals . Specifically speaking, it is about three points ：
1. DataFrame The essence is a matrix
All the operations have to be calculated according to the matrix method , It's always very winding .
2. There are many data types and different operation rules
Python Designed in Series,DataFrame, Grouping objects and so on , And different data types , The calculation method is not exactly the same , Such as DataFrame have access to query Function filtering , and Series Can not be , The nature of grouping objects is completely different from Series and DataFrame, The calculation method is more elusive .
3. He who knows it knows not why
Too many data types , The calculation method is quite different , Invisible to increase the amount of memory users , There's more to rote learning , It's hard to be flexible , This creates a strange phenomenon ： A simple operation , The Internet search Python The time of code may be better than that of excel The calculation is longer .
Python The code looks simple , But if you go to a training class, you probably can't learn , The result is only copying examples .
that , Is there no suitable tool for daily data processing in the workplace ？
Yes, there are. .
esProc SPL
esProc SPL It's also a programming language , Focus on structured data computing .SPL Provides a wealth of basic calculation methods , Its conceptual logic is also in line with our thinking habits .
1. An ordinal list is a collection of records
SPL Using ordered tables to carry structured data , Close to daily processing excel surface .
2. Few data types and consistent rules
SPL In structured data processing, there are only two data types: collection and record , The methods involved are generally the same .
3. Know what it is and know why
Just remember two data types , Master the basic algorithm , The more complicated operation is just a combination of simple operation rules . You may write bad code when you are not proficient , But it's unlikely that it won't come out , There will be no Python The phenomenon of spending a lot of time searching for code writing .
Let's use SPL To solve the problem mentioned above , Let's get to know SPL How much “ Amiable and easy of approach ”：
Sequence table
esProc SPL The data structure used to carry twodimensional structured data in is ordinal table , It and excel The results presented in this paper are consistent , Here's the picture ：
Except for the first line in the table above （ Title Line ） Outside , Each other line represents a record , And an ordinal list is a collection of records , Compare with Python Medium DataFrame More intuitive .
Filter
SPL It's not filtered by matrix positioning , It is select（ Screening ） Make records that meet the conditions .
Question 1 ： see R&D Employee information of the Department
A  B  
1  =file("Employees.csv")[email protected]()  / Import data 
2  =A1.select(DEPT=="R&D")  / Filter 
A2 result ：
SPL The filtered results are very understandable , It's a subset of the original data set .
Look again. SPL Modification of subsets and intersection operations
1. Modify the data in the subset
Question two ： take R&D The salary of the Department staff has been raised 5%
A  B  
1  =file("Employees.csv")[email protected]()  / Import data 
2  =A1.select(DEPT=="R&D")  / Filter 
3  =A2.run(SALARY=SALARY*1.05)  / Revise salary 
4  =A1  / View results 
A4 result ：
SPL It's all calculated according to our normal way of thinking , Filter out the results , Revise the salary for the results , Not like Python So much effort .
2. Intersection of subsets
Question 3 ： Find out that it's both New York State and R&D Employees of the Department
A  B  
1  =file("Employees.csv")[email protected]()  / Import data 
2  =A1.select(DEPT=="R&D")  /R&D Department staff 
3  =A1.select(STATE=="New York")  / New York State Employees 
4  =A2^A3  / intersection 
A4 result ：
SPL The intersection operation in is to find the intersection of sets , It's a real set operation , Just use a simple intersection operator “^” that will do . Easy to understand , And it's easy to write , Not like Python In this way, because we can't find the intersection of the matrix, we can find the intersection of the index , And then it intercepts from the original data .SPL Other set operations in, such as union 、 Difference set 、 XOR sets also have corresponding operators , It's easy to use , convenient .
grouping
Understand in groups
SPL The grouping operation of is also in line with natural logic , That is, after grouping, the result is a set of sets , Obvious .
First look at it. SPL Group aggregation operation of .
Question 4 ： Summarize the number of people in each department
A  B  
1  =file("Employees.csv")[email protected]()  
2  =A1.groups(DEPT;count(~):cnt)  / grouping 
A2 result ：
The result of group aggregation , It's still a preface , You can continue to use the ordered list method . Don't like Python The result of the aggregation is onedimensional Series.
And then look at SPL The grouping result of
A  B  
1  =file("Employees.csv")[email protected]()  
2  =A1.group(DEPT)  / grouping 
A2 result ：
The figure above is a collection of ordered lists , Each set is an ordered list of members of a department ; The figure below shows the member of the first group ——Administration Order list of department members .
This result is in line with our normal logic , It's also easy to see the results of grouping , It's easier to do the next operation on the grouped results .
Group subset processing
The result of grouping is a set of sets , Just deal with each subset .
Question five ： Sort the employees of each department from morning to evening according to their starting time
A  B  
1  =file("Employees.csv")[email protected]()  
2  =A1.group(DEPT)  / grouping 
3  =A2.conj(~.sort(HIREDATE))  / Sort subsets and merge 
A3 result ：
because SPL The result of grouping is still a set , So it can be calculated using the set method , There is no need to force memory of the calculation method after grouping , There is no need to use apply()+lambda This kind of heavenly Book combination , It's very natural to complete the grouping + Sort + The job of merging , ordinary 3 Line code , It's easy to write , It's easy to understand , And it's very efficient .
Summary
1. Python When doing structured processing , It's all matrix operations , Simple set operations need to go around the matrix ;esProc SPL The essence is a collection of records , Set operation is simple and convenient .
2. Python Data types are complex and diverse , The rules of operation are unpredictable , It is often known but not why , It's not likely to draw inferences from one instance , Writing code has more memory , It's so hard to understand how it works ;esProc SPL Less data types , And the calculation rules are fixed , You only need to master the basic operation rules, you can draw inferences from one instance to complete complex operations .
3. Study SPL, You can go to :