For me this often uses python For people who are churning data , The next library is real · regret we didn't meet sooner
Remember once when I was processing data on the server , In order to solve Pandas Read more than 2000W Data on the problem of memory explosion , It took two days to optimize . Finally, through data conversion , data type , Iterative reads and GC The mechanism solved ( The specific method is in my blog :Python Optimize the use of pandas Read and train tens of millions of data )
I always thought python It's really not good to deal with large-scale data , Unless Hadoop. Until I saw one called Modin The library of , Just know what a line of code is , Solve all the problems .
Pandas yes Python The library commonly used in , Computer 、 Data science should always use . In itself it's a high performance 、 Easy to use data structure and data analysis tools , It can be said that it's very new and friendly . But when the amount of data becomes large , Running on a single kernel Pandas Will become powerful not from the heart , After all, the daily data volume of enterprise level data may be GB perhaps TB Order of magnitude , Distributed systems may be needed to improve performance . Under default settings ,Pandas Use only a single CPU kernel , Run functions in single process mode , by comparison Tensorflow Just set up GPU Parameters can be multi-core parallel .
Slow speed doesn't affect small data , We may not even notice the change in speed . But for a huge data set , Using only a single kernel can cause very slow performance . Some datasets may have millions or even hundreds of millions of data , If you do only one operation at a time , Just one CPU, It's going to be slow .
Most modern computers have at least two CPU. But even if there are two CPU, Use pandas when , Subject to default settings , Half or more of the computer processing power can't play . If it is 4 nucleus ( Modern Intel i5 chip ) perhaps 6 nucleus ( Modern Intel i7 chip ), It's even more wasteful .Pandas This is not designed to use computer computing power efficiently .
So from we just want to let Pandas Run faster , Rather than optimizing its workflow for specific hardware settings . That means we want to be dealing with 10KB The data set of , It can be used and processed 10TB Data sets are the same Pandas Script .Modin Provides an optimization Pandas Solutions for , So data scientists can spend their time extracting value from data , Instead of spending it on tools that extract data .
Modin It's the University of California, Berkeley RISELab An early project of , It aims to promote the application of distributed computing in the field of data science . It's a multi process data frame (Dataframe) library , Have and Pandas The same API (API), So that users can speed up their Pandas workflow .
It's a multi process data frame (Dataframe) library , Have and Pandas The same API (API), So that users can speed up their Pandas workflow . According to relevant experiments , At one 8 On the nuclear machine , Users only need to modify one line of code ,Modin Will be able to Pandas Query task acceleration 4 times .
stay Pandas in , Given DataFrame, The goal is to process data as quickly as possible . have access to .mean() To figure out the average of each row , use groupby Classify the data , use drop_duplicates() To remove duplicates , There are still a lot of it Pandas Other built-in functions for .
Mentioned before ,Pandas Call only one CPU To do data processing . This is a big bottleneck , Especially for larger ones DataFrames, The lack of resources is more prominent .
In theory , Parallel computing is like in all available CPU It's just as easy to calculate in different data points in the kernel . To Pandas DataFrame, One of the basic ideas is based on different CPU The number of cores will be DataFrame Divide into different parts , Let each core calculate separately . Finally, add up the results , In terms of calculation , The operation cost is relatively low .
How to improve the data processing speed of multi-core system . During the processing of a single core system ( Left ), all 10 Use one... For every task CPU Handle . And in a dual core system ( Right ), Each node handles 5 A mission , Double the processing speed .
This is actually Modin Principle , take DataFrame To divide into different parts , And each part is sent to a different CPU Handle .Modin Can cut DataFrame The columns and columns of , Of any shape DataFrames Can be processed in parallel .
If you get a lot of columns but only a few lines DataFrame. Some libraries that can only cut Columns , It's hard to work in this case , Because there are more columns than rows . But because of Modin Cut from two dimensions at the same time , Of any shape DataFrames Come on , This parallel structure is very efficient . No matter how many lines there are , How many columns , Or both , It can deal with .
Pandas DataFrame( Left ) Store as a whole , Give only one CPU Handle .ModinDataFrame( Right ) Rows and columns are cut , Each part is handed over to a different CPU Handle , How many? CPU How many tasks can I handle .
The above image is just a simple example .Modin Usually a sub tray assistant is used (Partition Manager), It can change the size and shape of the disc according to the type of operation . for instance , You may need a whole row or a whole column ( data ) The operation of . under these circumstances , The sub tray assistant can cut the task , Give them to different ones CPU Handle , So as to find the optimal solution of task processing , Flexible and convenient .
In parallel processing ,Modin From Dask perhaps Ray Choose one of the tools to handle complex data , Both of these tools are PythonAPI The parallel operation Library of , Running Modin You can choose any one of them . So far, ,Ray It should be the safest and most stable .Dask The backend is still in the testing phase .
The system is for the hope that the program will run faster 、 Better scalability , Without major code changes Pandas Designed by users . The ultimate goal of this work is to be able to use Pandas.
Read 800M file 、 And all kinds of PD Operation speed comparison
Modin The project is still in its early stages , But yes Pandas It is a very promising supplementary library .Modin Handle all data partitioning and reorganization tasks for users , So we can focus on the workflow .Modin The basic goal is to enable users to use the same tools on small data and big data , Without thinking about change API To adapt to different data scales .
In this example , We use Modin, Read this 800M The files save about 22 second , It saves 74% Time for . Imagine if there is 100 These files need to be read , Just reading files can save half an hour .
pip install Dafa ( Remember to pretend RAY)
import modin.pandas as pd
more python Skill 、 machine learning 、AI knowledge , Welcome to my official account. 「 Turing's cat 」, The background to reply SSR There are airport nodes to see you off ~