What Python libraries do you hate to meet

Hearing people's feathers hanging 2020-11-12 23:12:47
python libraries hate meet


For me this often uses python For people who are churning data , The next library is real · regret we didn't meet sooner

Remember once when I was processing data on the server , In order to solve Pandas Read more than 2000W Data on the problem of memory explosion , It took two days to optimize . Finally, through data conversion , data type , Iterative reads and GC The mechanism solved ( The specific method is in my blog :Python Optimize the use of pandas Read and train tens of millions of data

I always thought python It's really not good to deal with large-scale data , Unless Hadoop. Until I saw one called Modin The library of , Just know what a line of code is , Solve all the problems .

 

Let's talk about why pandas It's not easy to use

Pandas yes Python The library commonly used in , Computer 、 Data science should always use . In itself it's a high performance 、 Easy to use data structure and data analysis tools , It can be said that it's very new and friendly . But when the amount of data becomes large , Running on a single kernel Pandas Will become powerful not from the heart , After all, the daily data volume of enterprise level data may be GB perhaps TB Order of magnitude , Distributed systems may be needed to improve performance . Under default settings ,Pandas Use only a single CPU kernel , Run functions in single process mode , by comparison Tensorflow Just set up GPU Parameters can be multi-core parallel .

Slow speed doesn't affect small data , We may not even notice the change in speed . But for a huge data set , Using only a single kernel can cause very slow performance . Some datasets may have millions or even hundreds of millions of data , If you do only one operation at a time , Just one CPU, It's going to be slow .

Most modern computers have at least two CPU. But even if there are two CPU, Use pandas when , Subject to default settings , Half or more of the computer processing power can't play . If it is 4 nucleus ( Modern Intel i5 chip ) perhaps 6 nucleus ( Modern Intel i7 chip ), It's even more wasteful .Pandas This is not designed to use computer computing power efficiently .


So from we just want to let Pandas Run faster , Rather than optimizing its workflow for specific hardware settings . That means we want to be dealing with 10KB The data set of , It can be used and processed 10TB Data sets are the same Pandas Script .Modin Provides an optimization Pandas Solutions for , So data scientists can spend their time extracting value from data , Instead of spending it on tools that extract data .
 

What is Modin?

Modin It's the University of California, Berkeley RISELab An early project of , It aims to promote the application of distributed computing in the field of data science . It's a multi process data frame (Dataframe) library , Have and Pandas The same API (API), So that users can speed up their Pandas workflow .

It's a multi process data frame (Dataframe) library , Have and Pandas The same API (API), So that users can speed up their Pandas workflow . According to relevant experiments , At one 8 On the nuclear machine , Users only need to modify one line of code ,Modin Will be able to Pandas Query task acceleration 4 times .

stay Pandas in , Given DataFrame, The goal is to process data as quickly as possible . have access to .mean() To figure out the average of each row , use groupby Classify the data , use drop_duplicates() To remove duplicates , There are still a lot of it Pandas Other built-in functions for .

Mentioned before ,Pandas Call only one CPU To do data processing . This is a big bottleneck , Especially for larger ones DataFrames, The lack of resources is more prominent .

In theory , Parallel computing is like in all available CPU It's just as easy to calculate in different data points in the kernel . To Pandas DataFrame, One of the basic ideas is based on different CPU The number of cores will be DataFrame Divide into different parts , Let each core calculate separately . Finally, add up the results , In terms of calculation , The operation cost is relatively low .

How to improve the data processing speed of multi-core system . During the processing of a single core system ( Left ), all 10 Use one... For every task CPU Handle . And in a dual core system ( Right ), Each node handles 5 A mission , Double the processing speed .

This is actually Modin Principle , take DataFrame To divide into different parts , And each part is sent to a different CPU Handle .Modin Can cut DataFrame The columns and columns of , Of any shape DataFrames Can be processed in parallel .

If you get a lot of columns but only a few lines DataFrame. Some libraries that can only cut Columns , It's hard to work in this case , Because there are more columns than rows . But because of Modin Cut from two dimensions at the same time , Of any shape DataFrames Come on , This parallel structure is very efficient . No matter how many lines there are , How many columns , Or both , It can deal with .

Pandas DataFrame( Left ) Store as a whole , Give only one CPU Handle .ModinDataFrame( Right ) Rows and columns are cut , Each part is handed over to a different CPU Handle , How many? CPU How many tasks can I handle .

The above image is just a simple example .Modin Usually a sub tray assistant is used (Partition Manager), It can change the size and shape of the disc according to the type of operation . for instance , You may need a whole row or a whole column ( data ) The operation of . under these circumstances , The sub tray assistant can cut the task , Give them to different ones CPU Handle , So as to find the optimal solution of task processing , Flexible and convenient .

In parallel processing ,Modin From Dask perhaps Ray Choose one of the tools to handle complex data , Both of these tools are PythonAPI The parallel operation Library of , Running Modin You can choose any one of them . So far, ,Ray It should be the safest and most stable .Dask The backend is still in the testing phase .

The system is for the hope that the program will run faster 、 Better scalability , Without major code changes Pandas Designed by users . The ultimate goal of this work is to be able to use Pandas.



Read 800M file 、 And all kinds of PD Operation speed comparison


Modin The project is still in its early stages , But yes Pandas It is a very promising supplementary library .Modin Handle all data partitioning and reorganization tasks for users , So we can focus on the workflow .Modin The basic goal is to enable users to use the same tools on small data and big data , Without thinking about change API To adapt to different data scales .
In this example , We use Modin, Read this 800M The files save about 22 second , It saves 74% Time for . Imagine if there is 100 These files need to be read , Just reading files can save half an hour .

 

Installation method

pip install Dafa ( Remember to pretend RAY)

Usage method

import modin.pandas as pd

more python Skill 、 machine learning 、AI knowledge , Welcome to my official account. 「 Turing's cat 」, The background to reply SSR There are airport nodes to see you off ~

版权声明
本文为[Hearing people's feathers hanging]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database