A rare Python library for data science and machine learning

osc_ a45vpoh4 2020-11-13 14:52:48
rare python library data science

This paper introduces the excellent data science and machine learning in the market Python library .


The picture is from Pexels

According to the extensive needs of the current technical community , This article will focus on , Excellent in the market for data science and machine learning implementations Python Software :

  • Introduction to data science and machine learning

  • Why use Python Doing data science and machine learning ?

  • For data science and machine learning Python library


Introduction to data science and machine learning

as everyone knows , We are in an era of big data , Data drives the development of machine models “ fuel ”.

actually , Data science and machine learning are both skills , It's not just two isolated technologies .

They require developers to have skills : Get practical insights from the data , By building prediction models , And then the ability to solve problems .

As far as the literal definition is concerned :

  • Data Science , It's extracting useful information from data , The process of solving practical problems .

  • machine learning , How is it through the vast amount of data provided , To solve the problem .

Then the relationship between the two can be described as : Machine learning is part of Data Science , It uses machine learning algorithms and other statistical techniques , To learn how data affects and develops the business .

Why use Python Doing data science and machine learning ?

Python Ranked first among the popular programming languages for machine learning and Data Science . Why is that ? 


① Easy to learn :Python It uses very simple syntax , Can be used to implement simple calculations .

for example : Add two strings to the complex calculation process , To build a complex machine learning model .

② Less code : Although there are many algorithms involved in data science and machine learning , But thanks to Python Support for predefined packages , We don't have to write algorithms from scratch .

meanwhile , To simplify ,Python It also provides a kind of “ Check when coding (check as you code)” Methods , Thus, the workload of testing code is effectively reduced .

③ Pre built database :Python with 100 Various pre built libraries , It can be used to implement various machine learning and deep learning algorithms .

therefore , Every time a user runs an algorithm on a dataset , Just install and load the necessary packages with a single command .

among , Popular pre built libraries include :NumPy、Keras、Tensorflow、 as well as Pytorch etc. .

④ It's not about the platform :Python It can be run in include :Windows、macOS、Linux、 as well as Unix On a variety of platforms .

When moving code from one platform to another , You can use things like PyInstaller Software packages like that , To solve all the dependency problems .

⑤ A lot of community support : Besides having a lot of supporters ,Python There are also many communities and forums , All kinds of programmers can post their own errors in it , And help each other .

For data science and machine learning  Python library

Python In artificial intelligence (AI) And the field of machine learning is widely used , One of the important reasons is :Python Thousands of built-in libraries are available .

Through various built-in functions and methods , These libraries make it easy to analyze data 、 Handle 、 Arrangement 、 And modeling tasks .

Next, we will focus on the following types of task Libraries :

  • Statistical analysis

  • Data visualization

  • Data modeling and machine learning

  • Deep learning

  • natural language processing (NLP)

Statistical analysis

Statistics is a foundation of data science and machine learning . All machine learning and deep learning (DL) Algorithm 、 And related technologies are based on the basic principles and concepts of Statistics . and Python It provides a large number of software libraries for statistical analysis .

Here it is , We will focus on the recommended packages and built-in functions that can perform complex statistical calculations .

They are :

  • NumPy

  • SciPy

  • Pandas

  • StatsModels



NumPy、 Or called Numerical Python Is the most commonly used Python One of the Libraries . The main function of the library is : Supports multidimensional arrays for mathematical and logical operations .

Users can NumPy Used to index 、 classification 、 plastic 、 Transmit images 、 And a multi-dimensional array of acoustic wave types .

Here is NumPy List of specific functions of :

  • Perform mathematical and scientific calculations from simple to complex .

  • Powerful support for multidimensional array objects , A collection of functions and methods for handling array elements .

  • Provides Fourier transform and data processing routines .

  • Perform linear algebraic calculations , This is for including : Linear regression 、 Logical regression 、 Naive Bayes and other machine learning algorithms , It's very necessary .


Based on the NumPy Above SciPy library , It's a collection of sub packages . It can help solve various basic problems related to statistical analysis .

Because it is suitable for processing, it uses NumPy Library defined array elements , therefore SciPy Libraries can usually be used to calculate those that use NumPy, Mathematical equations that are still impossible to complete .

Here is SciPy List of specific functions of :

  • Through and with NumPy Arrays are used together , It provides a platform for numerical integration and optimization methods .

  • It comes with vector quantization 、 Fourier transformation 、 integral 、 A collection of sub packages such as interpolation .

  • Provides a complete stack of linear algebraic functions . These functions can be used such as k-means Algorithm , To do clustering and other advanced computing .

  • Provides for signal processing 、 data structure 、 Numerical algorithms 、 And the support of creating sparse matrix .


As another important statistical Library ,Pandas It is mainly used for statistics 、 Finance 、 economics 、 Data analysis and so on .

The software library mainly relies on NumPy Array , To deal with it Pandas Data objects for . After all ,NumPy、Pandas and SciPy Performing scientific calculations 、 And data processing , There's a deep interdependence .

Here is Pandas List of specific functions of :

  • Use predefined and custom indexes , To quickly create effective DataFrame object .

  • Can be used to process large data sets , And execute the subset 、 Data slicing 、 And index operations .

  • Provides for creating Excel Chart , And the built-in capabilities to perform complex data analysis tasks , for example : Descriptive statistical analysis , Data collation 、 transformation 、 operation 、 And visualization .

  • Provides support for processing temporal data .

I personally think :Pandas It's an excellent software library for processing large amounts of data ;NumPy Excellent support for multidimensional arrays ; and Scipy It provides a set of sub packages for performing most of the statistical analysis tasks .


Based on the NumPy and SciPy Above StatsModels Python software package , It's creating statistical models 、 The best choice for data processing and model evaluation .

Except for the use of SciPy In the library NumPy Beyond arrays and scientific models , It can also work with Pandas Phase integration , To achieve effective data processing .StatsModels Good at statistical calculation 、 Statistical testing and data exploration .

Here is StatsModels List of specific functions of :

  • Make up for NumPy and SciPy Library defects , Ability to perform statistical tests and hypothesis tests .

  • Provides R-style The realization of the formula , For better statistical analysis . Statisticians can use R Language .

  • Because it can widely support statistical computation , Therefore, it can be used to implement generalized linear models (GLM,Generalised Linear Models) And ordinary least second order linear regression (OLM,Ordinaryleast-square Linear Regression) Model .

  • Support includes hypothesis testing ( Zero theory ,Null Theory) Internal statistical test .

Data visualization

Visualization is data visualization through , To effectively express key insights from data . It includes : graphics 、 Chart 、 Mind mapping 、 Thermogram 、 Histogram 、 Density isograph , Then we study the correlation between various data variables .image.png

Here it is , We'll focus on those that can be used with built-in functions , To study the dependence of various data Python Data visualization package .

They are :

  • Matplotlib

  • Seaborn

  • Plotly

  • Bokeh


Matplotlib yes Python The most basic data visualization software package in . It supports things like : Histogram 、 Bar chart 、 Power spectrum 、 Error chart and other graphics .

Through the 2D graphics library , Users can generate all kinds of clear graphics , For exploratory data (EDA) Is crucial .

Here is Matplotlib List of specific functions of :

  • Users can target Matplotlib Choose the right line style 、 Font style 、 Format axis and other functions , In order to easily draw a variety of graphics .

  • As a tool for reasoning quantitative information , It can be created by creating graphics , To help users understand trends 、 Patterns and associations .

  • As Matplotlib One of the best features of a software package , Its Pyplot The module provides with MATLAB Very similar user interface .

  • Provides object-oriented API modular , Through things like Tkinter、wxPython、 as well as Qt etc. GUI Tools , Integrating graphics into applications .



Although with Matplotlib Library Based , But with the Matplotlib comparison ,Seaborn Can be used to create more attractive and descriptive statistical charts .

In addition to providing extensive support for data visualization ,Seaborn It also comes with a built-in dataset oriented API, It can be used to study the relationship between multiple variables .

Here is Seaborn List of specific functions of :

  • It can analyze and visualize univariate and bivariate data points , Provides options to compare the current data with other data subsets .

  • Linear regression models for various target variables , Support automatic statistical estimation and graphical representation .

  • By providing the execution of high-level abstract functions , Can build multi graph grid (multi-plotgrids) Complex visualization of .

  • Through various built-in themes , Can realize the style setting , And create Matplotlib chart .


As a well-known figure Python One of the Libraries ,Ploty Through interactive graphics , In order to facilitate users to understand the dependence between target variables and prediction variables .

It can be used to analyze and visualize Statistics , For Finance 、 Business and scientific data , Generate clear graphics 、 Subgraphs 、 Thermogram 、 as well as 3D Charts, etc. .

Here is Ploty List of specific functions of :

  • have 30  Multiple chart types , Include :3D Chart 、 Science and Statistics 、SVG Maps, etc , Can achieve clear visualization .

  • adopt Python API, You can create a diagram by 、 graphics 、 The text and Web The public of images / Private dashboard .

  • Can be created based on JSON Visual image of format serialization , Users can go to R、MATLAB、Julia Easy access to them on different platforms .

  • By named Plotly Grid The built-in API, Users can import data directly into Ploty Environmental Science .



Bokeh yes Python One of the most interactive libraries in , Can be used for Web The browser builds a descriptive graphical representation .

It can easily handle large datasets , And build a generic diagram , In turn, it helps to implement a wide range of EDA.

By defining perfect features ,Bokeh Be able to build interactive charts 、 Dashboards and data applications .

Here is Bokeh List of specific functions of :

  • It can be done with a simple command , Help users quickly create complex statistical charts .

  • Support HTML、Notebook、 And output in server form . It also supports multi language binding , Include R、Python、lua、 as well as Julia etc. .

  • Through and with Flask and Django Integration of , You can express specific visualizations on the application .

  • By providing support for visualization files , Users can convert it to something like Matplotlib、Seaborn、 as well as ggplot And other libraries .

machine learning

Create results that can be accurately predicted 、 A machine learning model for solving specific problems , Is the most important part of any data science project .

However , Implementing machine learning and deep learning often involves thousands of lines of code . And when you need to solve complex problems through neural networks , The corresponding model becomes more cumbersome .

But fortunately , adopt Python A variety of self-contained software packages , We don't have to write any algorithms , It is easy to implement various machine learning technology applications .


Here it is , We'll focus on those that can be used with built-in functions , To achieve a variety of machine learning algorithms, highly recommended machine learning software package .

They are :

  • Scikit-learn

  • XGBoost

  • ELI5


As data modeling and model evaluation Python One of the Libraries ,Scikit-learn It comes with a variety of supervised and unsupervised machine learning algorithms .

meanwhile , It can be used in collective learning (Ensemble Learning) And promoting machine learning (Boosting Machine Learning) A clear definition of .

Here is Scikit List of specific functions of :

  • By providing a standard data set ( Such as :Iris and Boston House Price), To assist users in machine learning .

  • Built in methods that can be used to perform supervised and unsupervised machine learning , Including parsing 、 clustering 、 classification 、 Return to 、 As well as the anomaly detection of various files .

  • With built-in features for feature extraction and feature selection , It can help identify important attributes in the data .

  • By performing cross validation , It provides different ways to evaluate model performance , The performance of the model can be optimized 、 And adjust the parameters .


XGBoost That is to say “ Extreme gradient enhancement (Extreme Gradient Boosting)”, It belongs to Boosting Machine learning Python software package . By gradient enhancement ,XGBoost It can improve the performance and accuracy of machine learning model .

Here is XGBoost List of specific functions of :

  • Because it uses C++  Compiling , therefore XGBoost It is considered to be the fastest way to improve the performance of machine learning models 、 And one of the effective software libraries .

  • because XGBoost The core algorithm of is parallelizable , Therefore, it can effectively utilize the performance of multi-core computers . meanwhile ,XGBoost Data sets can also be processed in large numbers 、 And can carry out network work across multiple data sets .

  • Provides that you can use to perform cross validation , Parameter adjustment , Regularization , And internal parameters that handle missing values , It can also provide with Scikit-learn Compatible API.

  • because XGBoost Often used in top data science and machine learning competitions , Therefore, it is generally considered to be superior to other algorithms .


As another Python library ,ELI5 It mainly focuses on improving the performance of machine learning models . Because it's relatively new , So it's usually associated with XGBoost、LightGBM、 as well as CatBoost To be used together , And then improve the accuracy of machine learning model .

Here is ELI5 List of specific functions of :

  • Provide with Scikit-learn Integration of software packages , To represent the importance of characteristics , The decision tree and the integrated prediction based on the tree are explained .

  • Able to analyze and explain by XGBClassifier、XGBRegressor、LGBMClassifier、LGBMRegressor、CatBoostClassifier、CatBoostRegressor and Catboost The predictions made .

  • It provides support for implementing various algorithms , And be able to check the black box model . Its TextExplainer The module can interpret the predictions made by the text classifier .

  • Can assist in the analysis of those by linear regression 、 And the classifier gives Scikit Learn the general linear model (GLM,General Linear Models) Weight and prediction of .

Deep learning

The evolution of machine learning and artificial intelligence is inseparable from deep learning . With the introduction of deep learning , We can build complex models , And deal with huge datasets .

With Python A variety of deep learning packages available , We can easily build a variety of efficient neural networks .image.png

Here it is , We'll focus on those that can be used with built-in functions , To realize the deep learning software package of complex neural network which is highly recommended .

They are :

  • TensorFlow

  • Pytorch

  • Keras


As deep learning Python One of the Libraries ,TensorFlow Is an open source library for data stream programming across tasks .

TensorFlow Through a symbolic math library , To build powerful and accurate neural networks . It provides an intuitive multi platform programming interface , High scalability can be achieved in different domains .

Here is TensorFlow List of specific functions of :

  • Facing large projects and data sets , It can build and train multiple neural networks .

  • In addition to supporting neural networks , It also provides various functions and methods for performing statistical analysis . for example : It comes with the ability to create probabilistic models and Bayesian Networks ( Include :Bernoulli、Chi2、Uniform、Gamma etc. ) Built in features of .

  • TensorFlow Provides layered components , These components can perform hierarchical operations on weights and deviations , And by implementing regularization techniques ( for example :batch normalization、Dropout etc. ) To improve the performance of the model .

  • It comes with a device called TensorBoard The visualization program of , The visualization program can create interactive and visual graphics , To understand the dependence of data characteristics .


Pytorch It's based on Python Open source software package for scientific computing , Deep data sets can be implemented on large neural networks .

Facebook Use this software library to develop its neural network , Then, the tasks such as face recognition and automatic marking are realized .

Here is Pytorch List of specific functions of :

  • Provides easy to use API, It can be integrated with other data science and machine learning frameworks .

  • Be similar to NumPy,Pytorch Provide what is called Tensors Multidimensional array of , And can be used in GPU On .

  • It can not only be used to model large neural networks , And it also provides an interface , Support up to 200 A variety of mathematical operations that can be used for statistical analysis .

  • The code can be executed on each node , To create dynamic calculation graphs , And then assist in time series analysis , And be able to forecast sales in real time .


Also as Python One of the best deep learning libraries in China ,Keras Be able to build 、 analysis 、 Evaluating and improving neural networks provides comprehensive support .

Keras Is based on Theano and TensorFlow Python Library built . It provides a variety of additional functions needed to build complex large-scale deep learning models .

Here is Keras List of specific functions of :

  • Support for the construction of all types of neural networks , Include : Complete connection 、 Convolution 、 Pooling 、 loop 、 And embedding . Able to target large datasets and problems , By further combining the various models , To create a complete neural network .

  • It has built-in functions to perform neural network calculations , Include : Define layers and goals , Activate function ; Through optimizer and lots of tools , To easily process image and text data .

  • It comes with some preprocessed data sets and trained models , Include :MNIST、VGG、Inception、SqueezeNet、 as well as ResNet etc. .

  • extensible , It can support new functions and methods .

natural language processing

Google Application Alexa To accurately predict what users are searching for , And in the Siri Other chat robots will use natural language processing (NLP) technology .

NLP In the design AI In the system , Played a huge role . This system helps to describe the interaction between human language and computer .image.png

Here it is , We'll focus on those that can be used with built-in functions , To achieve advanced AI The system is a highly recommended natural language processing package .

They are :

  • NLTK

  • spaCy

  • Gensim

①NLTK( Natural language toolkit ,Natural Language ToolKit)image.png

NLTK Considered to be excellent at analyzing human language and behavior Python software package . As the first choice for most data scientists ,NLTK The library provides an easy-to-use interface , It includes 50 A variety of corpora and lexical resources , Help to describe the interaction between people , And building recommendation engines AI System .

Here is NLTK List of specific functions of :

  • Provides a complete set of data and text processing methods , Can be used for text analysis classification 、 Mark 、 Word stem 、 Parsing and semantic reasoning .

  • Included for industrial use NLP The wrapper for the library , By building complex systems , To assist in text categorization , And look for behavioral trends and patterns of human speech .

  • It comes with the implementation of computational linguistics (Computational Linguistics) A comprehensive guide to 、 And complete API Documentation guide , Can help novice programmers to use NLP.

  • It has a huge community of users and professionals , Can provide a comprehensive tutorial and quick guide , It is convenient for users to learn how to use Python Computational linguistics .


As a free Python Open source library ,spaCy It can be used to implement advanced natural language processing (NLP) Related technology .

When you're dealing with a lot of text , Can pass spaCy To easily learn the morphological meaning of the text , And how to classify it into human intelligible language .

Here is spaCy List of specific functions of :

  • Besides language computing ,spaCy Separate modules are also available , Can be used to build 、 Training and testing various statistical models , It can help users understand the meaning of words better .

  • It comes with various built-in language annotations , It can help to analyze the grammatical structure of sentences . This not only helps to understand the various tests , It also helps to find out the relationship between different words in a sentence .

  • Can be used for complex nested tags that contain abbreviations and multiple punctuation marks (nestedtokens), To achieve tokenization .

  • Besides its powerful function and efficiency ,spaCy And support 51 More than one language .


Gensim It's another open source Python software package , This modeling aims to extract semantic topics from large documents and texts , To deal with by means of statistical models and linguistic calculations , And then analyze and predict human behavior .

Whether it's raw data or unstructured data , It has the ability to handle and deal with huge data sets .

Here is Genism List of specific functions of :

  • By understanding the statistical semantics of each word , To build an effective classification document model .

  • It comes with something like Word2Vec、FastText、 Latent semantic analysis (Latent Semantic Analysis) Text processing algorithms like this .

    These algorithms can study statistical co-occurrence patterns in documents , By filtering out unnecessary words , And then build a model with only important features .

  • Provide for import 、 And support a variety of data formats I/O Wrappers and readers .

  • Its simple and intuitive interface , It can be used by beginners easily . meanwhile , Its API The learning curve is gentle , Therefore, it is loved by developers from all walks of life .

本文为[osc_ a45vpoh4]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database