A complete machine learning project in python (1)

Pan Chuang AI 2020-11-13 12:48:53
complete machine learning project python


People often choose a book on data science or complete an online course to learn and master machine learning . however , The reality is that , After learning, it is not clear how these technologies can be used in the actual project process . It's like you've got chunks in your head ” Puzzle “( Machine learning technology ), You don't know how to put them all together for practical projects . If you've had the same problem , So this article should be what you want . This series of articles will introduce a complete machine learning solution for real-world data sets , Let you know how all the parts come together .

This series of articles follows the general machine learning workflow step by step :

  1. Data cleaning and formatting
  2. Exploratory data analysis
  3. Featured projects and choices
  4. Several machine learning models are compared in terms of performance index
  5. Perform super parametric adjustments on the best model
  6. Evaluate the best model in the test set
  7. Explain the model results
  8. Draw conclusions and document work
By completing all the processes , We'll see how each step is connected , And how to Python Technical secondary school realizes every part . The project Can be found in GitHub To find out , With implementation process . The first article will cover the steps 1-2, The rest will be introduced in the following article .

Problem definition

The first step before coding is to understand the problem we are trying to solve and the data available . In this project , We will be able to use our public buildings in New York City

(http://www.nyc.gov/html/gbee/html/plan/ll84_scores.shtml)

The goal is : Use energy data to build a model , Can predict the building's energy star rating ENERGY STAR Score), And analyze the results to find out the factors that affect the score .

The data we can get includes Energy Star scores this one , This makes it a supervised regression machine learning task :

  • supervise : because We can access features and targets , Our goal is to develop models that can learn the mapping between the two
  • Return to : Energy Star scores It's a continuous variable
What we want to develop is a accuracy ( It can predict the energy star score and be close to the true value ) And can explain Model of ( We can understand the model, this prediction ). that , When we know these two things , As we dig deeper into data and build models, we have a clearer standard to guide our decisions .

Data cleaning

Unlike the data used in most data science courses , The real data is confusing , Not every dataset has no missing or outliers . That means before we start to analyze , We need to clean up the data and convert it into a readable format . Data cleaning is an essential part of most data science problems .

First , Use pandas(Dataframe) Read the data and look at :

Actual data

This is a containing 60 A subset of the complete data of column data . We can see a few problems : First , Although we know that what we want to predict is the energy star score (ENERGY STAR Score), But we don't know what each column means . Although it may not be a necessary question , We can usually create a potentially accurate model without knowing what other variables mean , But we want to focus more on the interpretability of the model , And it might be important to know at least some of the columns .

When you first got the task , I first focused on the name of the data file :

And start searching for “Local_Law_84” Related information , Learn that this is a New York City law requiring all buildings of a certain size to report their energy use . And then we find the meaning of each column in the data . In the process , Patience is necessary .

We don't need to study the exact meaning of all the columns , But energy star scored (ENERGY STAR Score) It's that we have to know exactly · Of . It is described as :

From the energy use reports submitted in each reporting year ,1~100 The percentile ranking of ( The higher the score, the better ). Energy Star scores (ENERGY STAR Score) It's a relative measure of building energy efficiency .

So far we have solved the first problem , Next, we will analyze the second question – Those are filled in ”Not Available“ The missing value . We can use the following dataframe.info() Method to view the data type of the column :

You can see , Some of them explicitly contain numbers ( for example ft²) The columns of are stored as objects. We can't do numerical analysis of strings , So the data needs to be converted to a numeric data type .

Here's a little bit of Python Code , Will all “Not Available” The entry is replaced by ” Not numbers ”(np.nan), Then convert the relevant columns to float data type :

Once the corresponding columns are converted to numbers , We can start to analyze the data .

Missing data and outliers

Except for abnormal data types , Another common problem with real data is missing data . These data gaps are often caused by many factors , Before we train machine learning models, we have to fill in or delete . First , Let's see how many missing values are in each column .( See the code github)

Although deleting information requires extra care , But for columns with a high percentage of missing values , They are probably meaningless for the training of models . The specific threshold for deleting these columns depends on the problem , For this project , We choose to delete the missing value more than 50% The column of .

then , We also need to deal with outliers . Those outliers may be due to misspellings or incorrect statistics in data input , Or some extreme values that are not the two reasons mentioned above but are not good for model training . For this project , We're going to use extreme outliers (extreme outliers) The definition of https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm) Come on Handling outliers :

  • Below the first quartile (Q1) - 3 * The quartile difference
  • Above the third quartile (Q3) + 3 * The quartile difference
( Code for deleting column values and outliers , see also github). After data cleaning and exception handling , We have 11,000 Multiple buildings and 49 individual features.

Exploratory data analysis (EDA)

Now? , We've completed the slightly tedious step of data cleansing . Next, we can explore our data . Exploratory data analysis (EDA) It's a way to analyze a data set to summarize its main features , Usually, visualization is used .

In short ,EDA The goal is to understand what our data can tell us , In order to help us choose and use data features reasonably .

Univariate graph (EDA One of the typical graphics techniques used in )

Our goal is to predict energy star scores ENERGY STAR Score( It was renamed in our dataset as score), So we can check this single variable from (ENERGY STAR Score) The distribution of the begins . Histogram is a simple and effective method to visualize the distribution of a single variable , Use matplotlib It's easy to achieve .

You can see from the chart above that energy star scores (ENERGY STAR Score) The distribution is uneven , The highest 100 Points and minimum points 1 There is a big part of it . However , The energy star score is a percentile , We want to see a unified distribution , Each score is assigned to the same number of buildings .

If we go back to the definition of the energy star score , We'll see that it's based on “ Self reported energy use ”, This may explain why there are so many very high scores . Asking building owners to report their energy use is like asking students to report their scores on a test , High marks tend to dominate . therefore , This may not be the most objective measure of building energy efficiency .

If we have unlimited free time , We may want to investigate why so many buildings have very high or very low scores , You can even choose these buildings and analyze what they have in common . however , Our goal is just Forecast score , Instead of designing a better building scoring system . therefore , We can note in our report that scores have a suspicious distribution , But our main concern is Score prediction .

Looking for relationships

Analyzing the relationship between function and goal is EDA One of the main steps of . Variables related to goals are useful for models . Check the classification variables on the target ( Use only a limited set of values ) One way of doing this is by using the seaborn Density map of the library .

Density maps visualize the distribution of individual variables , It can also be seen as a smooth histogram . We can color density maps by category , To see how variables affect distribution . The following code creates a visualization of different building types ( Limited to having more than 100 Building types of data points ) Energy Star score density map of :

We can see that the type of building has a significant impact on the energy star score . Office buildings tend to have higher scores , And hotels have lower scores . This tells us , We should include building types in our modeling , Because it does have an impact on the goal . As a class of variables , We're going to have to heat code the type of building alone .

A similar plot can be used to visualize the town's Energy Star score as follows :

Boroughs don't seem to have much impact on building type ratings . For all that , We also want to incorporate it into our model , Because there are subtle differences between the districts .

We can use Pearson (Pearson) The correlation coefficient Quantify the relationship between variables . Pearson (Pearson) Correlation coefficient is a way to measure the strength and direction of the linear relationship between two variables .+1 Points are perfect linear positive correlations ,-1 Points are a perfect negative linear relationship . Several values of the correlation coefficient are as follows :

Although correlation coefficients cannot capture nonlinear relationships , But it's a good way to start analyzing how to get correlations between variables . stay Pandas in , We can easily calculate the correlation between data columns :

The most positive correlation with the goal ( On ) And the most negative correlation ( Next ):

It can be seen from the above figure that the most negatively correlated categorical variables are almost all related to energy use intensity (EUI) of . Energy intensity (EUI) Is a function of the size or other characteristics of a building's energy use ( The lower the better ). Intuitively speaking , These correlations make sense : With EUI An increase in , The energy star score tends to decline .

Bivariate graph

We use a scatter plot to represent the relationship between two continuous variables , In this way, additional information such as classification variables can be included in the color of the points . for example , The chart below shows the comparison of Energy Star scores for building types Site EUI:

Through this picture , We can see that -0.7 The change of the correlation coefficient of . With Site EUI Reduce , Energy Star scores increase , This relationship remains stable in the building type .

Finally, let's talk about pairwise graphs (Pairs Plot). This is a good exploratory analysis tool , It allows us to see the relationship between multiple pairs of variables and the distribution of a single variable . ad locum , We use seaborn Visual library and PairGrid Function to create Pais Plot– The upper triangle uses scatter plot , Diagonals use histograms and lower triangles use two-dimensional kernel density maps and correlation coefficients .

We find where a row intersects a column , Look at the interaction between variables . Besides looking cool , These diagrams can help us decide which variables to include in the modeling .

This time mainly introduces the first two parts of the process , Please look forward to the following analysis .( Compile and compile :https://towardsdatascience.com/a-complete-machine-learning-walk-through-in-python-part-one-c62152f39420)

 

 

版权声明
本文为[Pan Chuang AI]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database