People often choose a book on data science or complete an online course to learn and master machine learning . however , The reality is that , After learning, it is not clear how these technologies can be used in the actual project process . It's like you've got chunks in your head ” Puzzle “（ Machine learning technology ）, You don't know how to put them all together for practical projects . If you've had the same problem , So this article should be what you want . This series of articles will introduce a complete machine learning solution for real-world data sets , Let you know how all the parts come together .
This series of articles follows the general machine learning workflow step by step ：
The first step before coding is to understand the problem we are trying to solve and the data available . In this project , We will be able to use our public buildings in New York City
The goal is ： Use energy data to build a model , Can predict the building's energy star rating （ENERGY STAR Score), And analyze the results to find out the factors that affect the score .
The data we can get includes Energy Star scores this one , This makes it a supervised regression machine learning task ：
First , Use pandas（Dataframe） Read the data and look at ：
This is a containing 60 A subset of the complete data of column data . We can see a few problems ： First , Although we know that what we want to predict is the energy star score （ENERGY STAR Score), But we don't know what each column means . Although it may not be a necessary question , We can usually create a potentially accurate model without knowing what other variables mean , But we want to focus more on the interpretability of the model , And it might be important to know at least some of the columns .
When you first got the task , I first focused on the name of the data file ：
And start searching for “Local_Law_84” Related information , Learn that this is a New York City law requiring all buildings of a certain size to report their energy use . And then we find the meaning of each column in the data . In the process , Patience is necessary .
We don't need to study the exact meaning of all the columns , But energy star scored （ENERGY STAR Score) It's that we have to know exactly · Of . It is described as ：
From the energy use reports submitted in each reporting year ,1~100 The percentile ranking of （ The higher the score, the better ）. Energy Star scores （ENERGY STAR Score) It's a relative measure of building energy efficiency .
So far we have solved the first problem , Next, we will analyze the second question – Those are filled in ”Not Available“ The missing value . We can use the following dataframe.info() Method to view the data type of the column ：
You can see , Some of them explicitly contain numbers （ for example ft²） The columns of are stored as objects. We can't do numerical analysis of strings , So the data needs to be converted to a numeric data type .
Here's a little bit of Python Code , Will all “Not Available” The entry is replaced by ” Not numbers ”（np.nan）, Then convert the relevant columns to float data type ：
Once the corresponding columns are converted to numbers , We can start to analyze the data .
Although deleting information requires extra care , But for columns with a high percentage of missing values , They are probably meaningless for the training of models . The specific threshold for deleting these columns depends on the problem , For this project , We choose to delete the missing value more than 50％ The column of .
then , We also need to deal with outliers . Those outliers may be due to misspellings or incorrect statistics in data input , Or some extreme values that are not the two reasons mentioned above but are not good for model training . For this project , We're going to use extreme outliers （extreme outliers） The definition of （https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm） Come on Handling outliers ：
Exploratory data analysis （EDA）
Now? , We've completed the slightly tedious step of data cleansing . Next, we can explore our data . Exploratory data analysis （EDA） It's a way to analyze a data set to summarize its main features , Usually, visualization is used .
In short ,EDA The goal is to understand what our data can tell us , In order to help us choose and use data features reasonably .
You can see from the chart above that energy star scores （ENERGY STAR Score） The distribution is uneven , The highest 100 Points and minimum points 1 There is a big part of it . However , The energy star score is a percentile , We want to see a unified distribution , Each score is assigned to the same number of buildings .
If we go back to the definition of the energy star score , We'll see that it's based on “ Self reported energy use ”, This may explain why there are so many very high scores . Asking building owners to report their energy use is like asking students to report their scores on a test , High marks tend to dominate . therefore , This may not be the most objective measure of building energy efficiency .
If we have unlimited free time , We may want to investigate why so many buildings have very high or very low scores , You can even choose these buildings and analyze what they have in common . however , Our goal is just Forecast score , Instead of designing a better building scoring system . therefore , We can note in our report that scores have a suspicious distribution , But our main concern is Score prediction .
Density maps visualize the distribution of individual variables , It can also be seen as a smooth histogram . We can color density maps by category , To see how variables affect distribution . The following code creates a visualization of different building types （ Limited to having more than 100 Building types of data points ） Energy Star score density map of ：
We can see that the type of building has a significant impact on the energy star score . Office buildings tend to have higher scores , And hotels have lower scores . This tells us , We should include building types in our modeling , Because it does have an impact on the goal . As a class of variables , We're going to have to heat code the type of building alone .
A similar plot can be used to visualize the town's Energy Star score as follows ：
Boroughs don't seem to have much impact on building type ratings . For all that , We also want to incorporate it into our model , Because there are subtle differences between the districts .
We can use Pearson (Pearson) The correlation coefficient Quantify the relationship between variables . Pearson （Pearson） Correlation coefficient is a way to measure the strength and direction of the linear relationship between two variables .+1 Points are perfect linear positive correlations ,-1 Points are a perfect negative linear relationship . Several values of the correlation coefficient are as follows ：
Although correlation coefficients cannot capture nonlinear relationships , But it's a good way to start analyzing how to get correlations between variables . stay Pandas in , We can easily calculate the correlation between data columns ：
The most positive correlation with the goal （ On ） And the most negative correlation （ Next ）：
It can be seen from the above figure that the most negatively correlated categorical variables are almost all related to energy use intensity （EUI） of . Energy intensity （EUI） Is a function of the size or other characteristics of a building's energy use （ The lower the better ）. Intuitively speaking , These correlations make sense ： With EUI An increase in , The energy star score tends to decline .
Through this picture , We can see that -0.7 The change of the correlation coefficient of . With Site EUI Reduce , Energy Star scores increase , This relationship remains stable in the building type .
Finally, let's talk about pairwise graphs （Pairs Plot）. This is a good exploratory analysis tool , It allows us to see the relationship between multiple pairs of variables and the distribution of a single variable . ad locum , We use seaborn Visual library and PairGrid Function to create Pais Plot– The upper triangle uses scatter plot , Diagonals use histograms and lower triangles use two-dimensional kernel density maps and correlation coefficients .
We find where a row intersects a column , Look at the interaction between variables . Besides looking cool , These diagrams can help us decide which variables to include in the modeling .
This time mainly introduces the first two parts of the process , Please look forward to the following analysis .（ Compile and compile ：https://towardsdatascience.com/a-complete-machine-learning-walk-through-in-python-part-one-c62152f39420）