author |Billy Fetzner compile |VK source |Towards Data Science
In my submission , Now that you click on this page , You may have a lot of data to analyze , You may be thinking of the best and most effective way to solve some of your data problems . The answer to your question can be answered by Pandas solve .
because Pandas The popularity of , It has its own traditional abbreviation , So whenever there will be Pandas Import python, Please use the following name :
import pandas as pd
Pandas The main use of packages is data frames
Pandas API take Pandas Data frames are defined as :
A two-dimensional 、 Variable size 、 Potential heterogeneous table data . Data structures also contain axes ( Row and column ). Arithmetic operations align row and column labels . It can be thought of as something similar to dict The container of , Used to store sequence objects . yes Pandas The main data structure .
Basically , This means that you have data contained in the format , As shown below . Data found in rows and columns :
Data frames are very useful , Because they provide an efficient way to visualize data , And then manipulate the data the way you want it to .
These rows can be easily referenced by the index , The index is the leftmost number in the data frame . The index will be a zero based number , Unless you specify the name of each line . Columns can also be easily named by column names ( for example “Track name”) Or its position in the data frame . We'll discuss reference rows and columns in detail later in this article .
establish Pandas There are several ways to frame data :
from .csv File import data ( Or other file types , for example Excel、SQL database )
From the list
From the dictionary
from numpy Array
other
Usually , You will mainly .csv Data from a file or some type of data source is put into Pandas In the data framework , Not from the beginning , Because it will take a very long time to complete , It depends on the amount of data you have . Here are python A quick word in the dictionary 、 A simple example :
import pandas as pd
dict1 = {'Exercises': ['Running','Walking','Cycling'],
'Mileage': [250, 1000, 550]}
df = pd.DataFrame(dict1)
df
Output :
Dictionary key (“Exercises” and “Mileage”) Become the corresponding column heading . The values in the dictionary are the list in this example , Become a single data point in a data frame .Running yes “Exercises” The first one in the list ,250 Will be listed first in the second column . in addition , You'll notice , Because I didn't specify a label for the index of the data frame , So it's automatically marked as 0、1 and 2.
however , As I said before , establish Pandas The most likely way to frame data is from csv Or other types of files , You will import the file to analyze the data . It's easy to do just the following :
df = pd.read_csv("file_location.../file_name.csv")
pd.read_csv It's a very powerful and versatile approach , Depending on how you want to import data , It will be very useful . If csv The file already has a header or index attached to it , You can specify when importing . In order to fully understand pd.read_csv, I suggest you look at the PandasAPI:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv
Now you're ready for this huge data set , You have to look at it , Look at what it looks like . As a person who analyzes these data , First, you have to be familiar with data sets , And really understand what's going on in the dataset . I like to understand my data in four ways .
raw_song.head()
It shows the front of the data frame 5 Rows and columns , So that you can easily summarize the appearance of the data . You can also specify a certain number of lines in the method , To show more rows .
.tail Show only the last 5 That's ok .
raw_song.tail()
From these two quick methods , I have a general idea of what column names and data look like , This is just a small sample of the dataset . These methods are also very useful , Especially for Spotify A dataset like this , Handle more than 300 Million lines of data , You can easily display data sets and quickly understand data , And your computer doesn't take long to display data .
.info It's also very useful. , It shows me all the columns 、 Their data types and whether they have null The data points .
raw_song.info(verbose=True, null_counts=True)
If you have full integer or floating-point Columns ( namely 'Position'、'Streams'), that .describe It's a useful way , Can help you better understand the dataset , Because it will display a lot of descriptive statistics about these columns .
raw_song.describe()
Last ,.sample Will allow you to randomly sample data frames , And check to see if any of your actions have incorrectly changed something in the dataset , And when you first explore data sets , You can also have a good idea of what the dataset contains
raw_song.sample(10)
When exploring and preparing data sets for analysis , I always use these methods . Whenever I change the data in a column 、 Change the column name or add / Delete row / Column time , I'm going to run at least fast in front of 5 Some of these methods to make sure that all changes are made the way I want them to be .
fantastic , Now you know how to look at data sets as a whole , But actually you just want to look at a few columns or rows , And then exclude the rest .
.loc[] and .iloc[]
These two approaches will do this in different ways , It depends on how you can refer to specific rows or columns .
If you know the label of a row or column , Please use .loc[].
If you know the index of a row or column , Please use .iloc[].
If you know both , Just choose your favorite .
therefore , go back to Spotify Data sets . You can use .loc[] or .iloc[] View columns “Track Name”. If you know the label of the column, you can use .loc[], So I'll use the following :
raw_song.loc[:,'Track Name']
The colon after the first bracket specifies the line I'm referring to , Because I want all lines to be in “Track Name” In the column , So I use “:”.
I will receive with .iloc[] Same output , But this time you need to specify “Track Name” Column index :
raw_song.iloc[:,1]
.loc[] and .iloc[] It has the same effect on the line , But in this case , Because the labels and indexes of the rows are the same , So they look exactly the same .
Another way to get DataFrame Part of the simple way is to use [] And specify the column name in square brackets .
raw_song[['Artist','Streams']].head()
If you only use a column and a set of parentheses , You will get Pandas Series.
raw_song['Streams']
Using what we've done from .loc[] Information obtained , We can use this or .insert Add a row or column to a data frame .
If you decide to use .loc[] Add rows to dataframe, You can only add it to dataframe The bottom of . Appoint dataframe Any other index in , Delete the data currently in the row , And replace it with the data you want to insert .
raw_song.loc[3441197] = [0,'hello','bluemen',1,"https://open.spotify.com/track/notarealtrack", '2017-02-05','ec']
You can also use .loc[] Add columns to the data frame .
raw_song.loc[:,'new_col'] = 0
raw_song.tail()
Except at the end , There are two other ways to insert new columns into data frames .
insert Method allows you to specify where to put the column in the data frame . It accepts 3 Parameters 、 The index to place it 、 The name of the new column and the value to place as column data .
raw_song.insert(2,'new_col',0)
raw_song.tail()
Add columns to dataframe The second way is by using [] Name the new column and make it equal to the new data , So that it becomes dataframe Part of .
raw_song['new_col'] = 0
raw_song.tail()
In this way , I can't specify the location of the new column , But it's another useful way to do that .
If you want to delete some rows or columns , It's very simple , Just delete them .
Just specify the axis to delete ( Behavior 0, As a 1) And the name of the row or column to delete , It's time to start !
raw_song.drop(labels='new_col',axis=1)
If you want to dataframe The index of is changed to dataframe The other columns in , Please use .set_index And specify the name of the column in brackets . however , If you know exactly what to name the index , Please use .rename Method .
raw_song.rename(index={0:'first'}).head()
To be on the list , Please be there. .rename Method to specify the column to rename and in the {} The name you want to name it in , It's like renaming an index .
raw_song.rename(columns={'Position':'POSITION_RENAMED'}).head()
A lot of times , When you process data in a data frame , You need to change the data in some way and iterate over all the values in the data frame . The easiest way is in pandas Built in for loop :
for index, col in raw_song.iterrows():
# Manipulate the data here
After completing all operations on the data frame , Now it's time to export data frames , So that it can be sent to other places . Similar to importing a dataset from a file , Now it's the opposite .Pandas There are many different file types , You can write data frames into it , But the most common is to write it into csv file .
pd.to_csv('file_name.csv')
Now you know Pandas And the basic knowledge of data frames . These are very powerful tools in the data analysis toolbox .
Link to the original text :https://towardsdatascience.com/an-introduction-to-pandas-29d15a7da6d
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/