Basic usage of pandas data structure

basic usage pandas data structure

Head And Tail

head() And tail() For quick preview Series And DataFrame, Default display 5 Data , You can also specify the amount of data to display .

Attributes and underlying data

Pandas Metadata can be accessed through multiple attributes :

shape: The axis dimension of the output object , And ndarray Agreement

Axis labels :

Series:Index( Only this axis )

DataFrame:Index( That's ok ) And column

Pandas object (Index、Series、DataFrame) The container equivalent to an array , For storing data 、 Perform calculations . The underlying arrays of most types are numpy.ndarray. however ,Pandas And third-party support libraries are generally expanded NumPy Type system , Add a custom array .

.array Property is used to extract Index or Series The data in .

array Generally refer to ExtensionArray.

extract NumPy Array , use to_numpy() or numpy.asarray().

Series And Index The type is ExtensionArray when ,to_numpy() Will copy the data , And cast the value .

to_numpy() Can be controlled numpy.ndarray The type of data generated . With time zone datetme For example ,NumPy Not providing time zone information datetime data type ,Pandas It provides two forms of expression :

1. One is to take Timestamp Of numpy.ndarray, Provides the right tz Information .

2. The other is datetime64[ns], It's also a numpy.ndarray, The value is converted to UTC, But the time zone information is removed .

Time zone information can be used dtype=object preservation

Or use dtype=’datetime64[ns]’ Remove .

extract DataFrame The original data in is a little bit complicated .DataFrame When the data types of all the columns in are the same ,DataFrame.to_numpy() Return the underlying data :

DataFrame For isomorphic data ,Pandas Directly modify the original ndarray, So the modification will be directly reflected in the data structure . For heterogeneous data , namely DataFrame When the data types of columns are different , It's not this mode of operation , Unlike shaft labels , Cannot assign value to property of value .

Here we need to pay attention to when dealing with heterogeneous data , Output results ndarray The data type of is applicable to all kinds of data involved . if DataFrame It contains strings , The data type of the output structure is object. If it's only floating point numbers or integers , The data type of the output result is floating point number .

before ,Pandas Recommend to use Series.values or DataFrame.values from Series or DataFrame Extract data from the database .

but Pandas Improved this function , Now? , Recommend to use .array or to_numpy Extract the data , Don't use .values 了 .

.values There are the following 2 Disadvantages :

1.Series With extension type ,Series.values It's impossible to judge whether to return NumPy array, Or return Extension array. and Series.array Only return to ExtensionArray, And it doesn't copy data .Series.to_numpy Then return to NumPy Array , The price is the need to replicate 、 And force the value of the data .

2.DataFrame With multiple data types ,DataFrame.values Will copy the data , And cast the value of the data to the same data type , It's a costly operation .DataFrame.to_numpy() Then return to NumPy Array , It's clearer in this way , And I won't DataFrame The data in the database is treated as a type .

Speed up the operation

With the help of numexpr And bottleneck support library ,Pandas Can speed up specific types of binary values and Boolean operations .

When dealing with large data sets , These two support libraries are particularly useful , The acceleration effect is also very obvious .numexpr Using intelligent blocking 、 Cache and multi core technology .bottleneck It's a set of exclusive cython routine , Treatment with nans Value array , Very fast .

Please see the following example (DataFrame contain 100 Column ×10 Ten thousand rows of data ):

Both support Kummer's view of enabled state , You can use the following options to set :

