## scipy.sparse , pandas.sparse The use of sklearn sparse matrix

Understanding oneself 2020-11-13 10:09:06
scipy.sparse scipy sparse pandas.sparse pandas

In a stand-alone environment , If the features are sparse and the matrix is large , Then there will be memory problems , If it's not distributed + no need Mars/Dask/CuPy Tools such as , So sparse matrix is an easy way to realize .

# 1 scipy.sparse

## 1.1 SciPy Several types of sparse matrix

SciPy There is 7 A data structure for storing sparse matrices ：

• bsr_matrix: Block Sparse Row matrix
• coo_matrix: COOrdinate format matrix
• csc_matrix: Compressed Sparse Column matrix
• csr_matrix: Compressed Sparse Row matrix
• dia_matrix: Sparse matrix with DIAgonal storage
• dok_matrix: Dictionary Of Keys based sparse matrix
• lil_matrix: Row-based LInked List sparse matrix

Various types of uses ：

• If you want to create a new sparse matrix ,lil_matrix,dok_matrix and coo_matrix It's more efficient than , But they're not suitable for matrix operations .
• If you want to do matrix operations , For example, matrix multiplication 、 Inverse, etc , Should use the CSC perhaps CSR Sparse matrix of type .
• Due to the difference of storage order in memory ,csc_matrix Matrix is more suitable for column slicing ,
• and csr_matrix Matrix is more suitable for row slicing .

## 1.2 lil_matrix

Just say lil_matrix, Because the author uses this , And it's more convenient .
lil_matrix It is the second intuitive storage method of sparse matrix . Its full name is row-based linked list sparse matrix . It has two elements ：rows and data

Example code one ：

``````>>> from scipy.sparse import lil_matrix
>>> l = lil_matrix((6,5))
>>> l[2,3] = 1
>>> l[3,4] = 2
>>> l[3,2] = 3
>>> print l.toarray()
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 3. 0. 2.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
>>> print l.data
[[] [] [1.0] [3.0, 2.0] [] []]
>>> print l.rows
[[] [] [3] [2, 4] [] []]
``````

Example code 2 ：

``````# The original matrix is
array([[1., 0., 0., 0., 0.],
[0., 0., 2., 0., 3.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 4., 0.],
[0., 0., 0., 0., 5.]])
mat_lil = sparse.lil_matrix(mat_coo) # Several sparse matrices can be transformed into each other
# mat_lil Two elements of
mat_lil.rows
array([list([0]), list([2, 4]), list([]), list([3]), list([4])],
dtype=object)
mat_lil.data
array([list([1.0]), list([2.0, 3.0]), list([]), list([4.0]), list([5.0])],
dtype=object)
``````

Example code 3 :

``````# Create a matrix
lil = sparse.lil_matrix((6, 5), dtype=int)
# Set the value
# set individual point
lil[(0, -1)] = -1
# set two points
lil[3, (0, 4)] = [-2] * 2
# set main diagonal
lil.setdiag(8, k=0)
# set entire column
lil[:, 2] = np.arange(lil.shape[0]).reshape(-1, 1) + 1
# To array
lil.toarray()
'''
array([[ 8, 0, 1, 0, -1],
[ 0, 8, 2, 0, 0],
[ 0, 0, 3, 0, 0],
[-2, 0, 4, 8, -2],
[ 0, 0, 5, 0, 8],
[ 0, 0, 6, 0, 0]])
'''
# View the data
lil.data
'''
array([list([0, 2, 4]), list([1, 2]), list([2]), list([0, 2, 3, 4]),
list([2, 4]), list([2])], dtype=object)
'''
lil.rows
'''
array([[list([8, 1, -1])],
[list([8, 2])],
[list([3])],
[list([-2, 4, 8, -2])],
[list([5, 8])],
[list([6])]], dtype=object)
'''
``````

## 1.3 General properties of matrices

Matrix properties

``````from scipy.sparse import csr_matrix
### Common property
mat.shape # Matrix shape
mat.dtype # data type
mat.ndim # Matrix dimensions
mat.nnz # Non zero number
mat.data # Nonzero value , One dimensional array
### COO Peculiar
coo.row # Matrix row index
coo.col # Matrix column index
### CSR\CSC\BSR Peculiar
bsr.indices # The index array
bsr.indptr # Pointer array
bsr.has_sorted_indices # Whether the index is sorted
bsr.blocksize # BSR Matrix block size
``````

Common methods

``````import scipy.sparse as sp
### Transformation matrix format
tobsr()、tocsr()、to_csc()、to_dia()、to_dok()、to_lil()
mat.toarray() # To array
mat.todense() # To dense
# Returns the sparse matrix of the given format
mat.asformat(format)
# Returns the sparse matrix of the given element format
mat.astype(t)
### Check the matrix format
issparse、isspmatrix_lil、isspmatrix_csc、isspmatrix_csr
sp.issparse(mat)
### Get matrix data
mat.getcol(j) # Return the matrix column j A copy of , As a (mx 1) sparse matrix ( Column vector )
mat.getrow(i) # Return matrix row i A copy of , As a (1 x n) sparse matrix ( Row vector )
mat.nonzero() # Not 0 Meta index
mat.diagonal() # Returns the main diagonal element of a matrix
mat.max([axis]) # The largest element of the matrix for a given axis
### Matrix operations
mat = mat * 5 # ride
mat.dot(other) # Coordinate dot product
resize(self, *shape)
transpose(self[, axes, copy])
``````

## 1.4 Sparse matrix access

Storage - save_npz

``````scipy.sparse.save_npz('sparse_matrix.npz', sparse_matrix)
``````

``````# from npz File read
``````

Storage size comparison

``````a = np.arange(100000).reshape(1000,100)
a[10: 300] = 0
b = sparse.csr_matrix(a)
# Sparse matrices are compressed and stored in npz file
sparse.save_npz('b_compressed.npz', b, True) # file size ：100KB
# Sparse matrix is not compressed and stored in npz file
sparse.save_npz('b_uncompressed.npz', b, False) # file size ：560KB
# Store to normal npy file
np.save('a.npy', a) # file size ：391KB
# Store to compressed npz file
np.savez_compressed('a_compressed.npz', a=a) # file size ：97KB• 1
``````

# 2 pandas.sparse

Sparse data structures

## 2.1 SparseArray

``````In [1]: arr = np.random.randn(10)
In [2]: arr[2:-2] = np.nan
In [3]: ts = pd.Series(pd.arrays.SparseArray(arr))
In [4]: ts
Out[4]:
0 0.469112
1 -0.282863
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.861849
9 -2.104569
dtype: Sparse[float64, nan]
``````

pandas in sparse Become a format , Such as `dtype: Sparse[float64, nan]`

## 2.2 newly build SparseDataFrame

Before Pandas Version has ：`pd.SparseDataFrame()`, But this has been removed in the new version .

SparseSeries and SparseDataFrame were removed in pandas 1.0.0. This migration guide is present to aid in migrating from previous versions.

One way ：

``````# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})
# New way
In [31]: pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Out[31]:
A
0 0
1 1
``````

The SparseDataFrame.default_kind and SparseDataFrame.default_fill_value attributes have no replacement.

Another way ：

``````# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
# New way
In [32]: from scipy import sparse
In [33]: mat = sparse.eye(3)
In [34]: df = pd.DataFrame.sparse.from_spmatrix(mat, columns=['A', 'B', 'C'])
In [35]: df.dtypes
Out[35]:
A Sparse[float64, 0]
B Sparse[float64, 0]
C Sparse[float64, 0]
dtype: object
``````

The third is new construction ：

``````In [38]: dense = pd.DataFrame({"A": [1, 0, 0, 1]})
In [39]: dtype = pd.SparseDtype(int, fill_value=0)
In [40]: dense.astype(dtype)
Out[40]:
A
0 1
1 0
2 0
3 1
``````

## 2.3 Format conversion

``````# SparseDataFrame -> dataframe
In [36]: df.sparse.to_dense()
Out[36]:
A B C
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
# SparseDataFrame -> spacy.coo
In [37]: df.sparse.to_coo()
Out[37]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
``````

## 2.4 Properties of sparse matrices

Sparse-specific properties, like density, are available on the .sparse accessor.

``````In [41]: df.sparse.density
Out[41]: 0.3333333333333333
``````

## 2.5 scipy.sparse And pandas.sparse

from scipy -> pandas
`pd.DataFrame.sparse.from_spmatrix` have access to

``````In [47]: from scipy.sparse import csr_matrix
In [48]: arr = np.random.random(size=(1000, 5))
In [49]: arr[arr < .9] = 0
In [50]: sp_arr = csr_matrix(arr)
In [51]: sp_arr
Out[51]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 517 stored elements in Compressed Sparse Row format>
In [52]: sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
Out[53]:
0 1 2 3 4
0 0.956380 0.0 0.0 0.000000 0.0
1 0.000000 0.0 0.0 0.000000 0.0
2 0.000000 0.0 0.0 0.000000 0.0
3 0.000000 0.0 0.0 0.000000 0.0
4 0.999552 0.0 0.0 0.956153 0.0
In [54]: sdf.dtypes
Out[54]:
0 Sparse[float64, 0]
1 Sparse[float64, 0]
2 Sparse[float64, 0]
3 Sparse[float64, 0]
4 Sparse[float64, 0]
dtype: object
``````

from pandas -> scipy

``````In [61]: A, rows, columns = ss.sparse.to_coo(row_levels=['A', 'B'],
....: column_levels=['C', 'D'],
....: sort_labels=True)
....:
In [62]: A
Out[62]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [63]: A.todense()
Out[63]:
matrix([[0., 0., 1., 3.],
[3., 0., 0., 0.],
[0., 0., 0., 0.]])
In [64]: rows
Out[64]: [(1, 1), (1, 2), (2, 1)]
In [65]: columns
Out[65]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
``````

# 3 sklearn

General situation scipy.sparse You can use it directly , Conduct `train_test_split`,
If `pandas.sparse` no way , So it turns into pandas `x = x.sparse.to_dense()` It should be possible ：

``````fea_datasets = csr_matrix((data, (row, col)), shape=(row_index, max_col+1)).toarray()
# When the feature dimension is too large , Choose this way （ Add toarray（） And it's right whether or not ）, Memory doesn't explode easily
#fea_datasets = csr_matrix((data, (row, col)), shape=(row_index, max_col+1))
x_train, x_test, y_train, y_test = train_test_split(fea_datasets, target_list, test_size = 0.2, random_state = 0)
return x_train, x_test, y_train, y_test
``````

I see that in general scipy in csr_matrix Formats generally support sklearn Model training ;
If it is `pandas.sparse` There may be a mistake , therefore , Need to become `dataframe`