Linear_Algebra_final.pdf

From Vectors to Recommendations: How Linear
Algebra Drives Personalization
Rohit Anand
April 2023
1 Introduction
Recommendation systems have become a crucial component of many businesses,
from e-commerce websites to streaming services. These systems use algorithms
to analyze user data and recommend products, services, or content that users are
likely to be interested in. Linear Algebra, a branch of mathematics that deals
with vector spaces and linear transformations, has several practical implications
in recommendation systems. In this blog post, we’ll explore some of the ways
that Linear Algebra is used in recommendation systems.
2 Basics of Linear Algebra
Scalars: A scalar is a single value or quantity that represents a specific mea-
surement or quantity. In other words, it’s a quantity that has only a magnitude,
or size, and no direction. For example, we might say ”Let x ∈ R be the solution
for a given equation” while defining the real-valued scalars, or ”Let n ∈ N be
the number of units,” while defining the natural number scalar.
Vectors: A vector is an array of numbers. Formally, a vector is an ordered
list of numbers, called its components, which can be written as a column or row
matrix. For example,
x=




x1
x2
:
xn




The first element of x is x1, the second element is x2, and so on. We also
need to say what kind of numbers are stored in the vector. If each element is
in R, and the vector has n elements, then the vector lies in the set formed by
taking the Cartesian product of R n times, denoted as Rn
. We can think of
vectors as identifying points in space, with each element giving the coordinate
along a different axis.
1

Matrices: A 2-D array of numbers, where each element is identified by two
indices instead of just one. If a real-valued matrix A has a height of m and
width of n, then A ϵ Rn
. We can add matrices to each other as long as they
have the same shape just by adding their corrosponsings elements C = A + B
where Ci,j = Ai,j + Bi,j.
A =


A1,1 A1,2
A2,1 A2,2
A3,1 A3,2

 ⇒ AT
=

A1,1 A2,1 A3,1
A1,1 A2,2 A3,2

AT
is called the transpose of the matrix and can be thought of as a mirror image
across the main diagonal.
Why take Transpose?
Taking the transpose of a matrix is an important operation in linear algebra
with several applications.
1. Solving systems of linear equations: In many cases, we need to solve a
system of linear equations, which can be represented as a matrix equation Ax =
b, where A is a matrix of coefficients, x is the vector of unknowns, and b is the
vector of constants. In order to solve for x, we may need to take the transpose
of A to ensure that the matrix multiplication is valid.
2. Orthogonal matrices: An orthogonal matrix is a matrix whose trans-
pose is also its inverse. Orthogonal matrices have several important properties
that make them useful in many applications, such as preserving lengths and
angles.
3. Eigenvalues and eigenvectors: The eigenvalues and eigenvectors of a
matrix are important in many areas of science and engineering. The transpose
of a matrix A has the same eigenvalues as A, but its eigenvectors may be differ-
ent. Taking the transpose can help simplify the calculation of eigenvalues and
eigenvectors.
4. Matrix operations: The transpose is useful for several matrix opera-
tions, such as computing the dot product, finding the determinant, and solving
linear systems. In some cases, taking the transpose of a matrix can simplify the
computation of these operations.
Tensors:A tensor is a mathematical object that extends the concept of scalars,
vectors, and matrices to higher dimensions. Tensors are used to represent and
manipulate multilinear relationships between sets of algebraic objects, and they
are widely used in many areas of physics and engineering, such as relativity,
electromagnetism, fluid dynamics, and elasticity.
2

Relationships between Matrices, vectors and scalars
Vectors can be thought of as matrices that contain only one column. The trans-
pose of a vector is therefore matrix with only one row. Scalar can be thought of
as a matrix with only a single entry, which means scalar is its own transpose: a
= aT
. Scalar can be added or multiplied to the matrix simply just by perform-
ing that operation on each element: D = a.B + c where a and c are scalar.
Multiplying Matrices:- Multiplying matrices involves taking the dot product
of the rows of the first matrix with the columns of the second matrix. The re-
sulting matrix is obtained by combining these dot products in the appropriate
positions. Suppose we have two matrices, A of size m x n and B of size n x p.
To compute the matrix product C = AB, we multiply each row of A by each
column of B. The product operation is defined by
Ci,j =
P
Ai,kBk,j
This is called element wise product or Hadamard product and denoted by A
⊙B
The dot product between two vectors x and y of the same dimensionality is the
matrix product xt
y
Matrix-vector product To define multiplication between a matrix A and
a vector x (i.e., the matrix-vector product), we need to view the vector as a
column matrix. We define the matrix-vector product only for the case when the
number of columns in A equals the number of rows in x . So, if A is an m×n
matrix (i.e., with n columns), then the product Ax is defined for n×1 column
vectors x . If we let Ax=b , then b is an m×1 column vector. In other words,
the number of rows in A (which can be anything) determines the number of
rows in the product b.
Ax=





a11 a12 . . . a1n
a21 a22 . . . a2n
.
.
.
.
.
.
...
.
.
.
am1 am2 . . . amn










x1
x2
.
.
.
xn





=





a11x1 + a12x2 + · · · + a1nxn
a21x1 + a22x2 + · · · + a2nxn
.
.
.
am1x1 + am2x2 + · · · + amnxn





.
Properties of matrix multiplications
• The commutative property of multiplication does not hold! : AB ̸= BA,
however the dot product between two vectors is commutative i.e xT
y =
yT
x.
• Associative property of multiplication: (AB)C=A(BC)
• Distributive properties: A(B+C)=AB+AC, (B+C)A=BA+CA
• Multiplicative identity property: IA=AI, A, equals, A and AI=A
3

• Multiplicative property of zero: OA=O, AO = O
• Dimension property : The product of an m×n matrix and n×k is an m×k
matrix
Identity and Invese matrix A system of linear equation can be represented
as Ax = b, where A ∈ Rm×n
is known as matrix, b ∈ Rm
is known as vector
and x ∈ Rn
is a vector of unknown variables which we would like to solve.
Linear algebra provides tool called matrix inversion to analytically solve this
equation.
An identity matrix is a matrix that does not change any vector when we multiply
that vector by that matrix. In is an identiy matrix that preserves n-dimensional
vectors. The structure of the identity matrix is simple: all of the entries along
the main diagonal are 1, while all of the other entries are zero.


1 0 0
0 1 0
0 0 1


The matrix inverse of A is denoted asA−1
, and it is defined as the matrix such
that A−1
A = In
Now solving Ax=b equation as follows:
Ax = b
A−1
Ax = A−1
b
Inx = A−1
b
x = A−1
b
This is solvavble only if A−1
exists.
When A−1
exists ?
I. The matrix must be square (same number of rows and columns).
II. The determinant of the matrix must not be zero (determinants are covered
later).
A square matrix that has an inverse is called invertible or non-singular.A matrix
that does not have an inverse is called Singular.
A matrix does not have to have an inverse, but if it does, the inverse is unique.
Linear independence and dependence
a set of vectors are said to be linearly independent if you cannot form any vec-
tor in the set using any combination of the other vectors in the set. If a set of
vectors does not have this quality – that is, a vector in the set can be formed
from some combination of others – then the set is said to be linearly dependent.
Given a set of vectors, the span of the set of vectors are all of the vectors that
can be “constructed” by taking linear combinations of vectors in that set
4

Span(S) := {
Pn
i=1 cixi | c1, . . . , cn ∈ R}
Intuitively, you can think of S as a set of “building blocks” and the Span(S) as
the set of all vectors that can be “constructed” from the building blocks in S.
Given a vector space, and a set of vectors S:=x1,x2,. . . ,xn, S is called linearly
independent if for each vector xi ∈ S , it holds that xi /
∈ Span(S/xi) or in simple
terms, A set of vectors are linearly independent if you cannot form any of the
vectors in the set using a linear combination of any of the other vectors.
Determining whether Ax = b has a solution thus amounts to testing whether
b ∈ Rm
is in the span of the columns of A. This particular span is known as
the column space range or the range of A. So if any point in Rm
is excluded
from the column space, that point is a potential value of b that has no solution.
Hence A must have at least m columns, i.e., n ≥ m. Otherwise, the dimension-
ality of the column space would be less than m. For example, consider a 3 × 2
matrix. The target b is 3-D, but x is only 2-D, so modifying the value of x at
best allows us to trace out a 2-D plane within R3
. The equation has a solution
if and only if lies on that plane.
n ≥ m is only a necessary condition but not the sufficient condition because it
is possible for some of the columns to be redundant. Consider a 2 × 2 matrix
where both of the columns are identical. This has the same column space as
a 2 × 1 matrix containing only one copy of the replicated column. In other
words, the column space is still just a line, and fails to encompass all of R2 ,
even though there are two columns. This kind of redundancy is known as linear
dependence.
This means that for the column space of the matrix to encompass all of Rm
, the matrix must contain at least one set of m linearly independent columns.
This condition is both necessary and sufficient for equation to have a solution
for every value of b.
Norms
The norm of any vector x, measures the distance from the origin to point x.In
machine learning, we usually measure the size of vectors using a function called
a norm Which function are Norm function ? Norm is an function that satisfies
following properties:
• f(x) = 0 ⇒ x = 0
• f(x + y) ≤ f(x) + f(y)
• ∀α ∈ R, f(αx) = α f(x)
Lp
norm is given by
∥ x∥p =
P
xi
p1
p
Lp
is know as euclidian norm where p=2, the squared L2
norm is more
convenient to work because it is efficient computationally and mathematically.
5

Forebnius Norms
The Frobenius norm is matrix norm of an m×n matrix ’A’ defined as the square
root of the sum of the absolute squares of its elements,mathematically
∥ A∥F =
qP
Ai,j
2
Also the dot product of two vectors can be rewritten in terms of norms
xT
y =∥ x∥2 ∥ y∥2cos (θ)
where θ is the angle between x and y
Why use Norms?
Optimization:- Regularization penalties in optimization problems to prevent
overfitting and improve generalization performance. Example The L2
norm is
commonly used in least-squares problems norm is commonly used as a regu-
larization term to prevent overfitting in models, while the L1
norm is used in
sparse optimization problems.
Distance metrics:-Norms can be used to define distance metrics between vec-
tors or data points in machine learning. The L2
norm, also known as the
Euclidean distance, is a common distance metric used in clustering and classi-
fication algorithms.
Model complexity::- Norms can be used to measure the complexity of machine
learning models. The Frobenius norm, which measures the size of a matrix, is
commonly used to measure the complexity of deep neural network models.
Loss functions:- Norms can be used as loss functions to measure the error
or loss of a machine learning model. The hinge loss function, which is a type of
norm, is commonly used in support vector machines (SVMs) for classification
tasks
Sparsity:- Norms can be used to promote sparsity in machine learning models.
The L1 norm, also known as the Lasso penalty, is commonly used to create
sparse models that have many zero weights.
Eigendecomposition
It is the process in which we decompose a matrix into a set of eigenvectors
and eigenvalues.
An eigenvector of a square matrix A is a non zero vector υ such that multi-
plication by A alters only the scale of υ:
Aυ = λυ
6

λ represents eigen value corrosponding to eigen vector. If υ is an eigen vector
of A and rescaled to sυ for s ∈ R and s ̸= 0 then it still has the same eigen
value. Hence we only look for unit eigen vectors. The eigendecomposition of A
is given by :
A = Vdiag(λ)V−1
Where matrix V with one eigen vectors per columns and λ is concatenation of
eigen velues to become vector.
What is the practical importance of Eigendecompostion ?
Eigen decomposition, also known as spectral decomposition,it has a wide range
of applications in various fields. :
• Dimensionality reduction: Eigen decomposition can be used to reduce
the dimensionality of a dataset by identifying the most important direc-
tions, or eigenvectors, of the data. This technique is commonly used in
principal component analysis (PCA), a popular method for reducing the
dimensionality of high-dimensional data.
• Linear transformations: Eigen decomposition can be used to decom-
pose a linear transformation into its eigenvectors and eigenvalues. This
technique is commonly used in computer graphics, where it is used to ro-
tate and scale images.
• Signal processing: Eigen decomposition can be used in signal processing
to extract features from signals. For example, in image processing, eigen
decomposition can be used to extract features such as edges and textures
from images.
• Machine learning: Eigen decomposition can be used in machine learn-
ing for tasks such as clustering, dimensionality reduction, and feature
extraction. For example, in clustering, eigen decomposition can be used
to cluster data points based on their similarity in the eigenspace.
• Matrix diagonalization: Eigen decomposition can be used to diago-
nalize a matrix, which is useful for solving systems of linear equations,
computing matrix exponentials, and computing matrix powers.
Singular Value decomposition: This is another way to factorise the matrix
into singular vectors and singular values. Also every real matrix has the singular
value decomposition, but same not true for eigen decomposition. The equation
for SVD look like this
7

A = UDVT
where A is an m×n, U is m×m, D to be m×n and V be an n×n. Each of these
matrix is defined to have the special structure. The matrix V and U defined
to be orthogonal matrix and D defined to be a diagonal matrix. The elements
along the diagonal of D are known as the singular values of the matrix A. The
columns of U are known as the left-singular vectors. The columns of V are
known as as the right-singular vectors.
SVD in terms of eigendecomposition
he left-singular vectors of A are the eigenvectors of AAT
. The right-singular
vectors of A are the eigenvectors of AT
A. The non-zero singular values of A are
the square roots of the eigenvalues of AT
A.
Trace Operator:-
It gives the sum of all diagonal element of a matrix.
Tr(A) =
P
Ai,j
Since the trace of an operator remains invariant under a change of basis,for ex-
ample it is invariant to transpose operator hence it becomes easy to manipulate.
Also, the trace of a square matrix composed of many factors is also invariant to
moving the last factor into the first position.
Determinant
The determinant is equal to the product of all the eigenvalues of the matrix. The
absolute value of the determinant can be thought of as a measure of how much
multiplication by the matrix expands or contracts space. If the determinant
is 0, then space is contracted completely along at least one dimension, causing
it to lose all of its volume. If the determinant is 1, then the transformation
preserves volume
3 Applications of Linear Algebra in Real world
Linear algebra has numerous real-world applications in various fields, including:
Computer Graphics: Linear algebra is used extensively in computer graphics
for tasks such as image processing, computer vision, 3D modeling, and anima-
tion.
Machine Learning: Linear algebra is a foundational tool in machine learning
for tasks such as data preprocessing, feature engineering, dimensionality reduc-
tion, and model optimization.
Cryptography: Linear algebra is used in cryptography for tasks such as en-
cryption, decryption, and code breaking.
8

Physics: Linear algebra is used in physics to solve problems related to quantum
mechanics, electromagnetism, and fluid dynamics.
Engineering: Linear algebra is used in engineering for tasks such as system
modeling and control, signal processing, and optimization.
Economics: Linear algebra is used in economics for tasks such as game theory,
optimization, and modeling of economic systems.
Operations Research: Linear algebra is used in operations research for tasks
such as optimization, decision making, and simulation.
Chemistry: Linear algebra is used in chemistry for tasks such as molecular
modeling, quantum chemistry, and chemical kinetics.
Biology: Linear algebra is used in biology for tasks such as protein struc-
ture prediction, gene expression analysis, and population genetics.
Now, lets move to see how Linear Algebra actual used in Recommendation
system.
4 Linear Algebra and Recommendation System
Linear algebra plays a fundamental role in building recommendation systems,
as it provides powerful techniques for modeling and analyzing large datasets
of user-item interactions. By representing user-item interactions as matrices,
we can apply linear algebraic techniques such as matrix factorization, singular
value decomposition (SVD), and non-negative matrix factorization (NMF) to
extract latent factors that capture the underlying structure of the data. These
techniques can then be used to make personalized recommendations for users
based on their past behavior and preferences.
For example, SVD can be used to decompose a user-item matrix into lower-
dimensional representations that capture the most important patterns of user-
item interactions. By projecting users and items onto these lower-dimensional
representations, we can estimate how much a user is likely to like a particular
item. NMF, on the other hand, can be used to decompose a user-item ma-
trix into non-negative basis vectors that can be used to represent both users
and items. By comparing these basis vectors, we can identify similar users and
items and make personalized recommendations based on their past behavior.
Linear algebra also provides powerful tools for dealing with missing data and
handling large, sparse matrices. For example, iterative algorithms such as alter-
nating least squares (ALS) can be used to factorize large, sparse matrices and
estimate missing values.
Overall, linear algebra provides a powerful framework for modeling and analyz-
9

ing user-item interactions and building effective recommendation systems that
can provide personalized recommendations to users based on their past behavior
and preferences.
Here are some common algorithms used in recommendation systems that have
the application of linear algebra:
Singular Value Decomposition (SVD):
As discussed above, SVD is a matrix factorization technique used to reduce
the dimensionality of a user-item rating matrix. It decomposes the matrix into
three matrices, and the user and item factors are derived using linear alge-
bra.SVD works by decomposing a matrix into three matrices:
• The first matrix represents the user-item ratings in the form of a m x n
matrix, where m is the number of users and n is the number of items.
• The second matrix represents the user factors in the form of a m x k
matrix, where k is the number of latent factors we want to extract.
• The third matrix represents the item factors in the form of a k x n matrix.
The user-item rating matrix is approximated as the product of the user fac-
tors and the item factors. Specifically, the predicted rating for user i and item
j is given by the dot product of the i-th row of the user factors matrix and the
j-th column of the item factors matrix.
To apply SVD to a recommendation system, we start by representing the user-
item ratings as a matrix. We then apply SVD to this matrix to extract the user
and item factors. The number of latent factors k is typically chosen to be much
smaller than the number of users and items to reduce the dimensionality of the
data.
Once we have the user and item factors, we can use them to make personalized
recommendations to users. For example, we can recommend items to a user
based on the items that have high predicted ratings for that user.
One important consideration when using SVD for recommendation systems is
how to handle missing data in the user-item rating matrix. One approach is to
use matrix completion techniques to fill in the missing values before applying
SVD. Another approach is to use regularized SVD, which adds a penalty term to
the SVD objective function to encourage sparsity in the user and item factors.
Let us see the example as well
Suppose we have a user-item rating matrix with 5 users and 4 items, as shown
below in Table 1:
This matrix has missing values, which represent items that users have not yet
rated. To apply SVD to this matrix, we first fill in the missing values using
a matrix completion technique such as Alternating Least Squares (ALS). The
resulting filled-in matrix might look like this in Table 2
10

Item 1 Item 2 Item 3 Item 4
1 3 4 5
2 1 3 4
3 2 4 5
4 4 5 3
5 2 3 4 2
Table 1: Example user-item matrix
1 3 4 5 3.8
2 1 2.5 3 4
3 2.3 2 4 5
4 4 5 3.9 3
5 2 3 4 2
Table 2: Example user-item matrix with ratings
We can then apply SVD to this matrix to extract the user and item factors.
Suppose we choose to extract 2 latent factors. The SVD decomposition of the
filled-in matrix might look like this:
R = U * S * VT
where:
• R matrix represents the filled-in user-item rating matrix, where each row
represents a user and each column represents an item. The values in the
matrix represent the ratings that the users have given to the items. If a
user has not rated an item, the corresponding value is represented by an
empty element.
• U is the user factors matrix (5 x 2) which represents the user factors ma-
trix, where each row represents a user and each column represents a latent
factor. The values in this matrix represent how much each user is associ-
ated with each latent factor.
• S is the diagonal matrix of singular values (2 x 2) which represents the
diagonal matrix of singular values, where each element on the diagonal
represents the strength of the corresponding latent factor.
• VT
is the transpose of the item factors matrix (2 x 4) which represents
the transpose of the item factors matrix, where each row represents a la-
tent factor and each column represents an item. The values in this matrix
11

represent how much each item is associated with each latent factor.
We can then use the user and item factors to make recommendations to users.
For example, suppose we want to recommend items to user 1. We can compute
the predicted rating for user 1 and each item using the dot product of the first
row of the user factors matrix and each column of the item factors matrix:
We take the first row of the U matrix, which represents the first user’s associa-
tions with the latent factors:
U0,: = [u11, u12]
We take the transpose of the VT
matrix, which represents the item associations
with the latent factors:
(VT
)T
= [[v11, v12, v13, v14], [v21, v22, v23, v24]]
We take the dot product of the first row of U and the transpose of VT
:
We can interpret each element of the resulting vector as the predicted rating
that the first user would give to each of the items. For example, the first element
u11 ∗v11 +u12 ∗v21 represents the predicted rating that the first user would give
to the first item. Similarly, the second element u11 ∗ v12 + u12 ∗ v22 represents
the predicted rating that the first user would give to the second item, and so
on.
This yields the following predicted ratings for user 1:
Based on these predicted ratings, we might recommend item 3 to user 1, as it
1 2.95 3.78 4.83 3.88
Table 3: predicted ratings for user 1
has the highest predicted rating.
Alternating Least Squares
In collaborative filtering, matrix factorization is the state-of-the-art solution
for sparse data problem. What is matrix factorization? Matrix factorization is
simply a family of mathematical operations for matrices in linear algebra. To
be specific, a matrix factorization is a factorization of a matrix into a product of
matrices. In the case of collaborative filtering, matrix factorization algorithms
work by decomposing the user-item interaction matrix into the product of two
lower dimensionality rectangular matrices. One matrix can be seen as the user
matrix where rows represent users and columns are latent factors. The other
matrix is the item matrix where rows are latent factors and columns represent
items. How does matrix factorization solve our problems?
12

1. Model learns to factorize rating matrix into user and movie representa-
tions, which allows model to predict better personalized movie ratings for
users
2. With matrix factorization, less-known movies can have rich latent repre-
sentations as much as popular movies have, which improves recommender’s
ability to recommend less-known movies
f
rui =
Pnfactors
f=0 Hu,f Wf,i
Rating of item i given by user u can be expressed as a dot product of the user’s
latent vector and the item’s latent vector. Latent factors are the features in
the lower dimension latent space projected from user-item interaction matrix.
The idea behind matrix factorization is to use latent factors to represent user
preferences or items in a much lower dimension space. Matrix factorization is
one of the very effective dimension-reduction techniques in machine learning.
The objective of matrix factorization is to minimize the error between true
rating and predicted rating:
argminH,W ∥ R − e
R∥F + α ∥ H ∥ +β ∥ W ∥
We can use funkSVD to complete the training process of the Matrix Factor-
ization Algorithm, only problem with this approach is that it’s not scalable as
the amount of data grows today. With terabytes or even petabytes of data,
it’s impossible to load data with such size into a single machine. So we need
a machine learning model (or framework) that can train on dataset spreading
across from cluster of machines. Hence Alternating Least Square (ALS) is also
a matrix factorization algorithm and it runs itself in a parallel fashion. ALS
is implemented in Apache Spark ML and built for a larges-scale collaborative
filtering problems. ALS is doing a pretty good job at solving scalability and
sparseness of the Ratings data, and it’s simple and scales well to very large
datasets. Some high-level ideas behind ALS are:
• Its objective function is slightly different than Funk SVD: ALS uses L2
regularization while Funk uses L1 regularization
• Its training routine is different: ALS minimizes two loss functions alter-
natively; It first holds user matrix fixed and runs gradient descent with
item matrix; then it holds item matrix fixed and runs gradient descent
with user matrix
• Its scalability: ALS runs its gradient descent in parallel across multiple
partitions of the underlying training data from a cluster of machines
So let say we have the objective function which look like this
minX,Y
P
(ru,i − xu
T
yi)2
+ λ(
P
∥ xu ∥2
+
P
∥ yi ∥2
)
13

where X is user’s matrix , Y is item’s matrix and R ≈ XT
Y. Notice that this
objective is non-convex (because of the XT
Yterm); in fact it’s NP-hard to opti-
mize. Gradient descent can be used as an approximate approach here, however
it turns out to be slow and costs lots of iterations. Note however, that if we
fix the set of variables X and treat them as constants, then the objective is a
convex function of Y and vice versa. Our approach will therefore be to fix Y
and optimize X, then fix X and optimize Y , and repeat until convergence. This
approach is known as ALS(Alternating Least Squares).Lets see the algorithm
as well:
Initialize X,Y
repeat
for u = 1 ... n do
xu = (
P
yiyi
T
+ λIk)−1
P
ru,iyi
end for
for i = 1 ... m do
yi = (
P
xuxu
T
+ λIk)−1
P
ru,ixu
end for
until convergence
The output of the algorithm is the factorized matrices X and Y that can be
used to predict missing ratings.The first is to do what was discussed before,
which is to simply predict ru,ixT
u yi for each user u and item i. This approach
will cost O(nmk) if we’d like to estimate every user-item pair. However,this
approach is prohibitively expensive for most real-world datasets. A second (and
more holistic) approach is to use the xu and yi as features in another learning
algorithm, incorporating these features with others that are relevant to the pre-
diction task. There are also several other way to distribute the computation of
ALS algorithms like using method of join or method of broadcast. There are
also a concept called Fast ALS which can be used here to decrease the compu-
tation cost.
Let see with simple example how can this algorithm be used. Suppose we have
a user-item matrix with 4 users and 5 items:
Item1 Item2 Item3 Item4
User1 5 ? ? 1
User2 ? 2 ? 5
User3 1 ? 4 ?
User4 ? 3 1 4
Table 4: User-Item Matrix
We want to predict the missing ratings (denoted by ?) so we can make per-
sonalized recommendations. To do this, we use ALS to factorize the user-item
14

matrix into two low-rank matrices: a user matrix and an item matrix.
The user matrix has a row for each user and k columns where k is the number
of latent factors we want to use. Each element in the matrix represents the
strength of the association between the user and the corresponding latent fac-
tor.
The item matrix has a row for each item and k columns. Each element in the
matrix represents the strength of the association between the item and the cor-
responding latent factor.
We initialize the user and item matrices with random values and then alternate
between fixing the user matrix and optimizing the item matrix and fixing the
item matrix and optimizing the user matrix. We repeat this process until the
error between the predicted and actual ratings is minimized.
We initialize the user and item matrices with random values
After initializing the user and item matrices with random values, we iterate
through a fixed number of epochs. In each epoch, we update the user and item
matrices alternatively while keeping the other matrix constant. We update the
user matrix by solving a least squares problem using the current values of the
item matrix and the ratings matrix. We update the item matrix in a similar
way using the current values of the user matrix and the ratings matrix.
The update rules for user matrix and item matrix are as follows:
For each user u:
Solve the following least squares problem for the user vector
- pu: minpu
P
i∈Ru
(ru,i − pT
u qi)2
+ λpu
2
For each item i:
Solve the following least squares problem for the item vector
- qi: minqi
P
u∈Ri
(ru,i − pT
u qi)2
+ λqi
2
Here, λ is the regularization parameter which controls overfitting.
After we have updated the user and item matrices for all epochs, we can use the
learned matrices to predict the ratings for new user-item pairs. The predicted
rating for user u and item i is given by pT
u qi.
Non Negative Matrix Factorization
Non-negative matrix factorization (NMF) is a popular technique for recommen-
dation systems. The basic idea behind NMF is to factorize a user-item rating
matrix into two non-negative matrices, one that represents the user preferences
for each item and another that represents the item features. By doing so, we can
obtain a low-dimensional representation of the data that can be used for rec-
ommendation. The user-item rating matrix typically has missing values since
not all users rate all items. NMF is a matrix completion technique that can
deal with missing values in the input matrix. It has been shown that NMF
can perform well even when the input matrix is highly sparse. The NMF al-
gorithm finds the non-negative matrices that minimize the reconstruction error
15

between the original matrix and its approximation obtained by multiplying the
two factor matrices. This is achieved by minimizing the Frobenius norm of the
difference between the original matrix and its approximation.
In the context of recommendation systems, the user-item rating matrix is typ-
ically large and sparse, and the factor matrices are of much lower dimension.
The factorization can be interpreted as a form of dimensionality reduction that
captures the underlying latent factors that determine user preferences and item
features. The factor matrices can be used to make recommendations for new
items that a user has not yet rated. This is done by computing the dot product
of the user feature vector with the item feature vectors and recommending the
items with the highest dot products. One of the advantages of NMF over other
matrix factorization techniques is that it produces non-negative factor matri-
ces, which can be interpreted as additive combinations of positive features. This
makes the resulting recommendations more interpretable and intuitive. Overall,
NMF is a powerful and flexible technique for recommendation systems that can
handle large and sparse user-item rating matrices, and produce interpretable
recommendations based on non-negative factor matrices.
Lets assume we want to develop movie recommendation system we have
user-movie rating matrix, which is a 2D matrix with dimensions (number of
users) x (number of movies) and Each entry in the matrix represents the rating
given by a user to a movie, on a scale from 1 to 5.
X=






5 3 0 1
4 0 0 1
1 1 0 5
1 0 0 4
0 1 5 4






Then we initialize NMF algorithm with a specified number of components:In the
context of NMF, ”components” refer to the latent factors that the algorithm
tries to discover in the input matrix. These components are represented as
non-negative vectors in the factor matrices, where each element in the vector
corresponds to a feature of the item or a preference of the user.
The number of components specified during the initialization of the NMF
algorithm is a hyperparameter that determines the dimensionality of the result-
ing factor matrices. In other words, it specifies how many latent factors should
be used to represent the input matrix. Let’s initialize the NMF algorithm with
10 components.
Factorize the user-movie rating matrix into non-negative user features and non-
negative movie features:
• The NMF algorithm aims to factorize the user-movie rating matrix into
two matrices: a matrix of non-negative user features and a matrix of non-
negative movie features.
16

• The user features matrix has dimensions (number of users) x (number of
components).
• The movie features matrix has dimensions (number of components) x
(number of movies).
• The NMF algorithm aims to minimize the error between the original user-
movie rating matrix and the reconstructed matrix, which is the product
of the user features matrix and the movie features matrix.
• The NMF algorithm uses an iterative optimization algorithm to find the
values of the user features and movie features that minimize the recon-
struction error, subject to the non-negativity constraints.
There are several algorithms that can be used to factorize a user-movie rating
matrix into non-negative user features and non-negative movie features. Some
of the most popular algorithms are:
Multiplicative Update Algorithm: This is a widely used iterative algorithm for
NMF that updates the factor matrices using multiplicative updates based on
the gradient of the Frobenius norm.
Alternating Least Squares (ALS): This is another iterative algorithm that al-
ternates between fixing one factor matrix and updating the other using least
squares optimization.
Gradient Descent: This algorithm updates the factor matrices using gradient
descent optimization based on the gradient of the reconstruction error.
Bayesian Non-negative Matrix Factorization (BNMF): This is a probabilistic
model that uses Bayesian inference to estimate the posterior distribution over
the factor matrices.
We apply the NMF algorithm to factorize the user-movie rating matrix X
into two matrices: a matrix of non-negative user features W and a matrix of
non-negative movie features H:
X ≈ WH
The user features matrix W has dimensions (number of users) x (number of
components):
W=






0.00 0.33 0.00 1.06 0.08 0.22 0.30 0.35 0.00 0.40
0.00 0.21 0.10 0.77 0.00 0.23 0.41 0.26 0.00 0.24
0.11 0.08 0.00 0.01 0.63 1.28 0.00 0.01 0.84 0.00
0.07 0.03 0.00 0.01 0.47 0.79 0.00 0.00 0.58 0.00
1.24 0.33 0.00 1.60 0.00 0.68 0.00 0.00 0.00 0.00






The movie features matrix H has dimensions (number of components) x (num-
ber of movies):
17

H=
















2.87 1.83 0.00 0.36
1.12 0.00 0.00 0.62
0.00 1.08 3.23 3.03
0.43 0.00 2.29 2.23
1.54 0.00 1.36 1.31
0.00 1.57 0.92 0.97
0.65 0.73 0.00 0.00
0.00 0.83 0.96 0.89
1.17 0.00 0.37 0.36
0.00 0.71 1.00 0.96
















Now lets generate personalized movie recommendations for a user Suppose we
want to generate movie recommendations for user 1. We first retrieve the user
features vector for user 1 from the user features matrix. This vector has di-
mensions (number of components), and represents the extent to which user 1
exhibits each of the identified user features.
We then calculate the predicted rating for each movie by taking the dot product
of the user features vector with the corresponding column of the movie features
matrix. This gives us a vector of predicted ratings for all movies, with dimen-
sions (number of movies). We then select the top 10 movies with the highest
predicted ratings and return their titles. Overall, the NMF algorithm is able to
identify underlying latent factors in the user-movie rating matrix that capture
the preferences and characteristics of both the users and the movies. By factor-
ing the matrix into non-negative user and movie features, the algorithm is able
to generate more accurate and personalized recommendations for users.
What is the importance of ”Non-Negative” Matrix factorization here:
• Interpretability: In non-negative matrix factorization, the resulting fea-
tures are all non-negative, which can be more interpretable than tra-
ditional matrix factorization methods that allow negative values. Non-
negative features can be more easily interpreted as representing different
aspects or characteristics of the users and items.
• Sparsity: Non-negative matrix factorization can handle sparse data bet-
ter than traditional matrix factorization methods. This is because the
non-negativity constraint encourages the algorithm to learn sparse repre-
sentations of the data, meaning that only a small number of features are
used to represent each user or item.
• Robustness to outliers: The non-negativity constraint can also make the
algorithm more robust to outliers in the data, as it can prevent the algo-
rithm from assigning negative weights to these outliers.
• Better performance: In some cases, non-negative matrix factorization can
outperform traditional matrix factorization methods in terms of prediction
accuracy, especially when the data is highly sparse and the non-negativity
constraint is appropriate for the problem at hand.
18

PCA
PCA stands for Principal Component Analysis. It is a statistical technique used
to reduce the dimensionality of data while retaining as much of the original in-
formation as possible. In other words, it helps to find a smaller set of variables,
called principal components, that explain most of the variance in the original
data.
PCA is helpful in recommendation systems because it can be used to reduce
the dimensionality of the user-item interaction matrix. The user-item interac-
tion matrix is a sparse matrix that contains information about the ratings or
preferences of users for different items. However, this matrix is typically very
large and high-dimensional, making it difficult to compute recommendations
efficiently.
By applying PCA to the user-item interaction matrix, we can reduce its dimen-
sionality by projecting the original data onto a lower-dimensional space, while
still preserving the important information about user-item interactions. This
lower-dimensional representation of the data can then be used to compute rec-
ommendations more efficiently.
PCA can also help to address the problem of sparsity in the user-item interac-
tion matrix. Sparse matrices can lead to inaccurate recommendations because
they lack sufficient information about user-item interactions. By reducing the
dimensionality of the matrix, PCA can help to densify the data and reduce the
impact of sparsity on the recommendations.
Lets see high-level algorithm for performing principal component analysis (PCA):-
Input: A dataset of user-item interactions, where each row represents a user
and each column represents an item, and the cells contain the rating or feedback
given by the user for that item. The dataset can be represented as a matrix
X of size n x m, where n is the number of users and m is the number of items.e.g;
X=




5 3 0 1 4
1 0 5 4 3
0 3 4 0 0
4 0 0 3 1




Calculate the mean of each column of the matrix to obtain the
item averages:
mean(xj) = 1
n
Pn
i=1 Xij
For the above matrix it will look like this:
mean(xj) = [2.5, 1.5, 2.25, 2, 2]
Subtract the item averages from each data point to center the data:
X’ = x’1, x′
2, ..., x′
n,
where x’i = xi−mean(x)∀i = 1, . . , n
19

X
′
=




2.5 1.5 −2.25 −1 2
−1.5 −1.5 2.75 2 1
−2.5 0.5 1.75 −2 −2
1.5 −1.5 −2 1 −1




Compute the covariance matrix:
Cov(X′
) = 1
n−1 X′
X′T
Cov(X
′
)
=




3.75 −2.25 −1.5 1
−2.25 3.5 1 0
−1.5 1 3.5 0
1 0 0 2.5




Compute the eigenvectors and eigenvalues of the covariance matrix:
eigvals, eigvecs = eig(Cov(X’)),
eigvals is a vector of m eigenvalues and eigvecs is a m-by-m matrix of eigenvectors.
eigvals = [5.17, 2.08, 2.08, 0.42]
eigvecs =




−0.529 −0.704 0.388 −0.254
0.043 −0.342 −0.791 −0.515
0.643 −0.191 −0.203 0.716
0.553 0.599 0.432 0.379




The eigenvectors are sorted in descending order of eigenvalue, so we choose
the first two eigenvectors to form the principal components of the data. We
choose k=2 eigenvectors, so we obtain a matrix Vk of size 5x2, containing the
first two eigenvectors as columns:
Vk =






−0.529 0.388
0.043 −0.791
0.643 −0.203
0.553 0.432
−0.704 −0.254






Now we project the centered data onto the k-dimensional space spanned by the
selected eigenvectors:
Xpca = X′
∗ Vk
where Vk is the matrix of the k selected eigenvectors.
20

Xpca = X′
Vk =




−1.17 −1.68
2.53 −0.89
1.05 2.15
−2.41 0.42




The resulting matrix Xpca is of size 4x2, where each row represents a user and
each column represents a principal component.
Finally, we can use the projected data Xpca to make recommendations. For
example, we can compute the cosine similarity between the projected data of a
target user and the projected data of all other users. We can then recommend
items that similar users have rated highly but the target user has not yet rated.
We can also use the transformed dataset Xpca as input to a recommendation
algorithm, such as collaborative filtering or matrix factorization, to predict rat-
ings or recommend items to users..
Content Based Filtering
Content-based filtering uses item features to recommend other items similar
to what the user likes, based on their previous actions or explicit feedback.
How content-based filtering different from collaborative filtering?
Content-based filtering focuses on the attributes of the items being recom-
mended and the preferences of the users. It analyzes the textual or descriptive
features of the items and tries to recommend items that are similar to the items
a user has already shown interest in. For example, if a user has previously
purchased a book on cooking, a content-based recommendation system would
recommend other books on cooking, based on similarities in the attributes of
the books.
On the other hand, collaborative filtering focuses on the behavior of other users
in the system to generate recommendations. It analyzes the patterns of user-
item interactions in the system and identifies users with similar preferences. It
then recommends items that similar users have liked in the past. For example,
if a user has previously liked a particular movie, a collaborative filtering system
would recommend other movies that users with similar preferences have liked.
The key difference between content-based filtering and collaborative filtering is
the source of information used to generate recommendations. Content-based
filtering relies on the attributes of the items being recommended, while collab-
orative filtering relies on the behavior of other users in the system.
Both techniques have their strengths and weaknesses. Content-based filtering is
good at recommending niche or unique items but may struggle to recommend
items that are dissimilar to a user’s previous choices. Collaborative filtering, on
the other hand, can recommend items based on the preferences of other users
but may struggle to recommend items that have not been previously rated or
reviewed by users in the system.
Now imagine you are a game developer who wants to create a personalized
21

game recommendation system for your users. The goal is to recommend games
to each user based on their preferences and playing history. One approach to
this problem is content-based filtering.
• First, we would create a user profile based on the games they have played
and enjoyed in the past. The user profile would consist of TF-IDF weighted
vectors for each game genre.
• Then, we would compute the similarity scores between the user profile
and all the games in the system using a similarity measure such as cosine
similarity.
• Next, we would select the games with the highest similarity scores as rec-
ommendations for the user.
Game Genres
Game1 Action, Adventure
Game2 Simulation, Strategy
Game3 Action, Adventure
Game4 Action, RPG
Game5 Sports
For example, the above table shows Games and Genres and let’s say a user
has played and enjoyed the following games in the past:
Game 1: Action, Adventure
Game 2: Simulation, Strategy
Game 3: Action, Adventure
Game 4: Action, RPG
We can represent this information in a table as follows:
Next, we can compute the TF-IDF weights for each genre as follows: TF-IDF
Game Genre1 Genre2 Genre3 Genre4 Genre5
Game1 1 1 0 0 0
Game2 0 0 1 1 0
Game3 1 1 0 0 0
Game4 1 0 0 0 1
for Action = ((1/4) * log(4/3)) + ((1/4) * log(4/3)) + ((1/4) * log(4/3)) +
((1/4) * log(4/4)) = 0.693
TF-IDF for Adventure = ((1/4) * log(4/2)) + ((0/4) * log(4/1)) + ((1/4) *
log(4/2)) + ((0/4) * log(4/4)) = 0.433
TF-IDF for Simulation = ((0/4) * log(4/1)) + ((1/4) * log(4/1)) + ((0/4) *
log(4/1)) + ((0/4) * log(4/4)) = 0.306
TF-IDF for Strategy = ((0/4) * log(4/1)) + ((1/4) * log(4/1)) + ((0/4) *
22

log(4/1)) + ((0/4) * log(4/4)) = 0.306
TF-IDF for RPG = ((0/4) * log(4/1)) + ((0/4) * log(4/1)) + ((0/4) * log(4/1))
+ ((1/4) * log(4/4)) = 0.0
What is TF-IDF?
TF-IDF (term frequency-inverse document frequency) is a commonly used method
in natural language processing and information retrieval to quantify the impor-
tance of a term in a document or a corpus of documents. It is based on the idea
that the more frequent a term appears in a document, the more important it
is to that document, but at the same time, the more frequent it appears in the
entire corpus, the less important it is in distinguishing between documents.
In content-based filtering, TF-IDF is used to represent the content of the items
(e.g., movies, books, articles) in a vector space model, where each term corre-
sponds to a dimension and the weight of the term is given by its TF-IDF score.
The vector space model allows us to compute the similarity between items based
on their content. Items that have similar content (i.e., similar TF-IDF vectors)
are considered more similar to each other and are more likely to be recommended
to users who have shown interest in similar items in the past.
Now we have the TF-IDF weights for each genre, which can be used to represent
the user’s preferences. We can normalise this vector using Euclidian Norma
then the normalized user preference vector is
[0.706, 0.441, 0.312, 0.312, 0.0]
Now, we can calculate the cosine similarity between the user preference vector
and the TF-IDF weighted genre vectors for each game as follows:
What is Cosine Similarity?
Cosine similarity is a measure of similarity between two non-zero vectors of an
inner product space. It is the cosine of the angle between the two vectors, which
gives a value between -1 and 1. A value of 1 indicates that the two vectors are
identical, 0 indicates that they are orthogonal (i.e., have no correlation), and -1
indicates that they are diametrically opposed. Cosine similarity is commonly
used in recommendation systems to compare the similarity of two items or two
users based on their ratings or preferences.
Cosine Similarity = Sc(A, B) : cos (θ) = A.B
∥A∥∥B∥
Hence following this formula we can calculate as follow
For Game1:
cosine similarity = (0.706 * 0.707) + (0.441 * 0.707) + (0.312 * 0) + (0.312 *
0) + (0 * 0) / ((0.7062
+ 0.4412
+ 0.3122
+ 0.3122
+ 02
)(
1/2) ∗ (0.7072
+ 0.7072
+
02
+ 02
+ 02
)(
1/2))
cosine similarity = 0.574
23

For Game2:
cosine similarity = (0.706 * 0) + (0.441 * 0) + (0.312 * 0.5) + (0.312 * 0.5) + (0 *
0) / ((0.7062
+0.4412
+0.3122
+0.3122
+02
)(
1/2)∗(02
+02
+0.52
+0.52
+02
)(
1/2))
For Game3:
cosine similarity = (0.706 * 0.707) + (0.441 * 0.707) + (0.312 * 0) + (0.312 *
0) + (0 * 0) / ((0.7062
+ 0.4412
+ 0.3122
+ 0.3122
+ 02
)(
1/2) ∗ (0.7072
+ 0.7072
+
02
+ 02
+ 02
)(
1/2))
For Game4:
cosine similarity = (0.706 * 0) + (0.441 * 0) + (0.312 * 0.5) + (0.312 * 0) + (0
* 0.866) / ((0.7062
+ 0.4412
+ 0.3122
+ 0.3122
+ 02
)(
1/2) ∗ (02
+ 02
+ 0.52
+ 02
+
0.8662
)(
1/2))
Therefore, the cosine similarity between the user preference vector and Game1
and Game3 is the highest, which means they are the most similar to the user’s
preferences and are the recommended games. In this way we can generate rec-
ommendations which is basically ranking items based on their similarity scores
and recommend the top N items to the user. This can be done using techniques
such as sorting or machine learning algorithms like regression or clustering.
Furhter we can evaluate the system’s performance by measuring metrics such as
accuracy, precision, recall, and F1-score. Also refining the system by incorporat-
ing user feedback, improving the feature extraction and similarity calculation
techniques, and experimenting with different recommendation algorithms can
be a way to build robust recommendation system.
There are still many algorithms where linear algebra plays a crucial role:
Matrix Completion:Matrix completion is a technique used to fill in miss-
ing entries in a user-item rating matrix. It relies on linear algebra to estimate
the missing values by solving a low-rank matrix completion problem
Factorization Machines
actorization Machines are a type of model that use linear algebra to extract
latent factors from a feature matrix. They are commonly used in recommenda-
tion systems to predict user-item ratings based on the user and item features.
Latent Dirichlet Allocation(LDA)
LDA is a topic modeling technique that is used to identify latent topics in a
corpus of documents. It can be applied to user-item rating matrices to identify
24

latent topics in user preferences.
Graph-based Methods
Graph-based methods are used to represent user-item interactions as a graph
and then use graph-based algorithms to make recommendations. Linear algebra
is used to compute graph properties such as eigenvectors and eigenspaces.
Multi-Armed Bandit
Multi-armed bandit algorithms are used to optimize the tradeoff between ex-
ploration and exploitation in recommendation systems. Linear algebra is used
to model the relationship between user preferences and the expected reward of
recommending a particular item.
5 Conclusion
Linear algebra is a fundamental mathematical tool that is widely used in recom-
mendation systems. Through techniques such as Singular Value Decomposition
(SVD), Alternating Least Squares (ALS), Non-Negative Matrix Factorization
(NMF), and Principal Component Analysis (PCA), recommendation systems
are able to extract latent features from large datasets to make personalized rec-
ommendations for users.
SVD is particularly useful for handling missing values in datasets, while ALS is
well-suited for handling large and sparse datasets. NMF and PCA are effective
in extracting relevant information from text and image data, respectively.
As we saw, Linear algebra is also used in content-based filtering algorithms,
which recommend items based on their similarity to items that a user has pre-
viously shown interest in. Similarity scores between items are computed using
linear algebra techniques such as cosine similarity or Euclidean distance.
Furthermore, linear algebra allows for the optimization of objective functions,
such as minimizing the difference between actual and predicted ratings, in rec-
ommendation algorithms.
Hence, by leveraging the power of linear algebra, recommendation systems are
able to provide personalized, high-quality recommendations to users across a
wide range of applications.
6 Reference
https://mathinsight.org/matrixvectormultiplication
https : //www.khanacademy.org/math/precalculus/x9e81a4f98389efdf : matrices/x9e81a4f98389efdf :
properties−of−matrix−multiplication/a/properties−of−matrix−multiplication
25

https : //people.richland.edu/james/lecture/m116/matrices/inverses.html : :
text = Requirements
https : //mbernste.github.io/posts/linearindependence/
https : //mathworld.wolfram.com/FrobeniusNorm.html : : text = The
https : //physics.stackexchange.com/questions/137158/what−information−
does − the − trace − of − a − matrix − give
https : //towardsdatascience.com/prototyping −a−recommender −system−
step − by − step − part − 2 − alternating − least − square − als − matrix −
4a76c58714a1
http : //stanford.edu/ rezab/classes/cme323/S15/notes/lec14.pdf
https : //en.wikipedia.org/wiki/Non−negativematrixf actorization : : text =
Non
https : //en.wikipedia.org/wiki/Cosinesimilarity : : text = Cosine
26

Linear_Algebra_final.pdf

Recomendados

Recomendados

Más contenido relacionado

Similar a Linear_Algebra_final.pdf

Similar a Linear_Algebra_final.pdf (20)

Último

Último (20)

Linear_Algebra_final.pdf