Book Recommendations on Amazon

Introduction

Project Type: Recommendation System

Problem: Implementation of recommendation system in the online shop platform has been able to increase sales as the user was provided by the product of their interest. Here is an example of the recommendation system that was built based on the preferences of other users that is called collaborative filtering.

Data Description

The dataset has 4 columns and 22.507.155 rows of books rated on amazon. The dataset contains information about user id, book id, rating, and timestamp.

import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("Dataset/amazon_book_rating.csv",header=None)
new_header = ['user','book_id','rating','timestamp']
df.columns = new_header
df

Content of Dataset

Find details about the dataset

print('Total Data')
print('-'*50)
print('Total Number of Ratings: ',df.shape[0])
print('Total Number of User: ',len(np.unique(df.user)))
print('Total Number of Books: ',len(np.unique(df.book_id)))

Output:
Total Data
--------------------------------------------------
Total Number of Ratings:  22507155
Total Number of User:  8026324
Total Number of Books:  2330066

Plot a bar chart to have a better picture of the rating

num = sns.countplot(x='rating',data=df,palette='summer')
num.set(xlabel='Rating',ylabel='Total number of ratings')

It can be seen that more than 60% of the users were given a 5.0 rating for the book they have purchased while the remains are divided into 4 other rangkings.

Data Preparation

Among the 4 columns provided, the timestamp column is not used in this project so it can be discarded.

df.drop(['timestamp'],axis=1,inplace=True)

Then, calculate the average rating and the number of ratings received by each book.

books_ratings = pd.DataFrame(df.groupby(by='book_id')['rating'].mean())
books_ratings['num_of_ratings'] = df.groupby(by='book_id')['rating'].count()
book_rating

The result shows that there are 2.330.066 books in the record with the highest number of ratings of 21.398 and the lowest of 1. Because the number of books recorded is too much and the difference between the most and least book rated is too far, the dataset will be filtered for books with the minimum number of ratings of 1000.

filter_by_ratings = books_ratings.loc[books_ratings['num_of_ratings'] > 1000].sort_values(by='num_of_ratings',ascending=False)
filter_by_ratings

Thus, 903 books were obtained and would be used to create a recommendation system. Let’s create a new data frame with the filtered book id.

filtered_book_id = filter_by_ratings.index.tolist()

df_new = df[df['book_id'].isin(filtered_book_id)]
df_new

This illustrates one of the disadvantages of the collaborative filtering method where a product will only be recommended if it already has enough rating.

Create a pivot table to present the rating score given by each user to all books. If there is no rating given, the score is 0.

book_mat = df_new.pivot_table(index='user',columns='book_id',values='rating').fillna(0)
book_mat

rated_book_id = '1630350168'

rated_book_ratings = book_mat[rated_book_id]
rated_book_ratings

List of users who give ratings for books with id 1630350168

Item-based without Machine Learnine

similar_book_rated = book_mat.corrwith(rated_book_ratings)
df_book_corr = pd.DataFrame(similar_book_rated,columns=['correlation'])
df_book_corr = df_book_corr.join(books_ratings['num_of_ratings'])

book_recommendation = df_book_corr.sort_values(by='correlation', ascending=False).iloc[1:].head(10)

List of 10 book recommendations based on book id 1630350168

Item-based by KNNWithMeans

import surprise
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df_new,reader)

Split data for training and testing.

trainset, testset = train_test_split(data, test_size=0.3)

Train the model of KNNWithMeans with a k-value of 5.

model = KNNWithMeans(k=5,sim_options={'name': 'pearson_baseline', 'user_based': False})
model.fit(trainset)

Test the model with data for testing.

test_pred = model.test(testset)

print("Item-based Model : Test Set")
accuracy.rmse(test_pred, verbose=True)

Output:
Item-based Model : Test Set
RMSE: 1.1336
1.133583609139205

book_mat_t = book_mat.T
book_mat_t

Tranpose of pivot table book_mat

from sklearn.decomposition import TruncatedSVD

SVD = TruncatedSVD(n_components=10)
decomposed_matrix = SVD.fit_transform(book_mat_t)
decomposed_matrix.shape

correlation_matrix = np.corrcoef(decomposed_matrix)
correlation_matrix.shape

rated_book_id = '1630350168'

Get index for book_id 1630350168

all_books = list(book_mat_t.index)
rated_book_index = all_books.index(rated_book_id)

Calculate the correlation of the book with other books

correlation_book = correlation_matrix[rated_book_index]

Filter the result for books with correlation more than 0.7 and remove the book from list

book_recommendation_2 = list(book_mat_t.index[correlation_book > 0.7])
book_recommendation_2.remove(rated_book_id)
book_recommendation_2

Choose 10 books randomly because from the results of recommendations there are more than 10 books recommended

import random

random.sample(book_recommendation_2, 10)

Output:
['014219672X','1461052149','1469982781'
,'061579615X','0989450236','1476763526'
,'1451533969','1455578312','1612186009'
,'1476755590']

Conclusion

A recommendation system based on the KNNWithMeans model shows reliability compared to non-machine learning modeling, according to the correlation scores obtained. The maximum correlation score for a recommendation system without ML is obtained less than 0.3 whereas if with AI the recommendation can be filtered with a correlation of at least 0.7.