Group Members: Jason Li and Alex Motyka
For our project, we have found two pretty cool datasets that we wanted to look into. The first dataset is the Video Game Sales dataset. The other dataset is just a larger video game dataset with more info.
In our two datasets, we really want to understand some of the general trends with video game sales. For example, we want to understand how video game sales varies with time and location, as well how different genres differ in sales and popularity. Other things we want to explore the relationship genre has with varibales like sale, rating, popularity, completion, stc.
We want to understand if there is a relationship between the the MetaCritic Reviews and User Reviews, and if there is a relationship between the number of acheivements in the game and the status of completion for that game for users. If there is a relationship, we also want to know if we could create a model to predict trends. We also want to see if this relationship is affected by the genre of the game.
In regards to the validity of our data, our first dataset is a webscrape from VGchartz, an industry research firm that publishes video game hardware estimates every week and hosts an ever-expanding game database with over 55,000 titles listed, featuring up-to-date shipment information and legacy sales data. for our Second dataset, it is data also from the RAWG API, where RAWG is the largest video game database and game discovery service.
For this analysis, this is probably most interesting for induviduals who want to make games, or see a glimpse into the history behind video games. For those who want to make a video game, this analysis helps understand what type of genre is popular and where they are popular, as well as how different types of games sell. A few relevant resources are mentioned below:
VGChartz.com - a website detailing video game stats and info, which one fo these datasets was scraped off of
RAWG.io - a video game databse API containing 500,000+ games over 50 platforms. This API wsa used to obtain the second dataset
history.com/topics/inventions/history-of-video-games - this website can help give further insight to why some trends in the data occurred the way it did.
github.com/leomaurodesenv/game-datasets - This repository contains may more video game and video game related data, offerring much more indepth information on different aspects of video games
We plan on meeting weekly via zoom, with a Github Repository as our main collaboraiton method. We may also use Google Collab, as it has been what we are using in class, but we will really only move if the Github Repository does not work out well. We are utilizing GroupMe and Texts to communicate for this project, as well as setting up a ToDo list in order to efficiently assign tasks to each group member.
Each milestone starts with a meeting to dicuss what needs to be done, and we split up the work. Then when each person is finished with their part, we have the other person review it and make sure it is okay
#run if you are missing pandas:
#!pip install pandas
#%pip install seaborn
#%pip install scikit-learn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%cd /content
!git clone https://github.com/21percentme/21percentme.github.io.git
%cd /content/21percentme.github.io
We chose this dataset because video games have gained popluarity along the years, especially with the recent mobility aspects of consoles. From here, it would be interesting to see how different video games have sold over the past years, which games are the most popular, and see how sale trends correspond with real life events
This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com, uploaded by GregorySmith
Fields include:
Rank - Ranking of overall sales
Name - The games name
Platform - Platform of the games release (i.e. PC,PS4, etc.)
Year - Year of the game's release
Genre - Genre of the game
Publisher - Publisher of the game
NA_Sales - Sales in North America (in millions)
EU_Sales - Sales in Europe (in millions)
JP_Sales - Sales in Japan (in millions)
Other_Sales - Sales in the rest of the world (in millions)
Global_Sales - Total worldwide sales.
#read the video game sales csv into a dataframe
game_sales = pd.read_csv("vgsales.csv")
#display first 5 rows to see what the data looks like
game_sales.head()
Right now we should start with cleaning the dataset up a bit, so lets check how many null entries there are
#in here we sum up the number of null entries
game_sales.isnull().sum()
Looks like there are 271 null entries in year, so we do not know when 271 of the games were release. Additionally, 58 games have an unknown/unreported publisher. This may mess up our data analyzing, so lets just drop the entries with those null values. We are making sure to check that they were dropped.
# #in hhere we drop all null values
games = game_sales.dropna()
#and we check to see if it worked below
games.isnull().sum()
Lets start looking into some basic analysis of the data! Lets first look at some basic summary stats of the data.
#so here we see some basic summary stats
games = game_sales
games.describe()
From here, we also look at what data types each variable is
#and call .info() to see the variable types
games.info()
Note that Platform, Genre, and Publisher are objects (strings). These variables are important to the questions we want to answer using this dataset, so we should consider turning these into individual binary columns based on the number of possible values.
games.groupby('Year')['Global_Sales'].sum()
games.groupby('Year').Global_Sales.sum().idxmax()
games.groupby('Year').Global_Sales.sum().idxmin()
games.Genre.value_counts()
It is cool to see the year video games sold the best is in 2008, and the worst in 2017, with the genre of action being the best selling.
Some questions we will investigate with this dataset:
1) How does video game popularity vary by region?
2) Were genres popular in North America similar to those popular in the EU and Japan?
2) How did the popularity of video game genres change over time?
3) Why did those changes happen?
4) Which year was the highest selling year for video games?
Because each row breaks each game's sales down by region, and also includes platform, genre, and year information, we will be able to answer the above questions with this dataset.
From here, we want to see the how video games sales differ with year and location
#rename columns for regions to be more intuitive
games = games.rename(columns={'Other_Sales': "Other", "JP_Sales": "Japan", "EU_Sales":"Europe", "NA_Sales":"North America"})
#group the global sales by year to see sales by year
years = games.groupby('Year')['Global_Sales'].sum()
#drop every column except for region, so we get sales per region
location = games.drop(columns=["Rank", "Year", "Global_Sales", "Platform", "Name", "Genre", "Publisher"]).sum()
#create sbplots to plot two plots at the same time
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (18,10))
fig.suptitle('Video Game Sales Compared to Time and Region')
years.plot(kind = 'barh', color = sns.color_palette("pastel"), title = "Video Game Sales by Year", ax = ax1)
ax1.set_xlabel("Total Sales (Millions)")
ax1.set_ylabel("Year")
location.plot(kind = 'barh', color = sns.color_palette("pastel"), title = "Video Game Sales by Region", ax = ax2)
ax2.set_xlabel("Total Sales (Millions)")
ax2.set_ylabel("Region")
g = games
# fig.savefig('name.png')
# files.download('name.png')
This is very interesting. It seems like 2008 had the most sales worldwide for video games, whereas 2017 had very little sales. Comapred to other regions, it seems North America has a lot more sales over the last 20 or so years compared to other regions. This sort of makes sense though, as North America is the largeset region.
The following histogram shows the distribution of games over time. It is easy to see how the number of video games on the market skyrocketed starting around 1996, why does that occur?
looking at the grpah above more closely, it looks like there was a sudden spike in 1996 to video game sales, I wonder why? Lets look into the consoles that had the most sales in 1996.
n96 = games[games.Year == 1996]
n96 = n96.groupby("Publisher").sum()
n96 = n96.Global_Sales.sort_values()
fig, (ax1) = plt.subplots(1, 1, figsize = (18,10))
fig.suptitle('Video Game Sales in 1996')
n96 = n96.sort_values(ascending=False)
n96= n96[:10]
n96.plot(kind = 'bar', color = sns.color_palette("pastel"), title = "Video Game Sales by Company", ax = ax1)
ax1.set_xlabel("Total Sales (Millions)")
ax1.set_ylabel("Publisher")
It looks like there Nintendo blows all the other companies out of the water! Granted, Nintendo is one of the biggest game companies, but to out preform even Sony this much, something must have happend. After a quick google search, it looks like 1996 was the year the Nintendo n64 released,as well as pokemon red and blue releasing, which means Nintendo with the new 3D console and popular games, it should be the most popular publisher this year.
%%capture
#we know nintendo was most popular, so now we see what nintendo games sold the best
platform = games.loc[games.Publisher == "Nintendo"]
platform.drop(columns=["Rank", 'Other', 'Europe', "North America", "Japan"], inplace=True)
platform = platform[platform.Year == 1996]
platform = platform.groupby("Platform")
platform.head()
Our findings online are confirmed! Pokemon was had the most sales by far, and many of the top sales are N64 Games.
From here, we were wondering if there were any differences in genre popularity in different regions.
#drop every column except for regions and genre, so we can see genre sales per
grouping = games.drop(columns=["Rank", "Year", "Global_Sales", "Platform", "Name", "Publisher"])
genres = grouping.groupby("Genre").sum()
#Genre for Japan = JA, Genre for North America is genre for NA, and so forth
JA = genres["Japan"]
NA = genres["North America"]
EU = genres["Europe"]
#create sbplots to plot three plots at the same time
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (18,6))
fig.suptitle('Video Game Sales by Genre in Different Regions')
JA.plot(kind = 'bar', color = sns.color_palette("tab20"), title = "Video Game by Genre in Japan", ax = ax1)
ax1.set_ylabel("Total Sales (Millions)")
ax1.set_xlabel("Genres")
NA.plot(kind = 'bar', color = sns.color_palette("tab20"), title = "Video Game by Genre in North America", ax = ax2)
ax2.set_ylabel("Total Sales (Millions)")
ax2.set_xlabel("Genres")
EU.plot(kind = 'bar', color = sns.color_palette("tab20"), title = "Video Game by Genre in Europe", ax = ax3)
ax3.set_ylabel("Total Sales (Millions)")
ax3.set_xlabel("Genres")
It is cool to see that in both Europe and North America, Action is the most popular genre, with sports close behind. However, in Japan, Role-Playing is the most popular by far. I guess different regions have diffrerent likes!
This dataset contains 474417 video games on over 50 platforms including mobiles. All game information was obtained using Python with RAWG API. This dataset was last updated on Dec 22nd 2020, made by Trung Hoang
Fields include:
id: An unique ID identifying this Game in RAWG Database
slug: An unique slug identifying this Game in RAWG Database
name: Name of the game
metacritic: Rating of the game on Metacritic
released: The date the game was released
tba: To be announced state
updated: The date the game was last updated
website: Game Website
rating: Rating rated by RAWG user
rating_top: Maximum rating
playtime: Hours needed to complete the game
achievements_count: Number of achievements in game
ratings_count: Number of RAWG users who rated the game
suggestions_count: Number of RAWG users who suggested the game
game_series_count: Number of games in the series
reviews_count: Number of RAWG users who reviewed the game
platforms: Platforms game was released on. Separated by ||
developers: Game developers. Separated by ||
genres: Game genres. Separated by ||
publishers: Game publishers. Separated by ||
esrb_rating: ESRB ratings
added_status_yet: Number of RAWG users had the game as "Not played"
added_status_owned: Number of RAWG users had the game as "Owned"
added_status_beaten: Number of RAWG users had the game as "Completed"
added_status_toplay: Number of RAWG users had the game as "To play"
added_status_dropped: Number of RAWG users had the game as "Played but not beaten"
added_status_playing: Number of RAWG users had the game as "Playing"
This dataset was interesting to us because it is another dataset on video games. We want to use this to get more info
gameinfo = pd.read_csv("game_info.csv")
gameinfo.head()
Lets again see where the null values are
gameinfo.isnull().sum()
gameinfo = gameinfo.drop(['slug', 'tba', 'updated', 'website', 'rating_top', 'developers', 'publishers', 'esrb_rating', 'reviews_count', 'suggestions_count', 'game_series_count'], axis=1)
gameinfo.head()
gameinfo.describe()
gameinfo.info()
Now we want to look into if there is any relationship between the sales in genre compared to the user ratings in genre. This means we first need to merge the two datasets.
gameinfo_N = gameinfo.rename(columns={'name': "Name"})
gameinfo_N.head()
merged = games.merge(gameinfo_N, on=["Name"], how='inner')
merged.count()
It looks like we have almost 10,000 datapoints preserved after merging, which is more than enough to analyze! Now we first do a check to see if trends are preserved in sales by genre, basically doing the same graph as we did earlier.
grouping = merged[["Genre", "Japan", "North America", "Europe"]]
genres = grouping.groupby("Genre").sum()
#Genre for Japan = JA, Genre for North America is genre for NA, and so forth
JA = genres["Japan"]
NA = genres["North America"]
EU = genres["Europe"]
#create sbplots to plot three plots at the same time
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (18,6))
fig.suptitle('Video Game Sales by Genre in Different Regions in the merged dataset')
JA.plot(kind = 'bar', color = sns.color_palette("tab20"), title = "Video Game by Genre in Japan", ax = ax1)
ax1.set_ylabel("Total Sales (Millions)")
ax1.set_xlabel("Genres")
NA.plot(kind = 'bar', color = sns.color_palette("tab20b"), title = "Video Game by Genre in North America", ax = ax2)
ax2.set_ylabel("Total Sales (Millions)")
ax2.set_xlabel("Genres")
EU.plot(kind = 'bar', color = sns.color_palette("tab20c"), title = "Video Game by Genre in Europe", ax = ax3)
ax3.set_ylabel("Total Sales (Millions)")
ax3.set_xlabel("Genres")
It looks pretty much the same, so that is a good sign. Now lets see average rating per genre comared to sales per genre.
genre = merged[["Genre", "rating", "Global_Sales"]]
genre_rate= genre.groupby("Genre").mean()
genre_sale = genre.groupby("Genre").sum()
gsale = genre_sale.Global_Sales
grate = genre_rate.rating
from sklearn.preprocessing import MinMaxScaler
#because ratings are in units of 1-5, and sale is in millions, we must scale. We use MinMax scaler just so there are no negatives
scaler = MinMaxScaler()
#set index of df into the genres
index = ['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle', 'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy']
#minmax scale data
grate_sc = scaler.fit_transform(grate.to_frame())
gsale_sc = scaler.fit_transform(gsale.to_frame())
#create new df
total = pd.DataFrame()
total['index'] = index
total['Global Sales Scaled'] = gsale_sc
total['Ratings Scaled'] = grate_sc
#now graph df
total_graph = total.set_index('index')
fig, (ax1) = plt.subplots(1, 1, figsize = (25,5))
total_graph.plot(kind = 'bar', title = "Video Game Sales and Ratings by Genre", ax = ax1, alpha = .5)
ax1.set_xlabel("Genre")
ax1.set_ylabel("Min-Max Scale")
It looks like there isn't really a relationship between global sales and ratings when looking at different genres. For puzzle, it seems to be lowest of both ratings and global sales, and for simulation and racing, both genres seem to have about the same amount of global sales an level of rating propotionally. However, generally, the trends seem to not match up between the two.
From here, we decided to look into if there was a relationship between the status of completion and genre.
genre = merged[["Genre", "rating", "Global_Sales"]]
genre_rate= genre.groupby("Genre").mean()
genre_sale = genre.groupby("Genre").sum()
gsale = genre_sale.Global_Sales
grate = genre_rate.rating
gameinfo['released'] = pd.to_datetime(gameinfo['released'], errors='coerce')
metacritic_info = gameinfo.dropna()
metacritic_info.info()
#now we create a graph showing status of completion per genre
top = merged[['Genre','added_status_yet', 'added_status_beaten', 'added_status_toplay', 'added_status_dropped', 'added_status_playing']]
top = top.rename(columns = {'added_status_yet': 'Not Played',
'added_status_beaten': 'Beaten',
'added_status_toplay': 'To Play',
'added_status_dropped': 'Stopped Playing not Beaten',
'added_status_playing': 'Currently Playing'
})
total = top.groupby('Genre').sum().sum(axis='columns')
single = top.groupby('Genre').sum()
thPlt = top.groupby('Genre').sum()
for i in range(12):
thPlt.iloc[i] = single.iloc[i]/total.iloc[i]
fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (20,15))
single.plot(kind = 'bar', title = "Video Game Status without Scale", ax = ax2, alpha = .5)
thPlt.plot(kind = 'bar', title = "Video Game Status with Scale", ax = ax1,alpha = .5)
We thought that for example if genre predicts status, maybe if a genre is prone to not be beaten, then maybe the genre is not the best type to pursue, or don't make the game very long. However, it seems there is no general trend, but sports genre seems to be the one with the highest ratio of played but not beaten to total amount of statusus. This makes sense as a lot of sports games you play again and again, where you can play online, with friends, etc.
In this model, we would like to use Genre to predic NOA (number of acheivements) and playtime, which seems like a pretty complex task. We currently have a plan to enact this however:
Split dataset 2 by Genre, if needed drop columns other than NOA and playtime
Given that some of the games may have many genres, we are dropping games with more than 3 genres, as that leads to attributes due to genre being less accurate
Genre tends to influence many properties of the game, for example sports games and roleplaying games will have different gameplay loops, where a roleplaying game could have a larger storyline, leading to a larger game. It could also mean that because sports games have a lot of replayablility, there could be more achievements linked to that. As the result, we are very interested to see the relations.
Use KNN model find some interests:
Given playtime and NOA, what Genre of game would it most likely be?
This model can also be expanded to include other attributes of the dataset, adding more depth and posilbilties to predictions of other attributes
In this model, there is no one dependent and independent variable, as you can use one attribute to predict the other, and vice versa, so it depends on you question. But in our two interests, the variables would be as follows:
#vertical bar is a weird character, must us \| to use as delimiter
merged.count()
#creating new instance of DF
genreSplit = merged[['Genre','achievements_count', 'playtime', 'Global_Sales', 'rating', 'metacritic']]
genreSplit.playtime.value_counts()
colors = genreSplit["Genre"].map({
"Action": "darkred",
"Sports": "gold",
'Shooter': 'blue',
'Role-Playing': 'orange',
'Misc': 'purple',
'Platform': 'brown',
'Adventure': 'pink',
'Simulation': 'gray',
'Fighting': 'olive',
'Strategy': 'cyan',
'Racing': 'navy',
'Puzzle': 'red',
})
scatter = genreSplit.plot.scatter(
x="Global_Sales", y="rating", c=colors,
alpha=.3, xlim=(0, 40), ylim=(0, 5))
from sklearn.model_selection import train_test_split
X = genreSplit[['playtime', 'achievements_count', 'Global_Sales', 'rating']]
y = genreSplit["Genre"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score, accuracy_score
X_train_np = X_train.to_numpy()
X_test_np = X_test.to_numpy()
scaler = StandardScaler()
scaler.fit(X_train_np)
X_train_sc = scaler.transform(X_train_np)
scores = []
f1 = []
for k in range(1, 50):
# Train and fit the model
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train_sc, y_train)
# Convert X_test to NumPy array before making predictions
X_test_sc = scaler.transform(X_test_np)
y_test_pred = model.predict(X_test_sc)
# Add accuracy score to a list
scores.append(accuracy_score(y_test_pred, y_test))
# F1 score is essentially binary, so we must do S or not S, add to list
f1.append(f1_score(y_test, y_test_pred, average='weighted', labels=genreSplit["Genre"]))
max = pd.Series(scores).max()
id = pd.Series(scores).idxmax()
print("Max Accuracy is " + str(max) + " at k = " + str(id+1))
x = np.arange(1,50)
x
plt.plot(x,scores)
plt.xlabel("Neighbors")
plt.ylabel("Accuracy")
Not that impressive. Let's see how this compares to a random classifier:
from sklearn.dummy import DummyClassifier
random_scores = []
random_f1 = []
for k in range(1, 50):
# Train and fit the model
random_classifier = DummyClassifier(strategy="uniform")
random_classifier.fit(X_train_sc, y_train)
# Convert X_test to NumPy array before making predictions
X_test_sc = scaler.transform(X_test_np)
y_pred = random_classifier.predict(X_test)
# Add accuracy score to a list
random_scores.append(accuracy_score(y_test_pred, y_test))
# F1 score is essentially binary, so we must do S or not S, add to list
random_f1.append(f1_score(y_test, y_test_pred, average='weighted', labels=genreSplit["Genre"]))
max = pd.Series(random_scores).max()
id = pd.Series(random_scores).idxmax()
print("Max Accuracy is " + str(max))
plt.plot(x, scores)
plt.plot(x, random_scores, '-.')
plt.xlabel("Neighbors")
plt.ylabel("Accuracy")
plt.title('multiple plots')
plt.show()
Pretty bad, and random. Let's try something else.
Since there was little correlation on the how sale, number of achivment, and review relate to the genre, we decided to attempt to predict how well a game would sell globally given its sale in North America and its rating
sales = merged.drop(columns = ["Name", "Rank", "id", "genres"])
sales = sales.dropna()
sales.info()
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
features = ["North America", "rating"]
X_train_dict = sales[features].to_dict(orient="records")
y_train = sales['Global_Sales']
def get_cv_error(k):
vec = DictVectorizer(sparse=False)
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("fit", model)])
mae = (np.mean(-cross_val_score(
pipeline, X_train_dict, y_train,
cv=10, scoring="neg_mean_absolute_error"
)))
return mae
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
test_errs = ks.apply(get_cv_error)
test_errs.plot.line()
test_errs.sort_values()
y_train.describe()
We get the lowest MAE Error at around 10, hence why we are choosing KNN n=10 for our best model. From here, as we learnt in class, MAE and RMSE measure how off a "typical" prediction is. There are other metrics that can be used to measure the quality of a regression model, but these are the most common ones. For our MAE for n=10, after using cross validation with a cv of 10, we got an MAE of 0.565.This is put into the context of having a mean of 1.18, which means typically we would be .565 off our guess, which isn't great, but is definitely an improvement.
Additionally, we chose to use MAE over RMSE as RMSE is senstive to outliers, which our data has a few.