Data Exploration and Predictive Model of North American Videogame Sales using NumPy, Pandas, Matplotlib & Scikit-Learn

Emem Andy
7 min readJun 4, 2021

I’m in a graduate certificate program that not only teaches but challenges me positively. I took a Data Science Foundation class and we were asked to perform data analysis and build a predictive model in any area of our choosing and I decided to go with videogame sales. This project was really exciting and I loved going through the videogame sales dataset, communicating with the data, and gaining insights that was beneficial in helping me feature engineer for my predictive model. I am going to go through my project but in a way that I give tips and tricks of functions or methods to use in cleaning, analyzing data. Let’s get started!!

First things first, before any data science project or any programming you have to know what libraries you’ll be working with and import them. For this project I imported Pandas, NumPy, Seaborn, Matplotlib and few others as shown in the image below.

You might be wondering why after each import line there’s a “as …”. Well, those are called aliases. Aliases help to shorten module names so that the are easier do use during the course of the task you’re working on. Imagine having to use NumPy functions and always using the long name to call those functions, it’s easier if you use the alias that you define it as from the import, so that the environment in which you’re working in will know what library you’re trying to refer to when using functions. During the course of this post, you’ll see what I mean in using aliases.

“%matplotlib inline” is a magic function in IPython (Interactive Python). IPython is typically used in environments like Jupyter, Google Colab. Check my previous post for information on this.

These are extra modules that I decided to import that will do the things I want it to as I have different operations I want to perform on the dataset.

Importing necessary modules

You can import your dataset however you wish to, but this is a way I decided to it, just because.

This is a different way I used to import my videogame dataset.

You have to perform this pandas read_csv function in order to read csv files and perform operation on it. I named my variable videogame so that the values of the read_csv function is assigned to that variable and then I can use it to perform all kinds of functions using the variable.

One thing I like to do when I am trying to communicate with data, is to know wat contents of the dataset I’m working with. I do this using the Pandas, info() function. This function gives a description of the columns in the dataset, the count of valid values, and the data type.

Once you see the columns you’re working with, it’s always good to see a sample of the contents of those columns. This is done using the head() function which usually gives a sample of your dataset, you can specify the number of samples you want displayed to you but by default, it shows 5 which is enough to get a peek of your dataset.

So from the earlier conversations with the data, you can see that something is off. The year column is a float and not an integer and we have to change this. Using the Pandas DataFrame square brackets that selects the year column and using the astype(np.int64) function to change the datatype to an integer. Another thing I want you to notice here, is my use of the NumPy alias, np.

Now after changing the data type of the year column we can see that it looks right.

I also want to change the index of my dataset. Currently, it using the Rank as the index but I don’t want that as Rank and so to do that, you first drop the current index it is at, and then set the new index to whatever name of your choosing, I went with id, plain and simple.

I noticed that my Continent sales were inconsistent and that will prove a problem when it’s time to visualize the data. There’s 41.49, 29.02,0.77, 3.28, 11.01. I didn’t understand what unit the dataset was using and so I decided to work on it so that they are working on the same unit. I first tested it out with my North American sales, and then proceeded to do it for the other continents.

I had some null values and the most efficient way I found to fix them was using the Scikit-Learn module that would impute the missing values for the numerical features (a.k.a. columns), with the mean strategy. For numerical values, a strategy of mean, median, most frequent, constant can be used. You can also use SimpleImputer to impute/replace missing values of categorical features

In order to know what features you will be using to build your model you have to see the correlation to each other and you can use a heatmap as I’ve done below. The lighter it is the stronger the correlation. From this heatmap I can tell my selected features will be Japan sales(JP_Sales), and European Sales. I wouldn’t pick Global Sales because that’s just the total sales from the individual continents combined.

The next step is to create visualizations and see to gain insights that will be helpful in better understanding your data. I did quite a lot because my data was really interesting and I wanted to see it from every angle but I would just show a few of the ones I did.

So, now it’s time to build the model, to test and train it. The model aspect is really short and pretty straightforward. The most work is done in the cleaning and analysis. If you have done an elaborate analysis on your data your model building should be a breeze relatively. Also depends on what you’re working as I know they’re much more complex models than this that require more scrutiny and patience and a lot of back and forth.

The good thing about working with an interactive environment is that you can add your code and run them individually without it affecting the entire thing if you wish. The other thing, is the fact that you can import modules, libraries at any given time. If I was using an environment that doesn’t support interactive code, it’s good practice to always do imports at the very top. With an interactive environment there really isn’t any restriction in that sense.

So here I imported the Linear Regression model and the train_test_split model selection from Scikit-Learn.

Yellowbrick is imported for my residual plot and plotting of my prediction model error. A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a nonlinear model is more appropriate.

The prediction error plot shows the actual targets from the dataset against the predicted values generated by the model.

And so there you have it, I have been able to show you ways you can interact wit your data for analysis and how to build, test & train a model. I hope this post was helpful in any way. If you have any extra questions, please don’t hesitate to ask.

Hope to communicate with you in another blog post. Remember, data will talk to you of you’re willing to listen :).

--

--