Predictive Analysis

Predictive analysis is the use of statistical, data mining, and machine learning techniques to analyze current and historical data in order to make predictions about future events or behaviors. It involves identifying patterns and trends in data, and then using that information to forecast what is likely to happen in the future.

Predictive analysis is used in a wide range of applications, from forecasting sales and demand, to predicting customer behavior, to detecting fraudulent transactions. It involves collecting and analyzing data from a variety of sources, including historical data, customer data, financial data, and social media data, among others.

The process of predictive analysis typically involves the following steps:

  1. Defining the problem and identifying the relevant data sources
  2. Collecting and cleaning the data
  3. Exploring and analyzing the data to identify patterns and trends
  4. Selecting an appropriate model or algorithm to use for predictions
  5. Training and validating the model using historical data
  6. Using the model to make predictions on new data
  7. Monitoring and evaluating the performance of the model over time

Predictive analysis can help organizations make more informed decisions, improve efficiency, and gain a competitive advantage by leveraging insights from data.

It is most commonly used in Retail, where workers try to predict which products would be most popular and try to advertise those products as much as possible, and also Healthcare, where algorithms analyze patterns and reveal prerequisites for diseases and suggest preventive treatment, predict the results of various treatments and choose the best option for each patient individually, and predict disease outbreaks and epidemics.

1. Intro to NumPy and the features it consists

Numpy, by definition, is the fundamental package for scientific computing in Python which can be used to perform mathematical operations, provide multidimensional array objects, and makes data analysis much easier. Numpy is very important and useful when it comes to data analysis, as it can easily use its features to complete and perform any mathematical operation, as well as analyze data files.

If you don't already have numpy installed, you can do so using conda install numpy or pip install numpy

Once that is complete, to import numpy in your code, all you must do is:

import numpy as np

2. Using NumPy to create arrays

An array is the central data structure of the NumPy library. They are used as containers which are able to store more than one item at the same time. Using the function np.array is used to create an array, in which you can create multidimensional arrays.

Shown below is how to create a 1D array:

a = np.array([1, 2, 3])
print(a) 
# this creates a 1D array
[1 2 3]

How could you create a 3D array based on knowing how to make a 1D array?

b = np.array([[[-6, -5, -4], [-3, -2, -1]], [[1, 2, 3], [4, 5, 6], [7, 8, 9]]], dtype=object)
print(b)
[list([[-6, -5, -4], [-3, -2, -1]])
 list([[1, 2, 3], [4, 5, 6], [7, 8, 9]])]

Arrays can be printed in different ways, especially a more readable format. As we have seen, arrays are printed in rows and columns, but we can change that by using the reshape function

c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(c.reshape(1, 9)) # organizes it all in a single line of output
[[1 2 3 4 5 6 7 8 9]]

In the code segment below, we can also specially select certain rows and columns from the array to further analyze selective data.

print(c[1:, :2])
# the 1: means "start at row 1 and select all the remaining rows"
# the :2 means "select the first two columns"
[[4 5]
 [7 8]]

3. Basic array operations

One of the most basic operations that can be performed on arrays is arithmetic operations. With numpy, it is very easy to perform arithmetic operations on arrays. You can add, subtract, multiply and divide arrays, just like you would with regular numbers. When performing these operations, numpy applies the operation element-wise, meaning that it performs the operation on each element in the array separately. This makes it easy to perform operations on large amounts of data quickly and efficiently.

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # adds each value based on the column the integer is in
print(a - b) # subtracts each value based on the column the integer is in
print(a * b) # multiplies each value based on the column the integer is in
print(a / b) # divides each value based on the column the integer is in
[5 7 9]
[-3 -3 -3]
[ 4 10 18]
[0.25 0.4  0.5 ]
d = np.exp(b)
e = np.sqrt(b)
print(d)
print(e)
[ 54.59815003 148.4131591  403.42879349]
[2.         2.23606798 2.44948974]

From the knowledge of how to use more advanced mathematical expressions than the basic 4 mathematical operations such as exponent and square root, now can you code how to calculate the 3 main trig expressions (sin, cos, tan), natural log, and log10 of a 1D array.

print(np.sin(b)) # calculate sin
print(np.cos(b)) # calculate cos
print(np.tan(b)) # calculate tan
print(np.log(b)) # calculate natural log
print(np.log10(b)) # calculate log10
[-0.7568025  -0.95892427 -0.2794155 ]
[-0.65364362  0.28366219  0.96017029]
[ 1.15782128 -3.38051501 -0.29100619]
[1.38629436 1.60943791 1.79175947]
[0.60205999 0.69897    0.77815125]

4. Data analysis using numpy

Numpy provides a convenient and powerful way to perform data analysis tasks on large datasets. One of the most common tasks in data analysis is finding the mean, median, and standard deviation of a dataset. Numpy provides functions to perform these operations quickly and easily. The mean function calculates the average value of the data, while the median function calculates the middle value in the data. The standard deviation function calculates how spread out the data is from the mean. Additionally, numpy provides functions to find the minimum and maximum values in the data. These functions are very useful for gaining insight into the properties of large datasets and can be used for a wide range of data analysis tasks.

data = np.array([2, 5, 12, 13, 19])
print(np.mean(data)) # finds the mean of the dataset
print(np.median(data)) # finds the median of the dataset
print(np.std(data)) # finds the standard deviation of the dataset
print(np.min(data)) # finds the min of the dataset
print(np.max(data)) # finds the max of the dataset
10.2
12.0
6.04648658313239
2
19

Now from learning this, can you find a different way from how we can solve the sum or products of a dataset other than how we learned before?

example = np.array([4, 50, 283])
print(np.sum(example))
print(np.product(example))
337
56600

Numpy also has the ability to handle CSV files, which are commonly used to store and exchange large datasets. By importing CSV files into numpy arrays, we can easily perform complex operations and analysis on the data, making numpy an essential tool for data scientists and researchers.

genfromtxt and loadtxt are two functions in the numpy library that can be used to read data from text files, including CSV files.

genfromtxt is a more advanced function that can be used to read text files that have more complex structures, including CSV files. genfromtxt can handle files that have missing or invalid data, or files that have columns of different data types. It can also be used to skip header lines or to read only specific columns from the file.

import numpy as np

padres = np.genfromtxt('files/padres.csv', delimiter=',', dtype=str, encoding='utf-8')
# delimiter indicates that the data is separated into columns which is distinguished by commas
# genfromtxt is used to read the csv file itself
# dtype is used to have numpy automatically detect the data type in the csv file

print(padres)
[['Name' ' Position' ' Average' ' HR' ' RBI' ' OPS' ' JerseyNumber']
 ['Manny Machado' ' 3B' ' .298' ' 32' ' 102' ' .897' ' 13']
 ['Fernando Tatis Jr' ' RF' ' .281' ' 42' ' 97' ' .975' ' 23']
 ['Juan Soto' ' LF' ' .242' ' 27' ' 62' ' .853' ' 22']
 ['Xander Bogaerts' ' SS' ' .307' ' 15' ' 73' ' .833' ' 2']
 ['Nelson Cruz' ' DH' ' .234' ' 10' ' 64' ' .651' ' 32']
 ['Matt Carpenter' ' DH' ' .305' ' 15' ' 37' ' 1.138' ' 14']
 ['Jake Cronenworth' ' 1B' ' .239' ' 17' ' 88' ' .722' ' 9']
 ['Ha-Seong Kim' ' 2B' ' .251' ' 11' ' 59' ' .708' ' 7']
 ['Trent Grisham' ' CF' ' .184' ' 17' ' 53' ' .626' ' 1']
 ['Luis Campusano' ' C' ' .250' ' 1' ' 5' ' .593' ' 12']
 ['Austin Nola' ' C' ' .251' ' 4' ' 40' ' .649' ' 26']
 ['Jose Azocar' ' OF' ' .257' ' 0' ' 10' ' .630' ' 28']]

loadtxt is a simpler function that can be used to read simple text files that have a regular structure, such as files that have only one type of data (such as all integers or all floats). loadtxt can be faster than genfromtxt because it assumes that the data in the file is well-structured and can be easily parsed.

import numpy as np

padres = np.loadtxt('files/padres.csv', delimiter=',', dtype=str, encoding='utf-8')
print(padres)
for i in padres:
    print(",".join(i))
Name, Position, Average, HR, RBI, OPS, JerseyNumber
Manny Machado, 3B, .298, 32, 102, .897, 13
Fernando Tatis Jr, RF, .281, 42, 97, .975, 23
Juan Soto, LF, .242, 27, 62, .853, 22
Xander Bogaerts, SS, .307, 15, 73, .833, 2
Nelson Cruz, DH, .234, 10, 64, .651, 32
Matt Carpenter, DH, .305, 15, 37, 1.138, 14
Jake Cronenworth, 1B, .239, 17, 88, .722, 9
Ha-Seong Kim, 2B, .251, 11, 59, .708, 7
Trent Grisham, CF, .184, 17, 53, .626, 1
Luis Campusano, C, .250, 1, 5, .593, 12
Austin Nola, C, .251, 4, 40, .649, 26
Jose Azocar, OF, .257, 0, 10, .630, 28

Pandas

What is Pandas

Pandas is a Python library used for working with data sets. A python library is something It has functions for analyzing, cleaning, exploring, and manipulating data.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories. Pandas can clean messy data sets, and make them readable and relevant. Also it is a part of data analysis, and data manipulation.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

  • Is there a correlation between two or more columns?
  • What is average value
  • Max value
  • Min value
  • How to load data
  • Delete data
  • Sort Data.

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

Basics of Pandas.

import pandas as pd
# What this does is it calls the python pandas library and this code segment is needed whenever incorporating pandas.

DICTIONARIES AND DATASETS

  • One way you are able to manipulate a pandas data set is by creating a dictionary and calling it as seen with the dict data 1 and pd.dataframe which is a way to print the set.
import pandas as pd

data1 = {
  'teams': ["BARCA", "REAL", "ATLETICO"],
  'standings': [1, 2, 3]
}

myvar = pd.DataFrame(data1)

print(myvar)
      teams  standings
0     BARCA          1
1      REAL          2
2  ATLETICO          3

Indexing and manipulaton of data through lists.

  • With pandas you can also organize the data which is one of its biggest perks, we call this indexing, this is when we define the first column in a data frame.
import pandas as pd 

score = [5/5, 5/5, 1/5]

myvar = pd.Series(score, index = ["math", "science", "pe"])

print(myvar)
math       1.0
science    1.0
pe         0.2
dtype: float64

Pandas Classes

Within pandas the library consits of a lot of functions which allow you to manipulate datasets in lists dictionsaries and csv files here are some of the ones we are going to cover (hint: take notes on these)

  • Series

A one-dimensional array that can have custom keys as indexes. It can hold data of any type.

  • Index

Also called a key, it is used as an identifier for a certain ambiguous value.

  • PeriodIndex

This provides a way to repeat data over time. It can divide time periods by certain intervals, such as the example provided that splits by month. It can also go by year or by day.

  • DataframeGroupedBy

Allows data to be grouped by a certain index/category/etc. in data.

  • Categorical

Categorical idetification helps certain data be defined under certain categories, helping to identify certain aspects of different code objects.

  • Time Stamp

Displays a single time. This is helpful for datasets that rely on time and would benefit from modifying it in some way.

PeriodIndex

  • This allows for a way to repeat data over time that it occurs as seen from january 2022 to december 2023. You can use Y for years, M for months, and D for days.
import pandas as pd


time = pd.period_range('2022-01', '2022-12', freq='M')

print(time)
PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12'],
            dtype='period[M]')

Now implement a way to show a period index from June 2022 to July 2023 in days.

timetwo = pd.period_range('2022-06', '2023-07', freq='D')

print(timetwo)
PeriodIndex(['2022-06-01', '2022-06-02', '2022-06-03', '2022-06-04',
             '2022-06-05', '2022-06-06', '2022-06-07', '2022-06-08',
             '2022-06-09', '2022-06-10',
             ...
             '2023-06-22', '2023-06-23', '2023-06-24', '2023-06-25',
             '2023-06-26', '2023-06-27', '2023-06-28', '2023-06-29',
             '2023-06-30', '2023-07-01'],
            dtype='period[D]', length=396)

Dataframe Grouped By

  • This allows for you to organize your data and calculate the different functions such as
  • count(): returns the number of non-null values in each group.
  • sum(): returns the sum of values in each group.
  • mean(): returns the mean of values in each group.
  • min(): returns the minimum value in each group.
  • max(): returns the maximum value in each group.
  • median(): returns the median of values in each group.
  • var(): returns the variance of values in each group.
  • agg(): applies one or more functions to each group and returns a new DataFrame with the results.
import pandas as pd

data = {
    'Category': ['E', 'F', 'E', 'F', 'E', 'F', 'E', 'F'],
    'Value': [100, 250, 156, 255, 240, 303, 253, 3014]
}
df = pd.DataFrame(data)


grouped = df.groupby('Category').mean() #GUESS WHAT THIS WOULD BE IF WE WERE LOOKING FOR COMBINED TOTALS!
# combined totals would be sum()

print(grouped)
           Value
Category        
E         187.25
F         955.50

Categorical

  • This sets up a category for something and puts it within the categories and allows for better orginzation
import pandas as pd

colors = pd.Categorical(['yellow', 'orange', 'blue', 'yellow', 'orange'], categories=['yellow', 'orange', 'blue'])

print(colors)
['yellow', 'orange', 'blue', 'yellow', 'orange']
Categories (3, object): ['yellow', 'orange', 'blue']

Timestamp Class

  • This allows to display a single time which can be useful when working with datasets that deal with time allowing you to manipulate the time you do something and how you do it.
import pandas as pd


timing = pd.Timestamp('2023-02-05 02:00:00')

print(str(timing) + " is 2:00am on February 5th, 2023.")
2023-02-05 02:00:00 is 2:00am on February 5th, 2023.

CSV FILES!

  • A csv file contains data and within pandas you are able to call the function and you are able to manipulate the data with the certain data classes talked about above.
  • Name, Position, Average, HR, RBI, OPS, JerseyNumber
  • Manny Machado, 3B, .298, 32, 102, .897, 13
  • Tatis Jr, RF, .281, 42, 97, .975, 23
  • Juan Soto, LF, .242, 27, 62, .853, 22
  • Xanger Bogaerts, SS, .307, 15, 73, .833, 2
  • Nelson Cruz, DH, .234, 10, 64, .651, 32
  • Matt Carpenter, DH, .305, 15, 37, 1.138, 14
  • Cronezone, 1B, .239, 17, 88, .722, 9
  • Ha-Seong Kim, 2B, .251, 11, 59, .708, 7
  • Trent Grisham, CF, .184, 17, 53, .626, 1
  • Luis Campusano, C, .250, 1, 5, .593, 12
  • Austin Nola, C, .251, 4, 40, .649, 26
  • Jose Azocar, OF, .257, 0, 10, .630, 28

QUESTION: WHAT DO YOU GUYS THINK THE INDEX FOR THIS WOULD BE?

I think the indexes are Name, Position, Average, HR, RBI, OPS and JerseyNumber.

Can you explain what is going on in this code segment below? (hint: define what ascending= false means, and df. head means)

In the code segment below, the data in the Padres .csv file is sorted by its top ten and bottom ten by name, ascending alphabetically.

import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/padres.csv').sort_values(by=['Name'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))
print(', '.join(df.tail(10)))
--Duration Top 10---------
                Name  Position   Average   HR   RBI    OPS   JerseyNumber
3    Xander Bogaerts        SS     0.307   15    73  0.833              2
8      Trent Grisham        CF     0.184   17    53  0.626              1
4        Nelson Cruz        DH     0.234   10    64  0.651             32
5     Matt Carpenter        DH     0.305   15    37  1.138             14
0      Manny Machado        3B     0.298   32   102  0.897             13
9     Luis Campusano         C     0.250    1     5  0.593             12
2          Juan Soto        LF     0.242   27    62  0.853             22
11       Jose Azocar        OF     0.257    0    10  0.630             28
6   Jake Cronenworth        1B     0.239   17    88  0.722              9
7       Ha-Seong Kim        2B     0.251   11    59  0.708              7
--Duration Bottom 10------
                 Name  Position   Average   HR   RBI    OPS   JerseyNumber
4         Nelson Cruz        DH     0.234   10    64  0.651             32
5      Matt Carpenter        DH     0.305   15    37  1.138             14
0       Manny Machado        3B     0.298   32   102  0.897             13
9      Luis Campusano         C     0.250    1     5  0.593             12
2           Juan Soto        LF     0.242   27    62  0.853             22
11        Jose Azocar        OF     0.257    0    10  0.630             28
6    Jake Cronenworth        1B     0.239   17    88  0.722              9
7        Ha-Seong Kim        2B     0.251   11    59  0.708              7
1   Fernando Tatis Jr        RF     0.281   42    97  0.975             23
10        Austin Nola         C     0.251    4    40  0.649             26
Name,  Position,  Average,  HR,  RBI,  OPS,  JerseyNumber
import pandas as pd


df = pd.read_csv("./files/housing.csv")


mode_total_rooms = df['total_rooms'].mode()


print(f"The mode of the 'total_rooms' column is: {mode_total_rooms}")
The mode of the 'total_rooms' column is: 0    1527.0
Name: total_rooms, dtype: float64
import pandas as pd

df = pd.read_csv("./files/housing.csv")


grouped_df = df.groupby('total_rooms')


agg_df = grouped_df.agg({'total_rooms': 'sum', 'population': 'mean', 'longitude': 'count'})

# WHAT DO YOU GUYS THINK df.agg means in context of pandas and what does it stand for.
# I think it stands for the aggregation of the dataset, which summarizes data as specified by the values

print(agg_df)
             total_rooms  population  longitude
total_rooms                                    
2.0                  2.0         6.0          1
6.0                  6.0         8.0          1
8.0                  8.0        13.0          1
11.0                11.0        24.0          1
12.0                12.0        18.0          1
...                  ...         ...        ...
30450.0          30450.0      9419.0          1
32054.0          32054.0     15507.0          1
32627.0          32627.0     28566.0          1
37937.0          37937.0     16122.0          1
39320.0          39320.0     16305.0          1

[5926 rows x 3 columns]

Our Frontend Data Analysis Project

Link

Popcorn Hacks

  • Complete fill in the blanks for Predictive Analysis Numpy
  • Takes notes on Panda where it asks you to
  • Complete code segment tasks in Panda and Numpy

Main Hack

  • Make a data file - content is up to you, just make sure there are integer values - and print
  • Run Panda and Numpy commands
    • Panda:
      • Find Min and Max values
      • Sort in order - can be order of least to greatest or vice versa
      • Create a smaller dataframe and merge it with your data file
    • Numpy:
      • Random number generation
      • create a multi-dimensional array (multiple elements)
      • create an array with linearly spaced intervals between values

My Dataset

I made a dataset of information about my favorite Pokemon. I figured this would work well because there are a lot of numbers and strings that can be used to describe each Pokemon.

pokedf = pd.read_csv("./files/pokemon.csv")
print(pokedf)
                    Name  Pokedex     Type1     Type2  BST   HP  Attack  \
0               Skarmory      227     Steel    Flying  465   65      80   
1             Toedscruel      949    Ground     Grass  515   80      70   
2              Reuniclus      579   Psychic      None  490  110      65   
3               Ampharos      181  Electric      None  510   90      75   
4                Avalugg      713       Ice      None  514   95     117   
5               Appletun      842     Grass    Dragon  485  110      85   
6                 Furret      162    Normal      None  415   85      76   
7               Accelgor      617       Bug      None  495   80      70   
8               Mudsdale      750    Ground      None  500  100     125   
9                 Durant      632       Bug     Steel  484   58     109   
10         Minior-Orange      774      Rock    Flying  500   60     100   
11            Galvantula      596       Bug  Electric  472   70      77   
12         Meowstic-Male      678   Psychic      None  466   74      48   
13             Dragonite      149    Dragon    Flying  600   91     134   
14  Shaymin (Land Forme)      492     Grass      None  600  100     100   

    Defense  Sp. Atk  Sp. Def  Speed  Generation  Legendary  
0       140       40       70     70           2      False  
1        65       80      120    100           9      False  
2        75      125       85     30           5      False  
3        85      115       90     55           2      False  
4       184       44       46     28           6      False  
5        80      100       80     30           8      False  
6        64       45       55     90           2      False  
7        40      100       60    145           5      False  
8       100       55       85     35           7      False  
9       112       48       48    109           5      False  
10       60      100       60    120           7      False  
11       60       97       60    108           5      False  
12       76       83       81    104           6      False  
13       95      100      100     80           1      False  
14      100      100      100    100           4       True  

Pandas

The stuff below is related to the Pandas portion of the main hack.

Maximums and Minimums

I wanted to see the strongest and weakest combination of all of the stats spread among the list of Pokemon.

max = {
    'Stat': [],
    'Value': []
}

min = {
    'Stat': [],
    'Value': []
}

for stat in ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']:
    max['Stat'].append(stat)
    min['Stat'].append(stat)
    max['Value'].append(pokedf[stat].max())
    min['Value'].append(pokedf[stat].min())

maxStats = pd.DataFrame(max)
minStats = pd.DataFrame(min)

print(maxStats)
print(minStats)
      Stat  Value
0       HP    110
1   Attack    134
2  Defense    184
3  Sp. Atk    125
4  Sp. Def    120
5    Speed    145
      Stat  Value
0       HP     58
1   Attack     48
2  Defense     40
3  Sp. Atk     40
4  Sp. Def     46
5    Speed     28

So that I could know which Pokemon were the weakest overall among my favorites, I isolated based on base stat total (BST).

print(pokedf[pokedf['BST'] == pokedf['BST'].max()][['Name', 'BST', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']])
print()
print(pokedf[pokedf['BST'] == pokedf['BST'].min()][['Name', 'BST', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']])
                    Name  BST  Attack  Defense  Sp. Atk  Sp. Def  Speed
13             Dragonite  600     134       95      100      100     80
14  Shaymin (Land Forme)  600     100      100      100      100    100

     Name  BST  Attack  Defense  Sp. Atk  Sp. Def  Speed
6  Furret  415      76       64       45       55     90

Sorting in Order

Below, I sorted the dataset in alphabetical order, only including certain data to save space.

alpha_pokedf = pokedf.sort_values(by=['Name'])
print(alpha_pokedf[['Name', 'Pokedex', 'BST', 'Generation']])
                    Name  Pokedex  BST  Generation
7               Accelgor      617  495           5
3               Ampharos      181  510           2
5               Appletun      842  485           8
4                Avalugg      713  514           6
13             Dragonite      149  600           1
9                 Durant      632  484           5
6                 Furret      162  415           2
11            Galvantula      596  472           5
12         Meowstic-Male      678  466           6
10         Minior-Orange      774  500           7
8               Mudsdale      750  500           7
2              Reuniclus      579  490           5
14  Shaymin (Land Forme)      492  600           4
0               Skarmory      227  465           2
1             Toedscruel      949  515           9

Then, I sorted by Pokedex number to see the density of my favorites in the Pokedex. I then found the mean Pokedex number to get a sense for the general trend.

import math

dex_pokedf = pokedf.sort_values(by=['Pokedex'])
print(dex_pokedf[['Pokedex', 'Name', 'BST', 'Generation']])
print("Mean Pokedex number: " + str(math.floor(pokedf['Pokedex'].mean())))
    Pokedex                  Name  BST  Generation
13      149             Dragonite  600           1
6       162                Furret  415           2
3       181              Ampharos  510           2
0       227              Skarmory  465           2
14      492  Shaymin (Land Forme)  600           4
2       579             Reuniclus  490           5
11      596            Galvantula  472           5
7       617              Accelgor  495           5
9       632                Durant  484           5
12      678         Meowstic-Male  466           6
4       713               Avalugg  514           6
8       750              Mudsdale  500           7
10      774         Minior-Orange  500           7
5       842              Appletun  485           8
1       949            Toedscruel  515           9
Mean Pokedex number: 556

Merging Dataframes

I added a couple more favorites to the list by using the pandas concat function.

dfpokedf = pd.DataFrame(pokedf)

concatdf = pd.DataFrame(
    {
        "Name": ['Sawsbuck', 'Brambleghast'],
        "Pokedex": [632, 947],
        "Type1":['Normal','Grass'],
        "Type2":['Grass','Ghost'],
        "BST":[475,480],
        "HP":[80,55],
        "Attack":[100,115],
        "Defense":[70,70],
        "Sp. Atk":[60,80],
        "Sp. Def":[70,70],
        "Speed":[95,90],
        "Generation":[5,9],
        "Legendary":[False,False]
    }, index=[15, 16]
)

finaldf = pd.concat([dfpokedf, concatdf])

print(finaldf[['Name', 'Pokedex', 'BST', 'Generation']])
                    Name  Pokedex  BST  Generation
0               Skarmory      227  465           2
1             Toedscruel      949  515           9
2              Reuniclus      579  490           5
3               Ampharos      181  510           2
4                Avalugg      713  514           6
5               Appletun      842  485           8
6                 Furret      162  415           2
7               Accelgor      617  495           5
8               Mudsdale      750  500           7
9                 Durant      632  484           5
10         Minior-Orange      774  500           7
11            Galvantula      596  472           5
12         Meowstic-Male      678  466           6
13             Dragonite      149  600           1
14  Shaymin (Land Forme)      492  600           4
15              Sawsbuck      632  475           5
16          Brambleghast      947  480           9

Bonus Content

Because I found this kind of interesting and also because I want the extra 0.1 points, I thought I'd look at some more data about the entries.

pokegens = finaldf.pivot_table(index=['Generation'], aggfunc='size')

print(pokegens)
Generation
1    1
2    3
4    1
5    5
6    2
7    2
8    1
9    2
dtype: int64

The most represented generation in my favorites list is Generation 5, with five Pokemon.

print(finaldf[finaldf['Generation'] == 5][["Name", "Pokedex", "BST", "Generation"]])
          Name  Pokedex  BST  Generation
2    Reuniclus      579  490           5
7     Accelgor      617  495           5
9       Durant      632  484           5
11  Galvantula      596  472           5
15    Sawsbuck      632  475           5

Now looking at typing.

poketypes = finaldf.pivot_table(index=['Type1', 'Type2'], aggfunc='size')

print(poketypes)
Type1     Type2   
Bug       Electric    1
          None        1
          Steel       1
Dragon    Flying      1
Electric  None        1
Grass     Dragon      1
          Ghost       1
          None        1
Ground    Grass       1
          None        1
Ice       None        1
Normal    Grass       1
          None        1
Psychic   None        2
Rock      Flying      1
Steel     Flying      1
dtype: int64

There is only one exact repeat type combination, being pure Psychic type (seen with Reuniclus and Meowstic). However, it would be more interesting to see the overall type density, including both Type1 and Type2 in the equation.

poketype1 = finaldf.pivot_table(index=['Type1'], aggfunc='size')
poketype2 = finaldf.pivot_table(index=['Type2'], aggfunc='size')

totaltypes = {}
for type, num in poketype1.items():
    totaltypes[type] = num
for type, num in poketype2.items():
    try:
        totaltypes[type] += num
    except:
        totaltypes[type] = num

print("Type density:")
for type, num in totaltypes.items():
    print(f'\t{type}: {str(num)}')
Type density:
	Bug: 3
	Dragon: 2
	Electric: 2
	Grass: 5
	Ground: 2
	Ice: 1
	Normal: 2
	Psychic: 2
	Rock: 1
	Steel: 2
	Flying: 3
	Ghost: 1
	None: 8

The data above shows that Grass is the most common type in my favorites, occurring 5 times. The two runner-ups are Bug and Flying, both with 3. Some types that are missing are Fire, Water, Poison, Fairy and Dark. Water really surprises me, since it's the most common type in the Pokedex.

"None" with a density of 8 means that eight of my favorite Pokemon are monotype, which is interesting since most Pokemon are dual-type.

Numpy

The section below are the hacks for the Numpy sections.

Random Number Generation

For some reason, Numpy has a random number generator. Its integers function lets you make a list of random numbers, which is pretty cool.

import numpy as np

rng = np.random.default_rng()

print(rng.integers(0, high=5, size=10))
[2 4 3 3 1 4 0 4 0 0]

You can also make a set of random choices.

selList = rng.integers(50, size=10)

print(rng.choice(selList, 5, replace=False))
[49 11 39 40  3]

Multi-dimensional Array

Here are the multi-dimensional arrays I created.

twod = np.array([["a", "b", "c"], ["d", "e", "f"]])
# this is a 2 by 3 array. Proof:
print(twod.shape)
(2, 3)
threed = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
# this is a 2 by 2 by 2 array (three-dimensional). Proof:
print(threed.shape)
(2, 2, 2)

Array With Linear Spacing

There is a Numpy function called linspace() that does this. I used it below.

print(np.linspace(0, 10, num=21))
[ 0.   0.5  1.   1.5  2.   2.5  3.   3.5  4.   4.5  5.   5.5  6.   6.5
  7.   7.5  8.   8.5  9.   9.5 10. ]
print(np.linspace(300, 700, num=21))
[300. 320. 340. 360. 380. 400. 420. 440. 460. 480. 500. 520. 540. 560.
 580. 600. 620. 640. 660. 680. 700.]

Grading

The grading will be binary - all or nothing; no partial credit

  • 0.3 for all the popcorn hacks
  • 0.6 for the main hack - CSV file
  • 0.1 for going above and beyond in the main hack