Descriptive Statistics —

Anakin

13 min readJan 18, 2021

T test, Z score, Gini Index, Expected Value, Bayes Theorem, Central Limit Theorem, Anova , Distribution Types

Measures of Central Tendency -

1 Mean = Average

Sum of observations / Number of observations

2 Median = Middle

Step 1 — Sort the values

{1,4,2,6,8} ={1,2,4,6,8}

4 is median

if values are even what is median?

{1,4,8,9}

Median = Mean of 4,8 = 12/2 =6

sum / number of observations

3 -Mode = most occurring observation

Frequency of occurrences , Most frequently occurring observation

1 and 8 occur twice so its a bi modal and are most frequent

But of have more than 3 observations that are most frequent is has NO MODE

Mean vs Median

Mean bad for outliers

Median is better

For skewed data median is better

Gini coefficient ?

Measure of inequality in economics for incomes

Skewed Data

Right Skewed — positive skewed — many BIG outliers ++ are present at tail

Left Skewed — Big negative outliers — negatively skewed

Bi Modal Distribution

For categorial Data —

Use Mode , Mean and Median is not applicable

Measures of Spread — Measures of Dispersion

1- Range and IQR

2- Variance and STD Standard Deviation

Range = difference between the highest value and lowest value

eg — 6 and 1 are the high and low value , subtract them to get range

Interquartile Range- how to calculate?

1- sort data

2- find the middle median

3- then on right and left side of each half find the middles

4- Subtract the two numbers

For even

for odds

Variance and Standard Deviation

Variance — population

we square because values can negate each other out
the outliers will also get squared so their weights will be more, and hence it will capture their presence

Standard Deviation

its square root of Variance

Impact of spread — Scaling and Shifting

Scaling = x * or / multiply or divide

Shifting = + or -

When u SHIFT data means ADD or MINUS

Central Distribution changes — Mean , Median , Mode

but

Spread and Variance — do not change

When you Scale Data means Multiply / Divide

Everything changes

Standardizing Statistical Moments

Some times the data becomes do big in values that we must standardize it

These are the formulas to Standardize it

Distribution

How often the values occur = Distribution

Histogram = Frequency of occurrence ( bar charts)

Normal Distribution

gaussian distribution , density function

Empirical Rule:

1sigma = 68%

2 sigma = 95%

3 sigma = 99.7

Z Score

every Normal Distribution has mean , std
mean = 175
std = 5
Z Score = x-mean/std
a value x = 180 ( example for heights in a population)

Z score for 180 = 180–175/5 = 1

Go to Z Score table = the Area under the curve on LEFT of the X value

if you know the ZScore value

eg 1.4 ,

What does it mean?

91.92 population percent is eg for x = 180 height

91% of people are smaller than X and X is in the top 8% of the population

2nd way is if you know the percentile , you can find X =.98 in this example

we can find the Z score and now we can find the X value

z=x-mu/sigma

x=z*sigma+ mu = 2.3*5+175 = 186.5

What does this mean?

if you have 186 height then your in the top 2 % and 98% of people have height less than this X value

Positive Z values are on the right half of the curve

Negative Z values are on the left side of the curve

negative values can be found on the Z Score table minus chart
OR just subtract 1 from the positive side to get the negative values

eg: 1- Positive value 0.84 = 0.15 which is the value for the negative Z Score

Probability -[0 to 1] chance of an event occurring

P(event) =[0,1]

Bayes Theorem :

Used in Navies Bayes Algorithm

Conditional probability use cases

Weather — given temperature and outlook what is probability person will play or not

Sentiment Analysis in NLP Classification : based on comments the output is 1 or 0

Example of Bayes calculation:

Expected Value

To calculate Expected Value of a given Scenario

1- Loss and Win Situation — X1 and X2

2- Value associated with a Win or Loss — X1 and X2

3 — Probability of a win and a loss — P1 and P2

4 - E(x) expected value = X1 * P1 + X2* P2

E (x) means if you continue with the same scenario or Business plan or Model you will either loose or win the E(x)

eg: Question should Freelancer continue working and how much will he make given the 2 possibilities

Scenario 1 — Freelancer makes 300 dollars and .6 of his clients are good

Scenario 2 — Freelancer looses -150 dollars and .4 of his clients as for refunds and don’t pay

Answer — expected out come is +120 dollars and he will make profit considering the 2 scenarios

If you have more possible outcomes : X1,X2,X3…Xn

E(x) = X1*p1 + X2*P2 ……. Xn+pn

Example = list down all possible outcomes and the expected outcome if you continue down this situation

Law of Large Numbers

Ff we repeat the experiment many many many times eventually the Expected value will be the Average Mean

Wikipedia — In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.

Central Limit Theorem

From a big population if we take many many many samples eventually they all will have a normal distribution and the same mean and variation as the original population

https://www.youtube.com/watch?v=14DPP4Z61Fk

See simulation below — start with begin and add 10000 samples from the top population and see that eventually mean and variance start resembling to the original population

Sampling Distributions

Edit description

onlinestatbook.com

Examples

An investment firm has 100 clients at average invest 4k , standard deviation is 6k , for every 3k firm makes 2k profit but firm loses money when investment is less than 3k and the loss is at this point 1.5k. is this model profitable

Step 1

mu = 4 k , std = 6k , sample mu s is same = 4k , std for sample sigma s = 6/sq root 100= 6/10 = .6

Step 2

whats the z score at 3k= -1.66 and z score table value = 0.48 in percentile its 4.8%

Step 3

4.8 * loss 1.5 k = 7.2 k Loss for every 100 clients

95.8 is the profit area and *2 k = 191.6 k at given rate for 100 clients

Example 2 online

Example 2

15 workers make 3 k profit a month and std = 15k, Is this business profitable

What are chance of profit and loss.

Whats the maximum loss this business can have

Solution

Step 1

calculate mu and sigma and then z score at x=0 thats is whats % making profit and % not making profit below, by our calculations profit probability is 60% and loss is 40%

Step2

whats the biggest loss the business can have? this is the .1% of highest loss (x)

so we need to find x at .99 percentile and from z table the z score here is -2.3

as we are taking reverse .1 % not 99%

Calculate x at -2.3 =-5.9k per worker

-5.9 is the maximum loss for 1 worker * 15 workers = -88.5 k loss per month

Example 2

25 traders, distribution has a mean of five thousand profit. standard deviation is pretty large, which is twenty thousand.

Question?

if X is our random variable for our sample mean, then we want to know what is the probability that in average our traders make less money than zero?

Answer 10% chance of a Loss

Whats the maximum loss? at .1% which means z score at 99% = 2.3 and for .1 % = -2.3

We calculate x with this value for 1 trader =-4320 k * 25 traders

maximum loss = -108k

Binomial Distribution

Contain Bernoulli Trails ( coin flip like)
Just has 2 outcomes
Independent outcomes
P(success)= 1 , P(failure)=1-p … where p is probability of Success

Use cases — if 20% customers are willing to buy whats the probability 25 of 100 customers will buy?

X = number of success

n = is sample trails

Expected value

Sigma

Poisson Distribution

Discreet Distribution has whole numbers

Calculations

Examples

For a specific value

For all values below a specific number … just add up

For different value of Lambda = expected value

Hypothesis testing

The mean of population is same as the experiment- Yes or No

Null Hypothesis says they are both same , The proposed Hypothesis says they are different

Level Of Significance:

Alpha = 0.05

The threshold that is decided below which , if the values are lesser than the threshold , it means that there is a significant difference in the test.

P Value = Probability value

All the probability values across the distribution on the X Axis

eg below ALL values 0.5,0.25,0.125…0.03 all are P values

Type 1 and Type 2 Error

Severity

Type 1 error is more sever in consequences as we reject the NULL hypothesis and accept a change, Type 2 error is not so sever as we continue with the null hypothesis and the change or difference is not discovered

Alpha 0.05 — Type 1 Error : When the Null H is correct and there is no difference but the experiment result show that there is a difference ..

In reality there is no difference but the specific sample values indicates there is a difference and we reject the null hypothesis

Beta 0.1 — Type 2 Error : When in reality there is a difference Null Hypothesis is wrong…. but the sample values of this experiment say that there is no difference and we accept the null hypothesis