Descriptive Statistics —

Anakin
13 min readJan 18, 2021

--

T test, Z score, Gini Index, Expected Value, Bayes Theorem, Central Limit Theorem, Anova , Distribution Types

Measures of Central Tendency -

1 Mean = Average

Sum of observations / Number of observations

2 Median = Middle

Step 1 — Sort the values

{1,4,2,6,8} ={1,2,4,6,8}

4 is median

if values are even what is median?

{1,4,8,9}

Median = Mean of 4,8 = 12/2 =6

sum / number of observations

3 -Mode = most occurring observation

Frequency of occurrences , Most frequently occurring observation

1 and 8 occur twice so its a bi modal and are most frequent

But of have more than 3 observations that are most frequent is has NO MODE

Mean vs Median

Mean bad for outliers

Median is better

For skewed data median is better

Gini coefficient ?

Measure of inequality in economics for incomes

Skewed Data

Right Skewed — positive skewed — many BIG outliers ++ are present at tail

Left Skewed — Big negative outliers — negatively skewed

Bi Modal Distribution

For categorial Data —

Use Mode , Mean and Median is not applicable

Measures of Spread — Measures of Dispersion

1- Range and IQR

2- Variance and STD Standard Deviation

Range = difference between the highest value and lowest value

eg — 6 and 1 are the high and low value , subtract them to get range

Interquartile Range- how to calculate?

1- sort data

2- find the middle median

3- then on right and left side of each half find the middles

4- Subtract the two numbers

For even

for odds

Variance and Standard Deviation

Variance — population

  • we square because values can negate each other out
  • the outliers will also get squared so their weights will be more, and hence it will capture their presence

Standard Deviation

its square root of Variance

Impact of spread — Scaling and Shifting

Scaling = x * or / multiply or divide

Shifting = + or -

When u SHIFT data means ADD or MINUS

Central Distribution changes — Mean , Median , Mode

but

Spread and Variance — do not change

When you Scale Data means Multiply / Divide

Everything changes

Standardizing Statistical Moments

Some times the data becomes do big in values that we must standardize it

These are the formulas to Standardize it

Distribution

How often the values occur = Distribution

Histogram = Frequency of occurrence ( bar charts)

Normal Distribution

gaussian distribution , density function

Empirical Rule:

1sigma = 68%

2 sigma = 95%

3 sigma = 99.7

Z Score

  • every Normal Distribution has mean , std
  • mean = 175
  • std = 5
  • Z Score = x-mean/std
  • a value x = 180 ( example for heights in a population)

Z score for 180 = 180–175/5 = 1

  • Go to Z Score table = the Area under the curve on LEFT of the X value
  • if you know the ZScore value

eg 1.4 ,

What does it mean?

91.92 population percent is eg for x = 180 height

91% of people are smaller than X and X is in the top 8% of the population

2nd way is if you know the percentile , you can find X =.98 in this example

we can find the Z score and now we can find the X value

z=x-mu/sigma

x=z*sigma+ mu = 2.3*5+175 = 186.5

What does this mean?

if you have 186 height then your in the top 2 % and 98% of people have height less than this X value

Positive Z values are on the right half of the curve

Negative Z values are on the left side of the curve

  • negative values can be found on the Z Score table minus chart
  • OR just subtract 1 from the positive side to get the negative values

eg: 1- Positive value 0.84 = 0.15 which is the value for the negative Z Score

Probability -[0 to 1] chance of an event occurring

P(event) =[0,1]

Bayes Theorem :

Used in Navies Bayes Algorithm

Conditional probability use cases

Weather — given temperature and outlook what is probability person will play or not

Sentiment Analysis in NLP Classification : based on comments the output is 1 or 0

Example of Bayes calculation:

Expected Value

To calculate Expected Value of a given Scenario

1- Loss and Win Situation — X1 and X2

2- Value associated with a Win or Loss — X1 and X2

3 — Probability of a win and a loss — P1 and P2

4 - E(x) expected value = X1 * P1 + X2* P2

E (x) means if you continue with the same scenario or Business plan or Model you will either loose or win the E(x)

eg: Question should Freelancer continue working and how much will he make given the 2 possibilities

Scenario 1 — Freelancer makes 300 dollars and .6 of his clients are good

Scenario 2 — Freelancer looses -150 dollars and .4 of his clients as for refunds and don’t pay

Answer — expected out come is +120 dollars and he will make profit considering the 2 scenarios

If you have more possible outcomes : X1,X2,X3…Xn

E(x) = X1*p1 + X2*P2 ……. Xn+pn

Example = list down all possible outcomes and the expected outcome if you continue down this situation

Law of Large Numbers

Ff we repeat the experiment many many many times eventually the Expected value will be the Average Mean

Wikipedia — In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.

Central Limit Theorem

From a big population if we take many many many samples eventually they all will have a normal distribution and the same mean and variation as the original population

https://www.youtube.com/watch?v=14DPP4Z61Fk

See simulation below — start with begin and add 10000 samples from the top population and see that eventually mean and variance start resembling to the original population

Examples

An investment firm has 100 clients at average invest 4k , standard deviation is 6k , for every 3k firm makes 2k profit but firm loses money when investment is less than 3k and the loss is at this point 1.5k. is this model profitable

Step 1

mu = 4 k , std = 6k , sample mu s is same = 4k , std for sample sigma s = 6/sq root 100= 6/10 = .6

Step 2

whats the z score at 3k= -1.66 and z score table value = 0.48 in percentile its 4.8%

Step 3

4.8 * loss 1.5 k = 7.2 k Loss for every 100 clients

95.8 is the profit area and *2 k = 191.6 k at given rate for 100 clients

Example 2 online

Example 2

15 workers make 3 k profit a month and std = 15k, Is this business profitable

What are chance of profit and loss.

Whats the maximum loss this business can have

Solution

Step 1

calculate mu and sigma and then z score at x=0 thats is whats % making profit and % not making profit below, by our calculations profit probability is 60% and loss is 40%

Step2

whats the biggest loss the business can have? this is the .1% of highest loss (x)

so we need to find x at .99 percentile and from z table the z score here is -2.3

as we are taking reverse .1 % not 99%

Calculate x at -2.3 =-5.9k per worker

-5.9 is the maximum loss for 1 worker * 15 workers = -88.5 k loss per month

Example 2

25 traders, distribution has a mean of five thousand profit. standard deviation is pretty large, which is twenty thousand.

Question?

if X is our random variable for our sample mean, then we want to know what is the probability that in average our traders make less money than zero?

Answer 10% chance of a Loss

Whats the maximum loss? at .1% which means z score at 99% = 2.3 and for .1 % = -2.3

We calculate x with this value for 1 trader =-4320 k * 25 traders

maximum loss = -108k

Binomial Distribution

  • Contain Bernoulli Trails ( coin flip like)
  • Just has 2 outcomes
  • Independent outcomes
  • P(success)= 1 , P(failure)=1-p … where p is probability of Success

Use cases — if 20% customers are willing to buy whats the probability 25 of 100 customers will buy?

X = number of success

n = is sample trails

Expected value

Sigma

Poisson Distribution

  • Discreet Distribution has whole numbers

Calculations

Examples

For a specific value

For all values below a specific number … just add up

For different value of Lambda = expected value

Hypothesis testing

The mean of population is same as the experiment- Yes or No

Null Hypothesis says they are both same , The proposed Hypothesis says they are different

Level Of Significance:

Alpha = 0.05

The threshold that is decided below which , if the values are lesser than the threshold , it means that there is a significant difference in the test.

P Value = Probability value

All the probability values across the distribution on the X Axis

eg below ALL values 0.5,0.25,0.125…0.03 all are P values

Type 1 and Type 2 Error

Severity

Type 1 error is more sever in consequences as we reject the NULL hypothesis and accept a change, Type 2 error is not so sever as we continue with the null hypothesis and the change or difference is not discovered

Alpha 0.05 — Type 1 Error : When the Null H is correct and there is no difference but the experiment result show that there is a difference ..

In reality there is no difference but the specific sample values indicates there is a difference and we reject the null hypothesis

Beta 0.1 — Type 2 Error : When in reality there is a difference Null Hypothesis is wrong…. but the sample values of this experiment say that there is no difference and we accept the null hypothesis

How do we eliminate Type 1 and Type 2 Errors?

we can increase n the number of sample size to have more diversity in the data to capture all values

Calculate Margin of Error and Confidence Interval

1- find sample mean =0.54

2- calculate sample sigma of sample = 0.05

3 — Central Limit Theorem says 95% of all values fall in 2 sigma + and — to mean

4 so the confidence intervals is mean 0.54 + 2 *0.05 and 0.54–2 *0.05

95 % of random variables taken from bigger population should fall within this range

T Test

  • Used when we don’t have populations standard deviation, sigma
  • because we don’t use sigma we cant use Z score so we use T scores here

Calculate T Test value

eg below is 0.20 and degree of freedom is 10 -1 =9

so the T Test Value is = 0.883

This value is Higher than alpha 0.05 and so we accept the Null hypothesis

that there is NO DIFFERENCE

Calculation of T Test

find mu , sigma (sigma/ square root of n)

t score= mu of sample — minus mu of Null hypothesis / std

dof = degree of freedom = n-1

Now find both in the T Score table :

see example below , z score = 0.20, dof =9 , the Z value is = 0.883

Hypothesis Test Example using Z score

P values and corresponding Z Score values

Decision Trees and Gini Index

Categorical Calculation — Yes / No

For Decision Trees Gini Index is the Loss Metrics and we aim at getting the lowest value for the Gini Index as we reach the lower splits

Root node — is the First Top Split

Leaf Node — is where the Gini Index has reached Lowest Possible ( close to 0 ) and is a pure node and no more splits will happen to it

Formulae

1 — Sum of all probabilities

Calculate Gini Index for Chest pain node

Step one

calculate the Gini Index for both the splits

eg : 1- ( Probability of Yes ) *2 — ( Probability of No) *2

1– ( 7/(7+99) * 2 — ( 99/ 7+99) *2

= 0.123

Same way we calculate the Gini Index of the other split = 0.249

Step 2

Add them together with weighted average of the number of samples in the split

eg: 96/202 * 0.249 + 106/202 * 0/123

=0.183 is the gini index for Chest pain parent node

--

--