Measures of Central Tendency -
1 Mean = Average
Sum of observations / Number of observations
2 Median = Middle
Step 1 — Sort the values
{1,4,2,6,8} ={1,2,4,6,8}
4 is median
if values are even what is median?
{1,4,8,9}
Median = Mean of 4,8 = 12/2 =6
sum / number of observations
3 -Mode = most occurring observation
Frequency of occurrences , Most frequently occurring observation
1 and 8 occur twice so its a bi modal and are most frequent
But of have more than 3 observations that are most frequent is has NO MODE
Mean vs Median
Mean bad for outliers
Median is better
For skewed data median is better
Gini coefficient ?
Measure of inequality in economics for incomes
Skewed Data
Right Skewed — positive skewed — many BIG outliers ++ are present at tail
Left Skewed — Big negative outliers — negatively skewed
Bi Modal Distribution
For categorial Data —
Use Mode , Mean and Median is not applicable
Measures of Spread — Measures of Dispersion
1- Range and IQR
2- Variance and STD Standard Deviation
Range = difference between the highest value and lowest value
eg — 6 and 1 are the high and low value , subtract them to get range
Interquartile Range- how to calculate?
1- sort data
2- find the middle median
3- then on right and left side of each half find the middles
4- Subtract the two numbers
For even
for odds
Variance and Standard Deviation
Variance — population
- we square because values can negate each other out
- the outliers will also get squared so their weights will be more, and hence it will capture their presence
Standard Deviation
its square root of Variance
Impact of spread — Scaling and Shifting
Scaling = x * or / multiply or divide
Shifting = + or -
When u SHIFT data means ADD or MINUS
Central Distribution changes — Mean , Median , Mode
but
Spread and Variance — do not change
When you Scale Data means Multiply / Divide
Everything changes
Standardizing Statistical Moments
Some times the data becomes do big in values that we must standardize it
These are the formulas to Standardize it
Distribution
How often the values occur = Distribution
Histogram = Frequency of occurrence ( bar charts)
Normal Distribution
gaussian distribution , density function
Empirical Rule:
1sigma = 68%
2 sigma = 95%
3 sigma = 99.7
Z Score
- every Normal Distribution has mean , std
- mean = 175
- std = 5
- Z Score = x-mean/std
- a value x = 180 ( example for heights in a population)
Z score for 180 = 180–175/5 = 1
- Go to Z Score table = the Area under the curve on LEFT of the X value
- if you know the ZScore value
eg 1.4 ,
What does it mean?
91.92 population percent is eg for x = 180 height
91% of people are smaller than X and X is in the top 8% of the population
2nd way is if you know the percentile , you can find X =.98 in this example
we can find the Z score and now we can find the X value
z=x-mu/sigma
x=z*sigma+ mu = 2.3*5+175 = 186.5
What does this mean?
if you have 186 height then your in the top 2 % and 98% of people have height less than this X value
Positive Z values are on the right half of the curve
Negative Z values are on the left side of the curve
- negative values can be found on the Z Score table minus chart
- OR just subtract 1 from the positive side to get the negative values
eg: 1- Positive value 0.84 = 0.15 which is the value for the negative Z Score
Probability -[0 to 1] chance of an event occurring
P(event) =[0,1]
Bayes Theorem :
Used in Navies Bayes Algorithm
Conditional probability use cases
Weather — given temperature and outlook what is probability person will play or not
Sentiment Analysis in NLP Classification : based on comments the output is 1 or 0
Example of Bayes calculation:
Expected Value
To calculate Expected Value of a given Scenario
1- Loss and Win Situation — X1 and X2
2- Value associated with a Win or Loss — X1 and X2
3 — Probability of a win and a loss — P1 and P2
4 - E(x) expected value = X1 * P1 + X2* P2
E (x) means if you continue with the same scenario or Business plan or Model you will either loose or win the E(x)
eg: Question should Freelancer continue working and how much will he make given the 2 possibilities
Scenario 1 — Freelancer makes 300 dollars and .6 of his clients are good
Scenario 2 — Freelancer looses -150 dollars and .4 of his clients as for refunds and don’t pay
Answer — expected out come is +120 dollars and he will make profit considering the 2 scenarios
If you have more possible outcomes : X1,X2,X3…Xn
E(x) = X1*p1 + X2*P2 ……. Xn+pn
Example = list down all possible outcomes and the expected outcome if you continue down this situation
Law of Large Numbers
Ff we repeat the experiment many many many times eventually the Expected value will be the Average Mean
Wikipedia — In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.
Central Limit Theorem
From a big population if we take many many many samples eventually they all will have a normal distribution and the same mean and variation as the original population
https://www.youtube.com/watch?v=14DPP4Z61Fk
See simulation below — start with begin and add 10000 samples from the top population and see that eventually mean and variance start resembling to the original population
Examples
An investment firm has 100 clients at average invest 4k , standard deviation is 6k , for every 3k firm makes 2k profit but firm loses money when investment is less than 3k and the loss is at this point 1.5k. is this model profitable
Step 1
mu = 4 k , std = 6k , sample mu s is same = 4k , std for sample sigma s = 6/sq root 100= 6/10 = .6
Step 2
whats the z score at 3k= -1.66 and z score table value = 0.48 in percentile its 4.8%
Step 3
4.8 * loss 1.5 k = 7.2 k Loss for every 100 clients
95.8 is the profit area and *2 k = 191.6 k at given rate for 100 clients
Example 2 online
Example 2
15 workers make 3 k profit a month and std = 15k, Is this business profitable
What are chance of profit and loss.
Whats the maximum loss this business can have
Solution
Step 1
calculate mu and sigma and then z score at x=0 thats is whats % making profit and % not making profit below, by our calculations profit probability is 60% and loss is 40%
Step2
whats the biggest loss the business can have? this is the .1% of highest loss (x)
so we need to find x at .99 percentile and from z table the z score here is -2.3
as we are taking reverse .1 % not 99%
Calculate x at -2.3 =-5.9k per worker
-5.9 is the maximum loss for 1 worker * 15 workers = -88.5 k loss per month
Example 2
25 traders, distribution has a mean of five thousand profit. standard deviation is pretty large, which is twenty thousand.
Question?
if X is our random variable for our sample mean, then we want to know what is the probability that in average our traders make less money than zero?
Answer 10% chance of a Loss
Whats the maximum loss? at .1% which means z score at 99% = 2.3 and for .1 % = -2.3
We calculate x with this value for 1 trader =-4320 k * 25 traders
maximum loss = -108k
Binomial Distribution
- Contain Bernoulli Trails ( coin flip like)
- Just has 2 outcomes
- Independent outcomes
- P(success)= 1 , P(failure)=1-p … where p is probability of Success
Use cases — if 20% customers are willing to buy whats the probability 25 of 100 customers will buy?
X = number of success
n = is sample trails
Expected value
Sigma
Poisson Distribution
- Discreet Distribution has whole numbers
Calculations
Examples
For a specific value
For all values below a specific number … just add up
For different value of Lambda = expected value
Hypothesis testing
The mean of population is same as the experiment- Yes or No
Null Hypothesis says they are both same , The proposed Hypothesis says they are different
Level Of Significance:
Alpha = 0.05
The threshold that is decided below which , if the values are lesser than the threshold , it means that there is a significant difference in the test.
P Value = Probability value
All the probability values across the distribution on the X Axis
eg below ALL values 0.5,0.25,0.125…0.03 all are P values
Type 1 and Type 2 Error
Severity
Type 1 error is more sever in consequences as we reject the NULL hypothesis and accept a change, Type 2 error is not so sever as we continue with the null hypothesis and the change or difference is not discovered
Alpha 0.05 — Type 1 Error : When the Null H is correct and there is no difference but the experiment result show that there is a difference ..
In reality there is no difference but the specific sample values indicates there is a difference and we reject the null hypothesis
Beta 0.1 — Type 2 Error : When in reality there is a difference Null Hypothesis is wrong…. but the sample values of this experiment say that there is no difference and we accept the null hypothesis
How do we eliminate Type 1 and Type 2 Errors?
we can increase n the number of sample size to have more diversity in the data to capture all values
Calculate Margin of Error and Confidence Interval
1- find sample mean =0.54
2- calculate sample sigma of sample = 0.05
3 — Central Limit Theorem says 95% of all values fall in 2 sigma + and — to mean
4 so the confidence intervals is mean 0.54 + 2 *0.05 and 0.54–2 *0.05
95 % of random variables taken from bigger population should fall within this range
T Test
- Used when we don’t have populations standard deviation, sigma
- because we don’t use sigma we cant use Z score so we use T scores here
Calculate T Test value
eg below is 0.20 and degree of freedom is 10 -1 =9
so the T Test Value is = 0.883
This value is Higher than alpha 0.05 and so we accept the Null hypothesis
that there is NO DIFFERENCE
Calculation of T Test
find mu , sigma (sigma/ square root of n)
t score= mu of sample — minus mu of Null hypothesis / std
dof = degree of freedom = n-1
Now find both in the T Score table :
see example below , z score = 0.20, dof =9 , the Z value is = 0.883
Hypothesis Test Example using Z score
P values and corresponding Z Score values
Decision Trees and Gini Index
Categorical Calculation — Yes / No
For Decision Trees Gini Index is the Loss Metrics and we aim at getting the lowest value for the Gini Index as we reach the lower splits
Root node — is the First Top Split
Leaf Node — is where the Gini Index has reached Lowest Possible ( close to 0 ) and is a pure node and no more splits will happen to it
Formulae
1 — Sum of all probabilities
Calculate Gini Index for Chest pain node
Step one
calculate the Gini Index for both the splits
eg : 1- ( Probability of Yes ) *2 — ( Probability of No) *2
1– ( 7/(7+99) * 2 — ( 99/ 7+99) *2
= 0.123
Same way we calculate the Gini Index of the other split = 0.249
Step 2
Add them together with weighted average of the number of samples in the split
eg: 96/202 * 0.249 + 106/202 * 0/123
=0.183 is the gini index for Chest pain parent node