j-ytu

20260210-Very Normal Youtube影片 The most important theory in statistics| Maximum Likelihood

0:00 : this simple expression forms the
0:01 : backbone of the most common statistical
0:03 : models that we use today and yet most
0:06 : people using these models have no idea
0:08 : this Theory even exists this is the
0:10 : theory of Maximum likelihood and it's a
0:12 : fundamental concept that all statistics
0:14 : students need to know in this video I'm
0:16 : going to explain what this theory is
0:18 : about and exactly why it's so crucial to
0:20 : Modern statistics if you're new here my
0:22 : name is Christian and this is very
0:24 : normal a channel for making you better
0:26 : at statistics the best way to do that is
0:28 : to understand the unique and interesting
0:30 : problems that statisticians face this is
0:32 : one of those problems to understand why
0:35 : this theory is so important we need to
0:37 : take a step back and understand one of
0:38 : the essential problems in statistics
0:41 : estimation or as I like to say educated
0:44 : guessing in statistics we use models to
0:46 : approximate real world relationships we
0:49 : use approximations because the real
0:50 : world is often so complex it can be
0:53 : difficult to even know where to start
0:55 : approximations allow us to take baby
0:57 : steps the linear regression can be used
0:59 : to model change between groups a
1:01 : logistic regression can be used to model
1:03 : how changes and covariates influence the
1:05 : odds of an event a proportional hazards
1:07 : model tells us how Co variat influence
1:09 : the time to an event what all these
1:11 : models have in common is that they have
1:13 : parameters which not only help model the
1:15 : randomness in the data we see but also
1:17 : to represent useful ideas that we'd like
1:19 : to research all these parameters here
1:22 : tell us how the covariates influence the
1:24 : outcome to be clear lots of different
1:26 : models have parameters but these
1:28 : regressions are the ones people are the
1:30 : most familiar with once we choose a
1:32 : bottle we need to collect data the data
1:34 : is generated from the parameters of the
1:36 : model but we don't see the parameters we
1:39 : have to infer the values of the
1:41 : parameters based on the data we
1:43 : collected this is where maximum
1:45 : likelihood comes in to really understand
1:47 : maximum likelihood you should know what
1:49 : a likelihood is first in this context
1:52 : the likelihood is defined as a product
1:53 : of all the probability distributions of
1:55 : the data evaluated at each of our
1:57 : observed data points when we select the
1:59 : model we often choose a convenient
2:01 : probability distribution for our data by
2:04 : convenient I'm referring to the fact
2:06 : that we often use parametric probability
2:08 : distributions to model the data by
2:11 : specifying just a few numbers called
2:13 : parameters we can get an entire function
2:16 : in everyday speech likelihood is
2:18 : sometimes used as a synonym for
2:20 : probability likelihood and probability
2:22 : are definitely related but only in how
2:24 : we Define what likelihood is if you just
2:26 : looked at this part of the expression
2:28 : alone you'd recognize that this is the
2:30 : joint probability distribution of our
2:32 : data assuming it was all independent but
2:34 : for a joint probability distribution the
2:36 : parameter value is assumed to be known
2:38 : or fixed but here in the likelihood the
2:41 : parameter value is unknown that is the
2:44 : likelihood should be viewed as a
2:45 : function of the parameter this product
2:48 : allows us to take all of the data we
2:50 : observe and summarize into a single
2:52 : number that could potentially be useful
2:54 : for making a more educated guess about
2:56 : the value of theta so why is it a good
2:58 : tool for guessing parameter values from
3:00 : data to understand this we'll start with
3:02 : a small example let's say that my cousin
3:04 : gave me a loaded dice that's more likely
3:06 : to roll a six I needed it for reasons
3:09 : but he didn't tell me the exact
3:10 : probability that I'll roll a six so I
3:12 : need to figure this out on my own by
3:14 : rolling the dice and collecting data
3:16 : we'll model the probability of getting a
3:18 : six with a brly random variable with
3:20 : some parameter Pi I know that some of
3:22 : you guys get upset when I use pi as a
3:24 : parameter so this was a perfect
3:25 : opportunity to use it again our goal is
3:27 : to estimate Pi we know that ber newly
3:30 : random variables have the following
3:32 : probability distribution let's say that
3:33 : I rolled a dice once and I actually get
3:35 : a six if I fill in the value here and
3:38 : keep Pi unknown then we'll have a simple
3:40 : likelihood if we plot it against the
3:42 : possible values of Pi 0 through 1 then
3:45 : we'll get a simple slope that reaches
3:46 : its maximum at one now think about
3:48 : making a guess about the value of pi
3:50 : purely from that one observation we made
3:53 : what is the value of pi that is most
3:55 : likely to have produced a single row of
3:58 : a six well a pi of 100% would definitely
4:01 : produce a dice roll of six it's still
4:04 : possible to roll a six if Pi were 90 or
4:07 : 80% but it's just not as likely as if Pi
4:09 : were 100% let's go through the same
4:11 : thought process again but after rolling
4:13 : the dice a few more times after rolling
4:15 : that first six I get two more six rolls
4:18 : and two non sixes that's five dice rolls
4:21 : so the corresponding likelihood looks
4:23 : like this you can see the pattern here
4:25 : each six roll contributes a pi term
4:27 : while the non six contributes a 1 minus
4:29 : this P term if I were to plot this
4:31 : likelihood we'll get a slightly
4:33 : distorted Bell since we've both observed
4:35 : a six and a nonix roll it's impossible
4:38 : that Pi is either zero or 1 so the
4:40 : likelihood at these points is zero
4:42 : following the same logic we did earlier
4:44 : the value of pi that is most likely to
4:47 : produce 3 sixes out of five is a pi of
4:50 : 60% other values of Pi can produce the
4:53 : same data set but it's just not as
4:55 : likely in a sense a pi of 60 best
4:58 : matches the data we observed the shape
5:01 : of the likelihood seems to provide a
5:03 : practical way to make an educated C on
5:05 : the value of pi or as we'll see any
5:07 : other parameter in a statistical model
5:09 : as you might expect a good guess would
5:11 : be to pick the parameter value that
5:13 : maximizes the likelihood the simple idea
5:16 : is the cor behind the method of Maximum
5:18 : likelihood estimation the most important
5:20 : theory in statistics this expression
5:23 : from the beginning is really just this
5:25 : idea but in complicated math notation
5:27 : reading this from left to right we can
5:29 : create a good guess or estimat of the
5:31 : parameter data Hat by defining it as the
5:34 : value that maximizes the likelihood data
5:37 : is an input or argument to the
5:39 : likelihood so that's why we have an
5:41 : argmax here this symbol here is actually
5:43 : a capital Theta and it's supposed to
5:45 : represent all the possible values that
5:47 : the parameter value can take in the dice
5:49 : rolling example Theta is pi and Pi can
5:52 : be anything between 0 and one so the set
5:55 : is what big data is the theory of
5:57 : Maximum likelihood was popularized in a
5:59 : series of paper Papers written by the
6:00 : eminent statistician RI Fischer in the
6:02 : 1910s and 1920s in developing maximum
6:05 : likelihood Fischer also introduced
6:07 : several important Concepts that are
6:09 : taught to BL y statistics students even
6:11 : today consistency efficiency sufficiency
6:14 : information just to name a few the star
6:16 : of today's video is maximum likelihood
6:19 : so we'll stay focused on that maximum
6:20 : likelihood is a simple idea but what
6:23 : makes maximum likelihood so powerful is
6:25 : not just its Simplicity but in the
6:26 : qualities it gives the estimator it
6:28 : produces which I'll be calling the mle
6:31 : remember that the mle is still just a
6:33 : guess in the end is it a good guess
6:36 : furthermore what even constitutes a good
6:39 : guess thankfully fiser and many others
6:41 : before us have given us the answer
6:43 : already the MLA is a great guess and
6:46 : sometimes it's the best possible guess
6:47 : that we can make it's not always the
6:50 : best as we'll see but it works in many
6:52 : models and situations that are common to
6:54 : Everyday research problems let's dive
6:56 : deeper into what makes a good guess when
6:58 : we make models an estimate parameter
7:00 : some data we often assume that there's
7:02 : some fixed but unknown parameter value
7:04 : that's generating the data the whole
7:06 : point of estimation is to guess what
7:08 : this unknown value is one thing we'd
7:10 : like from an estimator is that the value
7:12 : of our guess should get closer to this
7:14 : unknown value as we collect more data
7:16 : this reflects a natural intuition that
7:18 : more data should produce a better guess
7:20 : and in fact this is what happens with
7:22 : the mle the mle is what we call a
7:24 : consistent estimator and this is easily
7:26 : demonstrated in simulation as the sample
7:29 : size grows on the x-axis notice that the
7:32 : value of the mle approaches the True
7:34 : Value represented by the red line we say
7:36 : that an estimator is consistent if it
7:38 : gets closer to the True Value as our
7:41 : sample size grows to infinity even
7:43 : though the true parameter value is
7:44 : virtually unknowable a consistent
7:46 : estimator should get very close to it
7:49 : with enough sample size keep in mind
7:50 : that consistency is what we call an ASM
7:53 : totic property the EM only gets close to
7:55 : this True Value if the sample size gets
7:58 : large we can't say much and small sample
8:00 : settings before I tell you what the
8:01 : second property is I want you to take a
8:03 : look at the summaries from the three
8:05 : most commonly used statistical models a
8:07 : logistic regression a proportional
8:09 : hazards model and a mixed effects model
8:11 : what's something that all these models
8:13 : have in common the common factor or at
8:16 : least the one I want to focus on are in
8:18 : these sections here the estimated
8:19 : parameters either have a zcore as in the
8:22 : glm OR prop hazards model or a t-core
8:25 : with the mix effect model now where do
8:26 : you think these estimates get this
8:28 : distribution from Maximum likelihood one
8:31 : of the most important properties of the
8:33 : maximum likelihood estimator is that it
8:35 : has an asymptotic normal distribution
8:37 : this distribution is significant because
8:39 : we know a lot about the normal
8:41 : distribution and we don't have to drive
8:42 : any situation specific distribution for
8:45 : different models as long as you're
8:47 : maximizing the likelihood you'll get
8:48 : this normal distribution and can
8:50 : therefore easily derive P values for
8:52 : parameter estimates for this tcore here
8:55 : statistic students will know that we can
8:56 : easily go from a normal to a t
8:58 : distribution if we need to estimate the
9:01 : variance so this two is a product of
9:03 : Maximum likelihood the Curious among you
9:06 : might wonder what the mean and variance
9:07 : of this limiting normal distribution are
9:09 : consistency tells us that the mean of
9:11 : this distribution will be the true
9:13 : parameter value while the variance is
9:15 : given by the strange expression here
9:17 : this expression is known as the Fisher
9:20 : information so the variance of the mle
9:22 : is the inverse of the Fisher information
9:25 : the Fisher information is related to the
9:27 : log of the likelihood and deserves a
9:29 : video in its own right but ain't nobody
9:31 : got time for that but if you're curious
9:33 : I highly recommend you watch this video
9:35 : by Mutual information what's more
9:36 : important for this video is the special
9:38 : quality that this particular variance
9:40 : grants the mle at least in some
9:42 : situations we care about variance
9:44 : because of its relationship with the
9:46 : confidence interval it's in our best
9:47 : interest to make the variance as small
9:49 : as possible so that our confidence
9:51 : intervals are as small as possible when
9:53 : this happens we have the best chance of
9:55 : rejecting the all hypothesis assuming
9:57 : it's incorrect it's important to know
9:59 : that different ways of making guesses or
10:01 : estimates will cause them to have
10:03 : different variants we'd prefer to pick
10:05 : an estimation method that has the
10:06 : minimum variance but there's actually a
10:09 : fundamental limit on how low an
10:11 : estimator's variance can be this limit
10:13 : is called the CH ra lower bound and it
10:16 : states that the variance for any
10:17 : unbiased estimator of theta is bounded
10:20 : below by the inverse of the fiser
10:21 : information this is to say under the
10:24 : right conditions the maximum likelihood
10:26 : estimator has the best AKA smallest
10:29 : possible variance among unbiased
10:31 : estimators it's what we call
10:33 : asymptotically efficient these three
10:35 : properties are what make the mle so
10:37 : useful as an estimation tool there are
10:40 : definitely other desirable aspects to
10:42 : the mle but I feel that these three are
10:44 : the ones that real world analysts get
10:46 : the most benefit from in actual problems
10:48 : the fact that we get all these benefits
10:50 : is pretty incredible when you consider
10:51 : the fact that the only thing the
10:53 : procedure is doing is picking a maximum
10:55 : value but that being said how do we
10:57 : actually go about figuring out what this
10:59 : value Val is our human eyes can look at
11:01 : a likelihood like this and instantly
11:03 : know where the maximum is but this isn't
11:05 : really a scalable solution if you've
11:07 : taken statistics before then you know
11:09 : that finding the mle amounts to an
11:11 : applied calculus problem but this only
11:13 : works for the most basic of problems
11:15 : consider logistic regression we know
11:17 : that the likelihood of the outcome looks
11:18 : like this since it's still a br newly
11:20 : random variable good look trying to
11:22 : solve for zero for this expression
11:24 : instead we rely on computers and
11:26 : iterative algorithms to find this
11:28 : maximum there's not enough time to cover
11:29 : these here but the most important part
11:31 : to understand is that their job is to
11:33 : get the precise value of the mle even
11:36 : though we don't have a nice analytic
11:37 : form to look at it's the method of
11:39 : Maximum likelihood itself that gives us
11:41 : those nice properties I mentioned
11:43 : earlier this code repeatedly generates
11:45 : data for a logistic regression and picks
11:47 : out the mle the end result are these
11:49 : histograms you can see that the
11:51 : distributions have an approximately
11:52 : normal shape as predicted by maximum
11:54 : likelihood no fancy equation solving all
11:57 : of this just from finding a Max maximum
12:00 : for all the glazing I've been doing for
12:01 : maximum likelihood this is the part of
12:03 : video where I tell you that it has its
12:05 : weaknesses something I've been glossing
12:07 : over is the fact that several
12:08 : assumptions must be made for maximum
12:10 : likelihood to have the properties I
12:11 : talked about you may have heard of these
12:13 : called regularity conditions the
12:15 : likelihood I showed earlier is a basic
12:17 : one-dimensional curve and in most
12:19 : problems it's not guaranteed that the
12:21 : likelihood is going to have this
12:22 : convenient unimodal shape another
12:24 : sobering reminder I need to point out
12:26 : about the m is that consistency
12:28 : normality and efficiency are all
12:30 : asymptotic properties properties that
12:33 : arise when the sample size approaches
12:34 : Infinity it's easy for me to say this
12:36 : matter factly but we both know that you
12:39 : and me are mortal people and we don't
12:41 : have the time to collect infinite data
12:43 : we hope that our sample sizes are large
12:45 : enough for quote asymptotics to kick in
12:47 : but this in itself is another
12:49 : assumptions that many statistics users
12:51 : may not appreciate finally these three
12:53 : properties are actually not unique to
12:55 : the mle Fisher brought maximum
12:57 : likelihood into popular use in the 1920s
13:00 : but statistics has gone a lot since then
13:02 : some statisticians have developed
13:04 : so-called super efficient estimators
13:06 : that are still asymptotically normal but
13:08 : have even smaller variance than the mle
13:10 : other shows that there are situations
13:12 : where maximum likelihood can produce
13:13 : inconsistent estimates especially in
13:16 : higher Dimensions this led to work with
13:18 : shrinkage estimators such as the James
13:20 : Stein estimator with the aim of getting
13:22 : better estimates in these situations
13:24 : even though the mle isn't special the
13:26 : existence of these weaknesses doesn't
13:28 : take away from its usefulness the most
13:30 : commonly used probability distributions
13:32 : and practical applications normal
13:34 : binomial pan all fall in the category
13:37 : where maximum likelihood still has its
13:39 : desirable properties it drives a lot of
13:41 : analysis today and it'll continue to do
13:43 : so for the foreseeable future if you
13:45 : found this video useful I hope that I've
13:47 : earned a like and a subscription from
13:48 : you I try to upload statistics videos
13:50 : every week if you'd like to hear about
13:52 : videos as soon as they come out you can
13:53 : also subscribe to the channel newsletter
13:55 : you'll get video sent straight to your
13:56 : inbox and you can learn a little bit
13:58 : more about what's going on with me
13:59 : behind the scenes that's it for this one
14:02 : see you in the next one
14:04 : [Music]
▼1 Add Up

Add Up Data

1.
2.。

▼2 Others

Other Data

Lorem ipsum dolor sit amet.
Lorem ipsum dolor sit amet.

| 海闊天空 |