We have all experienced these events, and its part of the dynamics and corresponding uncertainty of the life. Our epistemic limitations and shortsighted perceptions ensure that, regardless of our best intention or most complete effort, there are outcomes we cannot control – and some of those things might, regardless of your ability, always be outside of our control.

So, to evaluate and make sense of a world bound by vast uncertainty, we use probability to quantify and organize events as subsets of the world we experience. We can think of probability as nothing more than a set of elements (we will only be considering countable sets -both finite and infinite); we can say, then, that a *probability space* represents our uncertainty regarding an experiment*. *Now, you may be wondering what a “space” is. Think of a space as a universe -which is, in our case, merely a set that can be divided up into subsets and whose points have some wiggle room such that they don’t leave the original set. Our probability space is composed of two distinct parts:

1) Sample space -this is the set that contains all the *outcomes* of our events (we denote this set as S -the green circle)

2) The other part is called the *probability* *measure *-which is a functions that maps the a subset of our sample space into the set of real numbers.

We shall call our function or measure P(x). We denote the set of outcomes that takes place in S as E –*events * (the red circle). So, P(E) informs us of what the chance that an experiment, if actualized, will produce an element that is a member of E; this “chance” is denoted by a value represented by a number in the reals.

What is most fascinating about the nature of probability is its simplicity. What I mean by “simplicity” is that we have a fairly straight-forward and directly computational process of determining most basic probabilities. Nonetheless, probability and its relation to sets does not come without some token of weirdness. For example, consider set of all real numbers. What is the probability that if I placed all the reals into some bag, that I select a rational? The probability is zero. That is, you do not have a chance of selecting a rational from the set of the reals.

But there are an infinite number of rationals! Moreover, they are dense in the reals -meaning that whatever open interval I , then Q is dense if for each | . We can also say that (that is, the closure of Q is a superset of the set it is dense in, but in this case, our set is R so the closure is equal to the “size” of R). Basically, everywhere you “look” in R, you will see at least one rational. So, why is it that the probability is zero?

First, let’s talk about size. As we have shown, there is some size to Q -Q is not small as per we can find it everywhere in R, and every irrational is an accumulation point for Q. Nonetheless, the size of R is massive. In fact, the set [0,1] has more irrational numbers than all Q. Consider, then, all Reals, that is the whole number line. The shear volume of irrationals makes the set Q look tiny if nonexistent. This is the most basic and intuitive way of thinking about the probability of selecting an element of Q as being zero. The other way deals with measure theory, and, sadly, we will not cover that here.

Getting back to the point: probability is useful, but it has some twists. It is a beautiful convention that we have formalized to help us precise and measure events that take on various states in a dynamic or chaotic nature. Nonetheless, it is merely a convention. In fact, consider the axioms of probability where the value of a probability is assigned a value in the interval [0,1]. Why [0,1]? Is there some overarching diving decree that willed this fact? Not really, it allows for simplicity and smoothness in calculation and is consistent with other perceptions.

]]>The virus is deadly and seems to possess a wide variance of symptoms, morbidities, and other complications; it is particularly threatening to those who are elderly or suffering from certain pre-existing conditions. In fact, the case fatality rate (CFR) is likely to arrive at about 2-3% (source), and the infection fatality rate (IFR) will likely be much lower at about ~0.79; but a single IFR or CFR is not particularly useful given the wide variance of effects it has on groups of potential patients.

One major challenge around controlling this pathogen is rooted in the contagiousness and asymptomatic spread. SARS-COV-2 maintains a reproduction number close to about 5.0 -making the virus very aggressive (a typical influenza virus possesses a reproduction number (i.e. R0) range of about 2.0-3.0) and difficult to manage with standard tracing and testing -a reproduction number this high allows for rapid and large-scale outbreaks to occur in a short time period (see superspreader events where the R0 can exceed 5.0 by orders of magnitude), and this is compounded with about 40% of carriers are asymptomatic.

Finally, another difficulty in managing SARS-COV-2 infectiousness in a way that lowers the R0 to a reasonable level, which is R0 <1.0, is *how* it is spread. There has been considerable debate on this subject. Initially, the US NID and CDC declared that the virus transmission was due to contaminated surfaces, and downplayed the other potential routes of transmission. Moreover, the same government entities dismissed the use of other precautionary measures such as mask use with the surgeon general claiming that masks did not prevent the spread of the illness (source), and Dr. Fauci later admitting that the government lied about masks (source).

So, getting to the point: do masks work? The answer to *this* question is actually ** easy**, but getting people to understand the simplicity of

Most faultfinders of mask-use leverage a number of claims to reinforce their belief that masks cannot possibly prevent the transmission of an airborne illness like COVID-19 -the following list details these objections:

1) An article published as a comment with the University of Montana (source)

2) A research article that examines the evidence for mask use (source)

3) An additional research article (source)

Before addressing these articles and the “evidence” they don’t have, let me make something very clear:

**Absence of evidence is not tantamount to evidence of absence **

All the major articles or stories involving why masks wont work is due to the research paper often claiming “there is little scientific evidence supporting mask use to prevent…” -that actually is not very informative. In fact, that claim is **not** saying “there is evidence that masks do not prevent…”. This is where most people get confused. They seem to think that a lack of empirical knowledge of a fact tends to promote the potential that such a fact does not exist, and we know, very well, that such is not the case. All sorts of things are not empirically verifiable at certain points, but are nonetheless true.

So, on that note, the question that we have to answer is: can mask-use actually be an * inherent* mitigator of transmission risks? The answer is undeniable -that is, undeniable that masks

*“…with* *p*=1/4 we get *R0′*=* p² R0 *. The drop in* R* becomes 93.75%! You divide *R* by 16! Even with masks working at 50% we get a 75% drop in *R0*.”

The mistake that most studies commit is a complete lack of understanding about the compounding effects across ensambles and how that impacts the virus’s reproduction number over a population set. While the individual benefit of using a cloth mask may be low as a prevention, if we had everyone use them, the risk would fall precipitously and the overall benefit would be significant.

How much would the risk fall? Well, since we have fallen upon the fact that use across an ensemble tends to exhibit compounding effects, we can estimate (since many countries are requiring masks) what the effects are based on mask efficacy and compliance:

The above image shows, based on current evidence and statistical models, that we can reasonably and with very low cost, lower the R0 by ensuring a high compliance of mask use. Even with a 50% effective mask, and 100% compliance, we crush the R0 to less than 1 (source)!

Now, back to the issue in (1) around masks “filtering” particles. There is this odd hyper-focus on filtering efficacy by respiratory specialists and air-filter engineers as it relates to illness transmission. Often the detractors claim that the virus is smaller than the pore size in all masks even the notorious N95 and this inhibits their effectiveness. Honestly I don’t know why this is an issue for several reasons: first, the virus is rendered in an aerosol that is often much larger than the pore size of an N95, and, even with a cloth mask, the *rate of attack* (the dose of virus) of the number of virus you are exposed to is significantly better than no protection at all; second, a mask helps slow the amount of virus an infected victim can render into the environment -see the following image:

With just a paper towel, the amount of particles are decreased by over 95% -with about 4% being expelled as a nasal discharge. This study showed that with a full nose to mouth mask, the amount of particulate is minimized by about 90%! Considering a rate of attack of, say, 60% with a virus load of 10^6, a further decrease by 90% is **MASSIVE** which not only impacts transmission rate, but can mean the difference amid serious illness and mild illness. Some may be tempted to say, “well… why does this matter if you are not sick. I know I am not sick, so I should be able to go without a mask” -this is a question I will address later, but as it relates to this study is that speech is a potential medium of transmission (source) given the replication of SARS-COV-2 in the upper airway; and, thus, by evaluating droplet dispersion using coverings, we can get a better sense of the masks ability to protect others.

There, derived from (2), is a claim that only the best respirators or masks provide any benefit; thus, if you are not using *at least* a N95 or greater, you are not benefiting. This is completely false (source). Studies have shown that even homemade cloth masks slow the dispersion of virus particles -thus, minimizing the overall rate of attack, and benefiting others who are also using masks.

Some other sources have shown that face coverings reduced transmission in households up to 80% (source) and simple masks are also useful (source).

Now, how is mask mandates working for countries who have them in-place? First, (source) in countries with universal mask-wearing, per-capita coronavirus mortality increased by just 5.4% each week compared with 48% each week in remaining countries!

Second, all the countries that have implemented masks mandates and have maintained high compliance, have done much better than those that do not have similar mandates or compliance. Take for instance, Thailand, that has a high organic use of masks since about March 16th of 2020:

Now, look at Vietnam, that *mandated *masks in March:

These images possess general trends that are seen across all mask-mandated countries -there is one exception, however, and that is Saudi Arabia (SA). In SA, the mandate was imposed the last few days in May with formal execution of the imperative around June 10th. There was a rapidly growing outbreak that persisted since the end of May, but since the mask mandate, there is some encouraging signs as the virus’ transmission rate falls (SEE UPDATE SECTION -END OF POST):

SA has had challenges with certain business compliance and religious groups, but now that the mask use is in full-swing, I expect the decrease they are seeing to continue. Let’s now compare that to the US:

Most of the lull from the end of April to the first week of June is likely due to the same equifinal effects of the both the lockdowns and serious precautionary behavior of certain states due to the fact that outbreak manifestation takes about 24-30 days to be fully realized. The uptick since the beginning of June is not unusual, over 70% of Americans shun masks .

There is even more localized data showing mask efficacy. If we look at Colorado that had city-based mandates in effect at around the first of May, and moved to state-mandate end of July for the most populous cities, we see a very similar trend (which only signals potential effect -not causation):

Finally, there is this claim from certain political positions that asserts something to the effect that “you don’t have the right to force me to use a mask” or that the mandate violates the tenets of liberty. I don’t think the reasoning works -let’s take a look at the fundamental principle that supports almost all (logically) arguments from liberty -the Non-Aggression Principle. The principle asserts that initiating or threatening any forceful interference with an individual’s life or property, is inherently immoral. This principle is often considered a fundamental element of conservatism and libertarianism in the United States (source). By not using a mask, you violate the principle in two major ways:

1) You risk harming them and threatening their lives by spreading a lethal virus to them.

2) By not using a mask, you are forcefully augmenting the individual risk of harm the other person is experiencing (which results in (1)).

Now, some will argue that they are not initiating any aggression because they do not possess the proper epistemic conditions to be said to *actually* aggress against another -however, this does not hold water. The fact that there could be up-to 20% to 40% of carriers are either asymptomatic or pre-symptomatic, you could easily, at any time, be a carrier or spreader of the illness (source). Thus, you are obligated to take all precautions not to violate or harm others given that you cannot completely know if you are not really infected.

The foundation of a moral society is one that protects the least among them. It is a society that willingly takes precautions to ensure that the elderly, the sick, or the helpless are protected -particularly, when the amount of investment involved is not ruinous to the he who is obligated to protect them.

So, to answer your question: are masks effective in minimizing COVID-19, the answer is YES.

**UPDATES:**

1) Source on Dr. Fauci’s deception around masks (source).

2) A very telling bit of evidence where a hairdresser and her partner were both COVID-19 positive and had attended to 140 clients while infected with SARS-COV-2 and not a single client was infected due to policy of both the hairdresser and client having to use masks (source). The probability of this occurring without masks with an R0 from 2 to 5 is almost zero! That is, without masks, we should expect that at least 110 or more should have been infected; thus, mask use blunted the R0 to ZERO in this case!

3) More evidence of aerosol transmission (source)

4) An NPR article highlighting the general evidence for mask use (source)

5) Some very nice explanations and evidence for the effects of maks use (source)

6) There is this concern that masks may not have the overall benefit that proponents suggest. Specifically, there is an argument that with incorrect use of protective personal equipment (PPE), that the reduction in risk of transmission due to mask use may be off-set by the mishandling of PPE -namely, that people touch their mask incorrectly, that they may use the mask incorrectly, or that they many not clean or dispose of the mask correctly resulting in contaminating their hands and, thus, infecting themselves with the virus.

To respond to this, let me say that this is a good issue to consider. We need to take serious precaution with our PPE and how we take it off and on while we maintain handwashing and the like. However, I do not believe that the number of people infected due to *misuse* is going to equal the number of people infected by lack of mask *use. *That is, the reproduction number of the virus is greater with precautions such as hygiene and disinfection protocols with no masks than it is masks alone. So, if I had to make a bet, I would be that masks are sufficient to minimizing the pandemic over hygiene/disinfection alone.

Now, given that, I think people still fall under the idea that masks are acting as a one-way protection device. This is NOT how I am looking at masks -their efficacy is two-way: when both people use masks the risk of virus in the environment declines significantly and the exposure to that virus by others is *minimized* (source). This has a ton of spillover effects -particularly when a carrier uses a mask- (a) minimizes rate of attack on others and lowers severity of infection or eliminates the risk, and (b) lowers the bio-burden on surfaces (virus falling out of aerosol and contaminating surfaces as seen in studies in the ICU) which lowers transmission via those surfaces. Thus, if we have everyone using masks, the total bioburden diminishes and drives the reproduction number to less than one -masks may indeed be sufficient regardless of misuse.

7) Another update:

This dovetails with other studies on mask compliance and reproduction number; over time, and with robust mask use, you can effectively end the pandemic.

8) BYU has assembled a report on research supporting mask use –here. Some of the key elements they bring up focus on the how masks contain infection and, hence, lower the risk of increasing the viral burden in the environment. They also provide the visual proof of lowering the burden (this should break the perceptions around correlation and causation that anti-maskers focus on -we have empirical evidence).

9) Update on Saudi Arabia ( you can see more of the trend here)

The decline since the mask mandate has been consistent. We are seeing this elsewhere.

10) More evidence in the healthcare setting here.

11) A very interesting observational case where those who used face shields did not avoid an outbreak, but those who used masks were left unscathed (here).

12) Between January 24, 2020 and February 15, 2020, an outbreak of COVID-19 occurred among 335 passengers on a flight from Singapore to Hangzhou in China; 15 already-infected passengers on a full plane of other individuals using masks, and only a single new infection took place -this man was sitting next to an infected person, and was not using a mask at various points (here).

13) Some overview of map prevention from The Lancet (here)

14) As of 23 Oct 2020, new data showing mask effectiveness has been produced (by CMU):

Not a bull-shit correlation -you can test that by just looking and seeing if a few points determine the trend and if removed, the whole think looks bonkers. In this case, even if you removed WY, SD, ND, DC, MA, RI, FL, LA the regression line would be almost untouched. This helps us understand that the effect us robust.

]]>Simple linear regression is useful, but what about situations where one has to model a significant number of variables? Suppose we have a data set that consist of three predictors and a single response variable. How should we accommodate these two new values?

Multiple linear regression is a way to accommodate additional variables, but we have to address a few concerns. Some may feel that we should proceed with creating distinct linear regression models for each predictor -and look at the goodness of the fit. The one with the best is the predictor we should go with; such an approach is insufficient, some of those predictors may have an effect on others, also its not clear, even with a best fitted predictor across models, that such is the correct one since the magnitude of the effect for what we are measuring is not denoted merely by the regression equation.

So, the best approach is to extend linear regression to multiple inputs. We can do this by giving each predictor a a separate slope coefficient in a single model. Without loss of generality, suppose we have predictors. Each multiple linear regression model assumes the form:

The rules still apply, however, and that means that we must estimate the coefficients of our model. Like with simple linear regression, we can proceed similarly. We make predictions by using:

The parameters are estimated using the same least squares approach that we saw in the context of simple linear regression. We choose

to minimize the sum of the squared residuals.

When considering multiple linear regression (MLR) , we should understand a few items of value. By selecting, carefully, the right variables for our model, we are able to better get a sense of the variance in the model -the variables add understanding of the variance. Moreover, we would expect, in general, that the *explanatory power *to increase for the model.

The pitfalls the we risk when delving into MLR are going to haunt us in many other aspects of statistical modeling, but we should be careful to watch for things like overfitting wherein we have fitted our predictors too closely; the next item is multicollinearity. Multicollinearity is a state of very high inter-correlations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable -another issue associated with this phenomena is spurious correlations. We may have selected predictors that correlate and, hence, are introduced into our model. However, the spuriousness of the data is actually such that we are merely modeling the noise and not the effects of the variables.

Okay, now that we got the cost and benefits out, lets keep going. We need to address a few important questions:

1) Is at least one of the predictors useful in predicting the response variable?

2) Do all the predictors help to explain Y or is only a subset of the predictors useful?

3) How well does the *overall* model fit the data?

In SLR, we can merely verify whether . However, with MLR, we have to extend this thinking to all the predictors. That is, are all the coefficients equal to zero? This brings us to the next step of forming a hypothesis test:

Challenged by the alternative

To test the hypothesis, we will compute the F-statistic,

So, when the F-statistic takes on a value close to 1, we would have no relationship amid the response and predictors.

We have to understand that the regression plane, or the MLR, is merely an estimator for the true regression plane across our predictors. Hence, there is error involved in our model. The same thing goes for SLR, and the accuracy or, otherwise, the inaccuracy of the coefficient estimates is related to the reducible error; thus, we can compute the confidence interval in order to determine how close is to Y or F(X).

We do not remove, in our efforts at estimating reality using a linear model, potential reducible error -or, model bias. Nonetheless, even if we knew the true values for the predictors, the response variable cannot be predicted perfectly because of random error that is inherent to the model. We will see, in future posts, that this is called *irreducible error*. We get questions, from this, that arise: *how much will Y vary from our* ; this question is one that we will respond to by calculating prediction intervals (which are large that confidence intervals because they include the various errors and uncertainty as to how much an individual point will differ from the population regression plan and the error in the estimate for F(X).

Here is a source of some more information, along with a video: LINK

]]>

[1]

We can think of (1) as the regression of on -this is a surjection. Note that the and are two unknown constants that represent the intercept and the slope values in the linear model in (1) -these values, collectively, are known as coefficients or parameters. Once we obtain some data, we can take these values a step further and produce *estimators *denoted as and . Thus, (1) is now expressed as:

[2]

is an estimator for Y based on where X = x. Thus, estimates the values for Y -which is an unknown parameter.

Now that we are this far, we can continue our review with some more pragmatic discussion of how to estimate the values the coefficients. To proceed, we need data; let,

represent *n* observations (pairs), each of which consists of a measurement of X and a measurement of Y -ultimately, we want to find a line that fits all these observations as *close *as possible. This idea of *closeness *is interesting; there are many ways we can define how close we are to the various observations, but the most common (particularly when we are talking about SLR) is *least squares *criterion

Consider as the equation that predicts Y based on the value of X. We can no consider the difference amid the value and the estimator which is denoted as the ith residual:

[4]

Ultimately, (4) will take us to the idea of residual sum of squares and, as we will see, the various coefficients in (2) and (1). The residual sum of squares (RSS) is direct and expressed as:

In LSR, we seek to MINIMIZE the RSS. This provides us the values of our coefficients, and hence the linear regression line.

and

With and as simple means.

Understanding the minimization of RSS is essential, but the concept is not as straightforward as one would think. Infact, it’s easy to understand that is merely a line in the form with m, and b corresponding to our coefficients. The catch is explaining how the line is formed -how we get the coefficients (its cheap to just place a formula out there) outside the formula.

So, let’s back-up. Our goal is to find a relationship that maps our observations on to some y. What does this look like geometrically? There is a great tool here that helps show how this works. If we find a line that *does not* minimize the RSS, we get something like this:

The black line is the LSR line, the red line is some other line -note that the red boxes are those that form a square from the point to the red line -the area of those squares is not *minimized. *That is, we never minimized the RSS to derive the correct values for our coefficients. The black boxes, however, are the areas of the squares of the LSR from each point -notice that they are minimized.

In general, LSR exhibits high interpretability and low flexibility. There are implications to model selection that must be considered; first, flexible models often require estimating a greater number of parameters and are more complex -this leads to a phenomenon known as *overfitting *wherein the noise is modeled too closely (the noise is integrated into the model over the signal). Second, there is the quintessential problem of spuriousness that tends to complicate all statistical models -but is more chronic amid problems where highly flexible models are more appropriate (there are methods to dealing with such a problem, but we won’t address that here). Second, there is a tradeoff that takes place amid flexibility and interpretability that is important to consider -as flexibility increases, interpretability tends to decrease.

It is important to highlight that when estimating and from our data that there exists a *true *relationship amid X and Y such that this relationship, however, is one where the error term has mean 0 and is derived from a normal distribution. This relationship is based on the population data. In reality, we rarely have access to such an epistemic vantage point -thus, we are stuck working with a subset of the population data. These observations generate as a distinct line that is a approximation of .

Prima facie, the idea of being different from is not clear. Why would such a difference exist when we have only one data set to work with? First, this is mora a philosophical actualization and fundamental problem with *measurability* than anything inherent to the mathematics. Suppose we seek to understand certain characteristics of a large population -suppose further that we want to know the population mean of some random variable Y. We know that we can’t really witness or obtain all the data elements of Y, but we can obtain a sample. This will allow us to *estimate* the mean of Y. Since we have access to the sample, say *n* observations from Y; call that k such that , we can find a reasonable estimator =\overline{y}$ where as the sample mean. In general, the sample mean is a good *estimator* for the population mean. Thus, in the same way, the coefficients are being estimated for m and b in .

The analogy drawn above amid the sample mean and the SLR is apt based on the concept of *bias.* Think about it…. if we use to estimate , the estimate is *unbiased *on average -so, in some cases the sample mean will overestimate the population mean, and in some cases it will underestimate. However, on average, we expect that, over a huge number of estimates, to equal the value of the population mean. Hence, such an estimator does not systematically fail in estimating the true parameter. This property of unbiasedness holds for:

and,

Now, if we were to average these over a very large number of data sets (pulled from the same population), we would have estimates that are equal to the the “real” coefficients of the population. However, we should be asking ourselves how accurate are these estimators, when unbiased, to the population equivalents. Let’s revert back to our example of the population mean of some random variable** Y** . So, how accurate is as a estimate of ? We have asserted that the average of ‘s over many data sets will be very close to -however, a single estimate of may substantially underestimate or overestimate How far off could we get? In general, we answer this question by computing the *standard error* of , written as We have well-known formula:

[5]

Basically, the standard error tells us the *average amount* that this estimate differs from $latex \mu. Take note that where n in (5) is the number of data points, the *standard error *gets smaller when n gets large -that is, the more observations, the more accurate.

Now, we should be able to obtain an standard error for our coefficients -so, to compute the standard error associated with and , we use the following formulas (more details here):

where . In all these formulas, the demand for them to be valid requires that the error, , is uncorrelated with common variance Note that we cannot know directly, but we can estimate it; this estimate is known as the *residual standard error, * and is given by the formula RSE = (RSS/(n-2))^(1/2).

Standard errors are useful for various things; one of these is to compute confidence intervals. A 95% confidence interval (CI) is defined as a 95% probability that a range of values will contain the true unknown value of the parameter in question. The range is defined as upper and lower bounds; the 95% CI for is which defines the CI:

The CI states that there is approximately a 95% confidence interval that our coefficient lies within that interval. We also define the CI for our intercept ( ) as .

Standard errors are more than just interesting measures to help us understand SLR. We can use these tools to perform *hypothesis tests* on these coefficients. We proceed as follows (some great resources here):

- Formulate a null hypothesis (think of this as the opposite of what you are guessing or looking for in your data).
- Provide an alternative hypothesis
- Set an (this involves type 1 and 2 errors – corresponds to a type 1 error -more on this later)/
- Assuming you have correct data, calculate a test statistic (F, T, etc…)
- Determine thresholds of rejection.
- Derive a decision of fail to reject the null or reject the null.

]]>

Now, in the last post we defined . We can now cash-out why that is significant. Since we have defined A as the finite union of a **finite** union of intervals, A is said to be elementary. Let’s talk a little about elementary sets:

Q1: What the is an elementary set?

A1: To describe what it is, let’s take a look at its properties:

p1. Every bounded interval is an elementary set.

p2. The intersection of two elementary sets is an elementary set.

p3. The difference of two elementary sets is also an elementary set.

p4. Set E is an elementary set IFF it is the union of finitely many intervals.

Now, ask yourself, do these look familiar? These are the same properties that we described, sans p4, for a ring. Notice, however, that an elementary set, denoted by , is not a ! Why? Well, because we said that a set is elementary IFF it can be expressed as a finite union of intervals. A contains the property if being closed under countable (possibly infinite) union.

Now that we have the notion of elementary sets down, we have some definitions that need evaluating.

Definition 1: A nonnegative additive set function defined on is said to be regular if the following is true: to every and to every >0, there exists and where F is compact (and closed) and G is open, so that : and

If and if these intervals are pairwise disjoint, then (think additivity) ;

]]>What is a measure? Well, let’s see what Lebesgue had to say about this,

Let us consider a set of points of (a, b); one can enclose in an infinite number of ways these points in an infinite number of intervals; the infimum of the sum of the lengths of the intervals is the measure of the set. A set E is said to be measurable if its measure together with that of the set of points not forming E gives the measure of (a, b).

He continues explaining,

Enclose A in a finite or denumerably infinite number of intervals, and let l1, l2, . . . be the length of these intervals. We obviously wish to have

m(A) ≤ l1 + l2 + · · ·

If we look for the greatest lower bound of the second member for all possible systems of intervals that cover A, this bound will be an upper bound of m(A). For this reason we represent it by m∗(A), and we have

m(A) ≤ m∗(A) (3)

If (a, b) \ A is the set of points of the interval (a, b) that do not belong to A, we have similarly

m((a,b) \ A) ≤ m∗((a,b) \ A).

Now we certainly wish to have

m(A) + m((a, b) \ A) = m(a, b) = b − a (4)

and hence we must have

m(A) ≥ b − a − m∗ ((a, b) \ A)

The inequalities (3) and (4) give us upper and lower bounds for m(A). One can easily see that these two inequalities are never contradictory. When the lower and upper bounds for A are equal, m(A) is defined, and we say that A is measurable.

What does Lebesgue mean by this? What is he getting at? Well, let us consider some set And, for now, let’s not consider if . Think about what his definition is for m*(A). Basically, let us cover the set A by countably many intervals {I},

the moment on whether A ⊆ (a, b); for example, A could be unbounded. Consider

his definition of m∗(A). Let’s cover A by countably many intervals , so that

For our purposes, we are going to (in the name of simplicity) assume that all our intervals , and note that the term denotes the usual length of . Now since , the sum should be larger than the “true measure” of our set A. The more accurately the union of the sequences of **I** approximates **A**, the smaller our aforementioned sum is. The opposite is also true (i.e. if our countable union does not approximate the set well, our sum is going to be larger).

As Lebesgue did, we call m*(I) to be an upper bound of m(I), but not just any bound -it is the least upper bound. We define,

m*(A)=

We define this as the outer measure of A. The outer measure basically assigns a “length” to every subset of the reals -that is all!

In the next post, we will provide some information on elementary sets, what they do, and why we should care about them. Finally, we will relate this to our m(I), measure, and see how it pushes our definition forward.

1) Definition: Given two non-empty sets and a *function* is a **rule **(i.e. a set of instructions) where each element of is “paired” with exactly one element of . Set is denoted as the *domain* of and is denoted as the *range or codomain. *We use the notation,

Now that we have cashed-out the idea of what a function is defined as, we can move toward our defining a “set function.” However, let’s take a step back and address the basics of sets and what they are. A function is perhaps one of the most important mathematical concept that exists. In fact, the essence of a function is merely, upon reduction and distillation of all the abstract fluff, an encoding of of information from one group of objects to another. This encoding process happens according to a well-defined list of rules.

I mean, do you like to cook? Have you ever baked a cake? You take one group of “stuff” and add other “stuff” in a specific *fashion* to produce the end result –this is the essence of functions. Now, to get to the idea of a set, one does not need stray from the conception of a group (in the conventional sense of the word). We group things together all the time. We group people, places, and things by attaching place-holders or by “tagging” a name to a group and then “baptizing” (see Kripke’s works, specifically, *Naming and Necessity*) the elements as members of the set we have tagged. An action such as ‘baptizing” an element a member of a specific group can be as arbitrary as one wants. For example, consider the group:

M={car, ~car, Fabio, train, shoe, glass, uranium, qualia, Hume, Leibniz, farm, Obama, Ross, jerk, nice and quarks}

Some of the objects in M share similarities, and some do not. The relations amid the elements in M are not consistent across all elements of the group M. Their are things in M that are inconsistent (i.e. car, and ~car or Hume and Leibniz). The properties shared among each elements, and across all elements are not uniform. Nonetheless, this is a set. That is, we say M is a set, and the stuff in M are its elements.

Cantor proclaimed,

A set is a gathering together into a whole of definite, distinct objects of our perception [Anschauung] or of our thought—which are called elements of the set

Some sets are better than others. I mean, consider the set of real numbers:

={0,1/2,2,6,600, 21, , e, 4, -5,…}

This set is very specific as to what elements can be contained in it. In fact, there exists a criteria, which is a list of properties that each element must possess in order to be in this set.

Sets are also acted upon by certain rules and operations. For example, what if we take a simple version of M, say M’ and let it possess the following:

M’={car, train, shoe}

and, now, let a new set called R contain the following:

R={car, leaf, stalk}

We can, from a linguistic standpoint, say, “hey, M’ and R, **together**, are the set of {car, train, shoe, leaf, stalk}!” This is true, and we can say that “the set M’ and R **share** {car}!” This is where the simple operation of disjunction and conjunction apply (i.e. or/and), but from a set-theroetical view, they do not necessarily apply in the exact same way we use the words “or” and “and.” In fact, as you can see above, putting the sets together is adding all of the elements of one to the other. This process calls for a disjunction or **uniting** of the sets as whole. Moreover, the process the entails a conjunction is one where one set **intersects **another set. So, we say,

M’ R ={car, leaf, stalk, train, shoe} for M’ united with R, and for M’ intersects R, we say that M’ R ={car}.

So, what, then, does it mean to say that there exists some , a set function on A?

2) Defintion: A set function is a function (same as above) whose input is a set and the output is a real number.

Basically, probability is a set function as is a measure (as we shall soon see).

If any and are two sets, we write to indicate that if or where denotes the compliment of B. And if and are *disjoint, *we say that their *intersection* is empty -that is, they do not share anything. We denote such sets as .

Now, since we are trying to work our way down to a very fundamental and important idea in mathematics, we have to continue with some more definitions. Up to now, all this has seemed very basic and, perhaps, peace-wise. Thus, let’s pin-point our goal.

We are trying to develop the theory of the Lebesgue integral; this will require the expounding of set functions, rings, and, as such, the development of measure theory. From there, we arrive at Lebesgue integration.

1) Defintion: A family of sets is called a ring if and $latex{\rightarrow}$

i) and

ii) if is a ring, then is a ring.

A ring is called a -ring if,

i)

whenever Since

ii)

if is a -ring.

This is all very nice, but it is very convoluted. Lets reduce it down a bit, shall we?

Let be any non-empty set. A $latex{\sigma}$-algebra of subsets of is a family of subsets of with the following properties:

p1) is non-empty.

p2) *Closure under compliment: *if or, you could say, A\B is in .

p3) *Closure under countable union: *if is a sequence of sets in then

2) Definition: We say that is a set function defined on if it assigns to every a number $latex\phi({A})$ of the real numbers line. Now, we say that is *additive *if implies:

p4)

Now, we really want to jump a bit further, and define as *countable additive * if for j not equal to i implies

p5) *Countable additivity: *for any mutually disjoint (see above) sets ,

3) **Theorem: **Suppose that our set function $latex{\phi}$ is countably additive on a ring . Suppose further that where n is a natural, , and,

then as , we have:

**Proof:**

Set and notice that . It follows that, for , and . Thus,

and

Now that we have established these facts, we arrive a the point where we can construct the Lebesgue measure. This will be done using a new post.

]]>

We know from basic calculus and our study of integration, differentiation, and continuity that:

on an interval , then (taken from Elementary Real Analysis by Andrew Bruckner).

(1) If are continuous on , so is

(2) If are differentiable on , then so is . And it follows that,

(3) If are integrable on , then so is And we have,

Everything seems to be as expected, right? So, what about adding an infinite number of functions . That is, what would happen, or what can we say about if we had to consider an infinite series of functions?

For points (1)-(3), imagine, for a moment, the nature of an infinite sum. Will it produce the same results?

To spoil the ending, the truth is: (1)-(3) do not necessarily hold as we “stretch” the terms out to infinity. We will is that there is a different type of condition that must absolutely hold for our reproducing the same results as we did in (1)-(3) for an infinite iteration of functions.

**POINTWISE LIMITS**

As is obvious, we are dealing with sequences of real-valued functions. The nature of how these functions work as a series is, however, not so obvious. In fact, we know what it means for a series of numerical values to converge, but doe the same thing work for functions? This question is among many that we must answer. In fact, is the limit of a sequence of functions still a function? What about differentiation? If each function is differentiable, is $latex \lim_{n \to +\infty} f_{n}$ differentiable? The same goes for integration. And what about continuity? If each of our functions is continuous on some , is the limit of these functions as well?

We must cash-out some definitions, look at some properties, and, then, see what theorems follow. From this, we can work some problems and develop an understanding before moving to uniform convergence.

Now, I will be drawing on my own notes for this post, but I recommend that you look at this site as a means to gain some further clarification. You can also find some references and discussions here.

**Definition 1:**

Let be a sequence of of functions defined on some domain . If the **for all x in , ** we say that the sequence *converges pointwise *on our domain. The limit defined a functions on by the equation,

and we say that .

What stands out about this definition? Well, to see it better, lets write it out in a more logical fashion:

We assert that pointwise to on if notice that N appears after the quantification of x and . Thus, N may depend on the values of these terms. Make special note as to the quantification of x here as well.

This sets us up for the next big question: what do we make of,

**Definition 2:**

For **each x** in and n$latex\in\mathbb{N}$ we let,

If the limit of partial sums is taken (i.e. ) and if such is a real number, we say that the series **converges at x **for the limit above. If the series converges for all x, we say that the series *converges* *pointless on D* to the function defined by,

In general we cannot use piecewise limits to provide positive answers to the questions of being continuous if is on some interval, nor can we do it for differentiation or integration.

For example, consider the following function for x in [0,1]. Each of the are continuous on [0,1]. But what about at each x? Take consideration of the limit and, yet, we get ! Look at the graph of the function to “see” more clearly what is happening here,

Thus, the pointwise limit of the sequence of continuous functions is discontinuous at x=1. Remember our original assertion that this is true **for all x in our domain D –**well, that is not the case if we are talking about x in D=[0,1]. If we restrict our domain to (0,1) we “get away” with continuity.

In realit, pointwise limits are really not useful because the pointwise limit may not retain any of the useful properties that the original functions shared. We require something more useful and something more meaningful where the existence of it implies all our previous work involving pointwise limits. Nonetheless, our work was not a complete loss. In fact, our study of pointwise limits sets us up for **Uniform Convergence.**

Before jumping into uniform convergence, lets get some background.

**UNIFORM LIMITS**

In general, pointwise limits do NOT allow the interchange of limit operations (i.e. continuity at some point in our domain we would have and this would absolutely require that ). Uniform limits will often allow us to interchange the limit operations.

**Definition 3:**

Let be a sequence of functions defined on a common domain D. We say that our sequence *converges* *uniformly* to a function on D if for every there exists some N so that,

for all .

We write, on D.

In logical notation, we have: : and there holds

The major difference amid this definition and our the definition regarding piecewise limits is that N depends on and NOT x. Essentially, for some value of n large, the value of the difference of our sequence of functions and the “actual” function is uniform.

And important fact is that if then it also converges pointwise as well. This often allows us to start proofs by looking at the piecewise convergence and moving toward showing uniform convergence. This works because if the question at hand is asking for uniform convergence, one must assume that, at least, working first with what the question will eventually imply sets one up for a path to the solution -at the very least.

In the next post, we will cover the Cauchy Criterion, and, then, talk about uniform convergence.

]]>

**Definition #1:** Let be real valued on an interval . The following holds:

**1)** If whenever and are in with , then our function is *nondecreasing* on I.

**2)** However, if holds, then our function if strictly increasing.

The derivative plays a fundamental role in determining the nature of a functions monotonic behavior; we will show that the “slope” of our function on can aid us in looking how it is moving along . We will prove this directly using the mean value theorem.

**Theorem:** let be differentiable on some .

**i) **If for all points in , then is non decreasing on .

**ii) **If for all points in , then is increasing on .

**iii)** If for all points in , then is non increasing on .

**iv)** If for all points in , then is decreasing on

If for all points in , then is nondecreasing on

**Proof:** To prove (**i**), let with . By the mean value theorem, so that:

So, when our derivative at c is positive or equal to zero, . If this holds on , then we say our function is nondecreasing on .

The absolutely awesomeness that spews from the prior fact is that we can use this to determine if there is a minimum or maximum at a point. If you imagine a function that is continuous on some real-valued interval , and, let that function have a minimum (or maximum) at a certain point in , then all you have to do is look at the values of the function from both sides as you reach that point. The function will hit a point where the derivatives have alternative signs -so, the value at that point will be bounded such that it must have a slope of zero. Thus, you have landed at a minimum.

Furthermore, consider that function . We know that is not differentiable at x=0, but who cares? We can use this result to show that at the point x=0, the value of changes so goes from some specific type of behavior.

I would like to address a concept that is derived from those ideas. It is called Dini’s theorem -or Dini Derivatives.

We just discussed how we can “think” of the function by way of our discussion of the values of its derivative. However, what other methods exist to look at this type of function more aggressively. I mean, one would think that there must be some method to evaluating the changes in the derivative itself even at points that may not have matching limits of the difference quotient at a specific point.

The graph above does not allow us to do much as we take the derivative near 0. In fact, a little work with the limit of the difference quotient while taking the limits as x goes to zero from the right and left will produce the obvious and one will witness their dream shattered and crushed under the weight of non-differentiability.

However, in mathematics, derivatives are IMPORTANT in evaluating the local behavior of a function. Will will discuss this in a subsequent post.

]]>Open any advanced calculus text or work in any field of analysis, you are likely to relish in the glorious ideas of Agustin-Louis Cauchy and not even know it. Cauchy was born in Paris, France in 1789. He began his professional life as an engineer and progressively influenced mathematics by way of research, professorships (when he was not refusing compliance with local regulations) and dabbling in physics (as the chair of several physics departments).

Today, we present a proof of the Cauchy Criterion for summations, series, and, finally, uniform convergence.

We proceed as follows:

1) A sequence {} is a Cauchy sequence it is convergent.

Proof: Suppose that {} is Cauchy. By definition, we have:

.

By our definition, it follows that for some , . Pick so that we select so that:

If we let M= , then it follows that . So, we say that {} is bounded.

Since {} is bounded, by the Bolzano-Wierstrauss Property, it follows that:

.

And, as stated, we take our definition for {},

.

Now, suppose that ; set so that where,

This establishes that fact that {} is convergent.

Now, we prove that if {} is convergent, then it is a Cauchy sequence.

The proof involving the conditional that a sequence {} convergent, is Cauchy is fairly simple. Consider a sequence for some latex {x_n}$ in our inequality so as to produce the desired result -being careful to consider the correct use of inequalities.

Now, consider how this may apply to summations. One of the conveniences of summations is that they basically act as a sequence in disguise -the previous criterion can be used with a few minor additions to prove the criterion for summations. However, we will act more formally.

The series

is said to be satisfy the necessary and sufficient conditions of the Cauchy criterion provided that:

Now, the criterion holds: a series converges the criterion holds. So, to formalize it, we would take the proofs from above and equate it to as:

Everything else falls out given that if we move in the other direction of the biconditional, we get a convergent subsequence which by definition fulfills the criterion.

Moving on, we are not at the end of our post. We have only one more fact to prove -uniform convergence. Uniform convergence is the idea that a sequence of function converges to some limiting function (or just, “a function”) if the speed of that convergence for successive n’s is sufficiently fast to arrive at the limiting function in question. For example, lets consider the following function, and make that a sequence -denoted as:

The basic idea, here, is that as we change n, we want to see if we can find some large value of n so that we have some value of our sequence that arrives at some function that is within all chose values in the range. Let’s show different values of the previously mentioned function and see what its behavior is.

The above image should help you see what is happening here. The “speed” at which we are converging to some limit value is not sufficient. That is, as we push out n, we see that there is always a large n so that our values are the same for all f’s. Also, even if I push the value of n out to 1,000 or 100,000 I still do not arrive at the value required to establish uniform continuity.

So, what is our definition for something to be called uniformly convergent? Let’s take a look:

**Defintion**: Let be a sequence of functions defined on a common domain . converges uniformly to a function on if such that

What does this mean for applying the Cauchy Criterion. Well, two things: (1) we are dealing with sequences that are involved in a limiting process, (2) the idea of functions converging should, depending on their nature, “act” the same way as sequences or series. Let’s move on to the meat-and-potatoes.

**Cauchy Criterion (defintion): **Let be a sequence of functions defined on a common domain . The sequence is said to be uniformly Cauchy on if such that if and then it follows that

**Criterion: ** on IFF is uniformly Cauchy.

**Proof:** be Cauchy, we show that it is convergent. Suppose that is Cauchy. By definition, we have:

such that if and then it follows that

By our definition, it follows that for some, . Pick so that we select so that:

If we let M= , then it follows that . So, we say that {} is bounded.

Since {} is bounded, by the Bolzano-Wierstrauss Property, it follows that:

.

And, as stated, we take our definition for {},

.

Now, suppose that ; set so that where,

This establishes that fact that {} is uniformly convergent.

The other conditional is easy. If it is convergent, then by definition, it is Cauchy.

]]>